QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION

In this paper, we propose a pipeline aimed at automatically extracting tables from heterogeneousWeb sources, such as HTML pages, pdf files and images. Table extraction is one of the activelydeveloping areas of Information Extraction, for which many applications, libraries and frameworksare currently...

Full description

Bibliographic Details
Main Authors: A. B. Nugumanova, K. S. Apayev, Y. M. Baiburin, M. Mansurova, A. G. Ospan
Format: Article
Language:English
Published: Al-Farabi Kazakh National University 2022-06-01
Series:Вестник КазНУ. Серия математика, механика, информатика
Subjects:
Online Access:https://bm.kaznu.kz/index.php/kaznu/article/view/1086/664
_version_ 1797937219097329664
author A. B. Nugumanova
K. S. Apayev
Y. M. Baiburin
M. Mansurova
A. G. Ospan
author_facet A. B. Nugumanova
K. S. Apayev
Y. M. Baiburin
M. Mansurova
A. G. Ospan
author_sort A. B. Nugumanova
collection DOAJ
description In this paper, we propose a pipeline aimed at automatically extracting tables from heterogeneousWeb sources, such as HTML pages, pdf files and images. Table extraction is one of the activelydeveloping areas of Information Extraction, for which many applications, libraries and frameworksare currently being developed. Nevertheless, most of these tools are focused on solving somespecific tasks, for example, only on recognizing tables presented in the form of images. Wepropose combining these tasks into a single pipeline that will support the full cycle of processingtables – starting with the stages of their search, recognition and extraction and ending with thestages of semantic analysis and interpretation, i.e. understanding tables (table understanding).Understanding tables and replenishing (population) knowledge bases (knowledge graphs) withmeaningful information contained in these tables is the ultimate goal of our design. The firstpart of the work presents methods for detecting tables on web pages, pdf documents, as well asautomatic detection of attributes and values of objects. The second part presents the architectureof the Qurma tool and its structure. The results show the implementation of the parser for theAlmaty-Ust-Kamenogorsk air search theme.
first_indexed 2024-04-10T18:41:15Z
format Article
id doaj.art-ee7b118681994b5fb54d66abba67b374
institution Directory Open Access Journal
issn 1563-0277
2617-4871
language English
last_indexed 2024-04-10T18:41:15Z
publishDate 2022-06-01
publisher Al-Farabi Kazakh National University
record_format Article
series Вестник КазНУ. Серия математика, механика, информатика
spelling doaj.art-ee7b118681994b5fb54d66abba67b3742023-02-01T14:36:11ZengAl-Farabi Kazakh National UniversityВестник КазНУ. Серия математика, механика, информатика1563-02772617-48712022-06-01114291100https://doi.org/10.26577/JMMCS.2022.v114.i2.08QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATIONA. B. Nugumanova0https://orcid.org/0000-0001-5522-4421K. S. Apayev1https://orcid.org/0000-0001-9292-4785Y. M. BaiburinM. Mansurova2https://orcid.org/0000-0002-9680-2758A. G. Ospan3https://orcid.org/0000-0002-1860-6997Sarsen Amanzholov East Kazakhstan UniversityD. Serikbayev East Kazakhstan Technical UniversityAl-Farabi Kazakh National University, Kazakhstan, AlmatyAl-Farabi Kazakh National University, Kazakhstan, AlmatyIn this paper, we propose a pipeline aimed at automatically extracting tables from heterogeneousWeb sources, such as HTML pages, pdf files and images. Table extraction is one of the activelydeveloping areas of Information Extraction, for which many applications, libraries and frameworksare currently being developed. Nevertheless, most of these tools are focused on solving somespecific tasks, for example, only on recognizing tables presented in the form of images. Wepropose combining these tasks into a single pipeline that will support the full cycle of processingtables – starting with the stages of their search, recognition and extraction and ending with thestages of semantic analysis and interpretation, i.e. understanding tables (table understanding).Understanding tables and replenishing (population) knowledge bases (knowledge graphs) withmeaningful information contained in these tables is the ultimate goal of our design. The firstpart of the work presents methods for detecting tables on web pages, pdf documents, as well asautomatic detection of attributes and values of objects. The second part presents the architectureof the Qurma tool and its structure. The results show the implementation of the parser for theAlmaty-Ust-Kamenogorsk air search theme.https://bm.kaznu.kz/index.php/kaznu/article/view/1086/664web-tablestable extractiontable recognitiontable understandingknowledge base population
spellingShingle A. B. Nugumanova
K. S. Apayev
Y. M. Baiburin
M. Mansurova
A. G. Ospan
QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION
Вестник КазНУ. Серия математика, механика, информатика
web-tables
table extraction
table recognition
table understanding
knowledge base population
title QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION
title_full QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION
title_fullStr QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION
title_full_unstemmed QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION
title_short QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION
title_sort qurma a table extraction pipeline for knowledge base population
topic web-tables
table extraction
table recognition
table understanding
knowledge base population
url https://bm.kaznu.kz/index.php/kaznu/article/view/1086/664
work_keys_str_mv AT abnugumanova qurmaatableextractionpipelineforknowledgebasepopulation
AT ksapayev qurmaatableextractionpipelineforknowledgebasepopulation
AT ymbaiburin qurmaatableextractionpipelineforknowledgebasepopulation
AT mmansurova qurmaatableextractionpipelineforknowledgebasepopulation
AT agospan qurmaatableextractionpipelineforknowledgebasepopulation