QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION
In this paper, we propose a pipeline aimed at automatically extracting tables from heterogeneousWeb sources, such as HTML pages, pdf files and images. Table extraction is one of the activelydeveloping areas of Information Extraction, for which many applications, libraries and frameworksare currently...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Al-Farabi Kazakh National University
2022-06-01
|
Series: | Вестник КазНУ. Серия математика, механика, информатика |
Subjects: | |
Online Access: | https://bm.kaznu.kz/index.php/kaznu/article/view/1086/664 |
_version_ | 1797937219097329664 |
---|---|
author | A. B. Nugumanova K. S. Apayev Y. M. Baiburin M. Mansurova A. G. Ospan |
author_facet | A. B. Nugumanova K. S. Apayev Y. M. Baiburin M. Mansurova A. G. Ospan |
author_sort | A. B. Nugumanova |
collection | DOAJ |
description | In this paper, we propose a pipeline aimed at automatically extracting tables from heterogeneousWeb sources, such as HTML pages, pdf files and images. Table extraction is one of the activelydeveloping areas of Information Extraction, for which many applications, libraries and frameworksare currently being developed. Nevertheless, most of these tools are focused on solving somespecific tasks, for example, only on recognizing tables presented in the form of images. Wepropose combining these tasks into a single pipeline that will support the full cycle of processingtables – starting with the stages of their search, recognition and extraction and ending with thestages of semantic analysis and interpretation, i.e. understanding tables (table understanding).Understanding tables and replenishing (population) knowledge bases (knowledge graphs) withmeaningful information contained in these tables is the ultimate goal of our design. The firstpart of the work presents methods for detecting tables on web pages, pdf documents, as well asautomatic detection of attributes and values of objects. The second part presents the architectureof the Qurma tool and its structure. The results show the implementation of the parser for theAlmaty-Ust-Kamenogorsk air search theme. |
first_indexed | 2024-04-10T18:41:15Z |
format | Article |
id | doaj.art-ee7b118681994b5fb54d66abba67b374 |
institution | Directory Open Access Journal |
issn | 1563-0277 2617-4871 |
language | English |
last_indexed | 2024-04-10T18:41:15Z |
publishDate | 2022-06-01 |
publisher | Al-Farabi Kazakh National University |
record_format | Article |
series | Вестник КазНУ. Серия математика, механика, информатика |
spelling | doaj.art-ee7b118681994b5fb54d66abba67b3742023-02-01T14:36:11ZengAl-Farabi Kazakh National UniversityВестник КазНУ. Серия математика, механика, информатика1563-02772617-48712022-06-01114291100https://doi.org/10.26577/JMMCS.2022.v114.i2.08QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATIONA. B. Nugumanova0https://orcid.org/0000-0001-5522-4421K. S. Apayev1https://orcid.org/0000-0001-9292-4785Y. M. BaiburinM. Mansurova2https://orcid.org/0000-0002-9680-2758A. G. Ospan3https://orcid.org/0000-0002-1860-6997Sarsen Amanzholov East Kazakhstan UniversityD. Serikbayev East Kazakhstan Technical UniversityAl-Farabi Kazakh National University, Kazakhstan, AlmatyAl-Farabi Kazakh National University, Kazakhstan, AlmatyIn this paper, we propose a pipeline aimed at automatically extracting tables from heterogeneousWeb sources, such as HTML pages, pdf files and images. Table extraction is one of the activelydeveloping areas of Information Extraction, for which many applications, libraries and frameworksare currently being developed. Nevertheless, most of these tools are focused on solving somespecific tasks, for example, only on recognizing tables presented in the form of images. Wepropose combining these tasks into a single pipeline that will support the full cycle of processingtables – starting with the stages of their search, recognition and extraction and ending with thestages of semantic analysis and interpretation, i.e. understanding tables (table understanding).Understanding tables and replenishing (population) knowledge bases (knowledge graphs) withmeaningful information contained in these tables is the ultimate goal of our design. The firstpart of the work presents methods for detecting tables on web pages, pdf documents, as well asautomatic detection of attributes and values of objects. The second part presents the architectureof the Qurma tool and its structure. The results show the implementation of the parser for theAlmaty-Ust-Kamenogorsk air search theme.https://bm.kaznu.kz/index.php/kaznu/article/view/1086/664web-tablestable extractiontable recognitiontable understandingknowledge base population |
spellingShingle | A. B. Nugumanova K. S. Apayev Y. M. Baiburin M. Mansurova A. G. Ospan QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION Вестник КазНУ. Серия математика, механика, информатика web-tables table extraction table recognition table understanding knowledge base population |
title | QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION |
title_full | QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION |
title_fullStr | QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION |
title_full_unstemmed | QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION |
title_short | QURMA: A TABLE EXTRACTION PIPELINE FOR KNOWLEDGE BASE POPULATION |
title_sort | qurma a table extraction pipeline for knowledge base population |
topic | web-tables table extraction table recognition table understanding knowledge base population |
url | https://bm.kaznu.kz/index.php/kaznu/article/view/1086/664 |
work_keys_str_mv | AT abnugumanova qurmaatableextractionpipelineforknowledgebasepopulation AT ksapayev qurmaatableextractionpipelineforknowledgebasepopulation AT ymbaiburin qurmaatableextractionpipelineforknowledgebasepopulation AT mmansurova qurmaatableextractionpipelineforknowledgebasepopulation AT agospan qurmaatableextractionpipelineforknowledgebasepopulation |