Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles

We present an algorithm to find corresponding authors of patents and scientific articles. The authors are given as records in Scopus and the Chinese Patents Database. This issue is known as the record linkage problem, defined as finding and linking individual records from separate databases that ref...

Full description

Bibliographic Details
Main Authors: Robert Nowak, Wiktor Franus, Jiarui Zhang, Yue Zhu, Xin Tian, Zhouxian Zhang, Xu Chen, Xiaoyu Liu
Format: Article
Language:English
Published: MDPI AG 2021-09-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/18/8417
_version_ 1797520394762059776
author Robert Nowak
Wiktor Franus
Jiarui Zhang
Yue Zhu
Xin Tian
Zhouxian Zhang
Xu Chen
Xiaoyu Liu
author_facet Robert Nowak
Wiktor Franus
Jiarui Zhang
Yue Zhu
Xin Tian
Zhouxian Zhang
Xu Chen
Xiaoyu Liu
author_sort Robert Nowak
collection DOAJ
description We present an algorithm to find corresponding authors of patents and scientific articles. The authors are given as records in Scopus and the Chinese Patents Database. This issue is known as the record linkage problem, defined as finding and linking individual records from separate databases that refer to the same real-world entity. The presented solution is based on a record linkage framework combined with text feature extraction and machine learning techniques. The main challenges were low data quality, lack of common record identifiers, and a limited number of other attributes shared by both data sources. Matching based solely on an exact comparison of authors’ names does not solve the records linking problem because many Chinese authors share the same full name. Moreover, the English spelling of Chinese names is not standardized in the analyzed data. Three ideas on how to extend attribute sets and improve record linkage quality were proposed: (1) fuzzy matching of names, (2) comparison of abstracts of patents and articles, (3) comparison of scientists’ main research areas calculated using all metadata available. The presented solution was evaluated in terms of matching quality and complexity on ≈250,000 record pairs linked by human experts. The results of numerical experiments show that the proposed strategies increase the quality of record linkage compared to typical solutions.
first_indexed 2024-03-10T07:56:10Z
format Article
id doaj.art-77095d6f17e340bda23c7154ff98911e
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T07:56:10Z
publishDate 2021-09-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-77095d6f17e340bda23c7154ff98911e2023-11-22T11:52:33ZengMDPI AGApplied Sciences2076-34172021-09-011118841710.3390/app11188417Record Linkage of Chinese Patent Inventors and Authors of Scientific ArticlesRobert Nowak0Wiktor Franus1Jiarui Zhang2Yue Zhu3Xin Tian4Zhouxian Zhang5Xu Chen6Xiaoyu Liu7Institute of Computer Science, Warsaw University of Technology, 00-665 Warsaw, PolandInstitute of Computer Science, Warsaw University of Technology, 00-665 Warsaw, PolandShanghai Science and Technology Development Co. Ltd., Shanghai 200233, ChinaShanghai Science and Technology Development Co. Ltd., Shanghai 200233, ChinaShanghai Science and Technology Development Co. Ltd., Shanghai 200233, ChinaShanghai Science and Technology Development Co. Ltd., Shanghai 200233, ChinaShanghai Science and Technology Development Co. Ltd., Shanghai 200233, ChinaShanghai Science and Technology Development Co. Ltd., Shanghai 200233, ChinaWe present an algorithm to find corresponding authors of patents and scientific articles. The authors are given as records in Scopus and the Chinese Patents Database. This issue is known as the record linkage problem, defined as finding and linking individual records from separate databases that refer to the same real-world entity. The presented solution is based on a record linkage framework combined with text feature extraction and machine learning techniques. The main challenges were low data quality, lack of common record identifiers, and a limited number of other attributes shared by both data sources. Matching based solely on an exact comparison of authors’ names does not solve the records linking problem because many Chinese authors share the same full name. Moreover, the English spelling of Chinese names is not standardized in the analyzed data. Three ideas on how to extend attribute sets and improve record linkage quality were proposed: (1) fuzzy matching of names, (2) comparison of abstracts of patents and articles, (3) comparison of scientists’ main research areas calculated using all metadata available. The presented solution was evaluated in terms of matching quality and complexity on ≈250,000 record pairs linked by human experts. The results of numerical experiments show that the proposed strategies increase the quality of record linkage compared to typical solutions.https://www.mdpi.com/2076-3417/11/18/8417probabilistic record linkagefuzzy string matchingtext features extractionsupervised learningDBpediaAll Science Journal Classification (ASJC)
spellingShingle Robert Nowak
Wiktor Franus
Jiarui Zhang
Yue Zhu
Xin Tian
Zhouxian Zhang
Xu Chen
Xiaoyu Liu
Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles
Applied Sciences
probabilistic record linkage
fuzzy string matching
text features extraction
supervised learning
DBpedia
All Science Journal Classification (ASJC)
title Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles
title_full Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles
title_fullStr Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles
title_full_unstemmed Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles
title_short Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles
title_sort record linkage of chinese patent inventors and authors of scientific articles
topic probabilistic record linkage
fuzzy string matching
text features extraction
supervised learning
DBpedia
All Science Journal Classification (ASJC)
url https://www.mdpi.com/2076-3417/11/18/8417
work_keys_str_mv AT robertnowak recordlinkageofchinesepatentinventorsandauthorsofscientificarticles
AT wiktorfranus recordlinkageofchinesepatentinventorsandauthorsofscientificarticles
AT jiaruizhang recordlinkageofchinesepatentinventorsandauthorsofscientificarticles
AT yuezhu recordlinkageofchinesepatentinventorsandauthorsofscientificarticles
AT xintian recordlinkageofchinesepatentinventorsandauthorsofscientificarticles
AT zhouxianzhang recordlinkageofchinesepatentinventorsandauthorsofscientificarticles
AT xuchen recordlinkageofchinesepatentinventorsandauthorsofscientificarticles
AT xiaoyuliu recordlinkageofchinesepatentinventorsandauthorsofscientificarticles