Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles
We present an algorithm to find corresponding authors of patents and scientific articles. The authors are given as records in Scopus and the Chinese Patents Database. This issue is known as the record linkage problem, defined as finding and linking individual records from separate databases that ref...
Main Authors: | , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-09-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/11/18/8417 |
_version_ | 1797520394762059776 |
---|---|
author | Robert Nowak Wiktor Franus Jiarui Zhang Yue Zhu Xin Tian Zhouxian Zhang Xu Chen Xiaoyu Liu |
author_facet | Robert Nowak Wiktor Franus Jiarui Zhang Yue Zhu Xin Tian Zhouxian Zhang Xu Chen Xiaoyu Liu |
author_sort | Robert Nowak |
collection | DOAJ |
description | We present an algorithm to find corresponding authors of patents and scientific articles. The authors are given as records in Scopus and the Chinese Patents Database. This issue is known as the record linkage problem, defined as finding and linking individual records from separate databases that refer to the same real-world entity. The presented solution is based on a record linkage framework combined with text feature extraction and machine learning techniques. The main challenges were low data quality, lack of common record identifiers, and a limited number of other attributes shared by both data sources. Matching based solely on an exact comparison of authors’ names does not solve the records linking problem because many Chinese authors share the same full name. Moreover, the English spelling of Chinese names is not standardized in the analyzed data. Three ideas on how to extend attribute sets and improve record linkage quality were proposed: (1) fuzzy matching of names, (2) comparison of abstracts of patents and articles, (3) comparison of scientists’ main research areas calculated using all metadata available. The presented solution was evaluated in terms of matching quality and complexity on ≈250,000 record pairs linked by human experts. The results of numerical experiments show that the proposed strategies increase the quality of record linkage compared to typical solutions. |
first_indexed | 2024-03-10T07:56:10Z |
format | Article |
id | doaj.art-77095d6f17e340bda23c7154ff98911e |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-10T07:56:10Z |
publishDate | 2021-09-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-77095d6f17e340bda23c7154ff98911e2023-11-22T11:52:33ZengMDPI AGApplied Sciences2076-34172021-09-011118841710.3390/app11188417Record Linkage of Chinese Patent Inventors and Authors of Scientific ArticlesRobert Nowak0Wiktor Franus1Jiarui Zhang2Yue Zhu3Xin Tian4Zhouxian Zhang5Xu Chen6Xiaoyu Liu7Institute of Computer Science, Warsaw University of Technology, 00-665 Warsaw, PolandInstitute of Computer Science, Warsaw University of Technology, 00-665 Warsaw, PolandShanghai Science and Technology Development Co. Ltd., Shanghai 200233, ChinaShanghai Science and Technology Development Co. Ltd., Shanghai 200233, ChinaShanghai Science and Technology Development Co. Ltd., Shanghai 200233, ChinaShanghai Science and Technology Development Co. Ltd., Shanghai 200233, ChinaShanghai Science and Technology Development Co. Ltd., Shanghai 200233, ChinaShanghai Science and Technology Development Co. Ltd., Shanghai 200233, ChinaWe present an algorithm to find corresponding authors of patents and scientific articles. The authors are given as records in Scopus and the Chinese Patents Database. This issue is known as the record linkage problem, defined as finding and linking individual records from separate databases that refer to the same real-world entity. The presented solution is based on a record linkage framework combined with text feature extraction and machine learning techniques. The main challenges were low data quality, lack of common record identifiers, and a limited number of other attributes shared by both data sources. Matching based solely on an exact comparison of authors’ names does not solve the records linking problem because many Chinese authors share the same full name. Moreover, the English spelling of Chinese names is not standardized in the analyzed data. Three ideas on how to extend attribute sets and improve record linkage quality were proposed: (1) fuzzy matching of names, (2) comparison of abstracts of patents and articles, (3) comparison of scientists’ main research areas calculated using all metadata available. The presented solution was evaluated in terms of matching quality and complexity on ≈250,000 record pairs linked by human experts. The results of numerical experiments show that the proposed strategies increase the quality of record linkage compared to typical solutions.https://www.mdpi.com/2076-3417/11/18/8417probabilistic record linkagefuzzy string matchingtext features extractionsupervised learningDBpediaAll Science Journal Classification (ASJC) |
spellingShingle | Robert Nowak Wiktor Franus Jiarui Zhang Yue Zhu Xin Tian Zhouxian Zhang Xu Chen Xiaoyu Liu Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles Applied Sciences probabilistic record linkage fuzzy string matching text features extraction supervised learning DBpedia All Science Journal Classification (ASJC) |
title | Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles |
title_full | Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles |
title_fullStr | Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles |
title_full_unstemmed | Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles |
title_short | Record Linkage of Chinese Patent Inventors and Authors of Scientific Articles |
title_sort | record linkage of chinese patent inventors and authors of scientific articles |
topic | probabilistic record linkage fuzzy string matching text features extraction supervised learning DBpedia All Science Journal Classification (ASJC) |
url | https://www.mdpi.com/2076-3417/11/18/8417 |
work_keys_str_mv | AT robertnowak recordlinkageofchinesepatentinventorsandauthorsofscientificarticles AT wiktorfranus recordlinkageofchinesepatentinventorsandauthorsofscientificarticles AT jiaruizhang recordlinkageofchinesepatentinventorsandauthorsofscientificarticles AT yuezhu recordlinkageofchinesepatentinventorsandauthorsofscientificarticles AT xintian recordlinkageofchinesepatentinventorsandauthorsofscientificarticles AT zhouxianzhang recordlinkageofchinesepatentinventorsandauthorsofscientificarticles AT xuchen recordlinkageofchinesepatentinventorsandauthorsofscientificarticles AT xiaoyuliu recordlinkageofchinesepatentinventorsandauthorsofscientificarticles |