Metadata-Based Clustering and Selection of Metadata Items for Similar Dataset Discovery and Data Combination Tasks

Data integration, which aims to solve problems and create new services by combining datasets, has attracted considerable attention. The discovery of similar datasets that can be combined is critical. In the literature on similar dataset discovery, it is important to select an appropriate discovery m...

Full description

Bibliographic Details
Main Authors:	Takeshi Sakumoto, Teruaki Hayashi, Hiroki Sakaji, Hirofumi Nonaka
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Dataset discovery dataset similarity clustering data exchange platform metadata
Online Access:	https://ieeexplore.ieee.org/document/10464313/

_version_	1797243275706368000
author	Takeshi Sakumoto Teruaki Hayashi Hiroki Sakaji Hirofumi Nonaka
author_facet	Takeshi Sakumoto Teruaki Hayashi Hiroki Sakaji Hirofumi Nonaka
author_sort	Takeshi Sakumoto
collection	DOAJ
description	Data integration, which aims to solve problems and create new services by combining datasets, has attracted considerable attention. The discovery of similar datasets that can be combined is critical. In the literature on similar dataset discovery, it is important to select an appropriate discovery method for each information need, such as the domain. However, conventional studies have evaluated discovery methods in different ways, such as domains, test datasets, and evaluation metrics. This factor prevents the appropriate method selection for each situation. Furthermore, the specific effects of the combination of different methods are not well known despite conventional studies arguing the importance of the combination. This study attempts to understand (1) the similarity indicators that should be employed for each domain and (2) the effects of a combination of different indicators on performance. We evaluated 16 inter-dataset clustering models based on different metadata-based similarity indicators, using unified evaluation metrics and datasets for 15 domains. Our results (1) suggest that similarity indicators should be used for each domain and (2) demonstrate that most of the combinations of different methods can improve clustering performance.
first_indexed	2024-04-24T18:52:32Z
format	Article
id	doaj.art-35ad149ab17f48fbb1733fdb323e33ec
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-24T18:52:32Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-35ad149ab17f48fbb1733fdb323e33ec2024-03-26T17:48:20ZengIEEEIEEE Access2169-35362024-01-0112402134022410.1109/ACCESS.2024.337575010464313Metadata-Based Clustering and Selection of Metadata Items for Similar Dataset Discovery and Data Combination TasksTakeshi Sakumoto0https://orcid.org/0000-0002-7589-3283Teruaki Hayashi1https://orcid.org/0000-0002-1806-5852Hiroki Sakaji2Hirofumi Nonaka3Department of Engineering, Nagaoka University of Technology, Nagaoka, Niigata, JapanDepartment of Engineering, The University of Tokyo, Bunkyo, Tokyo, JapanFaculty of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido, JapanFaculty of Business Administration, Aichi Institute of Technology, Toyota, Aichi, JapanData integration, which aims to solve problems and create new services by combining datasets, has attracted considerable attention. The discovery of similar datasets that can be combined is critical. In the literature on similar dataset discovery, it is important to select an appropriate discovery method for each information need, such as the domain. However, conventional studies have evaluated discovery methods in different ways, such as domains, test datasets, and evaluation metrics. This factor prevents the appropriate method selection for each situation. Furthermore, the specific effects of the combination of different methods are not well known despite conventional studies arguing the importance of the combination. This study attempts to understand (1) the similarity indicators that should be employed for each domain and (2) the effects of a combination of different indicators on performance. We evaluated 16 inter-dataset clustering models based on different metadata-based similarity indicators, using unified evaluation metrics and datasets for 15 domains. Our results (1) suggest that similarity indicators should be used for each domain and (2) demonstrate that most of the combinations of different methods can improve clustering performance.https://ieeexplore.ieee.org/document/10464313/Dataset discoverydataset similarityclusteringdata exchange platformmetadata
spellingShingle	Takeshi Sakumoto Teruaki Hayashi Hiroki Sakaji Hirofumi Nonaka Metadata-Based Clustering and Selection of Metadata Items for Similar Dataset Discovery and Data Combination Tasks IEEE Access Dataset discovery dataset similarity clustering data exchange platform metadata
title	Metadata-Based Clustering and Selection of Metadata Items for Similar Dataset Discovery and Data Combination Tasks
title_full	Metadata-Based Clustering and Selection of Metadata Items for Similar Dataset Discovery and Data Combination Tasks
title_fullStr	Metadata-Based Clustering and Selection of Metadata Items for Similar Dataset Discovery and Data Combination Tasks
title_full_unstemmed	Metadata-Based Clustering and Selection of Metadata Items for Similar Dataset Discovery and Data Combination Tasks
title_short	Metadata-Based Clustering and Selection of Metadata Items for Similar Dataset Discovery and Data Combination Tasks
title_sort	metadata based clustering and selection of metadata items for similar dataset discovery and data combination tasks
topic	Dataset discovery dataset similarity clustering data exchange platform metadata
url	https://ieeexplore.ieee.org/document/10464313/
work_keys_str_mv	AT takeshisakumoto metadatabasedclusteringandselectionofmetadataitemsforsimilardatasetdiscoveryanddatacombinationtasks AT teruakihayashi metadatabasedclusteringandselectionofmetadataitemsforsimilardatasetdiscoveryanddatacombinationtasks AT hirokisakaji metadatabasedclusteringandselectionofmetadataitemsforsimilardatasetdiscoveryanddatacombinationtasks AT hirofuminonaka metadatabasedclusteringandselectionofmetadataitemsforsimilardatasetdiscoveryanddatacombinationtasks

Metadata-Based Clustering and Selection of Metadata Items for Similar Dataset Discovery and Data Combination Tasks

Similar Items