Cluster Analysis of Open Research Data: A Case for Replication Metadata

Research data are often released upon journal publication to enable result verification and reproducibility. For that reason, research dissemination infrastructures typically support diverse datasets coming from numerous disciplines, from tabular data and program code to audio-visual files. Metadata...

Full description

Bibliographic Details
Main Author:	Ana Trisovic
Format:	Article
Language:	English
Published:	University of Edinburgh 2023-02-01
Series:	International Journal of Digital Curation
Online Access:	http://www.ijdc.net/article/view/833

_version_	1811173553006968832
author	Ana Trisovic
author_facet	Ana Trisovic
author_sort	Ana Trisovic
collection	DOAJ
description	Research data are often released upon journal publication to enable result verification and reproducibility. For that reason, research dissemination infrastructures typically support diverse datasets coming from numerous disciplines, from tabular data and program code to audio-visual files. Metadata, or data about data, is critical to making research outputs adequately documented and FAIR. Aiming to contribute to the discussions on the development of metadata for research outputs, I conducted an exploratory analysis to determine how research datasets cluster based on what researchers organically deposit together. I use the content of over 40,000 datasets from the Harvard Dataverse research data repository as my sample for the cluster analysis. I find that the majority of the clusters are formed by single-type datasets, while in the rest of the sample, no meaningful clusters can be identified. For the result interpretation, I use the metadata standard employed by DataCite, a leading organization for documenting a scholarly record, and map existing resource types to my results. About 65% of the sample can be described with a single-type metadata (such as Dataset, Software orReport), while the rest would require aggregate metadata types. Though DataCite supports an aggregate type such as a Collection, I argue that a significant number of datasets, in particular those containing both data and code files (about 20% of the sample), would be more accurately described as a Replication resource metadata type. Such resource type would be particularly useful in facilitating research reproducibility.
first_indexed	2024-04-10T17:48:51Z
format	Article
id	doaj.art-97797ad3b3724ea0a1432c9573ec8e76
institution	Directory Open Access Journal
issn	1746-8256
language	English
last_indexed	2024-04-10T17:48:51Z
publishDate	2023-02-01
publisher	University of Edinburgh
record_format	Article
series	International Journal of Digital Curation
spelling	doaj.art-97797ad3b3724ea0a1432c9573ec8e762023-02-03T01:02:43ZengUniversity of EdinburghInternational Journal of Digital Curation1746-82562023-02-0117110.2218/ijdc.v17i1.833Cluster Analysis of Open Research Data: A Case for Replication MetadataAna Trisovic0Harvard UniversityResearch data are often released upon journal publication to enable result verification and reproducibility. For that reason, research dissemination infrastructures typically support diverse datasets coming from numerous disciplines, from tabular data and program code to audio-visual files. Metadata, or data about data, is critical to making research outputs adequately documented and FAIR. Aiming to contribute to the discussions on the development of metadata for research outputs, I conducted an exploratory analysis to determine how research datasets cluster based on what researchers organically deposit together. I use the content of over 40,000 datasets from the Harvard Dataverse research data repository as my sample for the cluster analysis. I find that the majority of the clusters are formed by single-type datasets, while in the rest of the sample, no meaningful clusters can be identified. For the result interpretation, I use the metadata standard employed by DataCite, a leading organization for documenting a scholarly record, and map existing resource types to my results. About 65% of the sample can be described with a single-type metadata (such as Dataset, Software orReport), while the rest would require aggregate metadata types. Though DataCite supports an aggregate type such as a Collection, I argue that a significant number of datasets, in particular those containing both data and code files (about 20% of the sample), would be more accurately described as a Replication resource metadata type. Such resource type would be particularly useful in facilitating research reproducibility. http://www.ijdc.net/article/view/833
spellingShingle	Ana Trisovic Cluster Analysis of Open Research Data: A Case for Replication Metadata International Journal of Digital Curation
title	Cluster Analysis of Open Research Data: A Case for Replication Metadata
title_full	Cluster Analysis of Open Research Data: A Case for Replication Metadata
title_fullStr	Cluster Analysis of Open Research Data: A Case for Replication Metadata
title_full_unstemmed	Cluster Analysis of Open Research Data: A Case for Replication Metadata
title_short	Cluster Analysis of Open Research Data: A Case for Replication Metadata
title_sort	cluster analysis of open research data a case for replication metadata
url	http://www.ijdc.net/article/view/833
work_keys_str_mv	AT anatrisovic clusteranalysisofopenresearchdataacaseforreplicationmetadata

Cluster Analysis of Open Research Data: A Case for Replication Metadata

Similar Items