From Persistent Identifiers to Digital Objects to Make Data Science More Efficient

Data-intensive science is reality in large scientific organizations such as the Max Planck Society, but due to the inefficiency of our data practices when it comes to integrating data from different sources, many projects cannot be carried out and many researchers are excluded. Since about 80% of th...

Full description

Bibliographic Details
Main Author: Wittenburg, Peter
Format: Article
Language:English
Published: The MIT Press 2019-03-01
Series:Data Intelligence
Online Access:https://www.mitpressjournals.org/doi/abs/10.1162/dint_a_00004
_version_ 1818163360609337344
author Wittenburg, Peter
author_facet Wittenburg, Peter
author_sort Wittenburg, Peter
collection DOAJ
description Data-intensive science is reality in large scientific organizations such as the Max Planck Society, but due to the inefficiency of our data practices when it comes to integrating data from different sources, many projects cannot be carried out and many researchers are excluded. Since about 80% of the time in data-intensive projects is wasted according to surveys we need to conclude that we are not fit for the challenges that will come with the billions of smart devices producing continuous streams of data—our methods do not scale. Therefore experts worldwide are looking for strategies and methods that have a potential for the future. The first steps have been made since there is now a wide agreement from the Research Data Alliance to the FAIR principles that data should be associated with persistent identifiers (PID) and metadata (MD). In fact after 20 years of experience we can claim that there are trustworthy PID systems already in broad use. It is argued, however, that assigning PIDs is just the first step. If we agree to assign PIDs and also use the PID to store important relationships such as pointing to locations where the bit sequences or different metadata can be accessed, we are close to defining Digital Objects (DO) which could indeed indicate a solution to solve some of the basic problems in data management and processing. In addition to standardizing the way we assign PIDs, metadata and other state information we could also define a Digital Object Access Protocol as a universal exchange protocol for DOs stored in repositories using different data models and data organizations. We could also associate a type with each DO and a set of operations allowed working on its content which would facilitate the way to automatic processing which has been identified as the major step for scalability in data science and data industry. A globally connected group of experts is now working on establishing testbeds for a DO-based data infrastructure.
first_indexed 2024-12-11T16:48:20Z
format Article
id doaj.art-11406cbf09c94bb2b99ae715199bec7e
institution Directory Open Access Journal
issn 2641-435X
language English
last_indexed 2024-12-11T16:48:20Z
publishDate 2019-03-01
publisher The MIT Press
record_format Article
series Data Intelligence
spelling doaj.art-11406cbf09c94bb2b99ae715199bec7e2022-12-22T00:58:09ZengThe MIT PressData Intelligence2641-435X2019-03-011162110.1162/dint_a_00004From Persistent Identifiers to Digital Objects to Make Data Science More EfficientWittenburg, PeterData-intensive science is reality in large scientific organizations such as the Max Planck Society, but due to the inefficiency of our data practices when it comes to integrating data from different sources, many projects cannot be carried out and many researchers are excluded. Since about 80% of the time in data-intensive projects is wasted according to surveys we need to conclude that we are not fit for the challenges that will come with the billions of smart devices producing continuous streams of data—our methods do not scale. Therefore experts worldwide are looking for strategies and methods that have a potential for the future. The first steps have been made since there is now a wide agreement from the Research Data Alliance to the FAIR principles that data should be associated with persistent identifiers (PID) and metadata (MD). In fact after 20 years of experience we can claim that there are trustworthy PID systems already in broad use. It is argued, however, that assigning PIDs is just the first step. If we agree to assign PIDs and also use the PID to store important relationships such as pointing to locations where the bit sequences or different metadata can be accessed, we are close to defining Digital Objects (DO) which could indeed indicate a solution to solve some of the basic problems in data management and processing. In addition to standardizing the way we assign PIDs, metadata and other state information we could also define a Digital Object Access Protocol as a universal exchange protocol for DOs stored in repositories using different data models and data organizations. We could also associate a type with each DO and a set of operations allowed working on its content which would facilitate the way to automatic processing which has been identified as the major step for scalability in data science and data industry. A globally connected group of experts is now working on establishing testbeds for a DO-based data infrastructure.https://www.mitpressjournals.org/doi/abs/10.1162/dint_a_00004
spellingShingle Wittenburg, Peter
From Persistent Identifiers to Digital Objects to Make Data Science More Efficient
Data Intelligence
title From Persistent Identifiers to Digital Objects to Make Data Science More Efficient
title_full From Persistent Identifiers to Digital Objects to Make Data Science More Efficient
title_fullStr From Persistent Identifiers to Digital Objects to Make Data Science More Efficient
title_full_unstemmed From Persistent Identifiers to Digital Objects to Make Data Science More Efficient
title_short From Persistent Identifiers to Digital Objects to Make Data Science More Efficient
title_sort from persistent identifiers to digital objects to make data science more efficient
url https://www.mitpressjournals.org/doi/abs/10.1162/dint_a_00004
work_keys_str_mv AT wittenburgpeter frompersistentidentifierstodigitalobjectstomakedatasciencemoreefficient