Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers

Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most...

Full description

Bibliographic Details
Main Author: Brandon Seah
Format: Article
Language:English
Published: Pensoft Publishers 2023-11-01
Series:Biodiversity Data Journal
Subjects:
Online Access:https://bdj.pensoft.net/article/114076/download/pdf/
_version_ 1797450709350744064
author Brandon Seah
author_facet Brandon Seah
author_sort Brandon Seah
collection DOAJ
description Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here, we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist. This represents a common scenario: finding published sequence data (from NCBI) for species chosen by occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name-matching and data-cleaning, describe common issues encountered during curation and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name-matching caused by homonyms. By correcting such errors during data-cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to community resources, thereby improving the quality of downstream analyses.
first_indexed 2024-03-09T14:44:29Z
format Article
id doaj.art-6917d357a8a04883a8ad6a383d9def40
institution Directory Open Access Journal
issn 1314-2828
language English
last_indexed 2024-03-09T14:44:29Z
publishDate 2023-11-01
publisher Pensoft Publishers
record_format Article
series Biodiversity Data Journal
spelling doaj.art-6917d357a8a04883a8ad6a383d9def402023-11-27T11:00:03ZengPensoft PublishersBiodiversity Data Journal1314-28282023-11-011111710.3897/BDJ.11.e114076114076Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiersBrandon Seah0Thünen Institute for BiodiversityLinking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here, we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist. This represents a common scenario: finding published sequence data (from NCBI) for species chosen by occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name-matching and data-cleaning, describe common issues encountered during curation and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name-matching caused by homonyms. By correcting such errors during data-cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to community resources, thereby improving the quality of downstream analyses.https://bdj.pensoft.net/article/114076/download/pdf/data curationbiodiversity informaticsdata inte
spellingShingle Brandon Seah
Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
Biodiversity Data Journal
data curation
biodiversity informatics
data inte
title Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
title_full Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
title_fullStr Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
title_full_unstemmed Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
title_short Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
title_sort paying it forward crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
topic data curation
biodiversity informatics
data inte
url https://bdj.pensoft.net/article/114076/download/pdf/
work_keys_str_mv AT brandonseah payingitforwardcrowdsourcingtheharmonisationandlinkingoftaxonnamesandbiodiversityidentifiers