Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers

Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most...

Full description

Bibliographic Details
Main Author:	Brandon Seah
Format:	Article
Language:	English
Published:	Pensoft Publishers 2023-11-01
Series:	Biodiversity Data Journal
Subjects:	data curation biodiversity informatics data inte
Online Access:	https://bdj.pensoft.net/article/114076/download/pdf/

_version_	1797450709350744064
author	Brandon Seah
author_facet	Brandon Seah
author_sort	Brandon Seah
collection	DOAJ
description	Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here, we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist. This represents a common scenario: finding published sequence data (from NCBI) for species chosen by occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name-matching and data-cleaning, describe common issues encountered during curation and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name-matching caused by homonyms. By correcting such errors during data-cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to community resources, thereby improving the quality of downstream analyses.
first_indexed	2024-03-09T14:44:29Z
format	Article
id	doaj.art-6917d357a8a04883a8ad6a383d9def40
institution	Directory Open Access Journal
issn	1314-2828
language	English
last_indexed	2024-03-09T14:44:29Z
publishDate	2023-11-01
publisher	Pensoft Publishers
record_format	Article
series	Biodiversity Data Journal
spelling	doaj.art-6917d357a8a04883a8ad6a383d9def402023-11-27T11:00:03ZengPensoft PublishersBiodiversity Data Journal1314-28282023-11-011111710.3897/BDJ.11.e114076114076Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiersBrandon Seah0Thünen Institute for BiodiversityLinking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here, we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist. This represents a common scenario: finding published sequence data (from NCBI) for species chosen by occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name-matching and data-cleaning, describe common issues encountered during curation and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name-matching caused by homonyms. By correcting such errors during data-cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to community resources, thereby improving the quality of downstream analyses.https://bdj.pensoft.net/article/114076/download/pdf/data curationbiodiversity informaticsdata inte
spellingShingle	Brandon Seah Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers Biodiversity Data Journal data curation biodiversity informatics data inte
title	Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
title_full	Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
title_fullStr	Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
title_full_unstemmed	Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
title_short	Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
title_sort	paying it forward crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
topic	data curation biodiversity informatics data inte
url	https://bdj.pensoft.net/article/114076/download/pdf/
work_keys_str_mv	AT brandonseah payingitforwardcrowdsourcingtheharmonisationandlinkingoftaxonnamesandbiodiversityidentifiers

Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers

Similar Items