Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers
Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
Pensoft Publishers
2023-11-01
|
Series: | Biodiversity Data Journal |
Subjects: | |
Online Access: | https://bdj.pensoft.net/article/114076/download/pdf/ |
_version_ | 1797450709350744064 |
---|---|
author | Brandon Seah |
author_facet | Brandon Seah |
author_sort | Brandon Seah |
collection | DOAJ |
description | Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here, we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist. This represents a common scenario: finding published sequence data (from NCBI) for species chosen by occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name-matching and data-cleaning, describe common issues encountered during curation and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name-matching caused by homonyms. By correcting such errors during data-cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to community resources, thereby improving the quality of downstream analyses. |
first_indexed | 2024-03-09T14:44:29Z |
format | Article |
id | doaj.art-6917d357a8a04883a8ad6a383d9def40 |
institution | Directory Open Access Journal |
issn | 1314-2828 |
language | English |
last_indexed | 2024-03-09T14:44:29Z |
publishDate | 2023-11-01 |
publisher | Pensoft Publishers |
record_format | Article |
series | Biodiversity Data Journal |
spelling | doaj.art-6917d357a8a04883a8ad6a383d9def402023-11-27T11:00:03ZengPensoft PublishersBiodiversity Data Journal1314-28282023-11-011111710.3897/BDJ.11.e114076114076Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiersBrandon Seah0Thünen Institute for BiodiversityLinking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here, we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist. This represents a common scenario: finding published sequence data (from NCBI) for species chosen by occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name-matching and data-cleaning, describe common issues encountered during curation and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name-matching caused by homonyms. By correcting such errors during data-cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to community resources, thereby improving the quality of downstream analyses.https://bdj.pensoft.net/article/114076/download/pdf/data curationbiodiversity informaticsdata inte |
spellingShingle | Brandon Seah Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers Biodiversity Data Journal data curation biodiversity informatics data inte |
title | Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers |
title_full | Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers |
title_fullStr | Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers |
title_full_unstemmed | Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers |
title_short | Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers |
title_sort | paying it forward crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers |
topic | data curation biodiversity informatics data inte |
url | https://bdj.pensoft.net/article/114076/download/pdf/ |
work_keys_str_mv | AT brandonseah payingitforwardcrowdsourcingtheharmonisationandlinkingoftaxonnamesandbiodiversityidentifiers |