Enhanced geocoding precision for location inference of tweet text using spaCy, Nominatim and Google Maps. A comparative analysis of the influence of data selection.

Twitter location inference methods are developed with the purpose of increasing the percentage of geotagged tweets by inferring locations on a non-geotagged dataset. For validation of proposed approaches, these location inference methods are developed on a fully geotagged dataset on which the attach...

Full description

Bibliographic Details
Main Authors: Helen Ngonidzashe Serere, Bernd Resch, Clemens Rudolf Havas
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2023-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0282942
_version_ 1797843179206082560
author Helen Ngonidzashe Serere
Bernd Resch
Clemens Rudolf Havas
author_facet Helen Ngonidzashe Serere
Bernd Resch
Clemens Rudolf Havas
author_sort Helen Ngonidzashe Serere
collection DOAJ
description Twitter location inference methods are developed with the purpose of increasing the percentage of geotagged tweets by inferring locations on a non-geotagged dataset. For validation of proposed approaches, these location inference methods are developed on a fully geotagged dataset on which the attached Global Navigation Satellite System coordinates are used as ground truth data. Whilst a substantial number of location inference methods have been developed to date, questions arise pertaining the generalizability of the developed location inference models on a non-geotagged dataset. This paper proposes a high precision location inference method for inferring tweets' point of origin based on location mentions within the tweet text. We investigate the influence of data selection by comparing the model performance on two datasets. For the first dataset, we use a proportionate sample of tweet sources of a geotagged dataset. For the second dataset, we use a modelled distribution of tweet sources following a non-geotagged dataset. Our results showed that the distribution of tweet sources influences the performance of location inference models. Using the first dataset we outweighed state-of-the-art location extraction models by inferring 61.9%, 86.1% and 92.1% of the extracted locations within 1 km, 10 km and 50 km radius values, respectively. However, using the second dataset our precision values dropped to 45.3%, 73.1% and 81.0% for the same radius values.
first_indexed 2024-04-09T17:00:38Z
format Article
id doaj.art-6c10988c366b4d07a2d80c249e820293
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-04-09T17:00:38Z
publishDate 2023-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-6c10988c366b4d07a2d80c249e8202932023-04-21T05:32:57ZengPublic Library of Science (PLoS)PLoS ONE1932-62032023-01-01183e028294210.1371/journal.pone.0282942Enhanced geocoding precision for location inference of tweet text using spaCy, Nominatim and Google Maps. A comparative analysis of the influence of data selection.Helen Ngonidzashe SerereBernd ReschClemens Rudolf HavasTwitter location inference methods are developed with the purpose of increasing the percentage of geotagged tweets by inferring locations on a non-geotagged dataset. For validation of proposed approaches, these location inference methods are developed on a fully geotagged dataset on which the attached Global Navigation Satellite System coordinates are used as ground truth data. Whilst a substantial number of location inference methods have been developed to date, questions arise pertaining the generalizability of the developed location inference models on a non-geotagged dataset. This paper proposes a high precision location inference method for inferring tweets' point of origin based on location mentions within the tweet text. We investigate the influence of data selection by comparing the model performance on two datasets. For the first dataset, we use a proportionate sample of tweet sources of a geotagged dataset. For the second dataset, we use a modelled distribution of tweet sources following a non-geotagged dataset. Our results showed that the distribution of tweet sources influences the performance of location inference models. Using the first dataset we outweighed state-of-the-art location extraction models by inferring 61.9%, 86.1% and 92.1% of the extracted locations within 1 km, 10 km and 50 km radius values, respectively. However, using the second dataset our precision values dropped to 45.3%, 73.1% and 81.0% for the same radius values.https://doi.org/10.1371/journal.pone.0282942
spellingShingle Helen Ngonidzashe Serere
Bernd Resch
Clemens Rudolf Havas
Enhanced geocoding precision for location inference of tweet text using spaCy, Nominatim and Google Maps. A comparative analysis of the influence of data selection.
PLoS ONE
title Enhanced geocoding precision for location inference of tweet text using spaCy, Nominatim and Google Maps. A comparative analysis of the influence of data selection.
title_full Enhanced geocoding precision for location inference of tweet text using spaCy, Nominatim and Google Maps. A comparative analysis of the influence of data selection.
title_fullStr Enhanced geocoding precision for location inference of tweet text using spaCy, Nominatim and Google Maps. A comparative analysis of the influence of data selection.
title_full_unstemmed Enhanced geocoding precision for location inference of tweet text using spaCy, Nominatim and Google Maps. A comparative analysis of the influence of data selection.
title_short Enhanced geocoding precision for location inference of tweet text using spaCy, Nominatim and Google Maps. A comparative analysis of the influence of data selection.
title_sort enhanced geocoding precision for location inference of tweet text using spacy nominatim and google maps a comparative analysis of the influence of data selection
url https://doi.org/10.1371/journal.pone.0282942
work_keys_str_mv AT helenngonidzasheserere enhancedgeocodingprecisionforlocationinferenceoftweettextusingspacynominatimandgooglemapsacomparativeanalysisoftheinfluenceofdataselection
AT berndresch enhancedgeocodingprecisionforlocationinferenceoftweettextusingspacynominatimandgooglemapsacomparativeanalysisoftheinfluenceofdataselection
AT clemensrudolfhavas enhancedgeocodingprecisionforlocationinferenceoftweettextusingspacynominatimandgooglemapsacomparativeanalysisoftheinfluenceofdataselection