Migratable urban street scene sensing method based on vision language pre-trained model

We propose a geographically reproducible approach to urban scene sensing based on large-scale pre-trained models. With the rise of GeoAI research, many high-quality urban observation datasets and deep learning models have emerged. However, geospatial heterogeneity makes these resources challenging t...

Full description

Bibliographic Details
Main Authors: Yan Zhang, Fan Zhang, Nengcheng Chen
Format: Article
Language:English
Published: Elsevier 2022-09-01
Series:International Journal of Applied Earth Observations and Geoinformation
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1569843222001807
_version_ 1811183438667972608
author Yan Zhang
Fan Zhang
Nengcheng Chen
author_facet Yan Zhang
Fan Zhang
Nengcheng Chen
author_sort Yan Zhang
collection DOAJ
description We propose a geographically reproducible approach to urban scene sensing based on large-scale pre-trained models. With the rise of GeoAI research, many high-quality urban observation datasets and deep learning models have emerged. However, geospatial heterogeneity makes these resources challenging to share and migrate to new application scenarios. This paper introduces vision language and semantic pre-trained model for street view image analysis as an example. This bridges the boundaries of data formats under location coupling, allowing for the acquisition of text-image urban scene objective descriptions in the physical space from the human perspective, including entities, entity attributes, and the relationships between entities. Besides, we proposed the SFT-BERT model to extract text feature sets of 10 urban land use categories from 8,923 scenes in Wuhan. The results show that our method outperforms seven baseline models, including computer vision, and improves 15% compared to traditional deep learning methods, demonstrating the potential of a pre-train & fine-tune paradigm for GIS spatial analysis. Our model could also be reused in other cities, and more accurate image descriptions and scene judgments could be obtained by inputting street view images from different angles. The code is shared at: github.com/yemanzhongting/CityCaption.
first_indexed 2024-04-11T09:46:03Z
format Article
id doaj.art-19a7e3a3be204eeb91efae617da6c86c
institution Directory Open Access Journal
issn 1569-8432
language English
last_indexed 2024-04-11T09:46:03Z
publishDate 2022-09-01
publisher Elsevier
record_format Article
series International Journal of Applied Earth Observations and Geoinformation
spelling doaj.art-19a7e3a3be204eeb91efae617da6c86c2022-12-22T04:30:57ZengElsevierInternational Journal of Applied Earth Observations and Geoinformation1569-84322022-09-01113102989Migratable urban street scene sensing method based on vision language pre-trained modelYan Zhang0Fan Zhang1Nengcheng Chen2State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430079, ChinaDepartment of Civil and Environmental Engineering, The Hong Kong University of Science and Technology, 999077, Hong Kong, China; Senseable City Laboratory, Massachusetts Institute of Technology, MA, 02139, USAState Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430079, China; National Engineering Research Center of Geographic Information System, China University of Geosciences, Wuhan 430074, China; Corresponding author at: State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430079, China.We propose a geographically reproducible approach to urban scene sensing based on large-scale pre-trained models. With the rise of GeoAI research, many high-quality urban observation datasets and deep learning models have emerged. However, geospatial heterogeneity makes these resources challenging to share and migrate to new application scenarios. This paper introduces vision language and semantic pre-trained model for street view image analysis as an example. This bridges the boundaries of data formats under location coupling, allowing for the acquisition of text-image urban scene objective descriptions in the physical space from the human perspective, including entities, entity attributes, and the relationships between entities. Besides, we proposed the SFT-BERT model to extract text feature sets of 10 urban land use categories from 8,923 scenes in Wuhan. The results show that our method outperforms seven baseline models, including computer vision, and improves 15% compared to traditional deep learning methods, demonstrating the potential of a pre-train & fine-tune paradigm for GIS spatial analysis. Our model could also be reused in other cities, and more accurate image descriptions and scene judgments could be obtained by inputting street view images from different angles. The code is shared at: github.com/yemanzhongting/CityCaption.http://www.sciencedirect.com/science/article/pii/S1569843222001807GeoAINatural language processingData translationPretrained modelStreet viewMulti-modal
spellingShingle Yan Zhang
Fan Zhang
Nengcheng Chen
Migratable urban street scene sensing method based on vision language pre-trained model
International Journal of Applied Earth Observations and Geoinformation
GeoAI
Natural language processing
Data translation
Pretrained model
Street view
Multi-modal
title Migratable urban street scene sensing method based on vision language pre-trained model
title_full Migratable urban street scene sensing method based on vision language pre-trained model
title_fullStr Migratable urban street scene sensing method based on vision language pre-trained model
title_full_unstemmed Migratable urban street scene sensing method based on vision language pre-trained model
title_short Migratable urban street scene sensing method based on vision language pre-trained model
title_sort migratable urban street scene sensing method based on vision language pre trained model
topic GeoAI
Natural language processing
Data translation
Pretrained model
Street view
Multi-modal
url http://www.sciencedirect.com/science/article/pii/S1569843222001807
work_keys_str_mv AT yanzhang migratableurbanstreetscenesensingmethodbasedonvisionlanguagepretrainedmodel
AT fanzhang migratableurbanstreetscenesensingmethodbasedonvisionlanguagepretrainedmodel
AT nengchengchen migratableurbanstreetscenesensingmethodbasedonvisionlanguagepretrainedmodel