Migratable urban street scene sensing method based on vision language pre-trained model
We propose a geographically reproducible approach to urban scene sensing based on large-scale pre-trained models. With the rise of GeoAI research, many high-quality urban observation datasets and deep learning models have emerged. However, geospatial heterogeneity makes these resources challenging t...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2022-09-01
|
Series: | International Journal of Applied Earth Observations and Geoinformation |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S1569843222001807 |
_version_ | 1811183438667972608 |
---|---|
author | Yan Zhang Fan Zhang Nengcheng Chen |
author_facet | Yan Zhang Fan Zhang Nengcheng Chen |
author_sort | Yan Zhang |
collection | DOAJ |
description | We propose a geographically reproducible approach to urban scene sensing based on large-scale pre-trained models. With the rise of GeoAI research, many high-quality urban observation datasets and deep learning models have emerged. However, geospatial heterogeneity makes these resources challenging to share and migrate to new application scenarios. This paper introduces vision language and semantic pre-trained model for street view image analysis as an example. This bridges the boundaries of data formats under location coupling, allowing for the acquisition of text-image urban scene objective descriptions in the physical space from the human perspective, including entities, entity attributes, and the relationships between entities. Besides, we proposed the SFT-BERT model to extract text feature sets of 10 urban land use categories from 8,923 scenes in Wuhan. The results show that our method outperforms seven baseline models, including computer vision, and improves 15% compared to traditional deep learning methods, demonstrating the potential of a pre-train & fine-tune paradigm for GIS spatial analysis. Our model could also be reused in other cities, and more accurate image descriptions and scene judgments could be obtained by inputting street view images from different angles. The code is shared at: github.com/yemanzhongting/CityCaption. |
first_indexed | 2024-04-11T09:46:03Z |
format | Article |
id | doaj.art-19a7e3a3be204eeb91efae617da6c86c |
institution | Directory Open Access Journal |
issn | 1569-8432 |
language | English |
last_indexed | 2024-04-11T09:46:03Z |
publishDate | 2022-09-01 |
publisher | Elsevier |
record_format | Article |
series | International Journal of Applied Earth Observations and Geoinformation |
spelling | doaj.art-19a7e3a3be204eeb91efae617da6c86c2022-12-22T04:30:57ZengElsevierInternational Journal of Applied Earth Observations and Geoinformation1569-84322022-09-01113102989Migratable urban street scene sensing method based on vision language pre-trained modelYan Zhang0Fan Zhang1Nengcheng Chen2State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430079, ChinaDepartment of Civil and Environmental Engineering, The Hong Kong University of Science and Technology, 999077, Hong Kong, China; Senseable City Laboratory, Massachusetts Institute of Technology, MA, 02139, USAState Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430079, China; National Engineering Research Center of Geographic Information System, China University of Geosciences, Wuhan 430074, China; Corresponding author at: State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing, Wuhan University, Wuhan 430079, China.We propose a geographically reproducible approach to urban scene sensing based on large-scale pre-trained models. With the rise of GeoAI research, many high-quality urban observation datasets and deep learning models have emerged. However, geospatial heterogeneity makes these resources challenging to share and migrate to new application scenarios. This paper introduces vision language and semantic pre-trained model for street view image analysis as an example. This bridges the boundaries of data formats under location coupling, allowing for the acquisition of text-image urban scene objective descriptions in the physical space from the human perspective, including entities, entity attributes, and the relationships between entities. Besides, we proposed the SFT-BERT model to extract text feature sets of 10 urban land use categories from 8,923 scenes in Wuhan. The results show that our method outperforms seven baseline models, including computer vision, and improves 15% compared to traditional deep learning methods, demonstrating the potential of a pre-train & fine-tune paradigm for GIS spatial analysis. Our model could also be reused in other cities, and more accurate image descriptions and scene judgments could be obtained by inputting street view images from different angles. The code is shared at: github.com/yemanzhongting/CityCaption.http://www.sciencedirect.com/science/article/pii/S1569843222001807GeoAINatural language processingData translationPretrained modelStreet viewMulti-modal |
spellingShingle | Yan Zhang Fan Zhang Nengcheng Chen Migratable urban street scene sensing method based on vision language pre-trained model International Journal of Applied Earth Observations and Geoinformation GeoAI Natural language processing Data translation Pretrained model Street view Multi-modal |
title | Migratable urban street scene sensing method based on vision language pre-trained model |
title_full | Migratable urban street scene sensing method based on vision language pre-trained model |
title_fullStr | Migratable urban street scene sensing method based on vision language pre-trained model |
title_full_unstemmed | Migratable urban street scene sensing method based on vision language pre-trained model |
title_short | Migratable urban street scene sensing method based on vision language pre-trained model |
title_sort | migratable urban street scene sensing method based on vision language pre trained model |
topic | GeoAI Natural language processing Data translation Pretrained model Street view Multi-modal |
url | http://www.sciencedirect.com/science/article/pii/S1569843222001807 |
work_keys_str_mv | AT yanzhang migratableurbanstreetscenesensingmethodbasedonvisionlanguagepretrainedmodel AT fanzhang migratableurbanstreetscenesensingmethodbasedonvisionlanguagepretrainedmodel AT nengchengchen migratableurbanstreetscenesensingmethodbasedonvisionlanguagepretrainedmodel |