An End-to-End Named Entity Recognition Platform for Vietnamese Real Estate Advertisement Posts and Analytical Applications
The volume and complexity of publicly available real estate data have been snowballing. As a result, information extraction and processing have become increasingly challenging and essential for many PropTech (Property Technology) companies worldwide. The challenges are even more pronounced with lang...
Main Authors: | , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2022-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9846984/ |
_version_ | 1828104137629237248 |
---|---|
author | Binh T. Nguyen Tung Tran Nguyen Doan Son Thanh Huynh Khanh Quoc Tran An Trong Nguyen An Tran-Hoai Le Anh Minh Tran Nhi Ho Trung T. Nguyen Dang T. Huynh |
author_facet | Binh T. Nguyen Tung Tran Nguyen Doan Son Thanh Huynh Khanh Quoc Tran An Trong Nguyen An Tran-Hoai Le Anh Minh Tran Nhi Ho Trung T. Nguyen Dang T. Huynh |
author_sort | Binh T. Nguyen |
collection | DOAJ |
description | The volume and complexity of publicly available real estate data have been snowballing. As a result, information extraction and processing have become increasingly challenging and essential for many PropTech (Property Technology) companies worldwide. The challenges are even more pronounced with languages other than English, such as Vietnamese, where few studies in this field have taken place. This paper presents an end-to-end framework for automatically collecting real estate advertisement posts from different data sources, extracting useful information, and storing computed data into proper data warehouses and data marts for the Vietnamese advertisement posts in real estate. After that, one can serve aggregated data for other descriptive and predictive analytics. We combine two models for constructing the most appropriate extraction step: Noise Filtering and Named Entity Recognition (NER). These models can help process initial input data and extract all helpful information. The experiment results show that using <inline-formula> <tex-math notation="LaTeX">$\text{PhoBERT}_{large}$ </tex-math></inline-formula> can achieve the best performance compared to other approaches. Furthermore, we can obtain the corresponding F1 scores of the Noise filtering module and the NER module as 0.8697 and 0.8996, respectively. Finally, we utilize Superset for implementing analytic dashboards to visualize the predicted results and serve for further analysis and management processes. |
first_indexed | 2024-04-11T09:41:03Z |
format | Article |
id | doaj.art-7de9011ea16144db9e0f9a4038993f16 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-04-11T09:41:03Z |
publishDate | 2022-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-7de9011ea16144db9e0f9a4038993f162022-12-22T04:31:11ZengIEEEIEEE Access2169-35362022-01-0110876818769710.1109/ACCESS.2022.31954969846984An End-to-End Named Entity Recognition Platform for Vietnamese Real Estate Advertisement Posts and Analytical ApplicationsBinh T. Nguyen0https://orcid.org/0000-0001-5249-9702Tung Tran Nguyen Doan1https://orcid.org/0000-0003-4659-6164Son Thanh Huynh2Khanh Quoc Tran3https://orcid.org/0000-0003-1288-8003An Trong Nguyen4https://orcid.org/0000-0002-7782-8389An Tran-Hoai Le5https://orcid.org/0000-0002-0521-963XAnh Minh Tran6Nhi Ho7Trung T. Nguyen8Dang T. Huynh9Department of Computer Science, Faculty of Mathematics and Computer Science, Vietnam National University Ho Chi Minh City (VNUHCM)—University of Science, Ho Chi Minh City, VietnamAISIA Research Laboratory, Ho Chi Minh City, VietnamDepartment of Computer Science, Faculty of Mathematics and Computer Science, Vietnam National University Ho Chi Minh City (VNUHCM)—University of Science, Ho Chi Minh City, VietnamVietnam National University Ho Chi Minh City (VNUHCM), Ho Chi Minh City, VietnamVietnam National University Ho Chi Minh City (VNUHCM), Ho Chi Minh City, VietnamVietnam National University Ho Chi Minh City (VNUHCM), Ho Chi Minh City, VietnamDepartment of Computer Science, Faculty of Mathematics and Computer Science, Vietnam National University Ho Chi Minh City (VNUHCM)—University of Science, Ho Chi Minh City, VietnamHung Thinh Corporation, Ho Chi Minh City, VietnamHung Thinh Corporation, Ho Chi Minh City, VietnamDepartment of Computer Science, Faculty of Mathematics and Computer Science, Vietnam National University Ho Chi Minh City (VNUHCM)—University of Science, Ho Chi Minh City, VietnamThe volume and complexity of publicly available real estate data have been snowballing. As a result, information extraction and processing have become increasingly challenging and essential for many PropTech (Property Technology) companies worldwide. The challenges are even more pronounced with languages other than English, such as Vietnamese, where few studies in this field have taken place. This paper presents an end-to-end framework for automatically collecting real estate advertisement posts from different data sources, extracting useful information, and storing computed data into proper data warehouses and data marts for the Vietnamese advertisement posts in real estate. After that, one can serve aggregated data for other descriptive and predictive analytics. We combine two models for constructing the most appropriate extraction step: Noise Filtering and Named Entity Recognition (NER). These models can help process initial input data and extract all helpful information. The experiment results show that using <inline-formula> <tex-math notation="LaTeX">$\text{PhoBERT}_{large}$ </tex-math></inline-formula> can achieve the best performance compared to other approaches. Furthermore, we can obtain the corresponding F1 scores of the Noise filtering module and the NER module as 0.8697 and 0.8996, respectively. Finally, we utilize Superset for implementing analytic dashboards to visualize the predicted results and serve for further analysis and management processes.https://ieeexplore.ieee.org/document/9846984/Information extractioninformation retrieval and text miningNLP applications |
spellingShingle | Binh T. Nguyen Tung Tran Nguyen Doan Son Thanh Huynh Khanh Quoc Tran An Trong Nguyen An Tran-Hoai Le Anh Minh Tran Nhi Ho Trung T. Nguyen Dang T. Huynh An End-to-End Named Entity Recognition Platform for Vietnamese Real Estate Advertisement Posts and Analytical Applications IEEE Access Information extraction information retrieval and text mining NLP applications |
title | An End-to-End Named Entity Recognition Platform for Vietnamese Real Estate Advertisement Posts and Analytical Applications |
title_full | An End-to-End Named Entity Recognition Platform for Vietnamese Real Estate Advertisement Posts and Analytical Applications |
title_fullStr | An End-to-End Named Entity Recognition Platform for Vietnamese Real Estate Advertisement Posts and Analytical Applications |
title_full_unstemmed | An End-to-End Named Entity Recognition Platform for Vietnamese Real Estate Advertisement Posts and Analytical Applications |
title_short | An End-to-End Named Entity Recognition Platform for Vietnamese Real Estate Advertisement Posts and Analytical Applications |
title_sort | end to end named entity recognition platform for vietnamese real estate advertisement posts and analytical applications |
topic | Information extraction information retrieval and text mining NLP applications |
url | https://ieeexplore.ieee.org/document/9846984/ |
work_keys_str_mv | AT binhtnguyen anendtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT tungtrannguyendoan anendtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT sonthanhhuynh anendtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT khanhquoctran anendtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT antrongnguyen anendtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT antranhoaile anendtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT anhminhtran anendtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT nhiho anendtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT trungtnguyen anendtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT dangthuynh anendtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT binhtnguyen endtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT tungtrannguyendoan endtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT sonthanhhuynh endtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT khanhquoctran endtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT antrongnguyen endtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT antranhoaile endtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT anhminhtran endtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT nhiho endtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT trungtnguyen endtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications AT dangthuynh endtoendnamedentityrecognitionplatformforvietnameserealestateadvertisementpostsandanalyticalapplications |