EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing
In scene parsing, the model is required to be able to process complex multi-modal data such as images and contexts in real scenes, and discover their implicit connections from objects existing in the scene. As a storage method that contains entity information and the relationship between entities, a...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-06-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/13/12/7115 |
_version_ | 1797596211656523776 |
---|---|
author | Sai Ma Weibing Wan Zedong Yu Yuming Zhao |
author_facet | Sai Ma Weibing Wan Zedong Yu Yuming Zhao |
author_sort | Sai Ma |
collection | DOAJ |
description | In scene parsing, the model is required to be able to process complex multi-modal data such as images and contexts in real scenes, and discover their implicit connections from objects existing in the scene. As a storage method that contains entity information and the relationship between entities, a knowledge graph can well express objects and the semantic relationship between objects in the scene. In this paper, a new multi-phase process was proposed to solve scene parsing tasks; first, a knowledge graph was used to align the multi-modal information and then the graph-based model generates results. We also designed an experiment of feature engineering’s validation for a deep-learning model to preliminarily verify the effectiveness of this method. Hence, we proposed a knowledge representation method named Entity Descriptor Encoder of Transformer (EDET), which uses both the entity itself and its internal attributes for knowledge representation. This method can be embedded into the transformer structure to solve multi-modal scene parsing tasks. EDET can aggregate the multi-modal attributes of entities, and the results in the scene graph generation and image captioning tasks prove that EDET has excellent performance in multi-modal fields. Finally, the proposed method was applied to the industrial scene, which confirmed the viability of our method. |
first_indexed | 2024-03-11T02:48:22Z |
format | Article |
id | doaj.art-0d8fde09bb804b65938b50726d7479eb |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-11T02:48:22Z |
publishDate | 2023-06-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-0d8fde09bb804b65938b50726d7479eb2023-11-18T09:09:12ZengMDPI AGApplied Sciences2076-34172023-06-011312711510.3390/app13127115EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene ParsingSai Ma0Weibing Wan1Zedong Yu2Yuming Zhao3Department of Computer, Shanghai University of Engineering Science, Shanghai 201620, ChinaDepartment of Computer, Shanghai University of Engineering Science, Shanghai 201620, ChinaDepartment of Computer, Shanghai University of Engineering Science, Shanghai 201620, ChinaDepartment of Automation, Shanghai Jiao Tong University, Shanghai 200240, ChinaIn scene parsing, the model is required to be able to process complex multi-modal data such as images and contexts in real scenes, and discover their implicit connections from objects existing in the scene. As a storage method that contains entity information and the relationship between entities, a knowledge graph can well express objects and the semantic relationship between objects in the scene. In this paper, a new multi-phase process was proposed to solve scene parsing tasks; first, a knowledge graph was used to align the multi-modal information and then the graph-based model generates results. We also designed an experiment of feature engineering’s validation for a deep-learning model to preliminarily verify the effectiveness of this method. Hence, we proposed a knowledge representation method named Entity Descriptor Encoder of Transformer (EDET), which uses both the entity itself and its internal attributes for knowledge representation. This method can be embedded into the transformer structure to solve multi-modal scene parsing tasks. EDET can aggregate the multi-modal attributes of entities, and the results in the scene graph generation and image captioning tasks prove that EDET has excellent performance in multi-modal fields. Finally, the proposed method was applied to the industrial scene, which confirmed the viability of our method.https://www.mdpi.com/2076-3417/13/12/7115scene parsingknowledge graphmulti-modality |
spellingShingle | Sai Ma Weibing Wan Zedong Yu Yuming Zhao EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing Applied Sciences scene parsing knowledge graph multi-modality |
title | EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing |
title_full | EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing |
title_fullStr | EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing |
title_full_unstemmed | EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing |
title_short | EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing |
title_sort | edet entity descriptor encoder of transformer for multi modal knowledge graph in scene parsing |
topic | scene parsing knowledge graph multi-modality |
url | https://www.mdpi.com/2076-3417/13/12/7115 |
work_keys_str_mv | AT saima edetentitydescriptorencoderoftransformerformultimodalknowledgegraphinsceneparsing AT weibingwan edetentitydescriptorencoderoftransformerformultimodalknowledgegraphinsceneparsing AT zedongyu edetentitydescriptorencoderoftransformerformultimodalknowledgegraphinsceneparsing AT yumingzhao edetentitydescriptorencoderoftransformerformultimodalknowledgegraphinsceneparsing |