EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing

In scene parsing, the model is required to be able to process complex multi-modal data such as images and contexts in real scenes, and discover their implicit connections from objects existing in the scene. As a storage method that contains entity information and the relationship between entities, a...

Full description

Bibliographic Details
Main Authors: Sai Ma, Weibing Wan, Zedong Yu, Yuming Zhao
Format: Article
Language:English
Published: MDPI AG 2023-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/12/7115
_version_ 1797596211656523776
author Sai Ma
Weibing Wan
Zedong Yu
Yuming Zhao
author_facet Sai Ma
Weibing Wan
Zedong Yu
Yuming Zhao
author_sort Sai Ma
collection DOAJ
description In scene parsing, the model is required to be able to process complex multi-modal data such as images and contexts in real scenes, and discover their implicit connections from objects existing in the scene. As a storage method that contains entity information and the relationship between entities, a knowledge graph can well express objects and the semantic relationship between objects in the scene. In this paper, a new multi-phase process was proposed to solve scene parsing tasks; first, a knowledge graph was used to align the multi-modal information and then the graph-based model generates results. We also designed an experiment of feature engineering’s validation for a deep-learning model to preliminarily verify the effectiveness of this method. Hence, we proposed a knowledge representation method named Entity Descriptor Encoder of Transformer (EDET), which uses both the entity itself and its internal attributes for knowledge representation. This method can be embedded into the transformer structure to solve multi-modal scene parsing tasks. EDET can aggregate the multi-modal attributes of entities, and the results in the scene graph generation and image captioning tasks prove that EDET has excellent performance in multi-modal fields. Finally, the proposed method was applied to the industrial scene, which confirmed the viability of our method.
first_indexed 2024-03-11T02:48:22Z
format Article
id doaj.art-0d8fde09bb804b65938b50726d7479eb
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-11T02:48:22Z
publishDate 2023-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-0d8fde09bb804b65938b50726d7479eb2023-11-18T09:09:12ZengMDPI AGApplied Sciences2076-34172023-06-011312711510.3390/app13127115EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene ParsingSai Ma0Weibing Wan1Zedong Yu2Yuming Zhao3Department of Computer, Shanghai University of Engineering Science, Shanghai 201620, ChinaDepartment of Computer, Shanghai University of Engineering Science, Shanghai 201620, ChinaDepartment of Computer, Shanghai University of Engineering Science, Shanghai 201620, ChinaDepartment of Automation, Shanghai Jiao Tong University, Shanghai 200240, ChinaIn scene parsing, the model is required to be able to process complex multi-modal data such as images and contexts in real scenes, and discover their implicit connections from objects existing in the scene. As a storage method that contains entity information and the relationship between entities, a knowledge graph can well express objects and the semantic relationship between objects in the scene. In this paper, a new multi-phase process was proposed to solve scene parsing tasks; first, a knowledge graph was used to align the multi-modal information and then the graph-based model generates results. We also designed an experiment of feature engineering’s validation for a deep-learning model to preliminarily verify the effectiveness of this method. Hence, we proposed a knowledge representation method named Entity Descriptor Encoder of Transformer (EDET), which uses both the entity itself and its internal attributes for knowledge representation. This method can be embedded into the transformer structure to solve multi-modal scene parsing tasks. EDET can aggregate the multi-modal attributes of entities, and the results in the scene graph generation and image captioning tasks prove that EDET has excellent performance in multi-modal fields. Finally, the proposed method was applied to the industrial scene, which confirmed the viability of our method.https://www.mdpi.com/2076-3417/13/12/7115scene parsingknowledge graphmulti-modality
spellingShingle Sai Ma
Weibing Wan
Zedong Yu
Yuming Zhao
EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing
Applied Sciences
scene parsing
knowledge graph
multi-modality
title EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing
title_full EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing
title_fullStr EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing
title_full_unstemmed EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing
title_short EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing
title_sort edet entity descriptor encoder of transformer for multi modal knowledge graph in scene parsing
topic scene parsing
knowledge graph
multi-modality
url https://www.mdpi.com/2076-3417/13/12/7115
work_keys_str_mv AT saima edetentitydescriptorencoderoftransformerformultimodalknowledgegraphinsceneparsing
AT weibingwan edetentitydescriptorencoderoftransformerformultimodalknowledgegraphinsceneparsing
AT zedongyu edetentitydescriptorencoderoftransformerformultimodalknowledgegraphinsceneparsing
AT yumingzhao edetentitydescriptorencoderoftransformerformultimodalknowledgegraphinsceneparsing