Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence
Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-04-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/11/7/3214 |
_version_ | 1797538899316178944 |
---|---|
author | Huy Manh Nguyen Tomo Miyazaki Yoshihiro Sugaya Shinichiro Omachi |
author_facet | Huy Manh Nguyen Tomo Miyazaki Yoshihiro Sugaya Shinichiro Omachi |
author_sort | Huy Manh Nguyen |
collection | DOAJ |
description | Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods. |
first_indexed | 2024-03-10T12:37:43Z |
format | Article |
id | doaj.art-757a01c541df42629b875d8eaf57ec77 |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-10T12:37:43Z |
publishDate | 2021-04-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-757a01c541df42629b875d8eaf57ec772023-11-21T14:07:46ZengMDPI AGApplied Sciences2076-34172021-04-01117321410.3390/app11073214Multiple Visual-Semantic Embedding for Video Retrieval from Query SentenceHuy Manh Nguyen0Tomo Miyazaki1Yoshihiro Sugaya2Shinichiro Omachi3Graduate School of Engineering, Tohoku University, Sendai 9808579, JapanGraduate School of Engineering, Tohoku University, Sendai 9808579, JapanGraduate School of Engineering, Tohoku University, Sendai 9808579, JapanGraduate School of Engineering, Tohoku University, Sendai 9808579, JapanVisual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods.https://www.mdpi.com/2076-3417/11/7/3214video retrievalvisual-semantic embeddingmultiple embedding spaces |
spellingShingle | Huy Manh Nguyen Tomo Miyazaki Yoshihiro Sugaya Shinichiro Omachi Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence Applied Sciences video retrieval visual-semantic embedding multiple embedding spaces |
title | Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence |
title_full | Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence |
title_fullStr | Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence |
title_full_unstemmed | Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence |
title_short | Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence |
title_sort | multiple visual semantic embedding for video retrieval from query sentence |
topic | video retrieval visual-semantic embedding multiple embedding spaces |
url | https://www.mdpi.com/2076-3417/11/7/3214 |
work_keys_str_mv | AT huymanhnguyen multiplevisualsemanticembeddingforvideoretrievalfromquerysentence AT tomomiyazaki multiplevisualsemanticembeddingforvideoretrievalfromquerysentence AT yoshihirosugaya multiplevisualsemanticembeddingforvideoretrievalfromquerysentence AT shinichiroomachi multiplevisualsemanticembeddingforvideoretrievalfromquerysentence |