Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in...

Full description

Bibliographic Details
Main Authors: Huy Manh Nguyen, Tomo Miyazaki, Yoshihiro Sugaya, Shinichiro Omachi
Format: Article
Language:English
Published: MDPI AG 2021-04-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/7/3214
_version_ 1797538899316178944
author Huy Manh Nguyen
Tomo Miyazaki
Yoshihiro Sugaya
Shinichiro Omachi
author_facet Huy Manh Nguyen
Tomo Miyazaki
Yoshihiro Sugaya
Shinichiro Omachi
author_sort Huy Manh Nguyen
collection DOAJ
description Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods.
first_indexed 2024-03-10T12:37:43Z
format Article
id doaj.art-757a01c541df42629b875d8eaf57ec77
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T12:37:43Z
publishDate 2021-04-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-757a01c541df42629b875d8eaf57ec772023-11-21T14:07:46ZengMDPI AGApplied Sciences2076-34172021-04-01117321410.3390/app11073214Multiple Visual-Semantic Embedding for Video Retrieval from Query SentenceHuy Manh Nguyen0Tomo Miyazaki1Yoshihiro Sugaya2Shinichiro Omachi3Graduate School of Engineering, Tohoku University, Sendai 9808579, JapanGraduate School of Engineering, Tohoku University, Sendai 9808579, JapanGraduate School of Engineering, Tohoku University, Sendai 9808579, JapanGraduate School of Engineering, Tohoku University, Sendai 9808579, JapanVisual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed instances due to the difficulty of matching visual dynamics in videos to textual features in sentences. A single space is not enough to accommodate various videos and sentences. In this paper, we propose a novel framework that maps instances into multiple individual embedding spaces so that we can capture multiple relationships between instances, leading to compelling video retrieval. We propose to produce a final similarity between instances by fusing similarities measured in each embedding space using a weighted sum strategy. We determine the weights according to a sentence. Therefore, we can flexibly emphasize an embedding space. We conducted sentence-to-video retrieval experiments on a benchmark dataset. The proposed method achieved superior performance, and the results are competitive to state-of-the-art methods. These experimental results demonstrated the effectiveness of the proposed multiple embedding approach compared to existing methods.https://www.mdpi.com/2076-3417/11/7/3214video retrievalvisual-semantic embeddingmultiple embedding spaces
spellingShingle Huy Manh Nguyen
Tomo Miyazaki
Yoshihiro Sugaya
Shinichiro Omachi
Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence
Applied Sciences
video retrieval
visual-semantic embedding
multiple embedding spaces
title Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence
title_full Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence
title_fullStr Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence
title_full_unstemmed Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence
title_short Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence
title_sort multiple visual semantic embedding for video retrieval from query sentence
topic video retrieval
visual-semantic embedding
multiple embedding spaces
url https://www.mdpi.com/2076-3417/11/7/3214
work_keys_str_mv AT huymanhnguyen multiplevisualsemanticembeddingforvideoretrievalfromquerysentence
AT tomomiyazaki multiplevisualsemanticembeddingforvideoretrievalfromquerysentence
AT yoshihirosugaya multiplevisualsemanticembeddingforvideoretrievalfromquerysentence
AT shinichiroomachi multiplevisualsemanticembeddingforvideoretrievalfromquerysentence