Audio captioning and retrieval with improved cross-modal objectives

Audio captioning and retrieval with improved cross-modal objectives

Automated Audio Captioning (AAC) is the task of generating descriptive captions from an input audio clip, while Language-Based Audio Retrieval (LBAR) is the task of retrieving the most relevant audio clip based on an input text query. AAC requires a model that is not only able to comprehend the acou...

ver descrição completa

Detalhes bibliográficos
Autor principal:	Koh, Andrew Jin Jie
Outros Autores:	Chng Eng Siong
Formato:	Thesis-Doctor of Philosophy
Idioma:	English
Publicado em:	Nanyang Technological University 2023
Assuntos:	Engineering::Computer science and engineering
Acesso em linha:	https://hdl.handle.net/10356/172437

Registros relacionados

Cross-modal graph with meta concepts for video captioning
por: Wang, Hao, et al.
Publicado em: (2022)

Improved image captioning techniques with comparative study
por: He, Cari
Publicado em: (2021)

Audio pattern discovery and retrieval
por: Wang, Lei
Publicado em: (2013)

Deep learning-based image captioning
por: Chong, Kaydon
Publicado em: (2019)

Neural image and video captioning (NIVC)
por: Lee, Jeremy Kian Kiat
Publicado em: (2022)

Incorporating additional knowledge into image captioners
por: Xu, Yang
Publicado em: (2021)

Evaluations of training paradigms in neural image captioning
por: Lee, Si Min
Publicado em: (2019)

Towards abstractive captioning of infographics
por: Landman, Nathan, M. Eng. Massachusetts Institute of Technology
Publicado em: (2018)

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning
por: Zhengxin Li, et al.
Publicado em: (2024-01-01)

A vector-based approach to broadcast audio database indexing and retrieval
por: Wang, Lei, et al.
Publicado em: (2013)

Text-based image retrieval using image captioning
por: Tan, Kah Hwa
Publicado em: (2019)

Paired cross-modal data augmentation for fine-grained image-to-text retrieval
por: Wang, Hao, et al.
Publicado em: (2023)

Deep robust multilevel semantic hashing for multi-label cross-modal retrieval
por: Song, Ge, et al.
Publicado em: (2023)

Grounded semantic parsing using captioned videos
por: Ross, Candace Cheronda
Publicado em: (2018)

Deconfounded image captioning: a causal retrospect
por: Yang, Xu, et al.
Publicado em: (2022)

Multi-modal reinforcement learning with videogame audio to learn sonic features
por: Nadeem, Faraaz.
Publicado em: (2021)

Understanding what a captioning network doesn't know
por: Yip, Richard B.,M. Eng.Massachusetts Institute of Technology.
Publicado em: (2019)

Automatic closed caption generation from video files
por: Tan, Kenneth Chengwei
Publicado em: (2014)

Whispersync : close caption (live-following) of the read speech in a close cation
por: Lam, Chun Yin
Publicado em: (2015)

SWORS : a system for the efficient retrieval of relevant spatial web objects
por: Cao, Xin, et al.
Publicado em: (2013)

Distance metric learning for multi-modal image retrieval and annotation
por: Wu, Pengcheng
Publicado em: (2014)

A framework for efficient spatial web object retrieval
por: Jensen, Christian S., et al.
Publicado em: (2013)

Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature
por: Xu, Yuecong, et al.
Publicado em: (2021)

Image retrieval with a multi-modality ontology
por: Wang, Huan
Publicado em: (2010)

Online weighted hashing for cross-modal retrieval
por: Jiang, Zining
Publicado em: (2022)

Audio fingerprint application for media industry
por: Kusuma, Andrew Putra
Publicado em: (2018)

Introduction to the special issue on new subjective and objective methodologies for audio and visual signal processing
por: Loizou, Philip C., et al.
Publicado em: (2013)

Learning decoupled models for cross-modal generation
por: Wang, Hao
Publicado em: (2023)

Learning to collocate Visual-Linguistic Neural Modules for image captioning
por: Yang, Xu, et al.
Publicado em: (2023)

Context-aware visual policy network for fine-grained image captioning
por: Zha, Zheng-Jun, et al.
Publicado em: (2022)

COMIC: Toward A Compact Image Captioning Model With Attention
por: Tan, Jia Huei, et al.
Publicado em: (2019)

An object properties filter for multi-modality ontology semantic image retrieval
por: Sulaiman, Mohd Suffian, et al.
Publicado em: (2017)

Efficient object recognition and image retrieval for large-scale applications
por: Lee, John Jaesung
Publicado em: (2009)

Stack-VS : stacked visual-semantic attention for image caption generation
por: Cheng, Ling, et al.
Publicado em: (2021)

Interconnectable blocks for music and audio processing
por: McPherson, Andrew, 1982-
Publicado em: (2006)

An object-oriented, logic based approach to document retrieval
por: Tan, Nam Beng.
Publicado em: (2009)

Multimodal audio-visual emotion detection
por: Chaudhary, Nitesh Kumar
Publicado em: (2021)

Automated image captioning
por: Teo, Sabrina Jingya
Publicado em: (2017)

Neural image and video captioning
por: Lam, Ting En
Publicado em: (2024)

NTU smart audio tour
por: Aw, Li Jun
Publicado em: (2017)