Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists o...

Full description

Bibliographic Details
Main Authors:	Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, Huang, Zhiheng, Yuille, Alan L.
Format:	Technical Report
Language:	en_US
Published:	Center for Brains, Minds and Machines (CBMM), arXiv 2015
Subjects:	multimodal Recurrent Neural Network (m-RNN) Artificial Intelligence Computer Language
Online Access:	http://hdl.handle.net/1721.1/100198

_version_	1826206039877353472
author	Mao, Junhua Xu, Wei Yang, Yi Wang, Jiang Huang, Zhiheng Yuille, Alan L.
author_facet	Mao, Junhua Xu, Wei Yang, Yi Wang, Jiang Huang, Zhiheng Yuille, Alan L.
author_sort	Mao, Junhua
collection	MIT
description	In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
first_indexed	2024-09-23T13:23:03Z
format	Technical Report
id	mit-1721.1/100198
institution	Massachusetts Institute of Technology
language	en_US
last_indexed	2024-09-23T13:23:03Z
publishDate	2015
publisher	Center for Brains, Minds and Machines (CBMM), arXiv
record_format	dspace
spelling	mit-1721.1/1001982019-04-12T12:31:32Z Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) Mao, Junhua Xu, Wei Yang, Yi Wang, Jiang Huang, Zhiheng Yuille, Alan L. multimodal Recurrent Neural Network (m-RNN) Artificial Intelligence Computer Language In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF - 1231216. 2015-12-11T22:15:05Z 2015-12-11T22:15:05Z 2015-05-07 Technical Report Working Paper Other http://hdl.handle.net/1721.1/100198 arXiv:1412.6632 en_US CBMM Memo Series;033 Attribution-NonCommercial 3.0 United States http://creativecommons.org/licenses/by-nc/3.0/us/ application/pdf Center for Brains, Minds and Machines (CBMM), arXiv
spellingShingle	multimodal Recurrent Neural Network (m-RNN) Artificial Intelligence Computer Language Mao, Junhua Xu, Wei Yang, Yi Wang, Jiang Huang, Zhiheng Yuille, Alan L. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
title	Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
title_full	Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
title_fullStr	Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
title_full_unstemmed	Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
title_short	Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
title_sort	deep captioning with multimodal recurrent neural networks m rnn
topic	multimodal Recurrent Neural Network (m-RNN) Artificial Intelligence Computer Language
url	http://hdl.handle.net/1721.1/100198
work_keys_str_mv	AT maojunhua deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn AT xuwei deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn AT yangyi deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn AT wangjiang deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn AT huangzhiheng deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn AT yuillealanl deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

Similar Items