Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists o...

Full description

Bibliographic Details
Main Authors: Mao, Junhua, Xu, Wei, Yang, Yi, Wang, Jiang, Huang, Zhiheng, Yuille, Alan L.
Format: Technical Report
Language:en_US
Published: Center for Brains, Minds and Machines (CBMM), arXiv 2015
Subjects:
Online Access:http://hdl.handle.net/1721.1/100198
_version_ 1826206039877353472
author Mao, Junhua
Xu, Wei
Yang, Yi
Wang, Jiang
Huang, Zhiheng
Yuille, Alan L.
author_facet Mao, Junhua
Xu, Wei
Yang, Yi
Wang, Jiang
Huang, Zhiheng
Yuille, Alan L.
author_sort Mao, Junhua
collection MIT
description In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
first_indexed 2024-09-23T13:23:03Z
format Technical Report
id mit-1721.1/100198
institution Massachusetts Institute of Technology
language en_US
last_indexed 2024-09-23T13:23:03Z
publishDate 2015
publisher Center for Brains, Minds and Machines (CBMM), arXiv
record_format dspace
spelling mit-1721.1/1001982019-04-12T12:31:32Z Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) Mao, Junhua Xu, Wei Yang, Yi Wang, Jiang Huang, Zhiheng Yuille, Alan L. multimodal Recurrent Neural Network (m-RNN) Artificial Intelligence Computer Language In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF - 1231216. 2015-12-11T22:15:05Z 2015-12-11T22:15:05Z 2015-05-07 Technical Report Working Paper Other http://hdl.handle.net/1721.1/100198 arXiv:1412.6632 en_US CBMM Memo Series;033 Attribution-NonCommercial 3.0 United States http://creativecommons.org/licenses/by-nc/3.0/us/ application/pdf Center for Brains, Minds and Machines (CBMM), arXiv
spellingShingle multimodal Recurrent Neural Network (m-RNN)
Artificial Intelligence
Computer Language
Mao, Junhua
Xu, Wei
Yang, Yi
Wang, Jiang
Huang, Zhiheng
Yuille, Alan L.
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
title Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
title_full Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
title_fullStr Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
title_full_unstemmed Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
title_short Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
title_sort deep captioning with multimodal recurrent neural networks m rnn
topic multimodal Recurrent Neural Network (m-RNN)
Artificial Intelligence
Computer Language
url http://hdl.handle.net/1721.1/100198
work_keys_str_mv AT maojunhua deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn
AT xuwei deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn
AT yangyi deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn
AT wangjiang deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn
AT huangzhiheng deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn
AT yuillealanl deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn