Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists o...
Main Authors: | , , , , , |
---|---|
Format: | Technical Report |
Language: | en_US |
Published: |
Center for Brains, Minds and Machines (CBMM), arXiv
2015
|
Subjects: | |
Online Access: | http://hdl.handle.net/1721.1/100198 |
_version_ | 1826206039877353472 |
---|---|
author | Mao, Junhua Xu, Wei Yang, Yi Wang, Jiang Huang, Zhiheng Yuille, Alan L. |
author_facet | Mao, Junhua Xu, Wei Yang, Yi Wang, Jiang Huang, Zhiheng Yuille, Alan L. |
author_sort | Mao, Junhua |
collection | MIT |
description | In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. |
first_indexed | 2024-09-23T13:23:03Z |
format | Technical Report |
id | mit-1721.1/100198 |
institution | Massachusetts Institute of Technology |
language | en_US |
last_indexed | 2024-09-23T13:23:03Z |
publishDate | 2015 |
publisher | Center for Brains, Minds and Machines (CBMM), arXiv |
record_format | dspace |
spelling | mit-1721.1/1001982019-04-12T12:31:32Z Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) Mao, Junhua Xu, Wei Yang, Yi Wang, Jiang Huang, Zhiheng Yuille, Alan L. multimodal Recurrent Neural Network (m-RNN) Artificial Intelligence Computer Language In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval. This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF - 1231216. 2015-12-11T22:15:05Z 2015-12-11T22:15:05Z 2015-05-07 Technical Report Working Paper Other http://hdl.handle.net/1721.1/100198 arXiv:1412.6632 en_US CBMM Memo Series;033 Attribution-NonCommercial 3.0 United States http://creativecommons.org/licenses/by-nc/3.0/us/ application/pdf Center for Brains, Minds and Machines (CBMM), arXiv |
spellingShingle | multimodal Recurrent Neural Network (m-RNN) Artificial Intelligence Computer Language Mao, Junhua Xu, Wei Yang, Yi Wang, Jiang Huang, Zhiheng Yuille, Alan L. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) |
title | Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) |
title_full | Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) |
title_fullStr | Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) |
title_full_unstemmed | Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) |
title_short | Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) |
title_sort | deep captioning with multimodal recurrent neural networks m rnn |
topic | multimodal Recurrent Neural Network (m-RNN) Artificial Intelligence Computer Language |
url | http://hdl.handle.net/1721.1/100198 |
work_keys_str_mv | AT maojunhua deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn AT xuwei deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn AT yangyi deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn AT wangjiang deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn AT huangzhiheng deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn AT yuillealanl deepcaptioningwithmultimodalrecurrentneuralnetworksmrnn |