Unified image and video saliency modeling

Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. While image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing, video saliency models have shown rapid gains on the recent D...

Full description

Bibliographic Details
Main Authors:	Droste, R, Jiao, J, Noble, JA
Format:	Conference item
Language:	English
Published:	Springer 2020

_version_	1797112582002180096
author	Droste, R Jiao, J Noble, JA
author_facet	Droste, R Jiao, J Noble, JA
author_sort	Droste, R
collection	OXFORD
description	Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. While image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing, video saliency models have shown rapid gains on the recent DHF1K benchmark. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We identify different sources of domain shift between image and video saliency data and between different video saliency datasets as a key challenge for effective joint modelling. To address this we propose four novel domain adaptation techniques—Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN—in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300. With one set of parameters, UNISAL achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency datasets, despite faster runtime and a 5 to 20-fold smaller model size compared to all competing deep methods. We provide retrospective analyses and ablation studies which confirm the importance of the domain shift modeling. The code is available at https://github.com/rdroste/unisal.
first_indexed	2024-03-07T08:26:12Z
format	Conference item
id	oxford-uuid:6fe851f3-7d58-4324-907e-df44fe4f3ac0
institution	University of Oxford
language	English
last_indexed	2024-03-07T08:26:12Z
publishDate	2020
publisher	Springer
record_format	dspace
spelling	oxford-uuid:6fe851f3-7d58-4324-907e-df44fe4f3ac02024-02-16T12:08:58ZUnified image and video saliency modelingConference itemhttp://purl.org/coar/resource_type/c_5794uuid:6fe851f3-7d58-4324-907e-df44fe4f3ac0EnglishSymplectic ElementsSpringer2020Droste, RJiao, JNoble, JAVisual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. While image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing, video saliency models have shown rapid gains on the recent DHF1K benchmark. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We identify different sources of domain shift between image and video saliency data and between different video saliency datasets as a key challenge for effective joint modelling. To address this we propose four novel domain adaptation techniques—Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN—in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300. With one set of parameters, UNISAL achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency datasets, despite faster runtime and a 5 to 20-fold smaller model size compared to all competing deep methods. We provide retrospective analyses and ablation studies which confirm the importance of the domain shift modeling. The code is available at https://github.com/rdroste/unisal.
spellingShingle	Droste, R Jiao, J Noble, JA Unified image and video saliency modeling
title	Unified image and video saliency modeling
title_full	Unified image and video saliency modeling
title_fullStr	Unified image and video saliency modeling
title_full_unstemmed	Unified image and video saliency modeling
title_short	Unified image and video saliency modeling
title_sort	unified image and video saliency modeling
work_keys_str_mv	AT droster unifiedimageandvideosaliencymodeling AT jiaoj unifiedimageandvideosaliencymodeling AT nobleja unifiedimageandvideosaliencymodeling

Unified image and video saliency modeling

Similar Items