MOSE: a new dataset for video object segmentation in complex scenes

Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J & F) on existing datasets. However, since the target objects in these existing datasets are usu...

Full description

Bibliographic Details
Main Authors:	Ding, H, Liu, C, He, S, Jiang, X, Torr, P, Bai, S
Format:	Conference item
Language:	English
Published:	IEEE 2024

_version_	1826312828853682176
author	Ding, H Liu, C He, S Jiang, X Torr, P Bai, S
author_facet	Ding, H Liu, C He, S Jiang, X Torr, P Bai, S
author_sort	Ding, H
collection	OXFORD
description	Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J & F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex scenarios. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J & F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their ∼90% J & F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future.
first_indexed	2024-04-09T03:58:49Z
format	Conference item
id	oxford-uuid:0d76259d-9c96-486a-acf3-ff5770d093bf
institution	University of Oxford
language	English
last_indexed	2024-09-25T04:01:13Z
publishDate	2024
publisher	IEEE
record_format	dspace
spelling	oxford-uuid:0d76259d-9c96-486a-acf3-ff5770d093bf2024-04-30T16:23:31ZMOSE: a new dataset for video object segmentation in complex scenesConference itemhttp://purl.org/coar/resource_type/c_5794uuid:0d76259d-9c96-486a-acf3-ff5770d093bfEnglishSymplectic ElementsIEEE2024Ding, HLiu, CHe, SJiang, XTorr, PBai, SVideo object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J & F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex scenarios. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J & F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their ∼90% J & F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future.
spellingShingle	Ding, H Liu, C He, S Jiang, X Torr, P Bai, S MOSE: a new dataset for video object segmentation in complex scenes
title	MOSE: a new dataset for video object segmentation in complex scenes
title_full	MOSE: a new dataset for video object segmentation in complex scenes
title_fullStr	MOSE: a new dataset for video object segmentation in complex scenes
title_full_unstemmed	MOSE: a new dataset for video object segmentation in complex scenes
title_short	MOSE: a new dataset for video object segmentation in complex scenes
title_sort	mose a new dataset for video object segmentation in complex scenes
work_keys_str_mv	AT dingh moseanewdatasetforvideoobjectsegmentationincomplexscenes AT liuc moseanewdatasetforvideoobjectsegmentationincomplexscenes AT hes moseanewdatasetforvideoobjectsegmentationincomplexscenes AT jiangx moseanewdatasetforvideoobjectsegmentationincomplexscenes AT torrp moseanewdatasetforvideoobjectsegmentationincomplexscenes AT bais moseanewdatasetforvideoobjectsegmentationincomplexscenes

MOSE: a new dataset for video object segmentation in complex scenes

Similar Items