MOSE: a new dataset for video object segmentation in complex scenes

Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J & F) on existing datasets. However, since the target objects in these existing datasets are usu...

Full description

Bibliographic Details
Main Authors: Ding, H, Liu, C, He, S, Jiang, X, Torr, P, Bai, S
Format: Conference item
Language:English
Published: IEEE 2024
_version_ 1811139131511668736
author Ding, H
Liu, C
He, S
Jiang, X
Torr, P
Bai, S
author_facet Ding, H
Liu, C
He, S
Jiang, X
Torr, P
Bai, S
author_sort Ding, H
collection OXFORD
description Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J & F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex scenarios. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J & F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their ∼90% J & F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future.
first_indexed 2024-04-09T03:58:49Z
format Conference item
id oxford-uuid:0d76259d-9c96-486a-acf3-ff5770d093bf
institution University of Oxford
language English
last_indexed 2024-09-25T04:01:13Z
publishDate 2024
publisher IEEE
record_format dspace
spelling oxford-uuid:0d76259d-9c96-486a-acf3-ff5770d093bf2024-04-30T16:23:31ZMOSE: a new dataset for video object segmentation in complex scenesConference itemhttp://purl.org/coar/resource_type/c_5794uuid:0d76259d-9c96-486a-acf3-ff5770d093bfEnglishSymplectic ElementsIEEE2024Ding, HLiu, CHe, SJiang, XTorr, PBai, SVideo object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J & F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex scenarios. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J & F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their ∼90% J & F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future.
spellingShingle Ding, H
Liu, C
He, S
Jiang, X
Torr, P
Bai, S
MOSE: a new dataset for video object segmentation in complex scenes
title MOSE: a new dataset for video object segmentation in complex scenes
title_full MOSE: a new dataset for video object segmentation in complex scenes
title_fullStr MOSE: a new dataset for video object segmentation in complex scenes
title_full_unstemmed MOSE: a new dataset for video object segmentation in complex scenes
title_short MOSE: a new dataset for video object segmentation in complex scenes
title_sort mose a new dataset for video object segmentation in complex scenes
work_keys_str_mv AT dingh moseanewdatasetforvideoobjectsegmentationincomplexscenes
AT liuc moseanewdatasetforvideoobjectsegmentationincomplexscenes
AT hes moseanewdatasetforvideoobjectsegmentationincomplexscenes
AT jiangx moseanewdatasetforvideoobjectsegmentationincomplexscenes
AT torrp moseanewdatasetforvideoobjectsegmentationincomplexscenes
AT bais moseanewdatasetforvideoobjectsegmentationincomplexscenes