Summary: | Most scenes in practical applications are dynamic scenes containing moving objects, so accurately segmenting moving objects is crucial for many computer vision applications. In order to efficiently segment all the moving objects in the scene, regardless of whether the object has a predefined semantic label, we propose a two-level nested octave U-structure network with a multi-scale attention mechanism, called U<sup>2</sup>-ONe<inline-formula><math display="inline"><semantics><mi mathvariant="normal">t</mi></semantics></math></inline-formula>. <inline-formula><math display="inline"><semantics><msup><mi mathvariant="normal">U</mi><mn>2</mn></msup></semantics></math></inline-formula>-ONe<inline-formula><math display="inline"><semantics><mi mathvariant="normal">t</mi></semantics></math></inline-formula> takes two RGB frames, the optical flow between these frames, and the instance segmentation of the frames as inputs. Each stage of <inline-formula><math display="inline"><semantics><msup><mi mathvariant="normal">U</mi><mn>2</mn></msup></semantics></math></inline-formula>-ONe<inline-formula><math display="inline"><semantics><mi mathvariant="normal">t</mi></semantics></math></inline-formula> is filled with the newly designed octave residual U-block (ORSU block) to enhance the ability to obtain more contextual information at different scales while reducing the spatial redundancy of the feature maps. In order to efficiently train the multi-scale deep network, we introduce a hierarchical training supervision strategy that calculates the loss at each level while adding knowledge-matching loss to keep the optimization consistent. The experimental results show that the proposed <inline-formula><math display="inline"><semantics><msup><mi mathvariant="normal">U</mi><mn>2</mn></msup></semantics></math></inline-formula>-ONe<inline-formula><math display="inline"><semantics><mi mathvariant="normal">t</mi></semantics></math></inline-formula> method can achieve a state-of-the-art performance in several general moving object segmentation datasets.
|