Dense prediction and deep learning in complex visual scenes

Many computer vision applications, such as video surveillance, autonomous driving, and crowd analysis, suffer from the challenging conditions of complex scenes, including haze, underwater, extreme lighting, and crowded and small objects. These scenes might degrade or compromise the performance of or...

Full description

Bibliographic Details
Main Author: Wang, Yi
Other Authors: Lap-Pui Chau
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2021
Subjects:
Online Access:https://hdl.handle.net/10356/152009
Description
Summary:Many computer vision applications, such as video surveillance, autonomous driving, and crowd analysis, suffer from the challenging conditions of complex scenes, including haze, underwater, extreme lighting, and crowded and small objects. These scenes might degrade or compromise the performance of or even fail computer vision algorithms. It is valuable and important to develop methods to address such complex visual scenes. In this thesis, we follow a unified thinking for a series of dense prediction problems from low-level vision to high-level vision, i.e., restoration, detection, and recognition. In the restoration problem, haze and underwater scenes degrade the contrast and color of images due to light scattering and absorption. Research on de-scattering or dehazing refers to restore images captured in such scenes. As the first research direction of this thesis, we propose a novel image restoration approach for underwater imagery based on an adaptive attenuation-curve prior (AACP). The prior describes the fact that all pixel values of a clear image can be partitioned into several hundred distinct clusters in RGB space, and the pixel values in each cluster will be distributed on a curve with a power-function form after attenuated by water. Therefore, the pixel-wise medium transmission can be predicted according to the pixel value's distribution on such a curve. This method is generalizable and can be extended to hazy images. Moreover, according to the fact that ambient light exists in the infinite distant region of an outdoor image, we propose a new deep learning-based framework to estimate the ambient light by distant region segmentation (DRS). Qualitative and quantitative results show that the proposed methods achieve superior performance in comparison with state-of-the-art methods. In the detection problem, crowded objects present large-scale variation and severe occlusion, posing great challenges to object detectors. In addition, current crowd datasets only provide coarse point-level annotations, i.e., human heads are labeled as points, so state-of-the-art object detectors cannot be trivially applied to such point supervision. In our second research direction, we propose a novel self-training approach that enables a typical object detector trained only with point-level annotations to densely predict center points and sizes of crowded objects, termed Crowd-DCNet. Specifically, we propose the locally-uniform distribution assumption (LUDA) for initializing pseudo object sizes from point-level supervisory information, the crowdedness-aware loss for regressing object sizes, and the confidence and order-aware refinement scheme for refining the pseudo object sizes continuously during training. With our self-training approach, the ability of the detector is increasingly boosted. Moreover, bypassing object detection, we introduce a compact convolutional neural network (CNN) for object counting in video surveillance, in which a multi-scale density (MSD) regressor is employed to predict the coarse- and fine-scale density maps. The comprehensive experimental results on six challenging benchmark datasets show that our approach significantly outperforms state-of-the-art methods under both detection and counting tasks. In the recognition problem, small objects in unconstrained scenes adversely affect the accuracy of automatic recognition systems. Our third research direction focuses on automatic license plate recognition (ALPR) in unconstrained environments, such as oblique views, uneven illumination, and various weather conditions. Our study produces an outstanding design of ALPR with four insights: (1) the resampling-based cascaded framework is beneficial to both speed and accuracy; (2) the highly efficient license plate recognition should abandon additional character segmentation and recurrent neural network (RNN), but adopt a plain CNN; (3) in the case of CNN, taking advantage of vertex information on license plates improves recognition performance; and (4) the weight-sharing character classifier addresses the lack of training images in small-scale datasets. Based on these insights, we propose a real-time and high-performing ALPR approach, termed VSNet. The vertex supervisory information is fully exploited for training a detector (VertexNet) to predict the geometric shapes of license plates such that license plates can be rectified and their characters can be densely predicted by a recognizer (SCR-Net). Moreover, we propose a dynamic regularization method to avoid overfitting and improve the generalization ability of CNN. Experimental results on two challenging benchmark datasets demonstrate the effectiveness of the proposed method.