Summary: | Many problems in machine learning (ML) and computer vision (CV) deal with large amounts of data with variations and noise for underlying tasks. For example, object detection requires filtering semantic objects from noisy and varying backgrounds, and the ImageNet challenge requires to build a classification model on top of over one million images. Finding good representations of data can not only improve the performance of ML systems, but also indicate proper choices of algorithms for different problems. On the other hand, representing the data in an effective and efficient way is challenging, partly due to the intrinsic complexity of the high dimensional data distributions. Natural images are known to lie on a low dimensional manifold, but it is extremely difficult to construct a generative model that properly describes its structure from the high dimensional pixel space. Video data is even more complex given the temporal variations among frames. Therefore, building a scalable data representation has become a critical part of the ML processing pipeline, where there has been a huge amount of work addressing on this topic in the past. This thesis focuses on a specific data representation technique named sparse coding (SC) which has been found effective for natural signals such as images. The concept is to decompose signals into linear combinations of only a few elements from a large collection of predefined vectors. The goal of this thesis is to explore inherent sparse structures in machine learning problems, and design effective algorithms that utilize the derived sparse representations for solving supervised and unsupervised learning tasks. The first part of the thesis investigates the topic of dictionary learning. The idea of dictionary learning is to construct (or learn) a set of ``atomic vectors'' that are over-complete on the data space, so that each signal can be efficiently represented and reconstructed via only a small subset of them. The dictionary determines the sparse pattern which in turn reveals the intrinsic structure of the signal space. In this thesis, a discriminative dictionary learning approach is proposed for image classification which explicitly enforces the coding vectors to possess similar sparsity structure among the same class. This is achieved via a novel weighted class specific group sparsity regularization imposed on each class of the samples. The group sparsity property removes unwanted features from the same group simultaneously and keeps only important ones for the whole group, thus the regularization is able to improve uniformity of the sparse patterns of signals belonging to the same class and suppress ``support overlaps'' of different classes. Combined with feature representations via Extreme Learning Machine (ELM) on the input space and a maximum margin criterion (MMC) regularization on the output space, a unified supervised feature learning framework named ELM embedded Discriminative Dictionary Learning (ELM-DDL) is presented in this thesis which is effective in sparse feature representations. This thesis shows that ELM-DDL can reveal group structures robust to different conditions and separate samples favorably from different groups, achieving state-of-the-art performance on benchmark image datasets. The second part of the thesis explores the structural information in unsupervised graph learning. Constructing an informative graph is beneficial to many unsupervised tasks, including clustering analysis and manifold learning. When the graph is sparse, the connectivities can be determined via nonzero entries of the graph representation, which indicate the local structures of data. In this thesis, sparsity of data graph is achieved and exploited via a simple constraint on locality of each samples, because locality naturally leads to sparsity. The locality-constrained coding method focuses on nearest neighbors and gives rise to a distance regularized feature representation combined with local decomposition. When a rank constraint is further imposed on the Laplacian of the data graph to enforce the graph's connected components to match a predefined class number, the learned graph exhibits clear cluster structure and is easy to partition. It is demonstrated that the proposed graph learning method, termed Adaptive Locality-constrained Clustering (ALC), generates more structured graph compared with predefined ones and provides better clustering performance on benchmark datasets. This thesis also explores the use of unsupervised ELM (US-ELM) on top of the learned graph to learn a flexible and discriminative data embedding that is beneficial for clustering. The third part of the thesis focuses on the topic of convolutional dictionary learning (CDL). CDL is a special variant of the ordinary dictionary learning technique, where the obtained dictionary is applied on sparse codes via convolution instead of traditional matrix multiplication. The technique is particularly important to structural inputs such as images (natural or hyper-spectral), since the convolution operator is shift-invariant and does not break the spatial configurations. This part of the thesis is particularly dedicated to exploring online methods for efficient CDL algorithms where input signals are provided in a streaming fashion to update the dictionary, because the batch mode CDL is time and memory-consuming in nature. The methodology adopted in this part is a systematic extension of classical unsupervised online sparse coding algorithms to the convolutional sparse coding domain, with efficient optimizations achieved via a localized ``slice-based'' representation of sparse features. Specifically, a slice-based online convolutional dictionary learning (SOCDL) method is proposed for unsupervised image processing tasks such as image reconstruction and inpainting. The expected reconstruction cost with respect to the dictionary is approximated via a surrogate function to enable online learning, and under the slice-based representation a second-order stochastic approximation approach is proposed for optimization. The approximation efficiently generates a dictionary sequence which converges favorably with desired characteristics for image reconstruction. Convergence analysis is presented with theoretical justifications, and the computational complexity is among the lowest within the context of online CDL algorithms with least memory consumptions. Finally, this thesis proposes a novel masked version of SOCDL which can perform online learning on incomplete data for image inpainting.
|