Deep learning for texture recognition: from streamlined architecture to multimodal extensions

Texture serves as a crucial visual cue for various applications, ranging from identifying materials and spotting industrial defects to classifying terrains. Accordingly, texture recognition has long been a focus within the computer vision community. Traditional methods in this area primarily focused...

Full description

Bibliographic Details
Main Author: Mao, Shangbo
Other Authors: Deepu Rajan
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2024
Subjects:
Online Access:https://hdl.handle.net/10356/179779
Description
Summary:Texture serves as a crucial visual cue for various applications, ranging from identifying materials and spotting industrial defects to classifying terrains. Accordingly, texture recognition has long been a focus within the computer vision community. Traditional methods in this area primarily focused on proposing stable and invariant handcrafted image descriptors from input images, with the goal of maintaining consistency amid variations in color, scale, and rotation within a single texture category. Designing these descriptors usually require prior knowledge and feature engineering. Moreover, the designed descriptors cannot be generalized to different texture datasets. In contrast to traditional approaches, methods for texture recognition based on deep learning leverage deep neural networks that have been pretrained on large image datasets to automate the process of feature extraction. This automated feature extraction not only offers exceptional performance but also boasts strong generalizability, thereby setting a new standard in the field of texture recognition. The challenge that current deep-learning-based approaches encounter centers on the effective and efficient utilization of pretrained models for texture representation extraction. While these methods mainly emphasize feature encoding or aggregation, the aim is to align the extracted features specifically for the context of texture recognition. However, increasingly complex designs for feature aggregation do not necessarily result in significant performance improvements. Instead, they compromise the efficiency of deep learning methods by introducing additional modules that incur extra computational overhead. Thus, the first challenge that arises is: How can we maintain the efficiency of deep-learning-based methods while employing streamlined yet robust feature aggregation techniques? 'Streamlined' here refers to keeping the design of the feature aggregation module as simple as possible. In our first study, we introduce a learnable residual pooling layer with a streamlined design, comprising both a residual encoding module and an aggregation module. The residual encoder is designed to preserve spatial information, thereby enhancing the feature learning process. In contrast, the aggregation module uses a simple averaging technique to generate spatially order-less features, improving the overall texture classification performance. Beyond its streamlined architecture, our method sets new performance standards on well-established texture classification datasets including FMD, DTD, and 4D Light, as well as an industrial dataset for metal surface anomaly detection. Additionally, our approach maintains competitive results on the MIT-Indoor scene recognition dataset. Although we successfully enhanced the efficiency of feature aggregation through our proposed learnable residual pooling layer, and maintained competitive performance across various datasets, there remains a limitation. The misalignment between the texture dataset and the images on which the frozen deep learning models were primarily pretrained hampers the effectiveness of the extracted features. We could enhance the alignment by fine-tuning the pretrained model; however, most texture datasets are not sufficiently large to allow for effective fine-tuning without the risk of overfitting. Therefore, the second challenge that arises is: How can we improve the effectiveness of features extracted from the pre-trained deep learning models? Previous methods have shown that an increasingly complex aggregation module does not necessarily lead to a proportional improvement in performance. Therefore, we have redirected our attention from the aggregation of features derived from pretrained Convolutional neural networks (CNNs) to enhancing the pretrained CNNs models themselves. We propose to enhance the pre-trained backbone with additional information that is either readily available or can be automatically generated, which may offer another solution for texture recognition in the deep learning era. Our subsequent investigations revealed that converting images to grayscale had only a minor impact on texture characteristics but significantly altered the feature representations produced by pre-trained CNNs. Capitalizing on this insight, we crafted a two-stream framework, integrating convolutional features from both color and grayscale images. Its promising results not only validate grayscale image as a robust and easily accessible source of information for texture recognition but also underscore the value of complementarity between color and grayscale images in extracting effective texture features. To strengthen this architecture, we introduced Supervised Contrastive Learning for Texture (CoTex). CoTex is designed to enforce similarity among local features both within a single image and across multiple images sharing the same texture category. Powered by the combined utility of the two-stream framework and CoTex, our methodology achieves new benchmarks on DTD, GTOS, and GTOS-Mobile datasets, and sustains competitive performance on the FMD dataset. In the final phase of our research, we drew upon insights from our second study, which highlighted the critical role of effective additional information in enhancing texture recognition performance. Recognizing that texture recognition has been predominantly focused on images, we ventured into a novel direction — employing texture description as an additional modality. Facilitated by advances in Large Language Models (LLMs) such as CLIP and BERT, this previously unexplored avenue became feasible. We first developed an Unsupervised Texture Description Generator (UTDG) to automatically generate accurate texture descriptions for each texture image. Our trained UTDG module excelled in the phrase retrieval task on the $DTD^2$ dataset even without any phrase retrieval ground-truth during training, validating the meaningfulness of our automatically generated texture descriptions. Following this, we employed an Image-Text fusion module to coherently combine these two disparate modalities. Empowered by automated texture description generation and fine-tuned backbones, our method achieves competitive results on five common-used texture recognition datasets. This marks a pioneering step in harnessing text modality for enhancing texture recognition.