CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization

Thermal infrared (TIR) images remain unaffected by variations in light and atmospheric conditions, which makes them extensively utilized in diverse nocturnal traffic scenarios. However, challenges pertaining to low contrast and absence of chromatic information persist. The technique of image coloriz...

Full description

Bibliographic Details
Main Authors: Jiang, Qian, Zhou, Tao, He, Youwei, Ma, Wenjun, Hou, Jingyu, Ahmad Shahrizan, Abdul Ghani, Miao, Shengfa, Jing, Xin
Format: Article
Language:English
Published: OAE Publishing Inc. 2025
Subjects:
Online Access:http://umpir.ump.edu.my/id/eprint/43832/1/CMMF-Net_A%20Generative%20network%20based%20on%20CLIP-guided.pdf
Description
Summary:Thermal infrared (TIR) images remain unaffected by variations in light and atmospheric conditions, which makes them extensively utilized in diverse nocturnal traffic scenarios. However, challenges pertaining to low contrast and absence of chromatic information persist. The technique of image colorization emerges as a pivotal solution aimed at ameliorating the fidelity of TIR images. This enhancement is conducive to facilitating human interpretation and downstream analytical tasks. Because of the blurred and intricate features of TIR images, extracting and processing their feature information accurately through image-based approaches alone becomes challenging for networks. Hence, we propose a multi-modal model that integrates text features from TIR images with image features to jointly perform TIR image colorization. A vision transformer (ViT) model will be employed to extract features from the original TIR images. Concurrently, we manually observe and summarize the textual descriptions of the images, and then input these descriptions into a pretrained contrastive language-image pretraining (CLIP) model to capture text-based features. These two sets of features will then be fed into a cross-modal interaction (CI) module to establish the relationship between text and image. Subsequently, the text-enhanced image features will be processed through a U-Net network to generate the final colorized images. Additionally, we utilize a comprehensive loss function to ensure the network’s ability to generate high-quality colorized images. The effectiveness of the methodology put forward in this study is evaluated using the KAIST datasets. The experimental results vividly showcase the superior performance of our CMMF-Net method in comparison to other methodologies for the task of TIR image colorization.