Next Article in Journal
Monitoring Physical Activity in Students with Intellectual Disabilities: The Contribution of Physical Education, Gender and Disability Level
Next Article in Special Issue
A Method for Industrial Smoke Video Semantic Segmentation Using DeffNet with Inter-Frame Adaptive Variable Step Size Based on Fuzzy Control
Previous Article in Journal
Trace-LogVector-Based Relational Retrieval for Conversational System Log Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Recognition Task-Based Detection Score: A Task-Oriented Evaluation Metric for Infrared Image Colorization

1
School of Optics and Photonics, Beijing Institute of Technology, Beijing 100081, China
2
National Key Laboratory on Near-Surface Detection, Beijing 100072, China
3
School of Optoelectronic Engineering, Changchun University of Science and Technology, Changchun 130022, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(6), 1807; https://doi.org/10.3390/s26061807
Submission received: 5 February 2026 / Revised: 1 March 2026 / Accepted: 5 March 2026 / Published: 13 March 2026
(This article belongs to the Special Issue AI-Based Visual Sensing for Object Detection)

Highlights

What are the main findings?
  • A task-oriented colorization evaluation metric (RDS) is proposed that measures colorization quality through object detection performance, incorporating three key design characteristics: position robustness via IoU-based matching, fine-grained interpretability through category-level analysis, and task adjustability through flexible category partitioning strategies.
  • Experiments demonstrate that RDS not only maintains consistency with traditional metrics under standard conditions but also exhibits superior stability under registration errors (5.65% improvement vs. 11–70% degradation) and uniquely enables category-level performance diagnosis and flexible evaluation dimension adjustment—capabilities that traditional pixel-based metrics lack.
What are the implications of the main findings?
  • By providing the evaluation criteria that directly reflect downstream task performance, RDS guides the development of infrared colorization models toward practical applicability rather than pixel-level similarity, driving innovation in model architectures and training strategies for infrared imaging applications.
  • The three design characteristics enable RDS to provide reliable evaluation under imperfect registration conditions, reveal category-specific model weaknesses for targeted improvements, and adapt evaluation focus to match diverse application requirements, supporting more effective model optimization in practical deployment scenarios.

Abstract

Infrared image colorization has gained widespread attention in recent years as an important means of enhancing image visibility and semantic expression. However, existing evaluation methods mostly rely on pixel-level differences or feature distribution distances, failing to comprehensively reflect the usability of colorization results in practical tasks. To address this, we propose a task-oriented colorization quality evaluation metric called Recognition-Task based Detection Score (RDS), which uses the recognition accuracy of object detection models on colorized images as a proxy indicator to measure their actual performance in downstream tasks, thereby achieving consistency between image quality assessment and task performance. RDS incorporates three key characteristics in its design: enhancing position robustness through the matching mechanism of object detection tasks, providing fine-grained interpretability through category-level accuracy calculation, and achieving task adjustability through flexible category division strategies. Systematic experiments conducted on both NIR–RGB and FLIR-5C datasets demonstrate that RDS maintains good subjective–objective consistency with traditional metrics under standard registration conditions, exhibits superior stability under registration error scenarios, and possesses fine-grained interpretability and task adjustability that traditional metrics lack. RDS maintains a 5.7% improvement in discriminative Score Gap under misalignment while PSNR degrades by 69.8%, and flexible category merging raises TIC-CGAN’s RDS from 76.05% to 96.45% on unseen scenes, providing more practically valuable criteria for the evaluation and optimization of infrared colorization models.

1. Introduction

Infrared imaging technology, independent of visible light illumination, possesses the capability of stable perception in complex environments and is widely applied in scenarios such as night vision surveillance and autonomous driving [1]. However, compared to three-channel visible light images, infrared images are typically single-channel grayscale images with limited degrees of freedom in color space and insufficient semantic information, leading to constrained performance in object detection and recognition [2,3]. To alleviate the information insufficiency of infrared grayscale images, researchers have introduced Image-to-Image Translation technology, which achieves style transformation through mapping relationships and is widely applied in tasks such as image enhancement and style transfer [4,5,6,7]. Prior to the dominance of GAN-based approaches, Convolutional Neural Networks (CNNs) served as the primary deep learning framework for infrared image colorization. Representative works include deep multi-scale CNNs for NIR-to-RGB transfer [8], S-shape network architectures for infrared colorization [9], asymmetric codec-based CNN methods for near-infrared colorization [10], and U-Net-based CNNs for predicting visible spectrum images from near-infrared illumination [11]. In recent years, with the development of Generative Adversarial Networks (GANs) [12,13,14,15], GAN-based image translation methods have become the mainstream approach for grayscale image colorization [8,16,17,18,19,20,21,22,23,24,25,26]. These methods achieve the conversion from infrared grayscale images to color images through deep feature extraction and pixel mapping, significantly enhancing the visual presentation of images. However, methods regarding scientific and objective evaluation of the quality of colorization results remain an urgent problem to be solved.
Evaluation metrics play a crucial driving role in advancing image generation and translation tasks. In the field of image generation, the introduction of Fréchet Inception Distance (FID) [27] directly propelled the transition of generative models from subjective evaluation to objective quantification, catalyzing the rapid iteration of a series of high-quality generative models such as StyleGAN and Diffusion [28]. As Jayasumana et al. [29] pointed out: in many problems of machine learning, reliable evaluation metrics are key to driving progress. Research by Benny et al. also demonstrates that evaluation metrics not only provide unified standards for model comparison but, more importantly, guide research directions and promote innovation in model architectures and training strategies [30].
However, in the field of infrared image colorization, the development of evaluation systems has severely lagged behind model innovation, becoming a bottleneck constraining field advancement. Existing evaluation methods are mainly divided into two categories: no-reference metrics and full-reference metrics. No-reference evaluation metrics primarily evaluate through comparison of images before and after colorization, requiring no reference images. Luo et al. [8] proposed a metric called Average Precision of Canny Edges (APCE) to assess edge consistency before and after thermal infrared grayscale image colorization, but it cannot evaluate the crucial color changes in colorization problems. Full-reference evaluation metrics quantify image quality by comparing pixel-level differences or feature distribution differences between generated images and real reference images. Among them, Structural Similarity Index (SSIM) [31] emphasizes consistency in luminance, contrast, and structural information; Peak Signal-to-Noise Ratio (PSNR) [32] reflects the energy intensity of image reconstruction errors; and Fréchet Inception Distance (FID) [27] evaluates the proximity of image distributions in high-dimensional feature space. However, these metrics all focus on measuring the similarity between generated images and reference images, failing to reflect the support capability of colorization results for downstream tasks in practical applications.
To advance the colorization technology toward practical application, the evaluation system needs to achieve a fundamental transformation: from pixel-level similarity evaluation to task-oriented evaluation. Research by Zhang et al. [33] demonstrates that colorization as a pretraining task can significantly improve object detection and semantic segmentation performance, indicating that the colorization quality should be assessed from its support capability for downstream tasks. Recent studies further confirm that feature representations learned through colorization can be effectively transferred to visual tasks such as detection and segmentation, significantly enhancing downstream task performance [34,35]. This indicates a strong correlation between colorization quality and downstream task performance, and the evaluation system should reflect this task value.
To achieve this transformation, task-oriented evaluation metrics need to possess three key capabilities. First, it requires position robustness to adapt to registration errors. Infrared and color images are often accompanied by non-negligible pixel-level misalignment [20], requiring evaluation methods to tolerate reasonable registration errors and maintain evaluation stability and reliability under imperfect image alignment conditions. Second, it requires providing fine-grained performance analysis capability. Benny et al. [30] pointed out in their research on conditional image generation evaluation that fine-grained category-level evaluation is crucial for identifying model weaknesses and guiding targeted improvements. In colorization tasks, this diagnostic capability can support in-depth analysis and optimization of model performance, enabling researchers to accurately locate performance differences in models across different semantic categories. Third, it requires flexible adjusting evaluation criteria according to task requirements. Infrared image colorization serves two distinct practical goals: color restoration, which requires the model to accurately reproduce the specific colors of known objects in familiar scenes, and structure preservation, which requires the model to maintain semantic structural integrity when generalizing to unseen scenes where color prediction is inherently ambiguous. A practically valuable evaluation system must be able to shift its focus between these two orientations rather than applying a single fixed standard [36]. The evaluation system needs to be able to adjust evaluation criteria according to differences in specific application scenarios, avoiding the use of a single standard to measure diversified task objectives.
In response to the above requirements, this paper proposes Recognition Task-Based Detection Score (RDS), which uses the recognition accuracy of object detection models on colorized images as a proxy indicator for image colorization quality. It quantifies the actual usability of colorized images from a task perspective, thereby achieving consistency between colorization quality evaluation and downstream task performance. To enhance applicability in complex application scenarios, RDS incorporates three key characteristics in its design: by utilizing an object matching mechanism based on Intersection over Union (IoU) [37], it reduces reliance on pixel-level registration and ensures evaluation stability under conditions where position errors exist, thereby possessing position robustness; by supporting both global performance assessment and category-level performance decomposition, the metric can not only reflect overall quality but also reveal performance differences in the model across different object categories, providing fine-grained interpretability; and by combining with a category division strategy, the evaluation dimension can be flexibly switched between color restoration capability and structure preservation capability according to application needs, thereby possessing task adjustability.

2. Methods

To achieve task-oriented colorization quality evaluation, this paper proposes the RDS metric. As illustrated in Figure 1, the pipeline begins with the construction of two experimental datasets (NIR–RGB and FLIR-5C), which provide real color images and corresponding annotations for detection model training. The trained model then serves as the proxy evaluator: colorized infrared images to be evaluated are fed into the detector, and predicted bounding boxes are matched against ground-truth annotations via an IoU-based mechanism, endowing RDS with position robustness. Category-level AP values are subsequently computed, providing fine-grained interpretability, and aggregated through category-weighted aggregation to yield the final RDS. A category division strategy is applied prior to AP computation to flexibly switch the evaluation focus between scene-specific mode and scene generalization mode, enabling task adjustability. The metric definition and computation pipeline are elaborated in Section 2.1; the category division strategy is detailed in Section 2.2; the dataset construction is described in Section 2.3; and the detection model training strategy is presented in Section 2.4.

2.1. RDS Metric

The ultimate goal of infrared image colorization lies not in pixel-level color restoration, but rather in enhancing downstream task performance of the colorized images. Traditional metrics measure images based on pixel differences or feature distributions, failing to reflect the preservation of semantic structure in images. Based on this, we design the RDS as a task-oriented evaluation metric. The design principle is: inputting the infrared colorized images to be evaluated into a trained object detection model, where the detection accuracy can measure the degree of semantic structure preservation and task usability of the infrared colorized images, serving as a proxy evaluation indicator for colorization quality.
The computation process of RDS includes the following three steps:
(1)
Train object detection model.
This paper selects YOLOv5s [38] as the detection network, which, as a classic single-stage lightweight architecture, balances high detection accuracy with low computational complexity, making it suitable for deployment in large-scale colorized image evaluation scenarios. YOLOv5s is trained on a dataset constructed from real color images and their corresponding object annotations until the model achieves high detection accuracy. Since the category definition of the detection task can be adjusted according to the actual requirements, the RDS metric possesses task adjustability.
(2)
Detect colorized images.
The infrared colorized images to be evaluated are input into the trained detection model to obtain predicted bounding boxes, category labels, and confidence scores for each image. Subsequently, these are matched with reference annotation boxes, and following the standard detection evaluation process, the Average Precision (AP) is calculated for each category. This process not only quantifies the semantic preservation of colorized images in detection tasks but also provides fine-grained information on the model’s performance across different object categories, enabling the RDS to possess fine-grained interpretability.
(3)
Calculate RDS.
Based on the AP results of each category, RDS obtains an overall score through category-weighted aggregation. By default, this paper defines the final RDS metric in the form of mean value, namely:
R D S = 1 N i = 1 N A P i
where N represents the total number of detection categories, and i is the category index. To enhance robustness, RDS adopts an IoU-based matching mechanism when matching predicted and annotated boxes, with the judgment threshold, τ, set to 0.5. This strategy can tolerate minor position errors, effectively mitigating interference from insufficient pixel-level registration, enabling the RDS to possess position robustness.
In terms of task selection, the RDS adopts object detection as the downstream task for colorized images, considering that the primary purpose of infrared image colorization is to address its limitations in color space degrees of freedom and insufficient semantic expression capability. Compared to image-level tasks such as image classification, object detection not only requires the model to recognize object categories in images but also needs to localize their spatial boundaries, making it more sensitive to local color, texture, and structural information in images. This attribute makes it an ideal choice for evaluating whether colorization results effectively enhance semantic expression capability. In addition, the output of detection tasks possesses category decomposition capability and spatial matching mechanism, accommodating RDS’s design requirements for fine-grained interpretability and position robustness. By using detection accuracy as the evaluation criterion, RDS can measure the support capability of colorized images for downstream applications from a task-oriented perspective, thereby providing more informative feedback for model optimization and scenario adaptation.

2.2. Category Division Strategy

Due to the limited grayscale value range of infrared grayscale images and the lack of strong correlation between grayscale and object colors, the colorization process inevitably encounters the single-shape-multiple-color problem, where the same grayscale shape corresponds to multiple possible colors. For example, Figure 2 shows a set of images from our dataset, where two spheres in the near-infrared image: (a) have similar grayscale values, but in the color reference image and (b) one appears pink while the other appears purple. For observers familiar with this scene, colors can be inferred based on memory, but in unseen scenes, color prediction exhibits multi-solution characteristics, meaning that even if the generated colors differ from the reference, they may still be reasonable results. This ambiguity has a direct implication for evaluation: in familiar scenes where color information can be learned and memorized, the metric should emphasize the color restoration capability; in unseen scenes where color prediction is uncertain, the metric should instead emphasize the structure preservation capability to avoid penalizing reasonable colorization results. Therefore, category division must be flexibly adjusted according to the test scenario to achieve task-consistent evaluation criteria. The two category mapping modes proposed below are designed precisely to operationalize this distinction.
Let the original category set be:
C o r i g = s i , c i s i S , c i C ,
where S represents the shape set, C represents the color set, and [symbol] represents categories jointly defined by shape, si, and color, ci. This paper proposes two category mapping modes:
(1)
Scene-specific mode.
When test images come from familiar (identical or similar) scenes as the training set, both shape and color dimensions are retained, namely:
C m e m o r y = C o r i g .
This mode can directly evaluate the colorization model’s learning and memory capability for color details in familiar scenes, making RDS more sensitive to color prediction and suitable for analyzing the model’s color restoration performance.
(2)
Scene generalization mode.
When test images come from unseen scenes, color prediction exhibits strong uncertainty, and overemphasizing color consistency leads to evaluation bias. Therefore, the color dimension is ignored, retaining only the shape attribute:
C g e n e r a l = s i s i S .
This mode focuses on evaluating whether the model maintains semantic structural information, avoiding unreasonable penalties due to color multi-solution characteristics, which better aligns with the goals of evaluating generalization performance.
To ensure the objectivity and reproducibility of the RDS across different studies, we provide the following guidelines for category mapping decisions:
(1)
Determine the applicable mode. If test images are drawn from scenes seen during training, apply scene-specific mode, retaining full category definitions, including color attributes. If test images come from entirely new scenes that are not present in the training set, apply the scene generalization mode.
(2)
Identify merge groups under the scene generalization mode. Within the original category set, subcategories that share the same shape attribute but differ only in color attribute (i.e., categories subject to the single-shape-multiple-color problem) are candidates for merging. Merging should be applied when the grayscale values of these subcategories overlap substantially in the infrared domain, confirming that their color distinction is not reliably inferable from the infrared input signal.
(3)
Report the applied strategy. All RDSs should be accompanied by a clear statement of the mode applied, the resulting category mapping, and the criterion used to determine the merge groups to ensure cross-study comparability.
By flexibly switching between the scene-specific mode and scene generalization mode, the RDS can adjust evaluation priorities according to actual application requirements: emphasizing color restoration in seen scenes and highlighting structure preservation in unseen scenes. This flexible category definition approach enables the RDS to no longer depend on fixed evaluation dimensions, but rather dynamically adjust judgment criteria according to task objectives, thereby possessing good task adjustability.

2.3. Datasets

To ensure diversity in evaluation metric validation and comprehensiveness of results, this paper constructs and utilizes two types of experimental datasets: the near-infrared-to-color image pair dataset (NIR–RGB dataset) and the FLIR-5C dataset, which are used to simulate indoor and outdoor scenes, respectively. Among them, the NIR–RGB dataset is collected by this paper, covering multiple typical objects and background combinations, featuring controlled shooting conditions and high registration accuracy between near-infrared and color images, which helps evaluate the model’s learning and memory capability for color details under ideal imaging conditions. The FLIR-5C dataset is constructed based on the publicly available FLIR [39] thermal dataset through category filtering and grayscale processing, containing diverse traffic scenes and five object categories, with complex backgrounds and wide target distribution, suitable for analyzing the model’s generalization performance in real-world scenes. Both datasets are divided into the training set, Seen Test Set, and Unseen Test Set. Images in the Seen Test Set come from the same or similar scenes as the training set, while the Unseen Test Set consists of completely independent new scenes.

2.3.1. NIR–RGB Dataset

This dataset aims to construct indoor near-infrared-to-color image pairs with high registration accuracy and introduce diversified backgrounds to support systematic evaluation. To this end, this paper employs a multispectral area-scan camera FSFE-1600D-10GE, which uses prism light combination to collect near-infrared and color image data, solving the image pair position misalignment problem at the hardware level.
Figure 3 shows the position-matching situation when the collected image pairs are downsampled to the 640 × 480 resolution. From Figure 3d and the enlarged area within the green box, it can be seen that the data collected by this camera has a very high degree of position matching, meeting the requirements of this paper’s experiments.
In terms of scene design, this paper selects 6 types of typical objects as recognition targets: black jar, red cube, brown cup, white cup, pink ball, and purple ball. These targets are placed against backgrounds composed of tablecloths with different patterns and shelves in different arrangements, forming diverse scenes. As shown in Figure 4, (a) and (b) are images captured from different angles of scenes composed of black jar, red cube, and brown cup targets against a gray checkered tablecloth background; (c) and (d) are images captured from different scenes composed of different target combinations against backgrounds of white patterned tablecloth and side-standing shelves.
The NIR–RGB dataset contains approximately 2000 pairs of near-infrared-to-color images, of which approximately one-tenth of the independent scene images are divided into the Unseen Test Set, with these scenes not appearing at all in the training set; the remaining scenes are randomly divided into training set and Seen Test Set at a 4:1 ratio. Table 1 provides data statistics for each division.

2.3.2. FLIR-5C Dataset

Due to issues such as weak image stabilization performance, strong power supply dependency, and inconvenient equipment portability of multispectral cameras in outdoor environments, directly collecting high-quality outdoor near-infrared-to-color image pair datasets is very difficult. As shown in Figure 5, the structural features of grayscale-converted color images exhibit high similarity with actual near-infrared images, indicating that visible light grayscale images can reasonably approximate near-infrared images structurally. Therefore, this paper selects the publicly available FLIR thermal dataset as a basis and constructs the FLIR-5C dataset through category filtering and grayscale processing.
The original FLIR dataset covers diverse traffic scenarios with comprehensive annotation information, containing over 10 target categories. To enhance the evaluation consistency and category distribution balance, this paper retains only samples containing five categories with abundant instances: person, car, bus, traffic light, and traffic sign. Images containing the above five categories are filtered as valid samples, and their color versions are converted to grayscale to construct grayscale-to-color image pairs, simulating a data environment where infrared and color images have good position matching in outdoor scenes. Figure 6 shows an example image pair from the FLIR-5C dataset. Based on the scene independence principle, 1033 images constitute the Unseen Test Set, ensuring that these scenes do not appear in the training set; the remaining 9346 images are divided into training set and Seen Test Set at a 9:1 ratio, used for colorization model training and familiar scene testing, respectively. Data division details are shown in Table 2.

2.4. Training Strategy

To obtain stable and reliable detection performance, this paper trains YOLOv5s-based detection models on the NIR–RGB and FLIR-5C datasets respectively. During training, all images are uniformly scaled to 640 × 480 resolution, and multi-scale data augmentation strategies (including image flipping and color perturbation) are adopted to improve model generalization capability. The number of training epochs is set to 100, the optimizer is Adam, and the initial learning rate is set to 1 × 10−3, and it gradually decreases to 1 × 10−6 using a cosine annealing strategy with epochs. Other parameters maintain YOLOv5s default configurations.
The following platform information is provided for reference to help readers estimate the computational cost of applying the RDS in practice. The experiments are completed on the Ubuntu operating system, with the training platform including: PyTorch 1.11.0 framework, Intel Xeon(R) Platinum 8255C processor, 43 GB memory, and NVIDIA RTX 3090 (24 GB VRAM) GPU.

3. Experiments

3.1. Experimental Setup

To verify the performance of the proposed RDS metric in practical colorization tasks, this paper designs a series of comparative experiments, covering the colorization model selection, evaluation metric settings, and test task configurations. By constructing colorization results with obvious quality differences and introducing multiple commonly used evaluation metrics for comparison, we systematically evaluate the effectiveness and advantages of RDS.
In terms of colorization models, three representative deep learning models are selected: CycleGAN, Pix2pix, and TIC-CGAN. These three models cover the spectrum from unconditional generative networks and conditional generative networks to improved models with structural constraints, providing sufficient differences in generation strategies and visual performance. The colorization models are not the focus of this paper; their main role is to construct colorization result samples with clearly distinguishable quality levels for metric validation. The selection follows three criteria: first, the three models are widely adopted baselines in the infrared colorization literature, ensuring that their relative performance characteristics are well-understood and externally verifiable; second, they produce colorization results with clearly distinguishable visual quality, which is a prerequisite for rigorously testing whether an evaluation metric possesses sufficient discriminative power; and third, they represent architecturally diverse approaches, reducing the risk that any observed metric behavior is specific to a particular network design. To ensure fairness of comparison and experimental reproducibility, these three models are trained for 500 epochs on the NIR–RGB dataset and 100 epochs on the FLIR-5C dataset to adapt to different data scales and task complexity. During training, all images are scaled to the 640 × 480 resolution. All models use the Adam optimizer, with the initial learning rate set to 1 × 10−3 and gradually decreasing to 1 × 10−6 using a cosine annealing strategy, with other training parameters maintaining the default settings of each method.
In terms of evaluation metrics, this paper selects three classic image quality evaluation methods as comparison baselines: SSIM, PSNR, and FID. These metrics are widely applied in image restoration and generation tasks and are capable of measuring image quality from perspectives of structural similarity, signal-to-noise ratio, and distribution distance. Through the comparison with these metrics, we can systematically evaluate the performance differences and advantages of the RDS in colorization tasks.
To keep the focus on metric evaluation rather than model comparison, the experiments in Section 3.2, Section 3.3, Section 3.4 and Section 3.5 are organized around the effectiveness of the RDS and its three key characteristics—position robustness, fine-grained interpretability, and task adjustability—with the three colorization models serving solely as sources of test samples with distinguishable quality levels. Specifically, this paper conducts analysis from four perspectives. First, through subjective–objective comparison experiments, we verify whether the RDS is consistent with human perception, thereby evaluating its effectiveness. Second, we introduce position perturbation to test the robustness of each metric, verifying its position robustness. Furthermore, we analyze the model’s performance differences across different targets through category-level detection accuracy, evaluating RDS’s fine-grained interpretability. Finally, combined with category division strategies, we compare RDS variations under scene-specific mode and scene generalization mode category definition modes, verifying its task adjustability.

3.2. Effectiveness Validation

To verify the effectiveness of the RDS under standard registration conditions, this section conducts analysis from both subjective perception and objective metrics perspectives, focusing on comparing the consistency between RDS and traditional metrics, as well as human evaluation results.
Figure 7 and Figure 8 show typical colorization results on the FLIR-5C and NIR–RGB datasets. From the subjective visual effect perspective, images generated by TIC-CGAN have natural colors and well-preserved details, with the overall style closest to real color images; Pix2pix performs well in color consistency, but local details are blurred with artifacts appearing in edge regions; CycleGAN shows the most obvious color deviation, with lack of coordination between foreground and background, and the lowest detail fidelity. Overall, the subjective ranking for both datasets is TIC-CGAN > Pix2pix > CycleGAN.
To further verify the consistency of subjective judgment, Table 3 and Table 4 respectively list the average evaluation results of each model on the Seen Test Set and Unseen Test Set of the FLIR-5C and NIR–RGB datasets, including four metrics: SSIM, PSNR, FID, and the proposed RDS. The results show that under both datasets and both test subsets, all four metrics demonstrate consistent model ranking: TIC-CGAN scores highest, with TIC-CGAN > Pix2pix > CycleGAN. For example, on the FLIR-5C Unseen Test Set, the RDS of the three models are 39.87% (TIC-CGAN), 33.78% (Pix2pix), and 30.94% (CycleGAN), with the ranking completely consistent with subjective evaluation (i.e., the qualitative visual inspection of colorization results presented in Figure 7 and Figure 8).
In summary, RDS can accurately quantify quality variations in colorization results in standard registration scenarios, with its evaluation results not only highly consistent with subjective perception but also maintaining good consistency with existing metrics, verifying its effectiveness as an image colorization quality evaluation metric.

3.3. Position Robustness Validation

To verify the stability of RDS under registration error conditions, this section introduces artificial misalignment experiments to evaluate the performance changes in each metric before and after image misalignment. Given that infrared images in the FLIR-5C dataset are derived from grayscale conversion of visible light color images, the original image pairs are perfectly aligned at the pixel level, providing an ideal baseline for constructing artificial misalignment samples. Specifically, for each image, one direction (up, down, left, or right) is randomly selected, edge regions of 1~5 pixels width are cropped, and the image is restored to its original size through bilinear interpolation.
To quantify the performance changes in different metrics before and after misalignment, this paper introduces “Score Gap” as an auxiliary analysis indicator, defined as the absolute difference between the best and worst model scores in the evaluation results. The larger the Score Gap, the stronger the metric’s ability to distinguish model quality; the greater the decrease in Score Gap, the worse the metric’s stability under perturbation.
Figure 9 shows the Score Gap changes in each metric on the FLIR-5C dataset Seen Test Set before and after misalignment. As can be observed from the figure, all three traditional metrics—SSIM, PSNR, and FID—exhibit significant degradation after misalignment. Among them, the Score Gap of PSNR decreases from 5.12 dB to 1.54 dB, with a reduction rate as high as 69.8%, indicating its strongest dependence on pixel-level alignment; the Score Gap of SSIM decreases from 0.0959 to 0.0469, with a reduction rate of 51.1%; the Score Gap of FID decreases from 36.7 to 32.5, with a reduction rate of 11.4%. These results demonstrate that traditional metrics significantly decline in their ability to distinguish model quality under registration error scenarios.
In contrast, the Score Gap of the RDS remains essentially stable, changing marginally from 7.26% to 7.67%; this negligible difference falls within normal statistical fluctuation of AP computation over a finite test set and does not indicate a directional improvement, but rather confirms that RDS maintains stable discriminability under position perturbation. This characteristic stems from RDS’s adoption of an IoU-based object matching mechanism during the calculation process, making it insensitive to pixel-level position shifts, thereby possessing good position robustness and enhancing reliability in practical applications.

3.4. Fine-Grained Interpretability Validation

To verify the fine-grained interpretability of the RDS, this section analyzes its category-level scoring results. Figure 10 shows the AP values of each method across six detection categories on the NIR–RGB dataset Seen Test Set, where the detection results of real RGB images (RDS = 98.3%) serve as the performance upper bound. The RDSs of each model are marked with dashed lines to facilitate the observation of the relationship between average performance and category-level performance.
Using real RGB images as the reference baseline, TIC-CGAN achieves AP values of 98.7% and 98.7% on the “jar” and “cube” categories, respectively, almost identical to the RGB baseline (98.9%, 99.7%), indicating that this model possesses strong structural preservation capability on shape-dominant targets. For color-dominant categories such as “white cup,” “brown cup,” “pink ball,” and “purple ball,” TIC-CGAN also maintains AP values above 90%, with an overall RDS of 95.7%, only 2.6 percentage points away from the RGB baseline, demonstrating its strong scene memory and color restoration capability.
Further comparing Pix2pix with TIC-CGAN, it can be concluded that both perform similarly on shape-dominant categories, but show AP gaps of approximately 9% and 11% on the “pink ball” and “purple ball” categories, respectively. This difference corresponds to the phenomenon in the first row of Figure 8, where Pix2pix shows color restoration deviations on spheres, indicating that Pix2pix still has deficiencies in learning color details, resulting in an overall RDS of 91.4%.
CycleGAN’s category-level performance shows obvious imbalance characteristics. For the “white cup” and “brown cup” categories, its AP can still reach 76.6% and 79.6%, but for the “cube” and “purple ball” categories, the AP values are only 33.8% and 26.7%, respectively, far lower than other models. These two severe shortcomings directly pull down its overall RDS to 57.4%. The above quantitative results are consistent with the distortion and discoloration phenomena in CycleGAN output images in Figure 8, verifying the reliability of RDS category-level analysis results.
In summary, RDS can reveal model performance differences across different targets through category-level AP decomposition. It can not only identify the average performance level of models but also precisely locate their weaknesses in specific categories, providing clear basis for performance diagnosis and targeted optimization, possessing fine-grained interpretability that traditional metrics (SSIM, PSNR, FID) do not have.

3.5. Task Adjustability Validation

To verify the RDS’s adaptability under different task objectives, this paper designs task adjustability validation experiments on the Unseen Test Set of NIR–RGB, examining RDS’s evaluation capability under two task settings: color restoration-oriented and structure preservation-oriented.
The experiment employs two category definition strategies: the scene-specific mode (six-category setting) subdivides targets into six categories of jar, cube, white cup, brown cup, pink ball, and purple ball, emphasizing learning and restoration capability for color details; the scene generalization mode (4-category setting) merges targets by shape dimension into four categories of jar, cube, cup, and ball, reducing dependence on color consistency and focusing more on structure and semantic information.
Table 5 shows the RDSs and category-level AP values of the three colorization models under these two category settings. From the table, it can be observed that under the six-category setting, TIC-CGAN’s AP on “purple ball” is only 22.00%, significantly pulling down the overall score to 76.05%. This is consistent with the phenomenon in Figure 8, where the model incorrectly colors the purple ball as pink. Due to the influence of multi-solution characteristics of colors, forcibly distinguishing the color dimension in unseen scenes easily leads to evaluation distortion.
Under the four-category setting, pink ball and purple ball are unified as the “ball” category, eliminating the above penalty. TIC-CGAN’s AP increases to 100%, and the overall RDS also rises to 96.45%. Other models show similar trends, indicating that under scene generalization conditions, adjusting category definitions can effectively avoid misjudgment of reasonable coloring results, making evaluation results more consistent with human perception and application objectives.
In summary, through switchable category mapping methods, RDS enables the evaluation dimension of colorization models to shift from color restoration capability to structure preservation capability, achieving flexible adjustment of evaluation dimensions under task-oriented guidance, demonstrating good task adjustability.

4. Conclusions

This paper proposes a task-oriented colorization image evaluation metric RDS, which achieves consistency between colorization quality evaluation and downstream task performance by using the recognition accuracy of object detection models on colorized images as a proxy evaluation indicator for image quality. This paper designs four types of experimental tasks to evaluate the effectiveness and advantages of the RDS. Results demonstrate that the RDS is not only highly consistent with subjective perception but also maintains good consistency with existing objective metrics; maintains stable discriminability in position perturbation scenarios, verifying its position robustness; supports category-level performance decomposition, revealing model performance differences across different object categories and demonstrating fine-grained interpretability; and when combined with category division strategies, RDS can switch evaluation dimensions between color restoration and structure preservation according to task objectives, exhibiting good task adjustability. Quantitatively, its discriminative Score Gap improves by 5.7% under registration error scenarios while PSNR, SSIM, and FID degrade by up to 69.8%, 51.1%, and 11.4%, respectively; category-level AP decomposition exposes per-category weaknesses—such as CycleGAN’s AP of only 26.7% on the “purple ball” category—that are completely hidden by global scores; and switching from scene-specific mode to scene generalization mode reduces evaluation distortion in unseen scenes, raising TIC-CGAN’s RDS from 76.05% to 96.45%. Overall, the RDS provides a new evaluation metric with enhanced usability and stability for colorization results.
Future work can further apply RDS to broader generation and translation tasks such as night vision enhancement and remote sensing false color, improving its applicability across different data types and application scenarios. Meanwhile, it can also be combined with more complex downstream tasks such as semantic segmentation and action recognition to further expand the coverage scope of task-oriented metrics. In addition, validating the consistency of RDSs across different proxy detection architectures—such as YOLOv8 and Faster R-CNN—represents an important next step to further establish the metric’s generalizability, and it is planned as a priority item in our subsequent work.
Furthermore, applying RDS to evaluate state-of-the-art colorization methods based on Diffusion Models and Vision Transformers would provide a broader benchmark and further demonstrate the metric’s discriminative power across the modern research landscape, representing a natural and straightforward extension of the current work. A formal quantitative correlation analysis between RDSs and human subjective ratings—such as Spearman’s Rank Correlation Coefficient computed over a larger set of colorization methods—would further characterize the metric’s perceptual alignment and is planned as part of a dedicated follow-up study.

Author Contributions

Conceptualization, H.W. and J.C.; methodology, H.W. and J.C.; software, H.W. and J.C.; validation, H.W. and J.C.; formal analysis, H.W.; investigation, H.W. and J.C.; resources, C.Z.; data curation, J.C.; writing—original draft preparation, H.W.; writing—review and editing, Y.H. and C.Z.; visualization, H.W.; supervision, Y.H. and Q.H.; project administration, Y.H. and Q.H.; funding acquisition, Y.H. and Q.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62427817) and Beijing Key Laboratory of Advanced Optical Remote Sensing Technology Fund (AORS202407).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request. The FLIR-5C dataset is derived from the publicly available FLIR thermal dataset available at https://oem.flir.com/zh-cn/solutions/automotive/adas-dataset-form/ (accessed on 27 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Park, J.; Chen, J.; Cho, Y.K.; Kang, D.Y.; Son, B.J. CNN-Based Person Detection Using Infrared Images for Night-Time Intrusion Warning Systems. Sensors 2020, 20, 34. [Google Scholar] [CrossRef]
  2. Zeng, Q.; Qin, H.; Yan, X. Fourier Domain Anomaly Detection and Spectral Fusion for Stripe Noise Removal of TIR Imagery. Remote Sens. 2020, 12, 3714. [Google Scholar] [CrossRef]
  3. Wan, M.; Gu, G.; Qian, W.; Ren, K.; Chen, Q.; Maldague, X. Infrared Image Enhancement Using Adaptive Histogram Partition and Brightness Correction. Remote Sens. 2018, 10, 682. [Google Scholar] [CrossRef]
  4. Pang, Y.; Lin, J.; Qin, T.; Chen, Z. Image-to-Image Translation: Methods and Applications. IEEE Trans. Multimed. 2022, 24, 3859–3881. [Google Scholar] [CrossRef]
  5. Hoyez, H.; Schockaert, C.; Rambach, J.; Mirbach, B.; Stricker, D. Unsupervised Image-to-Image Translation: A Review. Sensors 2022, 22, 8540. [Google Scholar] [CrossRef]
  6. Zhang, J.; Xu, C.; Li, J.; Yue, L. SCSNet: An Efficient Paradigm for Learning Simultaneously Image Colorization and Super-Resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 22 February–1 March 2022; Volume 36, pp. 3432–3440. [Google Scholar] [CrossRef]
  7. Du, Y.; Xu, Y.; Ye, T.; Wang, Q.; Deng, J. Invertible Grayscale with Sparsity Enforcing Priors. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 97. [Google Scholar] [CrossRef]
  8. Limmer, M.; Lensch, H.P.A. Infrared Colorization Using Deep Convolutional Neural Networks. In Proceedings of the 15th IEEE International Conference on Machine Learning and Applications (ICMLA), Anaheim, CA, USA, 18–20 December 2016. [Google Scholar] [CrossRef]
  9. Dong, Z.; Kamata, S.; Breckon, T.P. Infrared Image Colorization Using a S-Shape Network. In Proceedings of the 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 2242–2246. [Google Scholar]
  10. Ma, X.; Huang, W.; Huang, R.; Liu, X. Near-Infrared Image Colorization Using Asymmetric Codec and Pixel-Level Fusion. Appl. Sci. 2022, 12, 10087. [Google Scholar] [CrossRef]
  11. Browne, A.W.; Deyneka, E.; Ceccarelli, F.; To, J.K.; Chen, S.; Tang, J.; Vu, A.N.; Baldi, P.F. Deep Learning to Enable Color Vision in the Dark. PLoS ONE 2022, 17, e0265185. [Google Scholar] [CrossRef]
  12. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
  13. Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  14. Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-Image Translation with Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar] [CrossRef]
  15. Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
  16. Ye, T.; Du, Y.; Deng, J.; Wang, Q. Invertible Grayscale via Dual Features Ensemble. IEEE Access 2020, 8, 89670–89679. [Google Scholar] [CrossRef]
  17. Yang, S.; Sun, M.; Lou, X.; Yang, H.; Liu, D. Nighttime Thermal Infrared Image Translation Integrating Visible Images. Remote Sens. 2024, 16, 666. [Google Scholar] [CrossRef]
  18. Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar] [CrossRef]
  19. Suarez, P.L.; Sappa, A.D.; Vintimilla, B.X. Infrared Image Colorization Based on a Triplet DCGAN Architecture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 212–217. [Google Scholar] [CrossRef]
  20. Berg, A.; Ahlberg, J.; Felsberg, M. Generating Visible Spectrum Images from Thermal Infrared. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1143–1152. [Google Scholar] [CrossRef]
  21. Luo, F.; Li, Y.; Zeng, G.; Peng, P.; Wang, G.; Li, Y. Thermal Infrared Image Colorization for Nighttime Driving Scenes with Top-Down Guided Attention. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15808–15823. [Google Scholar] [CrossRef]
  22. Kuang, X.; Zhu, J.; Sui, X.; Liu, Y.; Liu, C.; Chen, Q.; Gu, G. Thermal Infrared Colorization via Conditional Generative Adversarial Network. Infrared Phys. Technol. 2020, 107, 103338. [Google Scholar] [CrossRef]
  23. Liang, W.; Ding, D.; Wei, G. An Improved DualGAN for Near-Infrared Image Colorization. Infrared Phys. Technol. 2021, 116, 103764. [Google Scholar] [CrossRef]
  24. Wang, H.; Cheng, C.; Zhang, X.; Liu, S. Towards High-Quality Thermal Infrared Image Colorization via Attention-Based Hierarchical Network. Neurocomputing 2022, 501, 318–327. [Google Scholar] [CrossRef]
  25. Wu, L.; Shi, Y.; Lin, Y.H.; Wu, W. Thermal Infrared Image Colorization on Nighttime Driving Scenes with Spatially-Correlated Learning. J. Electron. Imaging 2024, 33, 023019. [Google Scholar] [CrossRef]
  26. Sun, T.; Jung, C.; Fu, Q.; Han, Q. NIR to RGB Domain Translation Using Asymmetric Cycle Generative Adversarial Networks. IEEE Access 2019, 7, 112459–112469. [Google Scholar] [CrossRef]
  27. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6626–6637. [Google Scholar]
  28. Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  29. Jayasumana, S.; Ramaswamy, S.; Veit, A.; Glasner, D.; Chakrabarti, A.; Kumar, S. Rethinking FID: Towards a Better Evaluation Metric for Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 9283–9292. [Google Scholar]
  30. Benny, Y.; Galanti, T.; Benaim, S.; Wolf, L. Evaluation Metrics for Conditional Image Generation. Int. J. Comput. Vis. 2021, 129, 1712–1731. [Google Scholar] [CrossRef]
  31. Huynh-Thu, Q.; Ghanbari, M. Scope of validity of PSNR in image/video quality assessment. Electron. Lett. 2008, 44, 800–801. [Google Scholar] [CrossRef]
  32. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  33. Zhang, R.; Isola, P.; Efros, A.A. Colorful Image Colorization. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 649–666. [Google Scholar] [CrossRef]
  34. Larsson, G.; Maire, M.; Shakhnarovich, G. Learning Representations for Automatic Colorization. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 577–593. [Google Scholar] [CrossRef]
  35. Treneska, S.; Zdravevski, E.; Pires, I.M.; Lameski, P.; Gievska, S. GAN-Based Image Colorization for Self-Supervised Visual Feature Learning. Sensors 2022, 22, 1599. [Google Scholar] [CrossRef]
  36. Teng, X.; Li, Z.; Liu, Q.; Zhang, C.; Quan, Y. Subjective evaluation of colourized images with different colorization models. Color Res. Appl. 2020, 46, 419–431. [Google Scholar] [CrossRef]
  37. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  38. Jocher, G. YOLOv5 by Ultralytics. Version 7.0. Available online: https://github.com/ultralytics/yolov5 (accessed on 27 January 2026).
  39. FLIR Systems. FLIR ADAS Thermal Dataset. Available online: https://www.flir.com/oem/adas/adas-dataset-form/ (accessed on 27 January 2026).
Figure 1. Overall computation pipeline of the RDS.
Figure 1. Overall computation pipeline of the RDS.
Sensors 26 01807 g001
Figure 2. Example of single grayscale value corresponding to multiple color values. (a) Near-infrared grayscale image; (b) visible light grayscale image.
Figure 2. Example of single grayscale value corresponding to multiple color values. (a) Near-infrared grayscale image; (b) visible light grayscale image.
Sensors 26 01807 g002
Figure 3. Edge alignment schematic diagram. (a) Near-infrared grayscale image; (b) edge feature map of near-infrared grayscale image; (c) (a) overlaid with (b); (d) visible light color image overlaid with (b).
Figure 3. Edge alignment schematic diagram. (a) Near-infrared grayscale image; (b) edge feature map of near-infrared grayscale image; (c) (a) overlaid with (b); (d) visible light color image overlaid with (b).
Sensors 26 01807 g003
Figure 4. NIR–RGB dataset composition schematic diagram. (a) Image from Scene 1; (b) image from Scene 1; (c) image from Scene 2; (d) image from Scene 3.
Figure 4. NIR–RGB dataset composition schematic diagram. (a) Image from Scene 1; (b) image from Scene 1; (c) image from Scene 2; (d) image from Scene 3.
Sensors 26 01807 g004
Figure 5. Comparison between near-infrared grayscale image and visible light grayscale image. (a) Near-infrared grayscale image; (b) visible light grayscale image.
Figure 5. Comparison between near-infrared grayscale image and visible light grayscale image. (a) Near-infrared grayscale image; (b) visible light grayscale image.
Sensors 26 01807 g005
Figure 6. A pair of images from the FLIR-5C dataset. (a) Visible light grayscale image; (b) visible light color image.
Figure 6. A pair of images from the FLIR-5C dataset. (a) Visible light grayscale image; (b) visible light color image.
Sensors 26 01807 g006
Figure 7. Test results on the FLIR-5C dataset. (a) Infrared grayscale image; (b) CycleGAN; (c) Pix2pix; (d) TIC-CGAN; (e) real color image. Green boxes indicate zoomed in regions, and red boxes indicate regions of key focus.
Figure 7. Test results on the FLIR-5C dataset. (a) Infrared grayscale image; (b) CycleGAN; (c) Pix2pix; (d) TIC-CGAN; (e) real color image. Green boxes indicate zoomed in regions, and red boxes indicate regions of key focus.
Sensors 26 01807 g007
Figure 8. Test results on the NIR–RGB dataset. (a) Near-infrared grayscale image; (b) CycleGAN; (c) Pix2pix; (d) TIC-CGAN; (e) color image. Green boxes indicate zoomed in regions, and red boxes indicate regions of key focus.
Figure 8. Test results on the NIR–RGB dataset. (a) Near-infrared grayscale image; (b) CycleGAN; (c) Pix2pix; (d) TIC-CGAN; (e) color image. Green boxes indicate zoomed in regions, and red boxes indicate regions of key focus.
Sensors 26 01807 g008
Figure 9. Comparison of Score Gap changes in each evaluation metric on FLIR-5C dataset Seen Test Set before and after misalignment. The vertical axis represents the relative Score Gap percentage with the no-misalignment Score Gap as the 100% baseline, and the values at the top of bars indicate the original Score Gaps. Blue bars represent the no-misalignment baseline, red/green bars represent the relative Score Gaps after misalignment, with arrows indicating change rates.
Figure 9. Comparison of Score Gap changes in each evaluation metric on FLIR-5C dataset Seen Test Set before and after misalignment. The vertical axis represents the relative Score Gap percentage with the no-misalignment Score Gap as the 100% baseline, and the values at the top of bars indicate the original Score Gaps. Blue bars represent the no-misalignment baseline, red/green bars represent the relative Score Gaps after misalignment, with arrows indicating change rates.
Sensors 26 01807 g009
Figure 10. Category-level average evaluation results and RDS visualization analysis of each model on NIR–RGB Seen Test Set.
Figure 10. Category-level average evaluation results and RDS visualization analysis of each model on NIR–RGB Seen Test Set.
Sensors 26 01807 g010
Table 1. NIR–RGB dataset detailed information.
Table 1. NIR–RGB dataset detailed information.
Data TypeNumber of Image PairsNumber of TargetsNumber of Scenes
Training set14434134138
Seen Test Set3611025
Unseen Test Set20565133
Table 2. FLIR-5C dataset detailed information.
Table 2. FLIR-5C dataset detailed information.
Data TypeNumber of Image PairsNumber of TargetsNumber of Scenes
Training set8403129,172132
Seen Test Set94314,742
Unseen Test Set103315,56517
Table 3. Average evaluation results of each model on the FLIR-5C test set.
Table 3. Average evaluation results of each model on the FLIR-5C test set.
DatasetSeen Test SetUnseen Test Set
MetricSSIMPSNR/dBFIDRDSSSIMPSNR/dBFIDRDS
Model
CycleGAN0.848026.1975.3531.13%0.833125.7377.4030.94%
Pix2pix0.899428.8157.8234.60%0.880527.7366.1633.78%
TIC-CGAN0.943931.3138.7038.39%0.937430.8440.8039.87%
Table 4. Average evaluation results of each model on the NIR–RGB test set.
Table 4. Average evaluation results of each model on the NIR–RGB test set.
DatasetSeen Test SetUnseen Test Set
MetricSSIMPSNR/dBFIDRDSSSIMPSNR/dBFIDRDS
Model
CycleGAN0.787817.01136.257.39%0.715314.89146.250.81%
Pix2pix0.892426.0371.9791.38%0.784417.55125.074.06%
TIC-CGAN0.910925.7931.2695.71%0.816218.0874.3576.05%
Table 5. Average evaluation results of each model on NIR–RGB Unseen Test Set under different category strategies.
Table 5. Average evaluation results of each model on NIR–RGB Unseen Test Set under different category strategies.
Category StrategyScene-Specific ModeScene Generalization Mode
MetricAPRDSAPRDS
TargetJarCubeWhite CupBrown CupPink BallPurple BallJarCubeCupBall
Model
Cycle-GAN74.04%20.07%75.87%66.56%62.45%5.87%50.81%75.55%34.49%92.82%90.68%73.39%
Pix2pix94.07%85.52%82.17%73.91%73.65%35.04%74.06%93.41%85.36%96.99%99.05%93.70%
TIC-CGAN96.12%92.33%86.34%84.05%75.46%22.00%76.05%94.15%94.53%97.13%100.0%96.45%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, H.; Cai, J.; Hu, Y.; Zhang, C.; Hao, Q. Recognition Task-Based Detection Score: A Task-Oriented Evaluation Metric for Infrared Image Colorization. Sensors 2026, 26, 1807. https://doi.org/10.3390/s26061807

AMA Style

Wang H, Cai J, Hu Y, Zhang C, Hao Q. Recognition Task-Based Detection Score: A Task-Oriented Evaluation Metric for Infrared Image Colorization. Sensors. 2026; 26(6):1807. https://doi.org/10.3390/s26061807

Chicago/Turabian Style

Wang, Hao, Jiaming Cai, Yao Hu, Chenglong Zhang, and Qun Hao. 2026. "Recognition Task-Based Detection Score: A Task-Oriented Evaluation Metric for Infrared Image Colorization" Sensors 26, no. 6: 1807. https://doi.org/10.3390/s26061807

APA Style

Wang, H., Cai, J., Hu, Y., Zhang, C., & Hao, Q. (2026). Recognition Task-Based Detection Score: A Task-Oriented Evaluation Metric for Infrared Image Colorization. Sensors, 26(6), 1807. https://doi.org/10.3390/s26061807

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop