CSMR: A Multi-Modal Registered Dataset for Complex Scenarios

Li, Chenrui; Gao, Kun; Hu, Zibo; Yang, Zhijia; Cai, Mingfeng; Cheng, Haobo; Zhu, Zhenyu

doi:10.3390/rs17050844

Open AccessArticle

CSMR: A Multi-Modal Registered Dataset for Complex Scenarios

by

Chenrui Li

¹

,

Kun Gao

¹

,

Zibo Hu

¹,

Zhijia Yang

¹

,

Mingfeng Cai

¹,

Haobo Cheng

¹ and

Zhenyu Zhu

^2,*

¹

Key Laboratory of Photoelectronic Imaging Technology and System, Ministry of Education of China, Beijing Institute of Technology, Beijing 100081, China

²

Key Laboratory of Metallurgical Equipment and Control Technology, Ministry of Education, Wuhan University of Science and Technology, Wuhan 430081, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(5), 844; https://doi.org/10.3390/rs17050844

Submission received: 19 January 2025 / Revised: 20 February 2025 / Accepted: 24 February 2025 / Published: 27 February 2025

(This article belongs to the Special Issue Recent Advances in Infrared Target Detection)

Download

Browse Figures

Versions Notes

Abstract

:

Complex scenarios pose challenges to tasks in computer vision, including image fusion, object detection, and image-to-image translation. On the one hand, complex scenarios involve fluctuating weather or lighting conditions, where even images of the same scenarios appear to be different. On the other hand, the large amount of textural detail in the given images introduces considerable interference that can conceal the useful information contained in them. An effective solution to these problems is to use the complementary details present in multi-modal images, such as visible-light and infrared images. Visible-light images contain rich textural information while infrared images contain information about the temperature. In this study, we propose a multi-modal registered dataset for complex scenarios under various environmental conditions, targeting security surveillance and the monitoring of low-slow-small targets. Our dataset contains 30,819 images, where the targets are labeled as three classes of “person”, “car”, and “drone” using Yolo format bounding boxes. We compared our dataset with those used in the literature for computer vision-related tasks, including image fusion, object detection, and image-to-image translation. The results showed that introducing complementary information through image fusion can compensate for missing details in the original images, and we also revealed the limitations of visual tasks in single-modal images with complex scenarios.

Keywords:

infrared and visible dataset; image fusion; object detection; image-to-image translation

1. Introduction

Complex scenarios pose a challenging problem for tasks in computer vision involving visible-light images. The target may be submerged in the complex background of the given image due to noise, and because the details of the image are similar to those of the target. This is a significant hurdle for object detection [1]. In the context of image fusion, the useful information contained in the given image may vary with the weather or lighting conditions, and some details may be concealed by poor lighting or inclement weather. The complementary information contained in infrared and visible-light images can help solve this problem. Infrared target detection technology has the advantages of a strong anti-interference capability, high concealment, and all-weather functionality. It has been widely applied to ecosystem services, urban monitoring, and environmental protection. Areas that have different temperatures are salient in infrared images, but lack textural information. On the contrary, visible-light images contain abundant details that can overwhelm the target. Complementary information can be integrated into a single image by fusing visible-light and infrared images, such an image containing both textural detail and saliency-related information [2,3].

Image fusion refers to the fusion of information from multiple images to generate a new image that contains all the key information and features of the original image. Image fusion technology can effectively combine image information from different sources to improve the quality and information content of images [4]. Better results can be obtained for such visual tasks as object detection by using fused images. With advances in artificial intelligence in recent years, deep learning-based methods [5,6,7,8] have come to be widely used in tasks of computer vision. However, such methods require a large amount of data on multi-modal images to train the network. A number of datasets of typical infrared and visible-light images have been developed, including the TNO [9], KAIST Multispectral Database [10], OSU Color–Thermal Database [11], CVC-14 [12], FLIR Thermal Dataset [13], and LLVIP [14]. However, many of the images in these datasets are unregistered, unlabeled, or cover only a single scenario, and none of them contain images of complex scenarios. It is important to construct a multi-modal, registered dataset of complex scenarios.

In this study, we develop a dataset called the CSMR, a multi-modal registered dataset for complex scenarios. The structure of our dataset is shown in Figure 1. It is a multi-modal registered dataset containing images of complex scenarios for security surveillance and monitoring low-slow-small targets. The CSMR contains pairs of labeled visible-light and infrared images of a variety of complex scenarios. Each infrared-visible image pair in the dataset has undergone strict registration and keeps consistent in terms of shooting scenario and shooting time. We used a binocular camera to simultaneously collect both kinds of images, so that images with different modalities were spatio-temporally consistent. As we used different cameras to capture them, the image pairs had different sizes and fields of view. They thus needed to be registered so that they were strictly spatio-temporally aligned, following which they could be used for a number of tasks in computer vision. The CSMR contains images of complex scenarios involving city streets, forests, fields, and watersides, and features varying levels of interference. For example, trees cover pedestrians and cars in images of city streets, while the lighting conditions and the intensity of light influence target detection. We also collected images under different weather and lighting conditions. The visible-light images provided more useful information on sunny days, while the temperature-related information contained in infrared images was more useful for image fusion under poor lighting. Our dataset contains different kinds of targets, including pedestrians, cars, and drones. The images in our dataset were also captured from different perspectives, including head-up, top-down, and bottom-up, to provide comprehensive information. We tested some typical algorithms of image fusion, object detection, and image-to-image translation on our CSMR, and found that complex scenarios raise important issues concerning these visual tasks. Our dataset is published at CSMR-A-Multimodal-Registered-Dataset-for-Complex-Scenarios (accessed on 23 February 2025).

The main contributions of this study are as follows:

(1): We proposed the CSMR, which is the first multi-modal registered dataset of complex scenarios for security surveillance and monitoring low-slow-small targets, to the best of our knowledge.
(2): Our dataset includes many complex scenarios and objects that are not found in other datasets, enabling it to support applications that other datasets cannot. Our complex scenarios and focus on low-slow-small targets are unique.
(3): We test our dataset on various visual tasks and identify some scientific problems that can guide future research.

2. Related Works

A number of datasets of infrared and visible-light images have been developed in past research, including the TNO Image Fusion Dataset [9], OSU Color–Thermal Database [11], KAIST Multi-spectral Dataset [10], CVC-14 [12], FLIR Thermal Dataset [13], and LLVIP [14]. Table 1 shows a comparison between our CSMR and these datasets, while Figure 2 visually depicts this. Our CSMR outperforms other datasets almost in all respects. Although the resolution of the images contained in it is lower than those in LLVIP, it is sufficiently high for them to be used for image fusion, object detection, and image-to-image translation. Its number of images, the perspectives from which they have been captured, and the complex scenarios considered in them distinguish our CSMR from the above-mentioned datasets.

The TNO Image Fusion Dataset [9] is designed for image fusion. It provides intensified visible-light (390–700 nm), near-infrared (700–1000 nm), and thermal infrared (8000–12,000 nm) images that were captured at night under different military and surveillance scenarios, and contains such targets as persons and cars. The TNO was proposed in 2014, when deep learning was not widely used in research, and is thus unsuitable for deep learning-based image fusion algorithms. Moreover, it is unsuitable for research on object detection because it does not contain such targets as pedestrians. Although it contains images covering several scenarios, its size is not sufficiently large to support research on deep learning.

The OSU Color–Thermal Database [11] is part of the OTCBVS Benchmark Dataset Collection developed by Dr. Riad I. Hammoud in 2004, and currently managed by Dr. Guoliang Fan at Oklahoma State University. This dataset focuses on the fusion and fusion-based detection of objects in color and thermal images. It contains 17,089 images of busy intersections on the campus of Ohio State University. The color and thermal images were captured by two cameras mounted adjacent to each other on a tripod placed on a building approximately three storeys high. While the dataset contains a large number of images covering a variety of scenarios, the scenarios considered are too simplistic. The images were all captured during the day, with the intersection of the path as the only background. This dataset is thus not suitable for use in research on complex scenarios.

Each image in the KAIST Multi-spectral Dataset [10] has an RGB version and an infrared version. The visible-light–infrared image pairs are spatio-temporally aligned. The KAIST contains images of typical traffic-related scenarios in university campuses, city streets, and the countryside during the day and night. It is designed for research on autonomous driving, because of which all of its images were captured from inside cars. The KAIST is thus rarely used for object detection because its instances usually overlap with one another. CVC-14 [12] is also a typical dataset for autonomous driving, and is designed for the automatic detection of pedestrians. It contains visible-light and infrared images captured during the day and at night. However, these images pairs are not registered, and are thus not suitable for research on image fusion.

The FLIR Thermal Dataset was published in 2018, and is used to develop and train the convolutional neural network (CNN). It is used by the automotive industry to develop safe and efficient advanced driving assistance systems (ADASs) and autonomous vehicle systems. However, the images from the FLIR cannot be used for tasks of image fusion directly because they are not registered.

LLVIP [14] is a dataset of visible-light–infrared image pairs captured under poor lighting conditions. It contains 16,836 registered visible–infrared image pairs. They were captured by a binocular camera platform that consisted of a visible-light camera and an infrared camera. The LLVIP provides high-quality image pairs under low lighting conditions. However, actual scenarios include many different complex conditions other than low light, such as different light intensity, temperature difference (especially in infrared images), target camouflage, and so on. The LLVIP only considers poor lighting conditions and only contains images captured on the streets. By contrast, our CSMR contains images captured in a variety of complex scenarios.

3. CSMR Dataset

In this study, we propose a multi-modal registered dataset for complex scenarios called the CSMR. Below, we detail the procedures of image collection and processing for the dataset and analyze its characteristics and applications.

3.1. Image Collection

We used a binocular camera platform and a portable computer to capture images for our dataset. The platform contained a visible-light and an infrared camera as well as a gimbal that could rotate in different directions. Figure 3 shows the equipment used for image collection. We used it to simultaneously capture infrared and visible-light images of the same scenario. We chose several typical scenarios under different lighting and weather conditions to ensure their complexity and comprehensiveness. We also collected images from different observational perspectives, including head-up, top-down, and bottom-up. This yielded a total of 30,819 visible-light–infrared image pairs.

3.2. Camera Parameters

The cameras for image collection are shown in Figure 4. The main parameters of our cameras are shown in Table 2. They were purchased from HIKVISION (Hangzhou Hikvision Digital Technology Co., Ltd., Hangzhou, China). iDS-2ZMN3009N and DS-2TD2067T-6/X were the visible-light and infrared cameras used, respectively. The optical axes of visible-light and infrared cameras were precisely calibrated before use to ensure that they were parallel. Table 2 shows only some of their parameters and functions. The resolution and frame rate of the visible-light camera were adjustable, and it had a maximum resolution of 2560 × 1440. The infrared camera had only one resolution of 640 × 512. The image pairs after registration had the size of 640 × 512. The cameras also supported various network protocols, and we used TCP/IP for network communication. Many algorithms could be implemented on the data provided by these cameras, but we did not develop the relevant functions. Nonetheless, we think that these functions can help improve our image collection system by making them more intelligent.

3.3. Complex Scenarios

At present, there is no particularly strict definition for complex scenarios [15]. The definition of complex scenarios is a relative concept which needs to be discussed in the context of the specific applications of the dataset. Our dataset is specifically designed for security surveillance and monitoring low-slow-small targets. In this situation, complex scenario refers to the image or video content that contains many different types of objects, complex backgrounds, and various visual elements. The complexity of these scenarios may be due to the number of objects, the complexity of the background, mutual occlusion between objects, lighting changes, and other visual interference factors.

These complex scenarios cause problems for visual tasks. For image fusion, the images exhibited different features under different conditions. Even the same instance in a given image appeared to be different in some cases. For example, some textural detail was clear in an image captured during the day but was fuzzy at night. Moreover, the saliency of cars in infrared images was affected by their color and the runtime of the cars. This makes it difficult for fusion algorithms to learn how to extract useful information from multi-modal images. For object detection, the complex background contains a lot of textural details which are similar to the targets, thus negatively influencing the detection. In addition, multiple instances may cover each other, or poor light or weather conditions may submerge the targets, making it difficult to distinguish between the targets. For image-to-image translation, complex backgrounds contain a lot of noise and details, which can cause distortion and blurring in the generated image. In addition, there are significant differences in spectral characteristics and semantic information between visible-light and infrared images. Complex backgrounds further exacerbate these differences, making it difficult for image translation models to accurately capture and transform these features. Complex backgrounds increase the difficulty of model training, especially when training data are limited. The model needs to learn how to maintain the quality and consistency of generated images under different background conditions [16,17,18,19].

We selected the scenarios based on the following two considerations: First, the number of scenarios must be sufficiently diverse. Other datasets, such as KAIST and LLVIP, only include a few fixed scenarios. This makes it difficult to apply these datasets to other scenarios. Second, the scenarios must be sufficiently complex. The scenarios we selected include complex textural details or a large number of objects that interact with each other. Alternatively, the scenarios have significant regional differences, such as scenes where water meets the sky. The scenarios we chose include (1) crowded and sparse street or intersection scenes during the daytime, nighttime, on rainy days, and sunny days, containing person and car targets; (2) field scenes with a person and a disguised person; (3) waterside scenes with a person and a ship; (4) mountain scenes with a person, car, and disguised person and car; and (5) a pure sky background with a drone. We selected the targets based on the applications of our dataset. For security surveillance, the focus is on persons and cars; for low-slow-small target monitoring, the focus is on drones. Therefore, we choose these targets while ignoring others. In these scenarios, there are a large number of targets that we are interested in, and the backgrounds are sufficiently complex, which also aligns with our application requirements. Figure 5 shows these scenarios.

3.4. Registration

The infrared and visible-light images needed to be registered because they have different resolutions and fields of view. The KAIST dataset utilizes an optical design with a shared optical path for both infrared and visible light, which inherently simplifies the registration process. However, the specific registration algorithm employed is not detailed in the relevant literature. The images captured by this device are inherently registered. However, due to the presence of the beam splitter, the light transmittance is relatively low. This affects the image quality. Moreover, this co-aperture device cannot handle depth of field and can only image objects at infinity. Moreover, the image acquisition method used by KAIST requires camera calibration and color correction. Although the images we collected are not congenitally registered, the image quality of our camera will be better. Especially for complex scenarios with more texture details, better imaging quality is necessary. Our subsequent registration process is simpler than camera calibration and color correction, and our method of capturing images is simpler and more practical than that of KAIST. In contrast, the LLVIP dataset features a device structure similar to ours, with infrared and visible light sensors arranged in a parallel optical path configuration. The registration algorithm for LLVIP adopts a semi-manual approach, where registration parameters are derived from manually marked points. This method, while effective, is labor-intensive and may not be easily scalable.

Our approach leverages the combination of SuperPoint and LightGlue [20,21] to automatically detect and match feature points. This method facilitates the extraction of a large number of robust feature point pairs, significantly reducing the need for manual intervention. Furthermore, the incorporation of the RANSAC algorithm effectively eliminates mismatched point pairs, thereby enhancing the precision and reliability of the registration results. This automated and robust approach not only achieves higher accuracy but also demonstrates greater adaptability across diverse scenarios and perspectives, making it a more versatile solution for registration tasks. The result of registration were satisfactory, with an AUC@5px of 79.6%. The registration of a 640 × 512 image took 29 ms. Figure 6 shows some results of image registration.

3.5. Annotation

Annotating instances in multi-modal images is challenging and cumbersome. Not only do they contain a large number of instances, but some of them are also submerged into the complex background. This makes it difficult to label them using only visible-light or infrared images. We annotated the instances by using both kinds of images by comparing a given instance in the two modal images and choosing the more clear-cut one to label. Each instance was labeled only once. We then merged the results of annotation based on the visible-light and infrared images. In this way, all the instances were correctly labeled in all images. The instances were all labeled manually and we labeled the instances as three classes of “person”, “car”, and “drone” using bounding boxes in Yolo format.

3.6. Advantages

Our CSMR dataset has the following advantages:

Our dataset contains infrared and visible-light images of the same scenarios, captured at the same time. The image pairs are strictly registered, such that they can be directly used for tasks of computer vision.
Our dataset can be applied to security surveillance and monitoring of low-slow-small targets like drones, which are not supported by other datasets.
Our dataset includes complex scenarios and a variety of objects under different environmental conditions, such as people, cars, and drones. It allows for research on complex scenarios, which has not been focused on by other datasets. Moreover, we also identify some scientific problems in complex scenarios that are relevant to current visual tasks.

3.7. Disadvantages

Due to geographical and temporal limitations, our dataset only includes a limited number of complex scenarios. In reality, there are still many complex environments, such as deserts and oceans, as well as other weather conditions, such as foggy and snowy days, that are worth paying attention to. According to our applications, we have only focused on a limited number of targets, such as people, cars, and drones. Other targets are included in smaller numbers and are not labeled.

3.8. Applications

Our dataset can be applied to security surveillance and monitoring of low-slow-small targets in complex scenarios under various environmental conditions, and it supports research on different visual tasks like image fusion, object detection, and image-to-image translation.

4. Tasks

4.1. Image Fusion

Image fusion refers to the fusion of information from multiple images to generate a new image that contains all the key information and features of the original image. The basic principle of image fusion is to process and combine multiple images appropriately to achieve complementary and enhanced information. The main image fusion methods include pixel-level fusion, feature-level fusion, and model-level fusion [22]. The fusion of visible-infrared images in our paper is pixel-level fusion. Visible-light images contain a large number of textural details, but the targets depicted in them may cover each other or be submerged in a complex background. On the contrary, infrared images do not contain many textural details, but the temperature-related information in them makes it convenient to distinguish between heat-emitting targets. By fusing infrared and visible-light images, the complementary information from them can be integrated into one image to facilitate subsequent tasks in computer vision.

Image fusion algorithms based on deep learning have lately come to dominate research in the area. CNNs [23], generative adversarial networks (GANs) [24], and diffusion [25] are known to be effective in terms of image fusion.

DDFM [5], proposed by Zixiang Zhao et al., is the first multi-modal image fusion algorithm based on a denoising diffusion probabilistic model (DDPM). The fusion task will be designed as a conditional generation problem within the DDPM sampling framework and divided into an unconditional generation sub-problem and a maximum likelihood sub-problem. DDFM performs well in infrared visible light fusion and medical image fusion.

CDDFuse [6] is proposed by the same team with DDFM, which combines the advantages of CNN and Transformer [26] and introduces the idea of feature decoupling. The model can be divide into two stages. The first stage is to extract cross-modality shallow features. Then, the second stage introduces a dual-branch Transformer-CNN feature extractor to extract low-frequency global features and high-frequency local features. A correlation-driven loss is further proposed to enable the network to decompose features more effectively.

LRRNet [7], proposed by Hui Li et al., is a lightweight end-to-end fusion network. It utilizes learnable representations and optimization algorithms to achieve the fusion of infrared and visible-light images. The key point of the model is the low-rank representation (LRR), which is a matrix factorization technique that reduces high-dimensional data to low-dimensional representations while preserving the main features of the data. LRRNet shows significant advantages in infrared and visible image fusion tasks, which can generate clearer and more accurate fusion images while reducing computational complexity and processing time.

FusionGAN [8], proposed by Jiayi Ma et al., is a method for fusing visible and infrared images using a generative adversarial network. The generator generates a fusion image with a high infrared intensity and an additional visible gradient. The discriminator is able to forcibly fuse images with more texture details. This enables the fused image to simultaneously preserve the thermal radiation of the infrared image and the textural details of the visible-light image. In addition, the end-to-end nature of generative adversarial networks can avoid the need for manually designing complex activity level measurements and fusion rules.

Many different metrics are available to assess the results of image fusion and can be divided into four categories: (1) metrics based on information theory, including the EN (entropy), MI (mutual information) [27,28], FMI (feature mutual information) [29], and PSNR (peak signal-to-noise ratio); (2) metrics based on structural similarity, including the SSIM (structural similarity) [30], MS_SSIM [31], and MSE (mean square error); (3) metrics based on image-related features, including the SF (spatial frequency), SD (standard deviation), and AG (average gradient); and (4) metrics based on source and generated images, including the CC (correlation coefficient), SCD (structure content distortion),

Q_{a b f}

(quality assessment based on blur and noise factors) [32],

N_{a b f}

(no-reference assessment based on blur and noise factors), and a metric based on the human visual perception called the VIFF (visual information fidelity for fusion) [33]. We evaluated several image fusion algorithms in our dataset based on eight metrics: EN, SD, SF, MI, SCD, VIFF,

Q_{a b f}

, and SSIM.

4.2. Object Detection

The typical object detection datasets like COCO [34] and VOC [35] contain various objects. However, our dataset focuses on pedestrians, cars, and drones. Pedestrian and car detection is of great significance in automatic driving, intelligent traffic systems, and video surveillance. However, there are many interferences in the complex image backgrounds that negatively influence detection. For example, overlaps among targets in the image make it difficult to distinguish between them, while the headlights of cars can obscure the target in visible-light images.

With advances in unmanned aerial vehicle (UAV) technology, drones have come to be commonly used in many areas. Military purposes constitute one of the most important fields of their application, due to which the capability to detect drones is critical. Drones are much smaller than pedestrians and cars, and usually appear alone in images. Such images also tend to have the sky as a clear background. The images in our dataset contain far fewer drones than pedestrians and cars. We divided it into two parts: one containing images featuring pedestrians and cars, and the other containing images with drones. The images of drones in our dataset were also captured with a complex background. We captured images of them with trees, mountains, bodies of water, and sky as the background, where the latter contain textural details similar to those of drones. Figure 7 shows examples of images in which drones are difficult to identify, or can be easily confused with other targets, like birds or planes.

While prevalent detection algorithms are sufficiently intelligent to perform the requisite tasks, datasets of images of complex scenarios to train them are lacking. Our CSMR dataset can support such research. Object detection algorithms can be divided into two categories: one-stage and two-stage algorithms. The YOLO series [36,37,38,39,40] represents typical one-stage algorithms. We tested our dataset on YOLOv3 [36], YOLOv5 [40], and YOLOv8 [41]. The results showed that the complex scenarios considered in our dataset made it challenging to detect pedestrians, cars, and drones in the corresponding images. We assessed the performance of object detection algorithms based on the mean average precision (mAP) [42]. We first calculated the precision and recall of each class under different IoU thresholds, and then plotted a precision–recall (P–R) curve of each class of the targets. The area under the P–R curve was the average precision (AP) of each class. The mAP was the mean value of the AP, and could directly reflect the quality of the results of detection.

4.3. Image-to-Image Translation

Image-to-image translation algorithms transform images from one modality into another. Collecting infrared images requires sophisticated equipment, which is costly. Directly generating infrared images from visible-light images can thus save a considerable amount of effort and money, and image-to-image translation can adequately handle this task. It has been used in many areas, including the transformation of semantic label maps and photographs.

Methods of image translation can be divided into two categories: methods based on manual design and deep learning methods. Because the temperature distribution in infrared images varies, and the mapping relation between infrared and visible-light images is uncertain, methods based on manual design are not suitable for this task. The conditional GAN offers promise for image-to-image translation, and pix2pixGAN [43] is considered to be the main solution in this context. Referring to relevant literature [14], we use PSNR (peak signal-to-noise ratio) and SSIM (structural similarity) to evaluate the experimental results.

5. Experimental Results

In this section, targeting the applications of our dataset, we conducted the following experiments: (1) image fusion in complex scenarios, (2) pedestrian and car detection for security surveillance and drone detection for low-slow-small target monitoring, and (3) image-to-image translation in complex scenarios. We carried out relevant experiments based on the characteristics of our dataset. Other datasets, such as KAIST and LLVIP, cannot support our applications, and our content is not included in them.

5.1. Image Fusion

We selected DDFM [5], CDDFuse [6], LRRNet [7], and FusionGAN [8] to test our dataset. The models and parameters are the same as those in the original papers. We evaluated the results by two methods: direct observation by human eyes, and calculating the metrics mentioned before. The experiments were carried out on a server with four NVIDIA 3090 GPU (memory of 24 G per GPU) (NVIDIA, Santa Clara, CA, USA).

Some examples of fused images are shown in Figure 8. These images represent the complex scenarios mentioned above. CDDFuse [6] delivered the best results based on direct observation while FusionGAN [8] recorded the worst. This reveals the potential of models of diffusion for image fusion. The targets, especially human targets, were easily affected by the background in the visible-light images. The first row in Figure 8 shows that some human targets were submerged in light or the shadows cast by trees in the visible-light images, but were clearly identifiable in the infrared images. The disguised person hidden in the trees in the fourth row appears to be more salient in the corresponding infrared image. The fused images contain the complementary information from multi-modal images, so they are clearer while keeping the textural information of the visible-light images.

However, the cars appearing in the images exhibited a phenomenon that we refer to as “erratic temperature”. Unlike humans, cars do not emit heat all the time, and their color and status of operation influenced their temperature distribution in the infrared images. The middle of Figure 9 shows that a black car appeared whiter than a white one in the infrared image, which means that it had a higher temperature, as the black car absorbed more heat than the white one. The black car was more salient in the fused image than in the visible-light image, while the white car maintained its textural and color-related information. Moreover, cars of the same color also exhibited different temperatures. The two cars on the right of Figure 9 were both black, but the one on the left had a higher temperature because it had operated for longer. After image fusion, the two cars yielded different colors, even though they were actually the same color. The environment also influenced the temperature. The left part of the figure shows that the portion of the car in sunlight had a higher temperature than that in the shadow. However, cars in fused images looked even worse than in the original images. In our analysis, we found that, for cars with high temperatures, incorporating infrared information significantly enhances their prominence. However, for those with low temperatures, infrared information not only fails to provide assistance but may also have a detrimental effect. In general, introducing infrared information into visible-light images is helpful for some situations. However, in complex scenarios, the temperature distribution is unstable, so effectively using information from different modalities remains a challenge.

Table 3 shows the mean value of the metircs of fusion algorithms on our CSMR dataset and LLVIP. The results show that in both our dataset and LLVIP, CDDFuse always yields the best results and FusionGAN performs the worst, which demonstrates that CDDFuse is suitable for image fusion in complex scenarios and that FusionGAN is not suitable. Compared with LLVIP, our dataset yields better results for most of the metrics, which shows that our dataset is more suitable for image fusion than LLVIP.

5.2. Object Detection

5.2.1. Pedestrian and Car Detection

In this section, we run detection algorithms on infrared images and visible-light images, respectively. We chose Yolov3 [36], Yolov5 [40], and Yolov8 [41] to test our dataset. The models are all pre-trained on the COCO dataset and fine-tuned on our dataset. The infrared images are single-channel and, before they are sent into the network, the single channel will be expanded to three channels by copying the single-channel image three times. For training, we use 80% (18,212 images) of the images containing a person and a car and the remaining 20% (4553 images) are used for validation and testing. The training parameters are kept consistent across the different models. The models are trained for 100 epochs, with a batch size of 8 and images resized to 640 × 640. The learning rate is 0.01 and the optimizer is SGD with a monmentum of 0.937 and a weight decay of 0.0005. The models were trained and validated on four NVIDIA 3090 GPU (memory of 24 G per GPU).

The experimental results for pedestrian and car detection are shown in Table 4. For all the models, persons in infrared images and cars in visible-light images showed better results than in the other modality. This is because the persons in visible-light images are easily submerged in the complex background while being more salient in infrared images. Although the cars are also influenced by the background, they are larger than persons and thus easier to identify in the visible-light images. Moreover, cars that are not running do not emit heat, which makes it difficult to identify them in the infrared images.

Some examples of different scenarios and perspectives are shown in Figure 10. In the first two rows, because of the different perspectives and light conditions, there are missed detections and wrong detections both in the visible-light images and the infrared images. In the third and the fourth rows, the disguised car and person cannot be detected in the visible-light images but can be recognized in the infrared images. We suppose that the disguise covers the texture information of the instances so that their features are different from those learned by the model. However, infrared images contain little texture information. The model learns information about the shape and temperature distribution of the targets from these images, which are not be covered by disguise. In the fifth and the sixth rows, for the different light conditions, the instances may be hidden in the dark or overwhelmed by the strong light, so the detection results of these targets are also very poor. However, these instances can be clearly identified in infrared images. In the last row, the complex textural details also make the detection results inaccurate. These examples demonstrate that complex scenarios cause problems with detection in visible-light images. Therefore, introducing infrared information into visible-light images is an effective method to solve this problem, and our dataset can support the relative research.

The generalization experiment results are shown in Table 5. LLVIP is the latest infrared-visible dataset, and its low-light conditions form a sort of complex scenario. The dataset has well-labeled pedestrian targets. It is suitable for the comparative experiment of pedestrian detection. Since there are no annotations for cars in LLVIP, we only perform pedestrian detection. The results show that the models trained on our dataset have better performance than the models trained on LLVIP. This demonstrates that our dataset has better generalization than LLVIP. Our dataset can partially handle the tasks of LLVIP, but LLVIP cannot accomplish the tasks of our dataset. However, the models trained on our dataset do not perform as well on LLVIP as they do on our own dataset. This could be due to two main reasons: Firstly, the persons in our dataset appear much smaller than those in LLVIP. Thus, the features that the models have learned are different. Secondly, the images in LLVIP are all under low-light conditions, but our dataset contains abundant person targets under normal light conditions. Therefore, the model trained on our dataset will not perform as well in extreme environments as LLVIP. This effect is mutual. Nevertheless, our dataset still demonstrates better generalization than LLVIP. The features of person targets in infrared images are similar between LLVIP and our dataset, but the features in visible-light images exhibit a huge difference. Thus, the results of infrared images are better than visible images.

The comparative experiment results are shown in Table 6. The mAP values of our dataset are relatively lower than those of LLVIP. This shows that the person targets in our dataset are more difficult to detect. Compared with the low-light conditions in LLVIP, the complex background in our dataset exerts a more significant impact on the targets. This fully demonstrates the challenges of target detection in complex backgrounds. There are no relevant datasets that pay attention to this question. Our dataset can provide support for relevant research.

Figure 11 shows some scenarios where our dataset does not perform well. We also tested these scenarios on the model trained by LLVIP. We can observe that the models trained on our dataset do not perform very well in top-down view scenarios and low-light scenarios with severe occlusions. We conclude that due to the lack of support from datasets containing relevant scenario images, existing pre-trained models lack learning based on images of these special scenarios. Although the detection models have been fine-tuned on our dataset, they still cannot perfectly handle all complex scenarios. However, our dataset still outperforms LLVIP in such scenarios. This demonstrates that the focus of our dataset on complex scenarios is meaningful.

5.2.2. Drone Detection

We also used Yolov3 [36], Yolov5 [40], and Yolov8 [41] to conduct drone detection. Since the drones in our dataset are much fewer than persons and cars, we carried out experiments, respectively, to avoid the models being biased towards a certain target. For training, we use 80% (2226 images) of the images containing drones and the remaining 20% (556 images) are used for validation and testing. The training parameters are consistent with pedestrian and car detection.

The experimental results are shown in Table 7. The results of infrared images are better than visible-light images. The drones are very small in the images and lack textural details and color-related information. This makes it difficult for the detection model to effectively learn information from the visible-light images. In addition, drones are also submerged by the complex background, which causes missed detections and wrong detections. However, the drone continues to emit heat when it works, so it is more prominent in the infrared images. The model of detection based on infrared images does not rely on textural or color-related information but instead focuses on the temperature distribution and shape of the target, which can be easily learned through infrared images. Thus, drones in infrared images are easier to detect, and introducing information from infrared images is a sound solution for improving the results of detection based on visible-light images. However, the number of drones is still not sufficiently large, and not all of them are very small and difficult to detect. We collected continuous images of drones from only four scenarios, where this rendered the training set similar to the validation set and led to a high AP of drone detection. We want to reveal some problems through these evaluation metrics rather than breaking records.

The detection examples are shown in Figure 12. There are wrong detections both in visible-light images and infrared images but they are not the same. In general, the wrong detections in infrared images are fewer than those in visible-light images, which is consistent with the experimental results. The wrong detections in visible-light images are caused by the textural details similar to the drones, while drones in the infrared images are subject to interference by pixel blocks with similar shapes. However, only the objects emitting heat simultaneously affects the detection of infrared images and visible-light images, while objects not emitting heat only affect the detection of visible-light images. For example, the birds affect the two modal images simultaneously but the clouds influence the visible-light images more, as shown in the first and last row in Figure 12. Thus, the influence of complex scenarios on drone detection in infrared imagery is relatively minor, which makes the introduction of infrared information a sensible and justifiable approach.

5.3. Image-to-Image Translation

We use pix2pixGAN [43] to perform image-to-image translation on our dataset. For pix2pixGAN [43], the generator is unet256 [44] and the discriminator is PatchGAN [43]. The images in our dataset are all 640 × 512 but they are cropped to 256 × 256 before being input into the network. We train the model with 200 epochs, and the learning rate is 0.0002 for the first 100 epochs, with linearly decays to zero in the last 100 epochs. The batch size is 8 and the training is conducted on a server with four NVIDIA 3090 GPU (memory of 24G per GPU).

Figure 13 shows some typical examples of different scenarios and perspectives. The translation network outputs images of 256 × 256. The images in the first two rows contain a large number of instances and textural details, which makes it difficult to translate the images accurately. The visible-light image in the third row is contaminated by rain, due to which the generated image is also unsatisfactory. In the fifth row, the strong headlights of the car submerge a big area of the image, appearing as a blur in the corresponding generated image. In conclusion, complex scenarios cause significant problems for visible-infrared image translation, and the algorithms available for this task have considerable room for improvement.

The experimental results are shown in Table 8. The PSNR and SSIM of our dataset are higher than in LLVIP [14] but lower than in KAIST [10], which means that our dataset is more challenging than KAIST but less so than LLVIP for image-to-image translation. Despite our dataset containing more complex scenarios than LLVIP, the images in LLVIP are all under extremely low-light conditions, while our dataset includes a large number of images under good light conditions which contain more textural information for the model to learn. Compared with KAIST, which has few scenarios, the complex scenarios in our dataset make it more challenging for the algorithms to translate the given images.

6. Discussion

Our dataset still has limitations and has room for improvement. First, the complex scenarios we selected are not comprehensive. Geographic limitations are an important factor. There are many other geographic environments, like desert and sea, which also contain complex textural details. Besides, there are many other weather conditions worthy of attention, like fog and snow. Due to the constraints imposed by the prevailing conditions, we regrettably failed to collect the relevant data. Secondly, there is a need to enhance the diversity of objects within our dataset. A greater variety of objects would endow our dataset with broader applications. We will continue to collect data from more diverse scenarios in the future to enrich our dataset. We also welcome other researchers to contribute to the expansion of our dataset.

7. Conclusions

In this paper, we proposed a multi-modal registered dataset for complex scenarios called the CSMR. Our dataset is designed for security surveillance and monitoring low-slow-small targets, and pays attention to person, car, and drone targets in complex scenarios under various environmental conditions. We not only outperform other datasets like KAIST and LLVIP in terms of content but also support the applications they do not support. KAIST is only used for autonomous driving and has limitations in terms of scenario diversity and target variety. Images containing low-slow-small targets like drones are also a highlight of our dataset which is not seen in other datasets. The complex scenarios in our dataset are also unique and not found in other datasets. We also found some scientific problems in complex scenarios that can guide subsequent research. Our dataset is more comprehensive than other datasets and supports more research. In summary, we not only make incremental improvements based on other datasets but also have our own unique features.

Author Contributions

Methodology, C.L.; Software, C.L.; Validation, C.L. and Z.Y.; Resources, C.L.; Data curation, C.L., Z.H., Z.Y. and M.C.; Writing—original draft, C.L.; Writing—review & editing, K.G., H.C. and Z.Z.; Visualization, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant U2241275.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Du, P.; Qu, X.; Wei, T.; Peng, C.; Zhong, X.; Chen, C. Research on small size object detection in complex background. In Proceedings of the IEEE 2018 Chinese Automation Congress (CAC), Xi’an, China, 30 November–2 December 2018; pp. 4216–4220. [Google Scholar]
Ma, J.; Ma, Y.; Li, C. Infrared and visible image fusion methods and applications: A survey. Inf. Fusion 2019, 45, 153–178. [Google Scholar] [CrossRef]
Ma, W.; Wang, K.; Li, J.; Yang, S.X.; Li, J.; Song, L.; Li, Q. Infrared and visible image fusion technology and application: A review. Sensors 2023, 23, 599. [Google Scholar] [CrossRef] [PubMed]
Bhataria, K.C.; Shah, B.K. A review of image fusion techniques. In Proceedings of the IEEE 2018 Second International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 15–16 February 2018; pp. 114–123. [Google Scholar]
Zhao, Z.; Bai, H.; Zhu, Y.; Zhang, J.; Xu, S.; Zhang, Y.; Zhang, K.; Meng, D.; Timofte, R.; Van Gool, L. DDFM: Denoising diffusion model for multi-modality image fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 8082–8093. [Google Scholar]
Zhao, Z.; Bai, H.; Zhang, J.; Zhang, Y.; Xu, S.; Lin, Z.; Timofte, R.; Van Gool, L. Cddfuse: Correlation-driven dual-branch feature decomposition for multi-modality image fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5906–5916. [Google Scholar]
Li, H.; Xu, T.; Wu, X.J.; Lu, J.; Kittler, J. Lrrnet: A novel representation learning guided fusion network for infrared and visible images. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 11040–11052. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
Toet, A. The TNO multiband image data collection. Data Brief 2017, 15, 249. [Google Scholar] [CrossRef]
Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
Davis, J.W.; Sharma, V. Background-subtraction using contour-based fusion of thermal and visible imagery. Comput. Vis. Image Underst. 2007, 106, 162–182. [Google Scholar] [CrossRef]
González, A.; Fang, Z.; Socarras, Y.; Serrat, J.; Vázquez, D.; Xu, J.; López, A.M. Pedestrian detection at day/night time with visible and FIR cameras: A comparison. Sensors 2016, 16, 820. [Google Scholar] [CrossRef]
FLIR, T. Teledyne FLIR ADAS Dataset. 2024. Available online: https://www.flir.com/oem/adas/adas-dataset-form/ (accessed on 10 January 2025).
Jia, X.; Zhu, C.; Li, M.; Tang, W.; Zhou, W. LLVIP: A visible-infrared paired dataset for low-light vision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3496–3504. [Google Scholar]
Liu, M.; Zhang, L.; Tian, Y.; Qu, X.; Liu, L.; Liu, T. Draw like an artist: Complex scene generation with diffusion model via composition, painting, and retouching. arXiv 2024, arXiv:2408.13858. [Google Scholar]
Han, S.; Mo, B.; Xu, J.; Sun, S.; Zhao, J. TransImg: A Translation Algorithm of Visible-to-Infrared Image Based on Generative Adversarial Network. Int. J. Comput. Intell. Syst. 2024, 17, 264. [Google Scholar] [CrossRef]
Han, Z.; Zhang, Z.; Zhang, S.; Zhang, G.; Mei, S. Aerial visible-to-infrared image translation: Dataset, evaluation, and baseline. J. Remote Sens. 2023, 3, 96. [Google Scholar] [CrossRef]
Ma, D.; Xian, Y.; Li, B.; Li, S.; Zhang, D. Visible-to-infrared image translation based on an improved CGAN. Vis. Comput. 2024, 40, 1289–1298. [Google Scholar] [CrossRef]
Wang, Y.; Liang, X.; Chen, L. Research on infrared and visible image registration algorithm for complex road scenes. IEEE Access 2023, 11, 78511–78521. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
Lindenberger, P.; Sarlin, P.E.; Pollefeys, M. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 17627–17638. [Google Scholar]
Li, S.; Kang, X.; Fang, L.; Hu, J.; Yin, H. Pixel-level image fusion: A survey of the state of the art. Inf. Fusion 2017, 33, 100–112. [Google Scholar] [CrossRef]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Waswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Qu, G.; Zhang, D.; Yan, P. Information measure for performance of image fusion. Electron. Lett. 2002, 38, 313–315. [Google Scholar] [CrossRef]
Ramesh, C.; Ranjith, T. Fusion performance measures and a lifting wavelet transform based algorithm for image fusion. In Proceedings of the IEEE Fifth International Conference on Information Fusion, Annapolis, MD, USA, 8–11 July 2022; FUSION 2002. (IEEE Cat. No. 02EX5997). Volume 1, pp. 317–320. [Google Scholar]
Haghighat, M.B.A.; Aghagolzadeh, A.; Seyedarabi, H. A non-reference image fusion metric based on mutual information of image features. Comput. Electr. Eng. 2011, 37, 744–756. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the IEEE The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 2, pp. 1398–1402. [Google Scholar]
Piella, G.; Heijmans, H. A new quality metric for image fusion. In Proceedings of the IEEE 2003 International Conference on Image Processing (Cat. No. 03CH37429), Barcelona, Spain, 14–17 September 2003; Volume 3, p. III-173. [Google Scholar]
Han, Y.; Cai, Y.; Cao, Y.; Xu, X. A new image fusion performance metric based on visual information fidelity. Inf. Fusion 2013, 14, 127–135. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Liu, C.; Hogan, A.; Diaconu, L.; Ingham, F.; Poznanski, J.; Fang, J.; Yu, L.; et al. ultralytics/yolov5: V3. 1-bug fixes and performance improvements; Zenodo: Geneva, Switzerland, 2020. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 January 2025).
Padilla, R.; Netto, S.L.; Da Silva, E.A. A survey on performance metrics for object-detection algorithms. In Proceedings of the IEEE 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niterói, Brazil, 1–3 July 2020; pp. 237–242. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]

Figure 1. The structure of our CSMR dataset. The original images first need to be registered and labeled. We test our dataset on three visual tasks: image fusion, object detection, and image-to-image translation.

Figure 2. Examples of the related datasets. The first row corresponds to the visible-light images and the second row corresponds to the infrared images. From left to right: (a) TNO, (b) OSU, (c) CVC-14, (d) KAIST, (e) FLIR, (f) LLVIP.

Figure 3. Our collection equipment contains a binocular camera platform and a portable computer.

Figure 4. Cameras for image collection.

Figure 5. Examples of the scenarios. The first row corresponds to the visible-light images and the second row corresponds to the infrared images. From left to right: (a) street at night, (b) intersection from top-down perspective, (c) disguised person in field, (d) waterside, (e) drone in sky.

Figure 6. Registration result of our dataset.

Figure 7. Examples of drones in our CSMR dataset. The images in the first row are visible-light images. The images in the second row are infrared images.

Figure 8. Examples of fusion algorithms on our CSMR dataset.

Figure 9. Examples of “erratic temperature”. From left to right: a single car; two different-colored cars; two same-colored cars.

Figure 10. Examples of pedestrian and car detection on our CSMR dataset.

Figure 11. Examples of failure cases.

Figure 12. Examples of drone detection on our CSMR dataset.

Figure 13. Examples of image-to-image translation results on our CSMR dataset.

Table 1. Comparison of our dataset and the existing multi-modal datasets, including TNO Image Fusion Dataset, OSU Color-Thermal Database, KAIST Multispectal Dataset, CVC-14, FLIR Thermal Dataset, and LLVIP.

	Image Pairs	Resolution	Registered	Labeled	Observation Perspective	Scenarios	Instances	Application
TNO	261	768 × 576	✓	×	head-up	multiple	person and car	military use
OSU	285	320 × 240	✓	×	top-down	single	person	pedestrian detection
CVC-14	8490	640 × 512	×	✓	head-up	single	person	autonomous driving
KAIST	4750	640 × 480	✓	✓	head-up	single	person	autonomous driving
FLIR	5258	640 × 512	×	✓	head-up	single	person, car, dog, bike	autonomous driving
LLVIP	15,488	1080 × 720	✓	✓	top-down	single	person	pedestrian detection
ours	30,819	640 × 512 *	✓	✓	head-up, top-down, bottom-up	multiple	person, car, drone	security surveillance and
								monitoring low-slow-small targets

* The original resolution of the visible-light images is 2560 × 1440, while 640 × 512 is the resolution of the visible-light and infrared images after registration.

Table 2. The parameters of the visible and infrared cameras.

	Visible Camera	Infrared Camera
resolution	2560 × 1440	640 × 512
frame rate	50 Hz:25 fps	50 Hz:50 fps
SNR/NETD	>52 dB	<50 mk
focal length	4.8–144 mm	6.3 mm
network protocol	TCP/IP	TCP/IP
power supply	DC 12 V	DC 10–30 V
power consumption	2.5 W (static); 4.5 W (dynamic)	2 W
response band	visible spectrum	8–14 $μ$ m
size	50 × 60 × 102 mm	45 × 45 × 99.5 mm
weight	285 g	265 g

Table 3. Experimental results of image fusion. For all metrics, higher values indicate better performance.

		EN	SD	SF	MI	SCD	VIFF	Q_abf	SSIM
DDFM	LLVIP	7.06	39.72	10.91	1.93	1.40	0.69	0.46	0.67
DDFM	ours	7.07	41.94	15.10	2.01	1.47	0.50	0.40	0.68
CDDFuse	LLVIP	7.35	50.26	16.42	3.01	1.58	0.86	0.63	0.66
CDDFuse	ours	7.40	59.74	23.38	2.41	1.70	0.59	0.53	0.66
LRRNet	LLVIP	6.41	29.69	10.69	1.69	0.85	0.55	0.42	0.63
LRRNet	ours	6.99	53.79	15.96	2.27	1.20	0.52	0.45	0.64
FusionGAN	LLVIP	6.46	26.77	8.05	1.97	0.65	0.44	0.23	0.58
FusionGAN	ours	6.79	33.59	10.19	1.76	0.74	0.31	0.20	0.59

Table 4. Experimental results for pedestrian and car detection. mAP50 means the AP at an IoU threshold of 0.5, mAP75 means the AP at an IoU threshold of 0.75, and mAP means the average of the AP at an IoU threshold of 0.5 to 0.95, with an interval of 0.05.

		Yolov3			Yolov5			Yolov8
		mAP50	mAP75	mAP	mAP50	mAP75	mAP	mAP50	mAP75	mAP
visible	car	0.952	0.878	0.796	0.95	0.857	0.768	0.954	0.881	0.798
	person	0.865	0.632	0.565	0.836	0.538	0.509	0.842	0.609	0.548
	all	0.908	0.755	0.680	0.893	0.698	0.638	0.898	0.745	0.673
infrared	car	0.925	0.759	0.686	0.916	0.719	0.651	0.921	0.755	0.678
	person	0.886	0.663	0.585	0.878	0.617	0.552	0.885	0.664	0.582
	all	0.906	0.711	0.635	0.897	0.668	0.601	0.903	0.710	0.630

Table 5. Generalization experiment results of our dataset and LLVIP for pedestrian detection.

		Yolov3			Yolov5			Yolov8
	Train + Test	mAP50	mAP75	mAP	mAP50	mAP75	mAP	mAP50	mAP75	mAP
visible	CSMR + LLVIP	0.540	0.247	0.272	0.457	0.182	0.217	0.495	0.237	0.253
visible	LLVIP + CSMR	0.272	0.115	0.134	0.261	0.109	0.127	0.167	0.086	0.089
infrared	CSMR + LLVIP	0.785	0.454	0.444	0.773	0.403	0.417	0.718	0.407	0.401
infrared	LLVIP + CSMR	0.558	0.175	0.244	0.418	0.123	0.178	0.404	0.143	0.186

Table 6. Comparative experiment results between our dataset and LLVIP for pedestrian detection.

		Yolov3			Yolov5			Yolov8
		mAP50	mAP75	mAP	mAP50	mAP75	AP	mAP50	mAP75	AP
visible	LLVIP	0.871	0.455	0.466	0.908	0.564	0.527	0.889	0.542	0.511
visible	ours	0.865	0.532	0.565	0.836	0.538	0.509	0.842	0.609	0.548
infrared	LLVIP	0.940	0.661	0.582	0.965	0.764	0.670	0.964	0.727	0.631
infrared	ours	0.886	0.663	0.585	0.878	0.617	0.552	0.885	0.664	0.582

Table 7. Experimental results of drone detection. AP50 means the AP at an IoU threshold of 0.5, AP75 means the AP at an IoU threshold of 0.75, and AP means the average of the AP at an IoU threshold of 0.5 to 0.95, with an interval of 0.05.

	Yolov3			Yolov5			Yolov8
	AP50	AP75	AP	AP50	AP75	AP	AP50	AP75	AP
visible	0.985	0.811	0.678	0.979	0.783	0.663	0.981	0.800	0.686
infrared	0.994	0.887	0.701	0.993	0.848	0.683	0.994	0.876	0.708

Table 8. Experimental results of image-to-image translation algorithms on KAIST, LLVIP, and our dataset.

Dataset	PSNR	SSIM
KAIST	28.9935	0.6918
LLVIP	10.7688	0.1757
ours	21.6358	0.5909

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Gao, K.; Hu, Z.; Yang, Z.; Cai, M.; Cheng, H.; Zhu, Z. CSMR: A Multi-Modal Registered Dataset for Complex Scenarios. Remote Sens. 2025, 17, 844. https://doi.org/10.3390/rs17050844

AMA Style

Li C, Gao K, Hu Z, Yang Z, Cai M, Cheng H, Zhu Z. CSMR: A Multi-Modal Registered Dataset for Complex Scenarios. Remote Sensing. 2025; 17(5):844. https://doi.org/10.3390/rs17050844

Chicago/Turabian Style

Li, Chenrui, Kun Gao, Zibo Hu, Zhijia Yang, Mingfeng Cai, Haobo Cheng, and Zhenyu Zhu. 2025. "CSMR: A Multi-Modal Registered Dataset for Complex Scenarios" Remote Sensing 17, no. 5: 844. https://doi.org/10.3390/rs17050844

APA Style

Li, C., Gao, K., Hu, Z., Yang, Z., Cai, M., Cheng, H., & Zhu, Z. (2025). CSMR: A Multi-Modal Registered Dataset for Complex Scenarios. Remote Sensing, 17(5), 844. https://doi.org/10.3390/rs17050844

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CSMR: A Multi-Modal Registered Dataset for Complex Scenarios

Abstract

1. Introduction

2. Related Works

3. CSMR Dataset

3.1. Image Collection

3.2. Camera Parameters

3.3. Complex Scenarios

3.4. Registration

3.5. Annotation

3.6. Advantages

3.7. Disadvantages

3.8. Applications

4. Tasks

4.1. Image Fusion

4.2. Object Detection

4.3. Image-to-Image Translation

5. Experimental Results

5.1. Image Fusion

5.2. Object Detection

5.2.1. Pedestrian and Car Detection

5.2.2. Drone Detection

5.3. Image-to-Image Translation

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI