1. Introduction
Infrared detection utilizes the difference between target and background radiation for detection. Compared to visible light detection systems, infrared detection has the advantages of a long detection distance, working day and night, and strong anti-interference capabilities [
1]. Long-distanced infrared targets, especially under a sky cloud background, usually show weak and small characteristics [
2,
3]. The detection of infrared dim and small targets plays an important role in the field of infrared image processing and has high application value in the field of remote sensing.
The current infrared dim small target detection methods are mainly divided into the traditional method based on a model-driven approach and the deep learning method based on a data-driven approach. Traditional methods can be divided into three categories: (1) Filter-based methods [
4,
5,
6] and background suppression models [
7,
8]: These methods have low computational complexity and mainly complete the detection of infrared dim small targets by calculating the differences in grayscale values and suppressing the background; (2) Methods based on the local contrast of the human visual system [
9,
10,
11,
12,
13]: These methods are easy to implement and construct the saliency map of the target through the local difference between the target and the background to achieve the detection of dim small infrared targets; (3) Methods based on the low-rank model [
14,
15,
16,
17]: These methods transform the target detection task into a task completed using sparse low-rank tensors. However, these traditional methods mainly rely on the characteristics of handcrafting and need prior knowledge of the background scenes, which usually have a high false-alarm rate in detection tasks with complex backgrounds and extremely dim targets. As for deep learning methods, the field has experienced rapid development and significant advancements. Researchers design specific function modules and customize network architecture to make the network more suitable for infrared dim small target detection. For example, [
18,
19,
20,
21,
22] improved the feature expression ability of small targets by combining traditional methods with deep learning and special network design methods; [
23,
24,
25] solved the problem that small targets easily disappear in a deep network by adopting the method of dense nesting and encoder–decoder design; [
26,
27] used confrontation learning and a multi-scale attention mechanism to achieve the balance between false alarms and missing detections in infrared dim small target detection; and [
25,
28] improved the adaptability of the detection network to the complex background through the design of a smoothing operator and attention module.
However, the above research has mainly focused on detection performance, which usually leads to complex structures and high computational complexity. Infrared dim small target detection networks usually need to be deployed on edge devices. It is therefore necessary to consider how to reduce the complexity of the network and transplant the network to the selected edge device for hardware acceleration.
Graphics processing units (GPUs) exhibit formidable parallel computing capabilities, rendering them applicable to diverse AI algorithms. However, their relatively high power consumption makes them unsuitable for network deployment in mobile devices with low power requirements. Application-specific integrated circuits (ASICs) are custom-designed for particular AI algorithms, delivering exceptional computational performance while maintaining low power consumption. Nevertheless, they suffer from fixed functionality, limited flexibility, and protracted development cycles [
29,
30]. In contrast, heterogeneous FPGA platforms demonstrate balanced performance characteristics, featuring high parallelism, reconfigurability, and energy efficiency, making them a good solution for deploying infrared dim small target detection networks.
Therefore, we propose an infrared dim small target detection network under a sky cloud background and FPGA-based optimization deployment methods. The main contributions of this article are as follows:
We constructed a lightweight sequential-differential-frame-based network (LSDF-Net) for infrared dim small target detection under a sky cloud background.
We proposed optimization methods for the network, including sequential-differential acceleration, convolutional pooling, and image channel optimization.
We explored the impact of the infrared image calibration set on the quantization effect and evaluated the deployment performance of different quantization methods on JFMQL100TAI.
The structure of this article is as follows.
Section 1 introduces the research background and the importance of the optimization and deployment methods of infrared dim small target networks based on a heterogeneous FPGA.
Section 2 introduces the architecture of the proposed LSDF-Net.
Section 3 introduces the optimization methods in detail and then explores the impact of the quantization calibration set on the quantization effect and evaluates different quantization methods on the deployment performance.
Section 4 compares and analyzes the experimental results. The last section summarizes the work of this article.
2. LSDF-Net and Structural Optimization of the Network
A lightweight sequential-differential-frame-based network (LSDF-Net) for infrared dim small target detection suitable for detection under a sky cloud background and the optimization methods of the network are proposed in this section. The architecture of the network is shown in
Figure 1, which includes the input, differential, intermediate, and output layers. Sequential images are used in the input layer, and the images are differentiated. The middle layer is the main part of the network, which uses a down-sampling pyramid structure to extract the features of the input image. A false-alarm-object learning strategy and multi-anchor box assignment strategy are adopted to improve the detection performance. The output layer contains the regression of the target classification and location, which completes the target detection task.
2.1. Sequential-Differential-Frame Input and Optimization
The existing infrared dim small target detection algorithms can be divided into single-frame detection methods and sequential-frame detection methods. Much research progress has been made in the field of single-frame detection. However, single-frame detection methods lack temporal information and mainly rely on the gray-level and spatial information of infrared images to complete detection, usually resulting in a higher false-alarm rate in low SNR scenes. The input of the proposed LSDF-Net network is four time-sequential images. The four input images are differentiated frame by frame; the current frame is successively subtracted by the previous three frames. The differential layer diagram is shown in
Figure 2, where X is the original sequence image with a size of M × N × 4, Y is the sequential-differential-frame image with a size of M × N × 3, and M and N are the width and height of the image.
The sequential images contain gray-level, spatial, and temporal information, which significantly increases the amount of information in the input. After the sequential images are input, the images are differentiated to improve the SNR. Through network learning, small target motion information can be effectively used to improve the target detection rate and suppress the false-alarm rate. As shown in
Figure 3, after image differentiating, most of the stationary background clutter is removed, while moving targets are not. Additionally, moving clouds and targets are distinguished based on their different shapes and sizes.
The differential image can highlight the changed area by calculating the pixel-level difference between sequential frames, which can effectively highlight the moving target and eliminate slow-moving background interference information, thus improving the SNR and facilitating subsequent target detection and tracking.
Differential images are generated by the differential layer in the LSDF-Net. Introducing a differential layer into the network structure will increase both the number of layers and the inference time. To reduce the layers of the network, a sequential-differential acceleration method is proposed, which does not need to add an additional differential layer while completing the image differentiation. Firstly, the network is trained with the differential layer. Once the network training is complete, the parameters of the convolution layer are fixed, and the differential layer also has established parameters, allowing it to be combined with the adjacent convolution layer.
The process of the differential layer of four-channel images is as follows:
where X1, X2, X3, and X4 are four time-sequential image inputs, and the output three-channel images are the inputs of the next convolution layer, with the convolution kernel Conv (ω, b), where ω represents the weight of the convolution kernel, with a size of 3 × 3 × 3 × 16, and b represents the bias, with a size of 1 × 1 × 16. The convolution process of the channel differential image and Conv (ω, b) is as follows:
where
Rewriting the above equation yields
where
It can be known from the above derivation that by linearly combining the weights of the original convolutional layer, the differential layer and the adjacent convolution layer are merged into a new differential-convolutional layer
, among which the bias b is unchanged. The schematic diagram of the sequential-differential acceleration is shown in
Figure 4.
The differential layer and adjacent convolution layer are merged into one convolution layer by using the sequential-differential acceleration method. The original differential operation and adjacent convolution are calculated step by step on the FPGA, which consumes a longer amount of time. After adopting sequential-differential acceleration, the dimensionality of the convolutional layer increases, but the original differential layer is removed. The differential-convolution operation is accelerated by utilizing efficient parallel computing on the FPGA platform.
2.2. Convolutional Pooling
In convolutional neural networks, there are three commonly used pooling operations: max pooling, mean pooling, and convolutional pooling. Maximum pooling takes the maximum value of the pixel in the pooling area [
31]. The feature map obtained in this way is more sensitive to rough texture features, but it discards the non-extreme information, which may cause a loss of detail in the features of dim small targets. Mean pooling takes the average value of pixels in the pooling area [
32]. The feature map obtained in this way is more sensitive to background information, but it will smooth the prominent features of dim small targets. Convolutional pooling is achieved by adjusting the stride of the convolution, resulting in a performance comparable to traditional pooling methods. The parameters of the convolutional pooling kernel are automatically learned during network training. Convolutional pooling can transmit the overall information of the target to the next layer and achieve different pooling effects for different channels, which provides high flexibility in extracting the features of dim small targets. Convolutional pooling combines the convolution layer and pooling layer, which can reduce the number of network layers. The schematic diagram of the three pooling methods is shown in
Figure 5. In the following chapter, a comparison of the convolutional pooling experimental results is provided.
2.3. False-Alarm-Object Learning and Multi-Anchor Assignment Strategy
Infrared dim small target detection in complex sky cloud background scenarios is a challenging problem [
33]. To reduce the false alarms of the network, we analyzed the sources of false alarms and found that the interference mainly comes from cloud edges, blind and flicking pixels, and system random noise, as shown in
Table 1.
Based on this, a false-alarm-object learning strategy is adopted, in which the detection network learns infrared dim small targets and other interfering objects including cloud edges, blind and flicking pixels, and system random noise. The images of the multiple objects are shown in
Figure 6. This strategy converts object detection tasks into object classification tasks. Compared to single-type dim small target detection, false-alarm-object detection can help to reduce false alarms.
In the task of target detection, one-stage target detection algorithms usually generate a series of anchor boxes on the image, regarding these anchor boxes as potential candidate regions [
34]. The model predicts whether these candidate regions contain targets and predicts the category of targets. In addition, since the anchor box position is fixed, it usually cannot coincide with the target bounding box, so the anchor box needs to be adjusted to form a real bounding box that can accurately describe the object position. In infrared dim small target detection, the target usually occupies 1–15 pixels and is usually contained in only one anchor box. In this article, the input image is divided into multiple grids, which are used as anchor boxes to detect the target bounding box.
Each real bounding box contains seven values, and its output is
where output represents the output value of the bounding box; x, y is the target center coordinate; w, h represents the width and height of the target; c1 represents the target category label; c2 represents the cloud edges, blind and flicking elements, and system random noise label; and conf represents the target confidence level.
However, for single-anchor box assignment, when the target center is close to the middle of the anchor boxes, the target label will be assigned to the closest anchor box, but the location of the target center might swing between the nearby anchor boxes due to the signal noise and target movement, which may lead to a decrease in the detection rate of the target. Especially for moving targets crossing the boundary of anchor boxes, the switching of anchor boxes may decrease the continuity of target detection.
To reduce the missed detections of the network, this article adopts a multi-anchor box assignment strategy. A buffer area is set, as shown in the dashed area in
Figure 7. When the target is located in the buffer area, it is considered to belong to both anchor boxes, and the target labels are assigned to the anchor boxes. As shown in
Figure 7, Target 1 is inside the anchor frame (Anchor1) that performs the target detection; Target 2 is in the middle of two adjacent anchor frames (Anchor1 and Anchor3), and both anchor frames are assigned target labels and perform the target detection together; and Target 3 is in the middle of four anchor frames (Anchor1, Anchor2, Anchor3, and Anchor4), which are assigned target labels and also perform the target detection together. The multi-anchor box assignment strategy allocates multiple anchor boxes to detect the boundary targets, which increases the target detection probability and improves the continuity of moving target detection.
3. FPGA-Based Optimization of Deployment
3.1. Image Input Channel Optimization
The number of image channels refers to the number of values required to describe a pixel. Generally, the grayscale image’s channel number is 1; the color image’s channel number is 3, representing the red, green, and blue (RGB) values of each pixel. In addition, there are some special four-channel images, such as color images with RGB and A channel values, where A represents transparency. The proposed LSDF-Net is input by four-channel infrared sequential images.
Most embedded platforms or AI solutions are designed for color images. In the case of the four-channel input, taking the RGBA format as an example, the R, G, B, and A values of each pixel are stored, and the image is arranged pixel by pixel. Some FPGA platforms only support a pixel-by-pixel four-channel arrangement, while sequential single-band infrared images are arranged channel by channel, as shown in
Figure 8.
Therefore, during network deployment, it is necessary to rearrange the dataset format before inputting the images, converting the sequential infrared images from a channel-by-channel layout to a pixel-by-pixel layout. The rearrangement of the data format will consume much time, resulting in a poor real-time performance of the detection system. To avoid the rearrangement of the data format and accelerate the differential operation, the differential algorithm is optimized. The original differential operation subtracts the last three frames from the first frame, and the improved differential operation subtracts the mean of the last three frames from the first frame, as shown in
Figure 9. Consequently, the differential output is a single channel, so data rearrangement is not required.
The merging process of the differential layer and adjacent convolution layer is updated as follows:
where X1, X2, X3, and X4 are four time-sequential image inputs,
represents the weight of the convolution kernel, with a size of 3 × 3 × 1 × 16, and
represents the bias, with a size of 1 × 1 × 16. The schematic diagram is shown in
Figure 10.
3.2. Optimization of the Quantization Calibration Set for Infrared Images
A calibration set is a dataset used to evaluate and adjust quantization parameters during the quantization process. Quantization requires measuring the distribution of each feature map to determine the appropriate quantization scaling factor. The function of the calibration set is to use a portion of the dataset to represent the entire dataset to measure the distribution range of each feature map and to count the input and output data range of each layer as a reference for the feature map quantization.
The relative Euclidean distance and cosine similarity are used as evaluation indicators, which are defined as
where
represents the output value of each network layer before quantization,
represents the output value of each network layer after quantization, and n represents the number of network layers. The relative Euclidean distance and cosine similarity indicate the difference between the model before and after quantization, which can reflect the effect of model quantization. The smaller the relative Euclidean distance, the better the quantization effect. Similarly, the closer the cosine similarity is to 1, the better the quantization effect.
The selection of a calibration set affects quantization accuracy. For general target detection tasks, the calibration set should meet the following basic requirements:
Representativeness: the calibration set should well represent the data distribution of the real scene images;
Diversity: the calibration set should cover different types of input data as much as possible;
Scale: the calibration set should have enough scale. The larger the scale, the better the quantization effect will be, but more memory and time will be required for the quantization process.
Besides the above, for an infrared image under a sky cloud background, the background mainly contains various clouds. Therefore, it is necessary to optimize the selection criteria for infrared images when used as a quantization calibration set. We further explore the effects of the dynamic range, variance, scale, and batch size of infrared images on the quantization effect. In the next chapter, the experiment results are compared and analyzed in detail.
3.3. Int16/Int8/Mixed Quantization
Model quantization aims to reduce the consumption of storage space and the computing resources of the model. It is a process of converting the network floating parameters into low-bit-width data. Quantification can significantly reduce model size and resource consumption, speed up computing, and make models easier to deploy on resource-constrained edge devices. The commonly used quantization bit widths are int16, int8, and binary quantization/ternary quantization with lower bit widths. However, due to the inevitable rounding and truncation errors in the quantization process, the accuracy of the neural network may be affected. For neural networks with high accuracy requirements, the selection of quantization bit width is a trade-off between network accuracy and network complexity. The text is quantified by the Icraft component, and the main process is shown in
Figure 11.
Quantization mainly includes the following steps:
Loading the floating network into the memory to facilitate the subsequent processing of the network;
Forward calibration: inferring on the floating-point network based on the calibration dataset to obtain the feature map. The parameter forward_mode determines the forward inference mode. Forward_dir and forward_list indicate the position and list of the calibration set;
Feature map measurement: determining the saturation point of each feature map according to the method specified by the parameter saturation;
Calculate the normalization ratio: calculating the normalization coefficient of each characteristic map according to the saturation point;
Normalization: normalizing the network parameters to ensure that the dynamic range of all feature maps is suitable for quantification. After normalization, the network and the previous network are completely equivalent;
Quantization: Quantizing the floating-point number to the fixed-point number after completing the preparation work;
Test and analysis: verifying and analyzing the quantified network progress through simulation tools.
For the infrared dim small target detection network, the quantization bit width int8, int16, and mixed are usually used. The mixed precision quantization method uses int16 to calculate the operators that have a greater impact on the quantization precision; meanwhile, other operators are calculated by int8 to reduce the complexity of the network. In the next chapter, the experiment results are compared and analyzed in detail.
4. Experimental Results and Analysis
4.1. Experimental Setup
4.1.1. Dataset and Training Environment
In the experiment, the dataset and training strategy are referenced from the literature [
35]. The training strategy includes small-sized image transfer learning, label refinement, and iterative training methods. The training dataset comprises 89,510 images, while the validation dataset consists of 2128 independent sequential images, each with a size of 640 × 512 × 16 bits. The sample of the validation dataset is shown in
Figure 12, where a, b, c, and d correspond to datasets 1, 2, 3, and 4, respectively.
The quantification is completed in Icraft 3.1.1. The quantized network is deployed on the Fudan-Micro Wukong development board, which was equipped with a Fudan-Micro heterogeneous FPGA chip JFMQL100TAI. The device runs on the Ubuntu 20.04 Linux system.
4.1.2. JFMQL100TAI
The heterogeneous FPGA JFMQL100TAI chip of the Fudan Micro Company integrates a processing system (PS), programmable logic (PL), and Buyi AI acceleration engine based on a four-core processor with rich characteristics. The schematic diagram of each unit is shown in
Figure 13. The PS is a four-core high-performance 64-bit energy-efficient Cortex-A53 processor based on ARM v8 instruction set architecture, which can be used as the main task management processor of the system, including SDIO, QSPI, UART, Ethernet, and other interfaces. The abundant programmable resources and high-speed interface resources at the PL can also complete the main interface logic functions, including the main control board GTX high-speed communication interface, LVDS interface, PCIE2.0, HDMI, and other high-speed and low-speed interfaces from the backplane; its rich programmable logic resources greatly improve the flexibility and scalability of the system. The Buyi AI acceleration engine is an ASIC AI processing engine integrated into the chip, which supports a variety of quantization precision tasks and has a strong AI computing capability. The quantization computing power for int8 reaches 27.52 TOPs, and for int16, it reaches 6.88 TOPs.
4.2. Ablation Experiments with Sequential-Differential Acceleration and Convolutional Pooling
To verify the improvement of detection accuracy by false-alarm-object learning and the multi-anchor assignment strategy, ablation experiments were conducted, and the results are shown in
Table 2.
The multi-anchor assignment strategy increased the precision rate from 97.79% to 98.64% and improved the recall rate from 82.66% to 87.61%; the false-alarm-object learning further enhanced the precision rate from 98.64% to 99.17% and boosted the recall rate from 87.61% to 90.18%.
To verify the improvement of the network performance by the sequential-differential acceleration method and convolutional pooling, ablation experiments were conducted. Differential acceleration and convolutional pooling are applied to the LSDF-Net. LSDF-Net + Diff-Acc is converted from the Baseline LSDF-Net by the sequential-differential acceleration method, and LSDF-Net + Diff-Acc + Conv-Pl is retrained with the convolutional pooling method. The networks are deployed to the heterogeneous FPGA JFMQL100TAI using the same settings. The results of the experiment are shown in
Table 3.
In the table, the hardware inference time is divided into two parts. One is hardware computation time, which represents the time spent by the hardware only on computation, such as convolution operation time and activation time; the other is memory copy time, which refers to the memory access and data transfer time. The average data in the table are the average values of four groups of validation sets. The results indicate that the implementation of sequential-differential acceleration leads to a reduction in memory copy time and hard time. Specifically, the average inference time decreases by 11.37%. Additionally, after applying convolutional pooling, the average inference time is further reduced by 15.78%. Notably, all three networks maintain a consistent recall rate performance.
4.3. Image Input Channel Optimization Experiment
A comparative experiment of four-channel inputs and single-channel inputs was carried out on the four datasets, each with int8 and int16 quantization settings. The experimental results are shown in
Table 4, where for the four-channel setting, data processing time includes the image reading, data rearrangement, output decoding, and target association times. For the single-channel setting, data processing time includes the image reading, differential operation, output decoding, and target association times. The experimental results are shown in
Table 4.
According to the table, by converting the image input from four channels to a single channel, the FPS increased from 28.86 to 51.09 under 16-bit quantization, reflecting an improvement of 43.51%. Similarly, under 8-bit quantization, the FPS increased from 32.20 to 55.73, representing an improvement of 42.22%. The results show that by optimizing the image input channel, the memory copy, hardware, and data processing times are reduced in both 16-bit quantization and 8-bit quantization.
4.4. Quantization Calibration Set Optimization Experiment
One hundred independent infrared images from various scenarios were selected for the quantization calibration experiment (some of the images are shown in
Figure 14) and the quantization increment experiments of the quantization calibration set were conducted in two ways.
Table 5 shows the increment based on the image dynamic range, and
Table 6 shows the increment based on the image variance. Cosine similarity and relative Euclidean distance were used as evaluation metrics to reflect the quantization effect.
As shown in
Table 4 and
Table 5, the experimental results indicate that the dynamic range and variance of the infrared images are positively correlated with the quantization effect. From a portion of the experimental data, it can also be concluded that when the maximum dynamic range remains unchanged and the maximum variance increases, the quantization effect remains the same; however, when the maximum variance remains unchanged and the maximum dynamic range increases, the quantization effect improves. Therefore, the maximum dynamic range of an image significantly influences the quantization effect. Furthermore, it is essential to select a sufficient number of quantization calibration sets to ensure they encompass a diverse range of images with both wide dynamic ranges and high variance.
In addition, experiments on the quantization effects of different batch sizes were conducted, as shown in
Table 7. The results indicate that as batch size increases, the quantization effect improves, and the optimal quantization effect is attained with a batch size of 100. It is important to note that when the calibration set is too large, performing a forward pass on all images at once can put significant pressure on the computer’s available memory. Therefore, when the computer has limited memory, batch processing can be used to handle the calibration set in smaller groups. The quantization component will average the results from multiple measurements to produce the final measurement result.
4.5. int16/int8/Mixed Quantization Experiment
The total running time and recall rates of 8-bit, 16-bit, and mixed quantization are compared across the four validation datasets. The experimental results are shown in
Table 8.
The experimental results indicate that for Dataset1 and Dataset2, which have a high target average SNR, all three quantization methods achieve a 100% recall rate; for Dataset3, with a target average SNR of 7.74, the recall of all three quantization methods decreased to 94.12%; For Dataset4, with a target average SNR of 3.72, the recall of int16 and the mixed quantization methods decreased to 85.71%, and the recall of int8 quantization methods further decreased to 80.95%.
The processing time experiment results are shown in
Table 9.
The experimental results demonstrate that the average inference time for int8 quantization is reduced by 37.99% compared to int16 quantization, while the average inference time for mixed quantization is 30.65% less than that of int16 quantization. The mixed quantization method not only decreases the hardware inference time but also maintains the same recall rate as int16 quantization. After deploying the mixed quantization, the comparison of the total running time is presented in
Table 10, where FPS increased to 54.10.
After network deployment, the FPGA resource usage of the hardware platform is shown in
Table 11. It can be seen that the entire system has less hardware resource utilization and a lower resource consumption ratio.
4.6. Comparison with the Existing Method
In order to verify the progressiveness of the proposed method, Efficientnet [
36], Mobilenetv2 [
37], Darknet19 [
38], Googlenet [
39], Resnet18 [
40], Yolov5n, and Squeezenet [
41] were selected for horizontal comparison.
As shown in
Table 12, the proposed LSDFnet has fewer parameters and lower FLOPs, achieving a high-speed processing performance with an acceptable detection performance.
5. Conclusions
This article proposes a network for infrared dim small target detection under a sky cloud background with optimization and deployment on the heterogeneous FPGA JFMQL100TAI. First, a lightweight sequential-differential-frame-based network (LSDF-Net) is established. This network incorporates sequential-differential input, a false-alarm-object learning strategy, and a multi-anchor box assignment strategy to improve the detection performance under a sky cloud background. Sequential-differential acceleration and convolutional pooling are introduced to optimize the structure of the network, reducing hardware inference time by 15.78%. Subsequently, by converting the image input from four channels to a single channel, FPS is improved by 43.5%. Finally, the selection criteria for the infrared image quantization calibration set are optimized, and mixed quantization is chosen for deploying the network on the heterogeneous FPGA JFMQL100TAI platform. Compared to 16-bit quantization, this approach saves 30.65% of the network inference time while maintaining the same level of recall rate. The recall rate after deployment on the JFMQL100TAI platform is not lower than 85.71% on the four validation datasets, with a performance of 54.10 FPS. Compared to some existing methods, the proposed LSDFnet achieves high-speed processing performance with acceptable detection performance. This article conducts research on network optimization and deployment based on JFMQL100TAI and has achieved some progress. However, the compatibility of this type of device with new network layers or custom network layers still needs to be further improved.