Multi-Dimensional Information Fusion You Only Look Once Network for Suspicious Object Detection in Millimeter Wave Images

: Millimeter wave (MMW) imaging systems have been widely used for security screening in public places due to their advantages of being able to detect a variety of suspicious objects, non-contact operation, and harmlessness to the human body. In this study, we propose an innovative, multi-dimensional information fusion YOLO network that can aggregate and capture multimodal information to cope with the challenges of low resolution and susceptibility to noise in MMW images. In particular, an MMW data information aggregation module is developed to adaptively synthesize a novel type of MMW image, which simultaneously contains pixel, depth, phase, and diverse signal-to-noise information to overcome the limitations of current MMW images containing consistent pixel information in all three channels. Furthermore, this module is capable of differentiable data enhancements to take into account adverse noise conditions in real application scenarios. In order to fully acquire the augmented contextual information mentioned above, we propose an asymptotic path aggregation network and combine it with YOLOv8. The proposed method is able to adaptively and bidirectionally fuse deep and shallow features while avoiding semantic gaps. In addition, a multi-view, multi-parameter mapping technique is designed to enhance the detection ability. The experiments on the measured MMW datasets validate the improvement in object detection using the proposed model.


Introduction
In recent years, due to the increasing emphasis on public safety, millimeter wave (MMW) imaging systems [1][2][3] have gradually become a necessary screening technique.Traditional security screening techniques, which mainly include X-ray machines and metal detectors, have different limitations.X-ray machines are not suitable for human screening due to the use of ionizing radiation, and metal detectors cannot detect non-metallic objects.In contrast, MMW scanners can not only penetrate clothing to detect concealed suspicious items, but they are also harmless to humans due to their utilization of nonionizing electromagnetic waves.Furthermore, MMW scanners are favored for their noncontact and privacy-protecting advantages.However, as illustrated in Figure 1, achieving the high-precision detection of suspicious objects in MMW images faces the following challenges [4][5][6]: (1) Compared with optical images, MMW images suffer from low-resolution problems.(2) Current MMW images are actually three-channel grayscale images, so they cannot provide enough information to find statistically significant differences between suspicious items and the human body.(3) The size gaps among suspicious items are large, and there are a variety of small targets in MMW images.(4) There is inherent system noise and multipath-reflection noise.Early MMW detection algorithms [7][8][9][10] are mainly based on statistical theory.Because they rely on hand-designed features and lack analy-ses utilizing big datasets, these algorithms have difficulty solving the above real-world application issues.
Electronics 2024, 13, x FOR PEER REVIEW 2 of 23 10] are mainly based on statistical theory.Because they rely on hand-designed features and lack analyses utilizing big datasets, these algorithms have difficulty solving the above real-world application issues.Recently, deep learning has achieved great success in the field of optical image detection, which benefits from the powerful feature extraction capabilities of convolutional neural networks (CNNs) [11] and attention mechanisms [12][13][14][15].Based on whether candidate regions are generated or not, these algorithms can be mainly categorized into twostage methods and one-stage methods.The representative two-stage algorithm is the region-based convolution neural network (RCNN) series, including R-CNN [16], Fast-RCNN [17], Faster-RCNN [18], Cascade-RCNN [19], Dynamic-RCNN [20], etc.The RCNN series generates region proposals in the first stage and refines the localization and classification in the second stage.Meanwhile, many remarkable improvements, such as feature pyramid networks (FPNs) [21], region-based fully convolutional networks (R-FCNs) [22], and regions of interest (ROIs) [23] have been successively applied to enhance the semantic interaction capability and estimation accuracy.Because they combine coarse-grained and fine-grained operations, two-stage algorithms can obtain a high accuracy.Compared with two-stage algorithms, one-stage methods are free of candidate-region generation and directly perform predictions.As a result, one-stage methods have a higher efficiency and are more suitable for environments with high foot traffic.The single-shot, multibox detector (SSD) [24] and You Only Look Once (YOLO) [25] algorithms are two well-known early one-stage algorithms.Although they can detect in real time, they cannot match the accuracy of two-stage algorithms.Over time, more and more YOLO versions [26][27][28][29][30][31] were developed to reduce the accuracy gap between one-stage and two-stage methods.Among them, YOLOv5 [32] has become the current main solution in industrial inspection areas due to its excellent performance.Through jointly utilizing the cross-stage partial (CSP) theory [33], mosaic processing, and a path aggregation network (PANet) [34], the accuracy of YOLOv5 has been as good as those of two-stage methods.Building upon the success of YOLOv5, the latest state-of-the-art (SOTA) model, YOLOv8, was proposed in [35].To promote convergence, YOLOv8 designs the C2f module to control the shortest longest gradient path [36].Decoupled-head [37] and task-aligned assigners [38] have also been Recently, deep learning has achieved great success in the field of optical image detection, which benefits from the powerful feature extraction capabilities of convolutional neural networks (CNNs) [11] and attention mechanisms [12][13][14][15].Based on whether candidate regions are generated or not, these algorithms can be mainly categorized into two-stage methods and one-stage methods.The representative two-stage algorithm is the region-based convolution neural network (RCNN) series, including R-CNN [16], Fast-RCNN [17], Faster-RCNN [18], Cascade-RCNN [19], Dynamic-RCNN [20], etc.The RCNN series generates region proposals in the first stage and refines the localization and classification in the second stage.Meanwhile, many remarkable improvements, such as feature pyramid networks (FPNs) [21], region-based fully convolutional networks (R-FCNs) [22], and regions of interest (ROIs) [23] have been successively applied to enhance the semantic interaction capability and estimation accuracy.Because they combine coarse-grained and fine-grained operations, two-stage algorithms can obtain a high accuracy.Compared with two-stage algorithms, one-stage methods are free of candidate-region generation and directly perform predictions.As a result, one-stage methods have a higher efficiency and are more suitable for environments with high foot traffic.The single-shot, multibox detector (SSD) [24] and You Only Look Once (YOLO) [25] algorithms are two well-known early one-stage algorithms.Although they can detect in real time, they cannot match the accuracy of two-stage algorithms.Over time, more and more YOLO versions [26][27][28][29][30][31] were developed to reduce the accuracy gap between one-stage and two-stage methods.Among them, YOLOv5 [32] has become the current main solution in industrial inspection areas due to its excellent performance.Through jointly utilizing the cross-stage partial (CSP) theory [33], mosaic processing, and a path aggregation network (PANet) [34], the accuracy of YOLOv5 has been as good as those of two-stage methods.Building upon the success of YOLOv5, the latest state-of-the-art (SOTA) model, YOLOv8, was proposed in [35].To promote convergence, YOLOv8 designs the C2f module to control the shortest longest gradient path [36].Decoupled-head [37] and task-aligned assigners [38] have also been introduced to enhance its capabilities.However, directly applying the above methods to the detection of suspicious objects in MMW images does not fully exploit their advantages.This is because MMW images and optical images have many differences, as stated earlier.
Thanks to the development of deep learning, scholars have proposed some CNNbased detection algorithms for MMW images.In [39], a self-paced, feature attention fusion network was proposed to fuse MMW features with different scales in a top-down manner.A multi-source aggregation transformer was presented in [40] to model the self-correlation and cross-attention of multi-view MMW images.To address the issues in detecting small targets in MMW images, Huang et al. [5] combined YOLOv5 and a local perception swin transformer to increase the algorithm's global information acquisition ability.Liu et al. [41] adopted dilated convolution to construct a network with an enhanced multi-size target detection ability.Wang et al. [42] used normalized accumulation maps to locate targets.A Siamese network was utilized in [43] to change the MMW detection task into a similarity comparison task with a low complexity.However, the susceptibility to noise and insufficient channel information are still challenges that MMW detectors need to solve.
In this paper, we propose an improved YOLO detection algorithm that aims to increase the available information and extend its feature aggregation ability.We named this innovative method the multi-dimensional information fusion YOLO (MDIF-YOLO) network.Current MMW images are actually grayscale images containing three channels, from which the data information cannot be fully mined by existing networks.In view of this issue, we designed a data information aggregation (DIA) module, which can jointly use pixel, depth, phase, and different signal-to-noise ratio (SNR) information to generate a novel type of multi-channel MMW images.Moreover, the DIA module realizes differentiable image enhancement in the generation procedure.The corresponding enhancement parameters can be learned end-to-end during the training of the model.It is worth noting that one can arbitrarily select the information type (pixel, depth, phase, or SNR) for fusion.Therefore, the DIA module has a wide range of application scenarios and can significantly increase the available information and robustness to various types of noise.After the DIA module, the latest YOLOv8 is applied as the framework to construct a detection model for multi-channel MMW images.In order to better mine the multimodal features, we designed an asymptotic path aggregation network (APAN) and utilized it to replace the original PANet neck of YOLOv8.Unlike the PANet neck that uses simple concat fusion, the APAN adopts an adaptive spatial weighted fusion strategy, which can address the inconsistencies of different layers.Furthermore, the APAN extends the asymptotic incorporation theory [44,45] to realize bidirectional asymptotic aggregation in both top-down and bottom-up paths, which contributes to avoiding feature gaps during network transmission.At the output of the detection head, a multi-view, multi-parameter mapping technique is utilized to refine the detection results.This technique performs a cross-correlation mapping refinement among the three sets of multi-view images generated by using three sets of different DIA parameters for the same scan group, and applies a padding strategy to enhance the detection of unclear targets.As a result, an improved detection precision and a reduction in false positives can be achieved.
The proposed MDIF-YOLO network possesses the following advantages: (1) The DIA module increases the information richness of MMW images and enhances the network's robustness to noise.(2) To our knowledge, we are the first to propose an end-to-end, online, multi-dimensional data synthesis method in the MMW security field, and we improved the mainstream MMW grayscale image detection method.(3) The constructed APAN addresses feature inconsistencies and avoids semantic gaps during the fusion of deep and shallow layers.(4) A higher detection precision and fewer false positives can be obtained through the proposed multi-view, multi-parameter mapping technique.Extensive experiments verified the favorable performance of the MDIF-YOLO.The rest of this paper is organized as follows: Section 2 details the overall framework and design of the proposed MDIF-YOLO network.The experiments and analysis are provided in Section 3. Section 4 concludes this article.

Overview of the System
In this study, we propose an innovative MDIF-YOLO network for suspicious object detection in MMW images.The MMW dataset [46,47] used in this research was obtained using our BM4203 series MMW security scanners [48,49] that adopt cylindrical scanning and active imaging, as shown in Figure 2. In one scan, ten MMW images from different angles can be obtained, where the first five images are generated by imaging the front of the person and the last five images are of the back.In order to protect the privacy of the person being scanned, BM4203 series MMW security scanners will automatically blur their face.The signal wideband of the system ranges from 24 GHz to 30 GHz, and the resolution is about 5 mm.As representative security products that have been widely applied in airports, stations, and stadiums, BM4203 series MMW security scanners have a high reliability and good image resolution.Through our scanners, the corresponding phase, depth, and diverse SNR information are simultaneously stored together with the MMW pixel images.

Overview of the System
In this study, we propose an innovative MDIF-YOLO network for suspicious object detection in MMW images.The MMW dataset [46,47] used in this research was obtained using our BM4203 series MMW security scanners [48,49] that adopt cylindrical scanning and active imaging, as shown in Figure 2. In one scan, ten MMW images from different angles can be obtained, where the first five images are generated by imaging the front of the person and the last five images are of the back.In order to protect the privacy of the person being scanned, BM4203 series MMW security scanners will automatically blur their face.The signal wideband of the system ranges from 24 GHz to 30 GHz, and the resolution is about 5 mm.As representative security products that have been widely applied in airports, stations, and stadiums, BM4203 series MMW security scanners have a high reliability and good image resolution.Through our scanners, the corresponding phase, depth, and diverse SNR information are simultaneously stored together with the MMW pixel images.Figure 3 depicts the entire architecture of the proposed MDIF-YOLO network.The MDIF-YOLO consists of three main modules: the DIA module, YOLOv8-APAN module, and multi-view, multi-parameter mapping module.At the beginning, the DIA module receives original MMW pixel images together with one or more types of information (depth, phase, or SNR).In the DIA module, these multimodal data can be successively aggregated in both coarse-grained and fine-grained manners, where coarse-grained fusion focuses on estimating enhancement parameters that can be trained by backpropagation, and fine-grained fusion concentrates on further integrating the enhanced data.Then, the generated, novel, robust multi-channel MMW images are sent to the YOLOv8-APAN module.The roles of the YOLOv8 backbone, APAN neck, and detection head in the YOLO module are, respectively, multi-scale feature extraction, deep-shallow layer aggregation, and prediction.In particular, the proposed APAN in this paper combines bidirectional, asymptotic aggregation and an adaptive spatial weighted technique to avoid multi-scale gaps.Finally, the prediction results are sent to the mapping module.By jointly mapping the results of three sets of images with different DIA parameters from the same scan, the mapping module provides sufficient multi-view and multi-dimensional information to fine-tune the results.For unclear targets, a padding strategy is adopted to increase the detection performance.Figure 3 depicts the entire architecture of the proposed MDIF-YOLO network.The MDIF-YOLO consists of three main modules: the DIA module, YOLOv8-APAN module, and multi-view, multi-parameter mapping module.At the beginning, the DIA module receives original MMW pixel images together with one or more types of information (depth, phase, or SNR).In the DIA module, these multimodal data can be successively aggregated in both coarse-grained and fine-grained manners, where coarse-grained fusion focuses on estimating enhancement parameters that can be trained by backpropagation, and fine-grained fusion concentrates on further integrating the enhanced data.Then, the generated, novel, robust multi-channel MMW images are sent to the YOLOv8-APAN module.The roles of the YOLOv8 backbone, APAN neck, and detection head in the YOLO module are, respectively, multi-scale feature extraction, deep-shallow layer aggregation, and prediction.In particular, the proposed APAN in this paper combines bidirectional, asymptotic aggregation and an adaptive spatial weighted technique to avoid multi-scale gaps.Finally, the prediction results are sent to the mapping module.By jointly mapping the results of three sets of images with different DIA parameters from the same scan, the mapping module provides sufficient multi-view and multi-dimensional information to fine-tune the results.For unclear targets, a padding strategy is adopted to increase the detection performance.

Data Information Aggregation Module
Although the current grayscale MMW images have three channels, the elements at the same spatial location in the three channels are identical pixels.The aim of the proposed DIA module is to synthesize a novel type of MMW image containing multi-dimensional information.Figure 4 shows the moduleʹs overall structure, which comprises coarsegrained fusion, parameter estimation, data enhancement, and fine-grained fusion submodules.As previously mentioned, ones can arbitrarily choose from available pixel, depth, phase, or SNR data for fusion.In this study, the pixel, depth, and SNR information were chosen to generate a new type of image.Their three single-channel matrices with dimensions of 1 × 352 × 808 were sent to the coarse-grained fusion submodule for the preliminary construction of a 3 × 352 × 808 matrix.To improve the computational efficiency, a letterbox scaling operation [35] was adopted to achieve downsampling so that the dimensions can be reduced to 3 × 256 × 256.Note that pixel, depth, and phase data can be easily obtained by using a wavenumber domain reconstruction algorithm [8] to process the MMW 3D

Data Information Aggregation Module
Although the current grayscale MMW images have three channels, the elements at the same spatial location in the three channels are identical pixels.The aim of the proposed DIA module is to synthesize a novel type of MMW image containing multi-dimensional information.Figure 4 shows the module's overall structure, which comprises coarse-grained fusion, parameter estimation, data enhancement, and fine-grained fusion submodules.

Data Information Aggregation Module
Although the current grayscale MMW images have three channels, the elements at the same spatial location in the three channels are identical pixels.The aim of the proposed DIA module is to synthesize a novel type of MMW image containing multi-dimensional information.Figure 4 shows the moduleʹs overall structure, which comprises coarsegrained fusion, parameter estimation, data enhancement, and fine-grained fusion submodules.As previously mentioned, ones can arbitrarily choose from available pixel, depth, phase, or SNR data for fusion.In this study, the pixel, depth, and SNR information were chosen to generate a new type of image.Their three single-channel matrices with dimensions of 1 × 352 × 808 were sent to the coarse-grained fusion submodule for the preliminary construction of a 3 × 352 × 808 matrix.To improve the computational efficiency, a letterbox scaling operation [35] was adopted to achieve downsampling so that the dimensions can be reduced to 3 × 256 × 256.Note that pixel, depth, and phase data can be easily obtained by using a wavenumber domain reconstruction algorithm [8] to process the MMW 3D echo data, which is expressed as As previously mentioned, ones can arbitrarily choose from available pixel, depth, phase, or SNR data for fusion.In this study, the pixel, depth, and SNR information were chosen to generate a new type of image.Their three single-channel matrices with dimensions of 1 × 352 × 808 were sent to the coarse-grained fusion submodule for the preliminary construction of a 3 × 352 × 808 matrix.To improve the computational efficiency, a letterbox scaling operation [35] was adopted to achieve downsampling so that the dimensions can be reduced to 3 × 256 × 256.Note that pixel, depth, and phase data can be easily obtained by using a wavenumber domain reconstruction algorithm [8] to process the MMW 3D echo data, which is expressed as where (x t , y t , z t ) indicates the coordinates of the target, (x r , y r , z r ) are the coordinates of the radar element, and f (x t , y t , z t ) is the reflectivity.Different SNR data can be obtained by changing the logarithmic imaging threshold or by using simple image processing methods, e.g., the CLAHE method [50].
Based on the differentiable training theory presented in [51,52], after the coarse-grained fusion submodule, the parameter estimation submodule is used to extract three enhancement parameters that can be trained during backward propagation.Similar to [51], the parameter estimation submodule is a tiny CNN with five convolutional layers (output channel numbers are 16, 32, 32, 32, and 32) and two fully connected layers.Every convolutional layer contains 3 × 3 convolutions with stride 2, batch normalization, and a swish activation function.The fully connected layers output three trainable enhancement parameters.
The data enhancement submodule receives three chosen single-channel matrices from the input.Each matrix will be online gamma enhanced through setting the output parameters γ 1 , γ 2 , and γ 3 of the parameter estimation submodule as the enhancement parameters.Gamma enhancement can be expressed as y = ls γ , where l represents the luminance coefficient that is generally set to 1, s is any matrix element, and γ is the gamma enhancement parameter.Obviously, the power exponential operation is differentiable.
At the end of the DIA module, the fine-grained fusion submodule will synthesize the above three enhanced single-channel matrices.As a result, a novel type of real three-channel MMW image with multimodal information can be generated.Figure 5 shows a comparison between a traditional MMW pixel image and the corresponding novel multimodal MMW image.Even to the human eyes, the proposed new type of image has richer textures, particularly for poorly imaged areas of the human body, such as the arms.Additionally, from the computer vision perspective, the new image has three different channels of information.In contrast, traditional MMW images only have one channel of information.Therefore, from the perspective of both image quality and feature extraction, the new image has more advantages.In addition, the online trainable data enhancement increases the network's robustness to various types of noise.In particular, through learning how to use the new kind of image, even the detection ability for the original pixel image can be greatly improved.The reason is that the learning of multi-dimensional information also strengthens the ability to analyze pixels.
where ( , , ) x y z indicates the coordinates of the target, ( , , ) r r r x y z are the coordinates of the radar element, and ( , , ) f x y z is the reflectivity.Different SNR data can be ob- tained by changing the logarithmic imaging threshold or by using simple image processing methods, e.g., the CLAHE method [50].
Based on the differentiable training theory presented in [51,52], after the coarsegrained fusion submodule, the parameter estimation submodule is used to extract three enhancement parameters that can be trained during backward propagation.Similar to [51], the parameter estimation submodule is a tiny CNN with five convolutional layers (output channel numbers are 16, 32, 32, 32, and 32) and two fully connected layers.Every convolutional layer contains 3 × 3 convolutions with stride 2, batch normalization, and a swish activation function.The fully connected layers output three trainable enhancement parameters.
The data enhancement submodule receives three chosen single-channel matrices from the input.Each matrix will be online gamma enhanced through setting the output parameters 1, 2, and 3 of the parameter estimation submodule as the enhancement parameters.Gamma enhancement can be expressed as y ls γ = , where l represents the luminance coefficient that is generally set to 1, s is any matrix element, and  is the gamma enhancement parameter.Obviously, the power exponential operation is differentiable.
At the end of the DIA module, the fine-grained fusion submodule will synthesize the above three enhanced single-channel matrices.As a result, a novel type of real three-channel MMW image with multimodal information can be generated.Figure 5 shows a comparison between a traditional MMW pixel image and the corresponding novel multimodal MMW image.Even to the human eyes, the proposed new type of image has richer textures, particularly for poorly imaged areas of the human body, such as the arms.Additionally, from the computer vision perspective, the new image has three different channels of information.In contrast, traditional MMW images only have one channel of information.Therefore, from the perspective of both image quality and feature extraction, the new image has more advantages.In addition, the online trainable data enhancement increases the networkʹs robustness to various types of noise.In particular, through learning how to use the new kind of image, even the detection ability for the original pixel image can be greatly improved.The reason is that the learning of multi-dimensional information also strengthens the ability to analyze pixels.

YOLOv8-Asymptotic Path Aggregation Network
In order to adequately extract and analyze the augmented information of the novel MMW image, an APAN was designed and embedded into YOLOv8 as its neck architecture.This novel network is named YOLOv8-APAN; its structure is depicted in Figure 6.YOLOv8-APAN comprises a YOLOv8 backbone, APAN neck, and detection head, where the backbone extracts features, the neck aggregates the extracted features, and the head produces the detection results.Like YOLOv8, in the backbone and neck, the proposed method applies C2f instead of C3 in YOLOv5.Compared with C3, C2f can control the shortest longest gradient path while keeping it lightweight, so that a higher capability boundary can be achieved.The detection strategy of the head uses the anchor-free method rather than the anchor-based method.In this way, the limitations of the anchor box design and the computational complexity of NMS will be alleviated.The structural components of C2f and the detection head can be found in [35].

YOLOv8-Asymptotic Path Aggregation Network
In order to adequately extract and analyze the augmented information of the novel MMW image, an APAN was designed and embedded into YOLOv8 as its neck architecture.This novel network is named YOLOv8-APAN; its structure is depicted in Figure 6.YOLOv8-APAN comprises a YOLOv8 backbone, APAN neck, and detection head, where the backbone extracts features, the neck aggregates the extracted features, and the head produces the detection results.Like YOLOv8, in the backbone and neck, the proposed method applies C2f instead of C3 in YOLOv5.Compared with C3, C2f can control the shortest longest gradient path while keeping it lightweight, so that a higher capability boundary can be achieved.The detection strategy of the head uses the anchor-free method rather than the anchor-based method.In this way, the limitations of the anchor box design and the computational complexity of NMS will be alleviated.The structural components of C2f and the detection head can be found in [35].It is well known that current neck networks generally choose FPN and its optimized versions, e.g., PANet, because of their strong deep and shallow feature fusion capabilities.Recently, an asymptotic feature pyramid network (AFPN) was proposed in [44] as a new and improved version of the FPN.Through integrating adjacent shallow-layer features and asymptotically uniting deep-layer features, the AFPN is able to eliminate semantic gaps between non-adjacent levels.This paper extends the asymptotic incorporation theory to PANet and proposes APAN.As the improved version of AFPN, the proposed APAN realizes asymptotic integration not only in top-down paths but also in bottom-up paths, which is shown in Figure 7. Unlike AFPN, which first considers shallow-level features and then expands to deeper layers, our APAN involves bidirectional expansion union, i.e., ASF2-1, ASF2-2, and ASF2-3 first follow a deeper-to-lower direction, and then ASF3-1, ASF3-2, and ASF3-3 take a lower-to-deeper path.According to [44], different layers that are far apart have semantic gaps that cannot be ignored.Thus, it is not suitable to fuse them directly.In APAN, feature differences between two distant layers are greatly reduced by first merging adjacent layers in pairs, and then further considering the fusion of distant layers.For example, input1 and input2 are the first to be aggregated through ASF2-1 and ASF2-2, and then the fusion of input3 and the output of ASF2-2 is carried out It is well known that current neck networks generally choose FPN and its optimized versions, e.g., PANet, because of their strong deep and shallow feature fusion capabilities.Recently, an asymptotic feature pyramid network (AFPN) was proposed in [44] as a new and improved version of the FPN.Through integrating adjacent shallow-layer features and asymptotically uniting deep-layer features, the AFPN is able to eliminate semantic gaps between non-adjacent levels.This paper extends the asymptotic incorporation theory to PANet and proposes APAN.As the improved version of AFPN, the proposed APAN realizes asymptotic integration not only in top-down paths but also in bottom-up paths, which is shown in Figure 7. Unlike AFPN, which first considers shallow-level features and then expands to deeper layers, our APAN involves bidirectional expansion union, i.e., ASF2-1, ASF2-2, and ASF2-3 first follow a deeper-to-lower direction, and then ASF3-1, ASF3-2, and ASF3-3 take a lower-to-deeper path.According to [44], different layers that are far apart have semantic gaps that cannot be ignored.Thus, it is not suitable to fuse them directly.In APAN, feature differences between two distant layers are greatly reduced by first merging adjacent layers in pairs, and then further considering the fusion of distant layers.For example, input1 and input2 are the first to be aggregated through ASF2-1 and ASF2-2, and then the fusion of input3 and the output of ASF2-2 is carried out through ASF2-3.Although input1 and input3 are non-adjacent layers, their indirect information fusion in ASF2-3 can avoid the gap issue.This is because input1 and input2 as well as input2 and input3 are adjacent layers, and the information of input2 in ASF2-2 plays an important role in regulating the conflict between input1 and input3.The same theory applies to ASF3-1, ASF3-2, and ASF3-3, except that they aim at three layers.Following each ASF submodule, C2F is deployed for processing and learning the fused feature.Moreover, upsampling and downsampling operations are introduced to align the dimensions during fusion.Upsampling consists of convolution and an interpolation technique, and downsampling is achieved by convolution.
Electronics 2024, 13, x FOR PEER REVIEW 8 of 23 through ASF2-3.Although input1 and input3 are non-adjacent layers, their indirect information fusion in ASF2-3 can avoid the gap issue.This is because input1 and input2 as well as input2 and input3 are adjacent layers, and the information of input2 in ASF2-2 plays an important role in regulating the conflict between input1 and input3.The same theory applies to ASF3-1, ASF3-2, and ASF3-3, except that they aim at three layers.Following each ASF submodule, C2F is deployed for processing and learning the fused feature.Moreover, upsampling and downsampling operations are introduced to align the dimensions during fusion.Upsampling consists of convolution and an interpolation technique, and downsampling is achieved by convolution.As discussed above, ASF2 and ASF3 are the key components of the APAN.They are mainly used for adaptive spatial weighted fusion.ASF2 is used for the weighted union of two layers and ASF3 is applied for the weighted union of three different-level layers.Figure 8 illustrates their structure.The computing procedure of ASF2 can be defined as where Conv2d, BN, concat, and softmax represent the convolution, batch normalization, and softmax operations, respectively.SiLU is the swish activation function.⨀ represents element-by-element multiplication.In Equation (3), both the kernel size and stride of the convolution operations for 1 and 2 are 1, and the output channel number is 8.The kernel size and stride of the last convolution operation in Equation ( 3) are also 1, and the output channel number is 2. For Equation ( 2), the kernel size and stride of its convolution operation are 3 and 1, respectively, and the corresponding output channel number is equal to the channel number of the input feature without upsampling and downsampling operations.Similarly, ASF3 can be expressed as As discussed above, ASF2 and ASF3 are the key components of the APAN.They are mainly used for adaptive spatial weighted fusion.ASF2 is used for the weighted union of two layers and ASF3 is applied for the weighted union of three different-level layers.Figure 8 illustrates their structure.The computing procedure of ASF2 can be defined as where Conv2d, BN, concat, and softmax represent the convolution, batch normalization, and softmax operations, respectively.SiLU is the swish activation function.⊙ represents element-by-element multiplication.In Equation (3), both the kernel size and stride of the convolution operations for X 1 and X 2 are 1, and the output channel number is 8.The kernel size and stride of the last convolution operation in Equation (3) are also 1, and the output channel number is 2. For Equation (2), the kernel size and stride of its convolution operation are 3 and 1, respectively, and the corresponding output channel number is equal to the channel number of the input feature without upsampling and downsampling operations.
Similarly, ASF3 can be expressed as Electronics 2024, 13, 773 In Equation ( 5), the convolution operations for X 1 , X 2 , and X 3 have the same parameters as those in Equation (3), i.e., both the kernel size and stride are 1, and the output channel number is 8.The kernel size and stride of the last convolution operation are 1, and the output channel number becomes 3 due to there being three inputs.The parameters of the convolution operation in Equation ( 4) are the same as those of the convolution operation in Equation (2), i.e., the kernel size is 3, the stride is 1, and the output channel number is equal to the channel number of the input feature, which does not need to perform upsampling and downsampling operations.It is well known that for different-level layer fusion, even if upsampling and downsampling operations are performed, the target characteristics inevitably present distortions.Fortunately, ASF2 and ASF3 can generate adaptive coefficients based on the evaluation of the importance of the characteristic, thus strengthening the desired information and weakening the jamming information.Through this way, inconsistencies of different layers in the aggregation procedure can be relieved.From Figure 8, ASF3 is a more complex version of ASF2, because the input number has been changed from two layers to three layers.In future work, we can consider designing an ASF4 submodule with four inputs to further optimize the network model.
In Equation ( 5), the convolution operations for 1, 2, and 3 have the same parameters as those in Equation (3), i.e., both the kernel size and stride are 1, and the output channel number is 8.The kernel size and stride of the last convolution operation are 1, and the output channel number becomes 3 due to there being three inputs.The parameters of the convolution operation in Equation ( 4) are the same as those of the convolution operation in Equation (2), i.e., the kernel size is 3, the stride is 1, and the output channel number is equal to the channel number of the input feature, which does not need to perform upsampling and downsampling operations.It is well known that for different-level layer fusion, even if upsampling and downsampling operations are performed, the target characteristics inevitably present distortions.Fortunately, ASF2 and ASF3 can generate adaptive coefficients based on the evaluation of the importance of the characteristic, thus strengthening the desired information and weakening the jamming information.Through this way, inconsistencies of different layers in the aggregation procedure can be relieved.From Figure 8, ASF3 is a more complex version of ASF2, because the input number has been changed from two layers to three layers.In future work, we can consider designing an ASF4 submodule with four inputs to further optimize the network model.The APAN has three outputs with different receptive fields, which are sent separately to three detection heads.The three detection heads are decoupled, which means they can be divided into two branches, i.e., regression and classification.None of them need to design anchors in advance and can choose to use anchor-free detection.

Multi-View, Multi-Parameter Mapping Module
Although the above DIA module and YOLOv8-APAN can significantly improve the detection capability, fusing multi-view and different DIA pattern information into the whole network can help to further refine the detection results and improve the accuracy.In this section, a multi-view, multi-parameter mapping technique is introduced.This The APAN has three outputs with different receptive fields, which are sent separately to three detection heads.The three detection heads are decoupled, which means they can be divided into two branches, i.e., regression and classification.None of them need to design anchors in advance and can choose to use anchor-free detection.

Multi-View, Multi-Parameter Mapping Module
Although the above DIA module and YOLOv8-APAN can significantly improve the detection capability, fusing multi-view and different DIA pattern information into the whole network can help to further refine the detection results and improve the accuracy.In this section, a multi-view, multi-parameter mapping technique is introduced.This technique can simultaneously utilize the spatial-temporal and wide domain information of three sets of images with different DIA parameters.This way, a higher performance and better robustness against noise can be obtained.
As shown in Figure 9. this technique consists of two parts: (1) multi-view, multiparameter aggregation and (2) mapping refinement and padding.The multi-view, multiparameter aggregation submodule is constructed to fuse multi-angle and multiple DIA pattern information.The mapping refinement and padding submodule is designed for screening and modifying the results.
Electronics 2024, 13, x FOR PEER REVIEW 10 of 23 technique can simultaneously utilize the spatial-temporal and wide domain information of three sets of images with different DIA parameters.This way, a higher performance and better robustness against noise can be obtained.As shown in Figure 9. this technique consists of two parts: (1) multi-view, multi-parameter aggregation and (2) mapping refinement and padding.The multi-view, multiparameter aggregation submodule is constructed to fuse multi-angle and multiple DIA pattern information.The mapping refinement and padding submodule is designed for screening and modifying the results.The multi-view, multi-parameter aggregation submodule is used at the beginning of the model detection process.In fact, this submodule can be seen as creating three parallel branches that feed three sets of DIA outputs into the YOLOv8-APAN.As stated before, the DIA module can realize online differentiable image enhancement, during which three DIA enhancement parameters for three channels are obtained.In the training process, some optimal DIA parameters can be concluded.In the multi-view, multi-parameter aggregation submodule, we choose three sets of optimal DIA parameters to construct novel, multi-dimensional MMW images in three different DIA patterns, i.e., DIA pattern 1, DIA pattern 2, and DIA pattern 3. It should be noted that the three patterns have the same input data from the same scan, but with different DIA parameters.This means that the three patterns have the same multi-dimensional, wide domain information type, but are in different SNR cases, because the gamma enhancement method could increase or suppress noise by varying its parameters.Their comparison is depicted in Figure 10.Obviously, different DIA parameters result in different representations because of varying channel weightings.In other words, the emphasis on different kinds of information is more diverse.Therefore, jointly using three DIA patterns could improve the applicability and stability of the method.DIA pattern 1, DIA pattern 2, and DIA pattern 3 each have 10 images (equal to the original pixel image number in the same scan).The 10 images provide multiangle information of the detected human body.In addition, due to the powerful parallel computing capabilities of YOLO and GPU-CUDA technology, combining the three patterns would not affect the high efficiency of the detection product.To summarize, multi- The multi-view, multi-parameter aggregation submodule is used at the beginning of the model detection process.In fact, this submodule can be seen as creating three parallel branches that feed three sets of DIA outputs into the YOLOv8-APAN.As stated before, the DIA module can realize online differentiable image enhancement, during which three DIA enhancement parameters for three channels are obtained.In the training process, some optimal DIA parameters can be concluded.In the multi-view, multi-parameter aggregation submodule, we choose three sets of optimal DIA parameters to construct novel, multidimensional MMW images in three different DIA patterns, i.e., DIA pattern 1, DIA pattern 2, and DIA pattern 3. It should be noted that the three patterns have the same input data from the same scan, but with different DIA parameters.This means that the three patterns have the same multi-dimensional, wide domain information type, but are in different SNR cases, because the gamma enhancement method could increase or suppress noise by varying its parameters.Their comparison is depicted in Figure 10.Obviously, different DIA parameters result in different representations because of varying channel weightings.In other words, the emphasis on different kinds of information is more diverse.Therefore, jointly using three DIA patterns could improve the applicability and stability of the method.DIA pattern 1, DIA pattern 2, and DIA pattern 3 each have 10 images (equal to the original pixel image number in the same scan).The 10 images provide multi-angle information of the detected human body.In addition, due to the powerful parallel computing capabilities of YOLO and GPU-CUDA technology, combining the three patterns would not affect the high efficiency of the product.To summarize, multi-angle 3D information, multi-type input information, and multi-SNR information can be aggregated through the multi-view, multi-parameter aggregation submodule to achieving a better performance.The output data of the multi-view, multi-parameter aggregation submodule are de livered to YOLOV8-APAN for detection.Then, the detection results of the 30 images from the same scan are fed into the mapping refinement and padding submodule for the fina adjustment.Figure 11 shows its detailed structure, which mainly comprises three types o components, i.e., a filtering and mapping component, refinement component, and pad ding component.
The filtering and mapping component is used for screening and mapping the detec tion results from each DIA pattern image group.Every detection result contains targe coordinates and confidence.In order to reduce false positives and redundancy, the filter ing and mapping component first filters the results by setting confidence thresholds ac cording to the type of targets.Easily distinguishable target categories, such as guns, hav high thresholds, while unclear categories, such as powder, have low thresholds.Assum that there are M kinds of targets, and the given threshold of each kind is .Then, whether the ith (I > 0) result in the jth (j = 1, 2, …, 10) image is filtered by th category threshold can be determined by the following mask function: , ,   , , , , 1, .0, If the confidence is no less than the category threshold, the mask is set as 1, and th result will be saved.Otherwise, it will be deleted.Furthermore, filtering can be furthe adjusted based on whether the target coordinates are located in a faint body part area.Fo example, arms are the dimmest body part due to their small scattering area.As for th targets located in arms, the mask function can be rewritten as , , , , , , 1, , 0, The output data of the multi-view, multi-parameter aggregation submodule are delivered to YOLOV8-APAN for detection.Then, the detection results of the 30 images from the same scan are fed into the mapping refinement and padding submodule for the final adjustment. Figure 11 shows its detailed structure, which mainly comprises three types of components, i.e., a filtering and mapping component, refinement component, and padding component.
The filtering and mapping component is used for screening and mapping the detection results from each DIA pattern image group.Every detection result contains target coordinates and confidence.In order to reduce false positives and redundancy, the filtering and mapping component first filters the results by setting confidence thresholds according to the type of targets.Easily distinguishable target categories, such as guns, have high thresholds, while unclear categories, such as powder, have low thresholds.Assume that there are M kinds of targets, and the given threshold of each kind is T m , m = 1, 2, . . ., M.Then, whether the ith (I > 0) result in the jth (j = 1, 2, . .., 10) image is filtered by the category threshold can be determined by the following mask function: If the confidence is no less than the category threshold, the mask is set as 1, and the result will be saved.Otherwise, it will be deleted.Furthermore, filtering can be further adjusted based on whether the target coordinates are located in a faint body part area.For example, arms are the dimmest body part due to their small scattering area.As for the targets located in arms, the mask function can be rewritten as where 0 α m < 1.The arm region can be obtained using a human posture recognition method or be roughly set using simple image zoning.This region-based threshold setting facilitates accurate screening.In this study, for convenience, we only chose basic threshold filtering in Equation ( 6).After filtering, the detection result coordinates of five front images and five back images in each DIA pattern group are mapped to the same front and back image in the group, respectively.The mapping can be easily achieved by 3D geometric transformation [46] because the depth, rotation angle, and 2D plane coordinates can be acquired by an arbitrary MMW security system.Moreover, there are other available mapping methods, such as the feature point matching method [53] or the tracking technique [47].
zeros, the corresponding mapping detection results are deleted.The reason is that all CIOUs being zero means that there are no existing intersecting detection boxes, so it is highly likely that the detection result is a false positive.For instance, the first group of images (DIA pattern 1 images) in Figure 11 has one false positive on the leg in the final front image.Since there are no false positives in close positions in the other front images, this false positive can be deleted by the refinement component.Please note that since all the suspicious objects in the example are placed on the front of the human body, only the front images are shown.The role of the padding component is complementary to that of the refining component.For one position, if more than one image has a detection result, it is determined that there is a target at all the corresponding positions in the same-side images, thus correcting some missing detections.In the first image of the second image group (DIA pattern 2 images) in Figure 11, the network failed to detect the target.Fortunately, this absence can be complemented by more than one correct detection result at close positions in other images.If only one image has a result, the padding process does not work, such as the example in the third image group (DIA pattern 3 images) of Figure 11.The refinement component applies CIOU to calculate the correlations among the filtering and mapping results.For each mapping detection result, the CIOU between it and the other results are saved in a list.If the list of a mapping detection result only contains zeros, the corresponding mapping detection results are deleted.The reason is that all CIOUs being zero means that there are no existing intersecting detection boxes, so it is highly likely that the detection result is a false positive.For instance, the first group of images (DIA pattern 1 images) in Figure 11 has one false positive on the leg in the final front image.Since there are no false positives in close positions in the other front images, this false positive can be deleted by the refinement component.Please note that since all the suspicious objects in the example are placed on the front of the human body, only the front images are shown.
The role of the padding component is complementary to that of the refining component.For one position, if more than one image has a detection result, it is determined that there is a target at all the corresponding positions in the same-side images, thus correcting some missing detections.In the first image of the second image group (DIA pattern 2 images) in Figure 11, the network failed to detect the target.Fortunately, this absence can be complemented by more than one correct detection result at close positions in other images.If only one image has a result, the padding process does not work, such as the example in the third image group (DIA pattern 3 images) of Figure 11.
Once the three DIA groups obtain their refined and padded results, padding among groups is carried out.Different from in-group padding requiring no less than two correct results at close locations, padding among groups needs to complement an absence, even if only one group detects the target at a certain location.This helps to break down the limitations of some DIA pattern images.The final results of all three groups are padded and aggregated onto the same puppet image, as shown in Figure 11.Clearly, a higher detection precision and fewer false positives can be achieved through the proposed multi-view, multi-parameter mapping technique.

Experiment Results and Discussion
In this section, the performance of the proposed MDIF-YOLO algorithm is evaluated.Many experiments were conducted using the real-world MMW dataset obtained from our BM4203 series MMW security scanners.All experiments were implemented in the WIN 10 system using a 24GB NVIDIA TITAN RTX GPU.In the following, we will expatiate the dataset, experiment details, comparisons between the proposed method and other SOTA methods, and the corresponding discussions.

Dataset
For reasons of commercial confidentiality, product copyright, and privacy protection, there are few large, available MMW datasets collected from practical application scenarios, let alone other non-pixel signal data such as depth and phase, even though non-pixel raw signal data are as readily available as pixels in every MMW screening system.Another reason for not providing non-pixel signal data is that current MMW detection algorithms do not take into account the use of multi-dimensional information.How to obtain non-pixel signal data has never been an obstacle for arbitrary MMW security products.
The MMW image dataset [46,47] and corresponding multi-dimensional data used in this study were collected using our BM4203 series MMW security scanners [48,49], which have been widely used in many airports, stations, and stadiums.BM4203 series MMW security scanners can perform almost 360-degree imaging of the person being scanned.Each scan can generate 10 images, where the first 5 images display the front of the body, and the last 5 images display the back.The resulting images preserve the person's privacy by blurring the face, and a good resolution of 5 mm can be achieved.Based on the application of BM4203 series MMW security scanners in multiple practical scenarios, the large dataset created contains 186610 real MMW security images and the corresponding depth, phase, and SNR information data.This study selected pixel, depth, and SNR information to use in creating novel multi-dimensional MMW images.Note that the novel image type would not affect the labels, so the labels do not need to be modified.The number of men and women in the dataset are balanced, and the age distribution is 18 to 70 years old.The testers wore a variety of clothes that are common in the four seasons, and carried guns, knives, rectangular explosives, lighters, powders, liquids, bullets, and phones.The dataset acquisition details can be summarized as follows: (1) The collection of the dataset was accomplished through four BM4203 series MMW security scanners applied in four different practical scenarios, including an airport, station, stadium, and office building.(2) There were about 200 testers, whose ages ranged from 18 to 70 years old.The number of men and women was nearly equal, and their body mass indexes covered the thin, normal, and obese ranges.(3) Since spring and autumn clothing are almost identical, the clothing worn by the testers can be divided into three types: winter type, spring and autumn type, and summer type.The frequencies of these three types of clothing in the dataset are similar.(4) There are eight kinds of common suspicious objects, which include guns, knives, rectangular explosives, lighters, powders, liquids, bullets, and phones.The testers were asked to hide suspicious objects on various body parts, including the upper and lower arms, shoulders, neck, abdomen, crotch, back, waist, buttocks, and legs.
For each body part, the suspicious items were placed randomly.(5) During the scan, the testers maintained a fixed posture with their hands raised upward.
Each scan generated 10 images at 10 different angles.When one scan was complete, the system prompted the testers to leave.If the testers moved or had the wrong posture during the scan, the system prompted them to rescan, thus avoiding image blurring or image occlusion.(6) The systems adopted a wave-number domain imaging algorithm [8].The depth and phase information were stored together with the pixel information (i.e., traditional images) during its maximum value projection procedure.Different SNR information can be obtained by changing the logarithmic threshold of the imaging method or using simple image processing methods, e.g., the CLAHE method.(7) The dataset was labeled using labeling software.
As shown in Figure 12, the division strategy of the dataset can be summarized as follows: (1) After finishing the data collection and data labeling, a test set was first constructed by selecting one-tenth of the data from the full dataset.In particular, we ensured that the testers associated with the test set were not correlated with the remaining nine-tenths of the dataset.(2) The remaining nine-tenths of the dataset were stored as eight subgroups according to the eight types of suspicious objects, i.e., guns, knives, rectangular explosives, lighters, powders, liquids, bullets, and phones.(3) In each subgroup, the data produced by the same scan were treated as one unit, which was named as a scan unit in this paper.Then, each subgroup was shuffled by the scan unit, which means that the data produced by the same scan were not shuffled, while different scan units were shuffled.For each body part, the suspicious items were placed randomly.( 5) During the scan, the testers maintained a fixed posture with their hands raised upward.Each scan generated 10 images at 10 different angles.When one scan was complete, the system prompted the testers to leave.If the testers moved or had the wrong posture during the scan, the system prompted them to rescan, thus avoiding image blurring or image occlusion.(6) The systems adopted a wave-number domain imaging algorithm [8].The depth and phase information were stored together with the pixel information (i.e., traditional images) during its maximum value projection procedure.Different SNR information can be obtained by changing the logarithmic threshold of the imaging method or using simple image processing methods, e.g., the CLAHE method.(7) The dataset was labeled using labeling software.
As shown in Figure 12, the division strategy of the dataset can be summarized as follows: (1) After finishing the data collection and data labeling, a test set was first constructed by selecting one-tenth of the data from the full dataset.In particular, we ensured that the testers associated with the test set were not correlated with the remaining ninetenths of the dataset.( 2) The remaining nine-tenths of the dataset were stored as eight subgroups according In conclusion, the partition of the training, validation, and test sets was 81 100 : 9 100 : 10 100 .

Evaluation Metrics
Following the main detection and evaluation indicators used in the security inspection field, this study utilized Precision, Recall, and mean average precision (mAP) to evaluate the method's performance.In addition, a metric named the fake alert rate (FAR) was defined for a more comprehensive evaluation.These metrics can be expressed as where TP, FP, TN, and FN represent the numbers of true positives, false positives, true negatives, and false negatives, respectively, and C is the number of target classes.Precision indicates the proportion of detected real targets among all detected results, and Recall measures what percent of the real targets were detected.Here, the definition of FAR is different from common false alarm rate concepts in the optical detection field.FAR is used to evaluate the rate of false positives in the entire dataset.mAP is a comprehensive evaluation and is the average of the integral of the Precision-Recall curve.

Implementation Details
The original dimensions of the pixel images were 3 × 352 × 808, where the three channels were actually the same.Thus, the images' real dimensions are seen as 1 × 352 × 808.During the parameter estimation of the DIA module, the data size was reshaped to 3 × 256 × 256 to reduce the number of calculations.Through DIA, a novel type of multidimensional MMW image, which is actually multi-channel and has dimensions of 3 × 352 × 808, is generated.When the data entered the YOLOv8-APAN, a mosaic enhancement and the letterbox process were applied.The novel image size became 3 × 640 × 640 at the input of the YOLO network until the end.During the training process, the proposed MDIF-YOLO algorithm end-to-end updates the parameters using an SGD optimizer with a momentum of 0.937.To ensure the stability of convergence, a warm-up policy was first applied for the first three epochs; then, the one-cycle strategy was used to gradually and nonlinearly decrease the learning rate.We set the initial learning rate and optimizer weight decay as 0.01 and 0.0005, respectively.The mosaic enhancement kept running during the training.The total number of training epochs was 100, and the batch size was set as 16.

Performance Comparison
In this subsection, the performance of the proposed MDIF-YOLO was compared to several other networks that are the most effective and widely used in the field of MMW security.In the practical application of MMW security products, most manufacturers still prefer to use the SOTA deep learning detectors in each period because of their good end-to-end training capabilities and their powerful deployment.As mentioned previously, SOTA deep learning detectors in different periods mainly include the RCNN series and YOLO versions.Here, we select Faster-RCNN, Cascade-RCNN, Dynamic-RCNN, YOLOv3, YOLOv5, and YOLOv8 as the comparison items.For security products, the detection ability and fake alert rate are the most important performance indicators.The results are shown in Table 1 and Figure 13.It can be seen that the performance of early YOLO methods (i.e., YOLOv3 and earlier versions) are not comparable to those of the RCNN algorithms.However, the recently proposed YOLOv5 and YOLOv8 algorithms showed comparable performances to those of two-stage algorithms.By overcoming the limitations of traditional MMW images and aggregating multi-dimensional information, the MDIF-YOLO proposed in this paper outperformed the other six SOTA detectors.Thanks to the synthesis of much more abundant image information and its powerful feature aggregation ability, MDIF-YOLO obtained better Precision and mAP values.Furthermore, the combination of multi-dimensional data, including pixel, depth, multi-SNR, and multi-view information, can also improve the robustness and applicability.As a result, the proposed method had a significantly lower fake alert rate compared to the other methods.  1 and Figure 13.It can be seen that the performance of early YOLO methods (i.e., YOLOv3 and earlier versions) are not comparable to those of the RCNN algorithms.However, the recently proposed YOLOv5 and YOLOv8 algorithms showed comparable performances to those of the two-stage algorithms.By overcoming the limitations of traditional MMW images and aggregating multi-dimensional information, the MDIF-YOLO proposed in this paper outperformed the other six SOTA detectors.Thanks to the synthesis of much more abundant image information and its powerful feature aggregation ability, MDIF-YOLO obtained better Precision and mAP values.Furthermore, the combination of multi-dimensional data, including pixel, depth, multi-SNR, and multiview information, can also improve the robustness and applicability.As a result, the proposed method had a significantly lower fake alert rate compared to the other methods.In order to give the reader a clearer understanding of the performance improvement of the proposed method, Figure 14 visualizes the comparison between MDIF-YOLO and YOLOv8.As discussed in the DIA module subsection, learning the DIA pattern images can greatly improve the detection ability of the original pixel images.To prove this more In order to give the reader a clearer understanding of the performance improvement of the proposed method, Figure 14 visualizes the comparison between MDIF-YOLO and YOLOv8.As discussed in the DIA module subsection, learning the DIA pattern images can greatly improve the detection ability of the original pixel images.To prove this more succinctly, in the next experiment, we omitted the multi-view, multi-parameter mapping process from MDIF-YOLO and directly predicted pixel images and the corresponding group of DIA pattern images.Note that SOTA algorithms, such as YOLOv8, do not have the structure to generate and learn multi-dimensional MMW images.Meanwhile, the existing SOTA methods are not robust enough to deal with the various types of SNR information contained in DIA pattern images, which has been tested in the actual operation of our MMW security products.Therefore, applying them to DIA pattern images will cause a significant degradation in the performance, and it is pointless to list their test results for DIA pattern images.This experiment chose bullets as the detection objects, which are the most small targets to detect in MMW images.In Figure 14, the bullets are very small relative to the human body, and are indistinguishable from the body texture, especially when the bullets are hidden on the unclear body parts, such as the back of the head, legs, and arms.YOLOv8 failed in identifying the bullets at the back of the head and leg.In contrast, the proposed MDIF-YOLO network identified almost all the targets.Applying MDIF-YOLO to both the pixel images and the DIA pattern images had almost the same effectiveness, which verifies that aggregating the multi-dimensional information can improve the robustness and ability to analyze various types of MMW data.Another representation for robustness is the FAR, which has not been used as much as Precision and mAP in the relevant research.In effect, FAR can fully reflect the stability when there are signal oscillations and changes in the SNR.From Figures 13 and 14, we can observe that the proposed method had a significantly lower FAR for different types of images, verifying that the proposed method is more robust.Another comparison of MDIF-YOLO and the specifically designed for MMW security is provided in Table 2 and Figure 15.Since the relevant research in the MMW security field is not sufficient and many related methods are not as good as the SOTA deep learning methods, we chose the recently proposed multi-source aggregation transformer (MSAT) [40] and Swin-YOLO [5] algorithms as comparisons.The comparison results showed that although these two latest specialized MMW detectors have a certain improvement over the general purpose SOTA methods, the proposed MDIF-YOLO was still superior, because it revolutionizes how the MMW data are used and aggregated, which proves its effectiveness in the MMW detection field.A lower FAR shows the stability of the proposed method and makes its application scenario more comprehensive, thus increasing its competitiveness.

Ablation Experiments
Ablation experiments were conducted to evaluate the effectiveness of each proposed module in MDIF-YOLO.The results are given in Table 3 and Figure 16.The Precision and mAP results of YOLOV8 were 89.3% and 88.7%.When adding the proposed DIA module to YOLOV8, the two metrics reached 91.2% and 91.1%, an increase of about 2%.The FAR dropped from 11.5% to 8.2%, a decrease of 3.3%.Evidently, the multi-dimensional information fusion and differentiable data enhancement accomplished by the DIA module raised the upper limit of the available information and enhanced the overall performance.Subsequently, the APAN was used to replace the YOLOv8 neck, which led to a 2.6% and 2.9% growth for Precision and mAP compared with YOLOv8.The FAR decreased to 8%.The reason for this improvement is that the APAN relieves feature inconsistencies and avoids semantic gaps.Finally, the multi-view, multi-parameter mapping module was added to form the complete MDIF-YOLO method.The Precision and mAP achieved, respectively, a 2.9% and 3.1% improvement to 92.2% and 91.8%.The FAR had a significant 5.1% reduction to 6.4%.Thus, it was proven that the multi-view, multi-parameter mapping module plays a role in fine-tuning the detection results, supplementing missed detections, and suppressing error detections.From Figure 16, the use of multi-dimensional information fusion resulted in obvious performance gaps between MDIF-YOLO and YOLOv8.These ablation experiments verified that the DIA module, APAN, and multiview, multi-parameter mapping module cooperate with each other to improve the performance and practicability of the algorithm from different aspects.

Ablation Experiments
Ablation experiments were conducted to evaluate the effectiveness of each proposed module in MDIF-YOLO.The results are given in Table 3 and Figure 16.The Precision and mAP results of YOLOV8 were 89.3% and 88.7%.When adding the proposed DIA module to YOLOV8, the two metrics reached 91.2% and 91.1%, an increase of about 2%.The FAR dropped from 11.5% to 8.2%, a decrease of 3.3%.Evidently, the multidimensional information fusion and differentiable data enhancement accomplished by the DIA module raised the upper limit of the available information and enhanced the overall performance.Subsequently, the APAN was used to replace the YOLOv8 neck, which led to a 2.6% and 2.9% growth for Precision and mAP compared with YOLOv8.The FAR decreased to 8%.The reason for this improvement is that the APAN relieves feature inconsistencies and avoids semantic gaps.Finally, the multi-view, multi-parameter mapping module was added to form the complete MDIF-YOLO method.The Precision and mAP achieved, respectively, a 2.9% and 3.1% improvement to 92.2% and 91.8%.The FAR had significant 5.1% reduction to 6.4%.Thus, it was proven that the multi-view, multiparameter mapping module plays a role in fine-tuning the detection results, supplementing missed detections, and suppressing error detections.From Figure 16, the use of multidimensional information fusion resulted in obvious performance gaps between MDIF-YOLO and YOLOv8.These ablation experiments verified that the DIA module, APAN, and multi-view, multi-parameter mapping module cooperate with each other to improve the performance and practicability of the algorithm from different aspects.

Conclusions
We proposed the MDIF-YOLO algorithm, which pioneers fusing multimodal data in the MMW detection field.Multiple types of MMW data, such as the pixel, depth, phase, SNR, and multi-view information, are jointly used in the MDIF-YOLO to break the limitations of traditional MMW detectors that only using pixel information.Using online, differentiable, multimodal data enhancement, the DIA module in the proposed method in-

Figure 1 .
Figure 1.MMW images of a human carrying knives (red boxes), lighters (green boxes), and rectangular explosives (yellow boxes), which have obvious size and shape differences.

Figure 1 .
Figure 1.MMW images of a human carrying knives (red boxes), lighters (green boxes), and rectangular explosives (yellow boxes), which have obvious size and shape differences.

Figure 2 .
Figure 2. BM4203 series MMW security scanner and ten MMW images generated in one scan.

Figure 2 .
Figure 2. BM4203 series MMW security scanner and ten MMW images generated in one scan.

Figure 5 .
Figure 5.Comparison between traditional MMW image and the proposed novel MMW image.The person being scanned is carrying four rectangular explosives.

Figure 5 .
Figure 5.Comparison between traditional MMW image and the proposed novel MMW image.The person being scanned is carrying four rectangular explosives.

Figure 7 .
Figure 7.The bidirectional asymptotic aggregation paths of the APAN.

Figure 7 .
Figure 7.The bidirectional asymptotic aggregation paths of the APAN.

Figure 8 .
Figure 8.The structure component of ASF submodule in an APAN.

Figure 8 .
Figure 8.The structure component of ASF submodule in an APAN.

Figure 9 .
Figure 9.The overall process of multi-view, multi-parameter mapping technology.

Figure 9 .
Figure 9.The overall process of multi-view, multi-parameter mapping technology.

Electronics 2024 ,
13,  x FOR PEER REVIEW 11 of 2 angle 3D information, multi-type input information, and multi-SNR information can b aggregated through the multi-view, multi-parameter aggregation submodule to achiev ing a better performance.

Figure 10 .
Figure 10.Comparison of the constructed multi-dimensional images with different DIA parameters The person being scanned is carrying four guns, which have been marked by rectangular boxes i the figure.

Figure 10 .
Figure 10.Comparison of the constructed multi-dimensional images with different DIA parameters.The person being scanned is carrying four guns, which have been marked by rectangular boxes in the figure.

Figure 11 .
Figure 11.The structure of the mapping, refinement, and padding submodule.Figure 11.The structure of the mapping, refinement, and padding submodule.

Figure 11 .
Figure 11.The structure of the mapping, refinement, and padding submodule.Figure 11.The structure of the mapping, refinement, and padding submodule.

( 4 )
After the shuffle operation, each subgroup was divided into a training subset and validation subset.The partition ratio was 9:1.(5) Finally, the training subset and validation subset from the eight subgroups were combined into the final training set and validation set.

Electronics 2024 , 23 Figure 14 .
Figure 14.Visualized detection comparison between YOLOv8 and MDIF-YOLO.YOLOv8 uses pixel MMW images, while MDIF-YOLO uses both pixel MMW images and a new type of MMW images.The person being scanned is hiding bullets in the back of the head, lower back, and right leg.

Figure 14 .
Figure 14.Visualized detection comparison between YOLOv8 and MDIF-YOLO.YOLOv8 uses pixel MMW images, while MDIF-YOLO uses both pixel MMW images and a new type of MMW images.The person being scanned is hiding bullets in the back of the head, lower back, and right leg.

Table 1 .
Detection abilities and fake alert rates of Faster-RCNN, Cascade-RCNN, Dynamic-RCNN, YOLOv3, YOLOv5, YOLOv8, and the proposed MDIF-YOLO.Electronics 2024, 13, x FOR PEER REVIEW 16 of 23 SOTA deep learning detectors in different periods mainly include the RCNN series and YOLO versions.Here, we select Faster-RCNN, Cascade-RCNN, Dynamic-RCNN, YOLOv3, YOLOv5, and YOLOv8 as the comparison items.For security products, the detection ability and fake alert rate are the most important performance indicators.The results are shown in Table

Table 2 .
Detection abilities and fake alert rates of MSAT, Swin-YOLO, and the proposed MDIF-YOLO.

Table 3 .
Ablation studies of the proposed MDIF-YOLO.Electronics 2024, 13, x FOR PEER REVIEW 20 of 23