SMR–YOLO: Multi-Scale Detection of Concealed Suspicious Objects in Terahertz Images

Zhang, Yuan; Chen, Hao; Ge, Zihao; Jiang, Yuying; Ge, Hongyi; Zhao, Yang; Xiong, Haotian

doi:10.3390/photonics11080778

Open AccessArticle

SMR–YOLO: Multi-Scale Detection of Concealed Suspicious Objects in Terahertz Images

by

Yuan Zhang

^1,2,3,

Hao Chen

^1,2,3,

Zihao Ge

^1,2,3,

Yuying Jiang

^1,2,4,*

,

Hongyi Ge

^1,2,3,

Yang Zhao

^1,2,4 and

Haotian Xiong

^1,2,4

¹

Key Laboratory of Grain Information Processing and Control, Ministry of Education, Henan University of Technology, Zhengzhou 450001, China

²

Henan Provincial Key Laboratory of Grain Photoelectric Detection and Control, Zhengzhou 450001, China

³

College of Information Science and Engineering, Henan University of Technology, Zhengzhou 450001, China

⁴

School of Artificial Intelligence and Big Data, Henan University of Technology, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Photonics 2024, 11(8), 778; https://doi.org/10.3390/photonics11080778

Submission received: 18 July 2024 / Revised: 18 August 2024 / Accepted: 20 August 2024 / Published: 22 August 2024

Download

Browse Figures

Versions Notes

Abstract

The detection of concealed suspicious objects in public places is a critical issue and a popular research topic. Terahertz (THz) imaging technology, as an emerging detection method, can penetrate materials without emitting ionizing radiation, providing a new approach to detecting concealed suspicious objects. This study focuses on the detection of concealed suspicious objects wrapped in different materials such as polyethylene and kraft paper, including items like scissors, pistols, and blades, using THz imaging technology. To address issues such as the lack of texture details in THz images and the contour similarity of different objects, which can lead to missed detections and false alarms, we propose a THz concealed suspicious object detection model based on SMR–YOLO (SPD_Mobile + RFB + YOLO). This model, based on the MobileNext network, introduces the spatial-to-depth convolution (SPD-Conv) module to replace the backbone network, reducing computational and parameter load. The inclusion of the receptive field block (RFB) module, which uses a multi-branch structure of dilated convolutions, enhances the network’s depth features. Using the EIOU loss function to assess the accuracy of predicted box localization further optimizes convergence speed and localization accuracy. Experimental results show that the improved model achieved mAP@0.5 and mAP@0.5:0.95 scores of 98.9% and 89.4%, respectively, representing improvements of 0.2% and 1.8% over the baseline model. Additionally, the detection speed reached 108.7 FPS, an improvement of 23.2 FPS over the baseline model. The model effectively identifies concealed suspicious objects within packages, offering a novel approach for detection in public places.

Keywords:

terahertz; concealed object; object detection; deep learning; YOLOv7

1. Introduction

In recent years, the frequent occurrence of events such as terrorist bombings and attacks has led to an increased demand for higher standards in anti-terrorism and security measures. In particular, the rapid detection and early warning of concealed suspicious objects, such as knives and firearms, carried by terrorists, has become a significant challenge for public safety in public places with large gatherings of people. Due to the high electronic energy of X-rays, they are likely to cause ionizing damage to the substances being detected, and thus X-ray security scanners cannot directly perform safe scans on the human body. The “metal detection door” security check method employed in airports cannot eliminate dangerous items such as ceramic knives, necessitating a “pat-down” style inspection, which leads to a series of issues including inefficiency and invasion of privacy. Consequently, the development of technology capable of rapidly, accurately, and effectively detecting concealed suspicious items has become a matter of great urgency.

Terahertz technology [1], an emerging detection method, can obtain the internal information of objects without compromising their integrity. This enables the effective detection of concealed suspicious objects. The THz wave refers to an electromagnetic wave with a frequency range of 0.1–10 THz. They lie between the millimeter wave and infrared radiation in terms of frequency and between electrons and photons in terms of energy. The THz wave is characterized by strong penetration, low photon energy, and molecular fingerprint spectra. These properties have led to the wide application of THz technology in various fields such as agricultural product inspection [2,3], biomedicine [4,5], and security inspection [6,7]. However, THz images are acquired with low resolution and a low signal-to-noise ratio due to the low power of the emission source and environmental noise. Therefore, how to use the limited features in THz images to achieve rapid and accurate detection has become the focus of current research.

Deep learning has powerful feature extraction and learning capabilities and has been widely applied in the field of image processing. With the development of deep learning, the performance of neural networks has greatly improved, and object detection algorithms based on convolutional neural networks have become mainstream. Jia et al. [8] integrated the coordinate attention (CA) mechanism into the YOLOv5 algorithm [9], enhancing the model’s regression and localization capabilities by embedding positional information to extract important features. Comparative experiments demonstrated that CA–YOLO achieved an average precision of 96% in detecting corn spikes, surpassing the you only look once (YOLO) series and classic detection models. Li et al. [10] proposed a Type 1 fuzzy attention method to enhance both the accuracy and real-time performance of vehicle detection. This method introduces fuzzy entropy to re-weight feature maps, thereby reducing map uncertainty and promoting the detector’s focus on the object center, effectively improving vehicle detection accuracy. Su et al. [11] designed a MOD–YOLO algorithm, considering that previous YOLO series algorithms may lose channel information and lack receptive fields, and applied it to crack detection in civil infrastructure. Compared to the YOLOX algorithm, the MOD–YOLO algorithm improved accuracy by 27.5% to 91.1% on the crack dataset while maintaining a similar detection time, reducing parameters by 19.7%, and lowering computational complexity by 35.9%. Li et al. [12] proposed a space pyramid convolutional shuffle module to extract fine information from the limited visible pixels of occluded objects and generate distinguishable representations for significantly overlapping objects. Extensive experimental results demonstrated that the SPCS module effectively enhances crowd detection performance.

In recent years, object detection has also been widely applied in the THz field. Lu et al. [13] proposed an improved single shot detector (SSD) algorithm to enhance the detection accuracy and speed of concealed objects in THz images. They used a ResNet-50 network instead of the original VGGNet-16 network [14] in SSD for feature extraction to overcome feature degradation, and then fused deep and shallow features using a feature fusion module to construct features rich in semantic information, thereby improving the accuracy of small object detection. Danso et al. [15] employed transfer learning based on the RetinaNet algorithm [16] to improve the identification accuracy of defects in THz images. Considering that the proportion of objects to be detected in THz images is very small, they utilized the differential evolution search algorithm for optimization, further enhancing detection accuracy. Xu et al. [17] designed a multi-scale filtering and geometric enhancement method, utilizing a spatial distance grid of geometric transformation matrices to improve the detection accuracy of convolutional neural networks (CNNs) for passive THz images. They combined this method with an improved YOLOv5 and validated its detection accuracy using passive THz images.

Although the above methods have achieved certain effects in THz image detection, due to the lack of texture details and unclear object contours in THz images, they are prone to missed and false detections in complex background environments. Additionally, the significant size differences of the objects in THz images and limited texture features pose challenges in rapidly and accurately detecting concealed suspicious items using limited features in low-resolution THz images. Therefore, detecting concealed suspicious objects accurately and rapidly in low-resolution THz images remains a challenge in THz image detection.

In this paper, we propose an improved YOLOv7 [18] algorithm for the multi-scale detection of concealed suspicious objects in THz images to address the aforementioned issues. A concealed suspicious object THz image dataset was constructed, including THz images of scissors, handguns, and blades, with various materials such as polyethylene and kraft paper. The non-local mean filtering algorithm [19] was employed to reduce image noise. By integrating the SPD-Conv [20] and MobileNet [21] structures, the SPD_Mobile network was introduced to replace the original YOLOv7’s backbone, reducing computational and parameter complexity and enhancing detection speed. An RFB [22] module was added after SPD_Mobile to utilize dilated convolutions to obtain different receptive fields and strengthen the extraction of multi-scale features. The EIOU [23] loss function was utilized to measure the precision of the bounding box predictions, improving model accuracy and accelerating model convergence.

2. Data Acquisition and Analysis

2.1. Experimental Setup

The TeraFAST-256-300 system was used to obtain the THz image dataset of concealed objects. The system architecture is illustrated in Figure 1. The system primarily consisted of a THz emitter, a linear fast scan camera, and a scan imaging control system. The output frequency of the THz emitter was approximately 300 GHz. The linear fast scan camera had an acquisition rate of up to 5000 lines per second, an imaging area size of 128 × 0.5 mm, a resolution of 256 × 1 pixels, and a pixel size of 0.5 × 0.5 mm. The scan imaging control system coordinated with the camera’s imaging speed to move the sample for scan imaging. In the experiment, the control system’s movement speed was set to 0.1 m/s to reduce the influence of system vibrations on the imaging process.

2.2. Terahertz Image Data Acquisition

For the experiment, six hazardous items, including handguns, scissors, lighters, metal blades, ceramic blades, and iron nails, as well as three non-hazardous items including neutral pens, nail clippers, and keys, were selected as concealed object test samples. Some of the samples to be tested are shown in Figure 2. To simulate real detection scenarios, the samples were randomly combined and placed under two packaging materials, polyethylene foam and kraft paper, to collect THz images of concealed objects.

The experiment utilized the THz fast imaging scanning system to scan the concealed object samples and obtain THz image data. The THz concealed object images are depicted in Figure 3. The size of the captured THz images is 256 × 512 pixels, with the color of each image reflecting the spectral intensity information; darker colors indicate smaller signal intensities. Throughout the image acquisition process, different color-toned THz images were obtained by adjusting the brightness and contrast of the system software images. Although the obstruction from the kraft paper box and polyethylene foam results in significant noise in the images, it is still possible to determine the sample types based on the contours of the objects in the images.

Due to the presence of noise interference and weak texture details in the original THz images, image preprocessing was necessary. In this study, the non-local means filtering algorithm was employed to process the original THz images. The processed THz images before and after the non-local mean filtering are shown in Figure 4. The processed THz images visually retained the original characteristics of the prohibited items while attenuating the surrounding noise.

3. Method

3.1. Overall Architecture

The experiment used YOLOv7 as the base model, which is one of the newer algorithms in the YOLO series and has a high accuracy and speed. The overall structure of YOLOv7 comprises three principal components: the backbone network, the neck network, and the head network. Each CBS module in the network is composed of convolutional layers, batch normalization layers, and the SiLU activation function. The backbone network first downsamples the input image size by a factor of 2 using three CBS modules, followed by multiple ELAN and max-pooling layers, ultimately outputting feature maps downscaled by factors of 8, 16, and 32 to the neck. The neck consists of a feature pyramid network, path aggregation network, and spatial pyramid pooling to fuse information from feature maps of different scales outputted by the backbone network. The head is responsible for predicting the feature maps of three scales from the neck output and speeding up inference using the RepVGG [24] block.

The SMR–YOLO network structure designed in this paper is shown in Figure 5. In the backbone network, to reduce network parameters and speed up efficiency, the ELAN and max-pooling in the original network were replaced by the SPD_mobile structure, and the first CBS in the network input was replaced by the SPB-CBS module to reduce the potential omission of small target features caused by the convolution module with an original stride of 2. In the neck network, the SPPF module was replaced with the RFB, which increased the receptive field by combining convolution kernels of different sizes, enhancing the network’s ability to detect features at different scales while reducing the model’s computational and parameter requirements. Before outputting the multi-scale feature maps from the neck, the large selective kernel (LSK) [25] module was added to adaptively adjust the receptive field size, learning contextual information around different scale targets in the image, enhancing the fusion of multi-scale features, and strengthening the network’s attention to salient features. In the head network, the EIOU was used as the bounding box (bbox) loss function, focusing on training samples with good localization and high classification confidence to improve the accuracy and robustness of the detector.

3.2. SPD_Mobile Structure

The preliminary work in this paper [26] focused on meeting the real-time requirements for concealed object detection scenarios. Building upon the MobileNext network, the SPD_Mobile network was designed to replace the original YOLOv7 backbone network, as depicted by the red box in Figure 5. Additionally, before the output of multi-scale feature maps in the YOLOv7 network’s neck, the LSK module based on kernel selection mechanisms was integrated to enhance the fusion of multi-scale features and strengthen the network’s attention to salient features.

Although the use of the SPD_Mobile network significantly improved the model’s detection speed, it also led to a decrease in model accuracy compared to the original YOLOv7. Therefore, in this experiment, based on SPD_Mobile and LSK, the RFB module replaced the original SPPF module. The RFB module utilized dilated convolutions with different dilation rates to obtain feature maps with varying receptive fields, integrating local and global information of the network to enhance the extraction capability of multi-scale features. Additionally, the EIOU loss function replaced the default CIOU [27] loss function of YOLOv7. The EIOU comprehensively considers factors such as the IOU of bounding boxes, the distance between center points, the aspect ratio, etc., and addresses the imbalance of hard and easy samples, speeding up model convergence and improving detection accuracy.

3.3. RFB Structure

In the SPPF module, three max-pooling layers were sequentially connected to obtain feature maps with different receptive fields. However, max-pooling layers retained only the maximum feature value while discarding the rest, leading to a loss of key information.

The RFB [22] module is a multi-branch dilated convolution structure, borrowing the concept from the inception network. Dilated convolution can increase the receptive field while keeping the feature map size unchanged, effectively mitigating the feature information loss caused by pooling operations and enhancing the expressive capability of feature maps. Dilated convolution adds gaps between the convolutions in the kernel, achieving multi-scale information without increasing the parameter count, and is beneficial for improving the accuracy of object detection. The formula for the actual size of dilated convolution is calculated as follows:

K = k + (k - 1) \times (r - 1)

(1)

where

K

represents the actual size of the dilated convolution kernel,

k

represents the corresponding regular convolution kernel size, and

r

represents the dilation rate, indicating the number of dilated elements, which is used to control the size of the receptive field.

The diagrams of dilated convolutions with different dilation rates are depicted in Figure 6, with all corresponding regular convolution kernel

k

sizes being 3. Figure 6a shows dilated convolution with a dilation rate of 1, which is equivalent to a standard 3 × 3 convolution with a receptive field of 3. Figure 6b shows a dilated convolution with a dilation rate of 2 and a receptive field of 5. Finally, Figure 6c shows a dilated convolution with a dilation rate of 4 and a receptive field of 9.

The RFB module first utilizes 1 × 1 convolution to perform channel reduction in the input features, then employs convolutions of different sizes and dilated convolutions with different rates to obtain feature maps with different receptive fields and concatenates these feature maps to achieve the multi-scale feature fusion. The structure of the RFB module is illustrated in Figure 7.

3.4. EIOU Loss Function

The EIOU loss function [23] is an extension of the CIOU [27], separating the aspect ratio loss term to individually handle the width and height differences between the predicted box and the minimum enclosing box. While the CIOU considers the aspect ratio of the bounding box, its calculation formula is as follows:

I O U = \frac{|B^{g t} \cap B|}{|B^{g t} \cup B|}

(2)

C I O U = I O U - \frac{ρ^{2} (b, b^{g t})}{c^{2}} - α ν

(3)

where

B

and

B^{g t}

represent the predicted box and the ground truth box, respectively,

I O U

denotes the intersection over union between the ground truth box and the predicted box,

b

and

b^{g t}

represent the centroids of the predicted box and the ground truth box,

ρ

refers to the Euclidean distance between the two centroids, and

c

represents the diagonal length of the smallest enclosing box of the predicted box and the ground truth box.

ν

represents the consistency in measuring the aspect ratio between the predicted box and the ground truth box, with its calculation formula as follows:

v = \frac{4}{π^{2}} {(a r c \tan \frac{w^{g t}}{h^{g t}} - a r c \tan \frac{w}{h})}^{2}

(4)

where

w^{g t}

and

h^{g t}

represent the width and height of the ground truth box, and

w

and

h

represent the width and height of the predicted box.

α

represents a balancing parameter, calculated as

α = \frac{ν}{1 - I O U + ν}

(5)

The regression loss function of

C I O U

is defined as

C I O U_{l o s s} = 1 - C I O U

(6)

While the

C I O U

loss function can optimize the aspect ratio of the predicted box during training, the aspect ratio of the target box does not contribute significantly to the detection of small targets. Additionally, the

C I O U

can only reflect the difference in aspect ratios but cannot reflect the real differences in confidence between the width and height of the predicted box and the ground truth box, thereby impeding the effective optimization of similarity by the model.

In response to the issues of the

C I O U

, the

E I O U

separates the aspect ratio impact factor of the

C I O U

and calculates the width and height differences between the predicted box and the ground truth box separately, making the model more sensitive to the width and height of bounding boxes and improving the regression accuracy of the predicted boxes. The definition of the

E I O U

regression loss function is as follows:

\begin{matrix} E I O U_{l o s s} = 1 - I O U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + \frac{ρ^{2} (w, w^{g t})}{{c_{w}}^{2}} + \frac{ρ^{2} (h, h^{g t})}{{c_{h}}^{2}} \end{matrix}

(7)

where

c_{w}

and

c_{h}

represent the width and height of the smallest enclosing box that simultaneously contains the predicted box and the ground truth box.

4. Experimental Results and Discussion

4.1. Implementation Details

In the experiment, a THz image dataset of concealed objects wrapped in two different materials, polyethylene foam and kraft paper, was collected. The dataset was annotated using the LabelImg 1.8.6 software to mark the concealed object targets. Due to the limited number of images obtained from the imaging system, which was insufficient to meet the training requirements of neural networks, a data augmentation strategy was employed to augment the original data through random horizontal and vertical flipping, cropping, scaling, and other methods. The augmented dataset comprised a total of 3556 images, with each image sized at 640 × 640 pixels. The dataset was divided into training, validation, and testing sets in an 8:1:1 ratio, with 2844 images in the training set and 356 images in both the validation and testing sets.

All models in the experiment used Python 3.7 as the development language and the PyTorch 1.13 deep learning framework. To mitigate the impact of pre-trained weights on the results, each model was trained from scratch. The training parameters were set with a batch size of 16 and 150 epochs for the number of iterations.

4.2. Evaluating Metric

To comprehensively evaluate the performance of the object detection models, the mean average precision (mAP), precision (

P

), recall (

R

), and frames per second (FPS) are used as evaluation metrics. The formulas for precision and recall are as follows:

P = \frac{T P}{T P + F P}

(8)

\begin{matrix} R = \end{matrix} \frac{T P}{T P + F N}

(9)

where true positive (

T P

) represents the number of samples that are correctly predicted and actually correct, false positive (

F P

) represents the number of samples that are predicted as correct but are actually incorrect, and false negative (

F N

) represents the number of samples that are predicted as incorrect but are actually correct. With

T P

s held constant as

F P

s increase, it indicates that more objects not belonging to the class are incorrectly predicted as belonging to the class, leading to a higher false positive rate and lower precision; as FNs increase, it indicates more objects originally belonging to the class are not correctly detected, leading to a higher miss detection rate and lower recall. The mAP calculation formula is as follows:

\begin{matrix} A P = \int_{0}^{1} P (R) d R \end{matrix}

(10)

\begin{matrix} m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i} \end{matrix}

(11)

The mAP@0.5 denotes the average precision at an IOU threshold of 0.5 in object detection, while mAP@0.5:0.95 represents the average value of all precisions with IOU thresholds ranging from 0.5 to 0.95 in steps of 0.05. FPS refers to the number of images that the network can process per second.

4.3. Experimental Results and Analysis

The proposed SMR–YOLO algorithm, trained on the constructed concealed object THz image dataset, achieves a detection performance of 98.9% for mAP@0.5 and 89.4% for mAP@0.5:0.95 on the test set, with an inference time of 108.7 FPS. To demonstrate the superiority of the SMR–YOLO algorithm compared to other object detection models, SMR–YOLO is compared with five classic object detection algorithms (YOLOv7, YOLOv5, RTMDet [28], RetinaNet, and Faster–RCNN [29]) on the concealed object THz image dataset test set. The comparison results of different models with SMR–YOLO are shown in Table 1. The mAP@0.5 and mAP@0.5:0.95 of SMR–YOLO are the highest, reaching 98.9% and 89.4%, respectively, outperforming other models. YOLOv7 follows with mAP@0.5 and mAP@0.5:0.95 at 98.7% and 87.6%, respectively, while YOLOv5 and RTMDet have similar results with mAP@0.5 at 97.5% and 97.1%, respectively, and mAP@0.5:0.95 at 94.7% and 94.1%, respectively. Finally, Faster–RCNN and RetinaNet have the lowest performance, with mAP@0.5 at 93.3% and 84.8%, respectively, and mAP@0.5:0.95 at 75.5% and 71.4%, respectively. In terms of algorithm inference time, SMR–YOLO exhibits the fastest speed at 108.7 FPS, surpassing YOLOv7 and YOLOv5. SMR–YOLO demonstrates superior performance in both detection accuracy and speed compared to other models.

4.4. Ablation Study

Based on YOLOv7 as the baseline, a series of ablation experiments were conducted on the five network components mentioned in Section 3.1. The performance of different modules is presented in Table 2, where “√” indicates the use of the module in the network and “×” indicates the absence of the module.

As shown in Table 2, it is evident that in the previous work [26] of this paper, the addition of MobileNext and SPD-Conv significantly improves the detection time of the algorithm, with the model detection speed reaching 116.3 FPS, 30.8 FPS higher than the benchmark model. However, this leads to a decrease in model accuracy, with mAP@0.5 and mAP@0.5:0.95 decreasing by 1% and 5.4%, respectively. Building on Model 3, the inclusion of the RFB module and the EIOU loss function increases mAP@0.5 and mAP@0.5:0.95 by 1.3% and 5.2%, and P and R improve by 1.7% and 4.5% without compromising detection speed. The experiments demonstrate that the RFB module and the EIOU loss function in SMR–YOLO positively contribute to the optimization of the model, enabling better detection of multiple concealed suspicious objects of varying scales in THz images while ensuring fast detection. Finally, the LSK module was added to the network. Compared to Model 6, the model’s P, R, mAP@0.5, and mAP@0.5:0.95 improve by 0.8%, 1.3%, 0.1%, and 2.2%, respectively, with a decrease in detection speed of 7.6 FPS. Compared to the benchmark model, P, R, mAP@0.5, and mAP@0.5:0.95 improve by 0.5%, 2.3%, 0.2%, and 1.8%, respectively, with a speed improvement of 23.2 FPS, not only enhancing detection accuracy but also increasing detection speed.

Figure 8 displays the object detection results of YOLOv7 and SMR–YOLO on the concealed object THz dataset. Figure 8a shows the detection results of the YOLOv7 model. In the first image, a partially occluded blade in the center of the image is not detected. In the second image, although both the tool and scissors are identified, the confidence of the detection results is relatively low, and there is some deviation between the predicted bounding boxes and the actual positions of the objects. In contrast, Figure 8b displays the detection results of the SMR–YOLO. In the first image, SMR–YOLO successfully detects the occluded target, while in the second image, the confidence of the overlapping objects is significantly improved, and the predicted bounding boxes have a higher IOU with the actual positions of the objects. The results demonstrate that SMR–YOLO performs better in detecting occluded targets, thus enhancing the overall object detection performance of the model.

While this experiment successfully achieved the detection of concealed suspicious objects, there are still some issues that need to be improved and optimized. Firstly, the THz image dataset collected only consisted of eight types of samples, which is relatively limited. Moreover, most of the samples were collected under conditions where the objects were either placed individually or in close proximity. However, in reality, dangerous objects are often not placed individually or close together during personnel and luggage inspections; instead, they tend to intersect or overlap. This complex object arrangement can make it challenging for the detector to accurately distinguish and identify the boundaries and locations of each object, thereby imposing greater demands on the detection algorithm. THz images are characterized by noise interference and weak texture features. This study only leveraged the limited features present in THz images to enhance detection accuracy, without fundamentally addressing the issue of noise interference or improving image resolution. Finally, while the proposed improved model has achieved rapid detection of concealed suspicious objects, there is still room for further performance enhancement.

In the future, we will construct a more comprehensive THz image dataset for concealed objects, considering the inclusion of a wider variety of suspicious objects and their styles, as well as objects under different levels of occlusion. Additionally, we will simulate more realistic scenarios involving intersecting or overlapping objects, thereby improving the model’s robustness and accuracy in handling complex real-world situations. Furthermore, we will explore methods to enhance image resolution. On the hardware side, we will optimize the optical components and detector arrays of the THz imaging system to improve its spatial resolution. On the algorithmic side, we plan to develop and apply new image super-resolution reconstruction techniques and denoising methods to enhance the detail and resolution of the images. Additionally, we will investigate target detection models better suited for detecting concealed suspicious objects, integrating the latest deep learning technologies to further improve detection accuracy and efficiency.

5. Conclusions

This paper focuses on concealed suspicious object THz image object detection. Taking the YOLOv7 object detection algorithm as a benchmark, we proposed a multi-scale object detection algorithm for concealed suspicious objects in THz images. SMR–YOLO adds a feel-field RFB module at the tail of the SPD_Mobile network and makes use of the dilated convolution of a multi-branch structure, which can effectively alleviate the loss of feature information brought by the pooling operation and enhance the feature map’s expression ability. The loss function of the EIOU as bbox is used to measure the accuracy of prediction box localization, so that the loss function pays more attention to the shape of the bounding box, focuses on training samples with good localization and high classification confidence, and improves the accuracy and robustness of the detector. The experimental results show that SMR–YOLO achieves the optimal results in terms of speed and accuracy of the model compared with YOLOv7, YOLOv5, RTMDet, RetinaNet, and Faster–RCNN algorithms, with the mAP@0.5 as high as 98.9%, and the detection speed is improved by 23.2 FPS compared with the baseline model. In summary, the model can effectively identify the package-concealed suspicious objects inside, which provides a new idea for the detection of concealed objects in public places.

Author Contributions

Conceptualization, Y.Z. (Yuan Zhang) and Z.G.; methodology, H.C. and Z.G.; validation, H.C.; formal analysis, Y.J. and H.G.; data curation, Y.Z. (Yuan Zhang) and Z.G.; writing—original draft preparation, H.C.; writing—review and editing, Y.J., Y.Z. (Yang Zhao) and H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 61975053, No. 62271191), the Natural Science Foundation of Henan (No. 222300420040), the Program for Science and Technology Innovation Talents in Universities of Henan Province (No. 22HASTIT017, No. 23HASTIT024), the Open Fund Project of Key Laboratory of Grain Information Processing and Control, Ministry of Education, Henan University of Technology (No. KFJJ2021102), the major public welfare projects of Henan Province (No. 201300210100), and the Innovative Funds Plan of Henan University of Technology (No. 2021ZKCJ04).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Saeedkia, D.; Safavi-Naeini, S. Terahertz Photonics: Optoelectronic Techniques for Generation and Detection of Terahertz Waves. J. Light. Technol. 2008, 26, 2409–2423. [Google Scholar] [CrossRef]
Jiang, Y.; Ge, H.; Zhang, Y. Quantitative analysis of wheat maltose by combined terahertz spectroscopy and imaging based on Boosting ensemble learning. Food Chem. 2020, 307, 125533. [Google Scholar] [CrossRef]
Ge, H.; Ji, X.; Jiang, Y.; Wu, X.; Li, L.; Jia, Z.; Sun, Z.; Bu, Y.; Guo, C.; Zhang, Y. Tri-band and high FOM THz metamaterial absorber for food/agricultural safety sensing applications. Opt. Commun. 2024, 554, 130173. [Google Scholar] [CrossRef]
Wan, M.; Healy, J.J.; Sheridan, J.T. Terahertz phase imaging and biomedical applications. Opt. Laser Technol. 2020, 122, 105859. [Google Scholar] [CrossRef]
Yang, Z.; Tang, D.; Hu, J.; Tang, M.; Zhang, M.; Cui, H.L.; Wang, L.; Chang, C.; Fan, C.; Li, J. Near-Field Nanoscopic Terahertz Imaging of Single Proteins. Small 2021, 17, 2005814. [Google Scholar] [CrossRef]
Tribe, W.R.; Newnham, D.A.; Taday, P.F.; Kemp, M.C. Hidden object detection: Security applications of terahertz technology. In Terahertz and Gigahertz Electronics and Photonics III; SPIE: Philadelphia, PA, USA, 2004; pp. 168–176. [Google Scholar]
Chen, Z.; Wang, C.; Feng, J.; Zou, Z.; Jiang, F.; Liu, H.; Jie, Y. Identification of blurred terahertz images by improved cross-layer convolutional neural network. Opt. Express 2023, 31, 16035–16053. [Google Scholar] [CrossRef]
Jia, Y.; Fu, K.; Lan, H.; Wang, X.; Su, Z. Maize tassel detection with CA-YOLO for UAV images in complex field environments. Comput. Electron. Agric. 2024, 217, 108562. [Google Scholar] [CrossRef]
YOLOv5 Code. Available online: https://github.com/ultralytics/yolov5 (accessed on 9 August 2024).
Kang, L.; Lu, Z.; Meng, L.; Gao, Z. YOLO-FA: Type-1 fuzzy attention based YOLO detector for vehicle detection. Expert Syst. Appl. 2024, 237, 121209. [Google Scholar] [CrossRef]
Su, P.; Han, H.; Liu, M.; Yang, T.; Liu, S. MOD-YOLO: Rethinking the YOLO architecture at the level of feature information and applying it to crack detection. Expert Syst. Appl. 2024, 237, 121346. [Google Scholar] [CrossRef]
Li, X.; He, M.; Liu, Y.; Luo, H.; Ju, M. SPCS: A spatial pyramid convolutional shuffle module for YOLO to detect occluded object. Complex Intell. Syst. 2023, 9, 301–315. [Google Scholar] [CrossRef]
Cheng, L.; Ji, Y.; Li, C.; Liu, X.; Fang, G. Improved SSD network for fast concealed object detection and recognition in passive terahertz security images. Sci. Rep. 2022, 12, 12082. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Danso, S.A.; Liping, S.; Hu, D.; Afoakwa, S.; Badzongoly, E.L.; Odoom, J.; Muhammad, O.; Mushtaq, M.U.; Qayoom, A.; Zhou, W. An optimal defect recognition security-based terahertz low resolution image system using deep learning network. Egypt. Inform. J. 2023, 24, 100384. [Google Scholar] [CrossRef]
Zhang, H.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Cascade RetinaNet: Maintaining consistency for single-stage object detection. arXiv 2019, arXiv:1907.06881. [Google Scholar]
Xu, F.; Huang, X.; Wu, Q.; Zhang, X.; Shang, Z.; Zhang, Y. YOLO-MSFG: Toward real-time detection of concealed objects in passive terahertz images. IEEE Sens. J. 2021, 22, 520–534. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Liu, Y.-L.; Wang, J.; Chen, X.; Guo, Y.-W.; Peng, Q.-S. A robust and fast non-local means algorithm for image denoising. J. Comput. Sci. Technol. 2008, 23, 270–279. [Google Scholar] [CrossRef]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2022; pp. 443–459. [Google Scholar]
Zhou, D.; Hou, Q.; Chen, Y.; Feng, J.; Yan, S. Rethinking bottleneck structure for efficient mobile network design. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. pp. 680–697. [Google Scholar]
Liu, S.; Huang, D. Receptive field block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Zhang, Y.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. arXiv 2021, arXiv:2101.08158. [Google Scholar] [CrossRef]
Shi, C.; Han, L.; Zhang, K.; Xiang, H.; Li, X.; Su, Z.; Zheng, X. Improved RepVGG ground-based cloud image classification with attention convolution. Atmos. Meas. Tech. 2024, 17, 979–997. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.-M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16794–16805. [Google Scholar]
Ge, Z.; Zhang, Y.; Jiang, Y.; Ge, H.; Wu, X.; Jia, Z.; Wang, H.; Jia, K. Lightweight YOLOv7 algorithm for multi-object recognition on contrabands in terahertz images. Appl. Sci. 2024, 14, 1398. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. Rtmdet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]

Figure 1. TeraFAST-256-300 system.

Figure 2. Real image of concealed object samples.

Figure 3. THz images of concealed objects. (a) Kraft paper box packaging and (b) polyethylene bag packaging.

Figure 4. NLM processing results. (a) Original THz image. (b) Image after NLM processing.

Figure 5. SMR–YOLO network structure.

Figure 6. Comparison of dilated convolutions with different dilation rates. (a) Dilated convolution with a dilation rate of 1; (b) dilated convolution with a dilation rate of 2; and (c) dilated convolution with a dilation rate of 4.

Figure 7. Structure of the RFB module.

Figure 8. Comparison of object detection results between SMR–YOLO and YOLOv7 models. (a) Object detection results of YOLOv7. (b) Object detection results of SMR–YOLO.

Table 1. Performance comparison of target detection algorithms.

Method	mAP@0.5(%)	mAP@0.5:0.95(%)	FPS
Faster–RCNN	93.3	75.5	34.1
RetinaNet	84.8	71.4	46.5
RTMDet	97.1	84.1	32.7
YOLOv5	97.5	84.7	87.4
YOLOv7	98.7	87.6	85.5
SMR–YOLO	98.9	89.4	108.7

Table 2. Ablation study.

Model	MobileNext	SPD-Conv	RFB	EIOU	LSK	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	FPS
1	×	×	×	×	×	98.2	96.4	98.7	87.6	85.5
2	√	×	×	×	×	96.2	92.9	97.5	82.0	116.3
3	√	√	×	×	×	96.5	94.0	97.7	82.2	116.3
4	√	√	×	√	×	97.5	95.2	98.3	82.5	116.3
5	√	√	√	×	×	97.5	97.0	98.6	86.1	116.3
6	√	√	√	√	×	97.9	97.4	98.8	87.2	116.3
7	√	√	×	√	√	98.9	96.4	98.6	86.1	111.1
8	√	√	√	√	√	98.7	98.7	98.9	89.4	108.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Chen, H.; Ge, Z.; Jiang, Y.; Ge, H.; Zhao, Y.; Xiong, H. SMR–YOLO: Multi-Scale Detection of Concealed Suspicious Objects in Terahertz Images. Photonics 2024, 11, 778. https://doi.org/10.3390/photonics11080778

AMA Style

Zhang Y, Chen H, Ge Z, Jiang Y, Ge H, Zhao Y, Xiong H. SMR–YOLO: Multi-Scale Detection of Concealed Suspicious Objects in Terahertz Images. Photonics. 2024; 11(8):778. https://doi.org/10.3390/photonics11080778

Chicago/Turabian Style

Zhang, Yuan, Hao Chen, Zihao Ge, Yuying Jiang, Hongyi Ge, Yang Zhao, and Haotian Xiong. 2024. "SMR–YOLO: Multi-Scale Detection of Concealed Suspicious Objects in Terahertz Images" Photonics 11, no. 8: 778. https://doi.org/10.3390/photonics11080778

APA Style

Zhang, Y., Chen, H., Ge, Z., Jiang, Y., Ge, H., Zhao, Y., & Xiong, H. (2024). SMR–YOLO: Multi-Scale Detection of Concealed Suspicious Objects in Terahertz Images. Photonics, 11(8), 778. https://doi.org/10.3390/photonics11080778

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SMR–YOLO: Multi-Scale Detection of Concealed Suspicious Objects in Terahertz Images

Abstract

1. Introduction

2. Data Acquisition and Analysis

2.1. Experimental Setup

2.2. Terahertz Image Data Acquisition

3. Method

3.1. Overall Architecture

3.2. SPD_Mobile Structure

3.3. RFB Structure

3.4. EIOU Loss Function

4. Experimental Results and Discussion

4.1. Implementation Details

4.2. Evaluating Metric

4.3. Experimental Results and Analysis

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI