Lightweight YOLOv7 Algorithm for Multi-Object Recognition on Contrabands in Terahertz Images

: With the strengthening of worldwide counter-terrorism initiatives, it is increasingly important to detect contrabands such as controlled knives and flammable materials hidden in clothes and bags. Terahertz (THz) imaging technology is widely used in the field of contraband detection due to its advantages of high imaging speed and strong penetration. However, the terahertz images are of poor qualities and lack texture details. Traditional target detection methods suffer from low detection speeds, misdetection, and omission of contraband. This work pre-processes the original dataset using a variety of image processing methods and validates the effect of these methods on the detection results of YOLOv7. Meanwhile, the lightweight and multi-object detection YOLOv7 (LWMD-YOLOv7) algorithm is proposed. Firstly, to meet the demand of real-time for multi-target detection, we propose the space-to-depth mobile (SPD_Mobile) network as the lightweight feature extraction network. Secondly, the selective attention module large selective kernel (LSK) network is integrated into the output of the multi-scale feature map of the LWMD-YOLOv7 network, which enhances the effect of feature fusion and strengthens the network’s attention to salient features. Finally, Distance Intersection over Union (DIOU) is used as the loss function to accelerate the convergence of the model and to have a better localisation effect for small targets. The experimental results show that the YOLOv7 algorithm achieves the best detection results on the terahertz image dataset after the non-local mean filtering process. The LWMD-YOLOv7 algorithm achieves a detection accuracy P of 98.5%, a recall R of 97.5%, and a detection speed of 112.4 FPS, which is 26.9 FPS higher than that of the YOLOv7 base network. The LWMD-YOLOv7 achieves a better balance between detection accuracy and detection speed. It provides a technological reference for the automated detection of contraband in terahertz images.


Introduction
In recent years, the riotous events occurred frequently at crowd gathering places, such as public transport stations, resulted in many casualties and economic losses.The social impact is extremely bad, so the effective detection of contrabands hidden in clothing is important, and parcels have become an urgent need for social security.Traditional contraband detection is generally divided into baggage security and human security, and the most common way of baggage security is to use high-power X-ray scanners to penetrate the parcel layer for high-quality imaging of hidden contrabands [1].However, excessive X-ray radiation may cause damage to the inspected personnel by ionisation and cannot be adapted to human security screening needs.Therefore, for the human body security screening, the current widespread use of metal security gates with hand-held metal detectors can effectively detect the metal objects carried by a person.However, metal detectors cannot obtain images for more information and cannot detect ceramic knives, lighters, and other non-metallic items.The low detection speeds of metal detectors for personal inspection will also cause people to become frustrated.
To break the limitations of the traditional means of contraband detection, contraband detection based on terahertz (THz) imaging technology has received extensive attention of researchers.A THz wave is an electromagnetic wave with a frequency range between 0.1 THz and 10 THz, and a wavelength of 0.03~3 mm, which has the characteristics of low photonic energy and non-destructive detection [2].THz waves can penetrate many commonly used non-polar materials, such as paper, textiles, plastic products, etc. THz imaging is a technique for imaging based on THz waves, and THz imaging technology has flourished to meet real-time high-resolution imaging [3,4].Terahertz imaging technology is an effective complement to X-ray imaging technology because of its fast imaging speed, harmlessness to the human body, and strong penetration, and it has good potential for applications in the field of hazardous materials' detection [5,6].
The detection of contrabands in terahertz images often requires visual observation with low recognition efficiencies.Therefore, extracting features of targets from terahertz images and automatically and accurately identifying and locating various types of targets is the key to efficient detection.Traditional target detection methods for terahertz images include threshold segmentation methods [7], edge-based detection methods [8], and clusteringbased detection methods [9].However, the traditional terahertz image detection algorithms are not effective because terahertz images lack texture details, the target contour is not obvious, and it is more likely to produce misdetection and omission detection in complex background environments.
In recent years, with the continuous development of deep learning technology, target detection algorithms based on convolutional neural networks (CNNs) have developed rapidly and have been widely used in various fields.The target detection task is mainly used for finding the target location in the image and classifying it.Deep learning-based target detection algorithms, instead of traditional manual feature selection, can be divided into two-stage target detection and single-stage target detection according to the detection process [10].The two-stage target detection algorithms represented by Faster R-CNN [11] and Mask R-CNN [12] firstly carry out the generation of candidate regions, and then classify the candidate regions with convolutional neural networks.The two-stage target detection algorithm has high accuracy, but its detection speed is slow.The single-stage target detection algorithms represented by YOLO [13][14][15][16] series and SSD [17] series do not need to generate the candidate region process, and directly locate and classify the target by deep convolutional neural network.The YOLO series target detection algorithms are single-stage detectors with very good performance.Moreover, compared to the two-stage algorithms, the YOLO algorithms can achieve a better balance [10].
Currently, some researchers are devoted to the study of target algorithms for detecting contrabands in terahertz images.Lu et al. [18] proposed an improved SSD algorithm to improve the accuracy and speed of detecting concealed objects in terahertz images.The ResNet-50 network was used to replace the original VGGNet-16 network in SSD for feature extraction to overcome the problem of feature degradation, and then the feature fusion module fused deep and shallow features together to construct features rich in semantic information, which is conducive to the improvement of detection accuracy of small targets.Danso et al. [19] introduced the idea of transfer learning based on the retinanet algorithm to improve the recognition accuracy of defects in terahertz images.Considering the small proportion of the object that needs to be detected in a terahertz image, a differential evolutionary search algorithm was used for optimisation, and the detection accuracy was further improved.Xu et al. [20] combined the spatial distance lattice of the geometric transformation matrix and designed a multi-scale filtering and geometric (MSFG) enhancement method to improve the detection accuracy of CNNs for passive terahertz images.This method was combined with the improved YOLOv5 to verify its detection accuracy with passive terahertz images.Although the above methods have been effective in terahertz image detection, there are many types of hidden objects, the size of the items to be detected in the terahertz image varies greatly, and the texture features are limited, so finding out how to use the limited features in low-resolution terahertz images to detect contrabands quickly and accurately is still a difficult task in terahertz image detection.
Aiming at the above problems, we carry out the following work: (1) We take multicategory concealed objects hidden in packages in a contraband detection scene as the research object, and we acquire the terahertz image data of concealed objects.(2) In order to verify the impact of image processing methods on the target detection results, the acquired original terahertz image is pre-processed using three methods: non-local mean filtering, wavelet transform, and histogram equalisation on the original terahertz images.Further, the detection results of the target detection algorithm YOLOv7 on the processed image are recorded.(3) In order to meet the requirements of accuracy and the real-time detection of dangerous goods, the new version of the YOLO series algorithm, YOLOv7 is used as the base network, which is a single-stage detection algorithm with advantages of both high speed and accuracy over its predecessor.At present, YOLOv7 (as the base model) has contributed less improvements for the characteristics of terahertz images, and it still needs to be specially designed to apply to the detection of hidden objects in terahertz images.Firstly, to meet the real-time demand of the dangerous goods' detection scene, based on the MobileNext network [21], the SPD_Mobile network is designed to replace the original backbone of the YOLOv7 network.In addition, the insufficient feature extraction ability of the MobileNext network for smaller targets, space-to-depth convolution (SPD_Conv) module [22] is introduced to reduce the omission of small object brought by the original convolution module with a stride of 2, and to reduce the computational and parametric quantities of the model without sacrificing too much on the premise of accuracy.Then, for the lack of texture details in terahertz images and the similarity of the contours of some different kinds of objects, the selective attention mechanism module called large selective kernel (LSK) network is integrated into the output of the multi-scale feature maps of the YOLOv7 network, which selects a larger convolutional kernel to adaptively adjust the size of the receptive field to achieve better access to the contextual information of the objects that need to be detected in the terahertz images.This module selects a larger convolution kernel to adaptively adjust the size of the receptive field to acquire the contextual information of the samples that need to be detected in a terahertz image in a more optimised manner, which enhances the feature fusion effect and strengthens the network's focus on salient features.The Distance Intersection over Union (DIOU) is used to measure the accuracy of prediction frame localisation, and the samples are normalised and weighted so that the network focuses on training samples that are well-located and have high classification confidence, improving the accuracy and robustness of the detector.

Acquisition of Terahertz Image
In this work, the terahertz image dataset is acquired by a linear terahertz imaging system; the structure of the system is shown in Figure 1.The system consists of two main components: a 0.3 THz terahertz generator and a high-speed linear terahertz camera.The terahertz generator produces terahertz waves at a frequency of 0.3 THz, which are transmitted through a flared antenna and dispersed uniformly to ensure that the power of the camera at the receiving end is the same everywhere; the THz beam is transmitted through a concave mirror, so that the THz beam emitted by the radiation source covers every pixel of the camera uniformly and efficiently.The high-speed linear terahertz camera has a scanning frequency of up to 5000 Hz and a scanning speed of 2.5 m/s.The camera operates at 300 GHz, which compensates for the lack of spatial resolution due to highfrequency scanning, and the detector's individual pixel size is 0.5 mm × 0.5 mm, with several pixels of a 256 × 1 resolution.The camera can be fixed underneath the conveyor belt, and the samples that will be tested will pass through the conveyor belt to the location of the camera.The sample that will be detected passes through the conveyor belt and reaches the camera, and the phase information of the sample can be observed in real time via imaging in the visualisation interface of the computer's software.Moreover, we can manually adjust the spectral intensity of the system to change the brightness and hue of the image in the software's interface according to our own needs.
a concave mirror, so that the THz beam emitted by the radiation source covers every pixel of the camera uniformly and efficiently.The high-speed linear terahertz camera has a scanning frequency of up to 5000 Hz and a scanning speed of 2.5 m/s.The camera operates at 300 GHz, which compensates for the lack of spatial resolution due to high-frequency scanning, and the detector's individual pixel size is 0.5 mm × 0.5 mm, with several pixels of a 256 × 1 resolution.The camera can be fixed underneath the conveyor belt, and the samples that will be tested will pass through the conveyor belt to the location of the camera.The sample that will be detected passes through the conveyor belt and reaches the camera, and the phase information of the sample can be observed in real time via imaging in the visualisation interface of the computer's software.Moreover, we can manually adjust the spectral intensity of the system to change the brightness and hue of the image in the software's interface according to our own needs.In order to adapt to the arbitrary placement of objects to be tested in real contraband detection scenarios, we have prepared eight kinds of samples of concealed objects with multiple types (five dangerous objects, namely scissors, pistol, blades, lighter, and nails, and three kinds of non-dangerous objects, namely keys, nail scissors, and pens; some of the physical diagrams are shown in Figure 2).The blade category includes different types of blades, such as metal blades, utility knives, ceramic blades, and plastic knives.We have also prepared different sizes of scissors and nails.We hid them randomly with different In order to adapt to the arbitrary placement of objects to be tested in real contraband detection scenarios, we have prepared eight kinds of samples of concealed objects with multiple types (five dangerous objects, namely scissors, pistol, blades, lighter, and nails, and three kinds of non-dangerous objects, namely keys, nail scissors, and pens; some of the physical diagrams are shown in Figure 2).The blade category includes different types of blades, such as metal blades, utility knives, ceramic blades, and plastic knives.We have also prepared different sizes of scissors and nails.We hid them randomly with different poses in two kinds of packages, as shown in Figure 3, and we used a terahertz imaging system to acquire the terahertz image data.Each image acquired by the device is a pseudo-colour image of size 512 × 256, where pixel size is 0.5 mm × 0.5 mm, and the colour of each image provides information about the spectral intensity.
poses in two kinds of packages, as shown in Figure 3, and we used a terahertz imaging system to acquire the terahertz image data.Each image acquired by the device is a pseudocolour image of size 512 × 256, where pixel size is 0.5 mm × 0.5 mm, and the colour of each image provides information about the spectral intensity.

Pre-Processing Methods for Terahertz Image
The quality of the original terahertz image is poor.In order to verify the effect of different image processing methods on the target detection results, this paper uses three typical methods, namely non-local mean filtering, wavelet transform, and histogram equalisation, to pre-process the original images.
Gaussian noise exists in the original terahertz images, and non-local means (NLM) algorithm can effectively weaken the Gaussian noise in the images while retaining the original details of the image.The histogram equalization (HE) algorithm is an image enhancement technique.Its main concept is to convert the grey scale histogram of the original image from a slightly concentrated grey scale range to a uniform distribution over the whole grey scale range to improve the contrast of the image with an uneven grey scale distribution.Wavelet transform is based on the extension of the Fourier transform, which can separate the image signal and noise under the same window to achieve image denoising.
To verify the impact of different image processing methods on the target detection results, this paper retains the original terahertz images and the results of the three kinds of image processing, constructs four sets of datasets, and uses LabelImg 1.8.6 software to poses in two kinds of packages, as shown in Figure 3, and we used a terahertz imaging system to acquire the terahertz image data.Each image acquired by the device is a pseudocolour image of size 512 × 256, where pixel size is 0.5 mm × 0.5 mm, and the colour of each image provides information about the spectral intensity.

Pre-Processing Methods for Terahertz Image
The quality of the original terahertz image is poor.In order to verify the effect of different image processing methods on the target detection results, this paper uses three typical methods, namely non-local mean filtering, wavelet transform, and histogram equalisation, to pre-process the original images.
Gaussian noise exists in the original terahertz images, and non-local means (NLM) algorithm can effectively weaken the Gaussian noise in the images while retaining the original details of the image.The histogram equalization (HE) algorithm is an image enhancement technique.Its main concept is to convert the grey scale histogram of the original image from a slightly concentrated grey scale range to a uniform distribution over the whole grey scale range to improve the contrast of the image with an uneven grey scale distribution.Wavelet transform is based on the extension of the Fourier transform, which can separate the image signal and noise under the same window to achieve image denoising.
To verify the impact of different image processing methods on the target detection results, this paper retains the original terahertz images and the results of the three kinds of image processing, constructs four sets of datasets, and uses LabelImg 1.8.6 software to

Pre-Processing Methods for Terahertz Image
The quality of the original terahertz image is poor.In order to verify the effect of different image processing methods on the target detection results, this paper uses three typical methods, namely non-local mean filtering, wavelet transform, and histogram equalisation, to pre-process the original images.
Gaussian noise exists in the original terahertz images, and non-local means (NLM) algorithm can effectively weaken the Gaussian noise in the images while retaining the original details of the image.The histogram equalization (HE) algorithm is an image enhancement technique.Its main concept is to convert the grey scale histogram of the original image from a slightly concentrated grey scale range to a uniform distribution over the whole grey scale range to improve the contrast of the image with an uneven grey scale distribution.Wavelet transform is based on the extension of the Fourier transform, which can separate the image signal and noise under the same window to achieve image denoising.
To verify the impact of different image processing methods on the target detection results, this paper retains the original terahertz images and the results of the three kinds of image processing, constructs four sets of datasets, and uses LabelImg 1.8.6 software to label the four sets of image data.Due to the large difference in the size of the object to be tested and the random location in the real contraband detection scene, this paper adopts the data augmentation techniques, such as random cropping, rotating, flipping, and size transforming, for the above four datasets to improve the generalisation ability of the target detection model.

Improved YOLOv7 Network Model
In this paper, YOLOv7 is used as the benchmark model, and it is a single-stage detection algorithm with both speed and accuracy advantages over its predecessor.The YOLOv7 algorithm incorporates the strategies of concatenation-based model scaling, extended efficient long-range attention network (E-ELAN), and RepVGG structure reparameterisation [23], etc.Its network structure consists of three parts: Backbone, Neck, and Head, respectively.The YOLOv7 network's structure is shown in Figure 4.
label the four sets of image data.Due to the large difference in the size of the object to tested and the random location in the real contraband detection scene, this paper ado the data augmentation techniques, such as random cropping, rotating, flipping, and s transforming, for the above four datasets to improve the generalisation ability of the tar detection model.

Improved YOLOv7 Network Model
In this paper, YOLOv7 is used as the benchmark model, and it is a single-stage tection algorithm with both speed and accuracy advantages over its predecessor.T YOLOv7 algorithm incorporates the strategies of concatenation-based model scaling, tended efficient long-range attention network (E-ELAN), and RepVGG structure rerameterisation [23], etc.Its network structure consists of three parts: Backbone, Neck, a Head, respectively.The YOLOv7 network's structure is shown in Figure 4.The backbone consists of the convolutional basic block (CBS), the efficient layer gregation net (ELAN) module, and the maximum pooling module (MP).The CBS mod consists of a convolutional layer (Conv), a batch normalisation layer (BN), and a SiLu tivation function layer, which are mainly used for feature extraction.The ELAN mod branches feature maps at different scales through different depths.Then, it finally spli them together so that the deeper network can be efficiently learned and converged.T main function of the MP module is downsampling.By splicing the maxpool downsa pling branch and the convolution downsampling branch, the feature maps obtained different downsampling methods are fused, which retains as much feature information possible without increasing the amount of calculation.Neck consists of the feature py mid (FPN), path aggregation module (PAN), and spatial pyramid pooling (SPPCSP and it is responsible for fusing features from different feature layers of the backbone n work and from various scales of the detection layer to generate features that carry ric The backbone consists of the convolutional basic block (CBS), the efficient layer aggregation net (ELAN) module, and the maximum pooling module (MP).The CBS module consists of a convolutional layer (Conv), a batch normalisation layer (BN), and a SiLu activation function layer, which are mainly used for feature extraction.The ELAN module branches feature maps at different scales through different depths.Then, it finally splices them together so that the deeper network can be efficiently learned and converged.The main function of the MP module is downsampling.By splicing the maxpool downsampling branch and the convolution downsampling branch, the feature maps obtained by different downsampling methods are fused, which retains as much feature information as possible without increasing the amount of calculation.Neck consists of the feature pyramid (FPN), path aggregation module (PAN), and spatial pyramid pooling (SPPCSPC), and it is responsible for fusing features from different feature layers of the backbone network and from various scales of the detection layer to generate features that carry richer information.The neck fusion network fuses the feature maps from the backbone network at three different scales with the sampled feature maps on this network separately to retain both the abstract features from the deep network and the semantic information from the shallow network.The detection head (Head) branches the three different sizes of feature maps output from the neck for multiscale prediction and accelerates the model inference through the re-parameterisation module (RepVGG Block, REP).To improve the performance of the terahertz image contraband recognition model, the following three parts are the improvements made to the YOLOv7 model.
Appl.Sci.2024, 14, 1398 7 of 17 (1) The large number of stacked ELAN modules in the original backbone network of YOLOv7 leads to too many parameters in the network and a large amount of computation.In this paper, the SPD_Mobile network is used as a lightweight backbone network.The network adopts the Sandglass structure proposed in MobileNeXt, and its structure is shown in Figure 5.The Sandglass structure is designed according to the inverse residual structure in MobileNet V2, which solves the problem of gradient dispersion or gradient explosion caused by the residual structure in MobileNet-V2 and improves the accuracy of the model.The structure follows the depth-separable convolution (Dwise) [24] approach of MobileNet V1 to achieve a greater efficiency and a low weight.
ture maps output from the neck for multiscale prediction and accelerates the mod ence through the re-parameterisation module (RepVGG Block, REP).To improve formance of the terahertz image contraband recognition model, the following thre are the improvements made to the YOLOv7 model.
(1) The large number of stacked ELAN modules in the original backbone net YOLOv7 leads to too many parameters in the network and a large amount of comp In this paper, the SPD_Mobile network is used as a lightweight backbone netwo network adopts the Sandglass structure proposed in MobileNeXt, and its stru shown in Figure 5.The Sandglass structure is designed according to the inverse r structure in MobileNet V2, which solves the problem of gradient dispersion or g explosion caused by the residual structure in MobileNet-V2 and improves the accu the model.The structure follows the depth-separable convolution (Dwise) [24] ap of MobileNet V1 to achieve a greater efficiency and a low weight.The structure of depth-separable convolution (Dwise) is shown in Figure 6.Th ule consists of two parts: layer-by-layer convolution (depthwise convolution) and by-point convolution (pointwise convolution).Depthwise convolution is used for filtering, and its procedure is that the convolution is performed individually on eac nel of the input feature map of size  ×  , and the number of channels is  , of convolution kernel is generally  ×  ×  , and the number of output channels sistent with the input.Pointwise convolution is used for information fusion; it co the feature maps obtained in the previous step (depthwise convolution) to gener feature maps, the size of convolution kernel is 1 × 1 ×  , the number of outpu nels  is consistent with the number of convolution kernels, and the size of the feature maps is  ×  .The structure of depth-separable convolution (Dwise) is shown in Figure 6.The module consists of two parts: layer-by-layer convolution (depthwise convolution) and point-by-point convolution (pointwise convolution).Depthwise convolution is used for spatial filtering, and its procedure is that the convolution is performed individually on each channel of the input feature map of size H 1 × W 1 , and the number of channels is C in , the size of convolution kernel is generally k × k × C in , and the number of output channels is consistent with the input.Pointwise convolution is used for information fusion; it combines the feature maps obtained in the previous step (depthwise convolution) to generate new feature maps, the size of convolution kernel is 1 × 1 × C out , the number of output channels C out is consistent with the number of convolution kernels, and the size of the output feature maps is Depth-separable convolution reduces the number of parameters in the pared to regular convolution.Convolution on the feature map using normal has a parameter computation  : Depth-separable convolution reduces the number of parameters in the model compared to regular convolution.Convolution on the feature map using normal convolution has a parameter computation P 1 : The computation using depth-separable convolution is P 2 : The mathematical expression for the ratio R of the computation of the depth-separable convolution and the normal convolution is given in Equation ( 3): The head of MobileNext uses a convolution with a stride of 2, which will reduce the size of the input feature map to half of its original size, producing a pooling-like effect.However, there are small objects with low resolution in terahertz images, thus causing the omission of useful features.Therefore, in this paper, SPD_Mobile is designed as the backbone network, which introduces the SPD_Conv module to the network, and the structure of SPD_Conv is shown in Figure 7. Depth-separable convolution reduces the number of parameters in the model co pared to regular convolution.Convolution on the feature map using normal convolut has a parameter computation  : The computation using depth-separable convolution is  : The mathematical expression for the ratio R of the computation of the depth-separa convolution and the normal convolution is given in Equation ( 3): The head of MobileNext uses a convolution with a stride of 2, which will reduce size of the input feature map to half of its original size, producing a pooling-like effe However, there are small objects with low resolution in terahertz images, thus causing omission of useful features.Therefore, in this paper, SPD_Mobile is designed as the ba bone network, which introduces the SPD_Conv module to the network, and the structu of SPD_Conv is shown in Figure 7.The SPD-Conv module consists of a spatial depth layer and a non-step-by-step convolutional layer.The spatial depth layer transforms the feature map X with the size of S × S × C 1 into a sequence of sub-features, and the conversion formula is as follows: The intermediate feature map X ′ is converted by concatenating these sub-feature maps along the channel dimensions with the Concat operation and by finally adding a non-tride convolution with a stride of 1 to obtain the final feature maps, which can effectively reduce the loss of fine-grained features.
(2) The lack of texture features of objects in terahertz images as well as the similarity of the outlines of some different kinds of objects can easily lead to the phenomenon of misdetection.Therefore, it is crucial to fully consider the contextual information of the target and strengthen the attention to the key features to improve accuracy.In this paper, we incorporate the selective attention mechanism LSK module before the output of multi-scale feature maps of the YOLOv7 network, which can dynamically adjust the sensory field size to effectively deal with different ranges of targets [25].The LSK module consists of a large kernel convolution and a spatial kernel selection mechanism.The structure of this module is shown in Figure 8.
tively reduce the loss of fine-grained features.
(2) The lack of texture features of objects in terahertz images as well as the similarity of the outlines of some different kinds of objects can easily lead to the phenomenon of misdetection.Therefore, it is crucial to fully consider the contextual information of the target and strengthen the attention to the key features to improve accuracy.In this paper, we incorporate the selective attention mechanism LSK module before the output of multiscale feature maps of the YOLOv7 network, which can dynamically adjust the sensory field size to effectively deal with different ranges of targets [25].The LSK module consists of a large kernel convolution and a spatial kernel selection mechanism.The structure of this module is shown in Figure 8.
To obtain rich contextual information features in different ranges of the input feature map X, the LSK module uses a series of deep convolutions  (. ) with different receptive fields to process the feature maps using the following formulas: After obtaining the feature map  , it is then processed with 1 × 1 convolution  × (. ): Firstly, a large convolution kernel is constructed, which is decomposed into a sequence of i convolution kernels k and expansion rate d increasing step by step in the depth convolution sequence; the receptive field RF also changes dynamically with the increase in k and d.The formulas for the i-th level deep convolutional kernel, expansion rate, and receptive field are as follows: To obtain rich contextual information features in different ranges of the input feature map X, the LSK module uses a series of deep convolutions F dw i (•) with different receptive fields to process the feature maps using the following formulas: After obtaining the feature map U i , it is then processed with 1 × 1 convolution Next, feature maps of different receptive fields ∼ U i are concatenated as follows: The spatial features are then efficiently extracted using average pooling and maximum pooling: To make full use of the contextual information of targets to achieve information interaction between different spaces, the LSK module splices the spatially pooled feature sums and uses a convolutional layer to convert the 2-channel-pooled features into an i-channel spatial attention feature map: Then, the feature maps of different receptive fields are spatially weighted and summed accordingly, and the attention feature S is obtained using the convolutional layer F: The final output of LSK is the product of input feature X and attention feature S: (3) This paper proposes a sample-weighted training strategy based on DIOU to enable the model to focus on training well-located samples with high classification confidence and to improve the model's detection accuracy.YOLOv7 localisation loss adopts Complete Intersection over Union (CIOU) training based on the improved IOU function, where IOU is the intersection and concatenation ratio of the sample frame and the real frame, and its expression is as follows: where B denotes the prediction box and B gt is the true box.CIOU is defined in Equation ( 20): where ρ 2 b, b gt represents the Euclidean distance between the centroid of the predicted box and the centroid of the true box; b and b gt represent the centroid of the predicted box B and the true box B gt , respectively; α represents a positive trade-off parameter; and ν represents the consistency of the aspect ratio of the measured predicted box to the real target box.The formulas of α and ν are as follows: In the contraband detection scene of terahertz images, there are many tiny targets such as razor blades, keys, nails, etc.Moreover, the aspect ratio of the true boxes does not contribute much to the detection of small objects, so the detection performance of CIOU for small targets is not as good as that of DIOU's sample training strategy [26].Compared to CIOU, DIOU only uses the distance between the centroid of the sample boxes and the true boxes as a penalty term and does not consider the similarity of the aspect ratio.Its expression is as follows: Based on the above improvement measures, this work proposes the LWMD-YOLOv7 algorithm.The network structure of LWMD-YOLOv7 is shown in Figure 9.
Based on the above improvement measures, this work proposes the LWMD-YOLOv7 algorithm.The network structure of LWMD-YOLOv7 is shown in Figure 9.

Image Pre-Processing Results
The terahertz concealed object image data in this paper contain a total of eight categories (five kinds of dangerous objects, namely scissors, pistol, blades, lighters, and nails, and three non-dangerous objects, namely keys, nail scissors, and pens, where hand knives and small razor blades are classified as blades), and each image is a pseudo-colour image with a size of 512 × 256.During image acquisition, we obtained terahertz images of different hues by adjusting the brightness and contrast of the software system, and the original images of the concealed objects hidden in two kinds of packages were obtained and are shown in Figure 10.The image dataset contains eight kinds of samples to be tested in different scales.The small objects such as nails, keys, and small razor blades account for a relatively small proportions of the images.Some of these objects are not displayed completely

Image Pre-Processing Results
The terahertz concealed object image data in this paper contain a total of eight categories (five kinds of dangerous objects, namely scissors, pistol, blades, lighters, and nails, and three non-dangerous objects, namely keys, nail scissors, and pens, where hand knives and small razor blades are classified as blades), and each image is a pseudo-colour image with a size of 512 × 256.During image acquisition, we obtained terahertz images of different hues by adjusting the brightness and contrast of the software system, and the original images of the concealed objects hidden in two kinds of packages were obtained and are shown in Figure 10.The image dataset contains eight kinds of samples to be tested in different scales.The small objects such as nails, keys, and small razor blades account for a relatively small proportions of the images.Some of these objects are not displayed completely in images, such as the nail in the bottom of Figure 10i.This is likely to cause omissions.Additionally, the utility knives and pens have similar external outlines and are likely to be mis-detected due to the lack of texture features in the terahertz images, as shown in Figure 10d.
in images, such as the nail in the bottom of Figure 10i.This is likely to cause omissions.Additionally, the utility knives and pens have similar external outlines and are likely to be mis-detected due to the lack of texture features in the terahertz images, as shown in Figure 10d.To verify the impact of different image processing methods on the target detection results, this paper uses two methods-non-local mean filtering as well as histogram equalisation-to pre-process the original terahertz image, preserving the results of both processes.The terahertz image before and after non-local means filtering (NLM) processing is shown in Figure 11.The processed terahertz image intuitively retains the original features of the objects while weakening the noise around the targets.The terahertz image after histogram equalisation (HE) enhancement is shown in Figure 12.As shown in Figure 12a, some of the original terahertz images have low contrast and insignificant features, and their corresponding histograms of grey scale distribution have unevenly distributed grey scale values (as shown in Figure 12b).After histogram equalisation, the contrast of some of the images with uneven grey scale distribution is improved (as shown in Figure 12c), and their pixel values are uniformly distributed between 0 and 255 (as shown in Figure 12d).To verify the impact of different image processing methods on the target detection results, this paper uses two methods-non-local mean filtering as well as histogram equalisation-to pre-process the original terahertz image, preserving the results of both processes.The terahertz image before and after non-local means filtering (NLM) processing is shown in Figure 11.The processed terahertz image intuitively retains the original features of the objects while weakening the noise around the targets.The terahertz image after histogram equalisation (HE) enhancement is shown in Figure 12.As shown in Figure 12a, some of the original terahertz images have low contrast and insignificant features, and their corresponding histograms of grey scale distribution have unevenly distributed grey scale values (as shown in Figure 12b).After histogram equalisation, the contrast of some of the images with uneven grey scale distribution is improved (as shown in Figure 12c), and their pixel values are uniformly distributed between 0 and 255 (as shown in Figure 12d).The original terahertz image dataset is D1, the non-local mean filtering processed is D2, the histogram equalisation processed is D3, and the wavelet transform processed is D4.The four groups of datasets after data augmentation are all expanded to 3556 images, The original terahertz image dataset is D1, the non-local mean filtering processed is D2, the histogram equalisation processed is D3, and the wavelet transform processed is D4.The four groups of datasets after data augmentation are all expanded to 3556 images, and the size of each image is 640 × 640.Some of the images in the dataset after data enhancement are shown in Figure 13 (taking D1 as an example).The four expanded datasets are divided as follows: training set/validation set/test set (8:1:1 ratio).
The experiments in this paper were conducted in Windows 10, with a graphics card model NVIDIA GeForce RTX 3060, and using Python 3.7 as the programming language.Using the four sets of the enhanced terahertz image data (D1, D2, D3, D4) mentioned above, the base model of YOLOv7 was trained based on the deep learning framework Pytorch 1.13, and the prediction results were compared to the four datasets.Then, the dataset with the best prediction results was selected to train the improved YOLOv7 model.To avoid the influence of the pre-training weights on the results, each model was trained by starting from scratch, with the batch size set to 16 and the number of iterations set to 150 epochs.
and the size of each image is 640 × 640.Some of the images in the dataset after data enhancement are shown in Figure 13 (taking D1 as an example).The four expanded datasets are divided as follows: training set/validation set/test set (8:1:1 ratio).The experiments in this paper were conducted in Windows 10, with a graphics card model NVIDIA GeForce RTX 3060, and using Python 3.7 as the programming language Using the four sets of the enhanced terahertz image data (D1, D2, D3, D4) mentioned above, the base model of YOLOv7 was trained based on the deep learning framework Pytorch 1.13, and the prediction results were compared to the four datasets.Then, the dataset with the best prediction results was selected to train the improved YOLOv7 model To avoid the influence of the pre-training weights on the results, each model was trained by starting from scratch, with the batch size set to 16 and the number of iterations set to 150 epochs.

Model Evaluation Criteria
To effectively evaluate the performance of the object detection model, this work eval uates it in terms of several metrics: mean average precision (mAP), precision (P), recall (R) and frames per second (FPS).Here, P reflects how many of the samples predicted to be positive cases are true cases and R reflects how many of the samples that are positive cases are predicted to be positive cases.The calculation formulae are shown, respectively, as follows: As seen in Table 1, the YOLOv7 algorithm achieves the best results on R and mAP@.5 (%) index metrics on dataset D2, indicating that the NLM-processed terahertz image is more conducive for YOLOv7 to learn the important features in the image.
To further validate the effectiveness of the improved model in this paper, a series of ablation experiments were conducted in the test set of D2 using YOLOv7 as a benchmark, and the results are shown in Table 2, where "+" means we add this module in the network and "-" means that this module is not added.For example, model 1 is the benchmark of Yolov7.In model 2, we use MobileNext as the backbone network.In model 4, we use the SPD_Mobile network as the backbone network and DIOU as the training strategy.Model 6 is the LWMD-YOLOv7 network that is proposed in this work.From the information in Table 2, it can be seen that the detection speed FPS of the algorithm can be greatly improved by using MobileNext as the backbone network, but this will also cause a decrease in the detection precision and a decrease in the recall rate.The accuracy of detection can be improved by using DIOU as the loss function.Compared to the MobileNext network, the SPD-Mobile network proposed in this paper can improve the recall rate and improve the phenomenon of omissions of small targets.By introducing the selective attention mechanism LSK module in the network, the detection accuracy can be improved and the phenomenon of false detection due to the similarity of outlines and lack of detailed features in terahertz images can also be improved.In summary, the detection speed (FPS) of the LWMD-YOLOv7 algorithm (model 6) proposed in this paper reaches 112.4 (f/s), which is 26.9 (f/s) higher than that of the YOLOv7 benchmark network.It also achieves a precision (P) of 98.5% and a recall (R) of 97.5%, which is higher than that of the YOLOv7 benchmark network.Moreover, the average detection accuracy (mAP@.5) is not much different from that of the YOLOv7 benchmark network, and the detection accuracy (P) reaches the highest average detection accuracy.This indicates that the improved YOLOv7 network effectively improves the detection efficiency without much loss in accuracy.The detection results of the LWMD-YOLOv7 algorithm in each category are shown in Table 3.
By analysing the overall detection results of this paper's algorithm and the recognition accuracy of different categories, the LWMD-YOLOv7 network reaches an average mAP@.5 (%) metric of 98.6% for all categories.The algorithm proposed in this paper achieves high accuracy (P) for both the pen and blade categories although the blade category contains utility knives, which are similar to the pen's appearance profile.Despite terahertz images lacking texture features, LWMD-YOLOv7 can still accurately identify them, demonstrating the effectiveness of the LWMD-YOLOv7 algorithm in contraband detection scenes.

Conclusions
To meet the real-time and accurate detection requirements in real contraband scenes, the lightweight algorithm LWMD-YOLOv7 for the detection of contraband in terahertz images is proposed.Aiming at the problems of noises in the background and limited texture features in terahertz images, we pre-processed the original terahertz image dataset using non-local means filtering, histogram equalisation, and wavelet transform (which retains the original terahertz image data and the results of the above methods); we constructed four groups of datasets; and we used the above four groups of datasets to train the YOLOv7 network.The results show that the YOLOv7 network achieves the best results for the terahertz image data after non-local means filtering processing.Furthermore, the LWMD-YOLOv7 algorithm is proposed, which uses the SPD-Mobile network as the YOLOv7 feature extraction network to construct a low-complexity and high-efficiency detection model.Furthermore, the selective attention mechanism LSK module is incorporated before the output of the YOLOv7 network's multi-scale feature maps to make full use of the contextual information of the targets by using a larger receptive field, which enhances the feature fusion effect and strengthens the network's focus on salient features to reduce the false detection rate due to the similarity of the object's outlines and the lack of texture features in the terahertz images.The accuracy of prediction box localisation is measured using the DIOU training strategy to improve the model's detection accuracy.The experimental results show that the LWMD-YOLOv7 algorithm can significantly improve the detection efficiency compared to YOLOv7 to meet the demand of real-time contraband detection while ensuring high detection accuracy, which provides a technical reference for the automated detection of contrabands in terahertz images.

Figure 1 .
Figure 1.Terahertz scanning imaging system.(a) The structure of the system; (b) the physical diagram of the system; (c) the software interface of the system.

Figure 1 .
Figure 1.Terahertz scanning imaging system.(a) The structure of the system; (b) the physical diagram of the system; (c) the software interface of the system.

Figure 2 .
Figure 2. Physical diagrams of some concealed object samples to be tested.

Figure 3 .
Figure 3. Two types of packaging for concealed object: (a) Kraft paper packaging.(b) Polythene packaging.

Figure 2 .
Figure 2. Physical diagrams of some concealed object samples to be tested.

Figure 2 .
Figure 2. Physical diagrams of some concealed object samples to be tested.

Figure 3 .
Figure 3. Two types of packaging for concealed object: (a) Kraft paper packaging.(b) Polythene packaging.

Figure 3 .
Figure 3. Two types of packaging for concealed object: (a) Kraft paper packaging.(b) Polythene packaging.

Figure 7 .Figure 7 .
Figure 7. SPD-Conv structure diagram.Different color represent sequence of sub-features of the input feature map.

Figure 10 .
Figure 10.Two types of packaging for concealed object.(a) Scissors and two nail scissors (polythene packaging); (b) key and two blades (polythene packaging); (c) pistol and lighter (polythene packaging); (d) pen and nail and blade (polythene packaging); (e) pistol and two keys (Kraft paper packaging); (f) scissors and blade (Kraft paper packaging); (g) key and scissors (Kraft paper packaging); (h) nail and blade (Kraft paper packaging); (i) three nails and two blades (polythene packaging).

Figure 10 .
Figure 10.Two types of packaging for concealed object.(a) Scissors and two nail scissors (polythene packaging); (b) key and two blades (polythene packaging); (c) pistol and lighter (polythene packaging); (d) pen and nail and blade (polythene packaging); (e) pistol and two keys (Kraft paper packaging); (f) scissors and blade (Kraft paper packaging); (g) key and scissors (Kraft paper packaging); (h) nail and blade (Kraft paper packaging); (i) three nails and two blades (polythene packaging).

Figure 12 .
Figure 12.HE-proposed THz image and its corresponding grey scale distribution histogram.(a) Original THz image; (b) grey scale distribution histogram of (a); (c) HE-processed image; (d) grey scale distribution histogram of (c).

Figure 12 .
Figure 12.HE-proposed THz image and its corresponding grey scale distribution histogram.(a) Original THz image; (b) grey scale distribution histogram of (a); (c) HE-processed image; (d) grey scale distribution histogram of (c).

Figure 13 .
Figure 13.Some images in the D1 dataset after data augmentation.

Table 1 .
Detection results of YOLOv7 base model on three datasets.

Table 3 .
Detection results of LWMD-YOLOv7 on different categories.