Hidden Dangerous Object Recognition in Terahertz Images Using Deep Learning Methods

: As a harmless detection method, terahertz has become a new trend in security detection. However, there are inherent problems such as the low quality of the images collected by terahertz equipment and the insufﬁcient detection accuracy of dangerous goods. This work advances BiFPN at the neck of YOLOv5 of the deep learning model as a mechanism to improve low resolution. We also perform transfer learning, thereby ﬁne-tuning the pre-training weight of the backbone for migration learning in our model. Results from experimental analysis reveal that mAP@0.5 and mAP@0.5:0.95 values witness a percentage increase of 0.2% and 1.7%, respectively, attesting to the superiority of the proposed model to YOLOv5, which is the state-of-the-art model in object detection.


Introduction
The frequencies from 0.1 to 30 THz form the Terahertz domain of the electromagnetic (EM) band region and act as a gap of convergence between the microwave and the infrared band in the EM spectrum, as shown in Figure 1. Terahertz is a non-ionizing frequency which has high penetration in non-dielectric materials. Several non-crystalline materials are transparent for THz rays, for example, cloths, plastic, material, paper, etc. Its impact on organic tissues implies that it cannot damage DNA, making it safe for human applications such as medical imaging. Terahertz rays suffer from the reflection and absorption of metal surfaces and polar liquids such as water, respectively [1]. Recently, terahertz technology and deep learning have been a focus of research. The applied areas in the domain of object detection include the detection of agricultural products [2][3][4][5], breast cancer and other medical conditions [6][7][8], and hidden object detection [9][10][11][12]. Nevertheless, terahertz suffers from low image resolution due to blurriness and noisy dark-spotted images, which stems from low-energy power sources [13][14][15] consequently affecting the detection accuracy and rate. Therefore, it stands to reason that any attempt to increase the detection rate and accuracy must first address the challenge of low resolution while at the same time revamping the deep learning model.

Related Work
In this section, we review previous work focusing on terahertz image processing, especially on terahertz image recognition.

Terahertz Image Acquisition & Image Processing
Due to a capture rate of up to 5000 lines per second, the term fast-256 device is capable of a scan speed up to 15 m/s. A single sensitivity band at 100 ± 10 GHz is characteristic of the sensor, but the experimental power source is around 100 GHz. Meanwhile, image capturing is made possible by the conveyor belt, which speeds at 10.1 m/s. In Figure 2, we show the active THz image device used in this work, which uses a coherent source. The THz detectors and source utilize transmission or reflection geometries.

Dataset Description
This section presents the acquisition steps of the terahertz image and the expansion methods for the image dataset. Primarily, these encompass affine transformation, rotation, transmission transformation, and translation, among others. Subsequently, we perform statistical analysis of the expanded dataset. The size of the image data collected by the device is 512 px × 256 px. In all, a total of eight different kinds of terahertz images of objects were collected. They included four types of weapon images as well as four types of non-weapon images (329 instances as a whole, since there might be more than one instance of a single image). The raw data information is shown in Table 1 and Figure 3.  In order to gain insights and better understand the characteristics of the data, analysis from the statistical perspective is pivotal, consequently aiding in model optimization. In the first place, statistics on the number of instances and the average bounding box size of eight (8) categories are shown in Table 1. As can be seen from Table 1, with weapons as our target discussion, the cardinality of blade categories is the fewest with the average bounding box being the smallest. The largest in number is the knife followed by Screwdriver with a relatively large bounding box. From the results of Figure 3, it can be seen that about 3% to 10% constitutes a major portion of the box area ratio. Meanwhile, a little section is concentrated in the 1% area ratio with the maximum proportion no more than 25%. Furthermore, in Figure 4, size distribution analysis on different types of bounding boxes is shown. Anchor as used in Figure 4 denotes a set of predefined bounding boxes of specific height and width and essentially captures the scale and aspect ratio of specific object classes of interest to us. In addition, the THz image pixels are represented by 0-255 values as RGB (Red Green Blue). Note that in RGB, a color is defined as a mixture of pure red, green, and blue lights of varying strengths with red, green and blue light levels encoded as a value in the range 0-255 where 0 and 255 denote zero light and maximum light, respectively.

Terahertz Image Detection
Image object detection is a fundamental task in image processing. The task needs to judge not only what the object is but also the position of the object. In recent years, with the outbreak of deep learning, target detection technology in the field of machine vision has developed rapidly. Traditional manual feature design methods, such as HOG, SIFT and LBP [16][17][18][19] can only achieve good results in specific scenes and cannot adapt to complex and diverse large-scale image data.The abstract features that can adequately describe objects are often ignored when designing features manually [20], and the designed network usually needs to be trained separately to conduct multi-level positioning. In recent years, the boom of deep learning based on a convolutional neural network (CNN) provides another idea for object recognition [21]. The target detection algorithm based on a convolution neural network automatically extracts image features through the convolution layer, which has greatly improved the efficiency and accuracy. The earliest CNN-based target detection algorithm is the RCNN algorithm proposed by Ross Girshick in 2015, which is followed by classic two-stage detection networks such as Fast RCNN and Mask RCNN [22][23][24][25][26]. Because the two-stage detection network consumes additional computing resources in the region proposal stage and the amount of model parameters is large, the overall detection speed of the two-stage detection network is slow. Researchers also proposed the one-stage detection network, which is represented by YOLO series [16,[27][28][29][30]. The one-stage detection network directly uses the output of the convolution feature map for classification prediction and position fitting to reduce the amount of calculation. In addition, there are some anchor-free algorithms such as FCOS, CenterNet, etc.
Although the above algorithms have achieved good index results in public datasets such as PASCAL VOC [31,32] and COCO [33] datasets, there are still some problems in the application of terahertz image object detection. These problems are mainly caused by the characteristics of the terahertz dataset, namely: (1) image blur and (2) uneven distribution of the image histogram (as shown in Figure 4). The characteristics of these datasets will cause certain detection errors in the existing detection framework model for terahertz. Based on the YOLOv5s model, this paper redesigns the head structure of the detection model for terahertz datasets, adopts the BiFPN structure [34,35] and realizes skip connection in convolution feature fusion, which can fuse richer image features than the original PANet [36,37].
The main contribution of this work is summarized as follows: 1. Improving low resolution using BiFPN at the neck of YOLOv5 of the deep learning model.

2.
Transfer learning is done using the fine-tuning process to the pre-training weight of the backbone for migration learning in our model.
The remainder of this work is as follows. In Section 3, we advance our model encompassing the backbone and neck, whereas in Section 4, we present experimental work involving image processing, model comparison and model transfer learning. We conclude the paper in Section 5.

Model Backbone
A key design section of a detection model is the backbone, which determines the quality of image feature extraction. It also affects the subsequent object classification, recognition and object location. ResNet series is a widely used backbone network. It uses the residual structure to solve the problem of gradient disappearance or gradient explosion in the training process of a deep convolution network. The classical fast RCNN model and RetinaNet model use the ResNet network as the backbone to extract rich image features. The detection model in this paper uses a cross-stage partial structure with less computation [38]. This structure optimizes the characteristic diagrams of different stages of different ResNet networks, as shown in Figure 5. The input feature split its channel into two signal flows, which finally concatenate together. This way, the variability of the gradients through the network is considered.
It is noteworthy in Figure 5 that the computational complexity of the ResNet network is O(CH2W2) with the complexity of the cross-stage partial basically computed by the product of the value and key branch. It is imperative to note that the dimensions of the input feature maps are C × H × WC × H × W. Note that ⊗ denotes matrix multiplication, whereas ⊕ is the broadcast element-wise addition. In the ResNet network, the multiplication operation plays a pivotal role in capturing the attentional feature map. It is possible to have an operation that satisfies both the acquisition of an attentional feature map and the crosschannel communication of information given that the multiplication operation is very similar to the multiplication operation in positional attention. The calculation amount and memory occupation of channel attention are significantly reduced compared to positional attention with respect to the attention mechanism. As a consequence, the channel attention mechanisms are utilized instead of positional attention mechanisms. This way, the memory occupation and time complexity can be greatly decreased with performance not sacrificed in any way.

Model Neck
The neck part of the whole detection network served in the role of convolution feature fusion. In the original YOLOv5s implementation, PANet is used as the neck, which adds a bottom-up pathway on top of the FPN. For the singularity of the terahertz image dataset, it is hard to obtain and fuse the significant features. As an extension of PANet, a bidirectional feature pyramid network (BiFPN) is adopted as our model's neck, as shown in Figure 6. It takes level 3-5 input features P i = P 3 , P 4 , P 5 , where P i represents a feature level with a resolution of 1/2i of the input image. For instance, our input terahertz image is transformed into 640 px by 640 px; P 3 then represents level 3 with resolution 80 px by 80 px (640/23 = 80). A skip connection is applied after input P 4 and P 5 to enhance the feature representation. The different features in BiFPN are concatenated with the same size after upsampling or downsampling. The output features of BiFPN can be calculated as:

Classification and Regression Loss
The loss of this model consists of classification loss and regression loss. The classification loss adopts binary cross-entropy loss, which is defined as: The return loss takes into account the GIoU loss of the bounding box: where IoU is expressed as: The calculation of IoU and C under bounding box A and B can be seen in Figure 7 where C denotes the smallest enclosing convex object. IoU fulfills all properties of a metric such as non-negativity [39].  Note, however, that IoU loss only works when the bounding boxes have overlap, and it would not provide any moving gradient for non-overlapping instances. In other words, IoU does not accurately reflect if two shapes are in close proximity to each other or very far from each other. To address such shortcomings, we adopt the GIoU.

Models
In this subsection, we elucidate on the various models used in this paper (for the purpose of comparison).

YOLOv5 and Variants
The framework architecture of YOLOv5 is composed of three main parts: backbone, neck, and predict head. Primarily, the backbone is a convolutional neural network that aggregates and forms image features at different granularities. It extracts feature information from input pictures. On the other hand, the neck is a series of layers to mix and combine image features to pass them forward to prediction. Typically, the neck combines the gathered feature information and creates three different scales of feature maps. The prediction head consumes features from the neck and takes box and class prediction steps. This is completed by detecting objects based on the created feature maps. In fact, the YOLO model was the first object detector to connect the procedure of predicting bounding boxes with class labels in an end-to-end differentiable network.
It is worth mentioning that YOLOv5 utilizes the CSPDarknet53 framework with an SPP layer as the backbone, the PANet as the neck, and the YOLO detection head. The best anchor frame value is calculated in YOLOv5 by adapting the clustering algorithm in different training datasets. The several activation functions tried by YOLOv5 include sigmoid, leakyReLU, and SiLU.
The five derived models for YOLOv5 include YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. Although they share the same model architecture, inherent in each model is different model widths and depths. Noteworthy is the fact that smaller models are faster; hence, they are usually designed for mobile deployment, whereas larger models, although more computationally intensive, have a better performance.
In other variants of YOLOv5, CSP-Darknet is used (for instance, in YOLOv5-P7 In the case of YOLOv5-P6, an additional large object output layer P6, following the EfficientDet example of increasing output layers for larger models is added. However, unique to this case is the fact that it is applied to all models. Note that current models have P3 (stride 8, small) to P5 (stride 32, large) outputs, but the P6 output layer is stride 64 and designed for extra-large objects. Notice that the architecture changes made to add a P6 layer here are key. The backbone is extended down to P6, and the PANet head goes down to P3 (consistent with state-of-the-art) and back up to P6 now instead of stopping at P5. However, new anchors are also added, which are evolved at img 1280. For brevity, we show in Figure 8 [40] the generalized architecture for Yolov5.

YOLOv5 Ghost
In this model, the focus is on a reduction in redundancy in intermediate feature maps calculated by mainstream CNNs. Toward this end, the model reduces the required resources (convolution filters used for generating them). In practice, we are given the input data x ∈ Rc × h × w where c is the number of input channels whereas h and w denote the height and width of the input data, respectively. It must be noted that the operation of an arbitrary convolutional layer for producing n feature maps can be formulated as: where * is the convolution operation, b is the bias term, and Y ∈ R h ×w ×n is the output feature map with n channels. Meanwhile, f ∈ R c×k×k×n . In addition, h and w represent the height and width of the output data, whereas k × k is the kernel size of convolution filters f . As part of the convolution process, the required number of FLOPs is calculated as n.h .w .c.k.k. This is often as large as hundreds of thousands, since the number of filters n and the channel number c are generally very large (for instance, 256 or 512). From Equation (5), optimization of the number of parameters (in f and b) is explicitly determined by the dimensions of input and output feature maps. The architecture of the model is shown in Figure 9 [41].

YOLOv5-Transformer
As shown in Figure 10 [42], in YOLOv5 Transformers (TRANS), a merger of MixUp, Mosaic and traditional methods in data augmentation are employed. Integrated into the YOLOv5 is the Transformer Prediction Heads (TPH). Essentially, these accurately localize objects in high-density scenes. In addition, the original prediction heads are replaced with Transformer Prediction Heads (TPH) with the self-attenttion mechanism. As a consequence, prediction is enhanced. The architecture of the transformer adopts stacked self-attention and point-wise, fully connected layers for both the encoder and decoder (see the left and right halves of Figure 11 [43]).

YOLOv5-Transformer-BiFPN
In this model, exploring the prediction potential of self-attention based on the YOLOv5, the TRANS module is integrated into the prediction heads instead of the original prediction heads. This accurately localizes objects in high-density scenes and can handle the large-scale variance of objects. Moreover, at the neck of the network, PANet is replaced with a simple but effective BiFPN structure to weight the combination of multi-level features of backbone. The specific details of TRANS together with the Bifn are depicted in Figure 12 [44].

YOLOv5-FPN
YOLOv5-FPN uses PANet to aggregate image features. As demonstrated in Figure 13 [45], PANet builds on FPN's deep-to-shallow unidirectional fusion by incorporating secondary fusion from the bottom up and employing precise low-level localization signals to improve the overall feature hierarchy and encourage information flow.

Terahertz Image Processing
There are 329 images in the original dataset. Each image size is 512 px by 256 px. After image augmentation (such as flipping, warping, rotating and blending), there are 1884 images in total. The average size of the bounding box is 89.52 px by 74.45 px. These datasets are divided into a training set and test set according to the ratio of 8:2, which yields 1507 and 377 as the training set and test set, respectively. During the training, we also enabled the mosaic online amplification method, as shown in Figure 14. That is, each input graph is randomly fused with four sub-graphs. During training and testing, the input image size is set to 640 px by 640 px. To avoid the influence of pre-training weight, all comparison models adopt from scratch the training method, the batch size is set to 16, takes turns on a 2080 (8G) graphics card, and the number of training rounds as epochs is set to 200. In addition, the reasons for the improvement of the model's effect are also analyzed.

Model Comparison
In this section, we compare the performance between the proposed model and the existing general detection models. In this work, the detection metrics introduced in [33] are adopted, which includes average precision (AP) and average recall (AR) over multiple Intersection over Union (IoU) values. The detection metrics are listed in Table 2, and the true positive and false positive matrices for calculating the precision and recall are shown in Figure 15. The performance indicators are precision, recall mAP@0.5 as well as mAP@0.5:0.95. The average of AP is mAP (mean average precision). The AP is computed for each class and combined in certain situations. However, in some situations, they are interchangeable. There is no distinction between AP and mAP in the COCO sense, for instance [33].

Experiment Results
Because the research in this paper is based on YOLOv5, different improvement methods are tried in the experimental process, such as changing the transformer-based backbone, using the FPN network as neck, adding an additional prediction head, etc. The relevant experimental comparison results are shown in Table 2 and Figure 16.
It can be seen from the results that the best effect is achieved on the test set by using the BiFPN network as the neck. The test results in each category are shown in Table 3.   To analyze the detection difference of each model, we analyze the convolution characteristic graphs of different models. Let the input image size be (C, H, W) and the convolution layer feature image be (c, h, w). First, we reduce the dimension in the channel c dimension; then, we take the average value of the (h, w) dimension, scale the feature image size to the original image size, and finally overlay to output the final effect. Figure 17 shows the different feature maps of different models. The (a)-(i) values are the same as shown in Table 3. It can be seen from the feature map that the BiFPN network structure suppresses the non-target features and reduces the feature noise, while the original YOLOv5 model still has more feature representations at the edge of the object. Other models still have large errors in terahertz image feature extraction, which reduces the accuracy of the model.

Model Transfer Learning
Transfer learning is a common skill to accelerate model convergence in the field of deep learning. Previous studies have adopted ab initio training to ensure consistency. This section will discuss the acceleration effect of transfer learning on the model. Since the backbone of the proposed network is consistent with the original YOLOv5 network, we can use the pre-training weight of the backbone for migration learning. The changes of various indicators in the training process are shown in Figure 18, and the evaluation results in the test set are shown in Table 4.
It can be seen from Figure 18 that the method of transfer learning can accelerate the convergence of network training and shorten the model training time under the condition of ensuring the same model accuracy. The fine-tuned network has achieved better results in the test set, especially in the detection of some dangerous goods as observed from Tables 3 and 4, where mAP@0.5 and mAP@0.5:0.95 values witness a percentage increase of 0.2% and 1.7%, respectively. It is also obvious from Table 2 that our model compared to [46] enjoys a 0.5% and 7% percentage increase concerning detection accuracy using the same THz dataset at COCO's evaluation metric.

Conclusions
Terahertz technology is a harmless security detection method, which is of great significance to the rapid and correct recognition of terahertz images. In this paper, a terahertz image target detection method based on BiFPN network feature fusion is proposed. The research results show that when using user-defined datasets, the proposed method is better than other improved models in terahertz feature extraction and classification. In subsequent research, we will focus on how to improve the terahertz image dataset and make it suitable for general target detection algorithms in the field of machine vision.

Conflicts of Interest:
The authors have no conflict of interest to declare that are relevant to the content of this article.