FSVM: A Few-Shot Threat Detection Method for X-ray Security Images

In recent years, automatic detection of threats in X-ray baggage has become important in security inspection. However, the training of threat detectors often requires extensive, well-annotated images, which are hard to procure, especially for rare contraband items. In this paper, a few-shot SVM-constraint threat detection model, named FSVM is proposed, which aims at detecting unseen contraband items with only a small number of labeled samples. Rather than simply finetuning the original model, FSVM embeds a derivable SVM layer to back-propagate the supervised decision information into the former layers. A combined loss function utilizing SVM loss is also created as the additional constraint. We have evaluated FSVM on the public security baggage dataset SIXray, performing experiments on 10-shot and 30-shot samples under three class divisions. Experimental results show that compared with four common few-shot detection models, FSVM has the highest performance and is more suitable for complex distributed datasets (e.g., X-ray parcels).


Introduction
X-ray baggage screening is a crucial component of security inspection. With the rise of deep learning, many automatic threat detection methods based on deep-learning methods have been proposed to increase the effectiveness of security screening and lower the labor intensity of human operators [1]. Deep neural networks, which benefit from deeper layer depth and a larger hypothesis space than conventional opitimization algorithms, can be trained with large amounts of data to achieve expressive results [2]. However, automated detection methods based on supervised learning are highly dependent on the quantity and reliability of annotations in datasets. For X-ray security datasets, collecting large amounts of well-labeled images for each type of contraband is extremely difficult due to the wide variety of contraband categories. Threats in the same categories also have different morphologies. Current public datasets for X-ray baggage only support a few contraband categories, which is far from sufficient to meet practical needs [3]. Furthermore, the high demand for annotations will make the threat detection model more difficult to respond to and employ, increasing the security risk. It is also common for criminals to carry unconventionally shaped contraband, such as shaped knives and weapons, to avoid security inspection, which is more likely to be overlooked. There is a great need to construct threat detection models when annotated samples are scarce.
Big data models are difficult to achieve satisfactory results from a limited number of samples. For recognition tasks, a better embedding space enables features to be more separable on the hyperplane [4,5]. However, models, in the data-hunger setting, may end up with an overfitted representation of the problem. In recent years, few-shot learning(FSL) defines it as an optimization problem in a limited data regime [6]. To acquire a discriminative feature, a few-shot model must utilize some kind of prior knowledge, e.g., reducing parameters to avoid overfitting or searching for a good initialization of the hypothesis Figure 1. Baggage samples in the X-ray security dataset. As shown in the pictures above, prohibited items(Knife, Gun, Wrench) will be cluttered together, making their outline complicated. They are also easily occluded by the compact structures, e.g., laptops and metal zippers.
To bridge this gap, this work proposed a few-shot threat detection method for X-ray security images. Given the complex distribution of X-ray security contraband, we constructed an SVM module to embed it into the original base detector for a few-shot learning task. Within the SVM embedding module, an additional supervised classification task was built to transfer the higher-order decision information into the former layers. The higher-order information, which is in the form of gradients, will back-propagate to finetune the unfrozen parameters. Through the end-to-end training process, the proposed method FSVM will produce the embedding space more suitable for complex distributed data. Experimental evidence proves that FSVM can support situations where the number of threat samples is extremely scarce and achieve better results compared with other few-shot object detection methods.
Our contributions can be concluded as follows: • We proposed a few-shot threat detection model called FSVM for X-ray security images, enabling the model to detect novel contraband items with only a small number of samples. • To generate an informative embedding space, an SVM embedding module is proposed, which can be trained end-to-end, and embedded it freely into an object detection model. • The SVM loss is proposed as a part of the joint loss function. This additional constraint, used in the fine-tuning stage, will back-propagate the supervised information to former layers. Trained with this joint loss, the proposed model is more suitable for few-shot application scenarios.

Automatic Threat Detection for X-ray Security Images
The X-ray imaging principle is that the X-ray tube generates beams that penetrate the scanned items. Depending on the material density, high-density items and low-density items on the X-ray images have distinguishable colors, which facilitates the manual screening process. However, prohibited items are occluded from each other in the single-view imaging, giving different forms to the same type of contraband. The cluttered nature of baggage makes the task harder. Furthermore, the human operators need sufficient knowledge and experience to achieve a confident inspection, which is hard to cultivate. Operators in practice could not maintain an average performance under excessive work intensity.
To tackle the above problems, automated detection methods have been proposed in recent years. With the development of deep learning methods, deep neural networks are showing their superb learning and generalization abilities. Akcay et al. [16] applied a CNN-based transfer learning method on the X-ray baggage security dataset, achieving 88.5% mean average precision(mAP) on a six-class object detection task and 97.4% mAP on a two-class firearm detection task. Hassan et al. [17] enhanced ROI features by extracting contour information from the same object placed at different angles, and detection on SIXray and GDXray achieved 96.8% mAP, exceeding the human-recognizable accuracy of 94.5%.
Supervised methods have shown promising performance in X-ray threat detection, while the lack of annotations limits contemporary deep model training. In response to this weakness, Xu et al. [18] proposed a method for identifying and utilizing attention mechanisms to generate heatmaps intermediately to localize prohibited items. However, this approach still requires sufficient labels to specify the category information.

Few-Shot Learning
Data collection has become one of the bottlenecks in the progress of deep learning. A child, however, could learn novel concepts by reviewing a small amount of data and annotations. To overcome the learning gap between human and artificial intelligence, few-shot learning (FSL) has become a promising research area. It aims at improving the model effect with only a few annotated samples (usually under 30-shots) while getting as close as possible to many-shot models. Initially, few-shot learning only focused on few-shot classification (FSC) [6], but many few-shot object detection (FSOD) methods have emerged in recent years for practical applications. FSOD is more complex than classification as it requires not only classifying categories but also predicting location boxes [19].
In few-shot learning tasks, categories are divided into base/seen classes and novel/unseen classes. Base classes have sufficient annotated data, and novel classes only have a few. Several transfer learning based FSOD methods first train on base classes, and fine-tune on novel classes [20,21]. Wang et al. [20] suggests a transfer learning method, which can effectively improve the model's generalization of the novel class while alleviating the model's forgetfulness of the base class. For few-shot learning tasks, a representative embedding space can enable a representative feature for novel class samples [22]. Sun et al. [21] combines a contrast learning approach to guide parameter updates by constructing CPE losses. The model maps the proposed features to a feature space with smaller intra-class distances and larger inter-class distances, allowing similar instances of different classes to be more distinguishable. Our method conducts research based on such methods.

End-to-End Trainable Embedding Layer
Enriching the representational information extracted by deep learning networks is key to improving the recognition ability of the model [23], and this representational information is expressed as a high-dimensional vector. Metric learning is used to determine whether two high-dimensional vectors are in the same class by calculating the distance between them in an embedding space [24].
To obtain a more transferable and representative embedding space, a range of approaches are proposed, which prove that the end-to-end training can exploit the search process of deep architectures in the hypothesis space to explore the invariant properties behind the data [25][26][27]. Some of them consider second-order statistics, such as covariance operators and gaussian operators, as an end-to-end trainable embedding layer to constrain the model parameter updating [25,26]. Such methods have now been widely used in unsupervised domain adaptation tasks, where they have been shown to yield a more informative embedding representation and more complex distribution [28,29]. In order to fuse higher-order information into few-shot object detection models, the literature [27] gives us the idea of embedding the quadratic programming solution process into FSOD, which is able to back-propagate higher-order information by solving the optimal solution of the convex function with the help of the features that constitute the covariance matrix.

Problem Formulation
For dataset D = {(x, y), x ∈ X, y ∈ Y} , each sample (x, y) consist of the input image x and associated annotation y ∈ {(c i , l i ), i = 1, 2, . . . , N}}, which denotes its categories and bounding box coordinates 1 of N object instances in the image x. Categories are divided into two subsets, the base class C base and the novel class C novel , respectively. C base ∩ C novel = ∅. For each contraband instance, c i ∈ {C n ∪ C b }. In a few-shot learning scenario, the N-way K-shot setting means an N-class classification task, where each category has K samples. The few-shot training comprises of two stages: the base-training stage and the finetuning stage. In the base-training stage, the training dataset consists of all base-class data D base , and the detection model is trained to obtain sufficient semantic information common to X-ray images. In the fine-tuning stage, we only use a small number of sample to verify the performance of the few-shot model. For each category c i in C, K samples are taken from both novel class and base class to form the D f inetune , with a sample size of C × K. In fine-tuning training process, model randomly sampling data from D f inetune and adjusting the parameters by gradient descent algorithm. Knowledge learned from the frist stage will transfer to novel classes during fine-tuning stage. The model validates its performance in both the base class and the novel class in the test set. Figure 2 shows the overall architecture of our model . A two-stage detection model, Faster R-CNN, is used as the base detector. Backbone network and FPN receive the input images and output feature maps as a proportionally sized feature maps at multiple levels, in a fully convolutional fashion. The Region proposal network (RPN) proposes candidate boxes from the feature map. RoI pooling extracts the proposed region corresponding to the anchor frame and unifies it into a fix-sized feature vector. The produced region proposal P = { p ∈ R m , y}, where p is the m-dimension RoI feature vector, will be calculate in the R-CNN predictor head. Then, RoI head output box offsets and class confidences. In fine-tuning stage, the amount of data is far less than base-training stage, so the overfitting problem will be very severe. To avoid the overfitting, we freeze the parameters of the backbone network and fine-tune the remaining parameters. The derivable SVM embedding module participate in the few-shot fine-tuning training. This embedding layer adds SVM multiclassification task as a constraint on the fine-tuning training phase. The additional supervised information will enrich the representation of the embedding space. Details about the SVM module are described in Section 3.3.

Svm Embedding Module
As a powerful and robust algorithm, support vector machines (SVMs) play an important role in traditional classification and regression tasks [30]. The mechanism of SVM is to achieve a hyperplane that could maximize the interval between classes by solving a convex optimization problem, which has advantages especially for low-quantity data regimes. In this part, we utilized the characteristics of the SVM algorithm as an additional constraint to guide the optimization direction for few-shot tasks. Instead of using SVM as a normal classifier, we embedded our proposed SVM module into the base detector, which could be trained end-to-end and therefore enrich the representation of our model.
The SVM module we have designed is shown in Figure 3. The region features p are remapped with f z (x) : R m → R d , before being transported to the SVM module. The projected features p = f z ( p) will pass through an IoU filter. It will filter out a portion of the proposed regions under an IoU threshold. Experimentally, setting the IoU threshold can effectively control the quality of the features that input to the SVM layer, resulting in a better performance of the model. The above filtered features will input into the SVM module. Inspired by the literature [27], the multi-classification SVM solution algorithm [31] is placed into our proposed SVM module. The objective function of the multiclassification SVM algorithm is: where P ∈ R n×d represent the feature matrix which consist of former feature vector p , β is the regularization parameter and δ y i ,c is the Kronecker delta function. We convert Equation (1) into the dual formulation, which can reduce the dependence on the embedding dimension.
Let Figure 3. SVM module. Our proposed SVM module consists of a remap layer, an IoU filter, the multi-class SVM object function, and the QP Sovler. The SVM module is trained end-to-end on the deep neural network for additional supervised decision-making. Gradients calculated from L svm will back-propogate to the former layers relying on the chain rule.
After adding the dual set of variables, we can get the Lagrangian of the dual program as stated in [31], where 1 y i represents the one-hot encoding labeled vector, 1 is the vector whose components are all one. α c is our required solution, where α c ∈ R n×c . Let θ = {α c } C c=1 be the parameters computed in the dual function, we abstract the above equation into the following optimization problem: where f 0 (θ, p ) is the convex function, f (θ, p ) and h(θ, p ) are the inequality constraint and equation constraint of this optimization problem, respectively, where θ is the parameter of the SVM embedding layer. According to Equations (1) and (2), θ = f svm ( p ), where p is the proposed region feature, which has p = f z ( p ) . In order to make the SVM layer trainable end-to-end, it is necessary to obtain ∂ f θ ∂ f p as gradient information to backpropagate, which can also be written as ∂ f svm ∂ f z (Figure 3). Assume that p in Equation (4) satisfy S( p ) = {θ|g(θ, p ) = 0}, where g : R d × R c → R c is an implicit function (implicit function) about θ and p . For the function f : R n → R, the gradient information is denoted as ∇ x f (x) ∈ R n . For the function f : R n → R m , the gradient information is denoted as Jacobian matrix D x f (x) ∈ R m×n . Then the gradient ∂ f θ ∂ f p is the Jacobian matrix D p S( p ) ∈ R d×c , which is the gradient we need to obtain.
We can establish the Lagrangian equation according to Equation (4): Since the SVM objective function in the dual formulation in Equation (3) satisfies the strong duality of Slater's condition, and both the equation constraint and the inequality constraint are differentiable, the gradient can be solved [31]. This KKT condition for the Lagrangian function can be written in the form of an implicit function.
According to Barratt [32], when partial derivatives in L(θ, λ, v, p ) all exist, by the implicit function theorem, set g(θ, λ, v, p ) = 0, we can obtain: where both D θ g(θ,λ,ṽ, p ) and D p g(θ,λ,ṽ, p ) can be solved. Equation (7) can be used to back-propagate the gradient. The QP solver suggested by Amos and Kolter [33] completes this part of the computation. Since the whole computational environment is in the tensor environment for deep learning, we can obtain the gradient from the SVM layer to the remapping layer by this Jacobian matrix. we can incoporate kernel functions to prove its convergence. The objective function with kernel is as follows.
where K( p i · p j ) is the kernel function of the feature, and different kernel functions can be chosen to bring into the above equation. In Section 4, we choose the two most common kernel functions, i.e., linear kernel function and Gaussian kernel function, for our experiments, denoted by FSVM-L and FSVM-G, respectively, see Section 4. The expressions of the linear and Gaussian kernel functions are:

Joint Loss Funtion
To measure the probability of its classification correctness, we use the negative loglikelihood as the SVM loss, which compares the similarity between the predicted distribution and the one-hot label distribution.
In the two-stage training process, the base-training phase uses the standard Faster R-CNN loss [34], namely RPN loss, classification loss and box regression loss. In the fine-tuning stage, adding the SVM loss constraint described above, we end up with a joint loss function of: L = L rpn + L cls + L bbox + γL svm (12) where γ is a scale parametor used to adjust the loss balance. The SVM loss will backpropogate the gradients of SVM decision information to former layers successfully.

Experiment
In this section, we evaluate our approch and compare it with four common fewshot object detection methods, including meta-learning method, transfer learning method, and pretrained transfer learning methods. Results of the former FSOD methods on the X-ray baggage security dataset are provided for the follow-up research, and we provide ablation experiments.

Dataset
The X-ray baggage security inspection image dataset SIXray [35] was collected from subway stations, and there are 8929 manually annotated X-ray baggage images for five different classes: gun, knife, wrench, pliers, and scissors. It comprises of objects with a wide range of scale, viewpoint, and overlap, resulting in heavy noise in dataset. We cleaned and removed a small portion of images in SIXray, which included images with low resolution, noticeable changes in color distribution, and some images that lacked usable labels. The cleaned dataset contains 8827 images in total. Each image may contain multiple contrabands, and there are about 14,000 contraband samples available for training. In this paper, the dataset is partitioned into a training set and a testing set, where the ratio is about 4:1.

Category Division
The dataset is randomly divided into three novel splits, each with three base classes and two novel classes. K = 10, 30 is selected for the N-way K-shot setting to test the model's generalizability under various sample sizes. The novel classes in tree divisions are Split1: gun, knife; Split2: pliers, wrench; Split3: scissors, gun.

Implementation Details
We adopt Faster R-CNN [34] with Resnet-101 and a feature pyramid network (FPN) as a base detector. All experiments were performed on an Nvidia GeForce 3090 GPU using the momentum SGD optimizer with a weight decay of 0.0001. In the base-training stage, we train for 60 epochs with a batch size of 8 and a learning rate of 0.02. To make the solution space more stable, base training starts with a warm-up strategy to adjust the learning rate gradually. The learning rate in the fine-tuning stage was set to 0.01 and the batch size is set to 4. For calculating the SVM loss, we use label smoothing with a smoothing parameter set to 0.1.

Evaluation Metrics
We use the mean average precision (mAP) to describe the performance of detection results. This metric takes into account both the precision and recall of a certain category, by calculating the area under the Precision-Recall Curve. Precision and recall is determined by: where TP represents the positive samples that are correctly identified, FP represents the positive samples that are incorrectly identified, and FN represents the negative samples that are incorrectly identified. The average precision is obtained by: And the mAP take the average of AP values of all categories. In the few-shot setting, the model performance will consider both base class and novel class as follows: where C, C base and C novel represents the total number of categories, the number of base classes and the number of novel classes, respectively.

Comparison Experiment
As in Table 1, we conduct comparison experiments on four common few-shot object detection techniques: the Meta-RCNN [36], FsdetView [37], TFA [20], and FSCE [21]. As with our proposed model, all methods use ResNet101 as the backbone network and FPN for feature enhancement. Among the four comparison methods, the Meta R-CNN algorithm is a meta-learning method, FsdetView is a transfer learning method, and both TFA and FSCE are pre-trained transfer learning methods. The experiment shows that the methods without using pre-trained models (Meta R-CNN, FsdetView) fail to achieve the performance of pre-trained methods (TFA, FSCE). Results show that our method is still superior compared to the two pre-trained methods and obtains the best results in the novel class for all three different divisions. Table 1. Performance evaluation of three SIXray category divisions. For each novel split, the results are reported by bAP and nAP. FSVM-L and FSVM-G are SVM-constrained methods with the linear kernel and the gaussian kernel, respectively. * represents results from 10 random seeds. We set the best results in red and the second-best results in blue.

Model bAP50 Gun Knife nAP50 bAP50 Pliers Wrench nAP50 bAP50 Scissors Gun nAP50
Meta In this paper, the before-and-after changes of the bAP are also counted ( Table 2). Compared to other methods, FSVM is able to maintain the base class AP at a higher level after completing fine-tuning. Like mentioned in [21], this means the base forgetting is much less after fine-tuning, i.e., the model could learn new information of the novel classes while remaining the characteristics of the base classes. Table 2. Base forgetting comparison. Since only base data are input during base-training stage, the closer the bAP value is to mAP after fine-tuning, the more resistant the model is to the base-class forgetting.

Base-Training
Fine-Tuning

Ablation Experiment
To determine the details of the designed model and to understand its impact, the ablation experiments are reported. If not specifically indicated, all ablation experiments were done based on Novel split1 10shot SIXray data. To maintain the consistency, same hyper-parameters were used for all experiments.

FSVM
In principle, the SVM embedding layer needs to back-propagate the learned supervised information in the form of gradients through the additional classification task, and the parameter updatting occurs in the fine-tuning phase depend on the quality of the back-propagated gradients, so the following ablation experiments are conducted for the designed SVM module, as shown in Table 3. To achieve a meaningful gradient, the remap layer (consist of two fully connected layers) is used for remapping the feature input. The experimental results indicate that the model with SVM constrained supervision has higher performance both at 10-shot and 30-shot setting. Table 3 also shows that add remapping layer can enhance the effect of SVM supervision, in addition to freezing part of the parameters can also reduce overfitting. We found that adding a remapping layer can effectively reduce the SVM loss generated by our additional multi-classification task shown as Figure 4. During the first 250 iters of the fine-tuning stage, the remapping layer was able to bring down the SVM loss, while the SVM loss without the remapping layer always oscillated above and below the initial value. Since the parameter space of the remapping layer is smaller than that of the overall model, we analyze that the remapping layer can act as a gradient amplifier by adjusting the effect of SVM layer to amplify a meaningful and expressive SVM losses.

Dimension Control
Experiment shown in Table 4 confirm the mapped feature dimension size. We test the impact of remapping layers from low (16) to high (4096) dimensions. The experiments show that although increasing the dimensionality of the remapping layer can remap the features to a high-dimensional space and make them separable using its sparsity property, using high-dimensional vector mapping when the number of samples is insufficient leads to overfitting, which makes the model accuracy decrease in the high-dimensional (4096) case. We found that nAP50 performs smoothly at the dimensiona equal to 128, which also can obtain more accurate detection boxes (higher nAP75) compared to other fetched values. The quality of the features we input to the SVM layer becomes very crucial for the proformance. We use an IoU filter to control the quality of the input RoI feature. First, we obtained the RoI feature with a detected bounding box, while checking whether the RoI passed the IoU requirement. The greater the overlap between the detected bounding box and the groundtruth, the higher the input quality. Meanwhile, it also means that the number of RoI features to the SVM module will be reduced.
To validate the method, we set the values of IoU requirement to 0.5 and 0.7, respectively, and test them on 10-shot and 30-shot setting, as shown in Table 5. The result shows that the IoU threshold = 0.7 can effectively improve the model performance, proves that a higher IoU threshold in our IoU filter can promote the quaility of RoI features for the SVM multi-classification task, which can generate a better constrained SVM gradient. Further more, the improvement is more obvious when the number of sample size is smaller. In this paper, we use two most common kernel functions for testing, namely linear kernel functions and gaussian kernel functions. The linear kernel is usually applied when the original feature vector is already relatively linearly separable, while the gaussian kernel can construct nonlinearly separable decision intervals by mapping the original feature vector to a gaussian kernel space. We test the impact of these two kernel functions on the model, shown in Table 6. The performance of the two kernel functions is measured by changing the regularization parameter β and the balance factor of SVM loss γ. The regularization parameter β can regulate the solution of the multi-classification SVM task in Equation (1), and a smaller β leads to simpler constitutive decision space. It can also reduce the overfitting in the data-hunger setting. We tested two values of β, 0.1 and 0.5, respectively. Table 6 shows that a larger β leads to a higher performance, β = 0.5 is more suitable for the current sampled sample points. However, fine-tuning the balance fator γ produces a slight perturbation of the model parameters and no significant enhancement effect. So we further test the value of γ while setting β to 0.5 in two kernel-type scenarios. Figure 5 shows the experiment results in both 10-shot and 30-shot settings. Results show that the model achieves the best results in the 10-shot and 30-shot scenarios when lambda is equal to 0.5. For the gaussian kernel-based FSVM, nAP50 will decline when lambda is greater than 0.5. For the linear kernel-based FSVM, nAP50 will decline when lambda is greater than 0.8. Finally, we determined γ = 0.5 to maintain model stability. Table 6. Parameter adjustment of the SVM module. nAP50 are tested by varying the regularization parameter β and the loss balance factor λ. We used 30-shot Novel split1 data, and fixed random seeds to keep the experiment consistent.

Linear Kernel
Gaussian Kernel

Visual Results and Comparison
The detection result are provided in Figure 6. The trained detector was fine-tuned by Novel Split1 with 30-shot samples. Each contraband is denoted in the figure by a bounding box and a category with a confidence score. The novel classes in Novel Split 1 are knife and gun. Results show that the detector can successfully detect novel contrabands after a few steps of finetuning under only 30-shot samples.  We also use Gradcam [38] to visualize the region of interest for the model when making classification decisions. Figure 8 shows the Gradcam heatmap results for the knife class. The upper part of the figure represents original X-ray baggage samples in the dataset. We can see that our model mostly recognizes a knife by its handle when the items are occluded by other metal structures. The activated regions (the red part of the picture) on samples are compact and concentrated, which shows our model learned the correct representation message through SVM-module finetuning. To further compare the Gradcam results, we also visualized the heatmap results of TFA and FSCE as shown in Figure 9. In this figure, we focus on gun class, and we can see that both TFA and FSCE focus not only on the prohibited items but also more on the complex background. The result shows that our model mitigates the negative impact of contraband that is difficult to distinguish in complex background distributions to some extent, which reflects the high learning ability of the model.

Conclusions
In this paper, we propose a few-shot object detection method, FSVM, which is designed for complex distributed X-ray security images with contraband. The FSVM can detect threats in X-ray baggage with an extremely insufficient number of samples. To increase detection performance, a end-to-end trainable SVM module is used to embed it into the object detection network. This allows for the back-propogation of additional supervised information to create a better embedding space for few-shot tasks. Four commonly used FSOD methods are tested to incorporate the few-shot learning method into X-ray threat detection, including the meta-learning method, the transfer learning method, and pretrained transfer learning method. Results for the 10 shot and 30 shot few-shot samples are presented on the X-ray baggage security dataset SIXray. Compared with the four FSOD methods, Table 1 shows that our method not only has highest average precision in both 10-shot and 30-shot settings, but also alleviates base forgetting. However, the confusion matrix results of our model showed more leak detection in novel classes than in base classes. Though our proposed model achieves better results, it is hard to avoid leak detection in the few-shot setting, especially in the cluttered and occluded X-ray security dataset. In the future, we will focus on promoting the quality of proposals produced by RPN and further improving the detection results.