Pixel-Level Analysis for Enhancing Threat Detection in Large-Scale X-ray Security Images

: Threat detection in X-ray security images is critical for preserving public safety. Recently, deep learning algorithms have begun to be adopted for threat detection tasks in X-ray security images. However, most of the prior works in this ﬁeld have largely focused on using image-level classiﬁcation and object-level detection approaches. Adopting object separation as a pixel-level approach to analyze X-ray security images can signiﬁcantly improve automatic threat detection. In this paper, we investigated the effects of incorporating segmentation deep learning models in the threat detection pipeline of a large-scale imbalanced X-ray dataset. We trained a Faster R-CNN (region-based convolutional neural network) model to localize possible threat regions in the X-ray security images on a balanced dataset to maximize detection of true positives. Then, we trained a DeepLabV3+ model to verify the preliminary detections by classifying each pixel in the threat regions, which resulted in the suppression of false positives. The two models were combined in one detection pipeline to produce the ﬁnal detections. Experiment results demonstrate that the proposed method signiﬁcantly outperformed previous baseline methods and end-to-end instance segmentation methods, achieving mean average precisions (m AP s) of 94.88%, 91.40%, and 89.42% across increasing scales of imbalance in the practical dataset.


Introduction
X-ray imaging is widely used for securing public spaces [1]. Developing algorithms that provide aid to human inspectors to accomplish the monotonous and nontrivial task of detecting threats in X-ray security images is of utmost importance. Recently, deep learning has become the most dominant method used in the field of automatic threat detection in X-ray security images [2]. The most common approach is to adopt deep learning models trained on natural images and apply these models to X-ray security images. However, X-ray security images do not have as many rich features as natural images. For instance, X-ray security images have limited color range, lower contrast, and poor texture. The most prominent distinguishing feature of X-ray security images is the visibility of overlap between objects, which is a challenge when adopting deep learning models because object overlap aggravates intra-class variations [3]. The pixels in X-ray security images provide insight into the state of overlap between objects since each pixel corresponds to the radiation intensity that is attenuated by all overlapping objects [4]. Darker regions in X-ray security images signify higher attenuation, which could be caused by several overlapping objects or non-overlapping objects made up of higher density materials. Similarly, lighter regions in the image mean lower attenuation. Thus, an intuitive approach to handle overlapping objects is to analyze the X-ray security images in a pixel-wise manner. Pixel-level deep learning method continues to be a popular approach used in the related field of medical X-ray imaging [5]. Pixel-level analysis can also bring similar improvements to efficiency and reliability not only limited to X-ray security applications but also encompassing other related X-ray imaging fields, such as structural materials inspection [6].
In natural images, image segmentation is the task domain concerned with the pixelwise analysis of images, which is further broken down into tasks, namely, semantic segmentation and instance segmentation [7]. Semantic segmentation aims to classify each pixel in the image into either of the available classes of objects. Instance segmentation combines object localization and pixel-wise classification such that it results in the delineation of each object instance in the image. With regards to threat detection in X-ray security images, instance segmentation would seem to be the appropriate task domain that allows pixel-level localization of each potential threat. However, this task cannot be directly adopted to X-ray security images because it restricts that each pixel is classified into only one object instance. Figure 1 shows instance segmentation in natural images, which do not exhibit overlap. Instead, we observe occlusion, wherein the objects in front fully cover parts of the objects behind. Yet, for X-ray security images with overlapping objects, a pixel can belong to as many object instances that are overlapping, as shown in Figure 2. Thus, a more specific task domain, called object separation, is required to address the issue of overlap. Object separation can be thought of as a multi-class and multi-label version of instance segmentation. This task domain was first introduced in [8] as a method of separating potentially overlapping objects in X-ray security images, then assigning the correct pixel values according to the estimated atomic number of each object. The first part of the object separation task can detect prohibited items with non-organic material properties, such as weapons, through their shape, texture, and other visual features. The second part can detect organic prohibited items, despite the lack of visual features, such as explosives and illegal drugs, by estimating their material composition. Thus, the complete object separation task domain can be used as a general solution to detecting prohibited items in X-ray security images. Deep learning can be leveraged in the first part of the object separation task since annotations can be readily attainable. However, the second part of object separation requires additional information about the material properties of each object in the image, which is currently not available in the public domain.  As a response to the growing interest in computer vision research on X-ray security images, Miao et al. [3] recently published SIXray, which is the largest database of X-ray security images that closely mirrors the practical scenario. The dataset demonstrates the visual complexities of X-ray security images, including overlapping of objects, but more importantly, the dataset places emphasis on the issue of imbalance as well. In practice, the data distribution of X-ray security images is highly skewed towards the majority class or negative samples, i.e., images that do not contain threats, since threat objects are rarely encountered during security screening compared to normal objects. Due to the overrepresentation of the majority class, conventional models that are trained with imbalance datasets are extremely biased at always predicting the majority class. However, this becomes a major problem because the misclassification cost of the minority class or positive samples, i.e., images containing at least one threat, is significantly higher [9]. A straightforward approach in balancing an imbalanced dataset would be to either oversample the minority class or under-sample the majority class. Oversampling the minority class is not only inefficient when the majority class is magnitudes of times larger but has also been consistently shown to be inferior to other approaches. Consequently, under-sampling the majority class leads to a loss of a considerable amount of information depending on the scale of imbalance, which results in a surge of false positives as the conventional model tries to fit objects from unseen negative samples to any of the designated threat objects. Yet, it is of utmost importance to keep the false positives low if the detection algorithm is to be used as an automation tool in security systems to aid human inspectors [10].
In this paper, we investigated the impact of pixel-level analysis when it is integrated into the threat detection pipeline to address the first part of the object separation task in a large-scale and imbalanced X-ray security image dataset, also referred to as a practical X-ray dataset. We trained a Faster-RCNN [11] to localize possible threat regions on a balanced subset of the dataset to maximize the detection of the true positives. We then trained a DeepLabV3+ [12] to classify each pixel in the possible threat regions such that the regions without any pixel-level predictions are discarded as false positives. The two models were combined in one detection pipeline to produce the final predictions. Both models were selected after an exhaustive evaluation of the state-of-the-art object detection and semantic segmentation models, respectively.
The main contributions of this paper are as follows: 1.
Reintroducing object separation as a unique task domain for X-ray security images; 2.
Using segmentation as a mechanism to address class imbalance problem in practical X-ray security dataset; 3.
Exhaustive evaluation of detection and segmentation deep learning models on the SIXray dataset; and 4.
Development of a two-stage threat detection model that decouples detection and segmentation in order to maximize the use of available annotation.
Experiment results showed that pixel-level analysis significantly improved the threat detection performance, wherein the proposed method achieved a mean average precision of 94.88%, 91.40%, and 89.42%, across increasing imbalance ratios, outperforming previous baseline methods and an end-to-end instance segmentation method by a large margin.
The rest of the paper is organized as follows: Section 2 briefly reviews the related works. Section 3 provides information on the imbalanced dataset. Section 4 defines the evaluation metrics. Section 5 discusses the methodology of the proposed approach. Section 6 describes the details of the experiments and analysis of the results. Section 7 presents the conclusions of this study.

Related Works
In this section, we review prior studies that used pixel-level analysis as a threat detection approach in X-ray security images and the previous works that attempted to solve the class imbalance problem in large-scale X-ray security images.
As the first to introduce the concept of object separation, the work in [6] is also the first to propose a solution to the problem. The authors exploited the log space additivity of pixels in X-ray security image sets. However, their method demands a few images from different views of the same target objects, which may not be the case in most practical port security operations. Furthermore, their approach resulted in extremely inaccurate segmentation, and therefore they instead focused on estimating the material properties using atomic numbers. Thus, their work cannot be directly applied in the detection of other threat objects such as weapons. Since then, no work has attempted to solve the object separation task partially or completely. Alternatively, we briefly reviewed recent pixel-level approaches used in X-ray security images. The scarcity of works in this field is due to the difficulties in acquiring pixel-level annotation [2]. The study conducted in [13] evaluated classical machine learning techniques to segment objects in X-ray security images on the basis of color and later classify them into two broad classes, i.e., organic or inorganic. The work in [14] used a two-stage segmentation method for the broader task of intra-object anomaly detection. A closely related approach used semantic segmentation [15] to detect threat objects in X-ray images. However, semantic segmentation is not compatible with object separation since it cannot distinguish between instances.
Apart from publishing SIXray, Miao et al. initiated the research for threat detection in imbalanced X-ray security images by introducing the class-imbalance hierarchical refinement (CHR) approach [3], wherein a convolutional neural network (CNN), which approximates a function that removes weakly related elements from the feature map, is implemented in multiple levels of the convolution operation hierarchy. In our previous work [16], we explored the effect of filtering the initial image-level predictions using a generative adversarial network (GAN) [17] to learn the underlying distribution of negative samples so that we can designate the positives samples as anomalies when they deviate from the learned distribution. In this way, we can train a classifier on a balanced ideal dataset and suppress the induced false positives by identifying anomalies. Moreover, we also studied the effect of enlarging the pool of positive samples through image synthesis using different image-generating GANs [18]. We found that generating synthetic images by combining isolated threat objects and negative samples causes the model to disassociate the threat objects to the backgrounds distinct to positive samples, resulting in a better generalization and suppression of false positives for object-level threat detection. The deep feature fusion method in [19] combines early fusion by concatenating extracted features at the earlier stages of the classification model and late fusion by taking the weighted sum of losses calculated at different stages of the classification model. These components are integrated into a dual branch network with the aim of exploiting low-level spatial features and alleviating bias caused by the imbalanced data. A closely related approach is the tensor pooling method proposed in [20], where the authors train an instance segmentation network to classify pixels of threat objects using preprocessed inputs, wherein the image contours are extracted at different orientations to encode tensor representations. However, the authors did not consider the issue of overlap in X-ray security images and, hence, were not concerned with separating or isolating threat objects as a threat detection approach. Furthermore, their method requires that each variant of the threat category must be labeled, which exponentially increases the cost of annotation.

Dataset
SIXray [3] is the largest publicly available benchmark dataset for X-ray security images comprised of more than 1 million images, wherein more than 8000 of which are labeled for containing at least one of the following objects: gun, knife, wrench, pliers, and scissors. The dataset is further split into subsets that correspond to the increasing ratio of imbalance between positive samples and negative samples: SIXray10, SIXray100, and SIXray1000. Even so, the dataset only consists of image-level and object-level annotations. For the task of object separation, pixel-level annotations are required. Figure 3 shows three different versions of ground truth masks for the tasks of semantic segmentation, instance segmentation, and object separation. In semantic and instance segmentation, occlusion restricts each pixel to a single semantic group and object instance, respectively, whereas object separation allows pixels to be classified under multiple object instances that describe the area of overlap between objects. Thus, one of the defining attributes of the object separation task is the ground truth labels used in supervised training.  Figure 4 illustrates the difference between the binary masks of each object instance for the tasks of instance segmentation and object separation, correspondingly. Suppose that the task is instance segmentation; objects at the bottom-most level of the stack will tend to lose more, if not all, pixel-wise labels to objects above it, even though its entirety can still be visually recognized in the image. It is apparent that training a model with these masks would lead to a model ignoring overlaps between objects and not have sufficient information for the objects at the bottom of the stack. Hence, we manually labelled the images such that all the pixels that describe an instance are included in the ground truth mask. However, given the high cost of annotating the entire dataset, we first randomly sampled 2500 images from the positive samples used in the training subsets and presented the initial results of our investigation with the aim of encouraging future research in this task domain to create more labeled datasets. The distribution of instances across all five categories for the pixel-level labeled dataset is presented in Table 1.  Comprehensive data exploration of the dataset was presented in [3,21]. Still, further examination of the dataset revealed that there was a considerable portion of the negative samples that were incorrectly labeled as negative despite noticeably containing threat objects. Some of the erroneous samples are shown in Figure 5. Noisy or corrupted labels are not uncommon in the field of machine learning. It is reported that real-world datasets include 8% to 38.5% of corrupted labels [22]. As a result, an entire field of methods for training models robust to these noisy labels has emerged, which is beyond the scope of this paper. We found that the SIXray dataset contains only about 1% mislabeled negative samples, as illustrated in Figure 6. Still, including mislabeled negative samples in evaluating algorithms is counterintuitive to the goal of security applications. In practice, we want these samples to be predicted as positives, but during evaluation, they are considered otherwise. Furthermore, using ranking metrics such as classification mAP adds further confusion since the predictions are sorted in descending order on the basis of the confidence scores. The best threat detection algorithms will predict these mislabeled negative samples as positive with very high confidence scores, which will cause a significant deterioration in the performance metric. Hence, in our evaluations, we removed the mislabeled negative samples from the test datasets to reveal the true strength of the algorithms. The proportion of mislabeled negatives is small enough to not affect the ratio of imbalance in the subsets. For comparison with previous works, we did not remove the mislabeled negative samples in the dataset we used for training our models, and we reevaluated previous works using the test datasets without the mislabeled data. Additionally, we have provided the list of the mislabeled negative samples as well as the pixel-level annotations of the data subset (https://github.com/jodumagpi/Xray-ObjSep-v1, accessed on 8 October 2021).

Evaluation Metrics
To select the detection model, we evaluated the performance using average precision (AP) [23], as defined in Equation (1).
where r is the recall, and p interp is the interpolated precision given by wherein p is the precision at r, and n includes all the recall points. AP is also defined as the area under the precision-recall curve. The mAP is the average of the APs calculated for each class of threat objects. This is one of the most used metrics for evaluating detection and classification performance. Moreover, the segmentation model is chosen on the basis of its performance on various evaluation metrics, namely, intersection-over-union (IoU), dice coefficient (DC), precision, and recall, as defined in Equations (2)- (5).
where A and B represent the target and predicted segmentation masks, respectively. IoU, also known as the Jaccard index, is calculated for each class of threat objects, and the average is reported as the mean IoU (mIoU). DC is another metric that has become increasingly popular in reporting the performance of modern segmentation algorithms, which bears some resemblance to the IoU metric.
where TP, FP, and FN are the proportions of the pixels in the masks that are considered true positives, false positives, and false negatives, respectively. Precision and recall can be calculated for each class and as the cumulative of all classes. As per the benchmark standard set in [3], we again used the mAP, defined in Equation (1), to evaluate our proposed method against previous works. We took the AP on each of the classes by ranking the classification predictions by their confidence scores, then took the average (mAP) across all the classes to report the overall performance of the models.

Proposed Method
In this section, we discuss the details of the model selection experiments and the description of the complete threat detection pipeline.

Model Selection
Training an end-to-end object separation model requires all image-level, object-level, and pixel-level annotation for each data entry. Yet, this kind of dataset is non-existent in the field of X-ray security images. Choosing to train an end-to-end model with a limited dataset has the downside of throwing away the data with incomplete annotations. Instead, we created a threat detection pipeline wherein the localization of the object instances and the prediction of segmentation masks are separated into two different tasks handled by two different models. This way, all the samples in the dataset with only image-level and object-level annotations are used to train the first model, i.e., the object detection model, and the limited samples with additional pixel-level annotations are used to train the second model, i.e., the segmentation model. Combining these models in one pipeline results in an object separation model that can localize each object instance and delineate all the pixels belonging to that instance while also maximizing the utilization of the available annotations. Consequently, selecting the appropriate models to use for the two tasks are conducted using the experiments discussed in the following subsections.

Detection Models
In selecting the best detection model, we evaluated four of the most popular state-ofthe-art approaches.
Faster R-CNN [11] belongs to the family of two-stage object detection models known as region-based CNNs. The first stage of the algorithm is known as the region proposal network (RPN), where areas in the image that are suspected of containing the target objects are extracted and passed down to the second stage of the algorithm, which then classifies the objects contained in the extracted area. The localization is improved by considering the prediction of the bounding boxes as a regression task, where the difference between the coordinates of the real bounding boxes and the coordinates of the proposed regions are considered. This model is designed to be the fastest and most accurate compared to its previous iterations [24,25].
You Only Look Once (YOLOv3) [26] belongs to the one-stage object detectors that do not require a region proposal as part of the algorithm and, instead, analyze a dense sampling of the possible locations. In YOLO, the input image is divided into grid cells, and each cell forecasts a fixed number of bounding boxes, called priors, along with the confidence scores. All these predictions are simultaneously made using a single CNN, making it one of the fastest algorithms suitable for real-time applications. In this paper, we adopted the third version of the algorithm, which is designed to detect small objects better with the help of the incorporation of shortcut connections.
Single-shot Multibox Detector (SSD) [27] is also a one-stage algorithm that performs detection at every pyramidal layer of the CNN to focus on objects of varying scales. Instead of dividing the input into grids, SSD predicts the offset of the default boxes, also known as anchor boxes, for every point in the feature map. Receptive fields of feature maps differ at every level in the pyramid of the CNN. Earlier layers tend to have fine-grained feature maps, and later layers tend to have coarse-grained feature maps. Since the anchor boxes have fixed sizes relative to their corresponding cell, predictions at later layers capture larger objects in the image, while predictions at the earlier layers capture smaller objects in the image.
RetinaNet [28] is another one-stage algorithm that bears a resemblance to SSD in the sense that it also uses the concept of detection at each level of pyramidal layers of the CNN. Yet, they also attach a feature pyramid network (FPN) [29], which concatenates later feature maps with earlier feature maps to gain stronger representations. Furthermore, they introduce a different loss function, called focal loss, with the aim of putting more weight on hard samples, i.e., samples that are often misclassified.
We only used positive samples to evaluate the effectiveness of each detection model in detecting threats when they are present in the image. We divided the data into training and validation sets. We used the same backbone network for all the models, i.e., ResNet-50 [30], and trained them all for 60,000 iterations. Table 2 shows the per-category AP and the mAP of the detectors on the validation set. Although per-category results are not consistent, the overall evaluation favors Faster R-CNN as the most accurate detector. Moreover, RetinaNet already has the advantage of having an FPN attached to its backbone network, which all the other models did not have. This may have been the reason why among the other one-stage detectors, it achieved the closest overall performance to the Faster R-CNN approach. On the basis of this result, we selected Faster R-CNN as our object detection model.

Segmentation Models
To determine the best segmentation model for the task, we evaluated six of the most popular state-of-the-art approaches.
FCN [31] is one of the earliest approaches to pixel-level classification. It takes the conventional CNN architecture and replaces the fully connected layers with a convolutional layer by arguing that the dense layers can be thought of as doing 1 × 1 convolutions. The final convolution layer is then up-sampled using deconvolution to learn non-linear upsampling and produce a feature map that has a similar size as the input, wherein each pixel corresponds to the predicted classes. U-Net [32] builds on top of FCN in that it is also comprised of symmetric encoding and decoding fully convolutional layers that form a U-shape. Additionally, it adds skip-connections between corresponding encoding and decoding layers to reinforce the information that is lost during downsampling.
PSPNet [33] introduces the pyramid pooling module, which concatenates the feature maps from the layers of the backbone model to capture global context, which is important in providing indications on the distribution of the segmentation classes across the image. Moreover, it uses an auxiliary loss applied at the input of the pooling module to serve as intermediate supervision during the training.
DeepLabV3 [34] adds several techniques to achieve a finer delineation of objects, such as atrous convolutions and spatial pyramid pooling (SPP). SPP is used to capture multi-scale context by applying multiple pooling layers that divide the feature maps of the last convolutional layer into fixed spatial bins and concatenating the output vector to be fed to the subsequent fully connected layer. As this increases the complexity of the model, atrous convolutions, also known as dilated convolutions, are used so that the receptive fields of filters are larger, thereby incorporating a larger context without increasing the number of parameters.
PAN [35] proposes two new modules to the segmentation framework, namely, feature pyramid attention (FPA) and global attention up-sample (GAU). GAU guides the low-level features by combining them with a global context extracted from performing global average pooling to the high-level features. FPA learns better representation by combining global pooling and spatial pyramid attention on the output of the backbone model. [36] also applies attention mechanism in its architecture specifically by introducing two new blocks: point-wise attention block (PAB) and multi-scale fusion attention block (MFAB). PAB captures the spatial dependencies between pixels in the global view, and MFAB captures the channel dependencies between any feature map by multi-scale semantic feature fusion.

MA-Net
We only used the pixel-level annotated data to evaluate the segmentation models. We used the detection model selected from the previous model selection experiment to extract patches or regions in the samples that were to be used to train the segmentation models. We added background patches, i.e., regions that do not contain threats, by running negative samples through the chosen detection model. The resulting dataset was then divided into training and validation sets. We also used the same backbone network for all the models, i.e., ResNet-50 [30], and trained them all for 60,000 iterations. Table 3 shows the results on the validation set using various evaluation metrics. We selected DeepLabV3 as our segmentation model since it outperformed the other approaches on most of the evaluation metrics.

Full Pipeline
The final threat detection pipeline is illustrated in Figure 7. We preprocessed all images using [37], wherein we cropped out the unnecessary air space in the image, leaving only the relevant information in the data. The cropped image was fed to the detection model, which predicted the locations of the predicted threat objects. In our final pipeline, we improved the backbone of the Faster R-CNN model by attaching FPN. The predicted regions were cropped out of the image and then fed to the segmentation model, which predicted the classification of each pixel in the region. In our final pipeline, we improved the segmentation model by using the next iteration of the algorithm, called DeepLabV3+ [12]. This latest version of the algorithm introduces atrous depthwise convolution that combines the ideas from atrous convolution and depthwise separable convolution [38] to drastically reduce the network parameters while maintaining, or even achieving better, performance. Furthermore, we also enhanced the model used in the encoder by using ResNext-50 [39] architecture with squeeze-and-excitation (SE) blocks [40]. We made the pixel level predictions threshold to be at 95% so that only the pixels with strong detections were considered. The regions without a pixel-level annotation were discarded. In this sense, the segmentation model acts as a verification model that ensures all predicted regions contain threat objects.

Experiment Results and Discussion
In this section, we discuss the experiment setup and the analysis of the results.

Experiment Setup
We trained our models using the predefined subsets of the SIXray dataset. SIXray10, SIXray100, and SIXray1000 have the ratio of negative samples over positive samples at 1:10, 1:100, and 1:1000, respectively. All the images in the training subsets were used to train the detection model, while only the pixel-level annotated images that are in the training subsets were used to train the segmentation model. We generated the patches used to train the segmentation model as described in the previous section. The spatial dimensions of all the patches were resized to 192 × 192.
We trained Faster-RCNN for 80,000 iterations with a batch size of 2 using stochastic gradient descent (SGD) [41] with a base learning rate of 0.001, which is linearly decayed by a factor of 0.1 after the 30,000th and 50,000th iterations, and a momentum of 0.9. We trained DeepLabV3+ for 250 epochs with a batch size of 32 using Adam optimizer [42] with a base learning rate of 0.001, which is linearly decayed by a factor of 0.1 after the 75th, 150th and 200th epochs. For further comparison, we also trained Mask R-CNN [43], a state-of-the-art end-to-end instance segmentation framework, using all of the samples in the training set with complete annotations. Mask R-CNN was trained with the same backbone and configurations as Faster R-CNN.

Experiment Results
Tables 4-6 show the quantitative results of the experiments on the predefined test datasets. We compared our approach to the baseline classification method, CHR [3]; our previous classification method, GBAD [16]; and an end-to-end instance segmentation method, Mask-RCNN [43]. The removal of the mislabeled negative samples caused a substantial performance boost in the reevaluation of the previous works showing that their true strength was obscured by the quality of the dataset. Still, results revealed that imagelevel approaches drastically fell behind object-level and pixel-level approaches. Moreover, it is especially concerning how image-level approaches cannot cope with increasing ratios of imbalance. In contrast, both Mask R-CNN and our proposed approach showed more robustness to the imbalance, proving that localized methods such as object-level and pixellevel approaches can drastically enhance the performance of threat detection models, which is of utmost importance in security applications. Furthermore, our approach enjoys an even larger boost in performance compared to Mask R-CNN thanks to the uncoupling of the detection and segmentation task, which allowed for the use of the entire training set. The multi-label detection confidence of two randomly sampled images from the positive and negative samples is shown in Table 7. Detection confidence of 100% conveys that the model is certain that the particular threat is present in the image, while detection confidence of 0% conveys that the model is certain that the image does not contain the particular threat. Analysis of the detection confidence will enable us to gain more insight into the overall mAP performance of each approach. mAP relies on the ranking of the detection confidence on each target class. Thus, a false positive with high detection confidence and a false negative with low detection confidence causes significant degradation in the performance. Meanwhile, a true positive with high detection confidence and a true negative with low detection confidence is desired for optimum performance. For the positive sample, i.e., Figure 8a, CHR [3] correctly detected the knife with relatively high confidence, but the rest of the threat classes also had considerably high detection confidence, especially the gun class. On the other hand, GBAD [16] also correctly detected the knife with 100% certainty and had extremely low detection confidence for the wrench, pliers, and scissors classes, but, unfortunately, it also had 100% certainty for the presence of a gun. From this sample alone, we might expect that GBAD performs worse on the gun and better on the rest of the target classes when compared to CHR. Indeed, we see this reflected in Table 5. The filtering mechanism of Mask R-CNN [43] and our proposed method allowed for the complete suppression of potential false positives to 0%. Both segmentation approaches correctly detected the knife and did not make any other false detections. Still, our approach boasts higher detection confidence for the same detected object. For the negative sample, i.e., Figure 8b, both CHR and GBAD incorrectly detected a gun object in the image with high confidence and, for the rest of the classes, the latter method continued to have extremely low detection confidence. It can be observed that GBAD had a tendency to have extreme detection confidence for any case, which may have been the reason for its marginal improvement over CHR. Meanwhile, Mask R-CNN was not able to suppress the incorrect detection of scissors, unlike our approach, which was able to correctly suppress any potentially wrong detection. It can be observed that the high prediction confidence on true positives and the consistent suppression of false positives caused the significant performance improvement of our proposed approach over the other methods. Table 7. Multi-label detection confidence (%) for randomly sampled images shown in Figure 8.    Figure 9a shows how our approach was able to accurately segment the detected threat objects even though there being overlap between three objects, i.e., knife, wrench, and gun. However, despite being able to produce segmentations for all the correct detections, our approach is shown in Figure 9b to have had a hard time in accurately delineating overlapping objects when most of them had high-density material properties. Still, for most of the non-overlapping objects, our approach was able to correctly verify the detections.   Figure 10 shows the false positive detections that were suppressed by our approach. These were regions in the negative samples that were predicted by Faster R-CNN to have threat objects. Since DeepLabV3+ did not predict any segmentation mask for these regions, they were not included in the final prediction output. We observed an overwhelming number of false positives for the scissors class, which was mostly squashed by the second stage of our approach. It may also be the case that the other approaches also produced too many unsuppressed false positive predictions for the scissors class, causing it to suffer the most degradation compared to other classes. Lastly, Figure 11 shows some of the failure cases of our approach. In these images, Faster R-CNN wrongly predicted suspected regions to which DeepLabV3+ generated segmentation masks. We observe that most of the errors came from wrong localization and verification of suspected knives and wrenches. This may have been due to the bland features of the objects in these classes, which can be easily mistaken as one of the elongated metals in the baggage. In contrast, the guns class benefitted from having very strong visual features in that it was able to consistently be predicted with high accuracy, regardless of the imbalance. Figure 11. Exemplar images of unsuppressed detections. Images are wrongly predicted to contain (a) knife, (b) knife, and (c) gun. Images on the right are the inputs, and images on the left are the outputs with overlayed masks.

Ablation Study
We examined each module of our threat detection pipeline to determine their individual contribution to the final algorithm performance. Table 8 shows the performance of four different cases represented by the different configurations of the threat detection pipeline. First, we only considered the detection model (Det) without the preprocessing and verification by the segmentation model. Then, we added the preprocessing algorithm to the detection model (Crop + Det) but still omitting the segmentation. Next, we combined detection and segmentation (Det + Segm) but removed the preprocessing algorithm. Finally, we merged all the modules into one threat detection pipeline (Crop + Det + Segm). Table 8 shows the results of the evaluation using SIXray100 subset. As demonstrated in our previous work [37], we can achieve a marginal yet valuable improvement by simply cropping out irrelevant areas in the image, such as the air gaps/spaces, which is very common in X-ray security images. Cropping the images to expose only relevant information and extract more features mostly enhances the detection model's ability to detect more objects with higher confidence. Due to being trained on a balanced training subset, the detection model is forced to fit the vast amount of unseen data from the negative samples to the more familiar targets in the positive samples. While this increases the detection of true positives, it also inadvertently increases the detection of false positives. This issue proves to be the main limiting factor and is thereby rectified by integrating the segmentation model to verify the initial predictions.

Conclusions
In this paper, we investigated the impact of integrating pixel-level analysis in the threat detection pipeline of a large-scale and imbalanced X-ray security image dataset. We reintroduced the object separation as a unique task domain for analyzing X-ray security images and address the first part of the task, which aimed to accurately delineate target object instances from X-ray images, including pixels that were shared by overlapping objects. We developed a straightforward and effective object separation pipeline composed of a detection model for localizing possible threat regions and a segmentation model for verifying the existence of threats in the predicted regions. We chose the appropriate detection and segmentation models by extensively evaluating current state-of-the-art deep learning models.
Our empirical results show that object-level and pixel-level approaches significantly outperformed image-level approaches for threat detection in X-ray security images. Our approach outperformed the baseline classification and end-to-end instance segmentation methods by up to 26.86% and 4.85%, respectively, on the subset that closely mirrored a practical scenario. Furthermore, our approach consistently outperformed the previous works across all the subsets with increasing scales of imbalance. These preliminary results support the claim that our intuitive approach works effectively without the need to build an entirely new framework or device a complicated data processing technique. Object separation has the potential to advance research in automated threat detection in X-ray security images as well as other X-ray imaging applications. However, exploration in this field is limited due to the lack of pixel-level annotations of X-ray security image datasets. We believe that this study would encourage researchers to create more high-quality labeled datasets and develop more sophisticated approaches to fully solve the object separation task, which can help in the generalization of the detection of all prohibited items such as weapons, explosives, and drugs. In our future research, we intend to create an end-to-end object separation framework.