A Pixel-Wise Foreign Object Debris Detection Method Based on Multi-Scale Feature Inpainting

: In the aviation industry, foreign object debris (FOD) on airport runways is a serious threat to aircraft during takeoff and landing. Therefore, FOD detection is important for improving the safety of aircraft ﬂight. In this paper, an unsupervised anomaly detection method called Multi-Scale Feature Inpainting (MSFI) is proposed to perform FOD detection in images, in which FOD is deﬁned as an anomaly. This method adopts a pre-trained deep convolutional neural network (CNN) to generate multi-scale features for the input images. Based on the multi-scale features, a deep feature inpainting module is designed and trained to learn how to reconstruct the missing region masked by the multi-scale grid masks. During the inference stage, an anomaly map for the test image is obtained by computing the difference between the original feature and its reconstruction. Based on the anomaly map, the abnormal regions are identiﬁed and located. The performance of the proposed method is demonstrated on a newly collected FOD dataset and the public benchmark dataset MVTec AD. The results show that the proposed method is superior to other methods.


Introduction
In the field of aviation, foreign object debris (FOD) refers to objects that appear on the pavements of the whole movement area, including an airport's runways, taxiways and apron and may cause damage to the aircraft, such as screws, nuts, rubber blocks, stones, etc. [1].Since the presence of FOD may pose a huge potential risk to aircraft during takeoff and landing, it needs to be removed from the runway in time.Therefore, FOD detection is an indispensable part of airport operation.Traditional FOD detection and removal methods, which rely on the staff to check runways and other regions at regular intervals, are inefficient.In addition, the reliability of traditional methods is also unsatisfactory.Small-scale FODs, such as nuts similar in color to airport runways, are not easily detected by the naked eye.Visual inspection systems are now widely adopted for automatic FOD detection.Meanwhile, FOD detection has become a hot spot in academic research, and many achievements have been attained [2][3][4][5].
Deep-learning-based methods are widely used in object detection due to their effectiveness and universality [6][7][8][9].However, the lack of labeled FOD images remains a major challenge in FOD detection tasks.In general, deep-learning-based detection methods require massive labeled images and a long period of supervised training.However, a comprehensive and balanced dataset, including different FOD, is very difficult to collect and annotate in actual airport environments, making the model lack effectively supervised information.More importantly, because FOD could be anything accidentally dropped on airport runways, the model trained using only limited FOD samples may fail to generalize on those previously unseen ones.Therefore, the supervised learning-based methods are not the best choice for FOD detection.Conversely, the unsupervised anomaly detection method is expected to solve this problem.In the basic anomaly detection tasks, only normal samples are available for training to construct a machine learning system that can detect abnormal samples [10,11], which is exactly the goal of FOD detection tasks.In FOD detection tasks, although images with real FOD are rare, images without FOD are sufficient for training.Since labeled FOD samples are not required, unsupervised anomaly detection methods can quickly adapt to different airport environments.
Recently anomaly detection methods for image focus on reconstructing the original image to a normal image through an autoencoder network [12][13][14].The input image is assigned an anomaly score based on the reconstruction error.This method assumes that the model trained on normal images cannot be generalized to abnormal images, that is, the reconstruction error of abnormal images is higher than that of normal images.However, this assumption is not always true in practice because the model has strong generalization ability and may reconstruct abnormal images well, which makes abnormal regions indistinguishable from normal regions only by the reconstruction error.
The methods based on pre-trained deep convolutional neural networks (CNNs) have recently been proposed for image anomaly detection [15][16][17].The pre-trained CNNs are very helpful when the dataset is small and the normal regions show randomness.They try to model the distribution of the pre-trained features of normal data using Gaussian mixture models or clustering.In the inference process, if the pre-trained features of the image deviate from the distribution, it is identified as an anomaly.These methods provide excellent results on image-level anomaly detection.However, they cannot perform anomaly localization.To tackle this problem, many methods perform in a region-based fashion, which splits images into smaller patches and determines the abnormality of every patch.This demands high computational resources and often leads to inaccurate localization.
In this work, we also leverage the pre-trained CNNs to detect anomalies.However, we propose to train a feature inpainting model in a self-supervised manner to restore the damaged feature maps into normal ones instead of modeling the distribution of the pre-trained features.The trained model can thus detect abnormal regions by comparing the original and restored features of the image.In particular, multi-scale grid masks are designed to determine the removal and recovery regions in the feature maps.The proposed method is termed as multi-scale feature inpainting (MSFI), where it realizes unsupervised anomaly detection and localization by reconstructing incomplete multi-scale features generated from the pre-trained CNN.Extensive experiments on two datasets, MVTec AD [18] and FOD, are conducted for image-level anomaly detection and pixel-level anomaly localization.
The rest of this paper is organized as follows.Section 2 discusses the latest methods of image anomaly detection.Section 3 introduces the overall anomaly detection framework in detail.In Sections 4 and 5, the experimental conditions are introduced and the experimental results are shown, respectively.Section 6 presents the ablation study.Section 7 and Section 8 provide the discussion and conclusion of this paper, respectively.

Related Work
Unsupervised image anomaly detection methods require only normal images during training.These methods can be divided into two methods: image reconstruction and feature modeling.
In many recent image reconstruction-based anomaly detection methods, autoencoders and a range of variants have been widely used, such as autoencoders [19,20] and variational autoencoders [21,22].The core idea of these methods is to convert an image into an abstract representation and then try to find its inverse mapping to reconstruct the original image.They assumed that the model trained on normal samples could not reproduce abnormal samples.However, the autoencoder could not only generalize well but could reconstruct the abnormal samples well [23,24].To address this problem, Gong et al. [24] proposed the memory-augmented autoencoder that designs a memory module for recording representations of normal samples.As the reconstruction consists of representations of normal samples, the reconstruction errors of abnormal samples will be increased.Generative adversarial network (GAN) [25] is also a network used for reconstruction, which aims to improve the quality of reconstruction through adversarial training.For example, Schlegl et al. [26] took the lead in using generative adversarial networks (GAN) for image anomaly detection and proposed AnoGAN.In addition, some methods [27,28] perform anomaly detection by masking multiple regions of the input image and using an autoencoder to reconstruct the masked regions only from its neighborhood information rather than the information of the region being reconstructed.It is assumed that the possibility of accurately reconstructing abnormal regions by generalizing neighborhood appearances is very low.For instance, Li et al. [27] proposed superpixel masking and inpainting (SMAI), which combines superpixel segmentation to determine the missing regions of an image.
Unlike the model based on image reconstruction, which detects anomalies in the image space, feature-modeling-based methods [29][30][31][32] detect anomalies in the feature space.For example, Ruff et al. [33] proposed deep support vector data description (Deep SVDD), which trains a neural network while minimizing the volume of the hypersphere containing normal sample representation.Since the network must map normal samples closely to the hypersphere's center, minimizing the hypersphere's volume forces the network to extract common features of the normal samples.However, since Deep SVDD maps the whole image to a point in the feature space, it can only infer whether there is an anomaly in the image and cannot indicate the location of the abnormal regions.Therefore, Patch SVDD was proposed, which detects each patch to localize anomalies [34].More recently, Bergmann et al. [35] proposed a student-teacher knowledge distillation framework for unsupervised anomaly detection, which uses the deep features from the pre-trained CNNs to detect anomalies in images through feature regression.Specifically, a pre-trained CNN (e.g., resnet18 [36]) is defined as a teacher network and several simple networks are defined as the student networks.During the training, the student networks are trained to imitate the behavior of the teacher network only on normal images.During the test, the anomaly score is calculated based on the predicted errors between the output of the teacher network and the student networks.The method assumes that the student networks only learn how to regress the output of the teacher network on normal images.Thus, the student networks may not be able to predict the output of the teacher network on abnormal images.

Method
Figure 1 shows the framework of the multi-scale feature inpainting (MSFI) for image anomaly detection, which contains four parts: multi-scale feature generation module, multiscale grid masks module, deep feature inpainting module, and anomaly detection and localization module.Given an input image, first the multi-scale features are constructed using the pre-trained CNN.Then, the multi-scale features are transformed to the masked feature maps via the multi-scale grid masks.Following this, the deep feature inpainting model recover and reconstruct each masked feature map.Finally, the anomaly map is obtained through calculation of the l 2 value of the original feature and its reconstruction version.The framework will be detailed in the following sections.

Multi-Scale Feature Extraction
A pre-trained CNN is used to generate discriminative deep features, which are then fed into the deep feature inpainting module with the multi-scale grid mask.
It is supposed that there is a CNN with L convolutional blocks, and each convolutional block consists of multiple consecutive convolutional layers and pooling layers.I represents an input image with the size of h × w × c.Feeding I into the CNN, a set of feature maps {φ 1 (I), φ 2 (I), . . ., φ L (I)} from the L convolutional blocks can be obtained.The size of the l-th feature map φ l (I) is h l × w l × c l .Since each feature map comes from a convolutional layer with a specific receptive field, it represents an abstract representation of the input image.In general, the low convolutional layers with small receptive field capture lowlevel features or local structural information, such as textural structure.In contrast, the deep convolutional layers with a large receptive field capture high-level features or global semantic information.Therefore, the fusion of the feature maps {φ l (I)} L l=1 naturally forms a discriminative representation for the image.The process of fusion consists of two steps, as shown in Equation (1).First, the feature map φ l (I) is resized to the space size (h 0 , w 0 , c l ), but the channel is unchanged.Then, all the scaled feature maps are concatenated to an integrated feature map: where resize(•) denotes the function of resizing.cat(•) denotes the function of concatenating.f (I) denotes the generated multi-scale feature with the size of (h 0 , w 0 , c 0 ).c 0 is the number of channels and satisfies c 0 = ∑ L l=1 c l .

Multi-Scale Grid Masks
As mentioned above, MSFI first removes a part of the regions in the multi-scale feature map and then makes the deep feature inpainting network learn to generate the original feature map.One problem during the process is which part of the regions should be removed.To solve this issue, two design principles are presented.On the one hand, since the anomalies may appear anywhere in the feature map, the regions should have equal probability to be removed.On the other hand, since the anomalies may have different sizes, the removed regions should have multiple scales.
To meet these requirements, multi-scale grid masks are designed to indicate the regions that should be removed and the pixel values at the removed regions are set as zero.As shown in Figure 2, the black grids indicate the regions to be removed, and the number of white girds is equal to that of black grids.The multi-scale grid masks are generated as follows.A mask with the same size as the input feature map is first divided into h 0 k × w 0 k grids, where k is the size of the grid.Then, all grids are randomly divided into two disjoint sets S g , each containing half of the grids, where g ∈ {1, 2}.Following this, a mask M S g is generated for each grid set S g .M S g is a binary mask in which the pixel values at the regions belonging to S g are set as zero.The masks with different scales could be obtained by changing the grid size.In this paper, three grid sizes are adopted, namely, K = {2, 4, 8}.

Deep Feature Inpainting
The U-Net network [37] is adopted to recover the removed regions in the multi-scale feature map.During the training, a pair of binary masks M S g are first generated using a grid size k, which is randomly selected from set K = {2, 4, 8} and leveraged to set the regions belong to S g as zero in the multi-scale feature map f (I): where f S g (I) is the masked feature map; means the element-wise multiplication.Then, the masked feature maps f S g (I) are fed into the network sequentially.The network reconstructs each masked feature map individually and outputs the partially reconstructed feature map f r g (I).Finally, the partially reconstructed feature map f r g (I) are masked and summed into the entire reconstructed feature map f r (I): where 1 h 0 ×w 0 represents a matrix with the height h 0 and the width w 0 , in which the elements are all one.The entire reconstructed feature map f r (I) is constructed using the partially recon- structed feature map f r g (I), in which the value of the regions not belonging to S g are zero.Therefore, each f r g (I) contributes only the regions belonging to S g that are removed in the original feature map.
The network is trained with a joint loss that takes into account the distance loss L val and directional similarity loss L dir .L val is the averaged pixel-level l 2 distance between the reconstructed feature f r (I) and the original feature f (I), as shown in Equation ( 4).The smaller the distance, the higher the similarity between the two: where (i, j) is the spatial position on deep feature maps.L dir aims to increase the directional similarity between feature description vectors.Cosine similarity is used to measure the directional similarity between the reconstructed feature f r (I) and the original feature f (I).The greater the cosine value, the higher the directional similarity between the two.The L dir is defined as: where vec(•) is a vectorization function transforming a matrix with arbitrary dimensions into a one-dimensional vector.Finally, the total loss function is defined as: where λ val and λ dir are the hyper-parameters to balance the weights of the distance loss and directional similarity loss.

Anomaly Detection and Localization
During testing, multiple masks with different grid sizes are adopted to remove regions from the multi-scale feature map and merge the multiple outputs from the model to compute the final anomaly map.
Given a test image I, the multi-scale feature map f (I) is first extracted.Then, the deep feature map is masked and reconstructed several times for each k ∈ K.The anomaly map A(I) for a grid size k is defined as the pixel-level l 2 distance between the original feature f (I) and its reconstruction f r (I): The final anomaly map A f inal (I) is then obtained by taking the average of the anomaly maps A k (I): where A k (I) is the anomaly map generated using the grid size k as defined in Equation (7).
N is the number of the grid size k.
Finally, the image-level anomaly score S is calculated by taking the maximum of A f inal (I): The spatial size of A f inal (I) is h 0 × w 0 .To further obtain the anomaly map with the same size as the image, A f inal (I) performs bilinear interpolation.To obtain the segmentation result, the anomaly map A f inal (I) is binarized using the threshold, which is the anomaly score corresponding to the maximum F 1 score on the test set.

Evaluation Metrics
The proposed method is evaluated in terms of image-level anomaly detection and pixel-level anomaly localization.The area under the receiver operating characteristic curve (AUROC) [18] is used as the evaluation metric.In addition, image-level AUROC is used to evaluate the performance for anomaly detection, while pixel-level AUROC evaluates the performance for anomaly localization.F1 score is also reported to evaluate the performance of MSFI and baselines.

Implementation Details
In the deep feature extraction module, the VGG19 [38] pre-trained on ImageNet [39] was used to produce deep features.The last three full connection layers were removed, and the output feature maps from the final four convolutional blocks were selected for feature fusion.For all the experiments on MVTec AD and FOD datasets, the deep feature inpainting network was trained by Adam optimizer with a batch size of 4 for 300 epochs.The initial learning rate was set as 1 × 10 −4 .After 200 epochs, the learning rate decayed to 1 × 10 −5 .During training, the weights of the pre-trained VGG19 were froze, and only the weights of the deep feature inpainting network were updated.The proposed model was implemented using the deep learning framework Pytorch.

MSFI is evaluated on MVTec AD and FOD datasets in terms of anomaly detection
and localization and is compared with the existing methods such as AE-l 2 [19], RIAD [40], MRKD [31] and DFR [41].

Anomaly Detection
Table 1 shows the anomaly detection results on the MVTec AD dataset.MSFI records the highest AUROC in nine categories and is superior to other anomaly detection methods in terms of average AUROC.In addition, MSFI outperforms the best baseline method DFR by 1%.Table 2 shows the anomaly detection results on FOD dataset.MSFI achieves the highest average AUROC.Compared with other methods, MSFI obtains the highest AUROC in nine foreign objects.MSFI also demonstrates superior performance over other methods in terms of F1 score on all the datasets.

Anomaly Localization
Table 3 presents the anomaly localization results on the MVTec AD dataset.MSFI exceeds the recent state-of-the-art method DFR in eight categories and achieves a higher average AUROC.Table 4 displays the anomaly localization results on the FOD dataset.MSFI outperforms all of the tested methods.MSFI outperforms DFR in eleven classes and carries out a higher average ROC-AUC.In addition, MSFI is simpler than DFR, as it extracts CNN feature maps from only 4 convolutional layers, compared to DFR, which requires 16 convolutional layers to generate the regional feature.The results in Table 3 and 4 show that MSFI also outperforms other models by F1 score on all the datasets.
The qualitative comparison between MSFI and other methods on the MVTec AD and FOD datasets is visualized in Figures 5 and 6, which show the anomaly maps of both methods and the segmentation maps of MSFI.For visualization, the anomaly map is normalized to the range of [0,1], and then superimposed on corresponding testing images.It can be observed that MSFI generally produces more reasonable anomaly maps compared with other methods.The reconstruction error is low for normal regions and high for abnormal regions, reducing the incorrect classification of normal regions and missing detection of abnormal regions.Visually, MRKD, DFR and MSFI can capture abnormal regions in images more accurately than AE-l 2 and RIAD based on image reconstruction.This may be because the former adopts the pre-trained features that can bolster their pattern-recognition abilities.This indicates that the method using pre-trained CNN is better for image anomaly detection than the method of learning image representation from scratch.This study adopts a series of different hierarchical features (that is, the last, the last two, the last three and the last four convolution blocks) to construct the model.The effectiveness of the models with different hierarchical features is evaluated on MVTec AD and FOD datasets, and the results are shown in Table 5. Obviously, the performance of MSFI becomes better as the number of layers increases.
Figure 7 represents the qualitative results of MSFI with different hierarchical features on MVTec AD and FOD datasets.It can be seen that with the use of more hierarchical features, the regions that are incorrectly detected as anomalies gradually decrease, and the predicted abnormal regions gradually approach the real abnormal regions.This is because the deep features with more hierarchical features will encode more local details and spatial context information for images, thus making the detection more robust and accurate.

Loss Function
This section analyzes the impact of each loss component on MSFI.Table 6 represents the average AUROC of all classes for anomaly detection and anomaly localization on the MVTec AD and FOD datasets, respectively.MSFI combining two loss functions performs best, and the results show that the performance of anomaly detection could be improved by considering the directional similarity between feature vectors.

Grid Size
This part analyzes the impact of the grid size on MSFI.The results of MSFI with a single grid size on the MVTec AD and FOD datasets are shown in Table 7.To evaluate the influence of the grid size, a single grid size is used to train the model in the training stage.The testing set also adopts the single grid size during testing.It can be seen from Table 7 that grid size has a great influence on the detection results.No matter what the grid size is, it is difficult to reconstruct the deep features of abnormal regions.This is because the model must infer deep features of abnormal regions from the surrounding regions, which is more difficult than reconstructing the abnormal regions only by a deep autoencoder.
The qualitative results of the model trained using different grid sizes are shown in Figure 8.As the grid size increases, abnormal regions become more difficult to recover and have high anomaly scores.However, normal regions also become difficult to recover and are given high anomaly scores, especially for normal regions with more randomness.As shown in the fourth row of Figure 8, the regions with marker lines in the image produce higher anomaly scores.Adoption of a combination of different grid sizes helps to generate high anomaly scores in abnormal regions and maintain low anomaly scores in normal regions, as shown in column (f) of Figure 8.

Discussion
In this study, the current FOD detection methods based on optical images are investigated.The existing FOD detection methods are mainly based on supervised learning, which require massive labeled images.However, since FOD is not clearly defined, the FOD samples cannot be collected comprehensively and easily.Consequently, once the method is applied to a new airport, it would be very time-consuming to collect enough FOD samples.Therefore, the methods based on supervised learning are not suitable for FOD detection.Conversely, the unsupervised anomaly detection methods are introduced to perform FOD detection on pavement images for the first time while requiring no real FOD samples.In this study, the images containing FOD are defined as abnormal images and the images without FOD as normal images.Since real FOD samples are not required in the training phase, the unsupervised anomaly detection methods could quickly adapt to a new airport.
The current anomaly detection methods could be divided into two methods: image reconstruction and feature modeling.The methods based on image reconstruction mainly use an autoencoder to reconstruct an original image to a normal image, making the as-sumption that anomalous regions would not be reconstructed well.The methods of feature modeling leverage pre-trained features of the normal images to train the model, assuming that the distance between the abnormal image and the normal image is larger than that between the normal pairs in feature space.Although these methods do achieve some success, we propose a much more effective method.Our method combines the pre-trained features and self-supervised learning.According to the comparison experiments, these methods, such as AE-l 2 , RIAD, MRKD and DFR, raised many false alarms in the regions with pavement defects.The results prove that the proposed method is effective and could better distinguish FOD from pavement defects.
Hundreds of aircraft take off and land in airports every day, which means that the time left for FOD detection and cleanup is extremely little.In order to minimize the interference of FOD detection with airport operations, the detection method needs to guarantee realtime FOD detection.In this paper, the experiments were performed on an Nvidia GeForce RTX 2080 Ti GPU and an Intel I9-9940 CPU@3.30GHz.As for the anomaly detection phase, the inference of the proposed method takes about 0.07 s per image, and the running speed of MSFI is about 15 fps.As we can see, there is still room for improvement in our method in inference speed, which will be investigated in future work.

Conclusions
This study proposes a multi-scale feature inpainting method to perform FOD detection on images with various pavement backgrounds while requiring no real FOD.The pretrained CNN is fully utilized to establish discriminative multi-scale features for the images.A deep feature inpainting module is designed and trained to learn how to reconstruct the missing region removed by multi-scale grid masks to match normal features.During testing, the abnormal regions, i.e., FOD, are inferred according to the difference between the original feature and its reconstruction version.Furthermore, a new dataset (FOD) containing 9042 airfield pavement images that covers 15 types of FOD is established for FOD detection.Extensive experiments and analysis on the FOD dataset and a public benchmark dataset, MVTec AD, have indicated that the proposed method is effective and outperforms other methods.

Figure 1 .
Figure 1.The overview of the multi-scale feature inpainting (MSFI).It consists of four parts: multiscale feature generation, multi-scale grid masks, deep feature inpainting and anomaly detection and localization.

Figure 2 .
Figure 2. Visualization of the multi-scale grid masks.
4.1.Datasets Specially designed to evaluate the performance of unsupervised image anomaly detection methods, the MVTec AD dataset has 5354 images.It contains 10 object classes and 5 texture classes with more than 70 different types of anomalies, such as breaks, contamination, holes and other structural defects.The images of each class are divided into training set and testing set, in which the former contains only normal images, while the latter contains both normal and abnormal images.Each abnormal image has a pixel-level annotation; thus, the MVTec AD dataset is well suited for evaluating unsupervised image anomaly detection methods.The Figure 3 shows many examples of normal and abnormal images in MVTec AD dataset.

Figure 3 .Figure 4 .
Figure 3. Examples of normal and abnormal images for each class on MVTec AD dataset.For each class, the top row shows normal image, and the bottom abnormal image.

Figure 5 .Figure 6 .
Figure 5. Qualitative comparison between the proposed method and other methods on MVTec AD dataset.Input represents the input abnormal image.Ground Truth represents the actual abnormal regions (in white).AM represents the anomaly map.SM represents the segmentation map.The red region represents the high anomaly score of AM, the solid red line indicates the boundary of the actual abnormal region and the green region represents the predicted abnormal region in SM.

Figure 7 .
Figure 7. Qualitative results of MSFI with increasing hierarchical features on the MVTec AD and FOD datasets.Last l: anomaly map of MSFI using the hierarchical features from the last l convolution blocks.

Figure 8 .
Figure 8. Qualitative results of MSFI trained and evaluated on a single gird size on the MVTec AD and FOD datasets.AM (k = 2), AM (k = 4) and AM (k = 8) respectively represent the anomaly map of MSFI trained with a single grid sizes of 2, 4 and 8. AM (K = {2, 4, 8}) represents the anomaly map of MSFI trained and evaluated on the set K = {2, 4, 8}.

Table 1 .
The anomaly detection results on MVTec AD dataset.The best result for each class is bolded.

Table 2 .
The anomaly detection results on FOD dataset.The best result for each class is bolded.

Table 3 .
The anomaly localization results on MVTec AD dataset.The best result for each class is bolded.

Table 4 .
The anomaly localization results on FOD dataset.The best result for each class is bolded.

Table 5 .
The results of MSFI on the MVTec AD and FOD datasets with different hierarchical features.Image-level AUROC represents the average AUROC of all the classes for anomaly detection.Pixellevel AUROC represents the average AUROC of all the classes for anomaly localization.

Table 6 .
The results of MSFI using different loss functions on the MVTec AD and FOD datasets.

Table 7 .
The results of MSFI trained and evaluated using a single gird size on the MVTec AD and FOD datasets.