Exploring the Intrinsic Probability Distribution for Hyperspectral Anomaly Detection

In recent years, neural network-based anomaly detection methods have attracted considerable attention in the hyperspectral remote sensing domain due to the powerful reconstruction ability compared with traditional methods. However, actual probability distribution statistics hidden in the latent space are not discovered by exploiting the reconstruction error because the probability distribution of anomalies is not explicitly modeled. To address the issue, we propose a novel probability distribution representation detector (PDRD) that explores the intrinsic distribution of both the background and the anomalies in original data for hyperspectral anomaly detection in this paper. First, we represent the hyperspectral data with multivariate Gaussian distributions from a probabilistic perspective. Then, we combine the local statistics with the obtained distributions to leverage the spatial information. Finally, the difference between the corresponding distributions of the test pixel and the average expectation of the pixels in the Chebyshev neighborhood is measured by computing the modified Wasserstein distance to acquire the detection map. We conduct the experiments on four real data sets to evaluate the performance of our proposed method. Experimental results demonstrate the accuracy and efficiency of our proposed method compared to the state-of-the-art detection methods.


Introduction
Hyperspectral imagery (HSI) captures hundreds of continuous narrow bands covering a wide range of wavelengths to record the spectral signature with abundant information [1].The intensity of a specific object in the hyperspectral image reflects the reflectance or radiance that mainly depends on the category to which the object belongs [2].Therefore, it is possible to apply the unique spectral characteristic to distinguish numerous ground objects [3].With the improvement of hyperspectral technological sensors of spectral resolution [4], HSI has been involved in multiple domains such as aerial reconnaissance [5], military surveillance [6], and marine monitoring [7], where hyperspectral anomaly detection acts as a critical technology [8].
Hyperspectral anomaly detection is considered a particular case of target detection [9], [10].It requires no prior knowledge for background or specific objects compared to target detection tasks [11], [12].As a consequence, it owns a promising application prospect.The leading hyperspectral anomaly detection theory considers that the objects in HSI can be categorized into the background component and the anomaly component [13], [14], [15].Thus, it is rational to model hyperspectral anomaly detection as a binary classification problem.The object whose spectral signature is significantly dissimilar to their adjacent local neighborhood is regarded as an anomaly [16].The objects in most of the continuous areas belong to the background, while anomalies only occupy a small portion of the image [17].Hence, the distribution of anomalies is sparse [18].
Traditional hyperspectral anomaly detection methods pay much attention to the characteristic of background [19], [20].
The landmark Reed-Xiaoli (RX) detector [21] holds the assumption that the whole background obeys a multivariate Gaussian distribution.Based on the generalized likelihood ratio test, the Mahalanobis distance between the test pixel and the mean value of background distribution is calculated.When the distance is greater than a specific threshold, the hypothesis that the test pixel belongs to an anomaly can be verified.Two typical versions of RX, namely global RX (GRX) detector and local RX (LRX) detector [22], evaluate the background by global statistics and local statistics in HSI, respectively.However, it is difficult for all types of background objects to conform to a single normal distribution from a more general viewpoint.To overcome the limit, Kernel RX (KRX) [23] was proposed to model the original data with a nonGaussian model in high dimensional feature space.However, the computational cost remains a troublesome issue to be optimized.
Recently, with the development of compressed sensing technology in the signal processing domain [24], hyperspectral anomaly detection methods have been developed based on representation learning [25] and matrix decomposition theories [26].Collaborative representation detector (CRD) [27] utilizes the conclusion that the background objects can be approximately represented by their local neighborhood, while the anomalies can not.Thus the reconstruction error is leveraged to distinguish the anomalies from the background objects.Moreover, Robust PCA (RPCA) [28] method separates the original hyperspectral data into a background component and an anomaly component by implementing a simple matrix decomposition procedure [29], therein the anomaly component is used for the following detection task.The method assumes that the background objects lie in a single subspace with no complicated implicit structure, whereas it is not realistic in most hyperspectral scenes.To surmount the constraint, low rank and sparse representation (LRASR) model [30] was proposed by incorporating the low rank representation (LRR) theory [31] into the sparse representation detector (SRD) [32].LRASR employs the concept of background dictionary, which considers potential extensive categories of background objects.The nuclear norm constraint is added to the coefficients matrix of background to acquire the lowest rank representation of entire spectra jointly, and 2,1 constraint is imposed on the anomaly part to make the distribution of anomalies sparse.However, the detection performance of the LRASR method depends largely on the selection of the background dictionary.Furthermore, low rank and sparse matrix decomposition (LSMAD) detector [33] inserts a noise term in the matrix decomposition procedure to alleviate the contamination simultaneously.Then, the Mahalanobis distance is computed by using the background statistics to obtain the final detection result.In addition, Li et al. [34] integrates the variational Bayes with the low-rank and sparse decomposition model.The anomaly component can be easily separated from the noise, which is modeled by a mixture of Gaussian.
Apart from the aforementioned methods, novel detectors have been proposed based on neural networks in recent years.The convolutional neural network (CNN) based model [35] trains CNN by constructing the training sample pairs.Moreover, Autoencoder (AE) based framework has become prevalent due to the simplicity.Beti et al. [36] proposed a semi-supervised learning hyperspectral anomaly detection algorithm based on AE.The model trains the neural network with background samples according to the known ground truth.Thus the anomaly pixels produce large reconstruction errors to separate them from the background objects.However, regardless of AE-based or CNN based models, no consideration is attached to local spatial information that makes a big difference in hyperspectral anomaly detection.To address this issue, Lu et al. [37] incorporated the embedding manifold constraint into AE to sufficiently utilize the local structural information.Besides, [38] is the first method to apply variational autoencoder (VAE) [39] to anomaly detection tasks, in which variational inference is integrated with AE in the form of probability distribution.Despite the high reconstruction ability of VAE [40], the intrinsic probability distribution cannot be thoroughly explored.
Most hyperspectral anomaly detection methods rarely consider the explicit distribution of the anomalies.RXbased approaches model the background as a multivariate Gaussian distribution, but the distribution of anomalies is ignored.Ordinary AE and VAE based methods exploit the reconstruction error to distinguish the anomalies from the background.However, intrinsic distribution information is not explored because neither VAE nor AE models the distribution of anomalies.Moreover, some of them work in a semi-supervised manner, which indicates the ground truth needs to be known in advance.Hence, the motivation of this paper is to discover the probabilistic properties of the hyperspectral data from a probabilistic perspective and explore a valid representation that works in an unsupervised manner with higher performance of generalization.
In this paper, we propose a novel probability distribution representation detector (PDRD) based on VAE structure for hyperspectral anomaly detection without using reconstruction error, which explicitly represents the HSI with multivariate Gaussian distributions and detects the anomalies by employing the modified Wasserstein distance.The VAE structure is adopted to acquire the representation of training samples in a probabilistic modality.The outputs of the encoder, including the mean value vector and the standard deviation vector, jointly constitute a multivariate Gaussian distribution in the latent feature space.Then we integrate the spatial information with the obtained distribution for each pixel by introducing the concept of Chebyshev neighborhood.For each pixel in the spatial space with its corresponding probability distribution, we define its Chebyshev neighborhood and compute the average expectation distribution of the pixels in the neighborhood to estimate the local statistics.Finally, we employ the modified Wasserstein distance to evaluate the difference between the corresponding distribution of the test pixel and the average expectation distribution that corresponds to the pixels in the neighborhood.In this way, we combine both spectral and spatial information to obtain the final detection results.
The main contributions of the paper can be concluded as follows.
1) We propose a framework to represent both the background and the anomalies in HSI by multivariate Gaussian distributions, which can discover the probabilistic characteristic of the original data in the latent space.
2) Instead of exploiting reconstruction error, we integrate local statistics with probabilistic structural information by constructing the Chebyshev neighborhood for each pixel.
3) We build a valid criterion according to the actual property of HSI to evaluate the difference between two probability distributions, which highlights the anomalies and suppresses the background pixels.
The remainder of this paper is organized as follows.Section II briefly introduces the related work.In Section III, we elaborate on our proposed method.Experimental results on four data sets are presented in Section IV.Finally, Section V draws the conclusion.

Related Work
The VAE model is a typical generative neural network, which has been broadly applied in computer vision tasks recently.It estimates the probability distribution of the training data and then samples examples from the learned probability distribution, aiming to generate new images that look similar to the original data.Different from the original AE framework, VAE views the reconstruction problem from a probabilistic perspective.
For each sample , the goal of VAE is to maximize ( ).According to the law of total probability, ( ) is represented as where indicates the latent variable to describe the implicit probability distribution.( | ) is intractable in the actual calculation procedure, thus variational inference is imposed to solve the dilemma.The loss function of VAE is then modeled by where and denotes the parameters of the encoder and decoder, respectively.indicates the total number of training samples.The first term is called reconstruction likelihood, which measures the log-likelihood log ( | ) of sampling from ( | ).The second KL-divergence term is usually called complexity constraint.It avoids the estimated posterior ( | ) deviating from the prior ( ) too much.The objective function of VAE ensures reconstruction accuracy, along with the generative ability of the model.
As an anomaly detection method using reconstruction error, VAE is more practical than AE due to prominent representational ability.For a given sample , the probabilistic encoder generates the mean value vector and standard deviation vector in the latent feature space.To reconstruct from the latent space, these two vectors are combined with a random term and then fed into the decoder network.The reconstruction error is employed for the following anomaly detection process.For traditional images, VAE has achieved a satisfactory effect on anomaly detection problems [41], [42].
Ordinary neural network-based methods such as AE and VAE weigh the reconstruction error as a vital term in anomaly detection tasks.However, the difference between the anomalies and the background can be discriminated through the reconstruction error when training samples contain only background pixels.In other words, these methods that depend on reconstruction error require the ground truth of the image to be known in advance.Neither AE nor VAE can serve as an unsupervised method, which heavily limits the practical application.

Methodology
In this section, we elaborate on our proposed method based on the VAE architecture, which is implemented in an unsupervised learning manner.As illustrated in Fig. 1, the proposed PDRD comprises three steps: 1) Represent each pixel by a multivariate Gaussian distribution in the latent space; 2) Select local Chebyshev neighborhood for each pixel to leverage the spatial information; 3) Obtain the detection map by manipulating the modified Wasserstein distance between the corresponding distribution of the test pixel and the average expectation of its Chebyshev neighborhood.

The framework
Traditional RX-based methods mainly focus on the probability distribution of background objects and assume that they obey a multivariate Gaussian distribution while the probability distribution of anomalies is hardly considered.If all training samples are modeled in a unified manner, the probability distributions can be expressed precisely.Thus the discrimination between the background and the anomalies can be evaluated directly.
Fig. 1 illustrates the whole framework of our proposed method.First, the training sample ∈ ℝ is fed into the encoder module 1 .Since our method is implemented in an unsupervised manner, the whole pixels in HSI constitute the training samples.Next, the output of 1 is fed into the probability representation part 2 to generate the mean value vector and standard deviation vector.Then, we combine these two vectors with 3 to produce a specific sample.Finally, the decoder module 4 reconstructs the original data from the computed sample.The training is performed by minimizing the modified loss function.Specifically, the proposed network mainly comprises the four modules that take on unique responsibilities. 1 denotes the encoder module that comprises three fully connected layers with 400 nodes.
2 is the probability representation part, which consists of the mean value module and the standard deviation module of the training sample.Both of the modules comprise a single layer with 20 nodes.These two outputs constitute the Gaussian distribution of the data in the latent feature space.3 represents the sampling module, which converts the probability distribution ( and ) to a specific sample by imposing a random term .4 comprises six fully connected layers with 20 nodes.It is the decoder module to recover the original image.Both of the activation function used in the encoder and decoder network are ReLu.
The VAE structure seeks to construct a latent variable that reflects the features of the image in some aspects while exploring the latent representation in the probabilistic space.Moreover, the reconstructed image is generated by sampling from the latent distribution.Consequently, we regard the latent distribution as the probabilistic representation of the original data.In this paper, we creatively exploit the Gaussian distribution in the latent space to provide an explicit expression for the whole image, which significantly promotes the convenience of exploring the latent characteristic.

Implementation of dimensional independence
Different dimensions in the feature space signify different latent attributes of the original image.However, there is no explicit meaning such as size, width, or angle for each dimension corresponding to the HSI.Each pixel acts as a training sample fed to the neural network, which is substantially different from the ordinary optical image.The latent dimensions may stand for a more intrinsic structural meaning at a deeper level.
For a pixel in HSI = =1 ∈ ℝ × , we extract its distribution ( ) from 2 , which consists of the mean vector and standard deviation vector of the multivariate Gaussian distribution in the latent space.However, each dimension of the distribution is not strictly independent of each other, causing impediments to compute the dissimilarity with other distributions.Hence, we strive to resolve the problem for the simplicity of implementation.
For the distribution ( ) that contains latent dimensions, if all dimensions of are mutually independent, then ( ) is equal to ∏ =1 ( ).Thus, the correlation between different dimensions can be converted to measure the difference between ( ) and ∏ =1 ( ).Our strategies that ensure mutual independence between different dimensions stem from the property.
The loss function acts as the optimization goal of the entire network.In other words, we can reshape our loss function to optimize the framework towards the desired orientation.By adding a weight coefficient [43], the expression of loss function changes to Moreover, the KL divergence term in eq. ( 3) can be written in the form of expectation by [44] ( ) [ ( ( where random variable indicates the index of selected training pixel, and it obeys a uniform distribution.We can see from eq. ( 4) that the newly constructed loss function contains the term (ii) called total correlation, and the expression owns the ability to evaluate the dimensional correlation.Therefore, in order to reduce the total correlation term to the greatest extent, the optimization emphasis should be more focused on the KL divergence term rather than the expectation term in eq. ( 3) from a global viewpoint.The optimization balance is broken up while enhancing the value of , and the KL divergence term receives more attention in the optimization process.Dimensional correlation declines gradually as the weight parameter increases, and it finally falls to a low level where it can be considered negligible.As a consequence, the dimensional independence can be satisfied theoretically.

Selection of Chebyshev Neighborhood
From the process of probability representation for each pixel, we acquire the corresponding multivariate Gaussian distribution in the latent space with independent dimensions.In this way, the spectral information is fully exploited.Thus, we seek to combine the spatial information with the obtained distributions.
Given a test pixel with the two-dimensional (2D) spatial coordinate ( , ), we define the point set with Chebyshev neighborhood by where represents the index of points to be selected, and ( , ) denotes the coordinate of the point that satisfies the eq.( 5).
As mentioned before, each pixel of the image corresponds to a definite multivariate Gaussian distribution in the latent feature space.To approximate the local statistics of the test pixel, we introduce a random variable ∼ ℕ ( , ) that reflects the average expectation of Chebyshev neighborhood for the test pixel and it can be characterized by where indicates the mean value vector of the distributions of -th pixel in , denotes the standard deviation of -th dimension of -th pixel in .The diagonal elements of are composed by the average variance of each dimension of the probability distribution corresponding to the pixels in the Chebyshev neighborhood.

Measurement of Different Distributions
In this step, we measure the dissimilarity of corresponding distributions between the test pixel and the average expectation of its Chebyshev neighborhood.In this paper, the concept of Wasserstein distance is adopted to measure the distinction between two distributions, which is also employed in generative adversarial networks (GAN) to alleviate the vanishing gradient [45].
For two acquired Gaussian distributions =  1 1 , 1 and =  2 2 , 2 on ℝ , with respective mean value vectors 1 and 2 ∈ ℝ , and symmetric positive semi-definite covariance matrices 1 and 2 ∈ ℝ × , the Wasserstein distance between and is given by [46] ( , where   (⋅) denotes the trace of a matrix.Obviously, the computation of eq. ( 8) is troublesome due to the complicated expression of the covariance matrix with high dimensionality.
By observing the expression of eq. ( 8), we know that the relationship among different dimensions of makes a considerable difference.As the aforementioned results demonstrate, all dimensions of are mutually independent, then degenerates into diagonal matrix denoted by where and , ( ∈ {1, 2, ⋯ }) represent the standard deviation of each dimension for 1 and 2 , respectively.Through a series of theoretical derivations, eq. ( 8) can be simplified to following equation: where = 1 , 2 , ⋯ , and = 1 , 2 , ⋯ , .According to eq. ( 10), it is tractable to evaluate the difference between the probability distributions and .

Modified Wasserstein distance
As eq. ( 10) suggests, the standard deviation shares equal significance with the mean value.However, in real hyperspectral scenes, the intensity of pixels from different categories oscillates much more wildly than the pixels from the same category.Hence, the difference between the anomalies and the background is biased towards the mean value rather than the standard deviation.To reshape the mutual relationship of the two terms, we attach a weight parameter to the standard deviation term as follows: Thus, it is tractable to measure the anomalous degree of the test pixel by computing the modified Wasserstein distance between the distribution of the test pixel and the average expectation distribution of the pixels in the Chebyshev neighborhood.The larger value of the computed Wasserstein distance indicates the greater degree of deviation from the average expectation for the test pixel.Accordingly, the pixel is more likely to become an anomaly.By reshaping the computed distance vector, we obtain the final detection map.The proposed method is elucidated in Algorithm 1.

Experimental results
In this section, four real-world hyperspectral data sets are utilized to evaluate the accuracy and efficiency of the proposed method.

San Diego Data Set
It is a widely used data set collected by The Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor from San Diego, CA, USA.The spectral resolution is 10 nm, and the spatial resolution is 20 m.It has 224 bands in total ranging from 370 to 2510 nm.After eliminating the water absorption, low signal-to-noise (SNR), and bad quality bands (1-6, 33-35, 97, 107-113, 156-166, and 221-224), the remaining 189 bands are exploited for the experiments.The original image has a spatial size of 400 × 400, and a sub-

Pavia City Data Set
The hyperspectral data set was collected by the reflective optics system imaging spectrometer (ROSIS) sensor, which covered the city center of Pavia in northern Italy.It has 205 spectral channels in total, covering the wavelength ranging from 430 to 860 nm, along with a spatial resolution of 1.3 m.The image has a geometric size of 150 × 150 pixels.The 2D visualization of the Pavia City data set and the corresponding ground truth are illustrated in Fig. 3.

Gulfport Data Set
The data set was captured by the AVIRIS sensor over the airport area of Gulfport, MS, USA.The spatial resolution is 3.4 m, and the spatial size is 100 × 100 pixels.It contains 191 spectral channels spanning the wavelength of 550 to 1850 nm.In the scene, three planes of different sizes are regarded as anomalies.The main background of the image includes roads, vegetation, and runways.The 2D visualization of the Gulfport data set and its corresponding ground truth are depicted in Fig. 4.

Jasper Ridge Data Set
It is a popular hyperspectral data set acquired by the AVIRIS sensor over Jasper Ridge in central California.The spectral resolution is up to 9.46 nm.The image records 224 channels ranging from 380 to 2500 nm.After eliminating the bands 1-3, 108-112, 154-166, and 220-224 (due to dense water vapor and atmospheric effects), the remaining 198 channels are exploited for our experiment.The original image has a spatial size of 512 × 614 pixels, and a sub-image of 100 × 100 is used for the experiment.Since the original ground truth holds the modality of probability, several contaminated objects that possess an ambiguous category are chosen as anomalies.The main background is composed of water, roads, dirt, and trees.The pseudocolor image of Jasper Ridge data and its corresponding ground truth are shown in Fig. 5.

Competitors And Parameter Setup
The following state-of-the-art methods are compared with our proposed method.
1) RX detector [21] is a benchmark hyperspectral anomaly detector.It assumes that the background satisfies a multivariate Gaussian distribution.A dual-window version of RX detector is exploited in the experiments due to the high detection performance.
2) LRASR [30] adopts a background dictionary that can fully discover the implicit background structure in the latent subspace by low rank and sparse representation.A profound separation between the background and anomalies can be achieved, and the separated anomaly part is exploited to detect anomalies.
3) LSMAD [33] decomposes the original data into a background part, an anomaly part, and a noise part.The Mahalanobis distance that reflected background signature is computed for the following detection process.4) CRD [27] assumes that the background pixels can be represented by the linear combination of its spatial neighborhoods, whereas anomalies cannot.5) AE [36] attempts to recover the background pixels by the structure of neural network.The anomalies hold larger reconstruction errors than the background pixels, which is the principle to distinguish the anomalies from the background.
6) LSDM-MoG [34] models the noise component with a mixture of Gaussian distributions.The anomalies are separated from the noise components by variational Bayes.
The parameters are set for the optimal value to make a fair comparison.For RX, the outer window size is set to 19, 25, 27, 25 for San Diego, Pavia City, Gulfport, and Jasper Ridge respectively, while the inner window size is set to 17, 19, 25, 23.For LRASR, parameters and are both set to 0.1.For LSMAD, it is known that the rank value determines the detection accuracy.The is set to 4 for all the considered data sets.For CRD, the is set to 17, 15, 23, 25 and is set to 15, 13, 19, 7 for the corresponding data sets.For LSDM-MoG, the initial rank 0 and the number of mixture Gaussian noise K are set to 60 and 4, respectively.

Detection Performance
Receiver operating characteristic (ROC) curve serves as a valid criterion in hyperspectral anomaly detection tasks.The superiority and effectiveness of our proposed method are illustrated by comparing the ROC curves.Fig. 6 illustrates the ROC curves on four data sets.In addition, the   area under the ROC curve (AUC) value of ( , ) is used to evaluate the detection performance quantitatively.The larger AUC value manifests better detection performance.Furthermore, the AUC score of ( , ) estimates the ability of background suppression [47].The lower value indicates higher performance of background suppression.Tables I and II show the AUC scores of ( , ) and ( , ) on the four real HSIs.

Results of San Diego data
Fig. 7-I displays the visualization of detection results on the San Diego data set.Concerning the San Diego data set, the general background is relatively more complex than others.The LSDM-MoG method holds the worst performance among these methods.Neither it detects many anomalies nor excludes enough background pixels.The RX and LRASR methods fail to detect a certain amount of anomalies, causing poor detection performance.The LS-MAD method integrates matrix decomposition theory with the classical RX detector and employs the noise term in the matrix decomposition process.However, both the anomalies and the background are suppressed.Consequently, the anomalies are not evident enough to be distinguished from the background pixels.Moreover, the AE method detects most of the anomalies, while several background objects whose spectral signature is similar to the airplane anomalies are also detected as anomalies.As a result, the algorithm receives bad detection performance.In contrast, our proposed method achieves outstanding performance, and the detected anomalies are distinct.ROC curves are shown in Fig. 6(a) to demonstrate the effectiveness of our proposed method.The ROC curve of our proposed method has a quick responding speed when the false alarm rate is nearly 10 -2 .When the detection of probability reaches 1, the false alarm rate is better than some other methods.
The AUC scores of ( , ) on the San Diego data set are shown in Table I.Evidently, our proposed method achieves the second-largest AUC value of ( , ) and yields better detection performance than other methods by a large margin except for CRD.

Results of Pavia City data
The visualization of detection results on the Pavia City is shown in Fig. 7-II.RX and LSDM-MoG highlights both the background and the anomalies, leading to a high false alarm rate.The detected anomalies of LRASR contains background components more or less, due to the large detection value of the background pixels.LSMAD faces the dilemma that the intensity of some anomalies is very low.CRD misses several targets, generating lower detection performance.Furthermore, the detection performance of AE attenuates because several bridge objects are detected as anomalies.Our proposed method fully utilizes global information of the data and incorporates the strength of local neighborhood into the detection process to enhance the saliency of anomalies.
ROC curves are plotted for the Pavia City to compare the detection performance qualitatively, as Fig. 6(b) shows.Notably, our proposed method is the fastest to reach the position where the probability of detection is equal to 1.It means the minimal intensity of the detected anomalies in our proposed method is more significant than that of other detectors.In the preliminary stage, the probability of detection is over 0.5, which dominates the value among these methods.When the probability of detection approaches 1, the false alarm rate is less than 10 -2 , which precedes other methods by a large margin.
The background condition of Pavia City data set is not as complicated as that of the San Diego data.Hence, all methods yield excellent detection performance.The AUC value of ( , ) in our proposed method is 0.9993, which is very near to 1.The value is convincing because it is consistent with the aforementioned analysis outcomes concerning the visualization of detection results and ROC curves.

Results of Gulfport data
Fig. 7-III shows the visualization of detection maps on the Gulfport image.RX, CRD, and LSDM-MoG incorporate a number of the runway objects into anomalies, as we can observe that several horizontal stripes are significant.LRASR erroneously detects large areas of pixels located in the lower-right of the original image as anomalies, causing bad detection performance.Besides, some background pixels of LSMAD are salient than anomalies, leading to a low detection rate.AE regards road objects as anomalies, and we can see a horizontal line in the detection result.
ROC curves are plotted as shown in Fig. 6(c) for the Gulfport data to confirm the detection performance qualitatively.The ROC curve of our proposed method keeps at a high level since the false alarm rate is over 10 -3 , and the tendency of growth holds quite steep.When the probability of detection of our proposed method reaches 1, the corresponding value of other methods is far from our proposed method.
For the Gulfport data set, our method makes a considerable improvement towards AUC scores of ( , ) compared with other methods, which can be owed to the utilization of both global and local information of the data.The results are consistent with the evaluation of both ROC curves and the visualization of detection maps.As a consequence, we conclude that our algorithm achieves better performance.

Results of Jasper Ridge data
Fig. 7-IV displays the detection results on Jasper Ridge data.LSDM-MoG gets the worst performance among these methods.It detects most of the anomalies, but several background pixels are highlighted.CRD, LSMAD, and AE yield better results than LSDM-MoG.However, they all miss de-tecting plenty of anomalies.LRASR has a high false alarm rate at the lower-left part of the image.Since the category of each object is not deterministic for this data set, the effect of constructing a background dictionary decreases.Therefore, the detection performance of LRASR diminishes rapidly compared to other data sets.Our method and RX have achieved better performance than other methods.Moreover, the results demonstrate slight superiority for our proposed method because the regularized intensity of anomalies is higher than RX.
ROC curves are shown in Fig. 6(d) to evaluate the detection results from the qualitative perspective.As we can see, the ROC curve of our method lies almost in the upperleft corner of the image.When the false alarm rate is over 10 -2 , the probability of detection is near to 1, which leads the performance among these methods to a great extent.
In line with the conclusion regarding the visualization of detection results and ROC curves, our method holds the best detection performance regarding AUC scores of ( , ).The effectiveness and superiority stem from the consideration of the intrinsic probability distribution of every single pixel.As a consequence, the implicit structure of both anomalies and the background is well-recovered.

Weight Parameter and Tradeoff Parameter
The weight parameter determines the biased degree between the mean value and the standard deviation.After multiple experiments have been conducted, all of them reach the same conclusion that only if is set to 0 can our model reach powerful detection performance.Under the circumstance, the standard deviation acts as interference in our detection task, which suggests only the mean value of the distribution influences the discrimination between the background and the anomalies.Hence, is set to 0 for all following experiments.
The tradeoff parameter is used to balance the weight of reconstruction error and KL divergence term in the VAE loss function.The setting of is significant in our experiment because the mutual independence of different dimensions depends on .The experimental results further verify our aforementioned theoretical analysis.As we can observe from Fig. 8, the AUC value grows as increases in the preliminary stage.After the AUC value reaches the maximum, it declines slowly from a general perspective.Specifically, with the increase of , although the distribution in the latent space updates towards the direction that each dimension is mutually independent, the emphasis of the loss function in eq. ( 3) gradually tilts toward the KL divergence term, leading to the decline of the model reconstruction ability.The optimal parameter is acquired by trial and error, and the critical point is to unearth a balance between the reconstruction error and the KL divergence term.
The dimensionality in the latent space is set to 20, while the local neighborhood is set to 19.Since the value of the KL divergence term is small, the range of the tradeoff parameter we set is {1, 10, 50, 100, 200, 500, 1000, 2000, 5000} in the experiment.When is equal to 1, the neural network represents the vanilla VAE.Moreover, the influence of on four different hyperspectral data sets is not identical.The optimal setting values of are 10, 50, 500, and 50 for San Diego, Pavia City, Gulfport, and Jasper Ridge data, respectively.

Dimensionality of Latent Variable
The distribution of latent variable signifies the intrinsic structure of original data in the feature space.The setting of also deserves considerable attention.We can observe from Fig. 8(b) that the AUC value proliferates as increases at the very beginning.After the AUC value reaches the maximum, it keeps at a high level or drops slightly.The phenomenon can be explained from the perspective of dimensional correlations.When is too small, massive redundancy and high correlations exist between different dimensions.As a result, sufficient information cannot be fully extracted by low dimensional probability representation in the latent space.On the contrary, provided that is large enough or larger than the band number of HSI, massive redundant dimensions emerge.

Chebyshev neighborhood
The Chebyshev neighborhood determines the range of local information we utilize from the original hyperspectral image.Fig. 8(c) demonstrates that when increases, the AUC value enhances at the very beginning.The AUC value diminishes as further increases after it reaches the maximum.When is small, the neighborhood only covers a small portion of adjacent areas, thus the local information cannot be exploited thoroughly.When is too large, the adjacent areas contain multiple unexpected background categories.Moreover, the execution time increases quickly as grows.
The and maintain fixed when exploring the optimal setting of .The detection results displayed in Fig. 8(c) reveal that the best of values we set are 21, 9, 25, and 31 for San Diego, Pavia City, Gulfport, and Jasper Ridge data, respectively.

Parameters of Neural Network
The learning rate is crucial to the training procedure of a neural network.The appropriate setting value of can accelerate the training process and promote the performance of the neural network.In our experiment, is set to 10 -5 for Jasper Ridge data set and 10 -3 for other three data sets.
The batch size of a neural network influences the learning accuracy and updating speed, and the optimal value is derived from trial and error with prior knowledge.It is set to 16, 32, 16, and 32 for San Diego, Pavia City, Gulfport, and Jasper Ridge data sets, respectively.

Execution Time
In this part, we compare the execution time of the aforementioned detectors.The experiments are conducted on a computer with a 64-bit Intel i7-8700 CPU of 3.2 GHz on Windows 10.To make a fair comparison, all methods are implemented with CPU.The average execution time of the state-of-the-art methods on different data sets can be observed from Table III.As we can see, our proposed method takes the least time among these methods.RX and CRD are much slower as the sliding windows consume considerable time.As for matrix decomposition-based methods, LRASR costs much more time than LSMAD due to the optimization of the background dictionary.AE also consumes much time because the reconstruction error cannot be minimized to the expected level until numerous iterations.
Notably, our proposed method not only achieves high detection performance, but also obtains high computational efficiency simultaneously compared to other methods.Low computational cost and high detection accuracy make it possible to implement the method into practical applications in the technological industry.

Ablation Study
To evaluate the effect of each part on the final detection result, including probabilistic representation (PR), Chebyshev neighborhood (CN), and modified loss function (MLF), we conduct an ablation study on four real hyperspectral data sets by using different combinations of the network components.In the first scenario, we eliminate the PR part, thus the architecture degenerate into the VAE network.Evidently, the AUC score decreases significantly on four data sets without the PR module, which indicates the validity of PR module.With the removal of CN in the second scenario, the average AUC score is reduced by 1.5%, which manifests the usefulness of CN.When MLF is removed in the third scenario, the AUC score declines by 0.5%, which embodies the effectiveness of MLF.From a comprehensive point of view, it can be observed that the average AUC score of the models without PR, CN, and MLF are smaller than PDRD, which signifies the combination of Gaussian probabilistic representation, local Chevbyshev neighborhood and the modification of loss function is effective for anomaly detection.

Conclusion
In this paper, we propose a novel PDRD for hyperspectral anomaly detection.We exploit multivariate Gaussian distributions to represent the whole image in the latent space from a probabilistic perspective.To take advantage of the spatial information, we introduce the Chebyshev neighborhood and average expectation to estimate the local statistics.Finally, the modified Wasserstein distance serves as the evaluation criterion for anomaly detection.The distinction between the corresponding distributions of the test pixel and the average expectation in the neighborhood is computed to acquire the anomalous degree for each pixel.Experiments on four real hyperspectral scenes demonstrate the accuracy and efficiency of our proposed method compared to the stateof-the-art methods.In future work, we intend to combine the probability representation strategy with other neural networks.

Algorithm 1
Anomaly detection for HSI based on PDRD Input: Training samples ∈ ℝ × and parameters: 1) trade-off parameter and weight parameter ; 2) Chebyshev neighborhood ; 3) dimensionality of latent distribution.Output: Anomaly detection map.1: Train a VAE neural network with modified loss function; 2: Extract the probability distribution for each training sample in the latent space; 3: Determine the appropriate Chebyshev neighborhood and calculate the average expectation for each pixel; 4: Compute the modified Wasserstein distance between the corresponding distributions of the test pixel and the average expectation of the pixels in the Chebyshev neighborhood via eq.(11); 5: Reshape the computed distance vector to obtain the final detection map.

Fig. 7 :
Fig. 7: Visualization of the detection results.I is the results of San Diego data set.II is the results of Pavia City data set.III is the results of Gulfport data set.IV is the results of Jasper Ridge data set .(a) RX.(b) LRASR.(c) LSMAD.(d) CRD.(e) AE.(f) LSDM-MoG.(g) proposed.

Table 1
AUC Scores of ( , ) on Different Data Sets

Table 2
AUC Scores of ( , ) on Different Data Sets

Table 3
Execution Time (in Seconds) on Different Data Sets

Table 4
AUC Scores for Component Analysis