Nonlinear Bayesian Algorithms for Gas Plume Detection and Estimation from Hyper-spectral Thermal Image Data

This paper presents a nonlinear Bayesian regression algorithm for detecting and estimating gas plume content from hyper-spectral data. Remote sensing data, by its very nature, is collected under less controlled conditions than laboratory data. As a result, the physics-based model that is used to describe the relationship between the observed remote-sensing spectra, and the terrestrial (or atmospheric) parameters that are estimated is typically littered with many unknown “nuisance” parameters. Bayesian methods are well-suited for this context as they automatically incorporate the uncertainties associated with all nuisance parameters into the error estimates of the parameters of interest. The nonlinear Bayesian regression methodology is illustrated on simulated data from a three-layer model for longwave infrared (LWIR) measurements from a passive instrument. The generated LWIR scenes contain plumes of varying intensities, and this allows estimation uncertainty and probability of detection to be quantified. The results show that this approach should permit more accurate estimation as well as a more reasonable description of estimate uncertainty. Specifically, the methodology produces a standard error that is more realistic than that produced by matched filter estimation.


Introduction
Estimating the constituent concentrations of industrial gas plumes from hyper-spectral data has received much attention in the recent years. See, e.g., [1][2][3][4][5][6] where ( 0 represents the ground emissivity, 7 the ground temperature, and 7 the plume temperature. The atmosphere is described by the transmissivity, P a ! , the up-welling radiance, Y , and the down-welling radiance, G . Meteorological data can supply values for these atmospheric quantities, therefore, they are assumed known. The expression, 1 3 ¢ 5 8 7 b represents the Planck spectral radiance of a blackbody at temperature 7 . Finally, the plume transmissivity, where 4 ¢ is the instrument noise associated with the wavenumber ¢ . It is assumed that the errors are independent, unbiased (zero mean), and their variance is known.
As mentioned earlier, the three-layer model makes some simplifications: No term for solar radiation: For this model to be appropriate, one must assume that either (1) the observations are taken at night, or (2) the contribution of solar radiation is insignificant for the IR band being used.
The atmospheric terms are known: The terms @ , " , and are assumed to be known. This simplification is invoked to allow us to study the dominant source of variability in this problem, which is background clutter (i.e. variability in and 0 ). The strategy is to include uncertainty in these atmospheric terms at a later date.
There are no correlations or biases in the instrument errors: A well-calibrated instrument may approximate this assumption. However, periodic instrument calibrations can introduce correlations into these errors.
l v g h o n q p r q s % i describes a physics-based relationship between the observed radiation and the state parameters that describe the observed system, and the noise, or error, u g 4 h i is assumed to come from a Gaussian distribution with known variance The errors are assumed independent, therefore, their covariance matrix denotes the number of discrete wavenumbers that the instrument records. The Gaussian assumption is fairly reasonable; The error in most IR instruments is dominated by Poisson shot noise, which is well approximated by a normal distribution.
The above model is nonlinear in the sense that l is a nonlinear function of the state parameters. These parameters are divided into two vectors, p and s , to distinguish between those that we want to estimate, p , and the nuisance parameters, s , i.e., parameters of no direct interest, except for the way they complicate the task of estimating the desired parameters.
Which parameters are nuisance parameters or parameters of interest depend on the particular application. For the temperature/emissivity separation problem, p represents the background temperature and emissivity. In the plume gas concentration estimation problem p denotes the chemical effluent burdens. No matter what the application is, a typical problem of physics-based models for hyper-spectral data is that the unknown parameters outnumber the spectral channels recorded, so if the model is used to formulate a classic (non-Bayesian) regression model, it will represent an under-determined set of equations. One could say that the spectrum is "under-sampled," but it is important to note that the problem cannot be solved by increasing the sampling. The number of nuisance parameters increases as the sampling rate increases [12].
"Under sampling" is usually solved by including additional information about the nuisance parameters into the problem formulation. The non-Bayesian way to do this is to plug in estimates for the parameters, while the Bayesian formulation considers parameters as random variables and uses a prior distribution to supply this information. The chief advantage a prior distribution has over the plug-in approach is that estimate variabilities or uncertainties can be automatically included.
A secondary advantage is that a prior distribution provides a richer framework for describing what is known about the parameters. In the plume gas concentration estimation problem, the component of the prior that describes the temperatures and the ground emissivity, (representing the nuisance parameters) may include all information known about these parameters; For example, background emissivity must be constrained to the interval [0,1] and also be smooth across energy channels; This information can easily be built into the model prior. Also, the background typically has a strong spatial structure, and it might be very useful to include this information into the prior.
The parameters of interest must also be given a prior. If nothing is known about them, one would apply a non-informative or weakly informative prior to these parameters. However, if some information concerning them is available, their priors can be modified to include such information. For example, it is obvious that the effluent burdens reside within an interval with lower bound 0 and some maximum plausible burden.

Derivation of the Prior Information
Let ' 8 denote the prior distribution of the model parameters. Without loss of generality, the nuisance parameters can be assumed to be independent from the parameters of interest: q where o 8 6 W c 0 The generalized Beta distribution and the truncated Gaussian distribution with large variance are commonly used to defined weakly informative priors on large intervals [15,16]. Using the second approach, a weak or diffuse prior for the concentrations can be formulated as follows: where ¢ z £ is a normalization constant so that © is a true distribution, is typically within 2 to 3K degrees of the the atmosphere, which is known from meteorological data. A prior for the background temperature Á Ầ can be developed from scene brightness temperature, i.e., the conversion to brightness temperature from radiance using the inverse of Planck's function, and with use of the Nonconventional Exploitation Factors Data System (NEFDS), a government database of surface reflection parameters, e.g., [22] and the National Geospatial-Intelligence Agency (NGA).
Defining priors for the background emissivity is more challenging. The ground emissivity has two important properties that need to be considered in the model and the prior: smoothness and constraint to be in the interval . On the other hand, there is a lot of available information on ground emissivity. For example, the NEFDS database contains pre-measured surface reflection parameters for over 400 materials corresponding to a wide variety of objects ranging from camouflage to white paint. This context motivates the following approach for building a prior distribution for the ground emissivity: (1) Model the emissivity as a smooth function of the wavenumber using for example a spline function [23], (2) choose a Gaussian prior for the spline coefficients, with mean and covariance determined from the mean and covariance from the NEFDS library, and (3) map the spline function to the interval via a nonlinear transformation such as the logistic function: has a Gaussian distribution with mean 0 and covariance equal to the square of the diagonal matrix of eigenvalues resulting from the singular value decomposition, and â is a constant we set to 4 in order to generate a diffuse prior. For our experiments we used Ü Ó é emissivities. These emissivities were chosen to be extreme representatives of the NEFDS library, and are displayed in Figure 1.

Parameter Inference
Bayesian inference about the model parameters relies on their posterior distribution. Given the data model (6) and the parameter prior distribution (7), the posterior distribution where ô Ú is a normalization constant. Two approaches are commonly used to construct parameter estimates and their associated uncertainties. One can either use the posterior mean and posterior covariance, or the posterior mode and highest posterior density interval.

NLMPD Algorithm
In order to produce the mode of the posterior (10), we chose a constrained version of a Levenburg-Marquardt iterative algorithm [24] we coined NLMPD for Nonlinear Maximum Posterior Density. Since the priors we used are bounded, the maximization algorithm must deal with these constraints. NLMPD is very efficient for nonlinear least-squares problems and usually converged in a few steps (3 to 10) for our problems. Thus this algorithm requires roughly 3 to 10 times the computation that a matched filter estimator would, and we can consider that they are comparable in speed. NLMPD also requires the first derivatives of the minimizing function to be known. Consequently, one can approximate, at no cost, the uncertainties associated with the mode estimate with an approximation of the posterior covariance matrix given by . This approximation will produce good results when the posterior distribution is close to normal and the gas burden estimates are not highly correlated. However, when the gas burdens are highly correlated (in the posterior, not prior), the covariance approximation can be too large.

Markov Chain Monte Carlo Algorithm
Let P denote the model parameters. MCMC generates sequences of random variables   . However, the rate of convergence of the chain to the posterior distribution, as quantified by the effective sample size of the generated sequence and its mixing will depend on how close i is to . The effective sample size is given by the length of the sequence minus the number of rejections in (12) while the quality of the mixing is measured by the auto-correlations found in the sequence. Low autocorrelations (i.e., large mixing) produce faster convergence for the posterior estimates. We have found that an efficient proposal i for our problem is a multivariate Gaussian, centered at the current point in the sequence and with covariance equal to the covariance estimate provided by NLMPD divided by 4.
Due to the typical high-correlations observed between the effluent concentrations within the plume, the mixing remains in many cases very low. This forces us to generate very long MCMC sequences and thin them. Thinning consists of subsampling from the sequences by selecting every g -th observation. In other words, in order to produce a reliable sample of size say 5000, one need to generate first an MCMC sequence of length h j i k i k i l m g and then extract every g -th observation. As typical values of g in our context is in the order of 100, generating such MCMC becomes costly.
Recently, Hamiltonian paths have been proposed to both increase the mixing in the MCMC sequence and reduce the number of rejections [17].
where denotes the -th iteration of Hamiltonian MCMC,

|
denotes the momentum at the beginning of the leapfrog step, and represents a small increment. Because the leapfrog technique is only an approximation, ² may vary along these paths. A Metropolis step is then use to re-establish the equilibrium. The state at the end of the Hamiltonian path denotes the maximum taken on the vector components (ì denotes the minimum.

NLBR vs. Matched Filter
To test the methodology, we simulated spectra with known gas burdens and compared the NLBR approach with a matched filter, perhaps the most popular algorithm currently in use for gas plume detection. We considered the case of a plume with 3 non-correlated gases -NH3, Ethylene and Butyl Acetateat various concentration levels: 0, 10, 20, 30, 50, 70, 90, and 110 ppm-M. For each level of concentration, 100 spectra were generated. For each simulation, the background emissivity was randomly selected from the 6 emissivities shown in Figure 1. Figure 2 compares estimates of NH3 obtained with NLMPD and the matched filter. The figure plots the true burden of NH3 (horizontal axis) versus the estimates (vertical axis). Consequently, a plot for a perfect algorithm would have all the estimates on the identity line. One can see that the matched filter estimator displays much more variability than the NLBR estimator (except when the gas burden is 0). In fact, the matched-filter plot contains points that look like outliers. These "outliers" are associated with the simulations that picked an atypical background emissivity (the bottom one in Figure 1). In the linearization that the matched filter algorithm uses, an average emissivity is plugged in, which results the observed "outliers." In the NLMPD algorithm the variability in emissivity seen in Figure 1 is accounted for in the prior.
Because of this problem, the matched-filter estimates are biased, and the actual error in the estimates (matched-filter Root Mean Squared Error RMSE=36ppm-M) is much greater than the error predicted by the algorithm (matched-filter predicted standard error=1ppm-M). On the other hand, these problems are absent from the NLBR estimator; It has no detectable bias, and the observed and predicted estimation error, given by (11), are the same (RMSE=2ppm-M, average predicted standard error=2ppm-M).

Gas Detection
The fact that the NLBR algorithm produces realistic standard errors has great benefit when the estimates are used for detection. The usual statistic to test the presence or absence (0 ppm-M) of a gas is the t-statistic, i.e., the algorithm estimate divided by the algorithm predicted standard error given by (11).   Under the "null" hypothesis of no gas burden, the t-statistic should approximately follow a truncated Gaussian distribution (truncated because burdens are constrained to be non-negative). Figure 3 presents the distributions of the t-statistic for NH3 and Ethylene, calculated from 1000 spectra simulated as in the previous example at concentration level 0 ppm-M. The distributions are presented in the form of Q-normal plots, and therefore, the t-statistic values should ideally line up on a straight line except for the truncated values. As one can see from the plots, the null distributions conform closely to the theoretical ideal.
This means that detection thresholds for NLBR estimates and the associated false call rate can be theoretically calculated and results will correspond to theory. Figure 4 presents probability of detection (POD) curves calculated from 2000 spectra, the above 1000 spectra and 200 spectra for each of the concentration levels 50, 100, 150, 200 and 250 ppm-M. Detection occurs when the t-statistic is larger than 3, a threshold that should give about a one in a thousand chance of a false call. In the figure, POD is plotted as a function of true gas concentration, so POD(0) is the false call rate. For NH3, the observed false call rate is in fact 1/1000, for Ethylene the false call rate is 3/1000, very good agreement with theory.
Fits of the NLBR model to actual hyperspectral cubes have shown that the performance shown in Figure 4 is more optomistic than that experienced on real data. For example, the residuals from real data fits are typically from 2 to 5 times larger than theory would predict, indicating that un-modeled sources of variability exist in the data. One obvious source is the atmospheric terms in the model, which are currently assumed to be perfectly known. The next step in our modeling strategy will be to construct a prior that adequately reflects variability in these terms.

Highly Correlated Gases
The results presented up to this point utilize the "quick" Bayesian solution provided by the NLMPD approach. Figure 5 presents a case for which the complete posterior, as provided by MCMC algorithm is very useful. For this example, 11 highly correlated gases were placed in the plume. The gas correlations are so large that gas burden estimation would be an ill-conditioned problem, if an unconstrained linear regression model were employed. (The principal constraint that makes this problem solvable is that all gas burdens must be non-negative). With the aid of this constraint, the Bayesian regression algorithm transforms an ill-conditioned problem to one with a very informative solution. Unfortunately, the posterior distribution for this problem is poorly approximated by a normal, and the NLMPD approach cannot provide reliable estimates. On the other hand, the MCMC algorithm can calculate the full posterior and provide critical insight on what is happening.
The two histograms in Figure 5 represent the posterior distribution for Butyl Acetate, one of the gases in the 11-gas plume. The first posterior distribution represents the estimated burden of Butyl Acetate when the true burden for each of the 11 gases is 0 ppm-M. For this case, the MCMC algorithm estimates the true burden with great accuracy even though large correlations exist between gases. In fact, a 95% highest probability density interval is (0,4.5) ppm-M, quite a small interval. Furthermore, the MCMC output reveals that the posterior for Butyl Acetate has an exponential shape with a modal value of 0, which is far from normal.
The second posterior distribution illustrates the problem that occurs when the plume contains a large amount of the gases (100 ppm-M for each gas). In this case, the shape of the distribution is dramatically different; between 0 and 100 ppm-M it is roughly uniform, and becomes roughly normal above 100 ppm-M (or more appropriately half-normal). The uniform portion below 100 ppm-M is produced by the fact that the Butyl Acetate signal is equivalent to a linear combination of other gases. The normal half of the distribution is caused by the constraints on the linear combinations (burdens are non-negative), which become active at higher-burdens. Note that the exact nature of the correlations in the plume gases can create posteriors with quite a complex shape; It is possible to have several modes in the posterior. As one can see, the MCMC can be very valuable for evaluating plumes that may contain several gases. The above histograms were constructed from a sample of 5000 data points generated by the Hamiltonian MCMC, after a burn-in phase of size 1000. Similar outputs could be obtained with the "traditional" MCMC algorithm described in Section 4.2, using a thinning of size . In other words, the second MCMC algorithm had to generate first a sequence of length 600,000, the first 100,000 being used for the burn-in phase, and then extract each ! # " observation to build the final sample. Because of this very high thinning, the "slow" Hamiltonian MCMC turned out to be faster by a factor of 2 than the traditional MCMC algorithm. Note that in less correlated experiments as the examples discussed earlier, the two algorithms were found to be comparable in speed.

Summary and Conclusions
The nonlinear Bayesian regression shows definite promise for producing better estimates from hyperspectral data. To be most useful, NLBR should not be considered to be a specific algorithm, derived from a specific model, but a general framework for producing estimates that can be tailored to the problem at hand.
The effectiveness of these regression models depends heavily on the prior information supplied to them. The methodology does require that realistic prior distributions be constructed on the nuisance parameters. However, the fact that uncertainties can be incorporated into the prior distributions in a straightforward manner means that the Bayesian methodology incorporates these uncertainties in a fairly automatic way. In fact, one of the strongest motivations for using a Bayesian models may be because its statements of estimate uncertainty are more believable than those from other methodologies.
Our simulations involving NLBR have produced results that correspond with theory. The application of the present NLBR model to real hyper-spectral data have produced less optimal performance; It is obvious that real hyper-spectral data contains more variability than our NLBR currently describes, most prominently, atmospheric variability [26]. However, the flexibility of the NLBR framework should allow us to include these other sources of variability without any dramatic change in the general approach.