Remote Sensing Statistical Distances and Their Applications to Biophysical Parameter Estimation: Information Measures, M-estimates, and Minimum Contrast Methods

Radiative transfer models predicting the bidirectional reflectance factor (BRF) of leaf canopies are powerful tools that relate biophysical parameters such as leaf area index (LAI), fractional vegetation cover f V and the fraction of photosynthetically active radiation absorbed by the green parts of the vegetation canopy (f APAR) to remotely sensed reflectance data. One of the most successful approaches to biophysical parameter estimation is the inversion of detailed radiative transfer models through the construction of Look-Up Tables (LUTs). The solution of the inverse problem requires additional information on canopy structure, soil background and leaf properties, and the relationships between these parameters and the measured reflectance data are often nonlinear. The commonly used approach for optimization of a solution is based on minimization of the least squares estimate between model and observations (referred to as cost function or distance; here we will also use the terms " statistical distance " or " divergence " or " metric " , which are common in the statistical literature). This paper investigates how least-squares minimization and alternative distances affect the solution to the inverse problem. The paper provides a comprehensive list of different cost functions from the statistical literature, which can be divided into three major classes: information measures, M-estimates and minimum contrast methods. We found that, for the conditions investigated, Least Square Estimation (LSE) is not an optimal statistical distance for the estimation of biophysical parameters. Our results indicate that other statistical distances, such as the two power measures, Hellinger, Pearson chi-squared measure, Arimoto and Koenker–Basset distances result in better estimates of biophysical parameters than LSE; in some cases the parameter estimation was improved by 15%.


Introduction
Biophysical parameters estimated from satellite data are important inputs to ecological models and land-surface models [1,2].Various algorithms have been developed to estimate biophysical parameters from remotely-sensed reflectance data [3,4].The forward problem, i.e., to predict the reflected radiation and canopy light interactions given the structure and optical properties of the canopy and surface, is well understood and several models exist that produce realistic results [4].The inverse problem is difficult to solve, however, because the problem is underdetermined.A commonly adopted way to overcome this is to add additional constraints or make a priori assumptions regarding the properties of the land surface, see [5][6][7].
Biophysical parameters are estimated from satellite data by inverting a sample of the of bidirectional reflectance factor, BRF(λ) = f (Angular Geometry, Structural Parameters), where structural parameters (canopy properties, soil background reflectance, etc.) are input parameters and BRF(λ) is the model output (wavelength-dependent reflectance).Numerical solution of this inverse problem adjusts the model parameters such that model-predicted values closely match the measured values [7,8].The match between model output and data is usually based on minimizing the sum of least squares [9].
Four approaches can be distinguished to estimate biophysical parameters from satellite data; the advantages and limitations of various approaches of biophysical parameter estimation are discussed in [10,11].A first approach is to estimate biophysical parameters from an empirical relationship with a spectral index, see for example [1].A second approach is to invert an analytical model; this approach puts a high demand on computing resources if the analytical model is complex.A third approach is to use machine learning, for a example by training a neural network using the inputs and outputs of a BRF model [12,13].A fourth approach is to use LUTs.This is an attractive way to estimate biophysical parameters for various reasons.Solutions of the model can be constrained to a range of realistic input parameters, optimization is fast and the complexity of the analytical model is retained [14].In the present study we adopt a LUT-based inversion using the FLIGHT radiative transfer model [9,15].
The estimation of biophysical parameters from satellite data is hampered by uncertainties and errors that arise from a number of sources.These include uncertainties in instrument calibration, variations in atmospheric composition or simplifying assumptions in the representation of canopy and soil background [16,17].Errors in the representation of the canopy and soil background are of particular concern in the present study since they have non-zero mean and are not normally distributed.Outliers and nonlinearities distort the residuals and in these cases a key assumption for using LSE is violated, which is that errors have a white noise, zero mean distribution of residuals.For this reason, we investigate three broad classes of statistical distances (in the remote sensing literature referred to as cost functions, elsewhere also known as metrics, or divergence measures) that are based on different error distributions.These classes are: information measures of divergence [18], M-estimates [19], and minimum contrast methods [20,21].
The first class of distance or divergence measures is referred to as information measures; optimization using these measures is based on minimization of distances between two probability distributions (Section 3).Thus, we need to rewrite the BRF as a probability distribution function to apply these measures.The information measures can be further divided into three sub-classes.The first subclass is referred to as f -divergences, a term introduced by Kullback and Leibler [22].This class of measures is based on the distance between probability distributions (Section 3.1).The f -divergences are not bounded, i.e., they range between 0 and ∞.A second subclass of divergence measures, referred to as blended measures, allows bounds to be calculated explicitly ( [23] and Section 3.3).A third subclass-consisting of generalized (h, f )-divergences or superposition of two functions, see Section 3.2-is a generalization of the f -divergences [24].
The second class, of M-estimates, is a broad class of functions to which, among others, least squares optimization belongs.For this class of measures, the BRF is not considered a probability distribution.Within the class of M-estimates are a large number of functions with robust or resistant properties (Section 4).
The third class, of minimum contrast estimates, considers the spectral domain.We express the BRF as a spectral density function (Section 5) to apply these measures.
The present paper has two aims.The first is to provide a review of available statistical distances and divergences to date (Sections 3-5 and Appendix).Only a few of these measures have been applied to parameter estimation from remote sensing data.The second aim is to apply the distance and divergence measures to the estimation of biophysical parameters from satellite data.The availability of a large number of statistical distances gives a high degree of flexibility, since it allows model optimization for a wide range of error distributions.We illustrate this in the numerical experiments where retrieval of biophysical parameters is tested for simulated needleleaf and broadleaf forests for ground-measured BRF.The present paper tests the use of alternative distance measures on simulated observations.This allows assessment of distance measures in a well-controlled environment with known errors in estimated biophysical parameters for a wide range of simulated land-surface conditions.The method will be tested on real observations in a follow-on study [25].
The paper is organized as follows.In Section 2 the estimation of biophysical parameters from Earth Observation (EO) data is expressed in a form comparable with statistical distance theories.Sections 3-5 describe the statistical distance and divergence measures that performed best in our study.Section 6 provides a description of the BRF simulations with FLIGHT; this includes a description of the land-surface scenes and the generation of LUTs.In Section 7 the statistical distance and divergence measures are applied to the estimation of LAI, f V and f APAR by numerical inversion of the LUTs.The Appendix contains an extensive list of distance measures with references to examples of applications in the peer-reviewed literature.
We acknowledge that the range of conditions (vegetation type, simulated error distribution, land-surface properties, BRF sampling) is limited; the results of this study can therefore only be used as a guideline.

Statement of the Problem
The present section formulates the BRF in a way that is appropriate for the application of statistical distances and divergence measures.First we represent the following elements in the LUT: R i (λ 1 , ..., λ n , θ) is a realization of the BRF dependent on wavelength, solar zenith angle, relative azimuth and view zenith angle, LAI, f V , leaf area distribution, ground reflectance, etc.In this notation λ is the wavelength and λ 1 , ..., λ n ∈ Λ, i = 1, ..., N, where, N is the number of entries (rows) in the LUT and θ =(η, ζ) is a vector with unknown biophysical parameters η = (η 1 , ..., η k ) of interest to our study (e.g., LAI, f V , f AP AR ) and ζ = (ζ 1 , ..., ζ r ) is a vector with parameters that we do not need to estimate, either because they are already known (e.g., solar zenith angle, relative azimuth and view zenith angle) or because their value is obtained by other means and is not estimated in the inversion (e.g., crown shape, soil reflectance).Denoting satellite observations by R * (λ 1 , ..., λ n ), we estimate the unknown parameters, the elements of the vector η * , by minimizing a measure that provides the best "closeness" between R in the LUT and R * .
Let Γ be a class of measures (distances) Γ(R * (λ j ), R i (λ j , η, ζ)) between two BRF functions; the LUT and the observations.The classical statistical method of inversion (or estimation and finding required η * ) of the radiative transfer model can be formulated as a semi-parametric problem The purpose is to find the best estimate for η * by solving the minimization problem (1) using different statistical distances and divergences between simulated satellite signals ("observations") and LUTs.We consider the parameters η * s and η s,i , The classical approach of this minimization problem is known as LSE, which is based on the minimization of the quadratic function We consider alternative statistical distances, which can be divided into three classes.The majority of statistical distances belong to the so-called class of information measures.This class considers distances or measures of divergence between two probability distributions.To apply these functions to biophysical parameter estimation, the BRF must be normalized such that the sum of probabilities is 1.The expression for the LUTs becomes and for the simulated satellite observations it is with n l=1 q l = 1, and n l=1 p i l = 1 for 1 ≤ i ≤ N. Thus the BRF functions can be rewritten as discrete probability mass functions by a simple normalization.
Finally, to compare the results obtained with different distance measures, the mean absolute error in parameter retrieval is defined, for example to assess the residual error in estimated LAI the following merit function is used In the next three sections we discuss the different statistical distances evaluated in the present study.The distances are applied to two reflectance distributions, P refers to the LUT reflectances and Q refers to the observed "true" reflectances.To simplify the notation we drop the index i.

Information Measures
Information theory was born in 1948 when Shannon [26] published his revolutionary paper motivated by the problem of efficiently transmitting information over a noisy channel.Since Mahalanobis [27] introduced the concept of distances between two probability distributions, several other distance measures have been suggested in the statistical literature and these have been referred to as measures of distance between two distributions, measures of separation, measures of discriminatory information and measures of variation-distance.While these measures were not always introduced for the same reason, they all increase when two distributions become "further away" from each other.
Divergence is an important concept in information theory and it is useful in many applications such as multimedia classification, neuroscience, optimization of the performance of density estimation methods, and cluster analysis.Distance measures also allow a wide range of tests to see if samples are from the same distribution.
Entropies are defined over the space of distributions that form the bases of independence/dependence concepts.For these reasons, Shannon's mutual information function has been increasingly utilized in the literature [28].Shannon's relative entropy and almost all other entropies fail to be "metric", as they violate either symmetry, or the triangular rule, or both.For these reasons, it is more appropriate to refer to these entropies as measures of divergence rather than measures of distance.
Informally, entropy can be understood as "the quantity of surprise one should feel upon reading the result of a measurement".More formally, we can write: if event A occurs with probability P (A), define the "information" I(A) gained by knowing that A has occurred to be The intuitive idea is that the rarer an event A, the more information we gain if we know it has occurred.

f -Divergence Information Measures
Kullback and Leibler (KL) [22] first introduced the concept of information divergences, which are non-symmetric measures between two distributions P and Q. Typically P represents the "true" distribution of the data and Q represents a model or an approximation of P .In information theory, KL divergence can be interpreted as cross Shannon entropy.This class has been extended in many directions since its initial application in decoding schemes and in signal processing.In particular, Rényi proposed a generalization of Shannon entropy [29], one of a family of functionals for quantifying the diversity, uncertainty or randomness of a system.The Rényi entropies are important in ecology and statistics as indices of diversity.Later, KL and Rényi related divergences were included in a broader class of divergences called f -divergences, introduced by Csiszár [30].This class can be formulated as follows.
A general class of divergence measures is given by where a strictly convex function of p so that Γ F (P, Q) is a strictly convex function of 0 ≤ p 1 , ..., p n ≤ 1; • For fixed Q, Γ f (P, Q) attains its unconstrained global minimum when p l = q l for all l, i.e., if P = Q; • for a given strictly convex twice differentiable function f (.) we define Note that the global minimum value is equal to f (1).Thus for a given f Many measures were added to this class from different areas of science and new divergences are still being discovered.Here we present some of these f-divergence measure, see [31] and additional list can be found in Section A.1 of the Appendix.
1. Let f (x) = x ln(x) and f (1) = 0.The corresponding measure is This measure is called the KL divergence; it also corresponds to the maximum likelihood distance.
It has wide application in code theory, signal processing, data compression, data storage and data communication as well as others areas of science.2. Let f (x) = 1 x and f (1) = 1 then This measure is called the Pearson chi-square.This measure is used in the chi-squared test first proposed by K. Pearson.3. χ α -Vajda divergence corresponds to the function f (x) = |x − 1| α and f (1) = 0, where α 1, then is called the squared Hellinger measure.It is bounded full distance and has been applied to parameter estimation in censored models (where the variable of interest is only observable under certain conditions) as well as many other areas of science. 5.It can be generalized in the following form to give more flexibility on parameter estimation. Assume = 0, then power divergence has the form 7. Power divergence measures [32] with minimum at zero.This class was introduced to unite efficiency with robust properties; the class is also Fisher consistent which gives for α = −2, −1, −1/2, 0, 1 the following already known measures: the Neyman chi-squared measure divided by 2, the Kullback-Leibler divergence, the twice-squared Hellinger distance, the likelihood disparity, and the Pearson's chi-squared divided by 2.

Other Divergence Measures between Two Probability Distributions.
There are another two important sub-classes based on the divergence between probability distributions that do not belong to the class of f -divergences.These sub-classes are referred to as (h, f )-measures and f -entropy measures.
The first of these sub-classes, (h, f )-measures, was introduced by [33].It can be written in the following form Under different assumptions, it is shown that the asymptotic distributions of the (h, f )-divergence statistics are either normal or chi-square.These divergences were developed for hypothesis testing on multinomial populations and to test goodness of fit and independence.This class is based on the superposition of two functions and it gives a large degree of flexibility to deal with outliers.
Here is one of the examples of these measures.Additional list can be found in Section A.2 of the Appendix.
Rényi divergence with The second subclass is referred to as entropy measures and can be introduced as follows.Let X be a random variable with probability distribution P .Shannon's entropy [26] has the form while the cross-entropy is It is easy to verify that the KL divergence Equation ( 7) is related to Shannon's entropy, i.e., In order to present a systematic way of studying the different entropy measures, Burbea and Rao introduced the so-called f -entropies, by where that some important entropy measures cannot be written as f -entropy.For this reason [24], defined the (h, f )-entropy as follows, where either f : (0, ∞) → R is concave and h : R → R is differentiable and increasing, or f : (0, ∞) → R is convex and h : R → R is differentiable and decreasing.
Based on the concavity property of the (h, f )-entropy, new generalization was introduced in [34]: These measures of divergence have been introduced to present systematic ways to study different entropy measures.They are used in applications that are associated with random variables with finite support in genetic diversity between populations, the study of taxonomy in biology and to test if populations are homogeneous in genetics and for the analysis of discriminant techniques.
An example of these measures can be seen below and additional list can be found in Section A.2 of the Appendix.

Blended f-Disparities
A third group of divergences is referred to as blended divergences.Lindsay [32] found that inference based on statistics of type f -divergence (obtained by replacing either one or both probability distributions by suitable estimators) requires either bounded differentiability of f or boundedness of f itself.He introduced a new class of divergences by the modification of weights inside the integral expression of Pearson's chi-squared divergence called "blended weight chi-squared disparity"-BWCS(β) and "blended weight Hellinger disparity"-BWHD(β), β ∈ [0, 1].In general all these new classes of disparities have the following common property.If the blending parameter is equal to the limiting values β = 0 or β = 1, then the two original divergences on which the blend was based are achieved in the class of blended divergences.Definitions and theorems can be found in [23].
All blended f -disparities have been used in goodness-of-fit tests in medical statistics and are shown to be an excellent compromise between the Pearson's chi-square and the log likelihood ratio tests.
To illustrate the theory of blended divergences, we give several examples below.Blended weighting scheme that generalizes Hellinger distance:

Nonlinear Regression and M-Estimates
Robust regression is a form of regression analysis designed to circumvent some limitations of traditional parametric and non-parametric methods.Regression analysis seeks to find the relationship between one or more independent variables and a dependent variable.Certain widely used methods of regression, such as LSE, have favorable properties if their underlying assumptions are true, but can give misleading results if those assumptions are violated.In cases where errors are not normally distributed, outliers occur, or ordinary LSE assumptions are violated in some other way, the validity of the regression results is compromised if a non-robust regression technique is used.M-estimators (see for example [35]) form a broad class of estimators that exhibit certain robust properties.Estimates with robust regression methods can be more stable with respect to anomalous errors.
We use M-estimates to the estimation of biophysical parameters from reflectances as follows.Let us consider the BRF R(λ j , θ) as a nonlinear regression function g(λ j , θ) of its parameter set, which is observed at wavelength λ j ∈ Λ with some noise j of complicated nature.The observations set R * (λ j ) = x j can be represented as follows where E j = 0, i.e., the expectation is that j is random noise with zero mean, not Gaussian in general.
In our case we are interested in the unknown parameter vector of interest η =(LAI, f V , f AP AR ).We use the M-estimates generated by a function of loss ρ(x), x ∈ R, see [19,35], for some motivation details and properties of these estimates.
The M-estimate of the unknown parameter vector η obtained from the observations x j j = 1, ..., n is given by the solution of the minimization problem: The function ψ(x) = ρ (x) is called the score function and the minimization problem (Equation ( 23)) can be written in the following equivalent way: The classical LSE corresponds to the case ρ(x) = x 2 , ψ(x) = 2x.In this case random noise j are i.i.d.r.v. that have a Gaussian distribution.In general for nonlinear regression, j may be independent but not necessarily Gaussian or even non-Gaussian dependent random variables.It is well known that LSE regression methods are consistent, asymptotically normal and asymptotically efficient.However, when the density function of errors is non-Gaussian or even has a non-symmetric skewed distribution, LSE estimates are no longer efficient and their application can result in large losses of efficiency.Robust methods replace the sum of squares by more suitable loss functions.
The following examples belong to the class of M-estimates and can be found in [35][36][37] and others.The full list of M-estimates can be found in Section A.2 of the Appendix. ( is LSE and it is non-robust.It is used widely in many application of remote sensing, biology, economy and other areas of science since it has nice properties described above. (2) The function defines the class of Koenker-Basset estimators [38], (0 < c < 1).In contrast to classical methods based on least-squares residuals, this measure is robust and has an explicit probabilistic meaning.It was introduced for the estimation of an unknown parameter η in the nonlinear regression model (Equation ( 23)) when the samples are independent, but errors j = x j − g(λ j , η) have some skewness (non-symmetric) property.The empirical quantile function may be defined in terms of solutions to a simple optimization problem.Explicitly, where ρ c -function (Equation ( 26)).One can also interpret this as (E j = 0) This distance measure has been applied in econometrics to study wage distributions and to study linear regression quantiles on censored data.

Minimum Contrast Estimation
To apply minimum contrast measures to our problem, we interpret the BRF as a spectrum or spectral density of some stochastic process.The basic idea behind a minimum contrast estimator is to minimize the distance (contrast) between a parametric model and a non-parametric spectral density.The estimates obtained are first-order efficient and are also attractive because they have robust properties.This class of estimates is close to the class of quasi-likelihood estimators, where instead of independence (which does not hold for many cases) we use asymptotical independence, as discussed below.
We adopt the following terminology from time series analysis to interpret our observations as measurements in the spectral domain [39].Let {Z t } be a stationary processes with spectral density f (λ) = f θ (λ), λ ∈ Λ ⊆ R, with expectation m, and θ ∈ Θ ⊂ R p .Our aim is to estimate the unknown parameter θ, that is, to identify the true value of spectrum f θ * (λ).
To implement this idea we consider the so-called quasi-likelihood method [40].The BRF observations in the spectral domain are written in a form, I n (λ j ), λ j ∈ Λ = {λ 1 , ..., λ n }, where This is the so-called periodogram non-parametric estimation of spectral density f (λ).
Under some general conditions [39], at the Fourier frequencies λ j ∈ Λ, the random variables I n (λ j ) are asymptotically independent and have an exponential distribution, that is The pdf that corresponds to the distribution function in the right hand side of Equation ( 31) takes the form: ) .Thus, one can construct a quasi-likelihood function or its logarithm which has to be maximized in order to estimate θ.It means that we need to minimize the so-called Whittle functional and the quasi-likelihood estimate is The Whittle estimator was also extended to cover correlated signal-plus-noise models, providing a formal asymptotic distribution theory specifically tailored for parameter estimation.This approach was first applied in time series for exponential volatility models; it then caught attention in financial econometrics and in related fields.These models are able to represent some of the stylized features of financial returns, such as uncorrelation in levels but strong dependence in squares and log-squares and leverage effect.
The Whittle estimator belongs to a class of more general estimates known as minimum contrast estimates, see [20,21].To demonstrate the idea, let us assume that the true value of unknown parameter θ * ∈ intΘ, the interior of Θ.A contrast function for θ * is a deterministic function F (θ * , θ), F θ * : Θ → R + , which has a unique minimum, θ = θ * .
A contrast process for F θ * is a sequence of random variables U n (θ), n = 1, 2... such that the ergodic like condition holds for some U (θ) in probability: The minimum contrast estimator is a value of θ for which the function U n (θ) takes its minimum, or Under some sets of conditions the minimum contrast estimators are consistent, see [41].Often the contrast function can be chosen as a distance L(f θ , g) between two spectral densities f θ (λ) and g(λ), which can be written in the form: where K(x) is a three times continuously differentiable function on (0, ∞) and has a unique minimum at x = 1.The contrast process in practice can be approximated as follows: The following examples of distances L(f θ , g) are widely used in parametric estimation in time-series analysis in frequency domain, in particular for autoregressive and moving average models.
One of the example can be seen below and the full list of such distances can be found in Section A.3 of the Appendix.
Let K(x) = logx + 1 x .This criterion is equivalent to the quasi-Gaussian maximum likelihood (73) and has the following form Note that function K(x) have a unique minimum at x = 1.To find a minimum at zero we need to subtract K(1) under the sum from each of the functions.In practical applications we use the notation from Section 2, i.e., in the above methodology the abstract parameter θ is replaced by the vector parameter of our interest η = (η 1 , ..., η k ).

Radiative Transfer Modeling
The performance of the distance measures in terms of retrieving biophysical parameters (LAI, f APAR , f V ) from reflectance values was tested on simulations.This approach has the advantage that the estimated biophysical parameters can be compared directly with model input parameters and that therefore errors associated with the use of different distance measures can be established with accuracy.Moreover, we can test distance measures on a large number of simulations and this provides a good indication of their robustness.
Simulations were carried out similar to the approach taken by Prieto-Blanco et al. [9].We used two models in conjunction to simulate a set of the ground BRF observations and to generate the LUTs.Simulations were carried out for 3 needleleaf and 2 broadleaf scenes.We used PROSPECT [49] to simulate light scattering and absorption by leaves, and FLIGHT [15] to simulate light scattering and absorption by vegetation canopies.The models used are state of the art and provide realistic simulations of the interaction of solar radiation with the vegetation canopy and the soil.
PROSPECT [49] calculates leaf transmittance and reflectance from 400 to 2,500 nm.In PROSPECT4 each leaf is considered as a stack of N absorbing plates with rough surfaces giving rise to scattering of light.Absorption is calculated as the linear summation of the concentrations of chlorophyll, water, and dry matter, each with their specific absorption coefficients [49].The PROSPECT input parameters are described in Table 1.The inputs include: N , the leaf structure parameter; C ab , the chlorophyll a + b concentration (µg/cm 2 ); Cw, the equivalent water thickness (g/cm 2 ); Cm, the dry matter content (g/cm 2 ).Chlorophyll content (C ab ) in leaves is linked to the maximum photosynthetic capacity of vegetation and varies with leaf development stage, productivity, stress and nitrogen levels.For the LUTs of conifers, a maximum and minimum value for C ab was entered to reflect a range of conditions.Table 1.PROSPECT input parameters.N is the leaf structure parameter; C ab are the chlorophyll a + b concentrations (µg/cm 2 ); Cw is the equivalent water thickness ( g/cm 2 ); and Cm is the dry matter content (g/cm 2 ).FLIGHT [15] is a 3D radiative transfer model for light interaction with vegetation canopies using Monte Carlo simulation of photon transport.The original model traced the photons' trajectories forwards from the source until they were absorbed in the canopy or left the canopy boundary.Subsequent improvements include calculation of paths back from any view direction to the intercepted surface facets [50,51], simulation of fine angular resolution, simulation of photosynthesis and simulation of LiDAR signals [52].A hybrid representation is used to model the discontinuous nature of the forest canopy.Large-scale structure is represented by geometric primitives defining shapes and positions of the tree crowns and trunks, here estimated from a statistical distribution.Within each crown, foliage is approximated by structural parameters of area density, angular distribution and size and optical properties of reflectance and transmittance.These parameters are approximated as homogeneous within each boundary, but may vary between crowns.Simulation of 3D photon trajectories allows accurate evaluation of multiple scattering within crowns, and between distinct crowns, trunks and ground surface.FLIGHT simulations have previously been compared with other 3D canopy radiative transfer models as part of the Radiation Model Intercomparison (RAMI) project [4].The recent analysis within RAMI of six selected 3D models, including FLIGHT, showed dispersion within 1% over a large range of canopy descriptions, see [53].

CONIFER
Radiation was simulated in 15 spectral bands (500, 560, 630, 690, 700, 740, 790, 830, 870, 1,035, 1,200, 1,250, 1,650, 2,100, 2,250 nm).A previous study [54] suggested such a selection of bands could provide approximately 90% of the information about the land surface that is provided by a full spectrum, although this study was based on field spectroscopy.The set is chosen here to demonstrate the retrieval method and selection of error metrics, but the method is applicable to any set of bands or potential view directions, where the study should be repeated to determine optimal error metrics.

Sites
The simulations were carried out on two main types of forest: conifer and broadleaf forests.Three conifer representatives were chosen from the former BOREAS sites [55], each characterized by a different dominant species: the Old Black Spruce (Picea Mariana) site (OBS), the Old Jack Pine (Pinus Banksiana) site (OJP) and the Young Jack Pine (Pinus Banksiana) site.Vegetation of these sites has a complex structure, needles show a high degree of clumping and there is mutual shadowing by crowns.These sites are therefore known to pose a challenge to biophysical parameter estimation.Detailed crown, leaf and soil background measurements are available for these sites as these have been extensively studied in [56,57].Chlorophyll content in coniferous canopies has been estimated in [58].Changes in leaf chlorophyll produce large differences in leaf reflectance and transmittance spectra, therefore three values of C ab were used to obtain a wide range of possible values [59], (Table 1).Broadleaf simulations were carried out for an oak and beech forest since these are among the most important species for the European forestry [60,61].
Table 1 shows the leaf optical properties for the conifer and broadleaf forests that were used in PROSPECT; values are based on [62].Tables 2 and 3 show the vegetation structure parameters and the angular configurations for the FLIGHT model.Five characteristic soil spectra from the Purdue spectral library were selected (Table 2) [63,64].
Table 2. FLIGHT input parameters.The column "range" represents the minimum and maximum values of the input parameters; column "step" is the increment for the LUT; column "Observed" represents the range over which a random number "r.n." is selected to generate satellite observed BRF, R * (λ 1 , ..., λ n ).Thus the simulations and LUT are generated from different input parameters; in particular, different values are used for viewing geometry, soil reflectance, fractional cover, and leaf area index (Section 6.4).

Parameter Range
Step "Observed"

Generation of Look-Up Tables
The LUT contains a total of N = 90, 404 entries of BRF reflectances.These are reflectance values calculated for parameters obtained at regular intervals of solar zenith angle, view zenith angle, relative azimuth angle, LAI, f V and C ab , see Table 1.Crown shape parameters can be found in Table 3 and f AP AR is obtained by summing individual f AP AR values at each band times weighted by the fraction of downwelling light within the band: where F a,i is the mean fraction of radiant energy absorbed by canopy at band i, calculated from FLIGHT output, and W i are the weights, see also [9].

Simulation of Observations
We simulated BRF values using the coefficients in Tables 1-3.For each realization the observation tables contain a total of M = 5,000 entries for the three conifer sites (OBS, OJP, YJP) and M = 5,000 for the broadleaf sites (beech and oak).The tables were simulated similar to the calculation of the LUT but using a random selection from the range of parameters in Table 2, last column (LAI, f V , angular geometry, atmospheric aerosol depth).Note that the parameters from Tables 1 and 3 stay the same.The observations are simulated from a wider range of conditions that are not present in the conifer and broadleaf LUTs (soils, leaf area distribution).Further errors due to sensor noise and calibration are not considered here, but have been examined in previous studies [9,65] and should be considered prior to application to particular instruments.The tables of "observed" ground reflectances were constructed to contain complex errors for needleleaf and broadleaf forests.These errors, originating from the mismatch between LUT and "observations", represent conditions encountered in real applications where we have to match a model representing only a selection of conditions with richly varied real world conditions.
Canopy reflectance models demonstrate increasing sensitivity to soil reflectance at lower vegetation cover.Soil reflectance is one of the most sensitive parameters in canopy reflectance models [17].Errors are biased and unsymmetrical and for this reason we expect some robust distances to perform better than LSE.This effect is more pronounced at lower values of LAI (<3).
Our simulations are necessarily restricted and represent only a subset of all parameters that can be varied within Prospect and FLIGHT.We believe that these simulations of errors are more realistic than the commonly adopted method of adding Gaussian noise to the spectrum.Errors due to incorrect assumptions on soil or leaf spectral properties will be spectrally correlated.

Results
The statistical distances listed in Sections 3-5 were evaluated as follows.Reflectances in the LUTs were matched to "observed" reflectances.For each case, we matched one entry in the "observations" table consisting of M = 5,000 spectral bands with one entry in the LUT for each type of forest.The inversion finds parameters for the nearest angular geometry in the LUT, i.e., the angular geometry is known.Other parameters are assumed to be unknown.The performance of the statistical distances was then assessed on the biophysical parameter estimated (LAI, f V ,f AP AR ).Some statistical distances allow parameters that govern the shape of the error distribution to be varied, thus the choice of these parameters leads to another optimization problem in itself.For these cases we tested a range of parameters and chose the parameter that minimizes the error in estimated biophysical parameters.All distances were tested by comparing the estimated biophysical parameters with the a priori known parameters.
For the purpose of clarity we present a selection of results that consists of the best performing measures from each of the three classes of statistical distance measures.The results show significant variability in retrieval accuracy depending on the chosen divergence measure.Overall, the optimal measures in each case show an improvement over using LSE, see Tables 4 and 5.  Table 4 shows the best distance measures (bold italic) for estimating biophysical parameters on broadleaf forests from BRF as well as results obtained with LSE (bold text).Table 5 shows the same but for estimating biophysical parameters on conifer forest.
The improvement obtained in estimating biophysical parameters for broadleaf forest is further illustrated in Figure 1.In Figure 1 residual errors in broadleaf biophysical parameters obtained from BRF reflectances by using LSE are compared with errors obtained by Hellinger (Equation ( 10)) and power divergence (Equation ( 12)).The range of errors (maximum and minimum errors) for these two methods is the same, however the alternative divergence shows a marked improvement in the error distribution for all biophysical parameters.Figure 1.Comparison of residual errors resulting from the application of two different distance measures used to estimate biophysical parameters for broad leaf canopies with a look-up table (LUT).Biophysical parameters, including leaf area index, LAI, fraction of photosynthetically active radiation absorbed by the canopy, f AP AR , and vegetation cover fraction, f V , are estimated from simulated ground reflectance values.The rows from top to bottom show the comparison of residual errors in estimating LAI with the least squares estimate (LSE) and Hellinger Equation (10) (top row), of residual errors in f V with LSE and the power divergence Equation ( 12) with power 1 and j = 4 (centre row) and of residual errors in f AP AR with LSE and the Hellinger equation (bottom row).The columns from left to right show frequency distributions of residual errors using LSE as a cost function (left column) and the residual errors using either a Hellinger Equation (10) for LAI and f AP AR or a power divergence Equation ( 12) with power 1 and j = 4 for f V (centre column).The right column shows the quantile-quantile plots comparing the frequency distributions of the left and centre columns expressed as absolute errors.The errors in biophysical parameters associated with the alternative cost functions are smaller and better behaved (more symmetrical and smaller bias).

Broadleaf
Figure 2 provides a further illustration of the reduction in the error in biophysical parameter estimation for needleleaf forests from BRF reflectance data obtained by power divergences in Equations ( 12) and ( 13) (see also Table 5).
Figure 2. Comparison of residual errors resulting from the application of two different distance measures used to estimate biophysical parameters for needleleaf canopies with a look-up table (LUT).Biophysical parameters, including leaf area index, LAI, fraction of photosynthetically active radiation absorbed by the canopy, f AP AR , and vegetation cover fraction, f V , are estimated from simulated ground reflectance values.The rows from top to bottom show the comparison of residual errors in estimating LAI with the least squares estimate (LSE) and power Equation ( 13) with power 2 and α = −5 (top row), of residual errors in f V with LSE and the power divergence Equation ( 12) with power 1 and j = 4 (centre row), and of residual errors in f AP AR with LSE and the power Equation ( 13) with power 2 and α = −5 (bottom row).The columns from left to right show frequency distributions of residual errors using LSE as a cost function (left column) and the residual errors using either a power equation for LAI and f AP AR or a power divergence Equation( ( 12)) with power 1 and j = 4 for f V (centre column).The right column shows the quantile-quantile plots comparing the frequency distributions of the left and centre columns expressed as absolute errors.The errors in biophysical parameters associated with the alternative cost functions are smaller and better behaved (more symmetrical and smaller bias).

Conifer
The results summarized in Tables 4 and 5 and Figures 1 and 2 illustrate that the use of alternative distances measures can significantly improve parameter estimation.We found that the following distance measures perform well for the cases described above: (1) For the broadleaf forest the best distances were: • for the estimation of LAI: Hellinger (Equation ( 10)) and Arimoto (Equation ( 20)); • for the estimation of f V : power divergence (Equation ( 12)); • for the estimation of f AP AR : generalized Hellinger measure (Equation ( 10)).
We found that compared with LSE, the Koenker-Basset distances (Equations ( 26) and ( 35)) gave better results in all cases.The improvement compared with LSE was 15% for LAI, 10% for f V and 11% for f AP AR .
(2) For conifer forest the best distances were: • for the estimation of LAI and f AP AR : power divergence (Equation ( 13)) and Pearson chi-square divergence (Equation ( 8)); • for the estimation of f V : power divergence (Equation ( 12)); Similar to the broadleaf case, we found that the Koenker-Basset metric (Equations ( 26) and ( 35)) improved estimation in all cases.All biophysical parameters, LAI, f V and f AP AR , improved by around 10% in this case.

Recommendations for Distance Choice
When optimizing parameterized models for which the error distribution (shape and bias) is known, the user can choose an appropriate cost functions based on specific physical properties of the model and metrics.When we deal with non-parametric models or non-linear model with many parameters (such as the present case where we simulate BRF using PROSPECT, FLIGHT) it may be useful to check a range of available distances to get the optimal cost function.For problems similar to the present study we can provide the following guidance.

Considering the Shape of the Distribution
For non-symmetric error distributions the recommended cost function is Koenker-Basset (Equation ( 26)).Such non-symmetric error distribution may arise, for example, from undetected sub-pixel cloud in pre-processing.Based on the shape of this function, it is expected that for the parameter c for this function becomes close to one with increasing skewness of the error distribution.We expect skewed error distributions to be common for biophysical parameter estimation from satellite data, especially when there is a mismatch between the soil reflectance specified in the LUT and the "true" soil reflectance.
For heavy-tailed distribution (right) and semi-heavy-tailed distribution (left) (i.e., the tail behave as negative exponent divided by power function) we can recommend non-symmetric power divergence (Equation ( 13)), Vajda (Equation ( 9)) and Pearson chi-square divergence (Equation ( 8)) cost functions.As can be seen from Figures 1 and 2, application of specific metrics to the biophysical parameters makes the errors more localized, removes heavy tails and makes the distributions more symmetric.This is due to the non-symmetric nature of the metrics themselves.To the best of our knowledge, these effects are usually not addressed in remote sensing inversion problems.
Special interest should be paid to the class of spectral metrics, since it represents informational distances in the spectral domain.Since informational transformation to the spectral domain usually makes our observation asymptotically independent, it is plausible that the spectral metric (Equation ( 35)) that corresponds to quasi-maximum likelihood provides good results.This is also consistent with the statistical theory that the maximum likelihood estimator is asymptotically optimal.For this reason we recommend to use this or a similar cost function.

Considering Properties of the Estimated Parameters
For all types of forest we found that for f V the optimal metric is power divergence (Equation ( 12)).This symmetric cost function represents a Rényi type entropy that maximizes the entropy distribution using a power law behavior.This gives us a better understanding of the nature of errors in this case.
In the case when parameters of the model are linearly correlated, we observe consistency in the optimal cost function for these parameters (for example LAI and f V and f AP AR in our case).Thus, we can recommend using the same cost function for linearly correlated parameters.However, this does not hold if the correlation has a more complicated nature.

Conclusions
Over 60 statistical distances from three major classes, information measures, M-estimates and minimum contrast methods, were obtained from the mathematical literature.A comprehensive list of these statistical distances was provided.The statistical distances were tested to see, if compared with LSE, they improved the estimation of biophysical parameters for needleleaf and broadleaf forests.We found that the commonly used LSE distance is not the optimal cost function for the cases studied and that better results can be obtained using alternative cost functions.
For the numerical experiments we use PROSPECT and FLIGHT to simulate "observed" reflectance values in 15 different spectral bands.We generate LUTs for a limited set of land-surface and atmospheric conditions.However for the observations we generate reflectance values for a wider range of conditions and thus introduce a mixture of errors caused by variations in angular geometry, LAI, f V , soil reflectance and leaf angle distribution.For the biophysical parameter estimation we match the observed reflectances to the reflectances of the LUT with different cost functions.The largest sources of (biased) error, i.e., the mismatch between observations and LUT, are potentially related to soils, since only a limited amount of variability associated with these variables was incorporated in the LUTs.We conclude that our analysis resembles a common problem for the estimation of biophysical parameters from satellite data, i.e., one estimates biophysical parameters assuming a limited set of ground conditions.A cost function that is based on an asymmetric, biased or heavy-tailed error distribution can therefore result in better estimates of biophysical parameters than LSE, which is based on a normal error distribution.
We found that the information measures from Section 3 provide better results when the BRF is normalized; see Equations ( 3) and ( 4).This result may not be valid for a smaller number of wavebands.
A caveat of the present study is that we analyzed only a limited subset of a wide range of possibilities, and for different applications it is likely that different cost functions may be more suitable.We are preparing a study where cost functions are used on real observations as opposed to simulated observations [25].
Alternative divergence measures and distances have been known in the statistical literature for some time and could find an application in many areas besides remote sensing, such as in biology, geography and geophysics.
The approach outlined in the present study can be extended to other applications that use LUT optimization, interpolation, linearization of parameter space, etc.It can be used in addition to or as an alternative to data training and machine learning schemes.We believe that the use of alternative statistical measures has great potential for remote sensing applications.4. Varma (1966) [18] f (x) = x α/β , h(x) = 1 β(β − α) logx , 0 < α < β, β 1 (q l ) α/β (A22) 5. Havrda and Charvat (1967) [18] f (x) = 1 1 − α (x α − x), h(x) = x, α > 0, α = 1 (p l α − p l ) + (q l α − q l ) (A23) 6. Sharma and Mittal [46]  (q l logq l ) 7. Sharma and Mittal [46] f It has some optimal properties, see [35], and the function ψ c (x) can be approximated by a twice differentiable score function.Huber's M-estimation has been applied in GPS positioning and modeling of complex technical experiments where it reduces the effect of outliers.This estimator is so satisfactory that it has been recommended for almost all situations, however, from time to time, difficulties are encountered, which may be related to a lack of stability.

Table 3 .
FLIGHT input parameters: Crown shapes, where "c" represents cone shape and "e" represents ellipsoid shape.

Table 4 .
Summary statistics for the performance of different distance measures in estimating biophysical parameters from ground reflectance data (BRF); reflectances were simulated for broadleaf trees.

Table 5 .
Summary statistics for the estimation of biophysical parameters from ground reflectance values (BRF) for needleleaf canopies.