DeepIndices: Remote Sensing Indices Based on Approximation of Functions through Deep-Learning, Application to Uncalibrated Vegetation Images

: The form of a remote sensing index is generally empirically deﬁned, whether by choosing speciﬁc reﬂectance bands, equation forms or its coefﬁcients. These spectral indices are used as preprocessing stage before object detection/classiﬁcation. But no study seems to search for the best form through function approximation in order to optimize the classiﬁcation and/or segmentation. The objective of this study is to develop a method to ﬁnd the optimal index, using a statistical approach by gradient descent on different forms of generic equations. From six wavebands images, ﬁve equations have been tested, namely: linear, linear ratio, polynomial, universal function approximator and dense morphological. Few techniques in signal processing and image analysis are also deployed within a deep-learning framework. Performances of standard indices and DeepIndices were evaluated using two metrics, the dice (similar to f1-score) and the mean intersection over union (mIoU) scores. The study focuses on a speciﬁc multispectral camera used in near-ﬁeld acquisition of soil and vegetation surfaces. These DeepIndices are built and compared to 89 common vegetation indices using the same vegetation dataset and metrics. As an illustration the most used index for vegetation, NDVI (Normalized Difference Vegetation Indices) offers a mIoU score of 63.98% whereas our best models gives an analytic solution to reconstruct an index with a mIoU of 82.19%. This difference is signiﬁcant enough to improve the segmentation and robustness of the index from various external factors, as well as the shape of detected elements.


Introduction
An important advance in the field of earth observation is the discovery of spectral indices, they have proved their effectiveness in surface description. Several studies have been conducted using remote sensing indices, often applied to a specific field of study like evaluations of vegetation cover, vigor, or growth dynamics [1][2][3][4] for precision agriculture using multi-spectral sensors. Some spectral indices have been developed using RGB or HSV color space to detect vegetation from ground cameras [5][6][7]. Remote sensing indices can also be used for other surfaces analysis like water, road, snow [8] cloud [9] or shadow [10].
There are two main problems with these indices. Firstly they are almost all empirically defined, although the selection of wavelengths comes from observation, like NDVI for vegetation indices. It is possible to obtain better spectral combinations or equations to characterize a surface with specific acquisitions parameters. It is important to optimize upstream the index, as the data transformation leads to a loss of essential information and features for classification [11]. Most studies have tried to optimize some parameters of existing indices. For example, an optimization of NDVI (N IR − Red)/(N IR + Red) was proposed by [12] under the name of WDRVI (Wide Dynamic Range Vegetation Index) (αN IR − Red)/(αN IR + Red). The author tested different values for α between 0 and 1. The ROC curve was used to determine the best coefficient for a given ground truth. Another optimized NDVI was designed and named EVI (Enhanced Vegetation Index). It takes into account the blue band for atmospheric resistance by including various parameters G(N IR − Red)/(N IR + C 1 Red − C 2 Blue + L), where G, L are respectively the gain factor and the canopy background adjustment, in addition the coefficients C 1 , C 2 are used to compensate for the influence of clouds and shadows. Many other indices can be found in an online database of indices (IDB: www.indexdatabase.de accessed 10 August 2019) [10] including the choice of wavelengths and coefficients depending on the selected sensors or applications. But none of the presented indices are properly optimized. Thus, in the standard approach, the best index is determined by testing all available indices against the spectral bands of the selected sensor with a Pearson correlation between these indices and a ground truth [13,14]. Furthermore, correlation is not the best estimator because it neither considers the class ratio nor the shape of the obtained segmentation and may again result in a non-optimal solution for a specific segmentation task. Finally, these indices are generally not robust because they are still very sensitive to shadows [11]. For vegetation, until recently, all of the referenced popular indices were man-made and used few spectral bands (usually NIR, red, sometimes blue or RedEdge).
The second problem with standard indices is that they work with reflectance-calibrated data. Three calibration methods can be used in proximal-sensing. (i) The first method use an image taken before acquisition containing a color patch as a reference [15,16], and is used for correction. The problem with this approach is that if the image is partially shaded, the calibration is only relevant on the non-shaded part. Moreover ideally the reference must be updated to reduce the interference of weather change on the spectrum measurement, which is not always possible since it's a human task. (ii) An other method is the use of an attached sunshine sensor [17], which also requires calibration but does not allow to correct a partially shaded image. (iii) The last method is the use of a controlled lighting environment [18,19], e.g., natural light is suppressed by a curtain and replaced by artificial lighting. All of these approaches are sometimes difficult to implement for automatic, outdoor use, and moreover in real time like detecting vegetation while a tractor is driving through a crop field.
In recent years, machine learning algorithms have been increasingly used to improve the definition of presented indices in the first main problem. Some studies favor the use of multiple indices and advanced classification techniques (RandomForest, Boosting, DecisionTree, etc.) [4,[20][21][22][23][24]. Another study has proposed to optimize the weights in an NDVI equation form based on a genetic algorithm [25] but does not optimize the equation forms. An other approach has been proposed to automatically construct a vegetation index using a genetic algorithm [26]. They optimize the equation forms by building a set of arithmetic graphs with mutations, crossovers and replications to change the shape of each equation during learning but it does not take into account the weights, since it's use calibrated data. Finally, with the emergence of deep learning, current studies try to adapt popular CNN architectures (UNet, AlexNet, etc.) to earth observation applications: [27][28][29][30].
However there is no study that optimize both the equation forms and spectral bands weights. The present study explicitly optimize both of them by looking for a form of remote sensing indices by learning weights in functions approximators. These functions approximators will then reconstruct any equation forms of the desired remote sensing index for a given acquisition system. To solve the presented second problem, this study evaluates the functions approximators on an uncalibrated dataset containing various acquisition conditions. This is not a common approach but can be found in the literature [31,32]. This will lead to creating indices that do not require data calibration. The deep learning framework has been used as a general regression toolkit. Thus, several CNN function approximators architectures are proposed. DeepIndices is presented as a regression problem, which is totally new, as is the use of signal and image processing.

Instrument Details
The images were acquired with the Airphen (Hyphen, Avignon, France) six-band multi-spectral camera ( Figure 1). This is a multi-spectral scientific camera developed by agronomists for agricultural applications. It can be integrated into different types of platforms such as drones, phenotyping robots, etc. The camera has been configured using the 450/570/675/710/730/850 nm bands with a 10 nm FWHM respectively denoted from λ 0 to λ 5 . These spectral bands have been defined by a previous study [33] for crop/weed discrimination. The focal length of each lens is 8 mm. The raw resolutions for each spectral band is 1280 × 960 px with 12 bit precision. Finally, the camera is equipped with an internal GPS antenna.

Image Dataset
The dataset were acquired on the site of INRAe in Montoldre (Allier, France) within the framework of the "RoSE challenge" founded by the French National Research Agency (ANR) and in Dijon (Burgundy, France) within the site of AgroSup Dijon. Images of bean and corn, containing various natural weeds (yarrows, amaranth, geranium, plantago, etc) and sowed ones (mustards, goosefoots, mayweed and ryegrass) with very distinct characteristics in terms of illumination (shadow, morning, evening, full sun, cloudy, rain, etc) were acquired in top-down view at 1.8 m from the ground. The Table 1 synthesis the dataset. Manual annotation takes about 4 h per image to obtain the best quality of ground truth, which is necessary for use in regression algorithms. Thus the ground truth size is small and defined with very distinctive illumination condition. To simulate light variations effect on the ground truth images a random brightness (20%) and a random saturation (5%) are added to each spectral band during the training phase. As illustration the Figure 2 shows a false color reconstruction of corn crop in the field with various weeds and shadows on the corners of the image (not vignetting).  Due to the nature of the camera (Figure 1), a spectral band registration is required and performed with a registration method based on previous work [34] (with a sub-pixel registration accuracy). The alignment is refined in two steps, with (i) a rough estimation of the affine correction and (ii) a perspective correction for the refinement and accuracy through the detection and matching of key points. The result shows that GoodFeatureToTrack (GFTT) algorithm is the best key-point detector considering the λ 570 nm band as spectral reference for the registration. After the registration, all spectral images are cropped to 1200 × 800 px and concatenated to channel-wise denoted λ where each dimension denoted λ d refers to each of the six spectral bands.

Images Normalization
Spectral bands inherently have a high noise associated with the CCD sensor, which is a potential problem during normalization [35]. To overcome this effect, 1% of the minimum and maximum signal is suppressed by calculating the quantiles, the signal is clipped on the given range and each band is rescaled in the interval [0, 1] using min-max normalization to obtain ρ d : The method also reduces the lighting variation. According to [36] a little variation is observed in the spectral correction factors between clear and cloudy days. Thus, the correction has a limited impact on the scaling factor and should be managed by this equation. However, the displacement factor could not be estimated, thus the output images are not calibrated in reflectance.

Enriching Information
In order to enrich the pool of information, some spectral band transformations are added, which allow to take into account spatial gradients and spectral mixing [6] in the image. The choice is oriented towards seven important information in different respects: The standard deviation between spectral band, noted ρ std can help to detect the spectral mixture. For example between two different surface like ground and leave which have opposite spectral radiance the spectral mixing make a pixel with linear combination, thus the standard deviation tend to zero [33]. Three Gaussian derivatives on different orientations are computed Gxx, Gxy and Gyy over the standard deviation ρ std which give an important spatial information about the gradients breaks corresponding to the outer limits of surfaces. These Gaussian derivatives are computed with a fixed Sigma = 1. The Laplacian computed over the standard deviation ρ std , the minimum and maximum eigenvalues of the Hessian matrix (obtained by Gaussian derivation Gxx, Gxy and Gyy), also called ridge are included. These transformations sould improve the detection of fine elements [37] such as monocotyledons for vegetation images.
All these transformations are concatenated to the channel-wise normalized spectral band input and build the final input image. In total seven transformations are added to the six spectral images for a final image of 13 channels, which will probably help the convergence.

Training and Validation Datasets
The input dataset is composed by spectral images I of size 1200 × 800 × 13 (or 6 if the "Enriching information" part is disabled) and a manual ground truth p of size 1200 × 800 × 1 where p ∈ {0, 1}. The desired outputp is a probability vegetation map of size 1200 × 800 × 1 wherep ∈ [0, 1]. This input dataset is randomly split into two sub-sets respectively training (80%) and validation (20%). All random seed is fixed at the start-up to keep the same training/validation dataset across all trained models which help to compare them. Keeping the same random seed also results in the same starting point between different new runs, making results reproducible on the same hardware.

Existing Spectral Indices
From the indices database, 89 vegetation indices have been identified (Table 2) as compatible with the wavelengths used in this study (as near as possible), they will be tested and compared to the designed DeepIndices. Five forms of simple equations have been extracted from this database (a wide variety of indices are derived from these forms, generally a combination of 2 or 3 bands): By analyzing these five equations we can synthesize them into two generic equations (Linear combination and Linear ratio) which take into account all spectral bands. Three other models can generalize any function: the polynomial fitting, the continuous function approximations by Taylor development, and the piecewise continuous function approximations trough morphological operators. These forms are interesting to optimize because they can approximate any function. This optimization will lead to automatically defining new indices (DeepIndices). The following subsections present these different models.

Linear Combination
To synthesize Equations (2) and (3), a simple linear equation such as y = ∑ N d=0 α d ρ d can be defined. This equation can be generalized to the 2D domain using a 2D convolution allowing consider the neighboring pixels. For a pixel at the position [i, j] the convolution is defined by: where H defines neighborhood weights (corresponding to α i ). D is the number of dimensions (6 spectral bands + 5 transformations) and N is the kernel size. The linear combination is given by N = 1, D = 12. The kernel weights are initialized by a truncated normal distribution centered on zero [38], weights are updated during the training of the CNN trough back-propagation and unnecessary bands should be set to zero. The interesting part is that increasing the kernel size N allows to take into account the neighborhood of a pixel and should estimate more accurately the spectral mixing [33]. Figure 3 shows the corresponding network.

Linear Ratio
To generalize Equations (4)-(6), a simple model based on the division of two linear combination is set. In the same way, this form is generalizable to the 2D domain and then corresponds to two 2D convolutions, one for the numerator, the other for the denominator. When the denominator is zero, the result is set to zero as well, to leverage the "not a number" output. The Figure 4 shows the corresponding network.

Polynomial
According to the Stone-Weierstrass theorem any continuous function defined on a segment can be uniformly approximated by a polynomial function. Thus all forms of color indices can be approximated by a polynomial Setting the degree is a difficult task which may imply under-fitting or over-fitting. In addition un-stability can be caused by near-zero δ d . But since the segment is restricted to the domain [0, 1] the Bernstein polynomials are a common demonstration and the equation can be wrote as a weighted sum of Bernstein basis polynomials B N,i = (1 − ρ) i ρ N−i which are more stable during the training. Moreover Bernstein Neural Network can solve partially differentiable equations [39]. For implementation reasons, two different layers are defined in the network (visible in the Figure 5). One for the Bernstein expansion limited to B 11,11 which takes the input image and produces different Bernstein basis polynomial, then each Bernstein basis is concatenated to the channel-wise and the linear combination is defined by a 2D convolution.

Universal Function Approximation
The Gaussian color space model proposed by [40] shows that the spatio-spectral energy distribution of the incident light E is the weighted integration of the spectrum ρ d denoted E(ρ d ). Where E can be described as a Taylor series and the energy function is convolved by different derivatives of a Gaussian kernel or structured receptive fields [41].
This important point shows that Taylor expansions can decompose any function f (x), especially for color decomposition and remapping, into : Here, the signature of the incident energies distribution of a remote sensing index associated to a surface can be reconstructed. An approach to learn this form of development is proposed by [42] which is commonly called DenseNet and then corresponds to the sum of the concatenation of the signal and these spatio-spectral derivatives Various convolutions allow to learn receptive fields and derivatives in spectral domain when the kernel size k is 1, and in spatio-spectral domain when k is higher. Batch-Normalization are used to reduces the covariate shift across convolution output by re-scaling it and speed up the convergence. Finally the Sigmoid activation function is used and defined by Sigmoid function allows to learn more complex structures and non-linearity of the reconstructed function. The number of derivative and receptive field are configurable with two parameters. The depth which corresponds to the number of layers in the network. And the width which refers to the number of outputs for each convolution. By default, the depth is fixed to 3 and the width is fixed to 5. The Figure 6 shows the corresponding universal function approximator network.

Dense Morphological Function Approximation
As for the Taylor series, an approximation of any piecewise continuous function can be established by morphological operators such as dilatation and erosion [43], respectively denoted ρ ⊕ s and ρ s where s are the corresponding erosion or dilatation coefficients. Several erosion and dilation are defined for each spectral band i, then the expanded layer is defined as the channel concatenation of z + i and in the same way for the erosion layer via z − i . Both are defined by To obtain the output i of which the w + i and the w − i are the linear combination coefficients obtained by a 2D convolution. We chose to set the number of dilation and erosion neurons at 6. The Figure 7 shows the corresponding network. To remove parts of the signal that may be dispensable, the addition of a low-pass, high-pass and band-pass filter upstream of the network are studied. A good example is provided by vegetation indices, only the high values in the green and near infra-red, and the low values in the red and blue characterize the vegetation.
This is the principle of the NDVI index. Due to the internal structure, the leaves reflect a lot of light in the near infrared, which is in sharp contrast to most non-vegetable surfaces. When the plant is dehydrated or stressed, the spongy layer collapse and the leaves reflect less light in the near-infrared, reaching red values in the visible range [44]. Thus, the mathematical combination of these two signals can help to differentiate plants from non-plant objects and healthy plants from diseased plants. However, this index is then less interesting when detecting only vegetation and is strongly influenced by shade or heat.
We will therefore add a filter in the previous equations to remove undesirable spectral energies of each ρ d by using two thresholds a and b, which will also be learned. If it turns out that the whole signal is interesting, these two parameters will not change and their values will be a=0 and b=1. To apply the low-pass filter the equation z = max(ρ − a, 0) ÷ (1 − a) is used and thus allows to suppress low values. For the high-pass filter the equation w = max(b − ρ, 0) ÷ b is applied to suppress high values. The band-pass filter it's the product of low and high-pass filters y = z * x. The output layer is the concatenation in the channel-wise of the input images, the low-pass, the high-pass and the band-pass filter which produce 4 × 13 = 52 channels. Finally to reduce the output data for the rest of the network, a bottleneck is inserted using a convolution layer, and generate a new image with 6 channels. This image is used by the rest of the network defined previously in Section 3.2. The Figure 8 shows the corresponding module inserted upstream of the network.

Spatial Pyramid Refinement Block (SPRB)
To take into account different scales in the image, the addition of a "Spatial Pyramid Refinement Block" at the downstream part of the network is studied. [45] showed that fusing the low to high-level features improved the segmentation task. It consists in the sum of different 2D convolutions whose core sizes have been set to 3, 5, 7 and 9. The results of all convolutions are concatenated and the final image output is given by a 2D convolution. The Figure 9 shows the corresponding module inserted downstream of the network.

Last Activation Function
To obtain an index and facilitate convergence, we will only be interested in the values between 0 and 1 at the output of the last layer with the help of an activation function of type clipped ReLU defined by where x is a pixel of the output image. Each negative or null pixel will then be the unwanted class, greater or equal to 1 will be the searched class. The indecision border is the values between 0 and 1 which will be optimized. And then correspond to the probability that the pixel is the searched surface P(Y = 1) or not P(Y = 0). This is valid for the output prediction denotedp ∈ [0, 1] and the ground truth denoted p ∈ {0, 1}.

Loss Function
A wide variety of loss functions have been developed during the emergence of deep-learning (MSE, MAE, Hinge, Tversky, etc). A cross-entropy loss function is usually used when optimizing binary classification [46]. This loss function is not optimized for the shape. Recently, with deep neural network and for semantic segmentation [47] has proposed a solution to optimize an approximation of the mean intersection over union (mIoU) and defined by The performance of this loss function seems more efficient than previous methods [48][49][50]. We will then use it as a loss function.

Performance Evaluation
Commonly, accuracy and Pearson correlation are used to quantify the performance of remote sensing indices [13,14]. However this type of metrics does not take into account either the class ratio nor the shape of the segmentation. Correlation is also highly sensitive to non-linear relationship, noise, subgroups and outliers [51,52] making incorrect evaluation. According to [53,54], the dice score and the mean intersection over union (mIoU) are more adapted to evaluate the segmentation mask. Defined by: We will then used these two metrics for the performance evaluation. Prior to quantization, a threshold of 0.5 is applied to the output of the network to transform the probability into a segmentation mask. Whenp is lower than 0.5, it is considered as the background, otherwise it is considered as the object mask we are looking for. Other metrics are not considered because they are not always appropriate in case of segmentation or use in unbalanced data.

Comparison with Standard Indices
In order to make a fair comparison it is necessary to optimize each standard index. A minimal neural network is used to learn a linear regression. The network is thus composed of the spectral index, followed by a normalization x = (x − min)/(min − max), then a 2D convolution with a kernel size of k = 1 is used for the linear regression. To perform the classification in the same way as our method, a ClippedReLU activation function is used. This tiny network is presented in the next Figure 10. Obviously the same metrics and loss function are used.

Training Setup
The training is done through Keras module within Tensorflow 2.2.0 framework. All computation is done on an NVidia GTX 1080 which have 8111MiB of memory, this limits the number of simultaneous layers on the memory and so the size of the model. Each model is compiled with Adam optimizer. This optimization algorithm is primarily used with lookahead mechanism proposed by [55]. It iteratively updates two sets of weights: the search directions for the fast weights are chosen by inner optimizer, while the slow weights are updated every k steps based on the direction of the fast weights and the two sets of weights are synchronized. This method improves the learning stability and lowers the covariance of its inner optimizer. The initial learning rate is fixed to 2 −3 . Batch size is fixed to 1 due to memory limitation. And the learning rate is decreased using ReduceLROnPlateau with f actor = 0.2, patience = 5, min_lr = 2e −6 . The training is done through 300 iterations. Finally an EarlyStopping callback is used to stop the training when there is no improvement in the training loss after 50 consecutive epochs.

Fixed Models
All standard vegetation models have been optimized using the same training and validation datasets. Each of them has been optimized using a min-max normalization followed by a single 1 × 1 2D convolution layer and a last clipped ReLU activation function is used like the generic models implemented. The top nine standard indices are presented in Table 2. Their respective equations are available in Table A1 in Appendix A. It is interesting to note that most of them are very similar to NDVI indices in their form. This shows that according to all previous studies, these forms based on a ratio of linear combination are the most stable against light variation. For example the following NDVI based indices are tested and show very different performances, highlighting the importance of weight optimization: The Modified Triangular Vegetation Index 1 is given by vi = 1.2 * (1.2 * (ρ 5 − ρ 1 ) − 2.5 * (ρ 2 − ρ 1 )) which shows that a simple linear combination can be as much efficient as NDVI like indices by taking one additional spectral band (ρ 2 = green) and more adapted coefficients. However, the other 80 spectral indices do not seem to be stable against of light variation and saturation. It is thus not relevant to present them.

Deepindices
Finally, each baseline model such as linear, linear ratio, polynomial, universal function approximation and dense morphological function approximation are evaluated with 4 different modalities of each kernel size N = 1, N = 3, N = 5 and N = 7. In addition input band filter (ibf) and spatial pyramid refinement block (sprb) are put respectively at the upstream and downstream of the network. Figure 11 shows that network synthesis. To deal with lighting variation and saturation a BatchNormalization is put in the upstream of the network in all cases. The ibf and sprb modules are optional and can be disabled. When the input band filter (ibf) is enabled, the incoming tensor size of 1200 × 800 × 13 is transformed to a tensor of size 1200 × 800 × 6 and passed to the generic equation. When it is not, the generic equations get the raw input tensor of size 1200 × 800 × 13. In all cases the baseline model output a tensor of shape 1200 × 800 × 1. The spatial pyramid refinement block transforms the output tensor of the baseline model to a new tensor of the same size.
All models are evaluated with two metrics, respectively the dice and mIoU score. For each kernel size, the results are presented in Tables 3-6. All models are also evaluated with and without ibf and sprb for each kernel size. Best models, higher than 80% of mIoU are highlighted in bold and the last row of tables corresponds to the difference to the baseline model (without ibf and sprb). Best models, higher than 80% of mIoU are highlighted in bold and the last row of tables corresponds to the difference to the baseline model (without ibf and sprb). Best models, higher than 80% of mIoU are highlighted in bold and the last row of tables corresponds to the difference to the baseline model (without ibf and sprb). Best models, higher than 80% of mIoU are highlighted in bold and the last row of tables corresponds to the difference to the baseline model (without ibf and sprb).
For all baseline models, the results (in term of mIoU) show that increasing the kernel size also increases performances. The gain performance between best models in kernel size 1 and 7 are approximately 2% and then correspond to the influence of spectral mixing. So searching for spectral mixing 3 pixels farther (kernel size 7) still increases performance. It could also be possible that function approximation allows to spatially reconstruct some missing information.
For all kernel sizes, the ibf module enhance the mIoU score up to 3.6%. So the ibf greatly prune the unneeded part of the input signal which increases the separability and the performances of all models. The sprb module allows to smooth the output by taking into account neighborhood indices, but their performance are not always better or generally negligible when it is used alone with the baseline model. The baseline polynomial model is probably over-fitted, because it's hard to find the good polynomial order. But enabling the ibf fixes this issue. However further study should be done to setup the order of Bernstein expansion.
The dense morphological with a kernel size of 5 and 7 using both ibf and sprb modules is the best model in term of dice (≈90%) and mIoU score (≈82%). Followed by universal function approximator with a kernel size of 1 or 3 with both ibf and sprb modules (dice up to 89% and mIoU up to 81%). Further studies on the width of the universal function approximator could probably increase performance. According to [43] it seems normal that the potential of dense morphological is higher although the hyper-parameters optimization of universal function approximator could increase their performance.

Initial Image Processing
To show the importance of the initial image processing, each model has been trained without the various input transformations, such as ρ std , Gxx, Gxy, Gyy filters, Laplacian filter, minimum and maximum Eigen values. Table 7 shows the score of DeepIndices considering only kernel size of 1 in different model. Best models, higher than 80% of mIoU are highlighted in bold and the last row of tables corresponds to the difference to the baseline model (without ibf and sprb).
The results shows that none of optimized models outperforms the previous performance with the initial image processing (best mIoU at 80.15%). The maximum benefit is approximately 6% for mIoU score depending on the model and module, especially when using combination of ibf, sprb and small kernel size. Meaning that signal processing is much more important than spectral mixing and texture.

Discussion
Further improvements can be set on hyper-parameters of the previously defined equations, such as the degree of the polynomial (set to 11), the CNN depth and width for Taylor series (set to 3) and the number of operations in morphological network (set to 10). In particular the learning of 2D convolution kernel of Taylor series may be replaced by a structured receptive field [41]. In addition it would be interesting to transpose our study with new data for other surfaces such as shadows, waters, clouds or snows.
The training dataset is randomly split with a fixed seed, which is used for every learned models. As previously noted, this is important to ensure reproducible results but could also favor specific models. Further work to evaluate the impact of varying training datasets could be conducted.

Model Convergence
Another way to estimate the robustness of a model against its initialization is to compare the model's convergence speed. Models with faster convergence should be less sensitive to the training dataset. As an example, the convergence speed of few different models is shown in Figure 12. The baseline model convergence is the same, as well as sprb module. However the speed of convergence also increases with the size of the kernel but does not alter subsequent observations. For greater readability only models with ibf are presented. An important difference in the speed of convergence between models is observed. An analysis of this figure allows the aggregation of model types and speed: • Slow converging models: polynomials models converge slowly as well as the majority of linear or linear-ratio models. • Fast converging models: universal-functions and dense-morphological are the fastest to converge (less than 30 iterations) A subset of slow and fast converging models could be evaluated in term of sensitivity against initialization. It shows that the dense morphological followed by universal function approximator convergence faster than the other. Regardless of the used module nor kernel size.

Limits of Deepindices
Shadows can be a relatively hard problem to solve in image processing, the proposed models are able to correctly separate vegetation from soil even with shadowy images, as shown in Figure 13. In addition, the Figure A1 in the Appendix A shows the impacts of various acquisition factors, such as shadow, noise, specular or thin vegetation features.  It appears that the discrimination error appears where the shadow is cast by a solid object, resulting in edge diffraction that creates small fringes on the soil and vegetation.
A lack of such images in the training dataset could explain the model failure. Data augmentation could be used to obtain a training model containing such images, from cloud shadows to solid objects shadows. Further work is needed to estimate the benefit of such a data augmentation on the developed models.
The smallest parts of the vegetation (less than 1 pixel, such as small monocotyledon leaves or plant stems) cannot be detected because of a strong spectral mixture. This limitation is due to the acquisition conditions (optics, CCD resolution and elevation) and should be considered as is. As vegetation with a width over 1 pixel is correctly segmented by our approach, the acquisition parameters should be chosen so that the smallest parts of vegetation that are required by an application are larger than 1 pixel in the resulting image.
A few spots of specular light can also be observed on images, particularly on leaves. These spots are often unclassified (or classified as soil). This modifies the shape of the leaves by creating holes inside them. This problem can be seen on Figure 15. Leaves with holes are visible on the left and the middle of the top bean row. It would be interesting to train the network to detect and assign them to a dedicated class. Next the location of the detected spots could be studied to re-assign them to two classes: specular-soil and specular-vegetation. To perform this step, a semantic segmentation could be set up to identify the surrounding objects of the holes specifically. It would be based on the UNet model, which performs a multi-scale approach by calculating, treating and re-convolving images of lower resolutions.
More generally, the quality of the segmentation between soil and vegetation strongly influences the discrimination between crop and weed, which remains a major application following this segmentation task. Three categories of troubles have been identified: the plants size, the ambient light variations (shades, specular light spots), and the morphological complexity of the studied objects.
The size of the plants mainly impacts their visibility on the acquired images. It is not obviously related to the ability of the algorithm to classify them. However, it leads to the absence of essential elements such as monocotyledon weeds at an early vegetation stage. A solution is proposed by setting the acquisition conditions to let the smallest vegetation part be over 1 pixel.
Conversely, the variations of ambient light should be treated by the classification algorithm. As previously mentioned, shadow management needs an improvement of the learning base, and specular light spots could be treated by a multi-scale approach. Their influence on the discrimination step should be major. Indeed, they influence the shape of the objects classified as plants, which is a useful criterion to discriminate crops from weeds. The morphological complexity of the plants can be illustrated by the presence of stems. In our case, bean stems are similar to weed leaves. This problem should be treated by the discrimination step. The creation of a stem class (in addition to the weed and crop classes) will be studied in particular.

Conclusions
In this work, different standard vegetation indices have been evaluated as well as different methods to estimate new DeepIndices through different types of equations that can reconstruct the others. Among the 89 standard vegetation indices tested, the MTVI (Modified Triangular Vegetation Index 1) gives the best vegetation segmentation. Standard indices remain sub-optimal even if they are downstream optimized with a linear regression because they are usually used on calibrated reflectance data. The results allow us to conclude that any simple linear combination is just more efficient (+4.87% mIoU) than any standard indices by taking into account all spectral bands and few transformations. The results also suggest that un-calibrated data can be used in proximal sensing applications for both standard indices and DeepIndices with good performances.
We therefore agree that it is important to optimize both the arithmetic structure of the equation and the coefficients of the spectral bands, that is why our automatically generated indices are much more accurate. The best model is much more efficient by +8.48% compared to the best standard indices and by +18.21% compared to NDVI. Also the two modules ibf, sprb and the initial image transformation show a significant improvement. The developed DeepIndices allow to take into account the lighting variation within the equation. It makes possible to abstract from a difficult problem which is the data calibration. Thus, partially shaded images are correctly evaluated, which is not possible with standard indices since they use sprectum measurement that change with shades. However, it would be interesting to evaluate the performance of standard indices and DeepIndices on calibrated reflectance data.
These results suggest that deep learning algorithms are a useful tool to discover the spectral band combinations that identify the vegetation in multi spectral camera. Another conclusion from this research is about the genericity of the methodology developed. This study presents a first experiment employed in field images with the objective of finding deep vegetation indices and demonstrates their effectiveness compared to standard vegetation indices. This paper's contribution improves the classical methods of vegetation index development and allows the generation of more precise indices (i.e., DeepIndices). The same kind of conclusion may arise from this methodology applied on remote sensing indices to discriminates other surfaces (roads, water, snow, shadows, etc).