Technical Note : Regression Analysis of Proximal Hyperspectral Data to Predict Soil pH and Olsen P

This work examines two large data sets to demonstrate that hyperspectral proximal devices may be able to measure soil nutrient. One data set has 3189 soil samples from four hill country pastoral farms and the second data set has 883 soil samples taken from a stratified nested grid survey. These were regressed with spectra from a proximal hyperspectral device measured on the same samples. This aim was to obtain wavelengths, which may be proxy indicators for measurements of soil nutrients. Olsen P and pH were regressed with 2150 wave bands between 350 nm and 2500 nm to find wavebands, which were significant indicators. The 100 most significant wavebands for each proxy were used to regress both data sets. The regression equations from the smaller data set were used to predict the values of pH and Olsen P to validate the larger data set. The predictions from the equations from the smaller data set were as good as the regression analyses from the large data set when applied to it. This may mean that, in the future, hyperspectral analysis may be a proxy to soil chemical analysis; or increase the intensity of soil testing by finding markers of fertility cheaply in the field.


Introduction
Fertilizer is the largest farm working expense in New Zealand hill country farming systems.Fertilizer applications are usually based on soil test results.This work seeks to establish if it is possible to measure soil fertility parameters using hyperspectral analysis.Large data sets are used to find the wave bands that may be proxies for laboratory testing [1].New Zealand's hill country farming systems are largely based on sheep, beef, and sometimes deer animal production grazing on ryegrass (Lolium perenne L.) and white clover (Trifolium repens L.) swards.Easier topography exhibits higher concentrations of both, while harder hill country swards are dominated by species with lower feed value such as brown top (Agrostis capillaries L.) and crested dogstail (Cynosurus cristatus L.) with little clover.Steeper hill slopes tend to have higher concentrations of weed species such as (Cirsium arvense (L.) Scop.) and (Ulex europaeus L.).New Zealand's temperate climate allows year round outside grazing.In the most severe environments, such as that created by altitude, animals are typically grazed on lower and better soils during winter and early spring.This is especially the case through lambing and calving in the South Island, and to a lesser extent the North Island's central plateau.
Sheep follow cattle in a grazing rotation as sheep graze shorter pasture than cattle and each animal class has different classes of parasites and parasitism is reduced as the parasite larvae of each class is destroyed by the grazing of the other.This farming system is heavily dependent on summer rainfall that is quite variable, with higher rainfall on the West coast than the East coast of both islands [2,3], see Figure 1.Where summer rainfall allows grass growth to finish stock farm incomes are higher than on steeper, colder country, where in dry summers farmers are, forced to sell stock, unfinished as store, for others to finish on softer country with irrigation [2].On the steeper country, the use of vehicular traffic is not possible, and fertilizer and herbicide applications are undertaken by aircraft [4].class is destroyed by the grazing of the other.This farming system is heavily dependent on summer rainfall that is quite variable, with higher rainfall on the West coast than the East coast of both islands [2,3], see Figure 1.Where summer rainfall allows grass growth to finish stock farm incomes are higher than on steeper, colder country, where in dry summers farmers are, forced to sell stock, unfinished as store, for others to finish on softer country with irrigation [2].On the steeper country, the use of vehicular traffic is not possible, and fertilizer and herbicide applications are undertaken by aircraft [4].Soil sampling for chemical analysis is expensive and limits the number of samples that it is economic to analyze.This results in the averaging of soil samples over large areas up to several hundred hectares for economies of scale [5].The provision of an extremely large data set of soil samples (the current price for a soil test is around US$70) made it possible to undertake some analysis Soil sampling for chemical analysis is expensive and limits the number of samples that it is economic to analyze.This results in the averaging of soil samples over large areas up to several hundred hectares for economies of scale [5].The provision of an extremely large data set of soil samples (the current price for a soil test is around US$70) made it possible to undertake some analysis as this provided the actual data for this work.The cost of undertaking soil testing limits the intensity and is another reason that samples are combined to provide an average.In addition, the samples were measured for spectral reflectance, which provided an opportunity to marry 2150 variables being the spectra reflectance values.This work was undertaken for another project so provided a unique opportunity to initiate work to see if soil elements can be predicted.In addition, a second data set was available at Massey, which had also been collected for other work, was available for testing [6].The farms that were sampled would not have contemplated such intense soil testing if they were required to pay for it.
At present, on hill country farms, sampling is undertaken in transects that combine samples taken 10 m apart to average values from 27 samples [5].Combining the samples eliminates the variation, especially for Olsen P.This test is the most important, in hill country fertilizer recommendations.The variation within transects is much greater than the 10 m sampling distance, therefore, there is no geospatial interpolation that is meaningful between sampling points [5][6][7].Hyperspectral analysis has the potential to be a useful tool for proxy measurements of soil nutrients that could increase sampling density and reduce cost.
Hyperspectral analysis of soil using proximal visual and near infrared (VIS-NIR) spectra have been undertaken for some years.A review of the work indicates that it has been more useful in providing a proxy for soil organic matter and clay content than in predicting elements.However, clay compounds such as kaolinite and minerals such as calcite (a form of calcium carbonate) and pH, have been measured with some degree of success [8].The proximal hyperspectral sensor used in this work is the Fieldspec 4 ASD (Analytical Spectral Devices, Malvern Panalytical, Malvern, UK), which has also had some success in mining large data sets for soil sensing [9][10][11].These have shown that in clay, the total organic carbon and moisture have been measured, with variable accuracy, more so in soils with high soil organic matter [11].Measurement of pH [9] and soil nitrogen [12] has been possible, but has been less robust in predictions.
This work aims to find the spectral wavebands that are proxies for Olsen P and pH, which are the two most important soil analyses used in New Zealand hill country farming.This is because plant available P reduces as soil acidity increases, acidic soil conditions are the norm in New Zealand.Measuring Olsen P using proximal hyperspectral analysis has not been successfully undertaken and pH with limited success.

Chemical Analyses
A data set of 3189 soil samples that had been analyzed, for full soil tests at Analytical Research Laboratory an ISO 17025 soil laboratory was married to spectra from the same soil samples collected using a probe on a Fieldspec 4. These measurements were undertaken at the Analytical Research Laboratory in Napier, New Zealand.The samples were from the four farms Ravensdown collected samples shown in Figure 2.These farms are part of a Private/Government research program between Ravensdown Ltd. and the New Zealand Ministry of Primary Industries, which requires extensive soil testing to 7.5 cm [13].The project is to improve the performance of steep pastoral country and improve the efficiency of fertilizer delivery by aircraft.There is very little cropping and arable production on this land and all the samples are from pasture, which is the focus of the research.This was the original large data set used for the multiple partial least square regression (PLSR) using "R" 3.41 to identify wavebands of interest.Only Olsen P and pH were regressed as the second data set obtained independently had only Olsen P and pH measured at Massey University soil laboratory, which analyzed the 883 soil samples.The smaller Massey data set was sampled at depths (0-3 cm, 3-15 cm, and 15-30 cm).Whereas, all samples from the larger Ravensdown data set were at (0-7.5 cm).
References [7,14].The pH measurements were taken using a digital probe in 10 g samples in 25 mL of distilled water.All Fieldspec 4 ASD measurements, at Massey and in Napier were on dry samples to eliminate the effects of moisture [15].

Statististical Analysis
A PLSR of 2150 wavebands of spectra data from the Fieldspec 4 (350-2500 nm) of 3189 samples were regressed against the chemical analysis results.A PLSR assumes the data to be normally distributed, the mean and standard deviations of the data sets are presented in Table 1.The shape of the distributions are shown in Figure 3a-d.The second data set was from two stratified nested grid samples from different fields on the same farm used to look at geospatial variance with soil depth [6,7].The samples were tested for Olsen P using the chemical extraction methods and photo-spectrometer was used for the geospatial analysis [6,7].Those that had not already been measured were completed using the method employed in References [7,14].The pH measurements were taken using a digital probe in 10 g samples in 25 mL of distilled water.All Fieldspec 4 ASD measurements, at Massey and in Napier were on dry samples to eliminate the effects of moisture [15].

Statististical Analysis
A PLSR of 2150 wavebands of spectra data from the Fieldspec 4 (350-2500 nm) of 3189 samples were regressed against the chemical analysis results.A PLSR assumes the data to be normally distributed, the mean and standard deviations of the data sets are presented in Table 1.The shape of the distributions are shown in Figure 3a-d.The Olsen P distributions for both data sets resembled a Chi 2 more than a normal distribution.For this reason a log10 of Olsen P of the Ravensdown data and Massey data was regressed as this passed the normality test (see Figure 4).The Olsen P distributions for both data sets resembled a Chi 2 more than a normal distribution.For this reason a log 10 of Olsen P of the Ravensdown data and Massey data was regressed as this passed the normality test (see Figure 4).For this work, the larger data set was regressed using "R" version 3.41 [11,16,17].The 100 most significant wavebands were selected for Olsen P and pH to regress the smaller Massey University data set.This is because the Massey data set had 883 samples so the number of waveband (factors) to be regressed against as the independent variables should be a much smaller quantity.The regression was undertaken initially using "R", then repeated using an Excel plug in StatTools version 7.5 (Pallisade, NY, USA).This was undertaken to check that the results were the same.
The regression equations from the Massey University data set were then used to predict Olsen P and pH on the large dataset.The predictions were then compared to the actual laboratory values.For this work, the larger data set was regressed using "R" version 3.41 [11,16,17].The 100 most significant wavebands were selected for Olsen P and pH to regress the smaller Massey University data set.This is because the Massey data set had 883 samples so the number of waveband (factors) to be regressed against as the independent variables should be a much smaller quantity.The regression was undertaken initially using "R", then repeated using an Excel plug in StatTools version 7.5 (Pallisade, NY, USA).This was undertaken to check that the results were the same.
The regression equations from the Massey University data set were then used to predict Olsen P and pH on the large dataset.The predictions were then compared to the actual laboratory values.This was to establish if the regression of a large data set had found a proxy means of measuring pH and Olsen P using independent data sets.

Results
The regression analyses on the data sets for pH and Olsen P, summarized in Table 2, for Ravensdown Ltd. and Massey University data, showed a significant trend between the modeled and actual values for all data set (p value < 0.0001).However, there is a high degree of variability within the data sets shown by the high standard errors and low adjusted R 2 values, with the best predictions being for pH.The summary of the scatter plots obtained from the regression are in Figure 5a-f.

Results
The regression analyses on the data sets for pH and Olsen P, summarized in Table 2, for Ravensdown Ltd. and Massey University data, showed a significant trend between the modeled and actual values for all data set (p value < 0.0001).However, there is a high degree of variability within the data sets shown by the high standard errors and low adjusted R 2 values, with the best predictions being for pH.The summary of the scatter plots obtained from the regression are in Figure 5a-f.The Ravensdown regression analyses have predictions using the regression equations developed on the Massey data.The number of wavebands was reduced from 2150 to 100 to reduce auto correlation and co-linearity effects from the regression, which are inevitable from such a large data set; especially if moving to a smaller data set with less rows than columns used in the PLSR.The significance of waveband contribution were all significant less than 0.05.The regression analyses using R on the Ravensdown data suggest that even though there are more rows than columns regressing using 2150 columns has an auto correlation effect (see Table 2).The equations from the analyses appear in Appendix A.

Discussion
The Massey soil data came from 30 cm soil cores sectioned 0-3 cm, 3-15 cm, and 15-30 cm soil depths, which contained samples from the lowest strata with extremely small levels of Olsen P some samples having undetectable levels.This meant that the Massey data regression for Olsen P passes close to the origin.Olsen P is a measure of exchangeable P and is readily available for plant uptake, while the remainder of the P in the soil is in less available forms organic (biomass/animal excrement) and complexed with Al and Fe.The stratification of phosphate in the soil in closely related to the organic matter stratification.The most popular fertilizer applied in New Zealand is single super phosphate (Ca(H2PO4)2) and the next most popular fertilizer supplying P is diammonium phosphate The Ravensdown regression analyses have predictions using the regression equations developed on the Massey data.The number of wavebands was reduced from 2150 to 100 to reduce auto correlation and co-linearity effects from the regression, which are inevitable from such a large data set; especially if moving to a smaller data set with less rows than columns used in the PLSR.The significance of waveband contribution were all significant less than 0.05.The regression analyses using R on the Ravensdown data suggest that even though there are more rows than columns regressing using 2150 columns has an auto correlation effect (see Table 2).The equations from the analyses appear in Appendix A.

Discussion
The Massey soil data came from 30 cm soil cores sectioned 0-3 cm, 3-15 cm, and 15-30 cm soil depths, which contained samples from the lowest strata with extremely small levels of Olsen P some samples having undetectable levels.This meant that the Massey data regression for Olsen P passes close to the origin.Olsen P is a measure of exchangeable P and is readily available for plant uptake, while the remainder of the P in the soil is in less available forms organic (biomass/animal excrement) and complexed with Al and Fe.The stratification of phosphate in the soil in closely related to the organic matter stratification.The most popular fertilizer applied in New Zealand is single super phosphate (Ca(H 2 PO 4 ) 2 ) and the next most popular fertilizer supplying P is diammonium phosphate ((NH 4 ) 2 HPO 4 ) [18][19][20].Phosphates are not very soluble, which can be problematic in situations where surface run-off is likely [20].The majority of phosphate tends to remain near the surface of soil horizons through cation exchange with clay minerals, which is why soil samples are taken to 7.5 cm in New Zealand pastures [5,19].While the hyperspectral data was regressed against Olsen P, the spectra reflectance will be on compounds containing phosphate, similarly to calcium being found in calcite [8].
The regression of the Massey Olsen P data had an adjusted R 2 of 0.55 whereas the Ravensdown data produced an adjusted R 2 of 0.27.Both data sets had high variation in data fit, but the trends from the analyses are highly significant.The Massey data probably produced a clearer trend as the stratified sample had a larger range of values with increasing values as samples neared the surface [6,7].The Ravensdown regression improved to an R 2 of 0.34 when the log 10 of Olsen P was regressed.These results from the Chi 2 shape of the raw data became more normalized when the log values were regressed.
The pH data was more normally distributed than the Olsen P data for both data sets.Both data sets were able to be regressed to the hyperspectral data with similar trends, which were highly significant.The R 2 of the Massey and Ravensdown data were about 0.42.
The regression equations found from the PLSR of the Massey data were also used to predict Olsen P and pH of the Ravensdown data, as shown in Figure 5c,d,f.This exercise was undertaken to see if the equations developed from one data set were able to provide a reasonable fit with the other data set when using the same wavebands.While the gradients of the equations in Figure 5c,d,f are quite different; they both bisect the data values almost in half.This suggests that the spectral wavebands selected may be transferable to other data sets and that further research and analysis using these wavebands may be worthwhile.
This exercise shows that there is promise in developing proximal hyperspectral sensing of soil nutrients.This may prove to be a cheaper alternative to chemical analysis if it can be undertaken in situ in the field.This could lead to more intensive testing which would be a benefit in delivering the desired fertilizer nutrient to where it is needed through variable rate application, providing more efficient fertilizer utilization.

Figure 2 .
Figure 2. Location and name of farms where samples were collected.The sampling body (Ravensdown Ltd, or Massey University) the size of the farm and number of samples taken.

Figure 2 .
Figure 2. Location and name of farms where samples were collected.The sampling body (Ravensdown Ltd, or Massey University) the size of the farm and number of samples taken.

Figure 3 .
Figure 3. (a) Shows the distribution of the Massey Olsen P data; (b) shows the distribution of Massey pH data; (c) shows the distribution of Ravensdown Olsen P data; and (d) shows the distribution of Ravensdown pH data.

Figure 4 .
Figure 4. (a) shows a Q-Q Normality test for the raw Olsen P data with a Chi 2 distribution.(b) Shows that the Log10 of the Olsen P of the data produces a normal distribution.

1 Figure 4 .
Figure 4. (a) shows a Q-Q Normality test for the raw Olsen P data with a Chi 2 distribution.(b) Shows that the Log 10 of the Olsen P of the data produces a normal distribution.

Figure 5 .
Figure 5. Scatter plots of Olsen P and pH modelled results compared to measured, for (a,b) Massey calibration set and (c-f) application of the Massey model to the Ravensdown set of data for validation and comparison to the data set regression.

Table 1 .
Mean and standard deviations of the two data sets.

Table 1 .
Mean and standard deviations of the two data sets.

Table 2 .
Summary of regression analyses.
1 R regression 2148 and 1040 degrees of freedom; 2 R regression 100 and 782 degrees of freedom;3StatTools regression 100 and 782 degrees of freedom; and4StatTools regression 100 and 3088 degrees of freedom.

Table 2 .
Summary of regression analyses.