Strategies for the Development of Spectral Models for Soil Organic Matter Estimation

: Visible (V), Near Infrared (NIR) and Short Waves Infrared (SWIR) spectroscopy has been indicated as a promising tool in soil studies, especially in the last decade. However, in order to apply this method, it is necessary to develop prediction models with the capacity to capture the intrinsic differences between agricultural areas and incorporate them in the modeling process. High quality estimates are generally obtained when these models are applied to soil samples displaying characteristics similar to the samples used in their construction. However, low quality predictions are noted when applied to samples from new areas presenting different characteristics. One way to solve this problem is by recalibrating the models using selected samples from the area of interest. Based on this premise, the aim of this study was to use the spiking technique and spiking associated with hybridization to expand prediction models and estimate organic matter content in a target area undergoing different uses and management. A total of 425 soil samples were used for the generation of the state model, as well as 200 samples from a target area to select the subsets (10 samples) used for model recalibration. The spectral readings of the samples were obtained in the laboratory using the ASD FieldSpec 3 Jr. Sensor from 350 to 2500 nm. The spectral curves of the samples were then associated to the soil attributes by means of a partial least squares regression (PLSR). The state model obtained better results when recalibrated with samples selected through a cluster analysis. The use of hybrid spectral curves did not generate signiﬁcant improvements, presenting estimates, in most cases, lower than the state model applied without recalibration. The use of the isolated spiking technique was more effective in comparison with the spiked and hybridized state models, reaching r 2 , square root of mean prediction error (RMSEP) and ratio of performance to deviation (RPD) values of 0.43, 4.4 g dm − 3 , and 1.36, respectively.


Introduction
Soil is a heterogeneous system, displaying complex processes and mechanisms that are difficult to comprehensively understand. Many conventional analytical techniques have been employed in an attempt to establish a direct relationship between soil and soil properties [1]. Knowledge concerning the soil system, its interactions and quality has been systematically supported through routine analyses which, although reliable, involve the collection of large numbers of samples, as well as laborious analysis processes. A high organic matter estimations in areas that are under different uses and management. We chose to work with organic matter, due to the fact that it is an important indicator of the physical and chemical quality of the soil [20,21]. On the other hand, procedures such as Walkley and Black [22], traditionally used to quantify this attribute uses dangerous chemical reagents such a chromium solution [23].
Associated with this, there is a strong demand in the State of Paraná for chemical analyzes of determination of organic matter, since monitoring the content of this element is important for the development of management practices that will improve and maintain the productivity of agricultural soils [24].
Facing the aforementioned, this study is expected to contribute to understanding how we can expand the use of VNIR-SWIR spectroscopy in the estimation of organic matter in new areas, based on a minimum number of soil samples and spectral data of this environment.

Soil Sampling
A total of 425 soil samples were collected from different areas submitted to different uses and management in the state of Paraná, Brazil (0-20 cm depth). The sampling points were selected based on the pedological and geological maps of the state of Paraná represented on a scale of 1:600.000 and 1:650.000, respectively [25,26]. The sites were chosen in such a way that there was no collection of repeated soil samples on similar source materials. This care was taken so that there was no influence from repeated samples in the spectral models to be adjusted. Use of soil as well as the management practices adopted in agricultural areas were recorded as the field work was being carried out. Subsequently, this information was inserted into a geographic information system for the creation of a georeferenced database and a spectral library of the state of Paraná. The predominant soil classes in the areas are lixisols, cambisols, gleysols, ferralsols, arenosols, nitisols, and histosols [27].
In addition, a total of 200 soil samples (0-20 cm depth) were collected in a specific area located in the Lobato municipality, northwestern of Paraná-Brazil, comprising 2500 ha. This area is basically occupied with remnants of forest and sugar cane crops. The soil classes from the area include ferralsols, nitisols, lixisols, cambisols, and arenosols [27]. The location of the target area within the state of Paraná is shown below (Figure 1).

Organic Matter and Spectral Analyses
The soil samples were oven dried at 45 • C for 24 h and subsequently sieved through a 2 mm mesh to be submitted to chemical analysis. Total organic carbon was determined following the Walkley and Black methodology [22]. The organic matter content was obtained by multiplying the total organic carbon by 1.724, since it is admitted that in the humus medium composition, carbon participates with 58% [28].
The organic matter attribute was chosen for spectral modelling because it is an important indicator of soil quality in Brazil and its traditional determination uses reagents which, without the correct final destination, may contaminate the environment.
For the determination of spectral readings, in addition to the drying process aforementioned, the samples were milled to homogenize soil particle size and reduce roughness effects [29]. The samples were then arranged in a 9 cm diameter and 1.5 cm high Petri dish and spectral readings were carried out using an ASD FieldSpec 3 JR spectroradiometer, which covers the spectral range from 350 to 2500 nm. The equipment was programmed to perform 50 readings per sample, generating an average spectral curve. Remote Sens. 2021, 13, x FOR PEER REVIEW 4 of 16 A standard white plate with 100% calibrated reflectance was used for data acquisition, according to the Labsphere Reflectance Calibration Laboratory [30]. The fiber optic reader was placed 8 cm upright from the sample support platform. The reading area was of approximately 2 cm 2 . The light source used was a 650 W lamp with an uncollimated beam for the target plane, positioned 35 cm from the platform and at a 30° angle to the horizontal plane [31].
The spectral readings were repeated three times, with successive displacement of the Petri dish 120 ° clockwise, allowing for a full sample scan. Subsequently, the simple arithmetic means of the three readings was determined for each sample, as recommended by Nanni and Demattê [32].

Data Processing and Statistical Analyses
Raw spectral data were preprocessed to improve the stability of the regression models, as described by Lee et al. [33]. Each spectral curve was subjected to baseline correction and light scattering by the multiplicative scatter correction (MSC) method, according to Buddenbaum and Steffens [34]. For noise reduction, the Savitzky-Golay Smoothing method [35] was used, with the first derivative employed using seven smoothing points. The calibration models were constructed applying the partial least squares regressions A standard white plate with 100% calibrated reflectance was used for data acquisition, according to the Labsphere Reflectance Calibration Laboratory [30]. The fiber optic reader was placed 8 cm upright from the sample support platform. The reading area was of approximately 2 cm 2 . The light source used was a 650 W lamp with an uncollimated beam for the target plane, positioned 35 cm from the platform and at a 30 • angle to the horizontal plane [31].
The spectral readings were repeated three times, with successive displacement of the Petri dish 120 • clockwise, allowing for a full sample scan. Subsequently, the simple arithmetic means of the three readings was determined for each sample, as recommended by Nanni and Demattê [32].

Data Processing and Statistical Analyses
Raw spectral data were preprocessed to improve the stability of the regression models, as described by Lee et al. [33]. Each spectral curve was subjected to baseline correction and light scattering by the multiplicative scatter correction (MSC) method, according to Buddenbaum and Steffens [34]. For noise reduction, the Savitzky-Golay Smoothing method [35] was used, with the first derivative employed using seven smoothing points. The calibration models were constructed applying the partial least squares regressions (PLSR), using the Unscrambler version 10.3 software package (CAMO, Inc., Oslo, Norway). The prediction performance of the models was assessed using the coefficient of determination (r 2 ), square root of mean prediction error (RMSEP), standard error (SEP), systematic error (BIAS) and ratio of performance to deviation (RPD), as described by D'Acqui et al. [13].

State Model
The state model (unspiked state model) was generated from a total of 425 soil samples collected in the state of Paraná. Its effectiveness in the estimation of organic matter was tested on 200 samples collected in the target area.

Recalibrated State Model
For this step, the spiking technique was used to recalibrate the state model with selected samples from the target area. Sample selection was performed based on spectral sample characteristics. The criterion of choice was based on the distribution of spectra from the set of samples from the target area within the spectral domain they occupy. In this way, we tried to use spectra that were at the limit of the spectral domain, as well as those that were in the center or even randomly distributed, with the objective of covering the entire spectral space.
The selection based on cluster analysis sought to group smaller samples with spectral similarity into smaller subsets. As it is an unsupervised analysis, a biased selection is discarded, in addition, recalibration with samples from different clusters may indicate that the best soil samples from a set should be employed in the Spiking technique. Initially recalibrations were tested with five and 10 samples, however, five samples proved to be an insufficient number (not presented).
A large number of samples was not tested in the recalibration, since such a procedure is not recommended, since routine analyzes would be required to determine organic matter, which would increase costs, contrary to the application of the spectroscopy technique. The selection criteria are presented below.
A total of 10 samples located at the periphery of the spectral space, comprising the first two main components (subset one), 10 samples located in the center of the spectral space, comprising the first two main components (subset two), 10 samples located along the spectral space consisting of the first two main components (subset three) and 10 samples belonging to different clusters (k-means) (subset four) were chosen, according to Cezar et al. [36].
In a second step, the state model was recalibrated with hybrid spectra, obtained using the dataset of the target area and of the state of Paraná. In order to obtain these spectra, the four selection criteria mentioned above were applied to both datasets. After the subset's selection, the simple means between the corresponding spectra was calculated for each criterion, obtaining 10 hybrid spectra for each criterion, with a total of 40 hybrid spectra. A general state model recalibration scheme is presented in Figure 2.
After state model recalibration, the model was used on an unknown data matrix (95% of the remaining samples of the target area), for performance and predictive ability testing.
The high variability (around 72.23%, not presented) of the state of the Paraná samples is mainly due to the existence of differences in the state climatic conditions [37], leading to significant variations in organic matter content in several regions [36].  In addition, the managements applied in the agricultural areas also result in greater variability, with higher or lower organic matter accumulation on the soil surface. Therefore, those variations were expected, since both no-tillage and tillage planting are observed in the state of Paraná, with higher organic matter accumulation over the years in no-tillage planting, agreeing with Martínez et al. [38].

Statistical Soil Sample Characterization
The results obtained from the descriptive analysis indicate that organic matter content is variable. Considering the entire state, the oscillation between the minimum and maximum values is high, reaching values above 60 g dm −3 ( Figure 3). Compared to the set of samples from the target area, the standard deviation of the set of samples collected in the state of Paraná is higher, showing a smaller homogeneity among the data. The high variability (around 72.23%, not presented) of the state of the Paraná samples is mainly due to the existence of differences in the state climatic conditions [37], leading to significant variations in organic matter content in several regions [36].  In addition, the managements applied in the agricultural areas also result in greater variability, with higher or lower organic matter accumulation on the soil surface. Therefore, those variations were expected, since both no-tillage and tillage planting are observed in the state of Paraná, with higher organic matter accumulation over the years in no-tillage planting, agreeing with Martínez et al. [38]. The high variability (around 72.23%, not presented) of the state of the Paraná samples is mainly due to the existence of differences in the state climatic conditions [37], leading to significant variations in organic matter content in several regions [36].
In addition, the managements applied in the agricultural areas also result in greater variability, with higher or lower organic matter accumulation on the soil surface. Therefore, those variations were expected, since both no-tillage and tillage planting are observed in the state of Paraná, with higher organic matter accumulation over the years in no-tillage planting, agreeing with Martínez et al. [38].
On the other hand, the sample set from the target area presented lower variability, around 59.23% (not presented). In this case, some factors such as climate and management were less relevant, due to the smaller size of the area, 2500 ha, mostly used for sugarcane plantations, which undergoes the same management during the crop cycle. At a lesser extent, some forest remnants are also observed.

Spectral Soil Sample Characterization
The spectral curves representative of the samples used for the model generation also showed inter-state differences, as well as differences between the state and the target area samples ( Figure 4). On the other hand, the sample set from the target area presented lower variability, around 59.23% (not presented). In this case, some factors such as climate and management were less relevant, due to the smaller size of the area, 2500 ha, mostly used for sugarcane plantations, which undergoes the same management during the crop cycle. At a lesser extent, some forest remnants are also observed.

Spectral Soil Sample Characterization
The spectral curves representative of the samples used for the model generation also showed inter-state differences, as well as differences between the state and the target area samples ( Figure 4). The average spectral curves for the samples collected in the target area are better defined, with distinct inter-sample spectral differentiation, mainly in wavelengths greater than 700 nm. On the other hand, the spectral curves for the Paraná samples are less differentiated (except for histosols and arenosols that have very different spectral behavior) and are better separated in wavelengths greater than 1900 nm. The reflectance factor of the target area samples presents a lower amplitude, ranging from 0.02 to 0.25, while for the Paraná samples it ranges from 0.01 to 0.70. This difference occurs mainly as a function of soil variability in Paraná State, as described in Section 2.1. In addition, soil use and management can lead to changes in spectral behavior, mainly due to variations in organic matter content, which can absorb electromagnetic radiation at all wavelengths, masking absorption bands generated by other ele- The average spectral curves for the samples collected in the target area are better defined, with distinct inter-sample spectral differentiation, mainly in wavelengths greater than 700 nm. On the other hand, the spectral curves for the Paraná samples are less differentiated (except for histosols and arenosols that have very different spectral behavior) and are better separated in wavelengths greater than 1900 nm. The reflectance factor of the This difference occurs mainly as a function of soil variability in Paraná State, as described in Section 2.1. In addition, soil use and management can lead to changes in spectral behavior, mainly due to variations in organic matter content, which can absorb electromagnetic radiation at all wavelengths, masking absorption bands generated by other elements [39]. This behavior can be observed through the evaluation of the spectral curve of samples collected in the remaining forest area, classified as histosols (Figure 4), which do not display absorption bands except in the 1900 nm region, characteristic for the presence of water [40].
Iron occurrence was the same for all target area samples (absorption around 900 nm), except for arenosols. This agrees with one of the materials of origin that form this soil, which displays low iron concentrations. On the other hand, the spectral responses of the Paraná samples indicated the presence of iron only for ferralsols and nitisols, which usually present this element above 150 g kg −1 of soil.
The other classes, besides presenting lower levels than the aforementioned soil, displayed impaired iron absorption bands due to the presence of organic matter, [41,42]. This was higher than 2% for 175 samples, with the potential to influence any spectral curve, as discussed by Baumgardner et al. [43] and Bilgili et al. [44]. Systematic Error, (5) Correlation Coefficient, (6) Ratio of Performance to Deviation; (7) Root-Mean-Square Error for prediction; (8) Standard error prediction; n: number of Paraná and target site soil samples.

Unspiked State Model Calibration and Prediction
Although the BIAS, correlation coefficient, and coefficient of determination results were satisfactory, the RPD value was below the ideal for use in agriculture, as discussed by Viscarra Rossel et al. [1], considered as presenting average precision. According to these researchers, the ideal values for use with agriculture would be above 3, where values from 2 to 3 are considered good, 1.5 to 2 average, and below 1.5, unsatisfactory. This finding is corroborated when using this model to estimate organic matter content in the target area, where a low ability to accurately predict these values is observed. The RPD in this case is noteworthy, at 1.42, relatively superior to the values obtained by Cezar et al. [36] in a similar study.
Therefore, it is evident that, even in the case of a large state model composed of 425 soil samples, the entire variability of the target area could not be determined, with no accurate representation of the organic matter contents present in the target area, agreeing with what was described by Viscarra Rossel et al. [45] and Guerrero et al. [15]. The presence of variability can be corroborated by the means of the average spectral curves for the target area, which present differences in terms of absorption bands, as well as oscillations in reflectance values for the different soil classes (Figure 4).

Recalibration
The statistical parameters presented a small improvement over the unspiked state model after recalibration of the state model through the spiking technique and spiking associated with hybridization ( Table 2). Error; (4) Recalibration Systematic Error, (5) Correlation Coefficient, (6) Residual Predictive Deviation; n: number of samples used in the recalibrated model.
When assessing the spiked state model, it is noted that the RMSEC is lower, ranging from 9.6 to 9.9 g dm −3 , while the correlation coefficient reaches a maximum value of 0.80 for the models recalibrated with subset one and subset four. The RPD values are also relatively better, reaching a maximum of 1.72. These results indicate that the recalibration of the state models with some target area samples can lead to improvements in the statistical parameters, agreeing with Guerrero et al. [46] and Hong et al. [17].
Similar behavior was observed after the use of the spiked state model associated with hybridization for recalibration of the state models. However, no significant improvements were observed after the use of this technique. RMSEC values reached a maximum of 9.9 g dm −3 , while RPD reached 1.66. It should be noted that, in both cases, BIAS was low, demonstrating a lack of bias for the generated models.

Prediction
The statistical parameters obtained by the recalibrated model during the prediction phase are presented in Table 3.
A relative improvement in organic matter content estimates is noted after the use of the spiked state model in a new area, in agreement with the one described by Brown et al. [9], Sankey et al. [12], and Wetterlind and Stenberg [14]. The RMSEP values were lower than those obtained for the unspiked state model, while the correlation coefficient and determination values were higher.
When compared to the work of Lazzaretti et al. [47], which estimated soil organic matter through NIR spectroscopy associated with an unspiked model in the southern region of Brazil, the results were also slightly higher. Emphasis should be given to the correlation coefficient, which ranged from 0.68 to 0.76 (Table 3), depending on the strategy used for selecting samples used in the recalibration of the state model, against 0.62 of the aforementioned work. (1) Prediction Determination Coefficient; (2) Prediction Root-Mean-Square Error; (3) Prediction Standard Error; (4) Prediction Systematic Error, (5) Correlation Coefficient, (6) Residual Predictive Deviation; n: number of soil samples used in the prediction.
On the other hand, the results were inferior to those obtained by Lazaar et al. [48] who, working in Eastern Morocco with organic matter estimation by means of VNIR-SWIR spectroscopy, obtained r 2 equal to 0.93 and RMSE equal to 0.13. It should be noted that in this case, the prediction models as well as their validations were developed with a smaller number of samples (lower variability) when compared to the work carried out in Paraná State.
Similarly, the results were lower than those achieved by Qiao et al. [49], who developed organic matter prediction models for Chinese soils using hyperspectral data. These researchers obtained values of r 2 and RPD (prediction) equal to 0.61 and 5.53, whereas we obtained maximum values for these indices equal to 0.43 and 1.36, respectively, considering the spiked state model.
As already explained for Lazaaret al. [48], the number of samples used in the calibration (165) and validation (15) of this work is small when compared to that used for the study of Paraná soils. While we obtained a mean value for organic matter above 20 g dm −3 and a maximum value above 60 g dm −3 (Figure 3), the study by Qiao et al. [49] obtained a mean value of 2.60 g dm −3 and a maximum of 4.33 g dm −3 . These results demonstrate the difference between the organic matter content of tropical soils and other parts of the world.
Regarding the use of hybridization, this technique did not allow for improvements in organic matter content estimates when compared to the spiked state model, since lower quality indices were found in most cases (subset one, subset two, and subset three). Likewise, when comparing the recalibrated models using hybrid curves with the unspiked state model, only subset four presented better results, with RMSEP, BIAS, and the correlation coefficient equal to 4.6 g dm −3 , 1.55, and 0.71, respectively.
The lack of effectiveness of the hybridization to improve organic matter predictions lies in the fact that, although the spectral curves are different in terms of absorption and reflectance, as displayed in Figure 4, both datasets (from Paraná and the target area) are within the same spectral domain, demonstrating that these samples are not spectrally distant, agreeing with Nawar and Mouazen [50] and Hong et al. [17].
This assertion can be corroborated by means of the similarity map formed by principal component analysis scores. When the PCA model generated with the Paraná soil sample spectra was applied to the target area sample spectra, the scores were calculated and projected to the local site within the spectral space occupied by the state samples ( Figure 5), similar to that obtained by Wetterlind and Stenberg [14]. Remote Sens. 2021, 13, x FOR PEER REVIEW 11 of 16 Figure 5. Main component (PC) similarity maps, between the Paraná and target site datasets. Blue scores were obtained by the calibration model using state spectra. Green scores were obtained by the calibration model using local spectra. Therefore, to obtain success with the use of hybrid curves, the regional, state, or national soil samples that generally make up large spectral libraries must be spectrally very different from the samples collected in new areas where organic matter estimates are of interest, i.e., both datasets must be separated within the occupied spectral space. Thus, after recalibration, this space will be filled by the hybrid curves, forcing the recalibrated model to present better estimation power.
The improvements in organic matter predictions noted after recalibration of the state model were due to the use of the spiking technique without hybridization, as demonstrated by Guerrero et al. [15]. The presence of both datasets within the same spectral domain ( Figure 5) led to more positive results, as advocated by Kuang and Mouazen [51] and Nawar and Mouazen [52]. After spiking, a slight improvement in the model fit was noted, especially when assessing the spiked state model using subset four ( Figure 6).
However, although improvements in the predictions of the aforementioned attribute were noted, surpassing the results obtained by Nanni et al. [31], who worked with VNIR-SWIR spectroscopy in soil from this region of Paraná, these were lower than expected. With the use of the spiking technique and hybrid spectral curves, the results were expected to be higher than those reported by Daniel et al. [53], Wetterlind et al. [11], and Wang et al. [54], which was not the case.
The recalibration of the state model with selected samples from the target area, despite displaying slightly improved organic matter estimates, indicated that the type of sample directly influences the result. The selected samples should be able to transfer the maximum variability concerning organic matter composition and amounts to the models to be recalibrated, to allow them to adequately estimate this attribute when it is applied to the sample collection area. Within this context, sample selection based on the cluster analysis (strategy four) was the most adequate for state model recalibration.
Another point that should be highlighted refers to the number of soil samples used for recalibration. Only 10 samples were not sufficient to represent the organic matter content distribution of the within the target area, since 59.23% variation was observed. According to Kuang and Mouazen [55], depending on spatial variability, one to two samples per hectare would be sufficient to capture this dissimilarity at a specific site. Nawar and Mouazen [52] suggested the use of four to five samples per ha for recalibration of European spectral libraries in order to provide adequate accuracy in the prediction of soil organic carbon content.
Hong et al. [17] observed improvements in the estimation of soil organic carbon when a larger number of samples (from 10 to 60) were used in the recalibration of the prediction model. However, it was detected that there were no significant gains when more than 30 Therefore, to obtain success with the use of hybrid curves, the regional, state, or national soil samples that generally make up large spectral libraries must be spectrally very different from the samples collected in new areas where organic matter estimates are of interest, i.e., both datasets must be separated within the occupied spectral space. Thus, after recalibration, this space will be filled by the hybrid curves, forcing the recalibrated model to present better estimation power.
The improvements in organic matter predictions noted after recalibration of the state model were due to the use of the spiking technique without hybridization, as demonstrated by Guerrero et al. [15]. The presence of both datasets within the same spectral domain ( Figure 5) led to more positive results, as advocated by Kuang and Mouazen [51] and Nawar and Mouazen [52]. After spiking, a slight improvement in the model fit was noted, especially when assessing the spiked state model using subset four ( Figure 6).
However, although improvements in the predictions of the aforementioned attribute were noted, surpassing the results obtained by Nanni et al. [31], who worked with VNIR-SWIR spectroscopy in soil from this region of Paraná, these were lower than expected. With the use of the spiking technique and hybrid spectral curves, the results were expected to be higher than those reported by Daniel et al. [53], Wetterlind et al. [11], and Wang et al. [54], which was not the case.
The recalibration of the state model with selected samples from the target area, despite displaying slightly improved organic matter estimates, indicated that the type of sample directly influences the result. The selected samples should be able to transfer the maximum variability concerning organic matter composition and amounts to the models to be recalibrated, to allow them to adequately estimate this attribute when it is applied to the sample collection area. Within this context, sample selection based on the cluster analysis (strategy four) was the most adequate for state model recalibration.
Another point that should be highlighted refers to the number of soil samples used for recalibration. Only 10 samples were not sufficient to represent the organic matter content distribution of the within the target area, since 59.23% variation was observed. According to Kuang and Mouazen [55], depending on spatial variability, one to two samples per hectare would be sufficient to capture this dissimilarity at a specific site. Nawar and Mouazen [52] suggested the use of four to five samples per ha for recalibration of European spectral libraries in order to provide adequate accuracy in the prediction of soil organic carbon content.
Hong et al. [17] observed improvements in the estimation of soil organic carbon when a larger number of samples (from 10 to 60) were used in the recalibration of the prediction model. However, it was detected that there were no significant gains when more than 30 samples were used for recalibration. Considering the edaphoclimatic conditions of the studied areas, the researchers suggested selecting from 20 to 30 samples to recalibrate the models to balance the relationship between the modeling cost and predictive accuracy. In turn, Gogé et al. [56] obtained satisfactory results in the estimation of organic carbon applying the spiking technique when they used 50 local samples to recalibrate the prediction model, corresponding to 35% of the total number of samples from the target area.
Remote Sens. 2021, 13, x FOR PEER REVIEW 12 of 16 samples were used for recalibration. Considering the edaphoclimatic conditions of the studied areas, the researchers suggested selecting from 20 to 30 samples to recalibrate the models to balance the relationship between the modeling cost and predictive accuracy. In turn, Gogé et al. [56] obtained satisfactory results in the estimation of organic carbon applying the spiking technique when they used 50 local samples to recalibrate the prediction model, corresponding to 35% of the total number of samples from the target area.  It should be emphasized that increasing the number of samples selected of a target area for recalibration of a prediction model is acceptable to a certain extent, since there is a requirement for analytical results of these samples. If the number is high, spectroscopy ceases to be an attractive technique with innovative potential, becoming an expensive technique when applied to estimation of organic matter or other soil attribute, agreeing with Hong et al. [17].

Conclusions
1. The samples selected through a cluster analysis were more effective for state model recalibration, since they were able to transfer more information about the chemical and physical characteristics of the organic matter attribute present in the target area. This fact allowed for a relatively better prediction when compared to the use of other samples selected by other strategies, reaching r 2 and RPD equal to 0.43 and 1.36, respectively, in the spiked state model; 2. The use of hybrid spectral curves did not allow for improvements in organic matter content estimations, since the target area and Paraná state samples occupied the same spectral space. As the hybrid spectrum contains information from both datasets, these will be effective in completing the spectral space if the groups are in different spectral domain, a fact that did not occur in this work; 3. The spiking technique was more effective in state model recalibration than when in conjunction with hybridization, generating more satisfactory results. Maximum RMSEP and R equal to 4.9 g dm −3 against 6.2 g dm −3 and 0.76 against 0.71, respectively were observed. The use of selected samples from the target area to recalibrate the state . It should be emphasized that increasing the number of samples selected of a target area for recalibration of a prediction model is acceptable to a certain extent, since there is a requirement for analytical results of these samples. If the number is high, spectroscopy ceases to be an attractive technique with innovative potential, becoming an expensive technique when applied to estimation of organic matter or other soil attribute, agreeing with Hong et al. [17].

Conclusions
1. The samples selected through a cluster analysis were more effective for state model recalibration, since they were able to transfer more information about the chemical and physical characteristics of the organic matter attribute present in the target area. This fact allowed for a relatively better prediction when compared to the use of other samples selected by other strategies, reaching r 2 and RPD equal to 0.43 and 1.36, respectively, in the spiked state model; 2. The use of hybrid spectral curves did not allow for improvements in organic matter content estimations, since the target area and Paraná state samples occupied the same spectral space. As the hybrid spectrum contains information from both datasets, these will be effective in completing the spectral space if the groups are in different spectral domain, a fact that did not occur in this work; 3. The spiking technique was more effective in state model recalibration than when in conjunction with hybridization, generating more satisfactory results. Maximum RMSEP and R equal to 4.9 g dm −3 against 6.2 g dm −3 and 0.76 against 0.71, respectively were observed. The use of selected samples from the target area to recalibrate the state model al-