Near-Infrared Spectroscopy and Machine Learning for Wood Species Discrimination in an Amazon Floodplain Forest Management Area

Washington Duarte Silva da Silva; Joielan Xipaia dos Santos; Tawani Lorena Naide Acosta; Deivison Venicio Souza; Ana Paula Souza Ferreira; Pamella Carolline Marques dos Reis Reis; Leonardo Pequeno Reis; Helena Cristina Vieira; Graciela Inés Bolzon de Muñiz; Silvana Nisgoski

doi:10.3390/f16060984

,

and

¹

Department of Forest and Technology Engineering, Federal University of Paraná, Curitiba 80210-170, Parana, Brazil

²

Forest Engineering, Federal University of Pará, Altamira 68372-040, Pará, Brazil

³

Capitão Poço Campus, Federal Rural University of Amazonas, Capitão Poço 68650-000, Pará, Brazil

⁴

Department of Forest Science, Federal Rural University of Pernambuco, Recife 52171-900, Pernambuco, Brazil

Forests2025, 16(6), 984;https://doi.org/10.3390/f16060984

This article belongs to the Special Issue Wood Properties: Measurement, Modeling, and Future Needs

Version Notes

Order Reprints

Abstract

This study analyzes near-infrared (NIR) spectral characteristics of the wood of Hevea spruceana (Benth.) Müll. Arg., Hura crepitans L., Ocotea cymbarum Kunth, and Pseudobombax munguba (Mart.) Dugand from an Amazon floodplain forest area located in the Mamirauá Sustainable Development Reserve, aiming at their discrimination using artificial intelligence. The samples were collected as increment cores, from which NIR spectra were randomly collected in the transversal anatomical surface and compared. Principal component analysis (PCA) was applied to explore variation patterns in the data. Additionally, the classifier support vector machine algorithm, partial least squares–discriminant analysis (PLS-DA), and k-nearest neighbors regression were used to evaluate the accuracy in distinguishing the woods based on the NIR data. The results indicate similar spectral behavior among the species, with differences in absorbance intensities. PCA revealed a greater tendency for samples of the same species to cluster together, with Ocotea cymbarum showing the highest tendency for grouping. Among the classifiers, PLS-DA achieved the highest accuracy (98%). We can conclude that NIR spectroscopy combined with artificial intelligence classifiers has the potential to distinct wood species from the Amazon floodplain forest analyzed.

Keywords:

Brazilian native species; nondestructive techniques; wood distinction; classification models

1. Introduction

The Amazon Rainforest has great diversity of fauna and flora with global importance, in addition to a diversity of unique ecosystems. Among these ecosystems, the main ones are the upland forests and floodplain forests. Floodplain areas are subject to inundation cycles, which can be daily or annual. In the Amazon, flooded areas are bordered by rivers, lakes, water holes, and/or streams [1]. The environmental conditions of these regions directly impact the floristic composition, resulting in vegetation adapted to seasonal water variations [2].

Logging is a reality in Amazon floodplain forests, with this ecosystem being the second most affected by predatory exploitation of the biome. Due to the diversity of species in floodplains, it is essential to use techniques that aid in their discrimination, reducing the risk of involuntary or intentional replacement, in addition to minimizing waste and the inappropriate use of wood [3].

Timber identification in the Amazon is challenging, since many species share similar characteristics, especially color and texture [4]. In particular, in floodplain areas, this difficulty is even greater due to the smaller number of studies characterizing timber species in this environment in comparison to those in upland (dryland) areas [5].

Traditionally, wood identification is performed using macroscopic and microscopic anatomical analysis, the latter being more accurate because it allows for the detailed visualization of the wood sample’s cellular elements [6]. However, these methods require time, specialized equipment, and advanced technical knowledge, which limit their large-scale application [7].

Currently, the use of near-infrared (NIR) spectroscopy as a nondestructive technique to identify timber species stands out. This technique allows for the chemometric characterization of wood quickly and efficiently, and can be applied in laboratories or in the field using portable equipment. When associated with machine learning methods, such as partial least squares-discriminant analysis (PLS-DA), support vector machine (SVM), and k-nearest neighbors (k-NN) regression, in particular, NIR spectroscopy has proved to have high potential to distinguish timber species [8].

The species analyzed in this study—Pseudobombax munguba (Mart. & Zucc.) Dugand, Ocotea cymbarum Kunth, Hura crepitans L., and Hevea spruceana (Benth.) Müll. Arg.—have great economic importance for Amazon floodplain areas [6]. According to data from Brazil’s Ministry of the Environment [9], between 2020 and 2025, 116,779.30 m³ of wood of these species was sold, mainly in the form of logs, sawn lumber, and laminated wood.

The following hypotheses were formulated in this study: (I) the near-infrared method can separate wood from the four Amazonian floodplain tree species based on their spectral characteristics; and (II) different machine learning classifiers can equally effectively differentiate the studied wood species.

Based on the factors discussed above, this study aims to analyze the application of near-infrared spectroscopy combined with machine learning algorithms in the discrimination of tree species harvested from a floodplain forest area of Central Amazonia, based on increment core samples.

2. Materials and Methods

2.1. Location of Study

Wood increment core samples were collected in Mamirauá Sustainable Development Reserve (RDS) with geographic coordinates 03°08′ S/64°45′ W × 02°36′ S/67°13′ W, located in the mid-Solimões region, in the center-west of Amazonas State. The region studied is defined as a high floodplain forest area, located in RDS Mamirauá territory, which encompasses approximately 1,240,000 hectares of floodable ecosystems, delimited by the rivers Japurá, Auati Paraná, and Solimões [10], which together integrate the Central Amazon Biosphere Reserve (RDAC).

2.2. Species and Sampling

Samples from species Hura crepitans L. (‘assacu’) and Hevea spruceana (Benth.) Müll. Arg. (‘seringueira-barriguda’), both from the Euphorbiaceae botanical family, Ocotea cymbarum Kunth (‘louro inamuí’) from the Lauaceae family, and Pseudobombax munguba (Mart. & Zucc.) Dugand. (‘munguba’) from the Malvaceae family were evaluated.

For analysis, samples were obtained by applying an increment borer. For each species, we selected four adult individuals, and the material was collected at two different trunk heights, namely the diameter at breast height (DBH—1.30 from the ground) and at a trunk height of 2 m (Figure 1). Materials where identified and stored in a support, separated by a divider. For each species, 8 increment cores were obtained, for a total of 32 for the four studied species.

Figure 1. Schematic representation of sampling and analysis.

From all of the sampled trees, botanical material was collected in accordance with the procedures described in Martins-da-Silva [11]. One practiced parabotanist made the initial recognition of species in loco, and afterward the vegetative material was sent to confirm the identification to the herbarium of the National Institute of Amazonian Research (INPA-AM). The studied species are recorded under code A66D164 in the System of National Genetic Heritage Management (SisGen).

2.3. Near-Infrared Spectroscopy (NIR)

Spectra were acquired using a Vertex 70 spectrophotometer (Bruker Optics, Ettlingen, Germany), operating in diffuse reflectance mode, with a resolution of 8 cm⁻¹, in a spectral range of 4000–10,000 cm⁻¹. Since the sample preparation can directly influence the NIR spectra [12], and to represent the methods commonly applied for wood transportation, the spectra were collected directly from the increment core surfaces, and no separation of heartwood and sapwood was performed.

To standardize the surface and eliminate oxidation’s influence on the increment cores, all samples were polished on a transversal surface with 180 and 320 grit sandpaper (Figure 1). To reach the moisture equilibrium with the environment (~12%), samples were kiln-dried and remained in a room with controlled conditions, 23 ± 2 °C of temperature and 60 ± 5% of relative humidity, until final analysis. In total, 20 spectra were obtained from each sample, englobing all core length, from bark-to-pith, for a total of 160 per species and an overall total of 640 spectra.

2.4. Statistical Analysis and Classification Methods

To verify the similarity between NIR spectra from the studied species, principal component analysis (PCA) was performed using R software 4.3.2 [13], applying two packages: FactoMineR [14] and factoextra [15].

To test the NIR spectra’s recognition of each species, three machine learning techniques were applied: partial least squares–discriminant analysis (PLS-DA); support vector machine learning (SVM); and k-nearest neighbors (k-NN) regression. All analyses were performed using R software 4.3.2. A summary of each technique is below.

2.4.1. Support Vector Machine (SVM) Learning

SVM learning involves the application of an algorithm with a flexible approach, used to solve classification problems [16]. In general, its purpose is to choose one or more high-dimensional hyperplanes in a space, thus ensuring the selection of the best expected responses, the distinction between evaluated classes. To estimate the limits, nonlinear SVM applies the kernel function (e.g., radial, polynomial), with the aim of expanding the limits between hyperplanes [17]. For this study, the radial kernel function served as the basis for SVM, where the implementation was available in the ‘kernlab’ library [18]. The kind of algorithm used uses a constraint violation cost (C) and radial basis kernel parameter (sigma) as the tuning hyperparameters.

2.4.2. Partial Least Squares—Discriminant Analysis (PLS-DA)

PLS-DA is originated on the classical regression partial sum of squares model [19,20], which is applied for prediction models as a widespread of the multiple linear regression (MLR) [19]. The PLS technique is relevant because it performs well with few predictors and relatively few samples, in addition to being robust to the presence of noise and strongly collinear variables [19,21]. In classification situations, the approach resembles that employed by classical PLS, except that the desired response variable is a category [20]. To implement PLS-DA in this study, the ‘pls’ algorithm from the CARET package [22] was applied, which interacts with the package with the same name [21]. In this context, a matrix of potential variables was generated, with each class represented by one column. We employed a response matrix for the plsr() function of the ‘pls’ package [23]. The softmax function transforms the model prediction into “values similar to the probability”, and the predicted one represents the class that obtained the highest value. The number of principal components (ncomp) are represented by a tuning hyperparameter in the applied algorithm.

2.4.3. K-Nearest Neighbors (k-NN)

k-NN, a nonparametric method based on supervised learning, is commonly used in regression analysis and classification tests [24,25]. This method is one of the top 10 data mineralization methods [26]. To apply the k-NN method, some components are required: a nominated dataset; a similarity or distance metric to assess the proximity between samples; the number of nearest neighbors represented by the value of k; and a method to weight the neighbors, giving higher values of weight to those samples closest to the predicted distance [27,28]. In the present study, we implemented the ‘knn’ algorithm from the CARET package [22]. In algorithm applied, the number of nearest neighbors is k, which represents the tuning hyperparameter. The Euclidean metric standard was used to calculate the distance between the uniform weight and the observation that is assigned to the nearest neighbors.

2.5. Experiments, Cross-Validation, and Evaluation Metrics

The predictive models were built with near-infrared spectra, based on the assumption that spectra from different positions of the tree trunk can produce patterns for species discrimination. The k-fold cross-validation method (2-fold cross-validation, repeated 25 times) based on the stratification of species classes and blocking of the increment core trunk position was used to estimate the performance of the classifiers. Thus, 50 validation partitions of each predictive learning model were evaluated. To create the cross-validation splits, the cvo_create_folds() function of the manyROC package [29] was used. In this process, the vector corresponding to the response variable (species) was selected as the stratification factor, while the increment core trunk position column was used as the blocking factor. The use of cross-validation ensured that spectra from the same tree were used only in either the training or validation set, avoiding data leakage. Therefore, the methodology was adopted to ensure that spectra obtained at the same increment core of a species were uniquely present in either the training or validation set. Thus, the results of the model performance estimates were more realistic.

For the training of the predictive models, we used a framework available in R language for classification and regression analysis, the CARET package interface [22]. In the preprocessing stage, the predictors were centralized and scaled. A grid of hyperparameters was established for each machine learning algorithm applied in the study, and to identify which was optimal, we used the grid search strategy. Table 1 presents the characteristics of the hyperparameters in the evaluated algorithms.

Table 1. Hyperparameter variants evaluated for each algorithm and packages applied.

To select the models, we evaluated the metrics by cross-validation, analyzing accuracy (Equation (1)), recall (Equation (2)), and F1 score (Equation (3)). The model selected for the species recognition rate was the one with the highest accuracy.

Acuraccy = (TP + TN)/(TP + FP + TN + FN),

(1)

Recall = TP/(TP + FN),

(2)

F1-score = (1 + β²).(Precision.Recall)/(((β².Precision) + Recall)),

(3)

where true positive (TP) is the number of samples correctly classified in class Ci; true negative (TN) represents the number of samples classified correctly as not from class Ci; false positive (FP) is the number of samples incorrectly classified in class Ci; and false negative (FN) indicates the number of samples incorrectly classified as not from class Ci.

3. Results

The average NIR spectra for the four Amazonian floodplain species, in the wavenumbers from 4000 cm⁻¹ to 10,000 cm⁻¹ (Figure 2), exhibited similar behaviors, but with different absorbance intensities. In general, from the initial spectra wavenumber to 7400 cm⁻¹, lower absorbance values were observed for the species P. munguba; after this spectral range, the species O. cymbarum presented a lower absorbance value (Figure 2). Throughout the entire spectral wavenumber interval, the species H. spruceana and P. munguba presented similar absorbance values. In the range from 7400 cm⁻¹ to 9900 cm⁻¹, H. crepitans spectra showed a tendency to separate from the others due to a greater increase in absorbance intensity (Figure 2).

Figure 2. Mean NIR wood spectra from the evaluated species.

The PCA performed using the NIR spectra from the wood samples indicated that approximately 98.4% of the total variance was explained by the first two principal components, with 65.4% for PC-1 and 33% for PC-2. Furthermore, these components had eigenvalues (λi) greater than 1, meaning that it is reliable to make inferences from these components (Figure 3).

Figure 3. Score plot (A) and loading graph for PC-1 and PC-2 (B) in the principal component analysis, with the mean NIR spectra of the four species being evaluated.

Observing the biplot score graph (Figure 3A), a greater tendency for approximation between wood samples of the same species was verified. However, O. cymbarum showed the greatest tendency for grouping, while the other species did not present a clear separation, remaining close to each other. The species H. crepitans showed a more dispersed behavior, while H. spruceana and P. munguba had greater similarity. Most samples of the species were concentrated around the central coordinate (0, 0) of the biplot, indicating a region of confusion between the species (Figure 3A).

We then used NIR spectral data to build species recognition models (n = 4) using machine learning techniques. The classifier built using the PLS-DA algorithm had the best predictive performance, with an accuracy and F1-Score above 98% in repeated cross-validation (Table 2). The k-NN model had the worst performance among the classifiers, with an accuracy of 57.14%.

Table 2. Metric performance of classifiers in repeated cross-validation.

Figure 4 shows recognition rate in the most accurate PLS-DA classification for the validation samples per species. The best performance in species recognition was achieved by the PLS-DA classifier for P. munguba, with 100% recognition (Figure 4). Overall, all of the species evaluated showed excellent recognition rates, being above 96%.

Figure 4. Recognition rate, using the most accurate PLS-DA classifiers, by species in repeated cross-validation. ncomp = number of principal components.

4. Discussion

The analysis of the NIR spectra revealed similar behavior among the studied species, although with differences in absorbance intensities. The species P. munguba presented lower absorbance values up to 7400 cm⁻¹, while O. cymbarum had lower absorbance after this spectral range. The similarity between H. spruceana and P. munguba between 8749 and 8547 cm⁻¹, according to Schwanninger et al. [30], are related to the aromatic groups of lignin. On the other hand, the higher absorbance intensity of H. crepitans between 7400 and 9900 cm⁻¹ indicates possible differences in the chemical structure of the wood (Figure 2). These patterns reinforce the importance of NIR for the chemical characterization of wood, as pointed out by Schwanninger et al. [30], who attributed spectral variations to the presence of cellulose, hemicellulose, lignin, and extractives.

Spectral similarities between species were also observed by Soares et al. [31], who reported the need for chemometric techniques to differentiate Amazonian species. Santos et al. [8] highlighted that, despite the similarities in NIR spectra, the distinctions between species occur mainly through absorbance intensity, a pattern identified in the present study.

In the range of 5400–5900 cm⁻¹, a region associated with the cell wall structure, all of the species presented absorbance peaks, mainly indicating the presence of cellulose, lignin, and hemicellulose [30]. This behavior was also reported by Santos et al. [8] in samples of Mezilaurus sp. and Nectandra sp. from the Amazon Rainforest, indicating that this spectral range may be relevant for the discrimination of tropical species.

In their study, Schwanninger et al. [30] described the band at 4146–4335 cm⁻¹ as being related to cell wall components, as well as the band between 5697 and 6110 cm⁻¹. Absorption at 5995 cm⁻¹ was attributed to extractives. Hemicelluloses show peaks between 4401 and 6800 cm⁻¹. The wavelengths of 6287 cm⁻¹ and 7000 cm⁻¹ were associated, respectively, with the crystalline and amorphous regions of cellulose.

Eugenio da Silva et al. [32] demonstrated that NIR spectroscopy was able to identify seven species with over 95% accuracy, even after exposure to environmental weathering. According to Gomes et al. [33], it is possible to discriminate wood even under different moisture conditions, which indicates that water content does not significantly interfere with identification. In practice, this expands the potential of the technique for application in different contexts of wood use and conservation.

PCA showed that 98.4% of the total variance in the data was explained by the first two principal components, with a clear tendency for samples within each species to converge, especially for O. cymbarum. However, the separation between the other species was not completely defined, which can be attributed to similarity in wood characteristics between H. spruceana and P. munguba. In PC1, most informative regions are near 6900–7000 cm⁻¹, which is attributed to amorphous regions of cellulose, amorphous polysaccharides, and some part of glucomannan and xylan, and also some bonds of water [30]. In PC2, wavenumbers from approximately 4000 to 5000, related to cell wall composition [30], had higher influence in species distribution. Despite similarities in NIR spectra, which occur in all lignocellulosic materials, PCA was performed as a previous visualization to verify the distribution of each species samples, to test different machine learning process to evaluate spectra from wood originated in floodplain regions, and to verify the correct distinction of species.

This limited separation may also reflect the intrinsic limitations of PCA when handling high-dimensional spectral data. Therefore, although PCA is a useful technique in various applications, its effectiveness tends to decrease when applied to high-dimensional datasets, such as NIR spectra. The method also faces limitations related to computational complexity and the difficulty of efficiently representing sparse structures. This can impair its ability to maximize the explained variance while simultaneously controlling the number of non-zero elements in the principal components [34]. In contrast, supervised learning techniques are used to enhance class separation by leveraging labeled data, which typically results in better classification performance than that achieved by unsupervised approaches [35].

Machine learning models based on NIR spectra (Table 2) demonstrated high performance in species classification, with PLS-DA achieving accuracy and F1-Score above 98%. P. munguba showed 100% recognition, a result consistent with the smoothing of peaks in the region of 7200–9900 cm⁻¹, which generates a distinct spectral pattern and facilitates discrimination (Figure 1). In contrast, the k-NN model had lower performance, with an accuracy of 57.14%, reinforcing that approaches based on linear regression, such as PLS-DA, are more suitable for this type of analysis.

The suboptimal performance observed for SVM and k-NN classifiers may be closely linked to specific properties of NIR spectral data. In particular, the high number of correlated variables commonly found in spectral datasets can hinder models that rely on distance metrics or that are sensitive to redundancy among features. For instance, k-NN is especially impacted by the curse of dimensionality, where the effectiveness of distance-based discrimination diminishes as the number of features increases [36]. Similarly, SVM performance can suffer in high-dimensional and collinear settings, especially when hyperparameters—such as kernel type and regularization—are not finely tuned [19,37].

Previous studies corroborate these findings. Pastore et al. [38] obtained good results in distinguishing species similar to mahogany using PLS-DA, while Bergo et al. [39] achieved over 96% accuracy in separating samples of mahogany, cedar, and andiroba. Soares et al. [31], using handheld NIR and classification by PLS-DA, obtained 90% efficiency in the distinction of six tropical species. Evaluating Dalbergia L.f. species that are listed as endangered, Snel et al. [40] found efficiencies above 90% when analyzing the potential of NIR combined with PLS-DA to identify wood. Santos et al. [8] obtained an accuracy above 98% when applying PLS-DA in the discrimination of species from the “Louros” group based on wood NIR spectra.

Despite the high overall performance of the model, there was some confusion between H. crepitans and H. spruceana, possibly due to shared anatomical features such as lower vessel abundance and low occurrence of axial parenchyma [6]. This structural similarity, combined with chemical proximity, may explain the difficulty in completely separating the species.

The objective of this paper was attained, namely, to test a fast methodology to classify wood species from floodplain regions, without knowledge of anatomical and chemical characteristics, to be applied for anyone in forest exploitation. We suggest a future chemical and anatomical analysis of species that will contribute to a detailed explanation of NIR performance related to the individual composition of wood.

5. Conclusions

In the NIR data, the spectral curves were similar, with the main difference being in the absorbance intensity. PCA revealed the potential of this technique to identify the shared characteristics among species. The application of NIR spectra and machine learning algorithms resulted in highly accurate classifiers, with the PLS-DA model standing out with an accuracy and F1-Score of 98.46%. These results clearly point to the informative potential of NIR data combined with artificial intelligence classifiers in distinguishing floodplain tree species. These results contribute to the development of more effective tools for identifying and monitoring timber, promoting the more sustainable use of forest resources. Future studies could expand this evaluation by including other species and by using classification algorithms, such as logistic regression and random forest, to further enhance the robustness of the models.

Author Contributions

Conceptualization, W.D.S.d.S., J.X.d.S., P.C.M.d.R.R., L.P.R., H.C.V. and S.N.; Methodology, W.D.S.d.S., J.X.d.S., T.L.N.A., D.V.S. and A.P.S.F.; Formal analysis, W.D.S.d.S., J.X.d.S., T.L.N.A. and D.V.S.; Data curation, D.V.S., H.C.V. and S.N.; Writing—original draft preparation, W.D.S.d.S., J.X.d.S., T.L.N.A., D.V.S. and A.P.S.F.; Writing—review and editing, W.D.S.d.S., J.X.d.S., D.V.S., P.C.M.d.R.R., L.P.R., H.C.V., G.I.B.d.M. and S.N.; Visualization, W.D.S.d.S., T.L.N.A. and A.P.S.F., Supervision, P.C.M.d.R.R., L.P.R., H.C.V., G.I.B.d.M. and S.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Council for Scientific and Technological Development, grant number 140131/2024-8 and Coordination for the Improvement of Higher Education Personnel, grant number 001.

Data Availability Statement

Data are available through request for correspondence author.

Acknowledgments

We would like to thank the Federal University of Paraná for providing the research facilities. We also thank the Mamirauá Institute for Sustainable Development for providing the wood samples.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Abreu, J.C.; Guedes, M.C.; Guedes, A.C.L.; Batista, E.M. Estrutura e Distribuição Espacial de Andirobeiras (Carapa spp.) Em Floresta de Várzea Do Estuário Amazônico. Ciência Florest. 2014, 24, 1007–1016. [Google Scholar] [CrossRef]
Semenishchenkov, Y.A.; Lobanov, G.V. Geoecological Conditions of Habitats of Floodplain Oak Forests in River Valleys of the Upper Dnieper Basin. Vestn. St. Petersburg Univ. Earth Sci. 2019, 64, 328–362. [Google Scholar] [CrossRef]
Richardson, S.B.; Simeone, J.C.; Deklerck, V. The global wood species priority list: A living database of tree species most at risk for illegal logging, unsustainable deforestation, and high rates of trade globally. Wood Fiber Sci. 2023, 55, 31–42. [Google Scholar] [CrossRef]
Reis, A.R.S. Anatomia Da Madeira de Quatro Espécies de Aspidosperma Mart. & Zucc. Comercializadas No Estado Do Pará, Brasil. Rev. Ciência Madeira—RCM 2015, 6, 47–62. [Google Scholar] [CrossRef]
Silva da Silva, W.; Santos, J.; Souza, D.; Naide Acosta, T.; Ferreira, A.; Reis, P.; Reis, L.; Vieira, H.; Muñiz, G.; Nisgoski, S. Applying Colorimetry and Visible Spectroscopy to Discriminate Wood Species in a Forest Management Area in the Amazon Foodplain. Holzforschung 2025, 79, 251–261. [Google Scholar] [CrossRef]
Silva, W.D.S.d.; Santos, A.d.S.; Ferreira, A.C.S.; Reis, P.C.M.d.R.; Reis, L.P.; Ferreira, A.P.S.; Naide Acosta, T.L.; Santos, J.X.d.; Ferreira, M.d.L.; Gris, D.; et al. Anatomical Characterization of Wood from Three Tree Species from a Floodplain Forest, Central Amazon, Brazil. Bosque 2024, 45, 315–324. [Google Scholar] [CrossRef]
Li, C.; Wang, Y. Optimizing Recognition Models for Wood Species Identification Using Multi-Spectral Techniques. Holzforschung 2025, 79, 177–187. [Google Scholar] [CrossRef]
Santos, J.X.; Vieira, H.C.; Souza, D.V.; de Menezes, M.C.; de Muñiz, G.I.B.; Soffiatti, P.; Nisgoski, S. Discrimination of “Louros” Wood from the Brazilian Amazon by near-Infrared Spectroscopy and Machine Learning Techniques. Eur. J. Wood Wood Prod. 2021, 79, 989–998. [Google Scholar] [CrossRef]
MMA-Ministério do Meio Ambiente Industrialização, Comércio e Transporte de Produtos Florestais. Available online: https://dd.serpro.gov.br/publico/sense/app/36ac6782-2c5b-4365-b11f-a5e713fbacad/sheet/bd361063-6a3b-4112-9cc2-0ea9d4a38963/state/analysis (accessed on 31 March 2025).
Ramalho, E.E.; Macedo, J.; Vieira, T.M.; Valsecchi, J.; Calvimontes, J.; Marmontel, M.; Queiroz, H.L. Ciclo Hidrológico Nos Ambientes de Várzea Da Reserva de Desenvolvimento Sustentável Mamirauá Médio Rio Solimões, Período de 1990 a 2008. Uakari 2009, 5, 61–87. [Google Scholar]
Martins-da-Silva, R.C.V. Coleta e Identificação de Espécimes Botânicos, 143rd ed.; Embrapa Amazônia Oriental: Belém, Brazil, 2002. [Google Scholar]
Gherardi Hein, P.R.; Lima, J.T.; Chaix, G. Effects of Sample Preparation on NIR Spectroscopic Estimation of Chemical Properties of Eucalyptus urophylla S.T. Blake Wood. Holzforschung 2010, 64, 45–54. [Google Scholar] [CrossRef]
R Core Team R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Available online: https://www.R-project.org/ (accessed on 15 April 2024).
Lê, S.; Josse, J.; Husson, F. FactoMineR R: An R Package for Multivariate Analysis. J. Stat. Softw. 2008, 25, 1–18. [Google Scholar] [CrossRef]
Kassambara, A.; Mundt, F. Factoextra: Extract and Visualize the Results of Multivariate Data Analyses. R Package Version 1.0.5. Available online: https://CRAN.R-project.org/package=factoextra (accessed on 15 April 2025).
Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013. [Google Scholar]
Ahmad, I.; Basheri, M.; Iqbal, M.J.; Rahim, A. Performance Comparison of Support Vector Machine, Random Forest, and Extreme Learning Machine for Intrusion Detection. IEEE Access 2018, 6, 33789–33795. [Google Scholar] [CrossRef]
Karatzoglou, A.; Smola, A.; Hornik, K.; Zeileis, A. Kernlab—An S4 Package for Kernel Methods in R. J. Stat. Softw. 2004, 11, 1–20. [Google Scholar] [CrossRef]
Wold, S.; Sjöström, M.; Eriksson, L. PLS-Regression: A Basic Tool of Chemometrics. Chemom. Intell. Lab. Syst. 2001, 58, 109–130. [Google Scholar] [CrossRef]
Santana, F.; Souza, A.; Almeida, M.; Breitkreitz, M.; Filgueiras, P.; Sena, M.; Poppi, R. Experimento didático de quimiometria para classificação de óleos vegetais comestíveis por espectroscopia no infravermelho médio combinado com análise discriminante por mínimos quadrados parciais: Um tutorial, parte v. Quim. Nova 2020, 43, 371–381. [Google Scholar] [CrossRef]
Mevik, B.H.; Wehrens, R.; Liland, K.H. Pls: Partial Least Squares and Principal Component Regression, R Package Version 2.7-3; RStudio: Boston, MA, USA, 2024.
Kuhn, M.; Wing, J.; Weston, S.; Williams, A.; Keefer, C.; Engelhardt, A.; Cooper, T.; Mayer, Z.; Hunt, T.; Candan, C.; et al. Caret: Classification and Regression Training. Available online: https://CRAN.R-project.org/package=caret (accessed on 15 March 2024).
Kuhn, M. Building Predictive Models in R Using the Caret Package. J. Stat. Softw. 2008, 28, 1–26. [Google Scholar] [CrossRef]
Song, Y.; Liang, J.; Lu, J.; Zhao, X. An Efficient Instance Selection Algorithm for k Nearest Neighbor Regression. Neurocomputing 2017, 251, 26–34. [Google Scholar] [CrossRef]
Hechenbichler, K.; Schliep, K. Weighted K-Nearest-Neighbor Techniques and Ordinal Classification Sonderforschungsbereich. Available online: https://epub.ub.uni-muenchen.de/ (accessed on 4 March 2025).
Wu, X.; Kumar, V.; Ross Quinlan, J.; Ghosh, J.; Yang, Q.; Motoda, H.; McLachlan, G.J.; Ng, A.; Liu, B.; Yu, P.S.; et al. Top 10 Algorithms in Data Mining. Knowl. Inf. Syst. 2008, 14, 1–37. [Google Scholar] [CrossRef]
McRoberts, R.E.; Næsset, E.; Gobakken, T. Optimizing the K-Nearest Neighbors Technique for Estimating Forest Aboveground Biomass Using Airborne Laser Scanning Data. Remote Sens. Environ. 2015, 163, 13–22. [Google Scholar] [CrossRef]
Li, C.; Qiu, Z.; Liu, C. An Improved Weighted K-Nearest Neighbor Algorithm for Indoor Positioning. Wirel Pers. Commun. 2017, 96, 2239–2251. [Google Scholar] [CrossRef]
Gegzna, V. ManyROC: Tools for ROC Analyis. Available online: https://gegznav.github.io/manyROC (accessed on 4 March 2023).
Schwanninger, M.; Rodrigues, J.C.; Fackler, K. A Review of Band Assignments in near Infrared Spectra of Wood and Wood Components. J. Near Infrared Spectrosc. 2011, 19, 287–308. [Google Scholar] [CrossRef]
Soares, L.F.; da Silva, D.C.; Bergo, M.C.J.; Coradin, V.T.R.; Braga, J.W.B.; Pastore, T.C.M. Avaliação de Espectrômetro NIR Portátil e PLS-DA Para a Discriminação de Seis Espécies Similares de Madeira Amazônicas. Quim. Nova 2017, 40, 418–426. [Google Scholar] [CrossRef]
Eugenio Da Silva, C.; Nascimento, C.S.; Freitas, J.A.; Araújo, R.D.; Durgante, F.M.; Zartman, C.E.; Nascimento, C.C.; Higuchi, N. Alternative Identification of Wood from Natural Fallen Trees of the Lecythidaceae Family in the Central Amazonian Using FT-NIR Spectroscopy. Int. For. Rev. 2024, 26, 29–44. [Google Scholar] [CrossRef]
Gomes, J.N.N.; Medeiros, D.T.; Viana, L.C.; Hein, P.R.G. Influence of Moisture on the Identification of Tropical Wood Species by NIR Spectroscopy. Holzforschung 2025, 79, 188–201. [Google Scholar] [CrossRef]
Xie, Y.; Wang, T.; Kim, J.; Lee, K.; Jeong, M.K. Least Angle Sparse Principal Component Analysis for Ultrahigh Dimensional Data. Ann. Oper. Res. 2024. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2009; ISBN 978-0-387-84857-0. [Google Scholar]
Domingos, P. A Few Useful Things to Know about Machine Learning. Commun. ACM 2012, 55, 78–87. [Google Scholar] [CrossRef]
Géron, A. Mãos à Obra Aprendizado de Máquinas Com Scikt-Learn & TensorFlow: Conceitos, Ferramentas e Técnicas Para a Construção de Sistemas Inteligentes; Alta Books: Rio de Janeiro, Brazil, 2019. [Google Scholar]
Pastore, T.C.M.; Braga, J.W.B.; Coradin, V.T.R.; Magalhães, W.L.E.; Okino, E.Y.A.; Camargos, J.A.A.; de Muñiz, G.I.B.; Bressan, O.A.; Davrieux, F. Near Infrared Spectroscopy (NIRS) as a Potential Tool for Monitoring Trade of Similar Woods: Discrimination of True Mahogany, Cedar, Andiroba, and Curupixá. Holzforschung 2011, 65, 73–80. [Google Scholar] [CrossRef]
Bergo, M.C.J.; Pastore, T.C.M.; Coradin, V.T.R.; Wiedenhoeft, A.C.; Braga, J.W.B. NIRS identification of Swietenia macrophylla is robust across specimens from 27 countries. IAWA J. 2016, 37, 420–430. [Google Scholar] [CrossRef]
Snel, F.A.; Braga, J.W.B.; da Silva, D.; Wiedenhoeft, A.C.; Costa, A.; Soares, R.; Coradin, V.T.R.; Pastore, T.C.M. Potential Field-Deployable NIRS Identification of Seven Dalbergia Species Listed by CITES. Wood Sci. Technol. 2018, 52, 1411–1427. [Google Scholar] [CrossRef]

Figure 1. Schematic representation of sampling and analysis.

Figure 2. Mean NIR wood spectra from the evaluated species.

Figure 3. Score plot (A) and loading graph for PC-1 and PC-2 (B) in the principal component analysis, with the mean NIR spectra of the four species being evaluated.

Figure 4. Recognition rate, using the most accurate PLS-DA classifiers, by species in repeated cross-validation. ncomp = number of principal components.

Table 1. Hyperparameter variants evaluated for each algorithm and packages applied.

Algorithm	Hyperparameter Variants	Method/Package	Reference
SVM	C = 2^c(−2, 0, 2, 4, 6, 7, 8, 9, 10, 11, 12) Sigma = c(0.005, 0.01, 0.02, 0.03, 0.05)	svmRadial/kernlab	Karatzoglou et al. [18]
PLS-DA	ncomp = seq(1:30)	pls/pls	Mevik et al. [21]
k-NN	k = seq(2, 25, 1)	knn/caret	Kuhn et al. [22]

SVM = Support vector machine; PLS-DA = partial least squares − discriminant analysis; k-NN = k-nearest neighbor; C = cost of constraint violation; sigma = radial basis kernel parameter; k = number of nearest neighbors; ncomp = number of principal components.

Table 2. Metric performance of classifiers in repeated cross-validation.

Parameter	2 × 25 Cross-Validation Partitions
	PLS-DA			SVM				k-NN
	HT	Accuracy (%)	F1-Score (%)	HT	Accuracy (%)	F1-Score (%)	Recall (%)	HT	Accuracy (%)	F1-Score (%)
Mean	ncomp = 29	98.46	98.46	sigma = 0.005 C = 64	62.77	62.74	62.77	k = 5	57.14	56.79
SD		0.55	0.55		2.79	2.55	2.79		2.75	2.66
Minimum		97.19	97.19		57.50	57.71	57.50		50.31	49.91
Maximum		99.38	99.38		69.69	68.86	69.69		63.44	62.39

SD = standard deviation.; SVM = support vector machine learning; PLS-DA = partial least squares discriminant analysis; k-NN = k-nearest neighbor; HT = optimal tuning hyperparameter; C = constraint violation cost; sigma = radial basis kernel parameter, k = number of nearest neighbors; ncomp = number of principal components.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Near-Infrared Spectroscopy and Machine Learning for Wood Species Discrimination in an Amazon Floodplain Forest Management Area

Abstract

1. Introduction

2. Materials and Methods

2.1. Location of Study

2.2. Species and Sampling

2.3. Near-Infrared Spectroscopy (NIR)

2.4. Statistical Analysis and Classification Methods

2.4.1. Support Vector Machine (SVM) Learning

2.4.2. Partial Least Squares—Discriminant Analysis (PLS-DA)

2.4.3. K-Nearest Neighbors (k-NN)

2.5. Experiments, Cross-Validation, and Evaluation Metrics

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics