Intelligent Selection of Spectral Bands from High-Precision Spectroradiometer Measurements for Optimizing Cocoa Bean Classification

Ayikpa, Kacoutchy Jean; Mamadou, Diarra; Ballo, Abou Bakary; Gouton, Pierre

doi:10.3390/electronics14101976

Open AccessArticle

Intelligent Selection of Spectral Bands from High-Precision Spectroradiometer Measurements for Optimizing Cocoa Bean Classification

¹

Laboratoire Imagerie et Vision Artificielle (ImViA), Université Bourgogne Europe, 21000 Dijon, France

²

Unité de Recherche et d’Expertise Numérique (UREN), Université Virtuelle de Côte d’Ivoire, Abidjan 28 BP 536, Côte d’Ivoire

³

Laboratoire de Mécanique et Information (LaMI), Université Felix Houphouët-Boigny, Abidjan 22 BP 801, Côte d’Ivoire

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(10), 1976; https://doi.org/10.3390/electronics14101976

Submission received: 31 March 2025 / Revised: 8 May 2025 / Accepted: 10 May 2025 / Published: 12 May 2025

(This article belongs to the Special Issue Artificial Intelligence in Graphics and Images)

Download

Browse Figures

Versions Notes

Abstract

Evaluating the spectral properties of cocoa beans based on their fermentation state (fermented, in a poor state, unfermented) is essential for ensuring their quality in the cocoa industry. This study examined the spectral response of beans in the range of 380 nm to 780 nm using the Konica-Minolta CS-2000 spectrophotometer comes from Dijon, France, a device designed to measure the spectrum of objects and sources in the visible range. Different spectral band selection methods have been applied to identify the most discriminating wavelengths for their classification. Several techniques were used: ANOVA, F-score, Lasso, Linear Discriminant Analysis (LDA), Mutual Information, and Partial Least Squares (PLS). A band selector voting process was implemented to determine standard wavelengths identified using the different methods. The selected spectral bands were then leveraged to train classification models, including Random Forest, SVM, and XGBoost. The results show that a restricted subset of wavelengths allows for effective class separation, thereby improving model performance. Among the approaches tested, ANOVA and F-score combined with Random Forest achieved an accuracy of 92.59%, while F-score and Mutual Information coupled with SVM and voting associated with SVM obtained an accuracy of 96.30%. These feature selection methods have effectively reduced dimensionality while maintaining high classification accuracy. These results open up promising prospects for the automation of quality control of cocoa beans, thus contributing to the optimization of industrial processes.

Keywords:

spectral analysis; selection of spectral bands; machine learning; classification of cocoa beans; spectral measurement

1. Introduction

Côte d’Ivoire is the world’s leading cocoa producer, accounting for 40% of global production. This sector plays a key role in the national economy, contributing approximately 10% of GDP. Improving the quality of cocoa beans is a significant challenge, involving various sectors of activity and generating growing interest in scientific research. Numerous studies focus on diseases, maturity, drying, and cocoa processing to optimize its quality and properties [1]. A determining factor in cocoa quality is its fermentation state, which is divided into three categories: fermented, bad condition, and non-fermented. Fermentation directly influences the taste and properties of the beans and their use in the food industry. In recent decades, computer vision has emerged as a powerful tool for analyzing cocoa quality, with a gradual transition from color imaging to multispectral imaging, offering more accurate results [2]. In this study, we exploited the spectral responses of cocoa beans obtained using the Konica-Minolta CS-2000 optical spectroradiometer, designed at the ImViA Laboratory of the University of Burgundy. This instrument enables the capture of Spectral Reflectance of beans in the visible spectrum (380–780 nm), providing a rich database for analysis [3]. We applied several spectral band selection methods, including ANOVA, F-score, LASSO, Linear Discriminant Analysis (LDA), Mutual Information, and Partial Least Squares (PLS), to identify the most discriminating wavelengths for classifying cocoa beans. Each method identified specific bands based on statistical and informational criteria.

The spectral band selection problem has been addressed to reduce the dimensionality of the spectral data while preserving the most informative features necessary for an accurate classification of the cocoa beans. In this study, the goal of band selection was to eliminate redundant or irrelevant bands that could introduce noise and degrade model performance and highlight the wavelengths that carry the most discriminative information for distinguishing between the three classes of beans.

To achieve this, we implemented a voting strategy to enhance the reliability of the selections and avoid biased choices based on a single method. This approach involves identifying the wavelengths most frequently selected using different selection techniques, thus ensuring a more robust and representative selection of key spectral features.

The selected spectral bands were then used to train three supervised classification models: Random Forest, SVM, and XGBoost. This methodology allowed us to systematically assess the impact of different band selections on model precision and to identify the optimal combinations of bands that maximize the performance and robustness of the cocoa bean classification task.

Our study made several significant contributions to the spectral evaluation of cocoa beans based on their fermentation state:

We created a database of Spectral Reflectance covering the range from 380 nm to 780 nm, corresponding to the visible spectrum. This database categorizes beans into three categories: fermented, poor condition, and unfermented. Measurements were performed using a Konica-Minolta CS-2000 spectrophotometer, ensuring the samples’ spectral responses were accurately acquired.
We applied several feature selection methods to identify the most relevant wavelengths for classification. These methods include ANOVA, F-score, LASSO, Linear Discriminant Analysis (LDA), Mutual Information, and Partial Least Squares (PLS). To improve the robustness of the selection, we implemented a voting strategy to retain recurring spectral bands across the different methods. This approach ensures better representativeness of discriminating wavelengths and limits the impact of variations due to individual methods.
The selected spectral bands were meticulously used to train and evaluate three supervised classification algorithms: Random Forest, SVM, and XGBoost. The objective was to analyze their ability to effectively discriminate between different bean categories using only the most informative wavelengths, ensuring a comprehensive research process.
We compared the performance of the different spectral band selection methods and classification models by examining several performance metrics, including accuracy, F1 score, and Matthews correlation coefficient (MCC). This analysis highlighted the influence of each technique on improving classification and identified the most effective combinations for optimizing bean recognition based on their fermentation state, a crucial factor in determining the quality of cocoa beans.

Our results demonstrate the value of machine learning combined with rigorous wavelength selection for cocoa bean quality control. This research opens up exciting possibilities for industrial applications, paving the way for automating bean sorting based on the fermentation state and enhancing the efficiency of the agricultural industry.

The manuscript is structured as follows: Section 2 reviews the state of the art in the spectral analysis of cocoa beans and the methods for spectral band selection. Section 3 describes the materials used and the methodologies adopted in our approach. Section 4 details the experiments conducted and the results obtained. Section 5 provides a discussion of these findings. Finally, Section 6 concludes the study.

2. Related Work

Hyperspectral imaging research has significantly developed over the past two decades, with increasing adoption in the agricultural sector. Its gradual integration into this sector is driven by methodological and technological advances, particularly in spectral band selection and machine learning. Numerous studies have highlighted its potential impact on optimizing agricultural processes, reinforcing its importance for society, and its future deployment.

Thenkabail et al. highlighted the informative richness of hyperspectral data for crop characterization. They distinguish two main approaches: global spectral analysis (WSA), which exploits the entire spectrum to capture the most subtle variations, and the selection of optimal hyperspectral bands (OHNBs), which aims to reduce redundancy and simplify analysis. Their study identified 28 bands that are particularly relevant for modeling plants’ biophysical and biochemical properties and constructing effective vegetation indices. However, they emphasize adapting this selection according to specific applications [4].

In this research, Jana Wieme et al. explored hyperspectral imaging as a method for non-destructively assessing the quality of fruits, vegetables, and mushrooms [5]. Their study analyzed laboratory-measurable quality parameters and assessed complex properties, including maturity, ripening, biotic defect detection, physiological disorders, mechanical damage, and sensory quality.

Furthermore, L. Mou et al. addressed the problem of unsupervised band selection from a Markov decision process perspective, proposing an approach based on deep reinforcement learning [6]. Once the agent is trained, it learns a band selection policy that enables it to select the most relevant wavelengths sequentially, thereby fully exploiting the information in the hyperspectral image and previous selections.

In a study by Peña et al., satellite images (SITS) combined with normalized difference indices (NDIs) improved the classification of six types of fruit tree crops in the Aconcagua Valley, Chile [7]. Applying penalized Linear Discriminant Analysis (LDA) techniques to address the high dimensionality of the data resulted in a 23% increase in classification accuracy. The most significant spectral bands were identified as those corresponding to the greening stage of the crops.

The food industry can also benefit from non-destructive, real-time classification methods, particularly for differentiating specialty and traditional green coffee beans. To this end, Gomes et al. applied techniques based on multispectral imaging and machine learning [8]. They could distinguish these two categories using reflectance and autofluorescence data processed using the SVM, RF, XGBoost, and CatBoost algorithms. The SVM achieved the best performance with an accuracy of 0.96. Furthermore, PCA and SVM analysis highlighted the importance of autofluorescence at 405/500 nm and specific fluorophores in the classification process.

Sim J. et al. explored near-infrared hyperspectral imaging (HSI-NIR) combined with machine learning to combat fraud in the geographic origin of coffee [9]. Their study classified coffee origins at different scales (continental, national, and regional) by analyzing green coffee beans from three continents, eight countries, and 22 regions. Four machine learning models (PLS-DA, SVM, RBF-SVM, and Random Forest) were compared, with linear SVM achieving nearly perfect results. This study thus demonstrates the feasibility of using HSI-NIR combined with machine learning for non-destructive screening of coffee origin.

Lin, M. Y. et al. developed a deep learning model for a multispectral vision system integrating a CMOS sensor and a spectrometer [10]. This model features an automatic and adaptive color temperature correction engine, ensuring accurate color recognition even in varying lighting and chromatic conditions. Trained on a nine-dimensional dataset, this system demonstrated remarkable efficiency in classifying adulterated coffee beans, achieving an accuracy of 99.3%.

Furthermore, Castro et al. used hyperspectral imaging (Vis-NIR and NIR) to analyze mango samples from eleven distinct varieties [11]. After data preprocessing, three machine learning models were trained: Linear Discriminant Analysis (LDA), k-nearest neighbors (k-NNs), and artificial neural networks (ANNs). Model evaluation using precision, accuracy, and F-score metrics revealed subtle spectral differences between the varieties. The selection of 17 (UV-Vis-NIR) and 21 (NIR) optimal spectral bands improved model performance, achieving an accuracy ranging from 0.752 to 0.972. These results confirm the potential of hyperspectral imaging combined with machine learning for the acceptable discrimination of mango varieties.

Hyperspectral imaging has emerged as a robust method for assessing the quality of agricultural products, particularly tubers such as potatoes and sweet potatoes. This technique analyzes the interactions between light and matter, enabling a precise and non-destructive characterization of tuber properties. Capturing detailed spectral information facilitates not only the classification and authentication of these products but also the prediction of their physical properties, such as texture and water retention, as well as their chemical composition, including protein and starch content, while detecting potential defects [12].

Integrating machine learning with hyperspectral imaging significantly enhances the accuracy and reliability of analyses. Machine learning models can effectively process complex spectral data and identify subtle patterns that traditional methods might overlook. Furthermore, feature selection helps optimize these models by reducing data dimensionality, improving efficiency, and minimizing the risk of overfitting.

This study highlights the effectiveness of combining hyperspectral imaging with artificial intelligence for assessing tuber quality. It demonstrates this synergy through various real-world applications and discusses key challenges, such as variability in image acquisition conditions and the need for standardized protocols. Ultimately, it examines prospects, including advancements in sensor technology and developing more robust predictive models, which will pave the way for automated and more accessible evaluation systems.

Hyperspectral imaging, combined with machine learning, is emerging as a crucial method for the non-destructive analysis of agricultural products. Various studies have demonstrated its effectiveness in classifying, authenticating, and assessing the quality of coffee beans, fruits, and tubers. Recent advances in deep learning and optimized spectral band selection have significantly improved the accuracy of predictive models. However, challenges remain, particularly related to the variability of acquisition conditions and the need for standardized methodologies.

3. Materials and Methods

The experiments were conducted in Python on a DELL laptop computer at the ImVIA laboratory at Dijon in France. This computer features an Intel Core i7-10700 processor operating at 2.90 GHz, 32 GB of RAM, and an NVIDIA Quadro P400 GPU. The models were implemented using Python 3.8.8, with the Keras 2.4.3 API and TensorFlow 2.3 as the backend.

3.1. Implementation of the Dataset and Spectral Measurement Process

To construct the dataset, we measured the spectral characteristics of cocoa beans using the high-precision CS2000 spectroradiometer. This instrument accurately captures the spectral characteristics of materials by measuring the beans’ reflectance across the entire visible spectrum. Figure 1 represents the spectroradiometer.

The CS2000 spectroradiometer was used in a multi-step process:

Step 1: A controlled light illuminates the cocoa bean to be analyzed. This light comes from a standardized white light source, ensuring uniform and reproducible illumination in a controlled environment;
Step 2: When the light strikes the surface of the bean, several optical phenomena occur:
○
The bean absorbs some of the light;
○
Another part is reflected, carrying information about the bean’s composition and condition.
Step 3: The CS2000 captures the light the bean reflects and decomposes it into different wavelengths using an optical system equipped with a diffraction grating. This decomposition provides a detailed Spectral Reflectance of each bean;
Step 4: The device records the intensity of the reflected light for each wavelength of the visible spectrum (from 380 nm to 780 nm). The measurements are converted into standardized digital values, facilitating accurate comparison of beans according to their condition.

For this study, 90 cocoa beans were selected, with 30 samples for each category. This balanced distribution aims to ensure fair representation of each class during the analysis and training of classification models. Bean selection was manually performed from homogeneous batches based on visual and sensory criteria such as odor, color, and texture. This assessment was validated by local farmers operating in the study area in Yakassé, a village in the commune of Grand-Bassam, Côte d’Ivoire.

Each selected bean underwent individual spectral acquisition, covering a wavelength ranging from 380 nm to 780 nm, corresponding to the visible spectrum. These measurements were designed to accurately assess each sample’s Spectral Reflectance.

Acquisitions were conducted under rigorously controlled conditions, particularly ambient lighting, angle of incidence, and humidity, to ensure data reliability. These measurements were carried out in the ImViA laboratory at the University of Burgundy Europe (formerly the University of Burgundy), France.

This data provides a reliable basis for comparing Spectral Reflectance between fermented, unfermented, and unprocessed beans, ensuring the reproducibility of the experiments and the validity of the results obtained.

Figure 2 shows a sample of the reflected light intensity for each wavelength of the visible spectrum, ranging from 380 nm to 780 nm.

The resulting data are organized in a table format. Each row corresponds to a single bean, and each column represents the spectral intensity measured at a specific wavelength.

The beans are classified into three categories according to their condition:

0: Fermented;
1: Bad condition;
2: Non-fermented.

Each recorded value indicates the bean’s reflectance rate at a given wavelength, expressed in units normalized by the spectrometer. These data are essential for identifying spectral differences between beans and assessing their quality based on fermentation.

Figure 3 illustrates the Spectral Reflectance of cocoa beans obtained from measurements made using the CS2000 spectroradiometer.

Figure 4 shows the average Spectral Reflectance for each class.

3.2. Methods for Spectral Band Selection

Spectral band selection methods are essential for optimizing the performance of classification algorithms by reducing redundancy and enhancing the relevance of extracted information. Various approaches are used to identify the most significant bands, including Mutual Information, Voting, ANOVA, F-score, Linear Discriminant Analysis (LDA), Partial Least Squares (PLS), and LASSO. Each method has specific characteristics that influence the accuracy and robustness of classification models.

3.2.1. Analysis of Variance (ANOVA)

ANOVA is a powerful statistical technique used to determine whether significant differences exist between the means of three or more groups. Imagine testing the effectiveness of three different teaching methods on student performance. ANOVA would help determine whether students’ average scores differ significantly among these three groups.

In its functioning, ANOVA focuses on variance, which refers to the dispersion of data around the mean. It compares two types of variances: between-group variance and within-group variance. The between-group variance measures how different the group means are from each other. If the group means vary greatly, this variance will be high. On the other hand, within-group variance measures the variability of individual observations within each group. Even if the group means are similar, there will always be some degree of individual variability [13,14].

ANOVA is a valuable tool for comparing the means of multiple groups. Analyzing variance helps determine whether the observed differences are likely due to chance or if they reflect actual differences between the groups. The equation associated with this model is the following:

F_{j} = \frac{\sum_{k = 1}^{K} n_{k} {(μ_{k, j} - μ_{j})}^{2}}{\sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} {(x_{i, j}^{(k)} - μ_{k, j})}^{2}}

(1)

$K :$ total number of classes
$n_{k} : n u m b e r o f s a m p l e s i n c l a s s k$
$μ_{k, j} : average of the values of band j in class k$
$μ_{j} : overall average for the band j$
$x_{i, j}^{(k)} : sample value i of the band j in class k$

3.2.2. Mutual Information

Mutual Information is a measure of statistical dependence between two random variables. Observing the other quantifies the amount of information gained about one variable. In other words, it measures the reduction in uncertainty about one variable due to knowledge of the other. The entropy of a random variable reflects its uncertainty or degree of disorder. The higher the entropy, the more unpredictable the variable is. The joint information of two random variables indicates the total uncertainty associated with their simultaneous observation. The conditional information of a random variable, given another, measures the remaining uncertainty about the first variable after observing the second. In feature selection, Mutual Information is used to evaluate the relationship between the target variable (the variable to be predicted) and each feature (explanatory variable), allowing for the measurement of their mutual dependence. Features with high Mutual Information with the target variable are considered the most informative and relevant and thus selected for inclusion in the predictive model [15,16]. Mutual Information is a powerful and flexible measure of dependence between random variables, with applications in fields such as information theory, machine learning, and signal processing. The equation associated with this model is the following:

M I (X_{j}; Y) = \sum_{x_{j} \in X_{j}} \sum_{y \in Y} p (x_{j}, y) \log (\frac{p (x_{j}, y)}{p (x_{j}) \cdot p (y)})

(2)

$X_{j} : spectral band j$
$Y : class label$
$p (x_{j}, y) : joint probability$
$p (x_{j}), p (y) : marginal probabilities$

3.2.3. Linear Discriminant Analysis (LDA)

LDA is a statistical technique used for classification and dimensionality reduction. It aims to find a linear combination of features that best separates the classes in a dataset. The primary objectives of this method are to assign an observation to the class to which it is most likely to belong and to project the data into a lower-dimensional space while preserving as much discriminative information as possible between the classes. In its operation, LDA seeks to maximize the distance between class means while minimizing the variance within each class. It assumes that the data for each class follows a normal distribution and that the covariance matrices of the classes are equal. LDA calculates linear discriminant functions that enable projecting the data into a new space where the classes are more clearly separated [17,18]. The equation associated with this model is the following:

δ_{k} (λ) = μ_{k, λ} \cdot \frac{x_{λ}}{σ_{λ}^{2}} - \frac{μ_{k, λ}^{2}}{2 σ_{λ}^{2}} + \ln P (ω_{k})

(3)

λ: Each spectral band
$δ_{k} (λ)$ : The discriminant function
$x_{λ}$ : Spectral value at the band $λ$
$μ_{k, λ} :$ Average of class $k$ for the band $λ$
$σ_{λ}^{2} :$ $I n t r a - c l a s s v a r i a n c e (c o m m o n t o a l l c l a s s e s f o r λ)$
$p (ω_{k}) :$ Prior probability of the class $k$

3.2.4. F-Score

The F-score is a widely used performance metric in statistics and machine learning, particularly for evaluating classification models. When applied to spectral band selection, it is an effective tool for assessing the relevance of different bands in a classification model. In this context, the F-score evaluates the impact of each band on the model’s overall performance by considering both precision (the percentage of correct optimistic predictions) and recall (the percentage of actual positives correctly predicted).

By selecting the spectral bands that maximize the F-score, we optimize the model’s ability to distinguish between different classes while minimizing errors. This is especially useful when dealing with multiple spectral bands, where careful selection is crucial to avoid model overcomplexity and improve performance [19].

Therefore, the F-score helps assess the importance of each spectral band in the classification task, aiming to strike the right balance between precision and recall while minimizing the influence of non-informative or redundant bands. The equation associated with this model is the following:

F (λ) = \frac{\sum_{k = 1}^{K} n_{k} {(μ_{k, λ} - μ_{λ})}^{2} / (K - 1)}{\sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} {(x_{i, λ} - μ_{k, λ})}^{2} / (N - K)}

(4)

$μ_{k, λ} : A v e r a g e o f c l a s s k f o r t h e b a n d λ$
$μ_{λ} : O v e r a l l b a n d a v e r a g e λ$
$n_{k} :$ Number of samples in the class
$N :$ Total number of samples
$K :$ Number of classes

3.2.5. Partial Least Squares (PLS)

Partial Least Squares (PLS) is a powerful and flexible statistical method that models the relationship between two sets of variables, even in multicollinearity and high-dimensional data [20]. This technique is particularly useful when the number of predictor variables (X) is large and there is a strong correlation between them.

PLS aims to find a linear relationship between the predictor and response variables (Y). When the number of predictor variables is high, PLS reduces the dimensionality by creating a small number of latent components that capture the most important information. PLS is robust to multicollinearity, which refers to the strong correlation between predictor variables.

In its operation, PLS extracts latent components from the predictor and response variables. These latent components are linear combinations of the original variables that maximize the covariance between X and Y. Once the latent components are extracted, PLS performs a linear regression of the response variables on them. The resulting PLS model can predict the response variables for new observations [21].

PLS is particularly suitable for data with strong multicollinearity. It helps reduce the number of variables to consider, simplifying the model and reducing the risk of overfitting. PLS can be used for linear, logistic, and discriminant analyses. It is widely used in chemometrics for spectroscopic data analysis and bioinformatics for analyzing genomic data. The equation associated with this model is the following:

t = X w

(5)

$t : V e c t o r o f l a t e n t s c o r e s (p r i n c i p a l c o m p o n e n t)$
$X : S p e c t r a l d a t a m a t r i x$
$w : W e i g h t v e c t o r$

The weights w are determined to maximize the covariance between the scores and the response variable:

w = \arg {\max_{| w | = 1} (cov (X w, y))}^{2}

(6)

3.2.6. Least Absolute Shrinkage and Selection Operator (LASSO)

The LASSO method is a regularization technique used in statistics and machine learning for variable selection and regression. It is beneficial when you have many potentially predictive variables and want to identify the most relevant ones [22]. The LASSO method can shrink some regression coefficients to zero. This means that the corresponding variables are effectively excluded from the model. This helps simplify the model by keeping only the most essential variables for prediction. Regularization is a technique used to prevent overfitting, which occurs when a model fits the training data too closely and fails to generalize effectively to new data [23]. LASSO uses an L1 penalty, which is the sum of the absolute values of the regression coefficients. This penalty pushes the coefficients to shrink, and some become precisely zero. The equation associated with this model is the following:

\hat{β} = \arg \min_{β} (| y - X β |_{2}^{2} + λ | β |_{1})

(7)

$X : matrix of input variables (spectral band)$
$y : target variable$
$β : vector of coefficients$
$λ : regularization parameter$
$| \cdot |_{2}^{2} : s q u a r e E u c l i d e a n n o r m$
$| \cdot |_{1} : L 1 norm (sum of absolute values)$

3.2.7. Selection by Majority Vote (Voting)

The voting method aggregates results from several variable selection techniques. This study was implemented to increase the robustness and reliability of selecting the most relevant spectral bands for cocoa bean classification.

The process is based on the following key steps:

Application of selection methods: Several statistical methods (LASSO, Mutual Information, ANOVA, PLS, and LDA) are applied independently to the spectral data to identify discriminatory wavelengths.
Generation of spectral band lists: Each method produces a list of selected spectral bands. The number of times each spectral band was chosen among the different techniques is recorded.
Determination of the voting threshold: A selection threshold is set at four. Thus, if at least four of the five methods used identify a spectral band as relevant, a spectral band is retained in the final selection. This threshold was chosen to balance selection rigor (by reducing biases specific to a given method) with retaining sufficient bands for efficient learning.
Aggregation of results: The spectral bands meeting the majority criterion are grouped to form the final set of explanatory variables. This set is then used for training and evaluating classification models.

The voting approach thus capitalizes on the complementarity of the different selection techniques while limiting the risks of overfitting linked to an isolated method. It contributes to improving the generalization capacity of predictive models by retaining only the spectral bands deemed both stable and informative using several independent methods.

3.3. Algorithms for Classification

The choice of three algorithms, namely Support Vector Machine (SVM), Random Forest, and XGBoost, for cocoa bean classification is based on their strengths regarding robustness, performance, and ability to handle complex data. Each algorithm offers distinct advantages that can be leveraged to enhance the accuracy and reliability of cocoa bean classification.

3.3.1. Support Vector Machine (SVM)

SVM is known for handling high-dimensional data and finding optimal hyperplanes that separate classes with maximum margin. It is susceptible to overfitting, especially in complex feature spaces. Thanks to kernel functions that transform the feature space, it is particularly effective for classification problems where classes are well separated or when the data is not linearly separable. It can also handle nonlinear data using the radial kernel (RBF) kernel, which helps capture complex relationships between spectral bands in cocoa bean measurements [24].

3.3.2. Random Forest

Random forests are ensemble methods that combine multiple decision trees, reducing the risk of overfitting and improving generalization. They are robust to outliers and noise in the data.

Random forests offer good accuracy performance and are capable of handling interpretable models. They can handle both numerical and categorical data, which is helpful for the diverse characteristics of cocoa beans.

Random forests can capture complex interactions between features by building multiple decision trees, which is beneficial for heterogeneous data [25].

3.3.3. XGBoost

XGBoost is a boosting method that combines multiple weak models to form a strong one. It is robust to overfitting, thanks to regularization and model complexity control techniques. XGBoost is known for its high accuracy and training speed. It is often used in machine learning competitions because it produces accurate results. XGBoost can handle data with nonlinear relationships and complex interactions between features. It is also effective at handling missing data and heterogeneous features [26].

3.4. Pipeline Methodology

The methodology architecture describes the process of classifying the condition of cocoa beans from spectral measurement data.

Data Acquisition: This device captures the wavelengths of light reflected by cocoa beans. These data are then converted into digital values corresponding to different spectral bands.
Data Preparation: The raw dataset comprises spectral intensity measurements of cocoa beans across multiple wavelengths, ranging from 380 nm to 780 nm. The data are structured in a tabular format, where each row corresponds to a single cocoa bean and includes its spectral response along with its assigned class label:
○
0: Fermented;
○
1: Bad condition;
○
2: Non-fermented.
Spectral Band Selection: Spectral band selection is an essential step in reducing the dimensionality of the data while retaining the most discriminating features. Several band selection methods are applied:
○
LASSO (Least Absolute Shrinkage and Selection Operator): A regularization method that selects the most critical variables by forcing specific coefficients to zero;
○
Mutual Information: Evaluate the dependence between wavelengths and classes to retain only the most relevant ones;
○
LDA (Linear Discriminant Analysis): Finds a linear combination of features that maximizes the separation between classes;
○
ANOVA (Analysis of Variance): Select the bands with the most significant variance between groups;
○
F-score: Used to evaluate the relevance of features based on inter- and intra-class variance;
○
PLS (Partial Least Squares Regression): A dimensionality reduction method that finds the most informative spectral bands;
○
Voting: After applying these methods, a final selection is made by combining the results of the different approaches to identify the most relevant spectral bands.
Data Splitting: The resulting data are split into two sets:
○
Training and Validation Set: Used for training and adjusting models;
○
Test Set: Reserved for final performance evaluation.
Classification: Three machine learning algorithms are used to classify the condition of cocoa beans:
○
SVM (Support Vector Machine): A classification model that uses hyperplanes to separate classes;
○
Random Forest: An ensemble of decision trees that improves robustness and generalization;
○
XGBoost (Extreme Gradient Boosting): A decision tree-based model optimized for performance.
Final Classification: The trained model predicts the condition of cocoa beans based on selected spectral characteristics. The goal is to achieve an accurate classification that distinguishes between fermented, bad condition, and non-fermented beans.

Figure 5 presents the general framework of our methodology.

3.5. Performance Metric

To evaluate the performance of the studied models, we will use several metrics, including accuracy, precision, mean squared error (MSE), recall, F1-score, Matthew’s correlation coefficient (MCC), and the ROC curve. Specific formulas define these metrics:

Accuracy: Measures the model’s ability to correctly classify data into the appropriate classes;
Precision: The ratio of instances correctly classified as positive to the total number of instances predicted as positive;
Recall: The classifier’s ability to correctly identify all truly positive instances;
F1-score: The weighted harmonic means of precision and recall;
Matthew’s correlation coefficient (MCC): Assesses the overall quality of classifications;
Mean Squared Error (MSE): The average of the squared differences between the predicted and actual values.

A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}

(8)

P r e c i s o n = \frac{T P}{T P + F P}

(9)

R e c a l l = \frac{T P}{T P + F N}

(10)

F 1 s c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(11)

M C C = \frac{T P \times T N - F P \times F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(12)

M S E = \frac{1}{n} \sum_{i = 1}^{n} (Y i - \hat{Y} i) ²

(13)

The variables used in performance evaluation are as follows:

True Positives (TP): Instances correctly classified as positive;
False Positives (FP): Instances incorrectly classified as positive while actually negative;
True Negatives (TN): Instances correctly classified as negative;
False Negatives (FN): Instances incorrectly classified as negative when they are positive;
Yi: The actual values from the dataset;
Ŷi: The predicted values from the dataset.

We utilized the ROC curve to evaluate the model’s performance, based on two key indicators: sensitivity and specificity. The ROC curve visually assesses the model’s performance and is one of the most important metrics for evaluating its effectiveness.

Additionally, we used the confusion matrix, which provides a detailed analysis of predictions by categorizing the results into true positives, false positives, true negatives, and false negatives.

4. Experiments and Results

In our experiment, we applied several statistical methods for spectral band selection to the cocoa bean measurements, including LASSO, Mutual Information, ANOVA, PLS, and LDA. Each method was used to identify the most discriminating spectral bands, i.e., those that provided the most relevant information for classification.

To improve the robustness of the selection, we then adopted a voting-based approach. This principle consists of retaining only spectral bands selected by at least four of the five statistical methods used. This strategy reduces the impact of method-specific biases and optimizes the quality of the bands retained for classification.

Once the spectral band selection was completed, we classified them using three benchmark classifiers: XGBoost, SVM, and Random Forest. Classification was performed for each selection method and the voting-based approach to assess the impact of each strategy on model performance. The results of each selection method will be presented in detail, allowing for an analysis of their respective performances. A general comparison will also be conducted to identify the optimal combination of techniques for efficiently classifying cocoa beans.

4.1. Detailed Overview of the Experimental Setup

Data Source

The data used in this study come from spectral measurements performed on cocoa beans using a Konica-Minolta CS-2000 spectroradiometer. Measurements were carried out in the visible range (380–780 nm), allowing the reflection spectrum of each sample to be obtained. The cocoa beans were categorized into three fermentation states: fermented, unfermented, and bad condition.

Our dataset consists of 90 spectral measurements of cocoa beans, evenly distributed across three classes (30 measurements per class). We adopted a stratified splitting strategy to ensure a reliable assessment of the classification models’ performance and to preserve the proportion of classes in the data subsets. Thus, 70% of the measurements were used for training, and the remaining 30% served as a test set.

Furthermore, to enhance the reliability of the results and limit the effects of a single random split, we applied cross-validation, ensuring better model generalization and optimal use of the dataset. This approach allows for more robust evaluations that are representative of the algorithms’ actual performance.

Classifier parameter settings

The classification experiments used three supervised learning algorithms: Random Forest, Support Vector Machine (SVM), and XGBoost. The Random Forest and SVM algorithms were implemented using the scikit-learn library (version 1.0.1), while the XGBoost model was implemented using the xgboost library (version 1.6.2).

The Random Forest classifier was parameterized with n_estimators = 100, resulting in the construction of 100 decision trees. For the SVM, a linear kernel (kernel = “linear”) was used for class separation, and the probability = True option was enabled to calculate the probabilities associated with each prediction.

For the XGBoost model, the use_label_encoder = False and eval_metric = ‘mlogloss’ parameters were specified. The first one allows the disabling of the default label encoder, deprecated in recent versions of the library. In contrast, the second one defines the multiclass log-loss as the evaluation metric suitable for multiclass classification problems.

Evaluation criteria
Classifier performance was assessed using the following metrics:
○
Accuracy (overall precision);
○
Precision, recall, and F1-score for each class;
○
Confusion matrix;
○
ROC curve and AUC to measure the models’ discriminatory ability;
○
Considering any class imbalance, these criteria allow for a detailed assessment of the models’ performance and robustness.

4.2. Original Spectral Bands Without Selection

Table 1 highlights that Random Forest is the best-performing algorithm for classifying cocoa beans across all spectral bands. It achieves an accuracy of 88.89%, an F1-score of 88.85%, and an MCC of 83.51%, outperforming other models in both performance and robustness.

Analysis of the original spectral data confirms the robustness and stability of Random Forest, which combines high accuracy with low variance. SVM is a reliable alternative with relatively consistent results. In contrast, XGBoost, despite its reputation for efficiency, was unable to optimally adapt to the complex characteristics of the raw spectral data. Figure 6 represents its confusion matrix and the ROC curve of Random Forest.

4.3. Spectral Bands Selected by LASSO

Table 2 shows the algorithms’ performance after spectral band selection using the LASSO method. Once again, Random Forest stands out as the best-performing algorithm, achieving an accuracy of 85.19%, an equivalent F1 score, and an MCC of 77.78%, confirming its robustness and superiority even after dimensionality reduction. The stability of the standard deviations reinforces the idea that Random Forest remains reliable on transformed data. The performance of SVM and XGBoost is comparable, with an accuracy of 81.48%, but their lower MCC scores (72.37%) reflect lower discriminatory ability. This suggests that although LASSO selection reduces data complexity, Random Forest takes better advantage of the key variables selected, ensuring a good compromise between model simplicity and predictive accuracy. The confusion matrix and ROC curve in Figure 7 confirm these results, demonstrating good generalization ability and adequate distinction between classes.

4.4. Selected Spectral Bands by Mutual Information

Performance analysis on the spectral bands selected by Mutual Information in Table 3 highlights the superiority of the SVM model, which achieved a remarkable accuracy of 96.30%, an F1 score of 96.28%, and a high MCC of 94.64%, with a low variance (±2.91). These results reflect the SVM’s excellent generalization capacity and strong robustness against features extracted using this selection method. Although Random Forest and XGBoost maintain high performance (88.89% accuracy each), their respective MCC scores (83.51% and 84.92%) and slightly higher standard deviations indicate they are less effective than the SVM. The Mutual Information approach thus better leverages the strengths of the SVM, making it particularly suited to classifying cocoa beans based on an optimized subset of spectral bands. The confusion matrix and ROC curve presented in Figure 8 confirm this performance, illustrating excellent class separation and significant error reduction.

4.5. Selected Spectral Bands by ANOVA

Performance evaluation on the selected spectral bands using ANOVA in Table 4 confirms the superiority of the Random Forest model, which achieves an accuracy of 92.59%, an F1 score of 92.50%, and an MCC of 89.63%, with relatively low standard deviations (±3.92 for accuracy), demonstrating increased stability and robustness. These results indicate that Random Forest effectively exploits the discriminative features extracted by ANOVA, providing excellent classification capability on filtered spectral data. The performance of the SVM and XGBoost models, while acceptable (85.19% accuracy for each), lags. Their MCC scores (79.95% and 78.26%, respectively) and F1-score (83.96% and 85.49%) show lower classification consistency, accompanied by slightly greater variations. These differences suggest that ANOVA favors Random Forest, reinforcing its role as a reference model in classifying cocoa beans from selected spectral bands. The confusion matrix and ROC curve presented in Figure 9 confirm this performance, illustrating good class separation and increased model robustness.

4.6. Selected Spectral Bands Using Partial Least Squares (PLS)

Performance analysis on the spectral bands selected using PLS in Table 5 reveals nearly equal results between the three algorithms evaluated. All achieved an identical accuracy of 85.19%, accompanied by similar values for F1-score, precision, and recall, with relatively moderate standard deviations (ranging from ±3.24 to ±7.32 depending on the metrics and models). Among them, SVM stood out slightly with an MCC of 78.26%, compared with 77.94% for Random Forest and XGBoost, suggesting a slight superiority in classification consistency. However, the overall differences remain small and statistically insignificant, indicating that the PLS method normalizes performance between models without allowing any one to dominate clearly. These results show that variable selection using PLS produces a balanced feature space, enabling consistent classification regardless of the algorithm used. However, for the sake of robustness, the SVM could be preferred, particularly in an application context where the stability of the model is essential. The confusion matrix and ROC curve in Figure 10 confirm these results, attesting to the model’s reliability.

4.7. Selected Spectral Bands Using Linear Discriminant Analysis (LDA)

The results in Table 6 show that the spectral band selection method based on Linear Discriminant Analysis (LDA) achieved excellent overall performance for the three algorithms evaluated. The SVM stands out as the best classifier, with an accuracy of 88.89% ± 3.95, an F1-score of 88.25% ± 4.11, and an MCC of 84.56% ± 5.43, indicating reliable classification and good stability. The performance of the Random Forest is also remarkable, with accuracy and F1-score identical to those of the SVM (88.89% ± 3.24) and an MCC of 83.51% ± 4.83, making it a very competitive alternative. In contrast, XGBoost, while achieving decent performance (85.19% ± 4.32 accuracy), remains behind the other two models, particularly in terms of MCC (77.94% ± 6.42), showing slightly lower robustness for classifying spectral data processed using LDA. These results confirm the relevance of LDA selection to improving spectral data quality and reveal that SVM is somewhat more reliable for this particular configuration. The confusion matrix and ROC curve presented in Figure 11 confirm this performance, illustrating good class separation and increased model robustness.

4.8. Selected Spectral Bands Using Linear F-Score

The results presented in Table 7 show that the F-score-based spectral band selection method achieved excellent overall performance for all three evaluated algorithms. Random Forest stands out as the best classifier, with an accuracy of 92.59% ± 3.92, an F1-score of 92.50% ± 3.25, and an MCC of 89.63% ± 4.83, indicating reliable classification and excellent stability. The performance of SVM is also remarkable, with an accuracy of 85.19% ± 4.43, an F1-score of 83.96% ± 4.77, and an MCC of 79.95% ± 5.97, making it a competitive alternative. On the other hand, although XGBoost shows satisfactory performance, with an accuracy of 85.19% ± 4.20, it falls behind the other two models, especially in terms of MCC (78.26% ± 6.14), showing a slightly lower robustness for the classification of spectral data processed using F-score. These results confirm the relevance of spectral band selection using F-score to improve spectral data quality and reveal that Random Forest is more reliable overall for this particular configuration. The confusion matrix and ROC curve presented in Figure 12 confirm these results, demonstrating excellent class separation and increased model robustness.

4.9. Selected Spectral Bands Using Voting

The results presented in Table 8 show that the Voting-based spectral band selection method significantly improved the algorithms’ performance, particularly for the SVM. The latter stands out as the best classifier, with an accuracy of 96.30% ± 3.49, an F1-score of 96.28% ± 3.44, and an MCC of 94.64% ± 5.25. These results reflect extremely reliable classification, excellent consistency between precision and recall, and a very low prediction error (MSE = 0.0370), demonstrating the model’s robustness. Random Forest also performed well, with an accuracy of 88.89% ± 3.39 and an F1-score of 88.85% ± 3.41, along with an MCC of 83.51% ± 4.97, making it an attractive alternative but significantly less efficient than SVM in this context.

In contrast, XGBoost performed more modestly, with an accuracy of 85.19% ± 3.95 and an MCC of 77.94% ± 5.91, indicating good classification capability but lower stability and reliability than the other two models.

These results confirm the effectiveness of the Voting method for spectral band selection and highlight the superiority of SVM in this specific configuration. The SVM model thus proves to be the most accurate and robust for classifying spectral data resulting from this selection strategy.

The confusion matrix and ROC curve presented in Figure 13 confirm these results, demonstrating excellent class separation and optimized model performance.

4.10. Comparison of Methods for Spectral Band Selection

Table 9 compares the different band selection methods and their impact on the classification models’ accuracy and Matthews Correlation Coefficient (MCC). The results show that Mutual Information and voting achieve the best performance, with an accuracy of 96.3% and an MCC of 94.64% when coupled with an SVM. These methods, therefore, appear particularly effective for selecting discriminatory and relevant spectral bands for classification.

Other approaches, such as ANOVA and F-score, combined with a Random Forest, also perform well, with an accuracy of 92.59% and an MCC of 89.63%. These statistical methods identify the most influential bands based on their contribution to class variance.

In contrast, although they remain competitive, methods such as LDA and PLS perform less well, with accuracies of 88.89% and 85.19%, respectively. LDA, which maximizes class separation, appears less effective in this context, while PLS and LASSO, although versatile, do not achieve the performance levels of other, more targeted methods.

These results underscore the importance of selecting the appropriate band selection method to maximize the performance of classification models. The most effective approach depends on the type of algorithm used and the specific characteristics of the hyperspectral data analyzed.

Figure 14 presents a histogram that visually compares the performance of different band selection methods, providing a clear visual representation of the accuracy and MCC results.

5. Discussion

The evaluation of the algorithms’ performance based on the spectral band selection methods highlights the latter’s significant influence on the classification of cocoa beans. The analysis of the different results reveals that the choice of the optimal algorithm strongly depends on the data characteristics used and the band selection method applied.

Among the three algorithms tested (Random Forest, SVM, and XGBoost), the SVM stands out in several configurations. It achieved an accuracy of 96.30%, an F1-score of 96.28%, and an MCC of 94.64% using the Mutual Information and Voting methods. These results demonstrate that the SVM performs exceptionally well when the input variables are carefully selected, reducing the problem’s complexity while retaining essential discriminant information. This is explained by the SVM’s ability to handle high-dimensional spaces well when relevant variables provide a clear separation between classes.

The Random Forest model also performs well, particularly after selection via ANOVA and F-score, with an accuracy of 92.59% and an MCC of 89.63%. This behavior confirms that Random Forest is a robust algorithm capable of handling datasets where combining several relevant variables allows for effective class separation. However, its performance remains generally lower than SVM’s, particularly when dimensionality reduction is essential.

While XGBoost performs well in some cases, it performs overall worse than the other two models, with a maximum accuracy of 85.19%. This underperformance can be explained by increased sensitivity to variations in the selected features and reduced generalization capacity on limited datasets, as in this study.

Regarding feature selection methods, they strongly influence model performance:

Mutual Information and Voting are the most effective approaches, particularly with SVM, by extracting non-redundant and highly informative variables;
ANOVA and F-score are more suitable for Random Forest, suggesting that these methods identify features well exploited by decision trees;
LDA and PLS yield more consistent results, with no real marked advantage for a given model, indicating that they offer good stability but more limited discriminatory power.

The improved performance with these selection methods highlights the importance of dimensionality reduction for achieving the following:

Reduce the risk of overfitting;
Improve the generalization ability of models;
Simplify the model structure without losing critical information.

In particular, SVM provides clear class separation and effectively reduces classification errors. The associated ROC curve shows a high area under the curve (AUC), confirming its strong ability to discriminate between classes. Similarly, Random Forest sometimes achieves high performance, although lower than SVM, reinforcing its value depending on the type of selection applied.

This study highlights the importance of coupling the variable selection method and the classification algorithm. Combined with Mutual Information or Voting methods, SVM offers the best performance, while Random Forest remains a reliable alternative depending on the nature of the variables selected. These results confirm that machine learning techniques can significantly improve sorting and quality control processes in the cocoa industry.

Although the results obtained demonstrate the effectiveness of the proposed spectral band extraction techniques, some limitations should be mentioned. First, the experiments were performed on a limited dataset from a single geographical region, which may limit the generalization of the results to other contexts. However, a more diverse database is being built to expand the geographical and varietal coverage, improving the robustness and transferability of the proposed models. Finally, only classical classifiers were used in this first phase. Integrating more advanced deep learning models could be explored in future work, as the database becomes richer and better balanced.

6. Conclusions

Our study demonstrated the importance of spectral analysis in classifying cocoa beans according to their fermentation state. The evaluation of the classification models’ performance highlighted the impact of spectral band selection methods on prediction accuracy.

The results show that the optimal algorithm choice closely depends on the selected features. Combined with Mutual Information and Voting methods, SVM achieved the best performance with an accuracy of 96.30%, surpassing Random Forest and XGBoost. However, Random Forest remains a robust alternative, particularly effective in conjunction with ANOVA and F-score, achieving an accuracy of 92.59%. On the other hand, XGBoost was found to be less suited to the studied data.

The analysis also highlighted the importance of feature selection techniques in reducing dimensionality and enhancing model generalization. Mutual Information and Voting helped optimize classification with SVM, while ANOVA and F-score enhanced the performance of Random Forest.

These results confirm that machine learning, applied to carefully selected spectral bands, is a promising approach for automating the quality control of cocoa beans. Further optimization of feature selection methods and exploration of other algorithms could further enhance the accuracy and efficiency of models in an industrial context.

In our future work, we will expand our database to other producing regions to strengthen our model’s robustness. Furthermore, we can explore integrating our own deep learning models (such as CNNs or Transformers) to improve our classification accuracy.

Author Contributions

Conceptualization, K.J.A. and P.G.; methodology, K.J.A. and D.M.; software, K.J.A.; validation, P.G. and D.M.; formal analysis, K.J.A. and A.B.B.; investigation, K.J.A. and A.B.B.; resources, P.G.; data curation, K.J.A.; writing—original draft preparation, K.J.A., A.B.B. and D.M.; writing—review and editing, P.G.; visualization, D.M.; supervision, P.G. and D.M.; project administration, P.G.; funding acquisition, P.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset is available upon request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ayikpa, K.J. Système d’Aide Pour Améliorer la Qualité des Récoltes de Café et du Cacao à l’Aide de l’Intelligence Artificielle. Ph.D. Thesis, Université Bourgogne Franche-Comté, Besançon, France, Université Félix Houphouët-Boigny, Abidjan, Côte d’Ivoire, 2023. [Google Scholar]
Nascimento, L.L.; Pereira, M.S.; de Almeida, L.S.; da Silveira Ferreira, L.; de Moura Pita, B.L.; de Souza, C.O.; Ribeiro, C.D.F.; Fricks, A.T. Innovation in Cocoa Fermentation: Evidence from Patent Documents and Scientific Articles. Fermentation 2024, 10, 251. [Google Scholar] [CrossRef]
Ayikpa, K.J.; Gouton, P.; Mamadou, D.; Ballo, A.B. Classification of Cocoa Beans by Analyzing Spectral Measurements Using Machine Learning and Genetic Algorithm. J. Imaging 2024, 10, 19. [Google Scholar] [CrossRef]
Thenkabail, P.S.; Lyon, J.G.; Huete, A. Advances in Hyperspectral Remote Sensing of Vegetation and Agricultural Crops. In Fundamentals, Sensor Systems, Spectral Libraries, and Data Mining for Vegetation; CRC Press: Boca Raton, FL, USA, 2018; ISBN 978-1-315-16415-1. [Google Scholar]
Wieme, J.; Mollazade, K.; Malounas, I.; Zude-Sasse, M.; Zhao, M.; Gowen, A.; Argyropoulos, D.; Fountas, S.; Van Beek, J. Application of Hyperspectral Imaging Systems and Artificial Intelligence for Quality Assessment of Fruit, Vegetables and Mushrooms: A Review. Biosyst. Eng. 2022, 222, 156–176. [Google Scholar] [CrossRef]
Mou, L.; Saha, S.; Hua, Y.; Bovolo, F.; Bruzzone, L.; Zhu, X.X. Deep Reinforcement Learning for Band Selection in Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Peña, M.A.; Liao, R.; Brenning, A. Using Spectrotemporal Indices to Improve the Fruit-Tree Crop Classification Accuracy. ISPRS J. Photogramm. Remote Sens. 2017, 128, 158–169. [Google Scholar] [CrossRef]
Pinheiro Claro Gomes, W.; Gonçalves, L.; Barboza Da Silva, C.; Melchert, W.R. Application of Multispectral Imaging Combined with Machine Learning Models to Discriminate Special and Traditional Green Coffee. Comput. Electron. Agric. 2022, 198, 107097. [Google Scholar] [CrossRef]
Sim, J.; Dixit, Y.; Mcgoverin, C.; Oey, I.; Frew, R.; Reis, M.M.; Kebede, B. Machine Learning-Driven Hyperspectral Imaging for Non-Destructive Origin Verification of Green Coffee Beans across Continents, Countries, and Regions. Food Control 2024, 156, 110159. [Google Scholar] [CrossRef]
Lin, M.; Chen, C.; Lu, J. Intelligent Classifier for Various Degrees of Coffee Roasts Using Smart Multispectral Vision System. J. Field Robot. 2024, 41, 639–653. [Google Scholar] [CrossRef]
Castro, W.; Tene, B.; Castro, J.; Guivin, A.; Ruesta, N.; Avila-George, H. Mango Varietal Discrimination Using Hyperspectral Imaging and Machine Learning. Neural Comput. Appl. 2024, 36, 18693–18703. [Google Scholar] [CrossRef]
Su, W.-H.; Xue, H. Imaging Spectroscopy and Machine Learning for Intelligent Determination of Potato and Sweet Potato Quality. Foods 2021, 10, 2146. [Google Scholar] [CrossRef]
Davidson, A.; Csillag, F. A Comparison of Nested Analysis of Variance (ANOVA) and Variograms for Characterizing Grassland Spatial Structure under a Limited Sampling Budget. Can. J. Remote Sens. 2003, 29, 43–56. [Google Scholar] [CrossRef]
Lotte, F.; Lécuyer, A.; Arnaldi, B. FuRIA: Un Nouvel Algorithme d’extraction de Caractéristiques Pour Les Interfaces Cerveau-Ordinateur Utilisant Modèles Inverses et Modèles Flous. In Proceedings of the Colloque GRETSI, Troyes, France, 1–14 September 2007. [Google Scholar]
Wang, D.; Tang, L.; Zhu, L.; Yang, Z.-X. Mutual Information Maximization Based Similarity Operation for 3D Point Cloud Completion Network. IEEE Signal Process. Lett. 2022, 29, 1217–1221. [Google Scholar] [CrossRef]
Vergara, J.R.; Estévez, P.A. A Review of Feature Selection Methods Based on Mutual Information. Neural Comput. Appl. 2014, 24, 175–186. [Google Scholar] [CrossRef]
Ali, S.; Hassan, M.; Kim, J.Y.; Farid, M.I.; Sanaullah, M.; Mufti, H. FF-PCA-LDA: Intelligent Feature Fusion Based PCA-LDA Classification System for Plant Leaf Diseases. Appl. Sci. 2022, 12, 3514. [Google Scholar] [CrossRef]
Li, Z.; Nie, F.; Wu, D.; Wang, Z.; Li, X. Sparse Trace Ratio LDA for Supervised Feature Selection. IEEE Trans. Cybern. 2024, 54, 2420–2433. [Google Scholar] [CrossRef] [PubMed]
Song, Q.; Jiang, H.; Liu, J. Feature Selection Based on FDA and F-Score for Multi-Class Classification. Expert Syst. Appl. 2017, 81, 22–27. [Google Scholar] [CrossRef]
Lee, G.; Lee, K. Feature Selection Using Distributions of Orthogonal PLS Regression Vectors in Spectral Data. BioData Min. 2021, 14, 7. [Google Scholar] [CrossRef]
Chang, N.; Jing, X.; Zeng, W.; Zhang, Y.; Li, Z.; Chen, D.; Jiang, D.; Zhong, X.; Dong, G.; Liu, Q. Soil Organic Carbon Prediction Based on Different Combinations of Hyperspectral Feature Selection and Regression Algorithms. Agronomy 2023, 13, 1806. [Google Scholar] [CrossRef]
Hassan, M.M.; Hassan, M.M.; Yasmin, F.; Khan, M.A.R.; Zaman, S.; Galibuzzaman; Islam, K.K.; Bairagi, A.K. A Comparative Assessment of Machine Learning Algorithms with the Least Absolute Shrinkage and Selection Operator for Breast Cancer Detection and Prediction. Decis. Anal. J. 2023, 7, 100245. [Google Scholar] [CrossRef]
Xie, Y.; Shi, H.; Han, B. Bioinformatic Analysis of Underlying Mechanisms of Kawasaki Disease via Weighted Gene Correlation Network Analysis (WGCNA) and the Least Absolute Shrinkage and Selection Operator Method (LASSO) Regression Model. BMC Pediatr 2023, 23, 90. [Google Scholar] [CrossRef]
Wu, Y.; Xia, Z.; Feng, Z.; Huang, M.; Liu, H.; Zhang, Y. Forecasting Heart Disease Risk with a Stacking-Based Ensemble Machine Learning Method. Electronics 2024, 13, 3996. [Google Scholar] [CrossRef]
Ayikpa, K.J.; Ayikpa, K.J.; Ayikpa, K.J.; Mamadou, D.; Gouton, P.; Adou, K.J. Experimental Evaluation of Coffee Leaf Disease Classification and Recognition Based on Machine Learning and Deep Learning Algorithms. J. Comput. Sci. 2022, 18, 1201–1212. [Google Scholar] [CrossRef]
Zhang, J.; Ma, X.; Zhang, J.; Sun, D.; Zhou, X.; Mi, C.; Wen, H. Insights into Geospatial Heterogeneity of Landslide Susceptibility Based on the SHAP-XGBoost Model. J. Environ. Manag. 2023, 332, 117357. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Image of the CS-2000 spectroradiometer from Konica Minolta.

Figure 2. Sample of reflected light intensity across the visible spectrum of cocoa beans (380 nm to 780 nm).

Figure 3. Spectral Reflectance of cocoa beans obtained with the CS2000 spectroradiometer.

Figure 4. Average Spectral Reflectance for each class.

Figure 5. General framework of our methodology.

Figure 6. Confusion matrix and ROC curve of the best algorithm on the original spectral bands.

Figure 7. Confusion matrix and ROC curve of the best algorithm on the spectral bands selected by LASSO.

Figure 8. Confusion matrix and ROC curve of the best algorithm on the spectral bands selected by Mutual Information.

Figure 9. Confusion matrix and ROC curve of the best algorithm on the spectral bands selected by ANOVA.

Figure 10. Confusion matrix and ROC curve of the best algorithm on the PLS selected spectral bands.

Figure 11. Confusion matrix and ROC curve of the best algorithm on the LDA-selected spectral bands.

Figure 12. Confusion matrix and ROC curve of the best algorithm on the spectral bands selected using F-Score.

Figure 13. Confusion matrix and ROC curve of the best algorithm on the spectral bands selected using Voting.

Figure 14. Comparison of the performance of different spectral band selection methods according to accuracy and MCC.

Table 1. Performance evaluation of the algorithms on the original spectral band data.

Algorithms	Accuracy (%)	F1-Score (%)	Precision (%)	Recall (%)	MSE	MCC (%)
Random Forest	88.89 ± 4.28	88.85 ± 4.43	89.17 ± 4.27%	88.89 ± 4.28	0.4444	83.51 ± 6.28
SVM	85.19 ± 4.62	84.65 ± 4.77	85.24 ± 4.52	85.19 ± 4.62	0.4815	78.26 ± 6.75
XGBoost	77.78 ± 6.30	77.33 ± 6.43	77.22 ± 6.47	77.78 ± 6.30	0.7778	66.80 ± 9.40

Table 2. Evaluation of algorithm performance on data from spectral bands selected by LASSO.

Algorithms	Accuracy (%)	F1-Score (%)	Precision (%)	Recall (%)	MSE	MCC (%)
Random Forest	85.19 ± 3.39	85.19 ± 3.40	85.19 ± 3.28	85.19 ± 3.39	0.5926	77.78 ± 5.02
SVM	81.48 ± 5.55	81.03 ± 5.68	80.93 ± 5.66	81.48 ± 5.55	0.6296	72.37 ± 8.25
XGBoost	81.48 ± 4.97	81.86 ± 4.86	82.59 ± 4.73	81.48 ± 4.97	0.6296	72.37 ± 7.45

Table 3. Evaluation of algorithm performance on spectral band data selected using Mutual Information.

Algorithms	Accuracy (%)	F1-Score (%)	Precision (%)	Recall (%)	MSE	MCC (%)
Random Forest	88.89 ± 3.24	88.85 ± 3.25	89.17 ± 3.20	88.89 ± 3.24	0.4444	83.51 ± 4.83
SVM	96.30 ± 2.91	96.28 ± 2.92	96.67 ± 2.91	96.30 ± 2.91	0.0370	94.64 ± 2.14
XGBoost	88.89 ± 2.91	88.57 ± 3.00	91.67 ± 2.14	88.89 ± 2.91	0.4444	84.92 ± 3.92

Table 4. Evaluation of algorithm performance on spectral band data selected by ANOVA.

Algorithms	Accuracy (%)	F1-Score (%)	Precision (%)	Recall (%)	MSE	MCC (%)
Random Forest	92.59 ± 3.92	92.50 ± 3.25	93.94 ± 3.20	92.59 ± 3.24	0.2963	89.63 ± 4.83
SVM	85.19 ± 4.43	83.96 ± 4.77	88.33 ± 3.43	85.19 ± 4.43	0.4815	79.95 ± 5.97
XGBoost	85.19 ± 4.20	85.49 ± 4.10	86.74 ± 3.71	85.19 ± 4.20	0.4815	78.26 ± 6.14

Table 5. Evaluation of the performance of algorithms on spectral band data selected using PLS.

Algorithms	Accuracy (%)	F1-Score (%)	Precision (%)	Recall (%)	MSE	MCC (%)
Random Forest	85.19 ± 3.24	85.37 ± 3.25	85.93 ± 3.20	85.19 ± 3.24	0.4815	77.94 ± 4.83
SVM	85.19± 5.07	84.65 ± 5.30	85.24 ± 4.83	85.19 ± 5.07	0.4815	78.26 ± 7.32
XGBoost	85.19 ± 7.32	85.37 ± 3.90	85.93 ± 3.82	85.19 ± 3.95	0.4815	77.94 ± 5.91

Table 6. Evaluation of the performance of algorithms on spectral band data selected using LDA.

Algorithms	Accuracy (%)	F1-Score (%)	Precision (%)	Recall (%)	MSE	MCC (%)
Random Forest	88.89 ± 3.24	88.85 ± 3.25	89.17 ± 3.20	88.89 ± 3.20	0.4444	83.51 ± 4.83
SVM	88.89 ± 3.95	88.25 ± 4.11	90.61 ± 3.17	88.89 ± 3.17	0.3333	84.56 ± 5.43
XGBoost	85.19 ± 4.32	85.37 ± 4.27	85.93 ± 4.11	85.19 ± 4.32	0.4815	77.94 ± 6.42

Table 7. Evaluation of the performance of algorithms on spectral band data selected using F-score.

Algorithms	Accuracy (%)	F1-Score (%)	Precision (%)	Recall (%)	MSE	MCC (%)
Random Forest	92.59 ± 3.92	92.50 ± 3.25	93.94 ± 3.20	92.59 ± 3.24	0.2963	89.63 ± 4.83
SVM	85.19 ± 4.43	83.96 ± 4.77	88.33 ± 3.43	85.19 ± 4.43	0.4815	79.95 ± 5.97
XGBoost	85.19 ± 4.20	85.49 ± 4.10	86.74 ± 3.71	85.19 ± 4.20	0.4815	78.26 ± 6.14

Table 8. Evaluation of the performance of algorithms on spectral band data selected using Voting.

Algorithms	Accuracy (%)	F1-Score (%)	Precision (%)	Recall (%)	MSE	MCC (%)
Random Forest	88.89 ± 3.39	88.85 ± 3.41	89.17 ± 3.17	88.89 ± 3.39	0.4444	83.51 ± 4.97
SVM	96.30 ± 3.49	96.28 ± 3.44	96.67 ± 3.31	96.30 ± 3.49	0.0370	94.64 ± 5.25
XGBoost	85.19 ± 3.95	85.37 ± 3.90	85.93 ± 3.81	85.19 ± 3.95	0.4815	77.94 ± 5.91

Table 9. Comparison of spectral band selection methods and their impact on algorithm performance.

Methods	Best Associated Algorithm	Accuracy (%)	MCC (%)
Mutual Information	SVM	96.3	94.64
Voting	SVM	96.3	94.64
ANOVA	Random Forest	92.59	89.63
F-score	Random Forest	92.59	89.63
LDA	SVM	88.89	84.56
PLS	Random Forest/SVM/XGBoost	85.19	77.94
LASSO	Random Forest/SVM/XGBoost	85.19	77.78

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ayikpa, K.J.; Mamadou, D.; Ballo, A.B.; Gouton, P. Intelligent Selection of Spectral Bands from High-Precision Spectroradiometer Measurements for Optimizing Cocoa Bean Classification. Electronics 2025, 14, 1976. https://doi.org/10.3390/electronics14101976

AMA Style

Ayikpa KJ, Mamadou D, Ballo AB, Gouton P. Intelligent Selection of Spectral Bands from High-Precision Spectroradiometer Measurements for Optimizing Cocoa Bean Classification. Electronics. 2025; 14(10):1976. https://doi.org/10.3390/electronics14101976

Chicago/Turabian Style

Ayikpa, Kacoutchy Jean, Diarra Mamadou, Abou Bakary Ballo, and Pierre Gouton. 2025. "Intelligent Selection of Spectral Bands from High-Precision Spectroradiometer Measurements for Optimizing Cocoa Bean Classification" Electronics 14, no. 10: 1976. https://doi.org/10.3390/electronics14101976

APA Style

Ayikpa, K. J., Mamadou, D., Ballo, A. B., & Gouton, P. (2025). Intelligent Selection of Spectral Bands from High-Precision Spectroradiometer Measurements for Optimizing Cocoa Bean Classification. Electronics, 14(10), 1976. https://doi.org/10.3390/electronics14101976

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intelligent Selection of Spectral Bands from High-Precision Spectroradiometer Measurements for Optimizing Cocoa Bean Classification

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Implementation of the Dataset and Spectral Measurement Process

3.2. Methods for Spectral Band Selection

3.2.1. Analysis of Variance (ANOVA)

3.2.2. Mutual Information

3.2.3. Linear Discriminant Analysis (LDA)

3.2.4. F-Score

3.2.5. Partial Least Squares (PLS)

3.2.6. Least Absolute Shrinkage and Selection Operator (LASSO)

3.2.7. Selection by Majority Vote (Voting)

3.3. Algorithms for Classification

3.3.1. Support Vector Machine (SVM)

3.3.2. Random Forest

3.3.3. XGBoost

3.4. Pipeline Methodology

3.5. Performance Metric

4. Experiments and Results

4.1. Detailed Overview of the Experimental Setup

4.2. Original Spectral Bands Without Selection

4.3. Spectral Bands Selected by LASSO

4.4. Selected Spectral Bands by Mutual Information

4.5. Selected Spectral Bands by ANOVA

4.6. Selected Spectral Bands Using Partial Least Squares (PLS)

4.7. Selected Spectral Bands Using Linear Discriminant Analysis (LDA)

4.8. Selected Spectral Bands Using Linear F-Score

4.9. Selected Spectral Bands Using Voting

4.10. Comparison of Methods for Spectral Band Selection

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI