Using Machine Learning to Classify Capsicum Genotypes Based on Agronomic Traits

Freire, Ana Izabella; Souza, Alex Fernandes de; Leal, Gustavo dos Santos; Souza, Filipe Bittencourt Machado de; Verri, Filipe Alves Neto; Balestrassi, Pedro Paulo; Paiva, Anderson Paulo de; Júnior, João José da Silva; Silva, Leonardo França da; Garcia, Fernando Henrique Silva; Fonseca, Guilherme Godoy

doi:10.3390/horticulturae12050623

Open AccessArticle

Using Machine Learning to Classify Capsicum Genotypes Based on Agronomic Traits

by

Ana Izabella Freire

¹,

Alex Fernandes de Souza

¹

,

Gustavo dos Santos Leal

¹,

Filipe Bittencourt Machado de Souza

^2,*,

Filipe Alves Neto Verri

³

,

Pedro Paulo Balestrassi

¹

,

Anderson Paulo de Paiva

¹

,

João José da Silva Júnior

²

,

Leonardo França da Silva

²,

Fernando Henrique Silva Garcia

⁴

and

Guilherme Godoy Fonseca

²

¹

Institute of Industrial Engineering and Management, Federal University of Itajubá (UNIFEI), BPS Avenue, 1303, Pinheirinho, Itajubá CEP 37500-903, MG, Brazil

²

Faculty of Agronomy and Veterinary Medicine, University of Brasília (UnB), Campus Darcy Ribeiro, ICC Centro–Bloco B, Térreo, Asa Norte, Brasília CEP 70910-970, DF, Brazil

³

Division of Computer Science, Aeronautics Institute of Technology (ITA), Marshal Eduardo Gomes Square, 50, Vila das Acacias, São José dos Campos CEP 12228-900, SP, Brazil

⁴

Department of Biological and Health Sciences, Federal University of Amapá (UNIFAP), Rodovia Josmar Chaves Pinto KM 02, Macapá CEP 68903-419, AP, Brazil

^*

Author to whom correspondence should be addressed.

Horticulturae 2026, 12(5), 623; https://doi.org/10.3390/horticulturae12050623

Submission received: 17 March 2026 / Revised: 1 May 2026 / Accepted: 7 May 2026 / Published: 18 May 2026

(This article belongs to the Section Genetics, Genomics, Breeding, and Biotechnology (G2B2))

Download

Browse Figures

Review Reports Versions Notes

Abstract

Peppers from the Capsicum genus are highly valued worldwide for their culinary, medicinal, and nutritional uses. However, accurately classifying and developing new varieties to enhance these traits remains a challenge due to the limitations of traditional methods, which often lack precision and are time-consuming. This study aimed to overcome these limitations by applying advanced multivariate statistical techniques and machine learning models (KNN, RF, XGBoost) to characterize and classify Capsicum genotypes based on genetic and phenotypic features. Sixteen Capsicum genotypes were analyzed using methods such as MANOVA, PCA, and cluster analysis to explore their variabilities and similarities. Cluster analysis revealed the formation of distinct groups, indicating phenotypic similarity patterns among specific varieties. The machine learning models were evaluated using Leave-One-Out cross-validation to address the challenges posed by small datasets. The results indicated that Random Forest outperformed the other models, exhibiting superior class discrimination with an AUC of 0.96, while KNN and XGBoost achieved AUC values of 0.95 and 0.85, respectively. Despite the slightly superior performance of Random Forest relative to KNN, both models demonstrated strong predictive performance, whereas XGBoost exhibited moderate performance. In addition, key agronomic traits such as pericarp thickness, fruit diameter, seeds per fruit, and corolla color were identified as the most relevant variables for classification. Principal component analysis indicated that the first components explained a substantial proportion of the total variance, supporting efficient dimensionality reduction and pattern recognition. Furthermore, the Random Forest model achieved high overall performance, with accuracy, precision, recall, and F1-score values close to 0.93, reinforcing its robustness in multiclass classification. This study highlights the effectiveness of machine learning in overcoming the constraints of traditional classification methods, providing a robust approach for the accurate identification and improvement of pepper varieties.

Keywords:

KNN; plant breeding; phenotyping; Random Forest; XGBoost

Graphical Abstract

1. Introduction

Pepper and chili species, belonging to the Capsicum genus, originate from the Americas, with consumption records dating back more than 7000 years ago in Mexico [1,2]. With the arrival of Europeans in Brazil, it was found that indigenous tribes already used ground peppers mixed with ashes as an effective method to preserve seeds from other traditionally cultivated plants [3,4]. Currently, peppers have spread globally, being part of the diet of about a quarter of the world’s population, predominantly as condiments [5,6]. They are valued for their diversity in shapes, colors, sizes, flavors, and levels of spiciness, enriching the gastronomy of various cultures around the world [4,7].

The Capsicum genus exhibits rich biological diversity, manifested in various shapes, colors, sizes, and degrees of pungency [8]. This diversity not only plays a significant role in global cuisine by adding distinct flavors to a myriad of dishes but also represents a fertile field for scientific research, especially in the areas of botanical classification and genetic improvement [7,9]. In this context, the use of machine learning techniques for the classification of Capsicum emerges as an innovative approach, leveraging the potential of these technologies to analyze and interpret the complex characteristics that define these varieties [10,11].

Machine learning, a subset of artificial intelligence, offers powerful tools capable of identifying patterns and relationships in large datasets, thereby surpassing the limitations of traditional classification methods [12,13]. By applying machine learning classification algorithms to the characteristics of Capsicum peppers, researchers can automate the classification process, increasing accuracy and efficiency [14,15]. These models are trained to recognize patterns in the data that correspond to different genotypes, taking into account not only visual attributes such as color and shape but also chemical properties, such as capsaicin concentration, which is responsible for spiciness [10].

Classification models in machine learning serve as a bridge between vast repositories of raw information and the actionable insights they can provide [16]. These models work by categorizing data into predefined classes based on their characteristics, making them vital tools in a wide range of scientific and industrial applications [17].

In the context of studying Capsicum peppers, machine learning classification enables the rapid and precise distinction between the numerous varieties, based on attributes such as shape, color, size, and spiciness [18,19]. This process not only optimizes the recognition and cataloging of genotypes but also drives research in genetics, agronomy, and nutrition by identifying patterns and correlations that may lead to the development of varieties with desirable characteristics, such as greater pest resistance or different flavor profiles [20]. Thus, classification models emerge not only as tools for operational efficiency but also as facilitators of significant innovations in the study and application of Capsicum peppers [21].

Equally important as machine learning classification models are multivariate statistical analyses, which represent a key component in the in-depth understanding of complex datasets [22]. These analyses are capable of simultaneously manipulating multiple variables to identify patterns, trends, and correlations that may not be evident at first glance. In studies of Capsicum peppers, multivariate statistical analyses can reveal subtle similarities and marked differences between distinct genotypes, facilitating the identification of groups of genotypes with similar genetic or phenotypic characteristics [23,24]. This approach deepens our understanding of biodiversity within Capsicum and provides pathways for genetic improvement and genotype conservation [25]. By unraveling the complex network of relationships among variables, multivariate analyses significantly enrich classification models, providing a solid foundation for scientific inferences and practical applications in agriculture, biology, and nutrition [26,27].

The adoption of these techniques opens new perspectives for agricultural biotechnology, facilitating the development of new varieties with desired characteristics for gastronomy, and aiding in biodiversity preservation [11,28]. Furthermore, the precise and automated classification of Capsicum peppers can enhance traceability and food safety, benefiting producers, traders, and consumers alike [10,29]. This study aims to explore the use of machine learning techniques in the classification of Capsicum peppers, not only to deepen our understanding of this diversity but also to highlight the potential of these innovative technologies in addressing contemporary challenges in botany and agriculture.

Therefore, the primary objective of this study is to apply multivariate statistical techniques and predictive machine learning models for the detailed characterization of the different types of Capsicum, as well as for the effective classification of genotypes through their distinctive features. This effort aims to explore the potential of these advanced methodologies to unravel the complexities inherent in the variables that define each type of pepper, from visual aspects to chemical properties, with the goal of establishing a precise and reliable classification system. This endeavor seeks not only to enrich the scientific knowledge base regarding the genetic and phenotypic diversity within the Capsicum genus but also to provide valuable statistical data that can assist in the genetic improvement of genotypes, biodiversity conservation, and the development of new culinary and industrial applications.

2. Materials and Methods

2.1. Experimental Design

In this study, 16 genotypes of peppers from the Capsicum genus were utilized (Table 1), including Capsicum frutescens (Grisu, Tabasco, Peter, Malagueta), Capsicum baccatum (Dedo de moça), Capsicum chinense (BRS Seriema, Biquinho amarela, Guaraci Cumari do Pará, BRS Moema, Bode Amarela Arari), and Capsicum annuum (Vulcão, Pimentão amarelo, Doce Italiana, Jalapenho, Jamaican red, Jamaican yellow). Each genotype was assigned an alphanumeric code for reference, along with its common name in Brazil and its English translation, enabling researchers to have a quick and unambiguous reference.

The use of these codes is particularly valuable in contexts where data processing is complex and precision is essential, such as in genetic studies, agronomy, and laboratory experiments. By replacing lengthy names, which can sometimes be difficult to translate, with concise codes, Table 1 promotes efficiency and reduces the possibility of errors in recording and handling information about each pepper variety.

The study evaluated thirteen parameters, with five replicates per parameter for each Capsicum genotype (M1–M5), representing independent replicates used to build the dataset, totaling 80 observations. The quantitative variables included stem diameter (mm), pericarp thickness (mm), and fruit diameter (mm), which were measured using a digital caliper (Whipps, Inc., stainless steel, Athol, MA, USA; stainless steel; 0.01 mm precision). Plant height (cm), canopy diameter (cm), and stem length (cm) were measured using a measuring tape (Stanley Black & Decker, New Britain, CT, USA; model PowerLock 33-425). Fruit weight (g) and the weight of five fruits (g) were determined using a precision balance (Shimadzu Corporation, Kyoto, Japan; model AY220). Soluble solids content (°Brix) was measured using a digital Abbe-type refractometer (Atago Co., Ltd., Tokyo, Japan; model PAL), while pH was measured using a digital pH meter (Hanna Instruments, Woonsocket, RI, USA; model HI2211) and expressed in pH units (Table 2).

Variables such as growth habit and corolla color, although numerically encoded, were treated as categorical traits converted into quantitative representations for use in multivariate analyses and machine learning models. Growth habit was considered an ordinal variable (prostrate = 3, intermediate = 5, erect = 7), preserving the order relationship among classes. In contrast, corolla color was treated as a nominal variable and classified into eight categories: white (1), light yellow (2), yellow (3), yellow–green (4), purple with white base (5), white with purple base (6), white with purple margin (7), and purple (8), without implying hierarchy or metric distance [30]. The inclusion of these variables was conducted with methodological caution, and exploratory analyses were performed to ensure their suitability and contribution to the models.

2.2. Statistical Analyses

In this study, a comprehensive analysis was employed, considering various relevant aspects within the system to understand the interdependencies among different elements and variables. The first step involved performing descriptive statistics to summarize the central tendencies, dispersion, and overall patterns in the data. Subsequently, density curves were plotted to visualize the distribution of the variables and assess underlying data trends.

Descriptive statistics (mean and standard deviation) were computed for each of the 16 genotypes across all 11 numerical variables. Kernel density estimation (KDE) curves were then plotted to visually compare the distribution of each variable across genotypes.

A Shapiro–Wilk normality test was applied to all variables (p ≤ 0.05). Given the predominant departure from normality, one-way ANOVA was applied to each variable independently to assess differences among genotypes (p ≤ 0.05).

MANOVA was conducted in Minitab 18 using all 11 numerical variables simultaneously, with genotype as the grouping factor. Wilks’ Lambda, Lawley–Hotelling Trace, Pillai’s Trace, and Roy’s Largest Root were all evaluated at p ≤ 0.05.

The tests used in MANOVA include Wilks’ Lambda, Lawley–Hotelling Trace, Pillai’s Trace, and Roy’s Largest Root. Wilks’ Lambda assesses the overall multivariate effect by comparing the determinants of the error matrix and the total matrix. Lawley–Hotelling Trace measures the sum of the eigenvalues of the hypothesis matrix and the error matrix. Pillai’s Trace is a more robust measure that sums the eigenvalues of the matrix formed by the hypothesis and error matrices. Roy’s Largest Root focuses on the largest eigenvalue of the error matrix inverse times the hypothesis matrix. These tests collectively provide a comprehensive assessment of the differences among the groups [31]. The multivariate analysis of variance (MANOVA) model can be expressed by Equations (1)–(5). The associated multivariate test statistics, namely Wilks’ Lambda, Lawley–Hotelling Trace, Pillai’s Trace, and Roy’s Largest Root, are subsequently employed to assess the significance of differences among groups within a multivariate framework.

Y = XB + EY

(1)

where:

Y is the matrix of dependent variables (e.g., agronomic traits such as stem diameter, plant height, and pH);
X is the design matrix representing the independent variables or experimental factors (e.g., Capsicum varieties);
B is the matrix of coefficients associated with the effects of the explanatory variables;
E is the matrix of residual errors, representing the unexplained variation.

In MANOVA, two fundamental matrices are defined:

H (Hypothesis matrix): represents the variation explained by the model (i.e., between-group variation);
E (Error matrix): represents the residual or within-group variation.

The statistical tests used in MANOVA are defined as follows:

Wilks’ Lambda

Λ = \frac{∣ E ∣}{∣ H ∣ + ∣ E ∣}

(2)

Λ (Lambda) is the test statistic;
|E| and |H| denote the determinants of the error and hypothesis matrices, respectively.
Lower values of Λ indicate greater differences between groups.

Lawley–Hotelling Trace

U = tr(E⁻¹ H)

(3)

U is the test statistic;
tr(·) denotes the trace of a matrix (sum of diagonal elements);
E⁻¹ is the inverse of the error matrix.

This statistic evaluates the ratio between explained and residual variation.

Pillai’s Trace

V = tr((H + E)⁻¹ H)

(4)

V is the test statistic;
(H + E)⁻¹ represents the inverse of the total variation matrix.

Pillai’s Trace is considered one of the most robust MANOVA statistics, especially when assumptions such as multivariate normality are not fully met.

Roy’s Largest Root

θ = max eigenvalue of E⁻¹ H

(5)

θ (theta) is the largest eigenvalue;
eigenvalue represents the magnitude of variance explained along a given direction.

This statistic focuses on the maximum separation between groups.

Principal component analysis (PCA) is used to simplify and identify patterns in multivariate datasets. In the paper, data is divided into two components, referred to as PC1 and PC2. Dimensionality reduction should occur while preserving the maximum amount of information present in the original data. The principal components are sets that are linear combinations of the original variables, yet they are uncorrelated with each other, being orthogonal [32].

For PCA, all 11 numerical variables were standardized (zero mean, unit variance; scikit-learn StandardScaler) prior to analysis. Two principal components (PC1 and PC2) were retained for visualization, and a biplot was generated to display sample projections and variable loadings simultaneously.

The dendrogram is an essential tool in cluster analysis, used to separate variables or observations based on their similarities. Through the dendrogram, variables or observations that show greater similarity are grouped together. Each branch of the dendrogram represents a cluster, and the height of the lines connecting the clusters on the y-axis indicates the level of dissimilarity or distance between them [33]. In this study, the Ward clustering method was adopted, a hierarchical approach that minimizes the variation within the clusters by systematically combining them. This method is particularly effective in identifying homogeneous groups and is described by Equation (6).

D (s, t) = \sqrt{\frac{2 n_{s} n_{t}}{n_{s} + n_{t}} \sum_{i = 1}^{p} {(x_{s i} - x_{t i})}^{2}}

(6)

where:

D(s,t) is the distance between cluster s and cluster t;
n_s and n_t are the number of observations in clusters s and t, respectively;
x_si and x_ti are the means of the ith variable for clusters s and t, respectively;
p is the number of variables.

Hierarchical clustering was performed on the same standardized data using Ward’s linkage method with Euclidean distance (scipy.cluster.hierarchy), and results were visualized as a dendrogram.

After conducting the cluster analysis, we also performed a Pearson correlation analysis to further explore the relationships between the variables. This additional step allowed us to quantify the strength and direction of the linear relationships between pairs of variables, providing deeper insights into how these variables interact within the dataset [34].

Confidence ellipses are designed to illustrate two-dimensional confidence intervals, constructed around estimates to display the confidence region where the sample is expected to lie. They are used to visualize uncertainty around mean estimates. They are built based on the use of the variance–covariance matrix, a symmetric and square matrix containing information about variable relationships in a multivariate dataset [35]. In general, the ellipse is obtained using the Equation (7):

[\begin{matrix} x_{1} \\ x_{2} \end{matrix}] = [\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}] + c \times [\begin{matrix} \sqrt{λ_{1} h_{11} \cos (α)} - & \sqrt{λ_{2} h_{12} sen (α)} \\ \sqrt{λ_{1} h_{21} \cos (α)} + & \sqrt{λ_{2} h_{22} sen (α)} \end{matrix}], c = \sqrt{χ_{p, \frac{α}{2^{'}}}^{2}} 0 \leq α \leq 2 π

(7)

$[\begin{matrix} x_{1} \\ x_{2} \end{matrix}]$ are the coordinates on the confidence ellipse;
$[\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}]$ represents the means of the two variables (the center of the ellipse);
$λ_{1}$ and $λ_{2}$ are the eigenvalues of the variance–covariance matrix of the variables;
$h_{i j}$ are the elements of the eigenvector matrix that orients the ellipse in space;
c is a scaling factor based on the critical value of the Chi-square distribution χ² for a confidence level;
α e p degrees of freedom, adjusting the size of the ellipse to the desired confidence level.

In this study, the use of confidence ellipses in analyzing 16 genotypes of Capsicum peppers allows for the visualization and understanding of the relationship between multiple measurable characteristics of these varieties, such as spiciness, color, and fruit size, enabling a visual analysis of the interactions among these variables.

2.3. Machine Learning Models

RF is a machine learning algorithm based on the concept of ensemble learning. This algorithm allows for the creation of numerous decision trees during training, which are later combined to build more accurate and stable predictions. The decision trees are trained with a random sample of data, and the prediction is obtained through regression or classification [36,37]. The prediction achieved excellent classification results and high accuracy using RF, along with comparative analysis with other metrics. Ref. [38] used techniques such as weighted mean, median, and median of components for predictions in genotypes distribution modeling, along with comparisons of RF and other methods.

Three classification algorithms were implemented to distinguish among the 16 Capsicum genotypes: K-Nearest Neighbors (KNN), RF, and XGBoost [38,39,40]. The input features used in all models were five variables selected based on the Pearson correlation analysis: pericarp thickness, fruit diameter, corolla color, seeds per fruit, and weight5f. The target variable was genotype, label-encoded as integer classes using scikit-learn’s LabelEncoder.

Extreme Gradient Boosting (XGBoost) is another machine learning technique used for regression and classification, similar to RF [41]. Unlike RF, which constructs a collection of independent decision trees, XGBoost uses a sequential boosting approach, where simpler models are built and subsequently corrected based on the errors of previous models [42].

Along with these methods, K-Nearest Neighbors (KNN) is a machine learning algorithm used for both classification and regression, notably applied in classification problems [39]. KNN classifies Capsicum genotypes by comparing characteristics like color, size, shape, and spiciness. It identifies the ‘k’ nearest neighbors in the training data, where ‘k’ is user-defined, and assigns the most common class among them to the new instance. Though simple and versatile, KNN is computationally intensive due to distance calculations between the test and all training instances.

It is worth noting that we chose not to use Convolutional Neural Networks (CNNs) or Artificial Neural Networks (ANNs) due to the nature and small size of the dataset, which consisted of only five samples per pepper genotype (Figure 1). While CNNs and ANNs are powerful techniques, they generally require large volumes of data to achieve superior performance, as the authors of [34] point out, especially in tasks involving complex patterns such as images. Additionally, these networks demand greater computational power and are more prone to overfitting when trained on small datasets. In contrast, more traditional techniques like KNN, RF, and XGBoost are better suited for small tabular datasets, offering greater simplicity and interpretability, and requiring fewer computational resources, making them more appropriate for the context of this study.

Model performance was assessed using four metrics: recall, precision, F1-score, and accuracy. Additionally, ROC curves were generated for each model using a one-vs-rest approach with label binarization, and the area under the curve (AUC) was calculated and compared across models via boxplot, allowing for a comprehensive evaluation of classification performance across all 16 genotype classes.

CNNs and ANNs were not employed due to the small dataset size, which increases the risk of overfitting and limits generalization capacity [43]. Traditional algorithms such as KNN, Random Forest, and XGBoost are better suited for small tabular datasets, offering greater simplicity, interpretability, and lower computational demand.

All statistical analyses were performed using Python (version 3.10.12), with the following libraries: pandas for data manipulation, scipy for one-way ANOVA and hierarchical clustering, scikit-learn for PCA and data standardization, and matplotlib and seaborn for visualization. The MANOVA was conducted using Minitab 18.

3. Results

3.1. Phenotypic Variability and Data Distribution

Using the collected data and the initial classification established in Table 1, a robust dataset was developed to enable detailed analysis of the various genotypes. To explore possible differences among genotypes, we analyzed the behavior of data from nine specific variables, using density curves. This approach allows for a detailed visualization of the data distribution for each variable, facilitating the identification of distinct patterns among the genotypes (Figure 2).

The genotype T11 (Peter) exhibited consistent distributions for stem diameter, plant height, and stem length, with density peaks concentrated around specific values. For canopy diameter, T13 (Jalapeño) showed a well-defined peak around 35 cm, indicating low variability, whereas T2 (Lady’s Finger) displayed a broader distribution. T12 and T5 presented distinct peaks for seeds per fruit, while pericarp thickness showed pronounced peaks in T12, T7, and T4, with greater variability observed in T6 and T9. For fruit weight and diameter, T12 exhibited narrow and concentrated distributions, indicating uniformity, whereas T6 (Yellow Goat Arari) showed greater dispersion. The pH distribution for T16 also exhibited a defined peak, suggesting consistency in this parameter. Overall, these results highlight the variability among genotypes and provide relevant insights into genetic diversity and the suitability of these traits for machine learning classification.

A comprehensive normality test was conducted to assess the overall distribution of plant parameters, aiming to understand the overall behavior of the variables under study (Supplementary 1).

3.2. Multivariate Analysis and Pattern Recognition

To investigate the presence of statistically significant differences between groups, MANOVA analysis was performed on the data. This technique allows us to discern whether the observed means are influenced by variation between groups or if they can be attributed to chance. The results of this detailed analysis are presented in Table 3.

The results of the MANOVA (multivariate analysis of variance) analysis indicate that there were significant differences among the 16 genotypes of peppers studied. The four tests used (Wilks’ Lambda, Lawley–Hotelling Trace, Pillai’s Trace, and Roy’s Largest Root) presented highly significant values, all with p-values of 0.000. Wilks’ Lambda had a value close to zero (0.000), which suggests that the differences between the groups are very pronounced. The Lawley–Hotelling Trace presented a value of 200.28449, indicating a clear distinction among the genotypes of peppers. Additionally, Pillai’s Trace, which is a more robust measure, had a value of 8.11467, confirming that the observed differences are significant. Roy’s Largest Root, with a value of 93.03567, indicates that the largest individual difference among the groups of peppers is quite significant. In summary, all the tests support the conclusion that we cannot accept the null hypothesis that all genotypes of peppers are equal, thus evidencing the existence of significant variations among the 16 genotypes of peppers analyzed.

Subsequently, we conducted a principal component analysis (PCA) to assess the proportion of information retained in the collected data. This statistical technique allows understanding how much of the total variance is explained by each principal component, providing insights into the amount of accumulated data that are significant for the model (Figure 3, Figure 4, Figure 5 and Figure 6).

We observe that the first principal components capture most of the variance in the data, with the blue line quickly reaching a plateau (Figure 3). This indicates that a relatively small number of principal components is needed to capture the majority of the information contained in the original dataset. The proportion of explained variance stabilizes near 1.0, suggesting that the addition of further components beyond this point offers marginal benefit in explaining the total variance of the data.

We are presented with a PCA (principal component analysis) biplot, which serves as a graphical illustration of the influence that each numerical variable has on the principal components in the study of peppers from the Capsicum genus (Figure 4). Through this biplot, we can identify the position of each pepper type within the multivariate space defined by the first two principal components, PC1 and PC2.

On the horizontal axis (PC1), which explained 31.9% of the variance, displaying the variables stem diameter (SD), fruit weight (FW), and plant height (PH) have long vectors pointing in the same direction, suggesting a significant contribution to these parameters. On the vertical axis (PC2), stem longitude (SL) and fruit diameter (FD) appear to be more influential accounting for 21.1% of the variance.

Brix degree (BD) and weight for five fruits (W5F) have shorter vectors, indicating a smaller contribution to the variation captured by these two principal components (Figure 4). The proximity of data points (representing the genotypes) to a variable’s vector suggests that this variable is a distinguishing characteristic for those genotypes. For example, the genotypes represented by T6 and T16, which are close to the pH vector, can be differentiated from other genotypes based on their pH.

Ward’s method was employed, as depicted by the dendrogram, to help discern the natural groupings within the pepper varieties and providing a clear visual representation of the relationships and similarities among the different genotypes of peppers (Figure 5).

The analysis of the dendrogram revealed different levels of phenotypic similarity among the pepper varieties. The dense grouping of varieties such as T6 and T9 in the orange cluster suggests a high phenotypic proximity, indicating a common lineage or low genetic diversification. These groupings may result from selective processes that have maintained similar characteristics over time. In contrast, clusters containing fewer varieties, such as the green cluster with T3 and T8, highlight more distinct lineages that have possibly undergone evolutionary processes leading to more pronounced differences. The proximity between varieties such as T1, T11, and T16 in the gray cluster and T4, T12, and T10 in the red cluster indicates significant similarities, suggesting they share genetic or phenotypic traits that bring them closer. These clustering patterns reveal how selection processes, crossbreeding, and evolution have influenced the characteristics of these Capsicum varieties, providing insights into their origins and potential pathways for genetic improvement.

Subsequently, a Pearson correlation analysis was conducted, which enabled the identification of which agronomic traits are most relevant to the Capsicum genotypes (Figure 6).

The correlation analysis revealed relationships among various characteristics of the peppers (Figure 6), showing a positive correlation between pericarp thickness and fruit diameter (r = 0.78), corolla color and pericarp thickness (0.77), corolla color and the number of seeds per fruit (0.76), and Pericarp thickness and the number of seeds (r = 0.75) which increased proportionally. The variable weight5f and fruit diameter showed a moderate positive correlation (r = 0.57), indicating a linear relationship.

Pairs of variables with correlation values above 0.5, as identified, point to a strong linear relationship and are potential candidates for classification tasks in machine learning models, as variation in one variable can be used to predict variation in another with a reasonable degree of accuracy. Based on the previous analysis, four variables have been selected for use in machine learning classification models (pericarp thickness, fruit diameter, corolla color, and seeds per fruit). To ensure good performance of the models, it is important to check if the data falls within a 95% confidence interval. This step is essential not only for enhancing model accuracy but also for ensuring that the predictive relationships between the variables are statistically robust and not due to random chance (Figure 7).

The relationship between pericarp thickness and fruit diameter (Figure 7A) indicates a clustering of data along an upward imaginary line, suggesting that as pericarp thickness increases, fruit diameter also tends to increase. The narrow and elongated ellipse signals that this correlation is strong, with consistent and predictable variations between the two measurements. This relationship could be quantitatively assessed using a correlation coefficient, such as Pearson’s coefficient, and a hypothesis test could be conducted to verify the significance of this correlation.

On the other hand, the vertical arrangement of the blue points in the “Corolla & Pericarp” graph (Figure 7B) suggests greater variability in the pericarp thickness for a given corolla color, indicating a weaker correlation between these two characteristics. The final figure, labeled “Overlapped Normalized Confidence Ellipses with Points,” represents a superimposition of the confidence ellipses from the previous three graphs, but normalized to a common scale. This normalization enables a visual comparison of data behaviors across the determined intervals, providing insight into the variability and spread of the datasets within each feature combination.

For instance, while the “Corolla & Pericarp” graph displays the vertical arrangement of blue points, suggesting greater variability in pericarp thickness relative to corolla color, this relationship is more apparent in the normalized ellipse overlap. The exact overlap between the red and blue ellipses in the final graph reflects the data variation and is not necessarily indicative of a strong correlation between the variables. This visual representation aids in identifying relative trends and comparisons that might not be immediately obvious in the individual graphs.

3.3. Machine Learning Model Performance

The final phase of this study focuses on classifying different genotypes of peppers using machine learning models (Supplementary 2).

This algorithm outlines a structured approach for applying machine learning to the classification of pepper types, detailing the process from data loading and feature selection to model evaluation and metric calculation. The Leave-One-Out (LOO) cross-validation method is specifically employed for training and testing the models, ensuring a thorough evaluation by using each observation in the dataset as a single test case while the remainder serves as the training set.

The optimized hyperparameters for the pepper genotype classification models (Supplementary 3) reflect a careful balance between model complexity and accuracy, with a particular focus on preventing overfitting and ensuring generalization. For KNN, the configuration favors a localized approach, while RF and XGBoost adopt simpler structures and cautious learning rates to gradually improve accuracy, avoiding the capture of data noise.

After configuration and optimization, the models were run, and one of the initial metrics evaluated was the Receiver Operating Characteristic (ROC) curve. Thirteen ROC curves were generated for each model, resulting in 15 AUC values (Figure 8).

The RF model demonstrated the greatest consistency, with most AUC values concentrated close to 1.00 and little dispersion, indicating robust and reliable performance. The KNN showed good performance in most cases, but exhibited greater variability, with AUC values ranging from approximately 0.80 to 1.00, suggesting that its performance may be inconsistent depending on the specific training and testing data used.

The XGBoost model exhibited the greatest variation in AUC values, ranging from 0.75 to 1.00, indicating that its performance is highly sensitive to data characteristics. Despite some values close to 1.00, the significant dispersion suggests that XGBoost is less prone to overfitting compared to the other models, as it does not present an excess of AUC values equal to 1.00. Thus, XGBoost could be considered the best model in terms of avoiding overfitting, offering good overall performance without sacrificing the ability to generalize to new data.

The performance of each model can be quantified by the area under the ROC curve (AUC). AUC values closer to 1 indicate superior classification accuracy. The RF model exhibits an AUC of 0.96, indicating superior performance, followed closely by the KNN model with an AUC of 0.95, while the XGBoost shows an AUC of 0.85, indicating slightly lower performance compared to the other two models (Figure 8). Although RF achieved the highest AUC, the difference relative to KNN is minimal, suggesting that both models exhibit statistically comparable performance in distinguishing among the Capsicum genotypes.

In addition to AUC, other performance metrics were also employed in the evaluation of the machine learning models. Recall was used to measure the models’ ability to correctly identify all relevant positive instances. Precision was adopted to assess the proportion of positive identifications made by the models that were indeed correct.

F1-score is a metric that combines recall and precision into a single measure that seeks a balance between the two, being particularly useful in situations where the classes are imbalanced.

The RF model demonstrated the best overall performance across all metrics—recall, precision, F1-score, and accuracy—with values consistently around 0.937 (Supplementary 4). This indicates that the RF was highly effective in correctly classifying the peppers, with an excellent rate of identifying relevant positive instances (recall) and a high proportion of correct positive predictions (precision). In contrast, the KNN model showed inferior performance, with metrics around 0.762, indicating a less robust performance. The XGBoost model, on the other hand, presented intermediate performance, with metrics around 0.800, showing a reasonable balance.

The ROC curves and AUC values provide additional insight into the models’ ability to distinguish between classes. The RF model achieved the highest AUC (0.96), indicating strong discriminative performance, followed closely by the KNN model (AUC = 0.95), while XGBoost exhibited a lower AUC (0.85). Although RF demonstrated a marginally higher AUC than KNN, this difference is minimal and not statistically significant. AUC values above 0.9 indicate excellent discriminative ability, suggesting that both models can be classified within the same performance category. Therefore, RF and KNN can be considered statistically comparable in their ability to discriminate among the Capsicum classes. The XGBoost model, with an AUC of 0.85, reflects its intermediate performance in the metrics.

This discrepancy between the ROC curves and the individual metrics could be explained by the fact that the AUC considers the true positive rate across all possible thresholds, while the other metrics are calculated from a specific cutoff point. Therefore, a model may perform excellently at some thresholds but not at all, which explains the high AUC of the RF despite its inferior individual metrics.

In addition to the AUC analysis and individual metrics, it is important to consider the confusion matrices. The confusion matrices provide a detailed view of how each model classifies the different genotypes of peppers, showing the distribution of correct and incorrect classifications for each of the 16 classes (Figure 8).

The RF model performed well overall, with most predictions correctly aligned along the diagonal. However, there were notable misclassifications that need to be addressed. Specifically, T1 had three incorrect classifications, T2 was misclassified on three occasions, T7 had one misclassification, and T15 was misclassified five times, often being identified as T12. The previously mentioned misclassification of T8 as T3 does not appear in the observations, as T8 was correctly classified as T8. This discrepancy between the original analysis and the actual results in the confusion matrix should be clarified to ensure consistency and accuracy in the interpretation (Figure 9).

For the KNN model, the performance shows significant limitations, with several misclassifications outside the main diagonal, reflecting challenges in accurately distinguishing between certain classes. Specifically, T1 had one misclassification as T3, indicating some overlap between these categories. Additionally, T15 was misclassified three times, being identified as T10 and T12, suggesting difficulty in differentiating these classes, possibly due to similarities in their feature spaces. T14 also experienced three misclassifications as T11, further highlighting the model’s struggles in distinguishing closely related categories. Notably, T11 had only one correct classification, indicating it was frequently misclassified as other varieties, likely due to shared or overlapping features with T12 and T13, as observed in previous analyses. These errors underscore the limitations of the KNN algorithm in handling complex and closely related classes, where fine-grained distinctions are required. Further analysis and refinement of feature representation could help address these classification challenges.

The XGBoost model exhibited notable challenges in classifying the Capsicum pepper varieties, as evidenced by a significant number of errors outside the main diagonal. The most pronounced issue was with T15, which was correctly classified only twice while being misclassified six times as T4, T10, T11, and T14. This suggests considerable overlap in the feature space of T15 with these varieties, potentially due to shared characteristics that the model struggled to distinguish. Similarly, T5 failed entirely, with no correct classifications, further emphasizing the model’s difficulty in recognizing certain classes.

Beyond these specific cases, all varieties except T4, T9, and T16 were subject to misclassification, underscoring broader limitations in the model’s ability to handle complex relationships within the dataset. These results indicate that XGBoost may require enhanced feature engineering to better capture the unique traits of each genotype. Additionally, refining the training process through hyperparameter tuning or improving the representativeness of the dataset could help address the observed misclassifications, ultimately improving the model’s classification accuracy.

Based on the analysis of the ROC curves and the metrics it is evident that the RF model performed the best. RF provided a more consistent balance across all metrics, avoiding overfitting issues that could affect the other models, and demonstrating a solid ability to generalize to new data.

4. Discussion

This study demonstrates the importance and effectiveness of machine learning models in classifying different genotypes of peppers, enabling the prediction of pepper types based on characteristics of the fruits and plants, as corroborated by [44]. The use of various statistical techniques, such as MANOVA, principal component analysis (PCA), and cluster analysis, allows for a deep understanding of the variabilities and similarities among the genotypes studied.

The statistical significance found in the MANOVA for variables such as corolla color and growth habit indicates marked differences that are essential for classification [40]. Additionally, PCA and cluster analysis help to visualize and infer phenotypic relationships and similarity patterns relationships among the genotypes. These analyses allow for identifying which variables have the most influence on the principal components and which genotypes resemble each other based on the variables analyzed. These methods enable the provision of phenotypic insights that can assist breeding strategies focused on trait selection processes in the development of hybrid genotypes that meet the demands of the food industries [43]. Based on the results obtained in predicting pepper types, new genotypes can be generated using the proposed models, catering to the specific taste and pungency preferences of each consumer. One can envision their application in the development of condiments with varying degrees of spiciness, unique aromas, and vibrant colors, all customized for the user’s culinary experience and optimized in their creation.

The results of this study provide a robust scientific basis for the conservation and genetic improvement of peppers, maintaining genetic diversity while promoting the development of more adaptable and productive varieties. Through detailed analysis of phenotypic traits, researchers can identify key agronomic characteristics associated with classification performance responsible for desirable traits, such as disease resistance and fruit yield. The performance of the machine learning models used, especially RF, showed a strong balance across all metrics, facilitates the identification of these genes and the application of targeted breeding or genetic engineering techniques to enhance these qualities in crops.

The RF demonstrated superior performance to KNN in several aspects, as evidenced by the results from the metrics and confusion matrices. Firstly, RF achieved a higher AUC (0.96 compared to KNN’s 0.95), showing a superior ability to distinguish between the pepper classes. Additionally, the analysis of the confusion matrices reveals that RF made fewer classification errors, particularly in classes where KNN struggled, such as distinguishing between T11, T12, and T13. RF also exhibited a better balance between recall and precision, resulting in a higher F1-score, which indicates greater effectiveness in correctly identifying classes and reducing false positives. Finally, RF demonstrated superior generalization ability, avoiding overfitting issues that can affect KNN, especially in datasets with high variability between classes. These factors collectively indicate that although the RF model demonstrates superior performance compared to KNN in the classification of Capsicum pepper varieties, both models exhibit statistically comparable performance.

By exploring the rich genetic variability of peppers, this study demonstrates how machine learning models can be effective in classifying and predicting phenotypic patterns, opening doors to future discoveries that could revolutionize how we understand and utilize this fruit, significant in various cultures. The implications extend beyond agriculture and touch the core of biotechnology and nutrition, offering prospects for creating peppers with optimized nutritional profiles. With the technology of modeling and classification advancing rapidly, the journey to unravel the secrets of peppers is just beginning. The possibilities are vast, and the potential for innovations that benefit both producers and consumers is immense. As we progress with this research, we are moving toward a future where the ideal pepper—perfectly adapted to taste preferences, resilience, and productivity—is within everyone’s reach.

Regarding the obstacles encountered, one of the main challenges faced in this research was the use of a small dataset for machine learning tasks. The limited amount of data made it difficult to use traditional methodologies, which typically require a larger volume of samples to effectively train and validate the models. To overcome this limitation, we opted to use the Leave-One-Out cross-validation technique. This approach allowed us to maximize the use of the available samples, ensuring that each one was used for both training and testing, resulting in a more robust evaluation of the models despite the scarcity of data.

Another significant challenge was the complexity of working with a multiclass classification problem. To address the multiple classes, present in the dataset, we implemented a label binarization process and utilized methods such as GridSearchCV with LOO validation to optimize the models. Multiclass analysis required additional care in defining performance metrics, and techniques such as AUC-ROC and ROC curve plotting for each model were employed to evaluate the effectiveness of each approach in distinguishing between the different classes.

5. Conclusions

The comparative analysis of performance metrics revealed that RF achieved the best overall balance, while KNN and XGBoost delivered results that may be suitable for specific contexts. The research concludes that combining various performance metrics is essential for a thorough evaluation of the models. This holistic approach not only underscores the feasibility of using machine learning in Capsicum pepper classification but also establishes a methodology for future agronomic and botanical studies that can support a wide range of practical applications. The use of Leave-One-Out cross-validation was essential to mitigate this limitation, but it also highlighted the need for larger datasets in future research to improve model reliability. Additionally, the multiclass nature of the problem required the use of sophisticated techniques such as label binarization and model optimization, which, while effective, may still struggle with highly imbalanced classes. Addressing these limitations in future studies will enhance the robustness and applicability of machine learning models in pepper classification and beyond.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/horticulturae12050623/s1.

Author Contributions

Conceptualization, A.I.F. and F.B.M.d.S.; methodology, A.I.F. and A.F.d.S.; software, G.d.S.L. and F.A.N.V.; validation, F.B.M.d.S. and A.P.d.P.; formal analysis, P.P.B. and A.P.d.P.; investigation, P.P.B. and L.F.d.S.; resources, J.J.d.S.J. and L.F.d.S.; data curation, G.d.S.L. and J.J.d.S.J.; writing—original draft preparation, A.I.F. and F.B.M.d.S.; writing—review and editing, F.B.M.d.S., F.H.S.G. and A.P.d.P.; visualization, A.I.F. and A.P.d.P.; supervision, A.F.d.S., F.H.S.G. and F.B.M.d.S.; project administration, A.I.F. and G.G.F.; funding acquisition, F.A.N.V. and G.G.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by DPI/BCE/UnB (University of Brasília), under Public Notice No. 001/2026; by the Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG), through the Program to Support the Settlement of Young PhDs in Brazil (Grant No. BPD-01045-22); by the Fundação de Amparo à Pesquisa do Estado do Amapá (FAPEAP); and by the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), through the Amazônia + 10 Initiative Project (Process No. 401521/2023-0).

Data Availability Statement

The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this study.

Abbreviations

The following abbreviations are used in this manuscript:

ANOVA	Analysis of Variance
MANOVA	Multivariate Analysis of Variance
PCA	Principal Component Analysis
KNN	K-Nearest Neighbors
RF	Random Forest
XGBoost	Extreme Gradient Boosting
ROC	Receiver Operating Characteristic
AUC	Area Under the Curve
NMR	Nuclear Magnetic Resonance
HS-SPME	Headspace Solid Phase Microextraction
GC–MS	Gas Chromatography–Mass Spectrometry
RGB	Red Green Blue
IoT	Internet of Things

References

Ramírez-Meraz, M.; Méndez-Aguilar, R.; Zepeda-Vallejo, L.G. Exploring the chemical diversity of Capsicum chinense cultivars using NMR-based metabolomics and machine learning methods. Food Res. Int. 2024, 178, 113796. [Google Scholar] [CrossRef] [PubMed]
Munjuluri, S.; Wilkerson, D.A.; Sooch, G. Capsaicin and TRPV1 Channels in the Cardiovascular System: The Role of Inflammation. Cells 2021, 11, 18. [Google Scholar] [CrossRef] [PubMed]
Tripodi, P.; Rabanus-Wallace, M.T.; Barchi, L. Global range expansion history of pepper (Capsicum spp.) revealed by over 10,000 genebank accessions. Proc. Natl. Acad. Sci. USA 2021, 118, e2104315118. [Google Scholar] [CrossRef] [PubMed]
Brilhante, B.D.G.; Santos, T.O.; Santos, P.H.A.D. Phenotypic and molecular characterization of Brazilian Capsicum germplasm. Agronomy 2021, 11, 854. [Google Scholar] [CrossRef]
Devi, J.; Sagar, V.; Kaswan, V. Advances in breeding strategies of bell pepper (Capsicum annuum L.). In Advances in Plant Breeding Strategies: Vegetable Crops; Springer: Cham, Switzerland, 2021; pp. 3–58. [Google Scholar]
Bellei, C.M.; Rodrigues-Ferreira, S.; Azevedo-dos-Santos, L. Insecticidal activity of Capsicum annuum leaf proteins on Callosobruchus maculatus. J. Asia Pac. Entomol. 2023, 26, 102158. [Google Scholar] [CrossRef]
Srivastava, A.; Baliyan, N.; Mangal, M. Capsaicin: Sources, isolation, quantitative analysis and applications. In Capsaicinoids; Springer: Singapore, 2024; pp. 25–53. [Google Scholar]
Li, H.; Gao, Z.; Tan, C. CRS: An online database of Capsicum annuum RNA-seq libraries. Sci. Hortic. 2023, 312, 111864. [Google Scholar] [CrossRef]
Martinez, M.; Santos, C.P.; Verruma-Bernardi, M.R. Agronomic, physicochemical and sensory evaluation of pepper hybrids (Capsicum chinense). Sci. Hortic. 2021, 277, 109819. [Google Scholar] [CrossRef]
Sudianto; Herdiyeni, Y.; Haristu, A.; Hardhienata, M. Chilli quality classification using deep learning. In Proceedings of the 2020 International Conference on Computer Science and Its Application in Agriculture (ICOSICA); IEEE: Kuala Lumpur, Malaysia, 2020; pp. 1–5. [Google Scholar]
Ramírez-Meraz, M.; Méndez-Aguilar, R.; Hidalgo-Martínez, D. Experimental races of Capsicum annuum cv. jalapeño: Chemical characterization and classification by NMR/machine learning. Food Res. Int. 2020, 138, 109763. [Google Scholar] [CrossRef]
Raschka, S.; Patterson, J.; Nolet, C. Machine learning in Python: Main developments and technology trends. Information 2020, 11, 193. [Google Scholar] [CrossRef]
Mouloodi, S.; Rahmanpanah, H.; Gohari, S. Applications of artificial intelligence in equine biomechanical research. J. Mech. Behav. Biomed. Mater. 2021, 123, 104728. [Google Scholar] [CrossRef]
Aziz, M.A.; Nazir, W.M.A.; Ali, A.M.; Abawajy, J. Chili ripeness grading simulation using machine learning. In Proceedings of the 2021 IEEE International Conference on Computing (ICOCO); IEEE: Kuala Lumpur, Malaysia, 2021; pp. 253–258. [Google Scholar]
Rasekh, M.; Karami, H.; Fuentes, S. Non-destructive sorting techniques for pepper using odor parameters. LWT 2022, 164, 113667. [Google Scholar] [CrossRef]
Tripathi, A.; Waqas, A.; Venkatesan, K. Building flexible machine-learning-ready multimodal oncology datasets. Sensors 2024, 24, 1634. [Google Scholar] [CrossRef] [PubMed]
Sharma, P.; Hans, P.; Gupta, S.C. Classification of plant leaf diseases using machine learning. In Proceedings of the 2020 10th International Conference on Cloud Computing, Data Science & Engineering; IEEE: Kuala Lumpur, Malaysia, 2020; pp. 480–484. [Google Scholar]
Azimi, S.; Kaur, T.; Gandhi, T.K. A deep learning approach to measure stress level in plants due to nitrogen deficiency. Measurement 2021, 173, 108650. [Google Scholar] [CrossRef]
Jafar, A.; Bibi, N.; Naqvi, R.A. Revolutionizing agriculture with artificial intelligence: Plant disease detection methods. Front. Plant Sci. 2024, 15, 1356260. [Google Scholar] [CrossRef]
Kundu, N.; Rani, G.; Dhaka, V.S. Deep learning models for disease classification in bell pepper. In Proceedings of the 2020 Sixth International Conference on Parallel, Distributed and Grid Computing; IEEE: Kuala Lumpur, Malaysia, 2020; pp. 243–247. [Google Scholar]
Yee-Rendon, A.; Torres-Pacheco, I.; Trujillo-Lopez, A.S. Analysis of New RGB Vegetation Indices for PHYVV and TMV Identification in Jalapeño Pepper (Capsicum annuum) Leaves Using CNNs-Based Model. Plants 2021, 10, 1977. [Google Scholar] [CrossRef]
Park, J.R.; Kang, H.H.; Cho, J.K. Rapid determination of piperine using NIR and multivariate statistical analysis. Foods 2020, 9, 1437. [Google Scholar] [CrossRef]
Niu, W.; Tian, H.; Zhan, P. Pepper volatile flavor compounds using HS-SPME–GC–MS and multivariate statistics. Molecules 2022, 27, 7760. [Google Scholar] [CrossRef] [PubMed]
Espichán, F.; Rojas, R.; Quispe, F. Metabolomic characterization of Peruvian chili peppers. Food Chem. 2022, 386, 132704. [Google Scholar] [CrossRef]
Ye, Z.; Shang, Z.; Li, M. Evaluation of physicochemical qualities of pickled Chinese pepper. Food Res. Int. 2020, 137, 109535. [Google Scholar] [CrossRef]
González-López, J.; Rodríguez-Moar, S.; Silvar, C. Correlation Analysis of High-Throughput Fruit Phenomics and Biochemical Profiles in Native Peppers (Capsicum spp.) from the Primary Center of Diversification. Agronomy 2021, 11, 262. [Google Scholar] [CrossRef]
Sahmat, S.S.; Rafii, M.Y.; Oladosu, Y. Genotype and environment interactions in chilli yield attributes. Sci. Rep. 2024, 14, 1698. [Google Scholar] [CrossRef]
Yun, B.H.; Yu, H.Y.; Kim, H. Geographical discrimination of Asian red pepper powders using NMR and deep learning. Food Chem. 2024, 439, 138082. [Google Scholar] [CrossRef]
Ding, H.; Tian, J.; Yu, W. Application of artificial intelligence and big data in the food industry. Foods 2023, 12, 4511. [Google Scholar] [CrossRef] [PubMed]
IPGRI—International Plant Genetic Resource Institute. Descriptor for Capsicum (Capsicum spp.); International Plant Genetic Resource Institute: Rome, Italy, 1995; 49p. [Google Scholar]
Goudet, J. hierfstat: A package for R to compute hierarchical F-statistics. Mol. Ecol. Notes 2005, 5, 184–186. [Google Scholar] [CrossRef]
Kurita, T. Principal Component Analysis (PCA). In Computer Vision; Springer: Cham, Switzerland, 2020; pp. 1–4. [Google Scholar] [CrossRef]
Forina, M.; Armanino, C.; Raggio, V. Clustering with dendrograms on interpretation variables. Anal. Chim. Acta 2002, 454, 13–19. [Google Scholar] [CrossRef]
Harrou, F.; Zeroual, A.; Hittawe, M.M.; Sun, Y. Recurrent and convolutional neural networks for traffic management. In Road Traffic Modeling and Management; Elsevier: Amsterdam, The Netherlands, 2022; pp. 197–246. [Google Scholar] [CrossRef]
Wang, B.; Shi, W.; Miao, Z. Confidence analysis of standard deviational ellipse and its extension into higher dimensional Euclidean space. PLoS ONE 2015, 10, e0118537. [Google Scholar] [CrossRef]
Reis, I.; Baron, D.; Shahaf, S. Probabilistic RF for noisy datasets. Astron. J. 2019, 157, 16. [Google Scholar] [CrossRef]
Lakshmanaprabu, S.K.; Shankar, K.; Ilayaraja, M. RF for big data classification in IoT. Int. J. Mach. Learn. Cybern. 2019, 10, 2609–2618. [Google Scholar] [CrossRef]
Valavi, R.; Elith, J.; Lahoz-Monfort, J.; Guillera-Arroita, G. Modelling species presence-only data with random forests. Ecography 2021, 44, 1731–1742. [Google Scholar] [CrossRef]
Zhu, X.; Ying, C.; Wang, J. Ensemble of ML-KNN for classification algorithm recommendation. Knowl. Based Syst. 2021, 221, 106933. [Google Scholar] [CrossRef]
Kirana, R.; Handayani, T.; Harmanto; Anwarudin, M.J. Selection of chili pepper hybrid variety candidate based on flower characteristics. IOP Conf. Ser. Earth Environ. Sci. 2023, 1172, 012023. [Google Scholar] [CrossRef]
Sahin, E.K. Predictive capability of ensemble tree methods for landslide susceptibility mapping. SN Appl. Sci. 2020, 2, 1308. [Google Scholar] [CrossRef]
Abedi, R.; Costache, R.; Shafizadeh-Moghadam, H.; Pham, Q.B. Flash-flood susceptibility mapping using machine learning. Geocarto Int. 2022, 37, 5479–5496. [Google Scholar] [CrossRef]
Zhang, J.; Xie, Y.; Ali, B. Genome-wide identification of Rboh family genes in pepper. Trop. Plant Biol. 2021, 14, 251–266. [Google Scholar] [CrossRef]
Meshram, V.; Patil, K.; Meshram, V. Machine learning in agriculture: A state-of-the-art survey. AI Life Sci. 2021, 1, 100010. [Google Scholar] [CrossRef]

Figure 1. Illustrates the experimental and analytical sequence adopted in the process of classifying peppers using machine learning. Methodological flow of the experiment, from the initial data collection to the final analysis using machine learning techniques. The dataset obtained from plant parameters were subjected to multivariate statistical analysis, essential to understanding the complex and intricate relationships between the different characteristics of the plants. Notes: M1–M5 represent the five independent measurements (replicates) performed for each Capsicum genotype.

Figure 2. Density distribution of nine agronomic traits across 16 Capsicum genotypes.

Figure 3. Principal component analysis (PCA) among pepper variables from the Capsicum genus.

Figure 4. Principal component analysis (PCA) biplot showing the distribution of Capsicum pepper genotypes based on agronomic traits such as: stem diameter (SD), plant height (PH), canopy diameter (CD), stem longitude (SL), seeds per fruit (SPF), pericarp thickness (PT), fruit weight (FW), fruit diameter (FD), pH, brix degree (BD), and weight5f (W5F). The vectors indicate the contribution of each variable to the Capsicum pepper phenotypes. and agronomic traits.

Figure 5. Hierarchical clustering dendrogram generated using Ward’s method based on Euclidean distance, considering all measured agronomic traits. Each observation represents an individual replicate (n = 5 per genotype), which accounts for the repeated occurrence of Capsicum genotypes in the dendrogram.

Figure 6. Pearson correlation heatmap illustrating the relationships among agronomic traits of Capsicum genotypes. Correlation coefficients range from −1 (strong negative correlation) to 1 (strong positive correlation), with color intensity and circle size representing the magnitude of the relationships.

Figure 7. Relationships among fruit morphological traits represented by scatter plots with 95% normalized confidence ellipses. Pericarp thickness vs. fruit diameter (red ellipse, (A)). Corolla color vs. pericarp thickness (blue ellipse, (B)). Corolla color vs. seed count per fruit (green ellipse, (C)). Overlapping of the three confidence ellipses with their respective data points, illustrating the joint distribution of the variables (D). Axes are normalized, ranging from −2.0 to 3.0 on the X-axis and −2.5 to 3.0 on the Y-axis.

Figure 8. Boxplot of AUC scores for KNN, RF, and XGBoost models in the classification of Capsicum genotypes peppers.

Figure 9. Confusion matrices for RF, KNN, and XGBoost models in the classification of 16 Capsicum pepper varieties.

Table 1. Catalog of Capsicum Varieties with Assigned Identification Codes.

Scientific Name	Common Name in Brazil	Translation to English	Code
Capsicum frutescens	Grisu	Grisu	T1
Capsicum baccatum	Dedo de moça	Lady’s Finger	T2
Capsicum chinense	BRS Seriema	BRS Seriema	T3
Capsicum annuum	Vulcão	Volcano	T4
Capsicum chinense	Biquinho amarela	Little Beak Yellow	T5
Capsicum annuum	Pimentão amarelo	Yellow Bell Pepper	T6
Capsicum chinense	Guaraci Cumari do Pará	Guaraci Cumari from Pará	T7
Capsicum chinense	BRS Moema	BRS Moema	T8
Capsicum annuum	Doce Italiana	Sweet Italian	T9
Capsicum frutescens	Tabasco	Tabasco	T10
Capsicum frutescens	Peter	Peter	T11
Capsicum frutescens	Malagueta	Malagueta	T12
Capsicum annuum	Jalapenho	Jalapeño	T13
Capsicum annuum	Jamaican red	Jamaican Red	T14
Capsicum annuum	Jamaican yellow	Jamaican Yellow	T15
Capsicum chinense	Bode Amarela Arari	Yellow Goat Arari	T16

Table 2. Mean and standard deviation of quantitative variables measured for 16 Capsicum genotypes.

Measured Variable	Statistic	T1	T2	T3	T4	T5	T6	T7	T8	T9	T10	T11	T12	T13	T14	T15	T16
stem diameter (mm)	mean	9.83	14.076	12.814	11.608	11.926	11.284	14.256	12.076	11.372	11.142	12.954	10.834	8.276	9.972	12.278	9.432
stem diameter (mm)	std	2.107	2.248	1.371	0.898	1.109	0.891	1.18	1.248	0.535	0.905	0.39	1.167	1.231	0.729	1.065	1.022
plant height (cm)	mean	41.8	66.4	66.6	47.4	47.5	42.2	42.16	37.7	46.04	54	45.4	47.2	41.9	40.9	50	36.8
plant height (cm)	std	4.817	13.126	5.079	3.912	4.359	6.211	6.301	3.718	2.955	9.407	0.822	5.495	4.006	2.074	2.915	5.45
canopy diameter (cm)	mean	41.774	78.5	58.98	49.624	65.758	51.09	69.776	58.83	45.458	43.95	45.264	47.75	35.556	44.302	52.62	32.398
canopy diameter (cm)	std	4.55	6.586	5.092	4.879	4.343	4.358	4.157	2.398	2.579	3.743	4.911	4.373	2.084	6.368	3.111	3.476
stem longitude (cm)	mean	9.4	13	41.6	11.06	16.8	18.1	8.9	11.7	15.4	30.2	6.54	13.4	11.7	16.8	18.4	18.56
stem longitude (cm)	std	1.395	1.581	3.435	1.479	0.972	2.408	1.673	1.754	1.673	1.44	0.74	1.134	1.857	1.605	2.903	1.992
seeds per fruit	mean	112.8	75.2	33	38.2	47.4	123.2	37	36	143.6	45.4	93	19.2	44.8	126	103.8	22.2
seeds per fruit	std	12.317	14.237	2.739	4.266	3.362	13.18	2.55	3.606	9.813	3.05	5.244	2.49	5.45	7.874	2.168	4.438
pericarp thickness (mm)	mean	1.988	2.058	2.456	1.294	1.476	4.66	1.202	1.642	3.034	1.03	1.914	0.988	4.254	2.224	1.434	1.722
pericarp thickness (mm)	std	0.396	0.239	0.339	0.084	0.095	0.346	0.077	0.114	0.505	0.137	0.454	0.07	0.267	0.411	0.316	0.354
fruit weight (g)	mean	0.266	0.96	0.457	0.264	0.705	0.369	0.664	0.817	0.441	0.562	0.3	0.252	0.367	0.376	0.982	0.661
fruit weight (g)	std	0.043	0.114	0.082	0.037	0.097	0.105	0.048	0.129	0.093	0.082	0.062	0.018	0.064	0.034	0.116	0.074
fruit diameter (mm)	mean	14.2	16.02	14.6	7.8	15.2	51.2	10.772	16.2	45.1	8	22	6.8	27.2	50	48.2	14
fruit diameter (mm)	std	1.304	1.689	0.548	0.837	0.837	5.933	0.872	1.483	4.478	0.707	4.243	0.447	1.789	1.581	2.049	0.707
weight5f (g)	mean	13.48	7.92	1.446	1.728	1.886	42.386	1.13	2.246	42.66	3.018	13.83	0.432	20.052	13.408	12.038	0.96
weight5f (g)	std	0.69	0.952	0.058	0.107	0.11	8.102	0.113	0.284	2.015	4.297	0.787	0.089	4.369	1.295	1.435	0.089
brix degree	mean	8.44	8.042	7.844	8.91	9.052	7.64	8.94	9.61	7.23	8.99	8.33	12.8	7.376	8.26	6.874	6.262
brix degree	std	2.06	0.915	1.069	1.69	0.807	0.727	0.532	0.151	0.148	0.27	0.504	0.308	0.438	0.691	0.952	0.262
pH	mean	4.518	4.596	4.162	5.484	4.394	4.596	3.952	4.454	4.082	6.96	4.728	4.816	5.198	4.848	4.886	4.486
pH	std	0.236	0.347	0.223	0.33	0.215	0.322	0.544	0.268	0.841	0.23	0.266	0.249	0.232	0.44	0.426	0.172

Table 3. The presented MANOVA (multivariate analysis of variance) analysis compares 16 genotypes of peppers. To interpret the results, we analyze the four tests provided: Wilks’ Lambda, Lawley–Hotelling Trace, Pillai’s Trace, and Roy’s Largest Root.

			DF
Criterion	Test Statistic	Approx F	Num	Denom	p
Wilks’	0.00000	35.070	165	503	0.000
Lawley–Hotelling	200.28449	63.341	165	574	0.000
Pillai’s	8.11467	12.000	165	704	0.000
Roy’s	93.03567	-	165	-	0.000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Freire, A.I.; Souza, A.F.d.; Leal, G.d.S.; Souza, F.B.M.d.; Verri, F.A.N.; Balestrassi, P.P.; Paiva, A.P.d.; Júnior, J.J.d.S.; Silva, L.F.d.; Garcia, F.H.S.; et al. Using Machine Learning to Classify Capsicum Genotypes Based on Agronomic Traits. Horticulturae 2026, 12, 623. https://doi.org/10.3390/horticulturae12050623

AMA Style

Freire AI, Souza AFd, Leal GdS, Souza FBMd, Verri FAN, Balestrassi PP, Paiva APd, Júnior JJdS, Silva LFd, Garcia FHS, et al. Using Machine Learning to Classify Capsicum Genotypes Based on Agronomic Traits. Horticulturae. 2026; 12(5):623. https://doi.org/10.3390/horticulturae12050623

Chicago/Turabian Style

Freire, Ana Izabella, Alex Fernandes de Souza, Gustavo dos Santos Leal, Filipe Bittencourt Machado de Souza, Filipe Alves Neto Verri, Pedro Paulo Balestrassi, Anderson Paulo de Paiva, João José da Silva Júnior, Leonardo França da Silva, Fernando Henrique Silva Garcia, and et al. 2026. "Using Machine Learning to Classify Capsicum Genotypes Based on Agronomic Traits" Horticulturae 12, no. 5: 623. https://doi.org/10.3390/horticulturae12050623

APA Style

Freire, A. I., Souza, A. F. d., Leal, G. d. S., Souza, F. B. M. d., Verri, F. A. N., Balestrassi, P. P., Paiva, A. P. d., Júnior, J. J. d. S., Silva, L. F. d., Garcia, F. H. S., & Fonseca, G. G. (2026). Using Machine Learning to Classify Capsicum Genotypes Based on Agronomic Traits. Horticulturae, 12(5), 623. https://doi.org/10.3390/horticulturae12050623

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using Machine Learning to Classify Capsicum Genotypes Based on Agronomic Traits

Abstract

1. Introduction

2. Materials and Methods

2.1. Experimental Design

2.2. Statistical Analyses

2.3. Machine Learning Models

3. Results

3.1. Phenotypic Variability and Data Distribution

3.2. Multivariate Analysis and Pattern Recognition

3.3. Machine Learning Model Performance

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI