Color Identification on Heterogeneous Bean Landrace Seeds Using Gaussian Mixture Models in CIE L*a*b* Color Space

López-Lobato, Adriana-Laura; Avendaño-Garrido, Martha-Lorena; Acosta-Mesa, Héctor-Gabriel; Morales-Reyes, José-Luis; Aquino-Bolaños, Elia-Nora

doi:10.3390/mca30030064

Open AccessArticle

Color Identification on Heterogeneous Bean Landrace Seeds Using Gaussian Mixture Models in CIE Lab* Color Space

by

Adriana-Laura López-Lobato

^1,*

,

Martha-Lorena Avendaño-Garrido

²

,

Héctor-Gabriel Acosta-Mesa

¹

,

José-Luis Morales-Reyes

³

and

Elia-Nora Aquino-Bolaños

³

¹

Artificial Intelligence Research Institute, University of Veracruz, Xalapa 91097, Mexico

²

Faculty of Mathematics, University of Veracruz, Xalapa 91097, Mexico

³

Center for Food Research and Development, University of Veracruz, Xalapa 91190, Mexico

^*

Author to whom correspondence should be addressed.

Math. Comput. Appl. 2025, 30(3), 64; https://doi.org/10.3390/mca30030064

Submission received: 15 April 2025 / Revised: 30 May 2025 / Accepted: 3 June 2025 / Published: 6 June 2025

(This article belongs to the Special Issue Feature Papers in Mathematical and Computational Applications 2025)

Download

Browse Figures

Versions Notes

Abstract

The classification of bean landraces based on their coloration is of particular interest, as the color of these plants is associated with the nutritional components present in their seeds. In this paper, the authors propose a procedure to identify the colors of heterogeneous color bean landraces based on the information from their digital images. The proposed methodology employs a three-dimensional histogram representation of the estimated color, expressed in the CIE L*a*b* color space, with an unsupervised learning method called the Gaussian Mixture Model. This approach facilitates the acquisition of representative information for the colors of a bean landrace, represented as points in the CIE L*a*b* color space. Furthermore, the K-

n n

method can be trained with these punctual representations to identify colors, yielding satisfactory results on landraces with homogeneous and heterogeneous seeds.

Keywords:

bean landrace analysis; Gaussian Mixture Model; Gini index; CIE L*a*b* color space; optimization

1. Introduction

The common bean (Phaseolus vulgaris L.) is an important legume in Mexico and Central America, due to its high consumption and significant nutritional value, including proteins, fiber, and other components with antioxidant properties [1]. These properties are well-documented for their ability to prevent cardiovascular diseases and chronic degenerative conditions such as diabetes, obesity, and cholesterol [2,3].

The identification of bean varieties intended for commercial consumption is typically based on the morphology and pigmentation of the seeds, with only those varieties exhibiting similarities between grains being considered [4]. However, there are bean varieties that are cultivated domestically for private consumption, and which exhibit divergent characteristics due to their adaptation to different agricultural practices and the conditions of the rural communities, as altitude, climate and soil [5,6,7,8]. These bean varieties are known as common bean landraces and their seeds can be either homogeneous (exhibiting similar coloration) or heterogeneous (different or variegated colors or patterns) [3], see Figure 1. This diversity makes it very difficult to classify.

Also, it has been established that there is a direct correlation between the chemical composition of common bean landraces and their colorimetric properties [9]. Dark bean landraces (black, red, or a mixture of dark grains) present higher levels of anthocyanin content and antioxidant activity in comparison to white or yellow bean landraces [2]. Furthermore, proanthocyanidins, flavonol glycosides, and anthocyanins have been identified as the compounds responsible for the coloration of bean seeds [10]. Consequently, it is imperative to classify bean landraces according to their coloration and utilize this classification as a reference point for the potential health benefits to the consumer.

The spectrometry laboratory technique has been employed to characterize the color of bean landraces. However, it should be noted that this device only illuminates an area of 8 mm diameter when observing an object, thus only a small portion of the seed surface can be sampled to determine color. Consequently, an average of these measurements may not represent the entire seed [2,11]. This only considers seeds of homogeneous color; therefore, the worst-case scenario would occur if the bean landrace contains seeds of heterogeneous color [11].

Several works have proposed the acquisition of color directly from a digital camera image [12,13,14,15,16,17]. These studies have focused on the analysis of color in fruits, such as grapes and strawberries, and in bean seeds, primarily because a digital image is capable of providing a complete color distribution of the individual analyzed. In the bean landrace case, the color distribution of the seed or a set of seeds can be obtained.

The utilization of digital images in the analysis of the colorimetric properties for classification and characterization involves the employment of numerous computer vision systems. An example of this is the use of color averages in a specific region of interest to represent an individual and determine their classification as a particular variety [16]. However, it has been observed that such averages are unable to adequately represent the color distributions observed in bean landraces [2]. Consequently, studies that utilize these metrics in conjunction with seed color patches [18,19] or morphological characteristics, such as seed shape and size [20], are regarded as more comprehensive. The methodologies employed to analyze the shape and texture of each bean sample encompass the use of Fourier elliptic descriptors [21], the Principal Component Analysis of grey histograms of a bean color region [22], and the dominant color technique on each bean [23].

It has been observed that the primary challenge encountered in the analysis and classification of bean landraces pertains to the heterogeneity in coloration among the seeds of a particular landrace [24]. Consequently, it is more appropriate to consider color distributions within the landraces. The utilization of probability distributions to characterize the color of bean landraces has been documented in [12,25], where two-dimensional and three-dimensional histograms were employed to describe the coloration of a set of seeds.

The work most closely related to this project is presented in [26], where an unsupervised learning model was employed to concentrate the information of these three-dimensional histograms into points in the CIE L*a*b* color space. This information compression allows for efficient classification of bean landraces with homogeneous color seeds based on their color and variety.

An adaptation of the technique proposed in [26] is employed in this work to address the challenge of classifying common bean landraces with heterogeneous colored seeds. This project uses the three-dimensional histogram in the CIE L*a*b* color space representing the landrace color information to fit a Gaussian Mixture Model (GMM). The GMM is a parametric model whose density function is a linear combination of K multivariate normal (Gaussian) distributions. In this context, the number of Gaussians must correspond to the number of colors in the bean landrace analyzed. For this purpose, the Gini Index (GI) algorithm [27,28] is employed, since it is distinguished by its ability to identify the number of Gaussian components present in the dataset. Consequently, in the event of the bean landrace exhibiting seeds of heterogeneous color, the GI algorithm identifies as many Gaussian components as colors, and the representative color information of the bean landrace is characterized by the means of the Gaussian components found.

To our knowledge, the classification of common bean landraces with heterogeneous seeds characterized by points of representative colors in the CIE L*a*b* color space has not yet been explored; therefore, this work aims to demonstrate that the color information on landraces with homogeneous and heterogeneous color seeds can be characterized as points in the CIE L*a*b* color space using their corresponding GMMs fitted with the GI algorithm. Then, the K-

n n

method can be trained to classify/identify the colors present in the landraces of heterogeneous seeds.

The novel approach proposed in this work for color characterization, which employs the information of the Gaussian components of a GMM as points in the CIE L*a*b* color space, offers a significant reduction in computational cost when compared with previous techniques that manage the information of two-dimensional and and three-dimensional histograms.

The remaining sections of the paper are organized as follows. Section 2 details the image acquisition, segmentation, and extraction of color information. In addition, a description of the Gaussian Mixture Model and the GI algorithm is provided, along with the proposed methodology. Section 3 presents the obtained results, and Section 4 describes the discussion based on these results. The conclusions are presented in Section 5, and Section 6 is added to complement and clarify the drawbacks and limitations of the proposed methodology. Finally, future work is presented in Section 7.

2. Materials and Methods

This section provides a concise overview of the process employed to acquire bean landrace images, the segmentation of the seeds, and the representation utilized to obtain the color information. It also presents a detailed exposition of the Gaussian Mixture Model and the Gini Index (GI) algorithm, along with the proposed methodology for color identification in heterogeneous bean landrace seeds.

2.1. Bean Landraces: Image Acquisition, Segmentation, and Extraction of Color Information

The common bean landraces utilized in this study were procured from rural communities in Oaxaca, Mexico. A total of 60 g of seeds, exhibiting diverse coloration patterns, were employed to represent each of the 168 bean landraces analyzed.

The classification of seeds according to their color is divided into seven distinct categories: black, brown, pink, purple, red, white, and yellow. Illustrative examples of the bean landraces associated with these color categories are presented in Figure 2.

To obtain a digital image of the bean landraces that closely resembled the color of their seeds, the artificial illumination environment and color image reproduction workflow delineated in [12,25] were employed. Once the digital images on the CIE L*a*b* color space are obtained, a segmentation step is considered to extract the color information of the bean landraces seeds as three-dimensional histograms (3D histograms) of color. The subsequent subsections will describe this process, with the relevant visual representation provided in Figure 3.

2.1.1. Artificial Illumination Environment

The illumination configuration comprises a controlled lighting system that ensures uniform illumination, mitigates glare, and reduces specular brightness for shiny bean landraces. The setup, depicted in Figure 4, consists of an aluminum box measuring 68 cm × 68 cm for the base and 60 cm in height (Figure 4A). A diffusion box inside this aluminum box decreases the specular reflexes and shadows among seeds (Figure 4B). The dimensions of the diffusion box are 38 cm × 38 cm for the base and 45 cm in height. Eight fluorescent bulbs are distributed between the boxes to illuminate the environment with a color temperature of 6500 K. A hole at the top of the box allows for the camera lens placement (Figure 4C).

2.1.2. Image Acquisition: The Color Image Reproduction Workflow

The image acquisition process entailed meticulously placing the seeds on a sliding base within the aluminum box to ensure the collection of unobstructed and well-lit images. The blue color was strategically employed to differentiate the seeds from their background to achieve optimal visual contrast, see Figure 5.

A camera configured for RAW image acquisition was employed. The ARW file format retains the complete data captured by the camera sensor. Darktable Software (Version 4.4.2) was then used to export each image from the ARW format to the TIFF format without compression. During this process, each TIFF image is assigned a standard ICC profile. ICC profiles are imperative in image conversion to the color space of a disparate device, given that each device possesses its own distinct color space. Finally, Matlab software (Version 2023b) replaced the standard ICC profile with a custom ICC profile. The custom ICC profile used corresponds to the one reported in [12]. This profile was converted from the RGB color space to the CIE L*a*b* color space. For a more detailed description of this process, please refer to [12,25].

The CIE L*a*b* color space is utilized in this study, as it is intended to investigate the chromaticity channels of luminosity. Furthermore, it is acknowledged as the color space most frequently used in numerous laboratory devices for measuring color. Several studies have demonstrated that this particular space is well-suited for characterizing the color of bean landraces, thereby ensuring enhanced precision in the results obtained [25]. Furthermore, it is the color space that most closely aligns with human vision perception.

The CIE L*a*b* color space is three-dimensional. The vertical axis represents the luminance L*, with values ranging from 0 (black) to 100 (white). The primary color axes range from red to green (the a* axis) and from yellow to blue (the b* axis). This is illustrated in Figure 6.

Within this color space, each possible color is represented by a unique point in the coordinate system (L*, a*, b*). For example, the colors white and black are located at coordinates (0, 0, 0) and (100, 0, 0), respectively.

2.1.3. Image Segmentation

The subsequent step was to identify the regions of interest. To this end, the region growth segmentation algorithm was employed, which utilizes the Euclidean distance as a criterion of similarity between pixels. For this algorithm it is imperative to designate a seed pixel as the initial point, from which the region will expand if the neighboring pixels exhibit a degree of similarity. Given the homogeneity of the background, the seed pixel was selected in that area.

The segmentation process yields a binary image (see Figure 7), thus facilitating the identification of regions of interest, in this case, the bean landraces seeds, which are represented by the color black regions in the binary image. This enables the accurate acquisition of color information of the bean landrace seeds.

2.1.4. Extraction of Color Information: 3D Histograms

In the context of a digital image, a three-dimensional histogram (3D histogram) can be derived by using the three channels on the CIE L*a*b* color space. The color histogram bins are calculated by discretizing the CIE L*a*b* chromaticity channels into 256 values (

2^{8}

). This discretization was necessary because the range of channels a* and b* lies within [−128, 127]. The values for the L* channel were normalized and scaled to 256 values. Consequently, each 3D histogram is represented as a three-dimensional matrix of size

256 \times 256 \times 256

, as illustrated in Figure 8.

The frequency of each color bin, denoted by

f_{i j k}

, is specified in Equation (1), where

p_{i j k}

denotes the number of pixels with color channel values belonging to a bin

(i, j, k)

, and N is the total number of pixels in the image.

f_{i j k} = \frac{p_{i j k}}{N}

(1)

The values

f_{i j k}

define a joint probability mass function

f (X, Y, Z)

, since the conditions presented in Equations (2) and (3) are fulfilled when

f (X = i, Y = j, Z = k) = f_{i j k}

.

f_{i j k} \geq 0, \forall (i, j, k) \in X \times Y \times Z .

(2)

\sum_{i \in X} \sum_{j \in Y} \sum_{k \in Z} f_{i j k} = 1 .

(3)

To characterize the color of a set of seeds, the corresponding joint probability mass function is defined solely with the pixels of the image that correspond to the seeds. So, the segmentation step is necessary to capture the colorimetric representation of each bean landrace as the joint probability distribution of their pixel color values. An example of a bean landrace 3D histogram is presented in Figure 9, where the blue squares represent bins of the color space with frequencies greater than 0. These are the colors present in the seeds of this bean landrace.

A subsequent examination of the scatterplots of the 3D histograms revealed a concentration of points in several regions of the CIE L*a*b* color space. This representation facilitates the implementation of a statistical approach known as the Gaussian Mixture Model (GMM). This model enables the description of the color behavior in the bean landrace through a mixture of Gaussian distributions, using the “means” of these distributions as representative information, or indicators, of the colors present in the bean landrace.

The subsequent section provides detailed descriptions of the GMM and the Gini Index (GI) algorithm, which are necessary for the methodology proposed in this work.

2.2. Gaussian Mixture Model and Gini Index Algorithm

The Gaussian Mixture Model (GMM) is a parametric, multimodal model widely utilized for unsupervised learning problems, specifically density estimation [29]. The objective of density estimation is to employ a parametric distribution to model the behavior of a given dataset, as illustrated in Figure 10. The Figure 10a presents the plots of a dataset in dimension 2, where concentrations of dots are observed in specific regions. In contrast, the Figure 10b offers a graphical representation of a continuous density function derived from a GMM learned with this dataset. This function is characterized by peaks or hills corresponding to the points’ concentrations in the analyzed set, providing a concise depiction of the data’s behavior.

The parametric density function employed for a Gaussian Mixture Model is a linear combination of K multivariate normal distributions, as illustrated in Equation (4). In this equation, the mean

μ_{k}

and the covariance matrix

Σ_{k}

of the k-th Gaussian component

f_{θ_{k}}

are denoted by

θ_{k} = (μ_{k}, Σ_{k})

, for

k = 1, \dots, K

. The parameters denoted by the letter

ϕ_{k}

represent the weights or mixing proportions required to satisfy the restrictions presented in Equation (5).

\sum_{k = 1}^{K} ϕ_{k} f_{θ_{k}} (\cdot), with f_{θ_{k}} = \frac{1}{2 π \sqrt{Σ_{k}}} e^{- \frac{1}{2} {(x - μ_{k})}^{T} Σ_{k}^{- 1} (x - μ_{k})} .

(4)

0 \leq ϕ_{k} \leq 1, for k = 1, \dots, K, and \sum_{k = 1}^{K} ϕ_{k} = 1 .

(5)

The GMM facilitates the modeling of dataset behavior by estimating the parameter

θ = {ϕ_{k}, θ_{k}}_{k = 1}^{K}

.

One method of determining the values of these parameters is to employ Fisher’s maximum likelihood method, using the log-likelihood function

L

, as shown in Equation (6), where

{x_{m}}_{m = 1}^{M}

denotes the dataset. This approach facilitates the implementation of the EM algorithm, which in turn enables the identification of the target values [29]. However, it should be noted that this algorithm is subject to certain limitations and performance concerns [30].

L = \sum_{m = 1}^{M} log (\sum_{k = 1}^{K} ϕ_{k} f (x_{m} ∣ μ_{k}, Σ_{k})) .

(6)

Consequently, in 2006, there was a significant increase in the use of distances between probability distributions involving optimal transportation problems, such as the Kantorovich metric and the Gini index problem, for solving density estimation problems [31]. The approach adopted in this study for estimating the values of the GMM parameters for a given dataset is the GI algorithm, which utilizes the Gini index problem as the basis for the calculation [27,28].

The Gini index (GI) is the optimal value obtained by resolving a variant of the optimal transportation problem. In the GI problem [32,33], two probability distributions, denoted by

ν_{1}

and

ν_{2}

, are considered within a random variable Y. A distance d in Y is employed as a cost function. The GI problem is structured in Equation (7), where the set of joint probability distributions in

Y \times Y

with marginals

ν_{1}

and

ν_{2}

in the first and second factors, respectively, is denoted by

Π (ν_{1}, ν_{2})

.

G I (ν_{1}, ν_{2}) = min_{π \in Π (ν_{1}, ν_{2})} \{\int_{Y \times Y} d (y_{1}, y_{2}) d π\} .

(7)

From a graphical perspective, the objective is to identify a joint distribution function, denoted by

π

, in

Y \times Y

such that the projections of

π

onto the axes coincide with the distributions

ν_{1}

and

ν_{2}

, see Figure 11.

The Gini index

G I (ν_{1}, ν_{2})

is a distance between the probability distributions

ν_{1}

and

ν_{2}

[34]. So, the objective of the GI algorithm is to calculate the parameters of the GMM by minimizing the distance, measured by the Gini index, between an empirical distribution of the analyzed data and the density function of the GMM [27,28,31].

In the model proposed in [27,28], the dataset P is utilized to delineate the three principal components of the GI problem. These components are the random variable Y, the distribution

ν_{1}

, and the shape of the parametric distribution

ν_{2}

. The distribution

ν_{2}

is hypothesized to be a GMM of independent multivariate normal distributions, that is, Gaussians whose covariance matrices are diagonal.

Once the GI problem has been established with these three characteristics, the Lagrange multiplier method is used to identify Equations (8) and (9), defining the GI algorithm, for a dataset

P = {p_{i}}_{i = 1}^{M}

, with

p_{i} = (p_{1}^{(m)}, p_{2}^{(m)}, \dots, p_{N}^{(m)}) \in R^{N}

, where the mean

μ_{k}

is a real vector in

R^{N}

, and the covariance matrix

Σ_{k}

is a positive definite diagonal matrix of dimension

N \times N

, see Equation (10).

μ_{t r} = \frac{\sum_{m = 1}^{M} p_{r}^{(m)} exp (- \frac{{(p_{r}^{(m)} - μ_{t r})}^{2}}{2 σ_{t r}^{2}})}{\sum_{m = 1}^{M} exp (- \frac{{(p_{r}^{(m)} - μ_{t r})}^{2}}{2 σ_{t r}^{2}})}, for 1 \leq t \leq K and 1 \leq r \leq N,

(8)

σ_{t r} = \frac{\sum_{m = 1}^{M} {(p_{r}^{(m)} - μ_{t r})}^{2} exp (- \frac{{(p_{r}^{(m)} - μ_{t r})}^{2}}{2 σ_{t r}^{2}})}{\sum_{m = 1}^{M} exp (- \frac{{(p_{r}^{(m)} - μ_{t r})}^{2}}{2 σ_{t r}^{2}})}, for 1 \leq t \leq K and 1 \leq r \leq N .

(9)

μ_{k} = (\begin{matrix} μ_{k 1} \\ μ_{k 2} \\ ⋮ \\ μ_{k N} \end{matrix}) and Σ_{k} = (\begin{matrix} σ_{k 1}^{2} & 0 & \dots & 0 \\ 0 & σ_{k 2}^{2} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & \dots \\ 0 & 0 & \dots & σ_{k N}^{2} . \end{matrix})

(10)

For the mixture proportions

ϕ_{k}

, the Law of Total Probability and Bayes’ Theorem are employed to calculate the conditional probabilities

P (f_{θ_{k}} ∣ p_{m})

, for

k = 1, 2, \dots, K

. These values indicate the probability that the point

p_{m}

originates from the Gaussian

f_{θ_{k}}

. The membership of each point is identified by considering the maximum conditional probability of that point and each Gaussian component. Therefore, the values of the mixture proportions,

ϕ_{k}

, are obtained by quantifying the elements in each class and normalizing them.

By assigning initial values to the parameters being sought and following several iterations, these expressions will converge to the values of the means, covariance matrices and mixture proportions that most accurately reflect the behavior of the data [35]. The pseudocode of the GI algorithm is presented in Algorithm A1 of the Appendix A. Further details about the GI algorithm can be found in [27,28].

The experiments conducted in [27,28] demonstrated the efficacy of the GI algorithm when utilizing both real and synthetic data under various conditions. These conditions encompass variations in quantity, the intersection of classes, dimensionality, and non-normal conditions. This finding supports the assertion that the GI algorithm is a robust methodology for fitting a GMM.

Also, it is essential to consider the critical advantage of the GI algorithm over the EM algorithm. Both of these approaches allow for estimating the parameters of a GMM. However, the GI algorithm is distinguished by its ability to identify the number of Gaussian components present in a dataset, a difference from the classical EM algorithm that always found the K components established by the user.

The present paper puts forward the proposal of the GI algorithm for the identification of as many Gaussian components as there are colors in a bean landrace. The information delivered by the vector means obtained in the adjustment of the GMM represents the information of the colors in a bean landrace as points in the CIE L*a*b* color space. This color characterization enables a substantial reduction in computational cost when compared with previous techniques of color identification that manage all the information presented in two-dimensional and three-dimensional histograms. The subsequent section will describe the methodology employed to perform this process.

2.3. Proposed Methodology

In this section, the methodology proposed in this work is described. The GI algorithm is used to concentrate the bean landrace 3D histogram information in several 3D points in the CIE L*a*b* color space. These points are representative of the colors observed in each bean landrace.

Subsequent to the acquisition of these points, the K nearest neighbors’ method, denoted as K-

n n

[29], will be employed to classify them. This method has been demonstrated to be both simple and effective for small datasets [36,37], as is the case in the present situation. Moreover, the theoretical framework employed in the implementation of this method, which considers the proximity between instances, aligns with the hypothesis that points representing the colors of the bean landraces exhibiting analogous colors are closer together.

The following steps constitute the methodological framework of the color analysis:

1.: Data cleansing on the bean landrace 3D histogram.
The dataset P is obtained by considering positions whose normalized frequencies exceed a predefined threshold. This threshold is determined by dividing the maximum value of the normalized frequencies of the histogram in 3D by a parameter t given by the user. In Figure 12, a graphical representation of the variation in the number of positions that are taken into consideration before the thresholding process (Figure 12A) and after the thresholding process (Figure 12B) is presented. After thresholding, the representative positions constitute a dataset P for the specific landrace.
2.: Transfer of data to space $[0, 100] \times [0, 100] \times [0, 100]$ .
The information of each of the three axes (L*, a*, and b*) of the dataset P is transferred to the interval $[0, 100]$ to apply the GI algorithm, defined over a squared domain $Y \times Y \times Y$ . Consequently, the information is obtained in the space $[0, 100] \times [0, 100] \times [0, 100]$ .
3.: Gaussian fitting with the GI algorithm.
The GI algorithm is employed to fit K Gaussians in the dataset P, where K is the number of colors sought in the analyzed landrace. This approach enables the estimation of the means, the diagonal covariance matrices, and the mixture proportions for each color. Notably, the GI algorithm’s effectively identifies the diversity of colors exhibited by bean landraces. This is due to its ability to overcome challenges denoted as ”open set classification problems”, wherein the number of classes (colors) is not known in advance. In this step of the process, consideration is given exclusively to those Gaussian components for which the mixture proportions are greater than zero.
4.: Return the mean values to the CIE L*a*b* color space.
The mean values equivalence in the CIE L*a*b* color space is computed with the operation described in Equation (11), where $k = 1, 2, \dots, K^{'}$ correspond to the $K^{'}$ Gaussian components whose mixture proportions are greather than zero, and the columns of the dataset in the CIE L*a*b* color space are denoted by $C_{j}$ , $j = 1, 2, 3$ . The minimum and maximum values of the analyzed axis $C_{j}$ are denoted by $M C_{j}$ and $m C_{j}$ , respectively.

$μ_{k C_{j}} = \frac{μ_{k j}}{100} \cdot (M C_{j} - m C_{j}) + m C_{j} .$

(11)

The pseudocode of the proposed methodology is presented in Algorithm A2 of the Appendix A.

The procedure outlined in this section is employed to obtain representative information of the analyzed bean landrace, expressed as the corresponding means in the CIE L*a*b* color space. The landrace means, in the CIE L*a*b* color space obtained through this methodology, serves as indicators of the colors of the corresponding landrace. By replicating this process for each landrace, it is possible to compare the information of all the landraces under consideration using some learning method, such as the K nearest neighbors method (K-

n n

), and identify the seed colors of bean landraces.

2.4. General Experiment Design

This section describes the experiments conducted to validate the performance of the proposed methodology. The project involves the identification of the colors of bean landraces’ seeds, considering seven colors: black, brown, pink, purple, red, white, and yellow (see Figure 2).

As a preliminary approximation, the bean landraces considered for these experiments have seeds of different colors but not exceeding three colorations on the same landrace (see Figure 13). Furthermore, landraces exhibiting variegated coloration were excluded from the study. The final sample comprises two hundred and six bean landraces with these characteristics. The number of landraces exhibiting the specific number of colorations is enumerated in Table 1.

To execute the experimental process, the following steps are performed 20 times for each value of K for the K-

n n

method, considering values from 1 to 10. Consequently, the total number of executions is 200. The experimental process is illustrated in Figure 14.

(A): Acquisition of the training dataset (See Figure 14A)
Given that the bean landraces under analysis possess different seed colors, the training dataset is structured with the color information of individual seeds previously labeled with one of the seven colors to be analyzed. The 3D histogram of each digital image seed is obtained, and the methodology proposed in Section 2.3 is employed, with one Gaussian component in the GI algorithm and a thresholding value t of 100, to get a single point (the median of the Gaussian component) in the CIE L*a*b* color space with the representative color information of the seed. An example of the process performed with the information of one seed of a bean landrace is shown in Figure 15.
The information of several seeds labeled with the considered colors enables the characterization of the color space with groups of points, wherein the points with identical labels must be closer to each other (see Figure 16).
In the present study, a set of 762 images of single seeds of various colors is utilized in the experimental setup. Given the substantial information these bean landrace seeds can provide, a random selection of 10% of the seeds by color constitutes the training dataset for each experiment performed. The number of seeds and the number of seeds included in the training dataset by color are presented in Table 2.
(B): Acquisition of the test dataset (See Figure 14B)
Two experiments are considered to acquire the test dataset. Firstly, 3D histograms of landraces with one- and two-color bean seeds are employed. Secondly, 3D histograms of landraces with one-, two-, and three-color seeds are employed. This distinction enables the analysis of whether the complexity of the classification process is increasing in correspondence with the number of colors and bean landraces considered. The study focuses on 172 bean landraces in the initial case, while the subsequent case encompasses 206 bean landraces (refer to Table 1 for details).
For both cases, the methodology described in Section 2.3 is employed on each bean landrace. A threshold value t of 500 is used to cleanse the data, thereby ensuring that the most visible colors are well identified, and 7 Gaussians are considered in the GI algorithm since seven colors are being sought. Consequently, seven means are obtained for each distinct bean landrace, along with a standard deviation matrix of size seven and seven mixture proportions. As previously stated in Section 2.2, the GI algorithm can identify Gaussian components with mixture proportion of zero. Consequently, the focus is directed towards those medians from the Gaussian components that have a proportion greater than zero, since this is indicative of concentration of information in those components (See Figure 17). This process results in the acquisition of a set of means with mixture proportion greater than zero for a bean landrace i, denoted by $M_{i}$ , within the CIE L*a*b* color space.
The test dataset, denoted by $M$ , is conform by all the medians with mixture proportion greater than zero in all the bean landraces considered.
It is imperative to elucidate that the bean landraces utilized for the test set were previously categorized by experts based on the colors visually discerned in their seeds. This classification will facilitate a comparison between the colors predicted by the proposed process and the actual colors observed, thereby enabling the estimation of the precision achieved by implementing the proposed methodology.
(C): Application of the K- $n n$ method (See Figure 14C)
Following the acquisition of the training and test datasets, the K- $n n$ method is employed to classify all the means in $M$ . The K- $n n$ method calculates the distance between each test mean and all training means, selects the K nearest neighbors based on these distances and assigns the majority class to the test mean analyzed. With this method the labels predicted for the means in $M$ , denoted by $L$ , is obtained.
(D): Bean landrace label assignment (See Figure 14D)
After executing the aforementioned class assignment, the label predicted for each bean landrace i is the set of colors identified by the K- $n n$ method in their representative means, denoted by $L_{i}$ . It is imperative to note that a set is defined as a collection of distinct elements. Consequently, if different means in the bean landrace i have the same label color, this color will only appear once in the set $L_{i}$ .
(E): Calculate precision (See Figure 14E)
The performance of the proposed process is evaluated through the utilization of the functions presented in Equations (12) and (13), where $L_{i}$ is the set of labels predicted for the means in $M_{i}$ , $A_{i}$ is the set of actual colors observed by the expert on the bean landrace i, I is the number of bean landraces analyzed, and $| \cdot |$ denotes the cardinality of a set (the number of elements in it).

$P (M_{i}) = \frac{| L_{i} \cap A_{i} |}{| A_{i} |}$

(12)

$P (M) = \frac{\sum_{i = 1}^{I} P (M_{i})}{I}$

(13)

Equation (12) calculates the precision for each bean landrace $M_{i}$ , taking into account the number of colors found by the proposed method that match the colors seen by the expert. If the colors found by the process match the actual colors, the value of this function is one. Suppose the process found more colors than the actual colors. In that case, this is also considered with a value of one, as it indicates that the algorithm found colors that are imperceptible by simple visual observation. Conversely, the value zero is found when the colors found by the proposed method do not match the actual colors, and the values are between zero and one when the proposed method finds some of the actual colors but not all the colors seen by the expert.
The Equation (13) quantifies the general precision obtained for the test dataset of the experiment under consideration. This is achieved by calculating the mean of all the precisions obtained in the bean landraces.

The implementation of the proposed methodology and the experiments are performed in the R project software (Version R 4.3.2) on a computer with the characteristics presented in Table 3.

The general precisions

P (M)

obtained for each of the 20 executions with each K value between 1 and 10 are documented, and their corresponding statistical descriptions are presented in the following section. The relevant statistical tests and an analysis of the results obtained are also included.

3. Results and Analysis

3.1. Experiments with Bean Landraces with One- or Two-Color Seeds and Fitting 7 Gaussians

In Table 4, the results obtained with the methodology proposed in this work are presented, considering 7 Gaussian components in the GI algorithm and bean landraces with one- and two-color seeds. The minimum, maximum, mean and standar deviation of the precisions of twenty experiments for each K value for the K-

n n

method are reported along with the mean time needed to obtain the training and test datasets for the experimental process (steps (A) and (B)). Even though the times employed for the K-

n n

method have been obtained, the results show that the mean time is 0.00059 s when using the knn() function of the R project software. Therefore, these times are not specified due to their low values.

The Shapiro-Wilk statistical test was performed to verify the normal distribution on the precisions obtained in the twenty executions of each K value. The p-values are also included in Table 4 and indicate that the distributions are non-parametric, so, in order to identify if the samples exhibit significant differences, the Kruskal-Wallis rank sum test was conducted, obtaining a p-value of 0.03162. Since this p-value is smaller than the significance level

α = 0.05

, it is concluded that the differences between the precision results obtained with the different K values for the K-

n n

method are significant, so, a post-hoc test for non-parametric comparison must be performed. However, the post-hoc analysis using Dunn’s test with bonferroni adjustment for multiple comparisons showed no significant differences between any of the K values at a significance level

α = 0.05

.

As illustrated in Figure 18, the boxplots are employed to visually represent the outcomes and the corresponding median for each K value. A comprehensive analysis of Figure 18 determined that the value of K would be set at 9 to analyze the specific results obtained. This determination was made on the basis that the median value obtained with

K = 9

was the highest.

A closer look at the results is warranted, given that the fitness function employed for the process exhibits a specific characterization. Consequently, five types of classifications are considered and enlisted with examples next:

1.

The colors in the real and predicted labels are the same.

Example: Real label = “Black Yellow”, Predicted label = “Black Yellow”

2.

The colors in the real label are contained in the predicted label.

Example: Real label = “Black”, Predicted label = “Black Purple”

3.

The colors in the predicted label are contained in the real label.

Example: Real label = “Black Purple Red”, Predicted label = “Purple Red”

4.

Some colors in the real and predicted labels are the same.

Example: Real label = “Black Brown Yellow”, Predicted label = “Brown White”

5.

The colors in the real and predicted labels are different.

Example: Real label = “Black Purple”, Predicted label = “Brown Red”

By considering these types of classifications on the optimal result with value

K = 9

, corresponding to the precision 0.9622, the performance described in Table 5 is demonstrated.

Table 5 shows the number of bean landraces that fit the types of classifications previously described. 99 of the bean landraces were labelled with the same colors as observed by the expert, while the remaining 73 have colors in common between the predicted and real colors. It is important to note that there is no bean landrace where the predicted colors completely differed from the real colors in this experiment. However, the 60 bean landraces where the predicted colors contain the real colors must be analyzed, since if the proposed methodology assigned to a bean landrace the 7 colors searched for, the precision for this bean landrace would be 1, without the real colors presented on the bean landrace being relevant.

In the following section, examples of the types of classifications are presented to analyze if the additional colors predicted with the methodology or if the colors not detected by the same are justified.

3.1.1. Analysis of Results Obtained by Type

This section analyzed a random selection of bean landraces to determine whether the colors predicted by the methodology are adequate compared to the real colors. In each case, the corresponding bean landrace image is displayed to analyze the colors on the seeds.

The case where the predicted and real colors are identical is not considered by the analysis, and, in this particular instance, no discrepancies were observed between the predicted and real colors of the bean landraces. Consequently, the analysis for the classification of types 1 and 5 is omitted.

The colors in the real labels are contained in the predicted labels (Type 2)

In this type of classification, the methodology proposed identifies the colors observed by the expert and additional colors. In such instances, the fitness function invariably yields a value of 1. This decision is justified by the premise that the methodology identifies colors the expert could not perceive. However, in such cases, it is important to analyze how the methodology assigns the additional colors.

To this end, Table 6 presents the number of bean landraces where the real colors are contained in the predicted colors, considering the number of colors in each case. The findings presented in this table demonstrate that, in certain instances, a maximum of two additional colors can be identified by implementing the proposed methodology in specific bean landraces.

Figure 19 shows three examples of bean landraces, each with a single-color real label. The proposed methodology identifies two colors (Figure 19A,B) or three colors (Figure 19C) in these cases. By analyzing the seeds on Figure 19A, there is a noticeable differentiation between the yellow tones, some darker than the others. These seeds were probably sufficient to contribute significantly to a new Gaussian whose means are closer to the brown color means in the training dataset. In the Figure 19B, the purple colors added by the proposed method are also visible in some seeds. This could be due to the lighting environment. The same happens in Figure 19C, where several tonalities can be perceived with the blue background and the accommodation of the bean seeds, so the methodology’s addition of the brown and purple colors can be justified.

Figure 20 shows two examples of bean landraces, each with two colors in their real labels. However, the proposed methodology identifies three colors. The additional colors, yellow and pink, added with the proposed methodology on Figure 20A and Figure 20B, respectively, are perceptible but in a smaller quantity than the other colors. This is probably why the expert does not consider these colors on the bean landraces.

In Figure 21, two examples of bean landraces with two colors in their real labels are presented. However, the proposed methodology identifies four colors. As in the previous case, the two added colors, pink and red for Figure 21A, and pink and purple for Figure 21B, are present in the corresponding bean landraces, but in a lower quantity of seeds, so the expert does not consider the presence of those colors.

The colors in the predicted labels are contained in the real labels (Type 3)

In this type of classification, the methodology proposed identifies fewer colors than those observed by the expert. Consequently, the fitness function invariably yields a value lower than 1. The three bean landrace images that fulfil this characterization are presented in Figure 22.

As illustrated in Figure 22, the proposed methodology identifies the brown color in Figure 22A,B, despite the black color being the more prevalent color in the seeds. In such instances, the GI algorithm identifies means that are more proximate to the information of the brown means in the training dataset. This phenomenon could be attributed to the fact that the black and brown colors are in proximity and on the same level in the training dataset locations, as illustrated in Figure 16. So, the means obtained for these bean landraces with the GI algorithm are at the same level and between the concentration of black and brown points means of the training dataset but closer to the brown ones.

Conversely, in Figure 22C it is evident that the proportions of seeds between colors are significantly imbalanced. So, the information of the pixels corresponding to the yellow seeds probably was eliminated during the thresholding process performed by the methodology proposed.

Some colors in the real and predicted labels are the same (Type 4)

In this type of classification, the methodology proposed does not identify all the colors observed by the expert and adds others. In such cases, the fitness function has a value lower than 1.

In Figure 23, three examples of bean landraces are presented, each bearing two colors in their real labels. In these cases, the proposed methodology identifies one of the colors and adds one or two more for the predicted label.

As illustrated in Figure 23, the methodology identifies the presence of yellow in certain seeds of the bean landrace on Figure 23A. However, the black color is excluded despite its observable presence in some seeds. A more in-depth analysis of the image reveals the presence of more than two colors in the seeds, although the estimation performed with the proposed model identifies red and brown as the means instead of black, likely because these colors are spatially close to black in the CIE L*a*b* color space. This substitution of brown for red is also evident in Figure 23C. In Figure 23B, the yellow color identified by the proposed methodology is consistent with the presence of seeds of this color in the bean landrace, and the brown color is also present. However, the black color is substituted by the red one for the same reasons as the other images: the black and brown colors are found in proximity in the CIE L*a*b* color space.

3.1.2. General Conclusions for the First Set of Experiments Considering Bean Landraces with One- or Two-Color Seeds and the Adjustment of 7 Gaussians

After analyzing the types of classifications, the following implications can be made:

The classification of type 2, which corresponds to cases where the real labels are contained in the predicted labels, occurs when the expert does not perceive the additional colors due to their observations being performed without a unified background. Alternatively, the expert may consider that the additional colors found by the methodology are in lower quantities than the other colors and, therefore, are not included in the real label.
The classification of type 3, corresponding to cases where the colors in the predicted labels are contained in the real labels, occurs when there is a confusion between the black and brown colors due to the proximity of the means of the training dataset in the CIE L*a*b* color space, or when some color is directly eliminated during the thresholding process.
The classification of type 4, corresponding to cases where some colors in the real and predicted labels are the same, occurs when the additional colors found by the methodology substitute the other colors considered in the real label, considering their spatial proximity in the CIE L*a*b* color space.

It can be posited that the proposed methodology is subject to critical errors in the classification of type 4, given that in such instances, certain visible colors are substituted by others that are proximate in the CIE L*a*b* color space. However, this is well penalized in the precision function, which gives values lower than 1 in these cases, considering the colors that are visually present but not found by the process.

The subsequent section will present a detailed exposition of the second set of experiments. These experiments will involve the analysis of bean landraces with one-, two-, and three-color seeds. The objective of this analysis is to ascertain whether the incorporation of bean landraces with more than two color seeds results in a significant enhancement of the complexity of the process.

3.2. Experiments with Bean Landraces with One-, Two-, or Three-Color Seeds and Fitting 7 Gaussians

In Table 7, the results obtained with the methodology proposed considering 7 Gaussian components in the GI algorithm and bean landraces with one-, two- and three- color seeds are shown. The minimum, maximum, mean and standar deviation of the precisions of twenty experiments for each K value for the K-

n n

method are reported along with the mean time needed to obtain the training and test datasets for the experimental process (steps (A) and (B)).

In this instance, the p-values of the Shapiro-Wilk statistical test shown in Table 7 indicate that the distributions of the precisions obtained in the twenty executions of each K value for the K-

n n

method are non-normal. Therefore, in order to identify if the samples exhibit significant differences, the Kruskal-Wallis test was conducted, obtaining a value of 0.3802. As this p-value exceeds the significance level of 0.05, it can be concluded that the differences between the values of K for the K-

n n

method are not statistically significant.

In Figure 24, the boxplots for these results are presented with the corresponding median for each K value. As in the previous set of experiments, the value of K is selected considering the highest median obtained. So the value of K is set at 7 to analyze the specific results in this case.

In this case, the five types of classifications are also considered on the best result with

K = 7

, corresponding to the precision 0.9037. Table 8 shows the number of bean landraces that fit the types of classifications, and it is observed that 118 of the bean landraces were labeled with the same colors as observed by the expert, while the remaining 88 have colors in common between the predicted and real labels. In this experiment, there is no bean landrace where the predicted colors completely differed from the real colors. However, the following section will analyze the types of classifications obtained with the methodology compared to the real colors observed by the expert.

3.2.1. Analysis of Results Obtained by Type

This section analyzes a random selection of bean landraces to determine whether the colors predicted by the methodology are adequate compared to the real colors., adding the corresponding bean landrace image to visually analyze the colors on the seeds.

In this case, the analysis for the classification of types 1 and 5 is also omitted since there is no need to analyze the cases where the predicted colors and the real colors are the same, and no discrepancies were observed between the predicted and real colors of the bean landraces, respectively.

The colors in the real labels are contained in the predicted labels (Type 2)

Table 9 shows the number of bean landraces where the real colors are included in the predicted colors, considering the number of colors in each case. In this set of experiments, the same behavior is observed as in the previous set of experiments. The proposed methodology finds at most two additional colors in some cases, so 5 colors are found in some bean landraces.

In Figure 25, three examples of bean landraces are presented, wherein the actual labels exhibit a single color, whereas the proposed methodology discerns two distinct colors. In these images, discerning at least two different intensity levels of color on each bean landrace is facilitated by the blue background incorporated during the image acquisition. In Figure 25A, the black coloration appears to have been added to accentuate the deep brown hue of certain seeds. In contrast, brown coloration was introduced in Figure 25B,C to create a contrasting effect with the light red seeds.

Figure 26 and Figure 27 contain examples of bean landraces where the expert discerns single-color seeds. However, the proposed methodology identifies three- and four-color seeds, respectively.

Image A in Figure 26 exhibits at least three distinct color intensity levels. For this reason the proposed methodology assigns three colors. Conversely, in images B and C, the variation in the red intensity can be structured at most in two intensities of red, yet the predicted colors exhibit brown and purple hues. This phenomenon can be attributed to the proximity of these colors to the red color in the CIE L*a*b* color space.

In Figure 27, the three bean landraces with a single-color real label that the methodology predicted to have a four-color label are displayed. These could be the three most critical errors performed by the proposed methodology since the predicted colors on images A, B, and C add three colors that cannot be perceived visually. The proximity of the colors in the CIE L*a*b* color space may be a contributing factor to this phenomenon, given the number of seeds, and the dispersion of their real colors.

On Figure 28, three examples of images are presented, with the predicted labels exhibiting three colors, while the real labels display two colors. A visual analysis of these bean landraces reveals that the colors indicated with the methodology are distinguishable on the corresponding images. In this case, the proposed methodology successfully identifies colors that the expert cannot perceive without the uniform background.

This phenomenon is also observed in the bean landrace images depicted in Figure 29 and Figure 30. The methodology employed identifies all the colors observed in the seeds of the images A, B, and C of both Figures. It is possible that the expert may not have considered these colors due to the limited number of seeds or perhaps due to the difficulty in discerning the colors in the absence of a blue background and under the conditions of the illumination environment.

The only case where the methodology found five colors in a bean landrace previously labeled with three colors is shown in Figure 31. It is evident from this figure that the five colors predicted by the methodology are present in the seeds, but some in smaller quantities than the others.

The colors in the predicted labels are contained in the real labels (Type 3)

In this classification, the methodology identifies fewer colors than those observed by the expert. The number of images with this characterization is documented in Table 10. Nine images of bean landraces are included in this classification, all exhibiting three colors as observed by the expert. However, the methodology detects only one or two colors on the seeds.

As illustrated in Figure 32, the two images of bean landraces display three colors in their labels, yet the methodology identifies only one color. In this case, the three colors of the seeds are perceptible visually; however, in image A, the reduced seed quantity in this sample relative to the other bean landraces examined can potentially influence the outcomes of the methodology. This is particularly salient in the thresholding process, where the color of the seeds not detected by the procedure may be filtered out. A similar scenario might occur in image B, where yellow seeds may have been removed, resulting in a dataset predominantly comprising black and brown seeds that are more closely aligned in terms of spatial distribution with the brown color means of the training dataset.

In Figure 33, three examples of bean landrace images are presented, in which the methodology identified two colors instead of the three indicated in the actual label. Notably, distinguishing more than two colors on the seeds in images A and C is significantly challenging. However, the expert adds red and purple colors, respectively. On the other hand, the image labeled as B presents a case in which the brown and purple seeds are grouped as a single brown category.

Some colors in the real and predicted labels are the same (Type 4)

In this type of classification, the expert and the methodology coincide in some colors of the analyzed bean landraces, but not in all, so a proper subset of colors is not presented as in the previous types of classification.

In Figure 34, three examples of bean landraces are presented, each exhibiting this characterization. In image A, the color indicated in both labels is black. A visual inspection of the landrace seeds reveals the presence of at least three distinct colors, with yellow being particularly notable. However, the proposed methodology erroneously identifies brown, pink, and red colors, even when the number of yellow seeds is significant. This discrepancy indicates a critical error in the methodology. In image B, both labels exhibit brown and yellow colors, while the pink color in the real colors is substituted by a combination of purple and red colors. This phenomenon can be attributed to the proximity of these three colors in the CIE L*a*b* color space. Furthermore, this proximity can provide a potential explanation for the labels in image C, where the red color observed by the expert is replaced by the purple color delivered by the methodology.

3.2.2. General Conclusions for the Second Set of Experiments Considering Bean Landraces with One-, Two-, or Three-Color Seeds and the Adjustment of 7 Gaussians

After analyzing the types of classifications of this set of experiments, it is observed that there are more critical errors by considering those populations with three colors on their seeds. The following implications can be made in this case:

The classification of type 2, in which the colors perceived by the expert are contained within the colors generated by the methodology, necessitates more cases for analysis compared to the previous set of experiments. However, a distinguishing feature of this classification is that the methodology identifies no more than two additional colors beyond the actual colors. Consequently, the methodology identifies a maximum of five colors. As in the previous set of experiments, the presence of additional colors can be attributed to the expert’s inability to perceive the colors in the absence of a plain background and an illumination environment, or to the expert’s perception that the additional colors found by the methodology are present in smaller quantities compared to the other colors. However, there are instances where the addition of colors by the methodology is not evident, as observed in images A and B of Figure 27. This phenomenon can be attributed to the proximity of the predicted colors in the CIE L*a*b* color space.
The classification of type 3, in which the colors in the predicted labels are contained in the actual labels, occurs in nine images previously analyzed by the expert with seeds of three different colors. In these cases, the methodology identified only one or two colors due to the proximity of the means in the CIE L*a*b* color space or when some color is directly eliminated during the thresholding process due to the number of seeds analyzed on the bean landrace.
The classification of type 4 corresponds to cases where some colors in the real and predicted labels are the same. In such instances, the supplementary colors identified by the methodology supplant the other colors specified in the real label, with their spatial proximity in the CIE L*a*b* color space being considered.

The proposed methodology exhibits a higher incidence of critical errors in this set of experiments compared to the previous one. The incorporation of bean landraces with three distinct colors serves to augment the complexity of the process due to the characteristics of these new images, some of which have a lower quantity of seeds, and since the thresholding process considers the same number t for all the images in the test dataset to clean the information, some colors disappear in this step (step (B) on Section 2.4).

It is also noteworthy that these cases are adequately penalized by the fitness function proposed. As in both sets of experiments, the analysis of the classification of types 3 and 4 employs the distribution of the training dataset in the CIE L*a*b* color space as justifications; the subsequent section will present a brief analysis of these datasets.

3.3. Analysis of the Means in the Training Dataset

As previously stated in step (A) of Section 2.4, the training dataset for the K-

n n

method considers the information of the means obtained from 3D histograms of individual seeds previously labeled with one color. Consequently, the information of multiple seeds of a single color is utilized to characterize the CIE L*a*b* color space with a group of points. Ideally, points with identical labels would be positioned closer to each other, as illustrated in Figure 16. However, the results obtained suggest that this phenomenon was not observed in the training datasets. So, this is an unconsidered factor that influences the outcomes.

The plots of four training datasets are presented in Figure 35.

As demonstrated in Figure 35, the points corresponding to the means of black, white, and occasionally yellow colors appear to be distributed evenly. That is to say, the points with these colors are close, or there is an absence of other color points overlapping their spatial structure. Conversely, for the colors brown, pink, purple, and red, there is an overlap of concentrations of points. This observation is further substantiated by the analysis of results, which reveals instances of color combination or substitution among these hues employing the K-

n n

method for the classification of the means in the test dataset.

4. Discussion

As established in preceding studies, identifying color in bean landraces is paramount, given the demonstrated correlation between the chemical composition and health benefits of the seeds and their colorimetric properties. However, the diversity of colors present in bean landrace seeds poses a challenge to classification.

Previous studies have employed various analytical techniques, including laboratory and digital images, to analyze the color properties of bean landraces. Contemporary investigations have highlighted the employment of probability distributions to characterize the colors of bean landraces on histograms in two and three dimensions in a color space, such as the CIE L*a*b* color space. The present study employs this approach, but instead of utilizing the complete information of these histograms in the learning process, a compression of information is considered with the GMM and the GI algorithm. This model facilitates capturing information regarding the colors in the bean landraces with the means vectors obtained for the GMM.

Characterizing the colors of heterogeneous bean landraces by points in the CIE L*a*b* color space is the novel approach employed in this study. In addition to the favorable outcomes achieved with the proposed methodology, the process facilitates a substantial computational cost reduction in terms of temporal requirements and memory when contrasted with antecedent techniques that employ numerous two-dimensional histograms for each bean landrace, along with a considerable amount of time in the learning process.

The proposed methodology involves the compression of information from each bean landrace into a maximum of seven points in the CIE L*a*b* color space (since seven colors are considered as base) in comparison with several 2D histograms of size 255 × 255 employed in other works.

Concerning the computational time required to conduct experiments of the proposed methodology, the compression of information from a set of 3D histograms from singular seeds with one color requires less than five minutes. The compression of information from a set of 3D histograms corresponding to the bean landraces analyzed employs 8 or 11 min, depending on the number of colors considered in the seeds. After the compression stage, the K-

n n

method is employed to classify the information of the bean landraces using the information of the single seeds as a training dataset. This process takes no more than 0.0006 s. Consequently, the total duration of the experiment is no more than 16 min. This represents a substantial reduction compared to alternative approaches that employ convolutional neural networks and more sophisticated techniques, which require greater information, more computational time, and a process where structures must be found first.

In contrast, the proposed process involves only two considerations to obtain each dataset (training and test): the number of Gaussians for the GMM that the GI algorithm must search for and the thresholding value t. The former is employed to restrict the search to the maximum number of colors considered for the bean landraces. At the same time, the latter is used to enhance the clarity of the color agglomerations in the CIE L*a*b* color space and eliminate noise information from the 3D histograms for the GI algorithm. However, more room of improvement was found in this regard, since the critical errors performed by the proposed methodology are part of the characterization of these parameters.

Regarding precision, the proposed methodology and experimentation setup obtained favorable results considering the proposed fitness function.

These enhancements signify a substantial and noteworthy advancement in the color identification of heterogeneous bean landraces.

5. Conclusions

The proposed methodology has yielded favorable outcomes in identifying colors in bean landraces with seeds of heterogeneous color. Employing the means obtained with the GMM and the GI algorithm for each seed (training dataset) and each bean landrace (test dataset) is a good approach to identifying the colors with the help of the K-

n n

method.

The analysis of the results of this study, considering the five types of classification, along with examples of the "errors" committed by the proposed methodology, enables the description of the main challenges presented in this identification process.

Firstly, the involvement of an expert in the pre-labeling process is paramount since several cases of type 2 occur when the expert does not perceive certain colors in the seeds due to suboptimal lighting and environmental conditions during observation. Additionally, the expert may overlook certain colors due to their minimal presence in a particular bean landrace.

Secondly, it is crucial to analyze the distribution of the means in the training dataset, as cases of types 3 and 4 occur when the means of different colors overlap in the CIE L*a*b* color space, as described in Section 3.

Thirdly, the quantity of seeds shown in the images is relevant to the process. In most images depicting bean landraces, 60 g of seeds are considered, as this is the specified value reported in several studies utilizing the pH differential method to measure the anthocyanin concentration of the landrace. Notably, the anthocyanin concentration is closely associated with the coloration of the seeds. However, it was observed that several errors in the proposed methodology resulted in the display of a lower quantity of seeds for some bean landraces. This phenomenon can be attributed to the threshold value t employed during the cleansing process, eliminating the information of the seeds with the color less pronounced.

Still, with these limitations, the proposed methodology achieves a high level of precision, with values exceeding 90%.

6. Drawbacks and Limitations of the Proposed Methodology

The methodology proposed in this work is subject to certain limitations:

The first and critical drawback results from the specific characteristics of the 3D histograms used as inputs for the methodology. As demonstrated in Section 2.1, these are derived from images obtained with the process described therein, where the photos are captured in optimal lighting conditions, with reduced specular brightness and a uniform background. In particular, as demonstrated in [12], experiments conducted on low-quality bean landrace images have yielded unsatisfactory results due to the presence of noise that can manifest as new or different colors. Consequently, the methodology employed in this study may be susceptible to errors in the GMM fitting process when employing low-quality images, which could lead to the discovery of more colors than the real ones and potentially result in diminished outcomes. However, the model’s objective is to achieve concordance with the outcomes of laboratory-based chemical analyses that measure anthocyanin levels without incurring the financial expense of reagents. Consequently, the structured imaging process, which is already in place in a regulated environment, offers the optimal conditions for the algorithm to achieve the most optimal results.
Another characteristic to consider is the limited number of bean landrace samples available for specific colors and varieties. For example, in this project, the bean landraces of variegate colors were not considered for this reason. However, it is essential to recognize that these legumes play a crucial role in the diets of local farmers and communities with limited resources.

7. Future Work

There is room for improvement by considering the following proposals:

A significant step in the process is to consider 3D histograms from images of low quality, such as those resulting from poorly illuminated conditions, to assess the robustness of the proposed methodology.
It is imperative to incorporate bean landraces with variegated seeds in this study. The compilation, image acquisition, segmentation, and extraction of information regarding these specific bean landraces must be performed to facilitate a comprehensive analysis of the proposed methodology’s behavior.
Identify the appropriate parameters for the methodology by performing more experiments:
–
The thresholding values t employed in the cleansing process should be analyzed, even considering different values between bean landraces, depending on the quantity of seeds in the images.
–
Further experimentation could be conducted by reducing the number of Gaussians considered for the GMM. In this case, 7 Gaussian was considered since that is the number of searched colors. However, given the understanding that, in some cases, the GI algorithm is capable of identifying a maximum of five colors, this number should be employed in subsequent studies to fit the GMM.
Furthermore, the utilization of 3D histograms in alternative color spaces, such as RGB, LCh, and HSV, is imperative for subsequent analysis. Alternatively, it can be considered the translation of color information to an optimal color space, where pixels of different colors are separated through techniques such as Linear Discriminant Analysis.

Author Contributions

Conceptualization, A.-L.L.-L. and M.-L.A.-G.; Data curation, A.-L.L.-L. and J.-L.M.-R.; Formal analysis, A.-L.L.-L.; Investigation, A.-L.L.-L.; Methodology, A.-L.L.-L. and M.-L.A.-G.; Resources, H.-G.A.-M., J.-L.M.-R. and E.-N.A.-B.; Software, A.-L.L.-L.; Supervision, M.-L.A.-G., H.-G.A.-M., J.-L.M.-R. and E.-N.A.-B.; Validation, M.-L.A.-G.; Visualization, A.-L.L.-L.; Writing—original draft, A.-L.L.-L.; Writing—review & editing, M.-L.A.-G., H.-G.A.-M., J.-L.M.-R. and E.-N.A.-B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets employed in this article are not readily available since they are part of an ongoing study. Requests for access to the datasets should be directed to eliaquino@uv.mx and heacosta@uv.mx.

Acknowledgments

The first author acknowledges the Secretaría de Ciencia, Humanidades, Tecnología e Innovación (SECIHTI) of Mexico for the financial support provided through scholarship 712182, which was awarded for postdoctoral studies at the Artificial Intelligence Research Institute at the University of Veracruz. The authors also express their gratitude to José Luis Chávez Servia for providing the samples of bean landraces employed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GMM	Gaussian Mixture Model
GI algorithm	Gini Index algorithm

Appendix A. Pseudocodes

In this section, the pseudocodes for the GI algorithm and the proposed methodology are presented to facilitate the reproduction of the process.

The Gini Index algorithm, described in Section 2.2, is outlined in Algorithm A1. The methodology proposed in Section 2.3 for identifying the colors present in a bean landrace using the GMM and the GI algorithm is presented in Algorithm A2.

Algorithm A1: Gini index algorithm (GI algorithm)

Algorithm A2: Methodology

References

Aquino-Bolaños, E.N.; Garzón-García, A.K.; Alba-Jiménez, J.E.; Chávez-Servia, J.L.; Vera-Guzmán, A.M.; Carrillo-Rodríguez, J.C.; Santos-Basurto, M.A. Physicochemical Characterization and Functional Potential of Phaseolus vulgaris L. and Phaseolus coccineus L. Landrace Green Beans. Agronomy 2021, 11, 803. [Google Scholar] [CrossRef]
Aquino-Bolaños, E.N.; García-Díaz, Y.D.; Chavez-Servia, J.L.; Carrillo-Rodríguez, J.C.; Vera-Guzmán, A.M.; Heredia-García, E. Anthocyanins, polyphenols, flavonoids and antioxidant activity in common bean (Phaseolus vulgaris L.) landraces. Emir. J. Food Agric. 2016, 28, 581–588. [Google Scholar] [CrossRef]
Chávez-Servia, J.L.; Heredia-García, E.; Mayek-Pérez, N.; Aquino-Bolaños, E.N.; Hernández-Delgado, S.; Carrillo-Rodríguez, J.C.; Gill-Langarica, H.R.; Vera-Guzmán, A.M. Diversity of common bean (Phaseolus vulgaris L.) landraces and the nutritional value of their grains. In Grain Legumes; InTech: Rijeka, Croatia, 2016; pp. 1–33. [Google Scholar]
Singh, S.P. Broadening the genetic base of common bean cultivars: A review. Crop Sci. 2001, 41, 1659–1675. [Google Scholar] [CrossRef]
Bitocchi, E.; Nanni, L.; Bellucci, E.; Rossi, M.; Giardini, A.; Zeuli, P.S.; Logozzo, G.; Stougaard, J.; McClean, P.; Attene, G.; et al. Mesoamerican origin of the common bean (Phaseolus vulgaris L.) is revealed by sequence data. Proc. Natl. Acad. Sci. USA 2012, 109, E788–E796. [Google Scholar] [CrossRef]
Espinoza-García, N.; Martínez-Martínez, R.; Chávez-Servia, J.L.; Vera-Guzmán, A.M.; Carrillo-Rodríguez, J.C.; Heredia-García, E.; Velasco-Velasco, V.A. Contenido de minerales en semilla de poblaciones nativas de frijol común (Phaseolus vulgaris L.). Rev. Fitotec. Mex. 2016, 39, 215–223. [Google Scholar] [CrossRef]
Capistrán-Carabarin, A.; Aquino-Bolaños, E.N.; García-Díaz, Y.D.; Chávez-Servia, J.L.; Vera-Guzmán, A.M.; Carrillo-Rodríguez, J.C. Complementarity in Phenolic Compounds and the Antioxidant Activities of Phaseolus coccineus L. and P. vulgaris L. Landraces. Foods 2019, 8, 295. [Google Scholar] [CrossRef]
Hernández-Delgado, S.; Muruaga-Martínez, J.S.; Vargas-Vázquez, M.L.P.; Martínez-Mondragón, J.; Chávez-Servia, J.L.; Gill-Langarica, H.R.; Mayek-Pérez, N. Advances in genetic diversity analysis of Phaseolus in Mexico. Mol. Approaches Genet. Divers. 2015, 1, 47–73. [Google Scholar]
Chávez-Servia, J.L.; Carrillo-Rodríguez, J.C.; Guzmán, A.M.V.; Aquino-Bolaños, E.N.; Hernández-Delgado, S.; Mayek-Pérez, N.; Lobato-Ortiz, R. Traditional family production and nutritional-nutraceutical value of common beans (Phaseolus vulgaris L.) in Southeast Mexico. In Phaseolus vulgaris: Cultivars, Production and Uses; Nova Science Publishers, Inc.: Hauppauge, NY, USA, 2018. [Google Scholar]
Campos-Vega, R.; Bassinello, P.Z.; Santiago, R.d.A.C.; Oomah, B.D. Dry beans: Processing and nutritional effects. In Therapeutic, Probiotic, and Unconventional Foods; Elsevier: Amsterdam, The Netherlands, 2018; pp. 367–386. [Google Scholar]
Chávez-Mendoza, C.; Hernández-Figueroa, K.I.; Sánchez, E. Antioxidant capacity and phytonutrient content in the seed coat and cotyledon of common beans (Phaseolus vulgaris L.) from various regions in Mexico. Antioxidants 2018, 8, 5. [Google Scholar] [CrossRef]
Morales Reyes, J.L.; Acosta Mesa, H.G.; Aquino Bolaños, E.N.; Herrera Meza, S.; Cruz Ramírez, N.; Chávez Servia, J.L. Classification of Bean (Phaseolus vulgaris L.) Landraces with Heterogeneous Seed Color using a Probabilistic Representation. In Proceedings of the 2021 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC), Ixtapa, Mexico, 10–12 November 2021; IEEE: Piscataway, NJ, USA, 2021; Volume 5, pp. 1–7. [Google Scholar]
Fernandes, A.M.; Franco, C.; Mendes-Ferreira, A.; Mendes-Faia, A.; da Costa, P.L.; Melo-Pinto, P. Brix, pH and anthocyanin content determination in whole Port wine grape berries by hyperspectral imaging and neural networks. Comput. Electron. Agric. 2015, 115, 88–96. [Google Scholar] [CrossRef]
Grimm, E.; Kuhnke, F.; Gajdt, A.; Ostermann, J.; Knoche, M. Accurate quantification of anthocyanin in red flesh apples using digital photography and image analysis. Horticulturae 2022, 8, 145. [Google Scholar] [CrossRef]
Yoshioka, Y.; Nakayama, M.; Noguchi, Y.; Horie, H. Use of image analysis to estimate anthocyanin and UV-excited fluorescent phenolic compound levels in strawberry fruit. Breed. Sci. 2013, 63, 211–217. [Google Scholar] [CrossRef] [PubMed]
Del Valle, J.C.; Gallardo-López, A.; Buide, M.L.; Whittall, J.B.; Narbona, E. Digital photography provides a fast, reliable, and noninvasive method to estimate anthocyanin pigment concentration in reproductive and vegetative plant tissues. Ecol. Evol. 2018, 8, 3064–3076. [Google Scholar] [CrossRef] [PubMed]
Djoulde, K.; Ousman, B.; Hamadjam, A.; Bitjoka, L.; Tchiegang, C. Classification of pepper seeds by machine learning using color filter array images. J. Imaging 2024, 10, 41. [Google Scholar] [CrossRef] [PubMed]
Kılıç, K.; Boyacı, I.H.; Köksel, H.; Küsmenoğlu, İ. A classification system for beans using computer vision system and artificial neural networks. J. Food Eng. 2007, 78, 897–904. [Google Scholar] [CrossRef]
Nasirahmadi, A.; Behroozi-Khazaei, N. Identification of bean varieties according to color features using artificial neural network. Span. J. Agric. Res. 2013, 11, 670–677. [Google Scholar] [CrossRef]
Venora, G.; Grillo, O.; Ravalli, C.; Cremonini, R. Identification of Italian landraces of bean (Phaseolus vulgaris L.) using an image analysis system. Sci. Hortic. 2009, 121, 410–418. [Google Scholar] [CrossRef]
Koklu, M.; Ozkan, I.A. Multiclass classification of dry beans using computer vision and machine learning techniques. Comput. Electron. Agric. 2020, 174, 105507. [Google Scholar] [CrossRef]
Tormena, C.D.; Campos, R.C.S.; Marcheafave, G.G.; Bruns, R.E.; Scarminio, I.S.; Pauli, E.D. Authentication of carioca common bean cultivars (Phaseolus vulgaris L.) using digital image processing and chemometric tools. Food Chem. 2021, 364, 130349. [Google Scholar] [CrossRef]
De Araújo, S.A.; Pessota, J.H.; Kim, H.Y. Beans quality inspection using correlation-based granulometry. Eng. Appl. Artif. Intell. 2015, 40, 84–94. [Google Scholar] [CrossRef]
Morales-Reyes, J.L.; Aquino-Bolaños, E.N.; Acosta-Mesa, H.G. Color Quantification in Common Bean Landraces Using a Supervised Learning Technique. In Proceedings of the Mexican International Conference on Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2024; pp. 167–178. [Google Scholar]
Morales-Reyes, J.L.; Aquino-Bolaños, E.N.; Acosta-Mesa, H.G.; Márquez-Grajales, A. Estimation of Anthocyanins in Heterogeneous and Homogeneous Bean Landraces Using Probabilistic Colorimetric Representation with a Neuroevolutionary Approach. Math. Comput. Appl. 2024, 29, 68. [Google Scholar] [CrossRef]
López-Lobato, A.L.; Avendaño-Garrido, M.L.; Acosta-Mesa, H.G.; Morales-Reyes, J.L.; Aquino-Bolaños, E.N. Bean Landraces Color Identification Through Image Analysis and Gaussian Mixture Model. In Proceedings of the Mexican International Conference on Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2024; pp. 112–124. [Google Scholar]
López-Lobato, A.L.; Avendaño-Garrido, M.L. Using the Gini index for a Gaussian Mixture Model. In Proceedings of the Advances in Computational Intelligence: 19th Mexican International Conference on Artificial Intelligence, MICAI 2020, Mexico City, Mexico, 12–17 October 2020; Proceedings, Part II 19. Springer: Berlin/Heidelberg, Germany, 2020; pp. 403–418. [Google Scholar]
López-Lobato, A.L.; Avendaño-Garrido, M.L. Fitting a Gaussian mixture model through the Gini index. Int. J. Appl. Math. Comput. Sci. 2021, 31, 487–500. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Vaida, F. Parameter convergence for EM and MM algorithms. Stat. Sin. 2005, 15, 831–840. [Google Scholar]
Bassetti, F.; Bodini, A.; Regazzini, E. On minimum Kantorovich distance estimators. Stat. Probab. Lett. 2006, 76, 1298–1302. [Google Scholar] [CrossRef]
Gini, C. Variabilità e mutabilità. In Reprinted in Memorie di Metodologica Statistica; Pizetti, E., Salvemini, T., Eds.; Libreria Eredi Virgilio Veschi: Rome, Italy, 1912. [Google Scholar]
Gini, C. Sulla misura della concentrazione e della variabilità dei caratteri. Atti R. Ist. Veneto Sci. Lett. Arti 1914, 73, 1203–1248. [Google Scholar]
Villani, C. Optimal Transport: Old and New; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008; Volume 338. [Google Scholar]
López Lobato, A.L.; Avendaño Garrido, M.L. Convergence of parameter estimation of a Gaussian mixture model minimizing the Gini index of dissimilarity. Commun. Stat.-Theory Methods 2024, 53, 6030–6037. [Google Scholar] [CrossRef]
Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. KNN model-based approach in classification. In Proceedings of the On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Italy, 3–7 November 2003; Proceedings. Springer: Berlin/Heidelberg, Germany, 2003; pp. 986–996. [Google Scholar]
Hand, D.J. Principles of data mining. Drug Saf. 2007, 30, 621–622. [Google Scholar] [CrossRef]

Figure 1. Common bean landraces with homogeneous and heterogeneous seeds. Bean landraces with (A) yellow color seeds (homogeneous), (B) black and red color seeds (heterogeneous), (C) black, yellow and pink color seeds (heterogeneous), (D) variegated seeds (heterogeneous), and (E) yellow and variegated seeds (heterogeneous).

Figure 2. Color categories considered for seeds.

Figure 3. Process performed to obtain the color information of the bean landraces seeds as 3D histograms in the CIE L*a*b* color space.

Figure 4. Prototype for the image acquisition process.

Figure 5. Visual contrast between the bean landrace and the background of the image in blue color.

Figure 6. CIE L*a*b* color space.

Figure 7. Original and binary image obtained with the region growth segmentation algorithm. The black regions obtained correspond to the region of interest, i.e., the bean landrace seeds.

Figure 8. 3D histogram divisions considering the CIE L*a*b* color space.

Figure 9. 3D histogram of a bean landrace in the CIE L*a*b* color space.

Figure 10. Example of (a) a dataset dispersion in

R^{2}

, and (b) its corresponding fitted Gaussian mixture distribution.

Figure 10. Example of (a) a dataset dispersion in

R^{2}

, and (b) its corresponding fitted Gaussian mixture distribution.

Figure 11. Graphical representation of a joint distribution function

π

whose projections onto the axes are the distributions

ν_{1}

and

ν_{2}

.

Figure 11. Graphical representation of a joint distribution function

π

whose projections onto the axes are the distributions

ν_{1}

and

ν_{2}

.

Figure 12. Process of data cleansing using thresholding on a 3D histogram of a bean landrace. (A) presents the positions considered in the original 3D histogram and (B) the positions that remain after the thresholding process.

Figure 13. Bean landraces with seeds of heterogeneous colors, considering one, two or three colors in each landrace.

Figure 14. Example of the steps performed for the general design of the experiments. The training dataset (A) is structured by the single-point means of 3D histograms of individual seeds that have been labeled with one of seven colors considered. The test dataset (B) comprises the means with a mixture proportion greater than zero delivered by the GI algorithm for the bean landraces analyzed. The labels predicted by the K-

n n

method for the test dataset (C) are obtained, and the label assignment for each bean landrace (D) is considered as the set of colors identified by the K-

n n

method in their respective means. To evaluate the performance of the proposed method, the precision of each bean landrace,

P (M_{i})

, and the general precision,

P (M)

, (E), are considered.

Figure 14. Example of the steps performed for the general design of the experiments. The training dataset (A) is structured by the single-point means of 3D histograms of individual seeds that have been labeled with one of seven colors considered. The test dataset (B) comprises the means with a mixture proportion greater than zero delivered by the GI algorithm for the bean landraces analyzed. The labels predicted by the K-

n n

method for the test dataset (C) are obtained, and the label assignment for each bean landrace (D) is considered as the set of colors identified by the K-

n n

method in their respective means. To evaluate the performance of the proposed method, the precision of each bean landrace,

P (M_{i})

, and the general precision,

P (M)

, (E), are considered.

Figure 15. Process performed in one bean landrace seed of homogeneous color to obtain its representative mean in the CIE L*a*b* color space.

Figure 16. Example of the representative means for a training dataset in the CIE L*a*b* color space.

Figure 17. Process performed in one bean landrace of heterogeneous color to obtain its representative means in the CIE L*a*b* color space.

Figure 18. Boxplots of the results obtained considering bean landraces with one- and two-color seeds and the different K values for the K-

n n

method with the proposed methodology, fitting 7 Gaussians with the GI algorithm. The median for each K value is displayed in the corresponding boxplot.

Figure 18. Boxplots of the results obtained considering bean landraces with one- and two-color seeds and the different K values for the K-

n n

method with the proposed methodology, fitting 7 Gaussians with the GI algorithm. The median for each K value is displayed in the corresponding boxplot.

Figure 19. Examples of bean landraces with single-color seeds (according to the expert), where the proposed methodology identifies the color, and one (A,B) or two colors more (C).

Figure 20. Examples of bean landraces characterized by the presence of two colors on their seeds (according to the expert), where the proposed methodology identifies these colors and one more color.

Figure 21. Examples of bean landraces characterized by the presence of two colors on their seeds (according to the expert), where the proposed methodology identifies these colors and two more colors.

Figure 22. Examples of bean landraces characterized by the presence of two colors on their seeds (according to the expert), where the proposed methodology only identifies one of these colors.

Figure 23. Examples of bean landraces characterized by the presence of two colors on their seeds (according to the expert), where the proposed methodology only identifies one of these colors and assigns two colors (A,B) or one color (C) more.

Figure 24. Boxplots of the results obtained considering bean landraces with one-, two-, and three-color seeds and the different K values for the K-

n n

method with the proposed methodology, and fitting 7 Gaussians with the GI algorithm. The median for each K value is displayed in the corresponding boxplot.

Figure 24. Boxplots of the results obtained considering bean landraces with one-, two-, and three-color seeds and the different K values for the K-

n n

method with the proposed methodology, and fitting 7 Gaussians with the GI algorithm. The median for each K value is displayed in the corresponding boxplot.

Figure 25. Examples of bean landraces where the real color observed by the expert is included in the corresponding set of predicted colors with the proposed methodology, with two colors.

Figure 26. Examples of bean landraces where the real color observed by the expert is included in the corresponding set of predicted colors with the proposed methodology, with three colors.

Figure 27. Examples of bean landraces where the real color observed by the expert is included in the corresponding set of predicted colors with the proposed methodology, with four colors.

Figure 28. Examples of bean landraces where the two real colors observed by the expert are included in the corresponding set of predicted colors with the proposed methodology, with three colors.

Figure 29. Examples of bean landraces where the two real colors observed by the expert are included in the corresponding set of predicted colors with the proposed methodology, with four colors.

Figure 30. Examples of bean landraces where the three real colors observed by the expert are included in the corresponding set of predicted colors with the proposed methodology, with four colors.

Figure 31. Bean landrace where the three real colors observed by the expert are included in the corresponding set of predicted colors with the proposed methodology, with five colors.

Figure 32. Examples of bean landraces where the predicted color with the proposed methodology is included in the corresponding set of real colors observed by the expert, with three colors.

Figure 33. Examples of bean landraces where the two predicted colors with the proposed methodology are included in the corresponding set of real colors observed by the expert, with three colors.

Figure 34. Examples of bean landraces where the set of predicted colors with the proposed methodology and the set of real colors observed by the expert have one or two colors in common.

Figure 35. Examples of training datasets considered in the experiments of the proposed methodology.

Table 1. Number of bean landraces that have one, two, or three colors on their seeds.

Number of Colors	Number of Landraces
One	130
Two	42
Three	34
Total	206

Table 2. Number of bean landrace seeds used in the experimental setup and training dataset, categorized by color.

Color	Number of Seeds	Number of Seeds for Training (10%)
Black	108	11 *
Brown	135	14 *
Pink	48	5 *
Purple	110	11
Red	166	17 *
White	66	7 *
Yellow	129	13 *
Total	762	78

* Numbers are rounded to the nearest integer.

Table 3. Computer specifications.

Operating System	Windows 11 Pro 23H2
RAM	64 GB
Processor	AMD Ryzen 5 5600G
Processor speed	3.90 GHz

Table 4. Results of the experiments considering bean landraces with seeds of one and two colors, searching for 7 Gaussian components with the GI algorithm.

K	Min	Max	Mean ± St.D.	Train/Test Mean Time (min)	p-Value (Shapiro-Wilk)
1	0.7209	0.9477	0.8974 ± 0.0651	4.05/8.67	2.51 × 10⁻⁵ *
2	0.7384	0.9419	0.9041 ± 0.0437	4.34/8.40	1.50 × 10⁻⁵ *
3	0.7529	0.9477	0.8953 ± 0.0645	4.17/8.45	3.16 × 10⁻⁵ *
4	0.7616	0.9506	0.9161 ± 0.0414	4.04/8.45	1.88 × 10⁻⁵ *
5	0.7878	0.9477	0.9164 ± 0.0400	4.18/8.47	2.37 × 10⁻⁶ *
6	0.8488	0.9477	0.9256 ± 0.0236	4.06/8.42	3.46 × 10⁻⁵ *
7	0.8256	0.9564	0.9230 ± 0.0279	4.07/8.46	5.92 × 10⁻⁴ *
8	0.7471	0.9622	0.9231 ± 0.0435	4.09/8.48	5.09 × 10⁻⁷ *
9	0.7500	0.9622	0.9259 ± 0.0437	4.12/8.50	6.95 × 10⁻⁷ *
10	0.8692	0.9506	0.9279 ± 0.0192	4.16/8.52	8.05 × 10⁻³ *

* Cases where there is a 95% confidence that the precisions does not fit a normal distribution.

Table 5. Analysis of labels obtained with the proposed methodology considering bean landraces with one- and two-color seeds.

Description	Number of Bean Landraces
Same colors	99
Predicted colors contain real colors	60
Real colors contain predicted colors	3
Some colors are the same	10
Different colors	0
Total	172

Table 6. Analysis of bean landraces where the real colors are contained in the predicted colors.

Real Colors	Predicted Colors	Number of Bean Landraces
1	2	33
1	3	7
2	3	16
2	4	4

Table 7. Results of the experiments considering bean landraces with seeds of one, two and three colors, searching for 7 Gaussian components with the GI algorithm.

K	Min	Max	Mean ± St.D.	Train/Test Mean Time (min)	p-Value (Shapiro-Wilk)
1	0.6610	0.8989	0.8408 ± 0.0656	4.17/10.37	9.56 × 10⁻⁵ *
2	0.6731	0.8859	0.8459 ± 0.0564	4.19/10.36	1.73 × 10⁻⁵ *
3	0.6683	0.8859	0.8503 ± 0.0568	4.32/10.38	3.37 × 10⁻⁶ *
4	0.6820	0.8908	0.8590 ± 0.0524	4.29/10.42	1.12 × 10⁻⁶ *
5	0.7112	0.9094	0.8704 ± 0.0403	4.19/10.47	1.64 × 10⁻⁶ *
6	0.7751	0.8997	0.8725 ± 0.0268	4.20/10.50	1.04 × 10⁻⁴ *
7	0.7961	0.9037	0.8765 ± 0.0235	4.34/10.50	9.22 × 10⁻⁴ *
8	0.8139	0.9013	0.8741 ± 0.0207	4.20/10.51	8.09 × 10⁻³ *
9	0.8390	0.9013	0.8765 ± 0.0166	4.27/10.60	3.30 × 10⁻¹ *
10	0.8341	0.8989	0.8739 ± 0.0173	4.29/10.64	2.94 × 10⁻¹ *

* Cases where there is a 95% confidence that the precisions does not fit a normal distribution.

Table 8. Analysis of labels obtained with the proposed methodology considering bean landraces with one-, two-, and three-color seeds.

Description	Number of Bean Landraces
Same colors	118
Predicted colors contain real colors	47
Real colors contain predicted colors	9
Some colors are the same	32
Different colors	0
Total	206

Table 9. Analysis of bean landraces where the real colors are contained in the predicted colors.

Real Colors	Predicted Colors	Number of Bean Landraces
1	2	18
1	3	5
1	4	3
2	3	11
2	4	5
3	4	4
3	5	1

Table 10. Analysis of bean landraces where the predicted colors are contained in the real colors.

Real Colors	Predicted Colors	Number of Bean Landraces
3	1	2
3	2	7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

López-Lobato, A.-L.; Avendaño-Garrido, M.-L.; Acosta-Mesa, H.-G.; Morales-Reyes, J.-L.; Aquino-Bolaños, E.-N. Color Identification on Heterogeneous Bean Landrace Seeds Using Gaussian Mixture Models in CIE L*a*b* Color Space. Math. Comput. Appl. 2025, 30, 64. https://doi.org/10.3390/mca30030064

AMA Style

López-Lobato A-L, Avendaño-Garrido M-L, Acosta-Mesa H-G, Morales-Reyes J-L, Aquino-Bolaños E-N. Color Identification on Heterogeneous Bean Landrace Seeds Using Gaussian Mixture Models in CIE L*a*b* Color Space. Mathematical and Computational Applications. 2025; 30(3):64. https://doi.org/10.3390/mca30030064

Chicago/Turabian Style

López-Lobato, Adriana-Laura, Martha-Lorena Avendaño-Garrido, Héctor-Gabriel Acosta-Mesa, José-Luis Morales-Reyes, and Elia-Nora Aquino-Bolaños. 2025. "Color Identification on Heterogeneous Bean Landrace Seeds Using Gaussian Mixture Models in CIE L*a*b* Color Space" Mathematical and Computational Applications 30, no. 3: 64. https://doi.org/10.3390/mca30030064

APA Style

López-Lobato, A.-L., Avendaño-Garrido, M.-L., Acosta-Mesa, H.-G., Morales-Reyes, J.-L., & Aquino-Bolaños, E.-N. (2025). Color Identification on Heterogeneous Bean Landrace Seeds Using Gaussian Mixture Models in CIE L*a*b* Color Space. Mathematical and Computational Applications, 30(3), 64. https://doi.org/10.3390/mca30030064

Article Menu