Evaluating LDA and PLS-DA Algorithms for Food Authentication: A Chemometric Perspective

Mészáros, Martin; Sedlák, Jiří; Bílek, Tomáš; Vávra, Aleš

doi:10.3390/a18120733

Open AccessArticle

Evaluating LDA and PLS-DA Algorithms for Food Authentication: A Chemometric Perspective

Research and Breeding Institute of Pomology Holovousy Ltd., Holovousy 129, 508 01 Hořice, Czech Republic

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(12), 733; https://doi.org/10.3390/a18120733

Submission received: 13 October 2025 / Revised: 12 November 2025 / Accepted: 15 November 2025 / Published: 21 November 2025

(This article belongs to the Collection Feature Papers in Algorithms)

Download

Browse Figures

Versions Notes

Abstract

High-dimensional analytical datasets, such as those generated by inductively coupled plasma–mass spectrometry (ICP-MS), require robust computational frameworks for dimensionality reduction, classification, and model validation. This study presents a comparative evaluation of Linear Discriminant Analysis (LDA) and Partial Least Squares Discriminant Analysis (PLS-DA) algorithms applied to multivariate chemometric data for food origin authentication. The research employs a workflow that integrates Principal Component Analysis (PCA) for feature extraction, followed by supervised classification using LDA and PLS-DA. Model performance and stability were systematically assessed. The dataset comprised 28 apple samples from four geographical regions and was processed with normalization, scaling, and transformation prior to modeling. Each model was validated via leave-one-out cross-validation and evaluated using accuracy, sensitivity, specificity, balanced accuracy, detection prevalence, p-value, and Cohen’s Kappa. The results demonstrate that, as a linear projection-based classifier, LDA provides higher robustness and interpretability in small and unbalanced datasets. In contrast, PLS-DA, which is optimized for covariance maximization, exhibits higher apparent sensitivity but lower reproducibility under similar conditions. The study also emphasizes the importance of dimensionality reduction strategies, such as PCA-based variable selection versus latent space extraction in PLS-DA, in controlling overfitting and improving model generalizability. The proposed algorithmic workflow provides a reproducible and statistically sound approach for evaluating discriminant methods in chemometric classification.

Keywords:

chemometrics; Linear Discriminant Analysis (LDA); Partial Least Squares Discriminant Analysis (PLS-DA); Principal Component Analysis (PCA); dimensionality reduction; cross-validation; multivariate classification; food authenticity

1. Introduction

Determining the origins of plant-based foods has become more important in response to concerns about food fraud, traceability, and consumer protection [1,2,3]. Spectrometric methods, particularly inductively coupled plasma mass spectrometry (ICP-MS), generate detailed mineral, and isotopic profiles that can classify samples by origin [1,4,5,6,7,8,9]. Although analytically powerful, ICP-MS datasets are usually characterized by specific mathematical properties: they produce high-dimensional feature spaces (typically 15–40 elemental variables) with strong multicollinearity due to geochemical co-occurrence patterns and environmental, agronomic, and genetic factors combined with relatively small sample sizes dictated by analytical costs and sample availability [10,11,12]. From a computational perspective, such datasets present fundamental algorithmic challenges: the ratio of observations to feature approaches or falls below unity, the predictor matrix X exhibits rank deficiency or near-singularity, and correlations among predictors violate independence assumptions of classical discriminant methods. These conditions necessitate dimensionality reduction strategies and careful algorithm selection to prevent overfitting while maintaining discriminatory power.

Chemometrics integrates chemical measurements with algorithmic analysis to provide this framework. It extracts latent structures, assesses variable importance, and validates the predictive performance of computational methods. Apples (Malus domestica Borkh.) are a relevant case study due to their global importance and cultivar-specific variability in mineral uptake, which is influenced by soil geochemistry [13]. Recent studies have validated the effectiveness of ICP-MS for apples and related products [12,14]. This method can be complemented by metabolomics approaches [15]. However, fine-scale origin discrimination remains challenging in areas with similar geomorphological and pedological histories, such as Central Europe [10,16]. Current research focuses on algorithm optimization and improved workflows to increase throughput and reproducibility [12,17,18].

To address the complexity of high-dimensional elemental datasets, a common strategy is to apply Principal Component Analysis (PCA) as an initial exploratory step to reduce dimensionality and reveal latent variance structure. PCA projects correlated variables onto orthogonal components that capture maximum variance. Mathematically, a PCA performs an eigendecomposition of the covariance matrix Σ = X^T X, extracting principal components as linear combinations of original variables that maximize explained variance under orthogonality constraints. This process facilitates the identification of clustering, separation, and potential outliers while suppressing noise and redundancy [19]. In food authentication, PCA is an effective preprocessing tool before supervised modeling [20,21]. PCA often forms the basis for subsequent discriminant analysis in spectroscopy-based studies [22,23].

Following this unsupervised reduction, supervised classification methods can evaluate the discriminatory power of chemical profiles with respect to different classifiers, such as the geographical origin. Linear Discriminant Analysis (LDA), one of the most well-established techniques, seeks linear combinations of predictors that maximize between-group separation while minimizing within-group variance [24]. Specifically, LDA solves the generalized eigenvalue problem (S_B)w = λ(S_W)w, where S_B and S_W represent between-class and within-class scatter matrices, respectively. This formulation requires S_W to be invertible, which fails when the number of features, p, approaches or exceeds the number of observations, n, or when features are highly correlated—conditions frequently encountered in ICP-MS datasets. LDA has repeatedly demonstrated strong performance in food authentication studies [25,26]. In contrast, Partial Least Squares Discriminant Analysis (PLS-DA) is designed for scenarios involving multicollinearity or when the number of predictors exceeds the number of observations. PLS-DA circumvents the singularity problem by projecting X and the categorical response Y onto a shared latent space, extracting components t_h that maximize the covariance cov(Xw_h, Yc_h) rather than variance alone. This bilinear decomposition inherently performs a dimensionality reduction while optimizing for class discrimination, making it theoretically suitable for high-dimensional, small-sample scenarios [27,28].

Recent applications demonstrate that combining PCA for feature reduction with LDA or PLS-DA for classification enables robust discrimination even when regional similarities complicate origin assignment [20,21,25,29]. In apple studies, supervised models have achieved high accuracy. However, their relative performance varies depending on the scale at which distinctions are made (e.g., regional versus local) and the variability arising from interactions between soil characteristics and apple cultivars [13,16]. For example, a study by Zhang et al. [30] compared stepwise LDA with PLS-DA in flash-GC volatile analysis, demonstrating subtle yet reliable discrimination of regional origins. Similar strategies in honey, tea, and vinegar likewise highlight the general usefulness of PCA–LDA/PLS workflows [20,21,25]. Nevertheless, a comprehensive evaluation of the model quality and stability—particularly regarding the trade-offs between apparent accuracy, cross-validated performance, and the detection prevalence in unbalanced, small datasets—remains insufficiently characterized in the literature.

Cross-validation is an integrated statistical approach used to evaluate the performance and generalization ability of mathematical models. Its implementation is based on dividing the dataset to the training set, allowing us to build up the model and validation set to test the model performance. On the other hand, classification models can also be specified for appropriate set of variables. For this purpose, several different approaches are on hand, like [31,32,33]. Cross-validation is crucial to prevent overfitting, support parameter optimization, and provide unbiased estimates of predictive power [34,35].

In accordance with the best computational practices in chemometrics, this study presents a rigorous algorithmic framework for comparing the LDA and PLS-DA under constrained data conditions typical of analytical chemistry applications. The dataset used in our study exhibits critical dimensionality properties: 28 observations × 19 features (18 minerals plus ¹⁰B/¹¹B isotope ratio), yielding an observation-to-feature ratio of 1.47, well below the threshold for conventional LDA estimations. The dataset exhibits multicollinearity and class imbalance (4–13 samples per geographic origin), creating an ideal testbed for evaluating the discriminant algorithm robustness.

The properties of our model dataset define a near-singular regime where the condition of the stable discriminant analysis is compromised, necessitating explicit dimensionality reduction or intrinsic latent variable extraction methods.

Our methodological contribution lies in the systematic comparison of two dimensionality reduction philosophies: (1) explicit feature selection via PCA followed by LDA projection versus (2) implicit dimension reduction through PLS-DA’s latent variable extraction. We evaluate these approaches not merely by classification accuracy but through comprehensive metrics, including balanced accuracy (critical for unbalanced classes), Cohen’s Kappa (accounting for chance agreement), and the detection prevalence (revealing model classification behavior) for a quantitative performance assessment. Dimensionality reduction and careful model selection are necessary to ensure interpretability, stability, and reproducibility of the dataset.

This integrative approach addresses specific challenges in the multivariate classification of food origins. Notably, it considers the trade-offs between model accuracy, stability, and the risks of overfitting, which is particularly relevant for small, unbalanced, and high dimensionality datasets [27,36,37,38,39]. The study also evaluates the impact of dimensionality reduction strategies by contrasting the intrinsic dimension reduction in PLS-DA with the variable selection of PCA. This comparison aims to optimize model interpretability and robustness. By linking computational rigor with domain-specific considerations, including geographic origin, cultivar traits, and environmental variability, our methodological framework advances the chemometric authentication of biological samples and contributes to the best practices of food provenance analytics.

2. Materials and Methods

2.1. Samples Origin, Data Collection, and Analysis

The apple (Malus domestica) samples analyzed in this study were collected by the Research and Breeding Institute of Pomology Holovousy Ltd. (Holovousy, Czech Republic). A total of 28 authentic apple samples with known geographical origins and cultivars were provided over 4 harvest years (2019–2022). The cultivars included in this study were ‘Gala’ and ‘Golden Delicious’, which originated from either the Czech Republic (17 samples from CZ) or Poland (11 samples from PL). The 28 samples originated from 4 districts: East Bohemia (CZ-EaB—13 samples), South Moravia (CZ-SoM—4 samples), Lower Selezian Vojvodeship (PL-LSV—4 samples), and Łódź Voivodeship (PL-LoV—7 samples).

The mineral content of the dry matter in fruits was analyzed for 18 minerals: phosphorus (P), potassium (K), magnesium (Mg), calcium (Ca), boron (B), iron (Fe), manganese (Mn), zinc (Zn), molybdenum (Mo), copper (Cu), sodium (Na), aluminum (Al), lead (Pb), arsenic (As), vanadium (V), cobalt (Co), chromium (Cr), and cadmium (Cd). In terms of isotope ratios, the boron isotope ratio ¹⁰B/¹¹B was evaluated. The samples were washed in demineralized water and dried at 50 °C to a constant weight in a laboratory dryer after the droplets of water were evaporated. The fruit samples were then ground into a powder using Grindomix GM 200. The mineral nutrient content and isotope ratios were determined using ICP-MS (Agilent 7900, Agilent Technologies, Inc., Santa Clara, CA, USA) after previous nitric acid digestion using a microwave-digestion system (Discover SP-D 80, CEM Corporation, Matthews, NC, USA). For trace analysis, each sample (0.25 g) was mixed with 6 mL of 65% nitric acid and digested at 200 °C for 4 min. This method was specified by the CEM Corporation and verified by the Central Institute for Supervising and Testing in Agriculture, Czech Republic under the number JPP 40033.1 [40]. The original dataset is attached as the appendix at the end of this article (Appendix A).

2.2. Mathematical and Statistical Approach

Prior to modeling, the obtained mineral profiles were statistically preprocessed using a univariate T-test or a single factorial analysis of variance (T-test/ANOVA) to preselect appropriate diagnostic markers. To this end, the data were analyzed for normality of residuals and homogeneity of variance using the Shapiro–Wilk and Levene tests, respectively. Due to the unequal variance and non-normal distribution of some of the analyzed mineral elements and isotopic ratios, normalization and logarithmic transformation were performed before the analysis of variance. The results of the analysis of variance were then back-transformed using exponential transformation and further compared among the districts using the estimated marginal means test through the “emmeans” library [41].

The adapted mineral profiles were further analyzed using different multivariate tests to reduce dimensionality and allow classification of the samples according to state and district. Binomial and multi-class distribution were calculated for state and district classification, respectively. Prior to the multi-class LDA for districts, principal component analysis (PCA) was used as an exploratory algorithm for discriminant selection and feature reduction. For this purpose, the data were standardized using the center-then-scale method followed by calculation of the contribution of variables. Only the variables with principal component loadings greater than 0.4 for the first two principal components were selected. Then we used the “drop one” method for the further characterization of the feature importance and selection for analyzed LDA models. Furthermore, the dataset was pre-analyzed for the collinearity of the analyzed features using the Variance Inflation Factor (VIF). The results were classified as non-collinear with an VIF < 2, with low collinearity if 2 < VIF < 5 and high collinearity if VIF > 5. The data were further processed using linear discriminant analysis (LDA) and partial least squares discriminant analysis (PLS-DA) to assess the discrimination between/among the particular classes for the two factors. The PLS-DA models were also adapted for both factors by including the weighed centering of the separator (PLS-DAw) due to the unbalanced distribution of samples between or among the analyzed classes.

The Theorem used for the weighed centering was as follows:

σ_{X_{j}}^{2} = \frac{(X_{. j} - μ_{X_{j}} 1) ⊤ W (X_{. j} - μ_{X_{j}} 1)}{s}

(1)

where

X_{\cdot j}

is the column vector for variable

j

,

μ_{X_{j}} = \frac{1^{⊤} W X_{\cdot j}}{s}

is its weighted mean,

W = d i a g (w_{1}, w_{2}, \dots, w_{n})

is the diagonal matrix of observation weights,

s = \sum_{i} w_{i}

is the total weight.

The obtained LDA and PLS models were validated using the “leave-one-out” internal cross-validation method. The models were characterized by reporting statistical metrics such as model sensitivity, specificity, detection prevalence, balanced accuracy, p-value and Cohen’s Kappa. The PLS-DA results were characterized by the described variance (R2X and R2Y), the predicted variance (Q2Y), and permutation tests for R2Y and Q2Y. Variable importance was assessed using permutation importance of the selected variables (PI) and variable importance projection (VIP) for the LDA and PLS-DA model, respectively. Variable importance was quantified by permutation importance (PI). For each predictor, we performed 1000 independent random permutations of that predictor across samples (random seed = 42). For each permutation, the trained model was evaluated using leave-one-out cross-validation (LOO-CV), and balanced accuracy was computed. PI for each variable was defined as the median decrease in balanced accuracy (original minus permuted) across the 1000 permutations; the interquartile range (IQR) of these decreases represents the dispersion. The model parameters remained fixed during the permutation runs (no retraining per permutation). For the PLS-DA model, the Variable Importance in Projection (VIP) was computed to quantify the relative contribution of each variable to the model. VIP values were calculated as the weighted sum of squared PLS loadings, proportional to the amount of Y-variance explained by each latent component. Mathematically, the VIP for variable j was expressed as follows:

{VIP}_{j} = \sqrt{p \cdot \sum_{h = 1}^{H} (\frac{S S Y_{h}}{S S Y_{t o t a l}} \cdot w_{j h}^{2})}

(2)

where

p

is the total number of predictors,

H

is the number of PLS components,

S S Y_{h}

is the variance in Y explained by the h-th component, and

w_{j h}

is the normalized weight of variable j in that component. Variables with VIP > 1 were considered significant contributors to class discrimination, whereas variables with VIP < 0.8 were regarded as less influential. This approach complements the permutation importance (PI) analysis used for the LDA model.

Figure 1 shows the particular steps of the data processing in consequent order. All statistical analyses were performed in R software version 4.5.1 using the following libraries: “ropls”, “mdatools”, “pls”, “caret”, “irr”, “MASS”, and “ggplot2” [42,43,44,45,46,47,48].

3. Results and Discussion

3.1. Classification Between Two Groups with Unequal Sample Size

The results of the T-test showed a significant difference among the classes for P, B, Mo, Cu, As, and ¹⁰B/¹¹B. The VIF test showed only a minor collinearity of variables P = 3.28, K = 2.95, Co = 2.04, and As = 2.07. All the other features demonstrated no collinearity at all. Consequently, these results allow us to maintain all variables to build the LDA models. Initially, the prevalence of the samples in the dataset was 0.607 for the Czech samples (CZ) and 0.393 for the Polish samples (PL). For this classification, four models were calculated: LDA, a reduced LDA-sub model that just kept a subset of variables (B, Mo, and Cu), PLS-DA, and PLS-DAw (with weighted centering of the separator). The detection prevalence for Czech samples with LDA-sub was identical to the initial prevalence of the dataset (Table 1). The detection prevalence of the former LDA was 0.571, while it was substantially lower with both PLS-DA models. Furthermore, the results showed that, after cross-validation, binary LDA and LDA-sub for the dataset using the state as a factor provided models with a very high sensitivity of 0.941 and 1.000 as well as a specificity of 1.000 for both models, respectively. The sensitivities and specificities of PLS-DA and PLS-DAw, were identical and at the same time lower than those of both LDA models. This result was reflected by the high balanced accuracy and Cohen’s Kappa coefficient especially for the LDA-sub model. LDA was the only model that showed significant separation between the two classes.

PLS-DA is considered as an alternative to LDA when sample size is smaller than the number of variables and the dimensionality reduction is required [27,36,39]. PLS-DA aims to project high-dimensional data into a lower-dimensional subspace that maximizes the total covariance between the data vectors and their class membership [49]. One pitfall of this algorithm is that it shifts the separator toward the centroid of the more numerous class, which can provide potentially misleading results when the sample sizes between the classes differ [37]. However, since the mean-centered PLS-DA and weighted mean-centered PLS-DAw models provided the same results, this shift likely did not affected the sample separation in our dataset. Furthermore, our dataset comprises more samples per class than variables (11–17:6) and both permutation importance and variable importance projection confirmed the impact of the variables chosen by ANOVA (P, Mo, Cu, As, B and ¹⁰B/¹¹B) on the model performance (Table 2). On the other hand, the selection of variables using the “drop one” method can further improve the reliability of those models. The most important variables were copper, boron, and molybdenum, while arsenic, phosphorus, and the boron isotopic ratio ¹⁰B/¹¹B seemed to be less important variables. These facts promote the quality performance of our LDA model with only one component, which provides a more reliable classification of the samples between two states (Figure 2 and Figure 3), as opposed to the two PLS-DA models. Since the PLS-DA algorithm is known to perform differently than LDA when only one component is used [37], it is likely that LDA becomes more reliable when the dataset is small with a relatively narrow class distribution and only one component.

3.2. Classification Between Multiple Groups with Unequal Sample Size

Data preprocessing using ANOVA revealed significant differences among the classes with P, K, B, Mn, Co, Mo, Cu, As, ¹⁰B/¹¹B. Multiple group classifications were calculated for four different districts using LDA and PLS-DA models (Table 3). The initial prevalence (proportion of samples per district) of the dataset was 13 samples for CZ-EaB (0.464), 4 for CZ-SoM (0.143), 7 for PL-LoV (0.250), and 4 for PL-LSV (0.143). All the models returned three significant components. The problem was the low number of samples per class compared to the relatively high number of variables (4-13:9) identified by the ANOVA test. This is an obvious issue when calculating LDA because, in such cases, the algorithm is susceptible to overestimation of the model [28,36]. To illustrate the results, we calculated a former LDA model and a reduced LDA-sub model that kept just a subset of variables (P, B, Mn, Cu), which were obtained by the PCA dimension reduction and further “drop one” variable selection, as described in the Materials and Methods Section. The selected variables are in some contrast with results of the variables permutation importance calculated for LDA (Table 4). Both LDA models consistently demonstrated a higher sensitivity for the CZ-EaB and PL-LoV classes, whereas sensitivity was only 50% for the CZ-SoM class. Moreover, the greatest difference between the LDA and LDA-sub models was observed in the CZ-EaB and PL-LSV class. The balanced accuracy of the LDA-sub model yielded slightly higher results, like those found for the sensitivity. On the other hand, the LDA-sub model had a detection prevalence similar to that found in the former LDA. Consequently, the variable selection through the PCA component reduction with the “drop one” variable selection proved to rationalize the LDA model performance, as recommended by many authors [22,23], but the small dataset may compromise the model’s reliability for sample classification. Comparing the results of LDA and PLS-DA reveals a significant increase in the overall accuracy, particularly due to the improved sensitivity of the PLS-DA models, which ranges from 0.786 to 0.929 (Table 3). To understand the difference, it is important to note that the LDA algorithm provides results of linear combinations of predictors that maximize between-class variance and minimize within-class variance. In contrast, PLS finds latent components that maximize the covariance between predictors and the class indicator matrix [27,50]. The PLS algorithm incorporates the dimension reduction, which depends on the between-groups sums-of-squares and the cross-product matrix to information on reducing dimension. PCA relies on the sample (total) variance/covariance matrix [27]. Therefore, it is suggestive that a decreased number of true samples per class in small datasets enhances the portion of variability (Figure 4). In particular cases, this may potentially compromise the stability [36] and accuracy of the LDA model [27,28]. Nevertheless, when we examine the detection prevalence of all four models, the results of LDA-sub model were the closest to those identified in the original dataset. The PLS-DA model algorithm looks for the variables that correlate best with the classifier. In a small dataset, this may lead to correlations that are just by chance [36]. Although this algorithm can cope with a higher number of variables than samples [39], there is a risk of overoptimistic results [37]. In our study, it is thus evident that PLS-DA provides higher model sensitivity but correctly classifies only three of the four classes (Figure 5). After weighting of the center mean, the PLS-DAw model decreased the model detection prevalence allowing differentiation of only two of the four classes. Consequently, the PLS-DA model may provide higher accuracy, but it carries a higher risk of lower stability in cases of small, unbalanced datasets with relatively close sample distributions among the observed classes. Therefore, the one-sided use of the misclassification rate as the basic tool for evaluating model performance should be approached with caution.

4. Conclusions

Building a reliable classification model requires appropriate data preprocessing as a foundational step (Figure 1). This includes selecting initial variables through univariate tests (T-test/ANOVA), multivariate approaches (PCA), or other complementary approaches for variable selection to reduce dimensionality according to the ratio of the number of samples versus the number of variables and the intended classification algorithm. Model selection depends on both data quality and algorithm-specific performance characteristics. Rigorous cross-validation is essential for the valid interpretation of model outputs. Model performance and reliability should be assessed not only through accuracy metrics, misclassification rates, statistical significance, and detection prevalence, the meaning of which provides insight into the underlying classification structure. For small, unbalanced datasets with closely distributed classification clusters, LDA consistently produces more reliable models, though not necessarily with higher apparent accuracy. In single-component scenarios, PLS-DA fails to match LDA performance. Therefore, LDA should be prioritized when computationally feasible. Even multi-class PLS-DA models with an equivalent number of significant components require cautious interpretation when applied to such datasets. Successful interpretation and generalization of model results require a consideration of model stability and reproducibility within the broader experimental context. Specifically, factors such as the geographical origin of the samples, the selection of species and cultivars, and agricultural management practices can substantially influence the mineral composition of fruit, thereby altering the discriminatory importance of analyzed elements and isotopic ratios in the classification framework. Our work has an algorithmic and statistical contribution because it addresses the selection of appropriate methods for mathematically non-ideal conditions, which is a fundamental problem in modern scientific statistics and machine learning in a world of mathematically non-ideal but real—natural—conditions.

Author Contributions

Conceptualization, M.M. and J.S.; methodology, M.M., J.S., T.B. and A.V.; software, M.M.; validation, M.M., T.B. and A.V.; formal analysis, M.M.; investigation, J.S.; resources, J.S.; data curation, M.M. and J.S.; writing—original draft preparation, M.M., J.S., T.B. and A.V.; writing—review and editing, M.M. and J.S.; visualization, M.M.; supervision, J.S.; project administration, J.S.; funding acquisition, J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Ministry of Agriculture Czech Republic (grant numbers RO1525 and NAZV QK1910104).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

We thank František Švec for the English proofreading. We thank Matěj Semerák for technical help with the formatting of the article.

Conflicts of Interest

The authors declare no conflicts of interest. All authors are employes of the Research and Breeding Institute of Pomology Holovousy Ltd. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Table A1. Dataset of samples mineral profiles resulting from ICP-MS of apples gathered in 2019–2022.

P	K	B	Mn	Cu	Co	As	Mo	B1011	Lokalita	State
567.0225	7314.243	11.37027	3.23651	2.49251	0.0144	0.02632	0.07984	0.1808	CZ-SoM	CZ
426.8187	7104.698	6.07267	3.37718	1.88254	0.01174	0.02128	0.06024	0.182	PL-LoV	PL
562.5194	8042.671	7.85781	3.71458	2.0386	0.00792	0.02388	0.06335	0.2015	PL-LoV	PL
613.122	7923.64	7.94541	3.80785	1.20859	0.008	0.01941	0.05944	0.1843	PL-LoV	PL
436.1235	5554.112	14.09681	1.29352	1.33092	0.0042	0.01874	0.03412	0.1823	PL-LSV	PL
587.9115	6395.028	17.67048	2.99726	1.56571	0.01569	0.036	0.07072	0.183	CZ-EaB	CZ
714.3354	7889.28	15.20656	3.59287	2.43352	0.00517	0.02239	0.05968	0.2009	CZ-EaB	CZ
636.3113	7306.948	16.11135	2.85219	1.6998	0.00442	0.02032	0.08737	0.227	CZ-EaB	CZ
548.2005	6330.344	13.25736	2.04612	2.19365	0.01164	0.01859	0.07436	0.221	CZ-SoM	CZ
608.8251	6790.088	12.48865	2.23684	1.86887	0.00272	0.01615	0.08425	0.213	CZ-EaB	CZ
585.5438	6148.056	18.94618	1.99178	1.53278	0.10202	0.04584	0.04594	0.223	CZ-EaB	CZ
711.8621	7864.707	15.59141	1.94846	2.28718	0.00718	0.02792	0.12621	0.2096	CZ-EaB	CZ
455.9913	6325.899	8.57509	1.66962	1.37818	0.00814	0.0238	0.04176	0.1915	PL-LSV	PL
662.5235	8427.586	6.94006	3.02136	1.12805	0.00587	0.01988	0.05041	0.1893	PL-LoV	PL
491.0273	6271.046	10.18624	2.00048	1.68266	0.00973	0.02858	0.08359	0.1806	CZ-EaB	CZ
489.6918	6059.492	8.82746	2.5081	2.77163	0.02482	0.0358	0.08239	0.1815	CZ-SoM	CZ
513.3892	8958.226	12.6508	2.94644	2.08915	0.00445	0.0252	0.03932	0.2021	PL-LoV	PL
484.8152	7188.742	11.54541	3.04117	1.17256	0.00815	0.01851	0.07094	0.1832	PL-LoV	PL
438.4806	6240.847	13.66747	1.94896	1.74549	0.03717	0.01727	0.04603	0.1827	PL-LSV	PL
530.8746	6736.301	20.83543	3.99958	1.55271	0.01467	0.03368	0.03377	0.1905	CZ-EaB	CZ
734.9667	8666.111	16.4817	3.13925	2.58911	0.00449	0.02459	0.04035	0.2008	CZ-EaB	CZ
575.3769	7191.591	22.13832	2.60921	2.19807	0.00334	0.02027	0.04312	0.224	CZ-EaB	CZ
484.9251	6108.117	10.51293	2.17031	1.81553	0.00801	0.01767	0.08446	0.2	CZ-SoM	CZ
677.7126	9169.346	13.53965	2.82348	1.43075	0.0051	0.0233	0.07631	0.228	CZ-EaB	CZ
606.8075	7597.724	18.65838	3.01845	1.26688	0.02409	0.04233	0.08146	0.225	CZ-EaB	CZ
630.3262	6904.64	13.23887	2.14149	1.49926	0.00523	0.03132	0.09543	0.1966	CZ-EaB	CZ
365.9668	6396.275	11.96465	1.98325	1.57732	0.00757	0.02434	0.04111	0.193	PL-LSV	PL
429.3323	7065.08	6.89563	3.16038	1.01286	0.00776	0.02552	0.04801	0.1903	PL-LoV	PL

References

Kelly, S.; Heaton, K.; Hoogewerff, J. Tracing the Geographical Origin of Food: The Application of Multi-Element and Multi-Isotope Analysis. Trends Food Sci. Technol. 2005, 16, 555–567. [Google Scholar] [CrossRef]
Dasenaki, M.E.; Thomaidis, N.S. Quality and Authenticity Control of Fruit Juices—A Review. Molecules 2019, 24, 1014. [Google Scholar] [CrossRef]
Nguyen, Q.T.; Nguyen, T.T.; Le, V.N.; Nguyen, N.T.; Truong, N.M.; Hoang, M.T.; Pham, T.P.T.; Bui, Q.M. Towards a Standardized Approach for the Geographical Traceability of Plant Foods Using Inductively Coupled Plasma Mass Spectrometry (ICP-MS) and Principal Component Analysis (PCA). Foods 2023, 12, 1848. [Google Scholar] [CrossRef]
Rajapaksha, D.; Waduge, V.; Padilla-Alvarez, R.; Kalpage, M.; Rathnayake, R.M.N.P.; Migliori, A.; Frew, R.; Abeysinghe, S.; Abrahim, A.; Amarakoon, T. XRF to support food traceability studies: Classification of Sri Lankan tea based on their region of origin. X-Ray Spectrom. 2017, 46, 220–224. [Google Scholar] [CrossRef]
Obeidat, S.M.; Hammoudeh, A.Y.; Alomary, A.A. Application of FTIR Spectroscopy for Assessment of Green Coffee Beans According to Their Origin. J. Appl. Spectrosc. 2018, 84, 1051–1055. [Google Scholar] [CrossRef]
Lim, C.M.; Carey, M.; Williams, P.N.; Koidis, A. Rapid classification of commercial teas according to their origin and type using elemental content with X-ray fluorescence (XRF) spectroscopy. Curr. Res. Food Sci. 2021, 4, 45–52. [Google Scholar] [CrossRef]
Mazarakioti, E.C.; Karavoltsos, S.; Sakellari, A.; Kourou, M.; Sakellari, A.; Dassenakis, E. Inductively Coupled Plasma–Mass Spectrometry (ICP-MS), a Useful Tool in Authenticity of Agricultural Products’ and Foods’ Origin. Foods 2022, 11, 3705. [Google Scholar] [CrossRef]
Mo, R.; Zheng, Y.; Ni, Z.; Shen, D.; Liu, Y. The Phytochemical Components of Walnuts and Their Application for Geographical Origin Based on Chemical Markers. Food Qual. Saf. 2022, 6, fyac052. [Google Scholar] [CrossRef]
Elefante, A.; Giglio, M.; Mongelli, L.; Bux, A.; Zifarelli, A.; Menduni, G.; Patimisco, P.; Caratti, A.; Cagliero, C.; Liberto, E.; et al. Spectroscopic Study of Volatile Organic Compounds for the Assessment of Coffee Authenticity. Molecules 2025, 30, 3487. [Google Scholar] [CrossRef]
Krška, B.; Mészáros, M.; Bílek, T.; Vávra, A.; Náměstek, J.; Sedlák, J. Analysis of Mineral Composition and Isotope Ratio as Part of Chemical Profiles of Apples for Their Authentication. Agronomy 2024, 14, 2703. [Google Scholar] [CrossRef]
Francini, A.; Fidalgo-Illesca, C.; Raffaelli, A.; Sebastiani, L. Phenolics and Mineral Elements Composition in Underutilized Apple Varieties. Horticulturae 2022, 8, 40. [Google Scholar] [CrossRef]
Müller, M.-S.; Oest, M.; Scheffler, S.; Horns, A.L.; Paasch, N.; Bachmann, R.; Fischer, M. Food Authentication Goes Green: Method Optimization for Origin Discrimination of Apples Using Apple Juice and ICP-MS. Foods 2024, 13, 3783. [Google Scholar] [CrossRef] [PubMed]
Fotirić Akšić, M.; Nešović, M.; Ćirić, I.; Tešić, Ž.; Pezo, L.; Tosti, T.; Gašić, U.; Dojčinović, B.; Lončar, B.; Meland, M. Polyphenolics and Chemical Profiles of Domestic Norwegian Apple (Malus × domestica Borkh.) Cultivars. Front. Nutr. 2022, 9, 941487. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Shen, Y.; Ma, N.; Xu, G. Authentication of apples from the Loess Plateau in China based on interannual element fingerprints and multidimensional modelling. Food Chem. 2023, 20, 100948. [Google Scholar] [CrossRef]
Bechynska, K.; Sedlak, J.; Uttl, L.; Kosek, V.; Vackova, P.; Kocourek, V.; Hajslova, J. Metabolomics on Apple (Malus domestica) Cuticle—Search for Authenticity Markers. Foods 2024, 13, 1308. [Google Scholar] [CrossRef]
Ballabio, C.; Lugato, E.; Fernández-Ugalde, O.; Orgiazzi, A.; Jones, A.; Borrelli, P.; Montanarella, L.; Panagos, P. Mapping LUCAS Topsoil Chemical Properties at European Scale Using Gaussian Process Regression. Geoderma 2019, 355, 113912. [Google Scholar] [CrossRef]
Wang, F.; Wu, X.; Ding, Y.; Liu, X.; Wang, X.; Gao, Y.; Tian, J.; Li, X. Study of the Effects of Spraying Non-Bagging Film Agent on the Contents of Mineral Elements and Flavonoid Metabolites in Apples. Horticulturae 2024, 10, 198. [Google Scholar] [CrossRef]
Mandache, M.B.; Cosmulescu, S. Mineral Content of Apple, Sour Cherry and Peach Pomace and the Impact of Their Application on Bakery Products. Foods 2025, 14, 3146. [Google Scholar] [CrossRef]
Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef]
Truzzi, E.; Marchetti, L.; Piazza, D.V.; Bertelli, D. Multivariate Statistical Models for the Authentication of Traditional Balsamic Vinegar of Modena and Balsamic Vinegar of Modena on ¹H-NMR Data: Comparison of Targeted and Untargeted Approaches. Foods 2023, 12, 1467. [Google Scholar] [CrossRef]
Sotiropoulou, N.S.; Xagoraris, M.; Revelou, P.K.; Kaparakou, E.; Kanakis, C.; Pappas, C.; Tarantilis, P. The Use of SPME-GC-MS IR and Raman Techniques for Botanical and Geographical Authentication and Detection of Adulteration of Honey. Foods 2021, 10, 1671. [Google Scholar] [CrossRef] [PubMed]
Karadžić Banjac, M.; Kovačević, S.; Podunavac-Kuzmanović, S. Ongoing Multivariate Chemometric Approaches in Bioactive Compounds and Functional Properties of Foods—A Systematic Review. Processes 2024, 12, 583. [Google Scholar] [CrossRef]
Li, W.; Zhao, C.; Zhang, Y.; Wang, X.; Chen, Y.; Liu, D. Hyperspectral Imaging for Foreign Matter Detection in Foods: Combining PCA and LDA Models. Foods 2025, 14, 3026. [Google Scholar] [CrossRef]
McLachlan, G.J. Discriminant Analysis and Statistical Pattern Recognition; Wiley: Hoboken, NJ, USA, 2004. [Google Scholar]
Hou, Z.; Jin, Y.; Gu, Z.; Zhang, R.; Su, Z.; Liu, S. 1H NMR Spectroscopy Combined with Machine-Learning Algorithm for Origin Recognition of Chinese Famous Green Tea Longjing Tea. Foods 2024, 13, 2702. [Google Scholar] [CrossRef]
Varrà, M.O.; Ghidini, S.; Husáková, L.; Ianieri, A.; Zanardi, E. Advances in Troubleshooting Fish and Seafood Authentication by Inorganic Elemental Composition. Foods 2021, 10, 270. [Google Scholar] [CrossRef] [PubMed]
Barker, W.; Rayens, W. Partial least squares for discrimination. J. Chemom. 2003, 17, 166–173. [Google Scholar] [CrossRef]
Hamid, H.; Zainon, F.; Yong, T.P. Performance analysis: An integration of principal component analysis and linear discriminant analysis for a very large number of measured variables. Res. J. Appl. Sci. 2016, 11, 1422–1426. [Google Scholar]
Barbosa, S.; Saurina, J.; Puignou, L.; Núñez, O. Classification and Authentication of Paprika by UHPLC-HRMS Fingerprinting and Multivariate Calibration Methods (PCA and PLS-DA). Foods 2020, 9, 486. [Google Scholar] [CrossRef]
Zhang, H.; Wu, L.; Li, D.; Li, J.; Wang, C.; Chen, J. Characterization and Discrimination of Apples by Flash GC E-Nose. Foods 2022, 11, 1631. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Kim, J.-H. Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Comput. Stat. Data Anal. 2009, 53, 3735–3745. [Google Scholar] [CrossRef]
Papadopoulou, O.S.; Pappa, E.C.; Goliopoulos, A.; Lioliou, A.; Kontakos, S.; Papadopoulou, S.K. Chemometric Classification of Feta Cheese Authenticity via ATR-FTIR Spectroscopy. Appl. Sci. 2023, 15, 8272. [Google Scholar] [CrossRef]
Lytovchenko, A.; Stepanova, N.; Shapoval, O.; Skoryk, M.; Korchynska, L.; Ostapchuk, A.; Hovorun, D. Authentication of Laying Hen Housing Systems Based on Egg Yolk Using ¹H NMR Spectroscopy and Machine Learning. Foods 2025, 13, 1098. [Google Scholar] [CrossRef]
Brereton, R.G. Consequences of sample size, variable selection, and model validation and optimisation, for predicting classification ability from analytical data. Trends Anal. Chem. 2006, 25, 11. [Google Scholar] [CrossRef]
Brereton, R.G.; Lloyd, G.R. Partial least squares discriminant analysis: Taking the magic away. J. Chemom. 2014, 28, 213–225. [Google Scholar] [CrossRef]
Hoyle, C.H. Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes. J. Mach. Learn. Res. 2008, 9, 2733–2759. Available online: https://www.jmlr.org/papers/volume9/hoyle08a/hoyle08a.pdf (accessed on 10 September 2025).
Rashid, N.A.; Hussain, W.S.E.C.; Ahmad, A.R.; Abdullah, F.N. Performance of Classification Analysis: A Comparative Study between PLS-DA and Integrating PCA+LDA. Math. Stat. 2019, 7, 24–28. [Google Scholar] [CrossRef]
Zbíral, J. Analýza Rostlinného Materiálu: Jednotné Pracovní Postupy, 3rd ed.; ÚKZÚZ: Brno, Czech Republic, 2014. [Google Scholar]
Lenth, R. emmeans: Estimated Marginal Means, aka Least-Squares Means; R package version 1.10.6. 2024. Available online: https://CRAN.R-project.org/package=emmeans (accessed on 9 October 2025).
Thevenot, E.A.; Roux, A.; Xu, Y.; Ezan, E.; Junot, C. Analysis of the Human Adult Urinary Metabolome Variations with Age, Body Mass Index and Gender by Implementing a Comprehensive Workflow for Univariate and OPLS Statistical Analyses. J. Proteome Res. 2015, 14, 3322–3335. [Google Scholar] [CrossRef] [PubMed]
Kucheryavskiy, S. mdatools—R Package for Chemometrics. Chemom. Intell. Lab. Syst. 2020, 198, 103937. [Google Scholar] [CrossRef]
Liland, K.; Mevik, B.; Wehrens, R. pls: Partial Least Squares and Principal Component Regression; R Package Version 2.8-5. 2024. Available online: https://CRAN.R-project.org/package=pls (accessed on 9 October 2025).
Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 2008, 28, 1–26. [Google Scholar] [CrossRef]
Gamer, M.; Lemon, J. irr: Various Coefficients of Interrater Reliability and Agreement; R Package Version 0.84.1. 2019. Available online: https://CRAN.R-project.org/package=irr (accessed on 9 October 2025).
Venables, W.N.; Ripley, B.D. Modern Applied Statistics with S, 4th ed.; Springer: New York, NY, USA, 2002; ISBN 0-387-95457-0. [Google Scholar]
Wickham, H. ggplot2: Elegant Graphics for Data Analysis; Springer: New York, NY, USA, 2016. [Google Scholar]
Ahmad, N.A. Numerically stable locality-preserving partial least squares discriminant analysis for efficient dimensionality reduction and classification of high-dimensional data. Heliyon 2024, 10, e26157. [Google Scholar] [CrossRef] [PubMed]
Hastie, T.; Friedman, J.; Tibshirani, R. The Elements of Statistical Learning-Data Mining, Inference, and Prediction, 1st ed.; Springer Nature: New York, NY, USA, 2001; ISBN 978-0-387-21606-5. [Google Scholar]

Figure 1. Algorithmic workflow for chemometric data processing and supervised classification. It integrates the following algorithms for multivariate feature extraction, model optimization, and validation: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Partial Least Squares Discriminant Analysis (PLS-DA). The color coding indicates: blue boxes—input data; green boxes—processing steps including preprocessing and transformation procedures; yellow boxes—classification decision points (binary vs. multi-class); red boxes—final discriminant models.

Figure 2. Histogram of (A) LDA and (B) LDA-sub scores of the discriminatory functions for the Czech and Polish samples.

Figure 3. Orthogonal distance, regression coefficients, misclassification rate, and the predictions plot of the first component for the cross-validated PLS-DAw model used for the country classification.

Figure 4. Score plots (A) of the LDA (left) and LDA-sub (right) models accompanied by group means, confidence intervals, and cross-validation (LOO-CV) results, as well as histograms (B) of both LDA discriminant scores of the first discriminatory functions for particular district.

Figure 5. Orthogonal distance, regression coefficients, misclassification rate, and the predictions plot for all three non-zero components of the cross-validated (A) PLS-DA and (B) PLS-DAw models used for the district classification.

Table 1. Comparison of the model sensitivity, specificity, detection prevalence, balanced accuracy, p-value, and Cohen’s Kappa of the binary models LDA, PLS-DA and weighted PLS-DA (PLS-DAw) for the Czech samples class.

	LDA	LDAsub	PLS-DA	PLS-DAw
Sensitivity	0.941	1.000	0.824	0.824
Specificity	1.000	1.000	0.636	0.636
Bal. accuracy	0.971	1.000	0.701	0.701
Detec. prevalence	0.571	0.607	0.500	0.500
p-value/pR²Y/pQ²Y	<0.001	<0.001	0.167	0.167
Kappa	0.926	1.000	0.401	0.401

Table 2. Permutation importance (PI) of the selected variables and the variable importance projection (VIP) for the LDA and PLS-DA models are described with the median and the interquartile range (IQR).

Variable	LDA Model		Variable	PLS-DA Model
Variable	PI Median	IQR	Variable	VIP Median	IQR	Freq_VIP
P	0.1429	0.0714	B	1.3924	0.1292	1.00
Mo	0.0714	0.0714	P	1.2111	0.1909	1.00
B	0.0358	0.0357	¹⁰B/¹¹B	1.1026	0.1106	0.80
Cu	0.0357	0.0357	Mo	1.0675	0.2703	1.00
As	0.0357	0.0357	Cu	1.0601	0.1296	0.60
¹⁰B/¹¹B	0.0357	0.0357	As	0.9666	0.0523	0.20

Table 3. Comparison of the model sensitivity, specificity, detection prevalence, balanced accuracy, p-value, and Cohen’s Kappa of the multi-class models LDA, LDA-sub (model with a subset of variables), PLS-DA, and weighted PLS-DA (PLS-DAw) for the classes of the four sampled districts.

Model	Districts	Sensitivity	Specificity	Detec. Prevalence	Bal. Accuracy	p-Value	Kappa
LDA	CZ-EaB	0.692	0.867	0.393	0.780	0.007	0.592
	CZ-SoM	0.500	0.917	0.143	0.708
	PL-LoV	0.857	0.905	0.286	0.881
	PL-LSV	0.750	0.917	0.179	0.833
LDA-sub	CZ-EaB	0.923	0.933	0.464	0.928	<0.001	0.791
	CZ-SoM	0.500	0.958	0.107	0.729
	PL-LoV	0.857	1.000	0.214	0.929
	PL-LSV	1.000	0.917	0.214	0.958
PLS-DA	CZ-EaB	0.893	0.933	0.392	0.913	<0.001	0.895
	CZ-SoM	0.786	0.917	0.000	0.851
	PL-LoV	0.929	0.952	0.214	0.941
	PL-LSV	0.929	0.958	0.107	0.944
PLS-DAw	CZ-EaB	0.923	0.933	0.429	0.929	0.041	0.780
	CZ-SoM	0.000	1.000	0.000	0.857
	PL-LoV	0.857	0.905	0.214	0.893
	PL-LSV	0.000	1.000	0.000	0.857

Table 4. Permutation importance of the selected variables (PI) and variable importance projection (VIP) for the LDA and PLS-DA models are described with the median and interquartile range (IQR).

Variable	LDA Model		Variable	PLS-DA Model
Variable	PI Median	IQR	Variable	VIP Median	IQR	Freq_VIP
K	6.0892	7.5709	B	1.3453	0.1112	0.95
P	5.1379	6.3081	Mn	1.2393	0.1251	1.00
Mn	4.7401	4.1554	P	1.2372	0.1049	0.96
Co	4.5024	8.0422	K	1.1443	0.1239	0.01
B	4.1656	4.4187	¹⁰B/¹¹B	0.9528	0.1403	1.00
Mo	4.0034	4.1890	As	0.7732	0.1695	0.03
As	3.5151	4.4595	Mo	0.7715	0.2022	0.02
Cu	3.4724	3.2303	Cu	0.6587	0.2722	0.02
¹⁰B/¹¹B	3.4369	4.1250	Co	0.3743	0.1184	0.32

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mészáros, M.; Sedlák, J.; Bílek, T.; Vávra, A. Evaluating LDA and PLS-DA Algorithms for Food Authentication: A Chemometric Perspective. Algorithms 2025, 18, 733. https://doi.org/10.3390/a18120733

AMA Style

Mészáros M, Sedlák J, Bílek T, Vávra A. Evaluating LDA and PLS-DA Algorithms for Food Authentication: A Chemometric Perspective. Algorithms. 2025; 18(12):733. https://doi.org/10.3390/a18120733

Chicago/Turabian Style

Mészáros, Martin, Jiří Sedlák, Tomáš Bílek, and Aleš Vávra. 2025. "Evaluating LDA and PLS-DA Algorithms for Food Authentication: A Chemometric Perspective" Algorithms 18, no. 12: 733. https://doi.org/10.3390/a18120733

APA Style

Mészáros, M., Sedlák, J., Bílek, T., & Vávra, A. (2025). Evaluating LDA and PLS-DA Algorithms for Food Authentication: A Chemometric Perspective. Algorithms, 18(12), 733. https://doi.org/10.3390/a18120733

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating LDA and PLS-DA Algorithms for Food Authentication: A Chemometric Perspective

Abstract

1. Introduction

2. Materials and Methods

2.1. Samples Origin, Data Collection, and Analysis

2.2. Mathematical and Statistical Approach

3. Results and Discussion

3.1. Classification Between Two Groups with Unequal Sample Size

3.2. Classification Between Multiple Groups with Unequal Sample Size

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI