1. Introduction
Global concerns regarding food fraud have intensified, particularly in the meat sector, where issues such as species substitution, origin misrepresentation, and mislabeling of halal or organic claims are increasingly reported [
1,
2]. Economically motivated adulteration poses risks that extend beyond consumer deception, encompassing religious sensitivities, nutritional misrepresentation, and even potential toxicological hazards. For example, pork and its derivatives are considered haram in Muslim communities, making reliable detection crucial for halal certification [
3]. At the same time, adulteration undermines local economies when high-value meats such as beef or indigenous pork are replaced with cheaper alternatives.
Traditional methods for meat authentication rely on DNA-based assays, such as a polymerase chain reaction (PCR), and protein-based techniques like an enzyme-linked immunosorbent assay (ELISA). While sensitive and specific, these techniques are often hindered by high operational costs, a laborious sample preparation, and reduced applicability to processed or thermally treated meat, where DNA and proteins may degrade [
4]. These limitations have motivated the development of vibrational spectroscopic approaches, particularly Fourier-transform infrared (FTIR) spectroscopy, as rapid, non-destructive, and cost-efficient alternatives [
5]. FTIR spectroscopy works by measuring the absorption of infrared (IR) radiation by molecular bonds, producing characteristic spectral fingerprints. Its operational range spans the near-infrared (14,000–4000 cm
−1), mid-infrared (4000–400 cm
−1), and far-infrared (400–50 cm
−1) regions, of which the mid-infrared (MIR) region is most relevant for food authentication, since it contains the fundamental vibrational frequencies of lipids, proteins, and nucleic acids [
6,
7]. Modern FTIR instruments utilize interferometers, typically based on the Michelson design, which allow the multiplexing of wavelengths and improve resolution and signal-to-noise ratios compared to dispersive IR systems [
8]. Importantly, the application of attenuated total reflectance (ATR)-FTIR has simplified sample handling by allowing the direct measurement of intact or heterogeneous samples without extensive preprocessing. Crystals such as diamond, ZnSe, or Ge facilitate the penetration of IR radiation into the sample surface, enabling the analysis of soft tissues, lipids, and protein-rich matrices [
7].
In meat analysis, ATR-FTIR spectra commonly exhibit diagnostic peaks: 3000–2800 cm
−1 (lipid CH stretching), ~1745 cm
−1 (triglyceride carbonyls), 1650–1540 cm
−1 (protein Amide I and II bands), and 1200–1000 cm
−1 (nucleic acids and phospholipids). The application of chemometrics has significantly advanced the interpretability of FTIR spectra. Chemometrics refers to the integration of mathematical and statistical tools into chemical analysis to extract meaningful information from complex datasets [
8]. Techniques such as principal component analysis (PCA) and partial least squares (PLS) regression have been used to classify species, identify adulterants, and even quantify adulteration levels in meat products. For example, studies demonstrated that ATR-FTIR combined with PLS regression achieved correlation coefficients (R
2) greater than 0.99 in quantifying lard in butter and differentiating beef sausages adulterated with pork fat [
9,
10]. Other approaches such as PLS-DA, SIMCA, and support vector machines (SVM) have further enhanced the classification accuracy in multi-class meat authentication problems, sometimes reaching accuracies above 98% [
11]. Internationally, the use of ATR-FTIR combined with chemometrics has extended beyond meat to fats, oils, dairy, and functional foods. A comprehensive review of over two decades of studies revealed that edible fats and oils were among the most adulterated food categories, with FTIR emerging as one of the most reliable fingerprinting tools when coupled with multivariate analysis [
12]. In meat applications, FTIR has been applied for detecting pork adulteration in beef meatballs, lamb sausages, and mixed minced meats, with detection limits often as low as a 1–2% substitution [
13,
14]. Importantly, portable ATR-FTIR and diffuse reflectance (DR)-FTIR devices have recently been evaluated for on-site authenticity testing, achieving classification accuracies of up to 100% when coupled with SVM models [
15].
In Malta, the case of pork authentication carries unique socio-economic and cultural significance. Historically, pork was a dietary staple, and its supply was severely disrupted during outbreaks of African Swine Fever. Although the sector has since recovered, a new challenge has emerged: competition from imported pork products that are often cheaper but of a lower quality. Current slaughter rates in Malta stand at approximately 1600 pigs per week, a sharp decline from 2400 in recent years, despite consumption levels remaining constant [
16]. The shortfall has been filled by imports, raising concerns about both the quality and authenticity. For a small island nation, where pork holds cultural value and represents a critical component of local agriculture, the risks of adulteration—whether through species substitution, origin misrepresentation, or false labelling, have significant economic and consumer trust implications. Finally, although the European Pharmacopoeia has begun incorporating chemometric methods into analytical chapters, their routine adoption in European food industries remains limited [
17]. This highlights a gap between methodological innovation and industrial practice. Addressing this gap in the Maltese context through the integration of ATR-FTIR with chemometrics can provide a rapid, non-destructive, and cost-effective solution for pork authentication. The present study, therefore, seeks to pioneer the application of these methods to Maltese pork, ensuring authenticity, strengthening regulatory oversight, and reinforcing consumer confidence in local meat production.
2. Materials and Methods
2.1. Pork Samples and Preparation
A total of 116 Maltese pork samples consisting of both loin and belly were directly sampled from KIM (Koperattiva Industijali tal-Majjal, Marsa, Malta). Samples were transported under chilled conditions (4 °C) to the laboratory to prevent degradation prior to analysis. Samples were then stored in a freezer at −15 °C before analysis. With respect to foreign pork samples, a total of 53 samples consisting of the loin and belly were sampled and stored at −15 °C before analysis. Before laboratory analysis, both local and foreign pork samples were freeze-dried (BioBase, BK-FD10PT, Jinan, China) for 3 days at −68 °C. After freeze-drying, visible skin, fat, and connective tissue were excised that could interfere in the analysis, and then about 100 g of meat was homogenized and ground in a ratio of 1:1 with dry ice as it minimizes unwanted heat generation due to friction.
2.2. ATR-FTIR Measurement
ATR-FTIR measurements were performed using an IRAffinity-1 Shimadzu spectrometer equipped with an attenuated total reflectance (ATR) accessory (Shimadzu, Kyoto, Japan). The instrument was switched on and allowed to stabilize for 30 min prior to analysis. A background spectrum was first acquired (45 scans), followed by measurement of the validation disk (45 scans) to confirm instrument stability and performance. Before each analysis, the ATR crystal surface was thoroughly cleaned with isopropyl alcohol (Biochem Chemopharma, Cosne-Cours-sur-Loire, France) and dried to prevent cross-contamination. Samples were then placed in firm contact with the ATR crystal to ensure optimal penetration of IR radiation. For each sample, spectra were recorded over the 400–5000 cm−1 wavenumber range at a resolution of 2 cm−1, with 45 co-added scans to improve the signal-to-noise ratio. To account for sample heterogeneity and improve reproducibility, three replicate spectra were collected per sample, with the sample being removed and repositioned on the crystal between replicates. After each measurement, the ATR surface was re-cleaned with isopropyl alcohol, dried before proceeding to the next sample, and a background scan was completed between different samples. To minimize spectral distortion, wavenumber regions associated with atmospheric CO2 (2390–2250 cm−1) and water vapor (3400–3200 cm−1) were excluded from further analysis. Two spectral datasets were prepared for chemometric analysis: Fingerprint region (1800–600 cm−1)—selected for its high specificity to functional group vibrations of proteins, lipids, and nucleic acids. Full mid-infrared (MIR) region (4000–500 cm−1) included broad spectral information while excluding CO2 and possible water interference zones.
2.3. Data Treatment
Raw FTIR spectra are inherently complex, containing overlapping peaks, baseline drifts, and scattering effects that can mask subtle chemical differences between meat samples. In order to optimize the discriminatory power of the spectroscopic dataset, a comprehensive suite of eleven spectral pre-processing transformations was systematically applied prior to chemometric modeling. These transformations were implemented in Unscrambler X (Camo Analytics, Mölndal, Sweden), following approaches widely reported in FTIR–chemometric applications for meat and edible fats [
18,
19]. The applied pre-processing methods included the following: Savitzky–Golay first derivative (SG 1st der.) enhances spectral resolution and minimizes baseline offsets by calculating the first derivative of absorbance values with a polynomial fitting algorithm. Savitzky–Golay second derivative (SG 2nd der.) emphasizes subtle differences in overlapping bands and improves peak resolution, particularly within the protein Amide I and II regions [
11]. Dersolve (derivative with smoothing) combines differentiation with noise filtering, balancing detail enhancement with signal stability. Detrend correction removes linear baseline shifts and compensates for scattering effects caused by surface irregularities. Median filter smoothing (5 point) reduces high-frequency noise by replacing each spectral point with the median of its neighbors. Multiplicative Scatter Correction (MSC) corrects multiplicative and additive light scattering variations due to heterogeneous particle sizes and pathlength differences. Orthogonal Signal Correction (OSC) removes spectral variance unrelated to the dependent variable (class membership), improving model robustness [
12]. Quantile normalization was also carried out to standardize intensity distributions across spectra, improving comparability. Raw spectra (no treatment) were included as a baseline reference and ATR correction was used to evaluate the added value of pre-processing. Standard Normal Variate (SNV) corrected for scatter and pathlength differences by scaling each spectrum individually. SNV + Detrend combined both SNV scaling and baseline correction to improve reproducibility.
Each pre-processed dataset was structured into a data matrix
where n corresponds to the number of samples (Maltese and foreign pork replicates) after averaging signal from the three independent replicates and p represents the number of spectral variables (wavenumber points). Supervised and unsupervised chemometric methods were carried in Python 3.11 (Python Software Foundation, Wilmington, DE, USA) using the scikit-learn machine learning library, along with NumPy, pandas, and Matplotlib (version 3.7.2) for data processing and visualization.
2.4. Principal Component Analysis
Principal Component Analysis (PCA) is a dimensionality reduction technique which is used to transform high-dimensional data into a lower-dimensional space while preserving the variance in the dataset. PCA is useful as it deals with large datasets with thousands of variables in common. PCA works by finding new axes that maximize variance in the data, involving computing the eigenvalues and eigenvectors of the covariance matrix [
20]. The mathematical equation of PCA is shown in equation
in which X represents the original data matrix with n observations and p variables, T represents the score matrix in terms of principal components (PCs), P represents the loading matrix containing the eigenvectors that define how the original variables contribute to each principal component, and E represents the residual matrix capturing unexplained variance or noise after projection [
20]. In this research, PCA was used to explain the variance within the ATR-FTIR dataset and to visualize clustering trends in relation to pork origin. The extracted PCA scores provided a summary of the sample grouping based on origin, while the PCA loadings provided an insight into the variability of spectral features contributing to differences within the pork profile. Outlier Detection was carried out using two statistical tests: Hotelling’s T
2 statistic, which measures the leverage of sample i in the score space:
where t
i is the score vector of sample i and S
t is the covariance matrix of the scores, and Q-residuals (Squared Prediction Error), which quantify the variance not captured by the PCA model:
where x
i is the original spectrum and
= t
i P
T is the PCA reconstruction. Samples exceeding the empirical threshold (mean + three standard deviations of the distribution) for either statistic were flagged as outliers and excluded from subsequent classification steps.
2.5. The Soft Independent Modeling of Class Analogy (SIMCA)
The Soft Independent Modeling of Class Analogy (SIMCA) algorithm was utilized as a supervised classification method for the spectral datasets. In SIMCA, distinct PCA models are created independently for each predefined class, allowing for the modeling of within-class variance while preserving class-specific structure. Unknown samples are then projected into each class model and their class membership is assessed by calculating the residual distances between the original spectrum and its PCA reconstruction. The validation of the SIMCA models, along with all other supervised models, was performed using three approaches. Training accuracy is determined by the classification of samples within the calibration set. Leave-One-Out (LOO) cross-validation involves excluding each sample one at a time and reclassifying it using models developed from the remaining data. Excluded-row validation, also known as Venetian blind cross validation, systematically omits every 3rd sample from training and classifies it independently. Each unknown spectrum was classified into the class with the lowest residual distance:
where d
k,i is the residual distance of sample i to class model k. For two-class comparisons, Coomans plots were constructed to visualize sample positions relative to both class models, providing a graphical overview of membership, ambiguous cases, and potential outliers. The SIMCA performance was assessed using different parameters, namely accuracy, defined as the proportion of correctly classified samples relative to the total number of samples:
where TP = true positives, TN = true negatives, FP = false positives, and FN = false negatives.
Specificity ability of the model to correctly identify negative samples (i.e., correctly rejecting samples from the other class):
Selectivity, known as sensitivity, is defined as the ability of the model to correctly identify positive samples (i.e., correctly accepting samples belonging to the target class):
2.6. Multivariate Classification Using PCA-LDA and PLS-LDA
To investigate the discriminatory power of the spectral data and assess sample classification based on origin, two hybrid chemometric workflows were employed: Principal Component Analysis coupled with Linear Discriminant Analysis (PCA-LDA) and Partial Least Squares Regression coupled with Linear Discriminant Analysis (PLS-LDA). Both approaches combined dimensionality reduction with supervised classification, optimizing interpretability while minimizing model overfitting. All absorbance values were standardized using z-score normalization (mean-centered and scaled to unit variance) via Standard Scaler from scikit-learn, ensuring comparability across wavenumber intensities.
In the PCA-LDA, dimensionality reduction was first achieved by PCA. PCA was performed on the standardized spectral matrix, retaining a maximum of 10 principal components (PCs) or fewer, depending on dataset constraints. The selected PCs, which captured the majority of spectral variance, were then used as input features in Linear Discriminant Analysis (LDA). LDA is a supervised classification algorithm that seeks to maximize between-class variance while minimizing within-class variance in the transformed space. LDA was implemented using the Linear Discriminant Analysis class from scikit-learn and applied to the PCA scores. The resulting canonical scores were plotted to visualize class separation and classification performance was evaluated.
For the PLS-LDA approach, Partial Least Squares analysis (PLS) was first used to reduce data dimensionality by projecting the spectral matrix onto a new set of orthogonal latent variables (LVs) that are maximally correlated with the class labels (encoded as binary integers: 0 = non-Maltese, 1 = Maltese). A maximum of 10 LVs or fewer were extracted using the PLSRegression class from scikit-learn. The resulting PLS scores (X-scores) served as input features for LDA, implemented in the same manner as the PCA-LDA model. This approach leveraged both the variance in the spectral dataset and its covariance with class membership, potentially offering greater classification power when relevant discriminatory information is subtly embedded in the data structure. Confusion matrices were generated for training predictions and canonical score plots (LD1 vs. LD2) were produced to visualize class separation. Both loading plots and latent variable scores were also exported to aid interpretation of discriminant features. Model outputs and performance metrics were saved for both whole-spectrum and fingerprint-only preprocessing strategies for comparison purposes. The performance of the PCA-LDA and PLS-LDA classification models was assessed using three complementary validation approaches.
2.6.1. Training Accuracy (Apparent Accuracy)
This metric quantifies the proportion of correctly classified samples within the calibration dataset used to train the model. While informative, it may overestimate performance due to overfitting, particularly in high-dimensional datasets with limited samples.
2.6.2. Leave-One-Out Cross-Validation (LOO-CV)
LOO-CV is a robust internal validation method where each sample is iteratively excluded from model training and used for testing. This approach reduces bias and provides more realistic estimate of the model’s predictive ability on unseen data. It is particularly suitable when the dataset is small, as it maximizes training size in each fold.
Here, yi is the true class label of the ith sample and yLOOi is the predicted class label obtained when the ith sample was excluded from model training.
2.6.3. Excluded Sample Accuracy (Structured Venetian Blind Validation)
In addition to leave-one-out (LOO) cross-validation, a Venetian blinds approach was employed for excluded-sample validation, as this strategy leaves out systematic blocks of spectra rather than single observations, thereby providing a more realistic estimate of prediction error and reducing the tendency of LOO to overestimate error in small samples. In this study, every third sample in the dataset was systematically excluded prior to model training and used solely for model evaluation. This form of stratified sampling ensures that each excluded observation is not adjacent or strongly correlated to those used for training, thereby mimicking an external validation set and avoiding overly optimistic estimates caused by temporal or batch autocorrelation. Specifically, 33% of the samples (every 3rd entry) were withheld and not used during model training. The remaining 67% formed the training set and were used to build the PCA-LDA and PLS-LDA models. Predictions were then generated for the excluded subset and classification accuracy was computed based on the proportion of correctly predicted labels:
2.7. Partial Least Squares Regression (PLSR)
Partial Least Squares Regression (PLSR) was performed using the PLSRegression class from the scikit-learn sklearn.cross_decomposition module. Although the response variable in this study is non-continuous, PLSR was applied to evaluate the variability in classification performance across different spectral transformations and regions by calculating the root mean square error (RMSE). The maximum number of latent variables (LVs) was defined as the minimum between n − 1 (where n is the number of samples) and the number of spectral variables. The optimal number of LVs was selected by minimizing the RMSE obtained from Leave-One-Out (LOO) cross-validation. RMSE values were computed for both the training set and the LOO validation set to assess model performance and reduce the risk of overfitting. In this framework, the binary class response was modeled as a continuous variable rather than a discrete categorical outcome. Class labels were encoded as dummy variables, assigning a value of 1 to Maltese samples and 0 to foreign samples. Predicted values generated by the PLSR model were interpreted probabilistically: samples with predicted values >0.5 were classified as foreign, while those ≤0.5 were classified as Maltese.
Regression coefficients for each wavenumber were extracted from the PLSR model using the optimal number of LVs. Additionally, Variable Importance in Projection (VIP) scores were calculated to assess the relative contribution of each spectral variable to the model. VIP scores were computed following the approach of Wold et al. (2001) [
21] using the formula:
where p is the number of variables, w
j,a is the weight of variable j on LV a, S
a is the amount of variance in y explained by LV a, and A is the number of LVs retained.
PLSR score plots were used to visualize class separation in latent variable space, with samples color-coded by origin (red = foreign, black = Maltese). Regression coefficients and VIP scores were plotted against the original wavenumber axis for interpretability. Model performance was evaluated using the Root Mean Squared Error (RMSE)
where y
i is the reference class label (0 for foreign, 1 for Maltese),
is the corresponding predicted value (continuous output from the PLSR model), and n is the total number of samples evaluated. RMSE was computed for the training set, leave-one-out cross-validation (LOOCV), and excluded rows validation (ERV) to evaluate the accuracy and robustness of the model under different validation strategies.
2.8. Support Vector Machine Regression (SVMR) Modeling
Support Vector Machine Regression (SVMR) was implemented using a radial basis function (RBF) kernel via the scikit-learn library. Similar to PLSR, the response was modeled as a continuous variable rather than a discrete categorical outcome. Model hyperparameters were optimized through an exhaustive grid search combined with five-fold cross-validation, using the coefficient of determination (R2) as the selection criterion. The hyperparameter space explored included C (regularization parameter): {0.1, 1, 10, 100}; ε (insensitive loss): {0.01, 0.1, 0.5, 1.0}; and γ (kernel coefficient): {‘scale’, ‘auto’}. Model performance was evaluated using the RMSE and coefficient of determination (R2) for the training set using leave-one-out cross-validation (LOOCV) and excluded rows validation. To interpret the relative contribution of spectral variables to the SVMR model, permutation importance analysis was performed using 10 randomized repetitions. The top 30 most informative wavenumbers were ranked based on their mean importance scores and visualized for biochemical interpretation.
2.9. Artificial Neural Network (ANN) Modeling
A supervised feed-forward Artificial Neural Network (ANN) was employed to classify the geographical origin of the FTIR spectra. The ANN was implemented as a multilayer perceptron (MLP) with rectified linear unit (ReLU) activation functions and optimized using the Adam algorithm hidden layer configurations including single-layer networks with 50 and 100 nodes, two-layer networks (50–20, 100–50), and a three-layer network (50–30–10) combined with maximum iteration limits of 1000, 2000, and 3000. Early stopping based on validation loss was applied in all models to prevent overfitting and reduce computational cost. Classification performance was assessed using accuracy, precision, recall, specificity, F1-score, misclassification rate, cross-entropy loss, and the area under the receiver operating characteristic curve (AUC).
4. Discussion
The ATR-FTIR spectra revealed clear biochemical differences between Maltese and non-Maltese pork across protein-, lipid-, ester-, and carbohydrate-associated regions previously identified by other authors [
3,
4,
6]. In the high wavenumber region, the broad Amide A band (~3290 cm
−1), corresponding to the N–H stretching of proteins with O–H contributions from polysaccharides, appeared slightly more intense in Maltese pork; however, due to the possible water overlap, this peak was excluded from the analysis. In the lipid region (3000–2800 cm
−1), Maltese pork displayed more pronounced CH
3 and CH
2 stretching vibrations. Both the CH
3 asymmetric stretching (~2956 cm
−1) and the CH
2 asymmetric stretching (~2925 cm
−1) bands were stronger, as was the CH
2/CH
3 symmetric stretching region (~2872–2853 cm
−1). These peaks reflect intramuscular lipids, phospholipids, and neutral lipids, indicating that Maltese pork exhibits relatively stronger methyl and methylene vibrational contributions [
3,
4,
5,
7]. In contrast, non-Maltese pork exhibited stronger carbonyl and protein-related absorptions. The C=O stretching vibration at ~1715 cm
−1, associated with fatty acids and aromatic esters, was more defined in non-Maltese samples, suggesting higher levels of free fatty acids or oxidation products [
3,
4,
9]. Similarly, the Amide I (~1655 cm
−1) and Amide II (~1540 cm
−1) bands were more intense in non-Maltese pork, indicating higher contributions from structural proteins or differences in secondary structures [
3,
6,
7]. This contrasts with the higher Amide A intensity observed in Maltese pork, suggesting possible differences in protein conformations or hydration states between the two groups [
3]. Further differences were evident in the fingerprint region (1500–900 cm
−1). Non-Maltese pork exhibited stronger CH
2 bending vibrations around ~1465 cm
−1, together with more intense signals in the ~1412–1418 cm
−1 region associated with cis-olefinic rocking and C–N stretching. The COO
− symmetric stretching band at ~1392 cm
−1, a marker for fatty acid composition, was also stronger in non-Maltese pork. These absorptions are consistent with a greater lipid bending intensity, higher fatty acid unsaturation, and compositional differences in fatty acid profiles [
3,
4,
7].
In contrast, Maltese pork showed more pronounced signals in the Amide III region (~1315–1230 cm
−1), which also overlaps with PO
2− asymmetric stretching from phospholipids and nucleic acids [
3,
4,
7]. Additional differences were observed in the 1170–1150 cm
−1 region, corresponding to the C–O stretching of serine, threonine, and tyrosine residues, and in the 1080–1030 cm
−1 range, assigned to PO
2− symmetric stretching and C–O vibrations of carbohydrates and glycogen. These stronger absorptions in Maltese pork indicate a higher contribution from structural proteins, phospholipids, and carbohydrate-related biomolecules [
3,
7]. Taken together, these spectral observations suggest that Maltese pork is distinguished by stronger Amide A, Amide III, and phosphate/carbohydrate-associated vibrations, alongside pronounced CH
2/CH
3 stretching bands. Non-Maltese pork, on the other hand, is characterized by stronger Amide I–II absorptions, more defined carbonyl stretching, and greater lipid bending and fatty acid-associated peaks [
7]. These compositional differences are likely rooted in production practices: Maltese pork, typically derived from small-scale systems with balanced feeding and shorter supply chains, shows stronger signatures of structural proteins and phospholipids, whereas non-Maltese pork, associated with intensive farming and energy-dense diets, exhibits higher levels of free fatty acids, lipid unsaturation, and protein signals linked to leaner carcass development.
Chemometric modeling confirmed that these spectral features formed the basis for robust classification. The application of Savitzky–Golay derivatives improved the resolution of overlapping peaks in the amide and lipid regions, allowing subtle yet systematic differences between the groups to be emphasized. The superior performance of second-derivative preprocessing in PCA clustering mirrors earlier findings in meat authenticity studies, where derivative treatments consistently enhanced separation [
6,
8,
15]. Supervised classifiers further improved the classification accuracy. PLS-LDA achieved 100% accuracy across preprocessing methods, outperforming PCA-LDA, which does not explicitly optimize for class-related variance. This agrees with earlier studies showing that PLS-DA and SVM consistently outperform PCA-based models in meat species and origin authentication [
6,
15]. Although whole-spectrum models achieved high accuracy, the fingerprint region (1800–900 cm
−1) emerged as the most chemically meaningful. It captures the amide bands, lipid bending modes, and phosphate/carbohydrate absorptions that directly reflect protein-to-lipid ratios and cellular composition. This reinforces the literature consensus that the fingerprint region provides the richest biochemical information for species and origin discrimination [
6,
7,
8]. Nevertheless, second derivative models showed increased outlier sensitivity, suggesting that complementary preprocessing strategies such as detrend or OSC may offer a more stable balance between accuracy and robustness. These observations are summarized in
Table 7, which compares the different preprocessing techniques applied in this study, highlighting their relative advantages, limitations, and impact on the spectral resolution and model performance.
Regression modeling further highlighted the discriminatory power of the fingerprint region. PLSR models performed best with derivative preprocessing, though inflated leave-one-out (LOO) errors reflected the known limitations of this validation strategy in small datasets. Nonlinear regression approaches such as SVMR provided stronger predictive robustness, capturing subtle biochemical patterns beyond the linear structure of PLSR. Feature importance from SVMR and region of importance from ANN consistently highlighted Amide I (~1650 cm
−1), CH
2/CH
3 bending (~1465 cm
−1), and carbohydrate/phosphate bands (~1117–1031 cm
−1) as the most discriminative, fully matching the biochemical assignments of the spectra. ANN models also performed strongly when derivative or OSC preprocessing was applied, corroborating recent evidence that deep learning approaches enhance classification power in FTIR–chemometric workflows [
6,
8,
15].
These results confirm that Maltese and non-Maltese pork can be reliably differentiated based on their FTIR fingerprints. Maltese pork is defined by stronger protein- and phosphate-associated absorptions, while non-Maltese pork is characterized by more pronounced lipid- and ester-associated signals. When coupled with derivative preprocessing and supervised classifiers, ATR-FTIR provides a rapid, non-destructive, and cost-effective strategy for pork origin authentication. Spectral acquisition required approximately 3 min per sample, with negligible reagent consumption, thereby offering a markedly more economical alternative to conventional molecular or proteomic approaches. DNA-based authentication (e.g., PCR or qPCR) typically entails 2–4 h of sample preparation, amplification, and analysis, in addition to recurring expenses for extraction kits and enzymes, while proteomic or mass-spectrometric methods frequently exceed these temporal and financial requirements [
26]. Relative to such methods, ATR-FTIR reduces per-sample reagent and consumable costs by an estimated ≥70% and lowers total analytical expenditure to roughly 5–10% of that associated with a conventional workflow [
27]. These findings are concordant with previous demonstrations of the robustness of FTIR–chemometric strategies for meat traceability and halal verification [
6,
7,
14,
26,
27] and, together with reports of the successful deployment of portable ATR-FTIR instrumentation, highlight the feasibility of implementing this approach for rapid, on-site regulatory and industrial monitoring.
5. Conclusions
This study demonstrated the successful application of ATR-FTIR spectroscopy coupled with advanced chemometric and machine learning approaches for the authentication of Maltese versus non-Maltese pork. A comprehensive evaluation of classification and regression strategies revealed that data preprocessing plays a pivotal role in extracting chemically meaningful information from complex FTIR spectra. Derivative transformations, particularly the Savitzky–Golay first and second derivatives, consistently enhanced spectral resolution and improved model robustness across all workflows.
Linear models such as PCA-LDA, SIMCA, and PLSR provided high levels of accuracy and interpretability, with the fingerprint region (1800–600 cm−1) emerging as the most discriminative spectral domain due to its rich representation of proteins, lipids, and nucleic acids. However, these methods were more sensitive to sample variability and exhibited inflated errors under stringent cross-validation. Nonlinear approaches, especially Support Vector Machine Regression (SVMR) and Artificial Neural Networks (ANNs), delivered a superior predictive performance, with accuracies exceeding 0.99 and lower misclassification rates under external validation. The ANN models, when combined with appropriate preprocessing (2nd derivative, OSC, or median filtering), provided the most powerful classification framework, highlighting the capacity of deep learning to capture subtle, nonlinear spectral features.
Collectively, these findings confirm that FTIR spectroscopy coupled with chemometrics, and machine learning provides a rapid, cost-effective, and non-destructive tool for meat authenticity assessments. The strong performance of nonlinear models underscores their potential for real-world deployment in quality control and regulatory enforcement. Importantly, the results also emphasize that the careful choice of the preprocessing and validation strategy is essential to prevent overfitting and to ensure model generalizability.