Authentication of Laying Hen Housing Systems Based on Egg Yolk Using 1H NMR Spectroscopy and Machine Learning

(1) Background: The authenticity of eggs in relation to the housing system of laying hens is susceptible to food fraud due to the potential for egg mislabeling. (2) Methods: A total of 4188 egg yolks, obtained from four different breeds of laying hens housed in colony cage, barn, free-range, and organic systems, were analyzed using 1H NMR spectroscopy. The data of the resulting 1H NMR spectra were used for different machine learning methods to build classification models for the four housing systems. (3) Results: The comparison of the seven computed models showed that the support vector machine (SVM) model gave the best results with a cross-validation accuracy of 98.5%. The test of classification models with eggs from supermarkets showed that only a maximum of 62.8% of samples were classified according to the housing system labeled on the eggs. (4) Conclusion: The classification models developed in this study included the largest sample size compared to the literature. The SVM model is most suitable for evaluating 1H NMR data in terms of the hen housing system. The test with supermarket samples showed that more authentic samples to analyze influencing factors such as breed, feeding, and housing changes are required.


Introduction
Organically produced animal foods are more expensive to produce due to the stricter regulations on feeding and housing systems related to animal welfare, making them a target for food fraud [1,2].Food fraud can be counteracted by authenticating organically produced food.Therefore, analytical methods are required for the authentication of animal food, e.g., in relation to animal species, breed, feeding, or housing system [3,4].For example, conventionally produced hen's eggs can easily be labeled as organically produced eggs due to their identical external appearance.The production of eggs and the husbandry of laying hens are subject to a number of EU regulations and directives [1,5,6], which define the minimum requirements for the different housing systems, in particular for organic farming.In Germany, eggs from laying hens are labeled with a code that symbolizes the housing system (colony cage: 3; barn: 2; free-range: 1; organic: 0) and the farm where the eggs were produced.The colony cage system is currently approved in Germany until 2025.
Several studies have focused on the classification of the housing system of laying hens based on the metabolome of the eggs.Ackermann et al. [7] reported a 1 H NMR method combined with linear discriminant analysis (LDA) to differentiate between organically and conventionally produced eggs (n = 344 samples).Puertas and Vázquez [8] determined the housing systems (cage, barn, free-range, and organic) by UV-Vis-NIR spectroscopic analyses of egg yolk lipid extract (n = 48 samples) [8] or egg plasma (n = 84 samples) [9] and support vector machine (SVM), quadratic discriminant analysis (QDA), and LDA.In addition, Hajjar et al. [10] analyzed the triacylglycerol extract of egg yolk (n = 34 samples) using 1 H NMR spectroscopy and classified the housing system by LDA and canonical discriminant analysis.Cardoso et al. [11] showed the discrimination of barn and free-range eggs (n = 48 samples) by 1 H NMR spectroscopy and partial least square discriminant analysis (PLS-DA).Chin et al. [12] reported an LC-MS/MS method to analyze lipid extracts from egg yolks (n = 357 samples) from cage, barn, and free-range systems in combination with the machine learning methods orthogonal projection in latent structures discriminant analysis (OPLS-DA) and SVM.The OPLS-DA and SVM models showed accuracies between 77 and 96% and 80 and 97%, respectively, depending on the class [12].Furthermore, Lösel et al. [13] analyzed 270 eggs from conventional and organic systems with LC-ESI-IM-qTOF-MS and FT-IR spectroscopy in combination with a random forest (RF) calculation and reported an accuracy of 96.3%.Kopec and Abramczyk [14] showed a PLS-DA model based on Raman spectra of 40 eggs from cage, barn, free-range, and organic systems with a sensitivity and specificity of validation between 0.897 and 1.0.
To the best of our knowledge, this is the first study to authenticate the housing system of laying hens with a sample size n > 1000 and to develop seven machine learning methods and apply the models to egg samples purchased from supermarkets.
We hypothesized that it is possible to authenticate the housing system of laying hens by 1 H NMR analysis of a large number of authentic egg yolk samples and an appropriate machine learning model.

Raw Material
The egg samples were collected from the farms of a German egg producer.The eggs were from laying hens from the four different housing systems (colony cage, barn, free-range, and organic) and four different breeds (Lohmann Selected Leghorn, Dekalb, Lohmann Brown, and Sandys).The laying period of the hens was separated into four groups: <25 weeks, 26-35 weeks, 36-55 weeks, and >56 weeks.Eggs were collected throughout the entire laying period of the hens (20 to 94 weeks), with every effort being made to ensure that at least one sample from each combination of age group, breed, and housing system was available.Further information is presented in Table S1 of the Supplementary Materials.A total of 472 eggs were collected from colony cages, 1200 eggs from barn, 1192 eggs from free-range and 1324 eggs from organic housing systems, resulting in a total amount of 4188 eggs used for model development.In addition, 290 eggs labeled as barn, free-range, or organic housing systems were purchased from supermarkets and discounters (Table S2).

1 H NMR Spectroscopy
The samples were analyzed using a 400 MHz Ascend III NMR spectrometer (Bruker Biospin GmbH, Ettlingen, Germany) with the following measurement parameters: pulse program, zg; temperature, 298 K; spectral width, 8223 Hz; number of points, 65 k; number of scans, 128; number of dummy scans, 4; acquisition time, 3.9 s; and relaxation delay, 6.0 s.Phase and baseline corrections were performed automatically using TopSpin 3.6.5 (Bruker Biospin GmbH).

Data Analysis
Data analysis was performed using Matlab R2018a (The Mathworks, Natick, MA, USA) and RStudio 2023.06.0 421 (Posit PBC, Boston, MA, USA) based on the R 4.3.1 software.The 1 H NMR spectra were scaled using the signal of trimethoxybenzene as an internal standard compound and conventionally bucketed into 682 buckets with a size of 0.01 ppm.The signals from MeOD and CDCl3 were excluded.The machine learning models (LDA, QDA, PLS-DA, SVM, RF, k-nearest neighbor (kNN), and artificial neural network (ANN)) were performed based on the buckets of 1 H NMR spectra.Since three out of four classes (housing system) were approximately equal in size, and only one class (colony cage) contained fewer samples, the data were not adjusted for imbalance.This avoids negative effects such as replication of minor effects or loss of information.In the case of LDA, QDA, SVM, RF, kNN, and ANN, a principal component analysis (PCA) was computed beforehand, and 134 principal components were used for these model

1 H NMR Spectroscopy
The samples were analyzed using a 400 MHz Ascend III NMR spectrometer (Bruker Biospin GmbH, Ettlingen, Germany) with the following measurement parameters: pulse program, zg; temperature, 298 K; spectral width, 8223 Hz; number of points, 65 k; number of scans, 128; number of dummy scans, 4; acquisition time, 3.9 s; and relaxation delay, 6.0 s.Phase and baseline corrections were performed automatically using TopSpin 3.6.5 (Bruker Biospin GmbH).

Data Analysis
Data analysis was performed using Matlab R2018a (The Mathworks, Natick, MA, USA) and RStudio 2023.06.0 421 (Posit PBC, Boston, MA, USA) based on the R 4.3.1 software.The 1 H NMR spectra were scaled using the signal of trimethoxybenzene as an internal standard compound and conventionally bucketed into 682 buckets with a size of 0.01 ppm.The signals from MeOD and CDCl 3 were excluded.The machine learning models (LDA, QDA, PLS-DA, SVM, RF, k-nearest neighbor (kNN), and artificial neural network (ANN)) were performed based on the buckets of 1 H NMR spectra.Since three out of four classes (housing system) were approximately equal in size, and only one class (colony cage) contained fewer samples, the data were not adjusted for imbalance.This avoids negative effects such as replication of minor effects or loss of information.In the case of LDA, QDA, SVM, RF, kNN, and ANN, a principal component analysis (PCA) was computed beforehand, and 134 principal components were used for these model calculations.A radial basis kernel function with degree = 3 and cost = 1 was used for SVM computation.The parameters ntree (number of trees) and mtry (number of variables) for the RF model, number of PLS components for PLS-DA, k (number of neighbors) for kNN, number of hidden layers, and number of neurons in each hidden layer for ANN were optimized in terms of model accuracy and set to ntree = 800, mtry = 8, number of PLS components = 13, k = 5, number of hidden layers = 2, and number of neurons in each hidden layer = 5.
In general, the data set was randomly divided into a training set (80% of the data) and a test set (20% of the data).The training set was used to compute the different models and the test set for validation.Additionally, all model calculations were validated by k-fold cross-validation with k = 10.Accuracy, sensitivity, specificity, precision, number of misclassifications (NMC), and area under the curve (AUC) of the receiver operating characteristic (ROC) curve were calculated for the model, the cross-validated model, and the test set fit.A permutation test with 1000 permutations was conducted for each model, and the p-value for accuracy, NMC, and AUC were reported.These tests and parameters provide information on the quality of the models and on the problem of the models with imbalanced data.The equations are provided in the Supplementary Material.

Model Computing Using Multiple Machine Learning Methods
The egg yolks of 4188 eggs from the four housing systems (colony cage, barn, freerange, and organic) were separated, extracted, and measured using 1 H NMR spectroscopy (Figure 1).The 1 H NMR spectra (Figure 2) are the basis for the calculation of the different machine learning models.Literature has shown that several classification methods are applicable to NMR data.Seven machine learning methods (LDA, QDA, SVM, PLS-DA, RF, kNN, ANN) were selected to develop the classification models.In the case of LDA and QDA, the between-class variances are maximized, and the within-class variances are minimized to increase distinctiveness.Unlike QDA, LDA assumes equal covariance matrices between classes.In addition, the LDA has a linear decision surface, while the QDA has a nonlinear decision surface [15,16].The SVM calculates hyperplanes to discriminate different classes, both linearly and non-linearly, by combining with the kernel function [17].The PLS-DA is a supervised linear classifier that uses PCAs on x and y terms.Direct relationships to the classes and the analyzed metabolome can be established by calculating loadings and variables' importance in projection (VIP) scores [18][19][20].In the case of RF, multiple low-correlated decision trees are constructed based on random splits of the data.The predicted class is the class selected by the most trees, thus minimizing the error of using only one decision tree.RF can handle large data sets with high dimensionality and large sample size [21,22].The kNN classification is based on the classes of a certain number (k) of the nearest data points (neighbors).The kNN algorithm is easy to apply and can be helpful in cases where other machine learning methods fail [23].The ANN is a complex method that consists of input, hidden, and output layers with a defined number of artificial neurons depending on the application [24].In general, the data set was randomly divided into a training set (80% of the data) and a test set (20% of the data).The training set was used to compute the different models and the test set for validation.Additionally, all model calculations were validated by kfold cross-validation with k = 10.Accuracy, sensitivity, specificity, precision, number of misclassifications (NMC), and area under the curve (AUC) of the receiver operating characteristic (ROC) curve were calculated for the model, the cross-validated model, and the test set fit.A permutation test with 1000 permutations was conducted for each model, and the p-value for accuracy, NMC, and AUC were reported.These tests and parameters provide information on the quality of the models and on the problem of the models with imbalanced data.The equations are provided in the supplementary material.

Model Computing Using Multiple Machine Learning Methods
The egg yolks of 4188 eggs from the four housing systems (colony cage, barn, freerange, and organic) were separated, extracted, and measured using 1 H NMR spectroscopy (Figure 1).The 1 H NMR spectra (Figure 2) are the basis for the calculation of the different machine learning models.Literature has shown that several classification methods are applicable to NMR data.Seven machine learning methods (LDA, QDA, SVM, PLS-DA, RF, kNN, ANN) were selected to develop the classification models.In the case of LDA and QDA, the between-class variances are maximized, and the within-class variances are minimized to increase distinctiveness.Unlike QDA, LDA assumes equal covariance matrices between classes.In addition, the LDA has a linear decision surface, while the QDA has a nonlinear decision surface [15,16].The SVM calculates hyperplanes to discriminate different classes, both linearly and non-linearly, by combining with the kernel function [17].The PLS-DA is a supervised linear classifier that uses PCAs on x and y terms.Direct relationships to the classes and the analyzed metabolome can be established by calculating loadings and variables' importance in projection (VIP) scores [18][19][20].In the case of RF, multiple low-correlated decision trees are constructed based on random splits of the data.The predicted class is the class selected by the most trees, thus minimizing the error of using only one decision tree.RF can handle large data sets with high dimensionality and large sample size [21,22].The kNN classification is based on the classes of a certain number (k) of the nearest data points (neighbors).The kNN algorithm is easy to apply and can be helpful in cases where other machine learning methods fail [23].The ANN is a complex method that consists of input, hidden, and output layers with a defined number of artificial neurons depending on the application [24].The validation results of the different models are shown in Tables 1 and 2. The accuracies of fit, cross-validation, and test set were close to each other for the classification models, indicating that there were no overfitting models.The p-values of the permutation test showed that the computed models with permuted class labels had a significantly poorer classification performance than the correct models.This indicates that the models with correct labels were useful in predicting the housing system.The models with the highest accuracy of fit (99.9%), the highest accuracy of cross-validation (97.8%; 98.5%), and the lowest NMC (3-7; 1-5) were QDA and SVM.The QDA model showed wider ranges in sensitivity (0.879-0.994), specificity (0.979-0.999), precision (0.949-0.997) and AUC (0.939-0.997) of cross-validation than the SVM model (sensitivity: 0.953-1.000;specificity: 0.993-0.999;precision: 0.983-0.989;AUC: 0.976-0.996),but the QDA is the only model where the AUC for the class colony cage in the permutation test was as high as the AUC of the normal model.This indicates that the classification of the class colony cage was random.QDA and SVM are both classifiers that are not based on a linear decision surface like LDA.QDA can be based on a quadratic function or curve, while SVM is based on several kernel functions (e.g., polynomial, radial, etc.) [15,17].A different characteristic of multi-class applications between SVM and LDA/QDA is that LDA/QDA computation focuses on very different classes, while SVM computation focuses on closer classes [25].This may explain the weakness of QDA in classifying colony cage samples.Furthermore, the class colony cage is unrepresentative in terms of sample size (547 samples) compared to the other classes (>1200 samples) due to the reduced use of colony cage in Germany, which could explain the poorer classification performance of this class.This is reflected in the lower sensitivity of the colony cage class in all models compared to the other three classes.LDA-linear discriminant analysis, QDA-quadratic linear discriminant analysis, PLS-DA-partial least square discriminant analysis, SVM-support vector machine, RF-random forest, kNN-k-nearest neighbor, ANN-artificial neuronal network.Listed as barn, free-range, colony cage, and organic housing system; 1 AUC-area under the ROC (receiver operating characteristic) curve; 2 NMC-number of misclassifications; n.c.-not calculable.
The accuracy of the test set samples was the highest for the QDA and SVM models (98.6% and 99.0%, respectively) compared to the other models.Chin et al. [12] showed that their SVM model provided better classification results than the OPLS-DA model when using MS/MS spectra of eggs from conventional housing systems.In contrast to the results of this study, the SVM model by Puertas and Vázquez [8] and Puertas et al. [9] showed worse classification results than LDA and QDA.These differences may be explained by the use of UV-Vis and NIR data to calculate the models.In general, the SVM algorithm could be useful for authenticating food samples based on NMR spectra [22,[26][27][28].For example, Cui et al. [26] demonstrated that their SVM model had the best classification results compared to LDA, PLS-DA and RF to authenticate the geographical origin of Zanthoxylum bungeanum extract samples.As another example, Nyitrainé Sárdy et al. [28] compared LDA, ANN, SVM, and RF models to classify the region and variety of wine based on 1 H NMR spectra and showed that the SVM and RF models had the best classification performance.
The results of LDA compared to QDA showed a poorer classification model, which is consistent with observations noted in the literature regarding eggs [8,9] or other foods [26,28] and could be an indicator that a linear decision surface is less applicable for these samples.Hajjar et al. [10] and Ackermann et al. [7] used 1 H NMR spectra of conventionally and organically produced eggs to compute a good classification model based on LDA.This may be due to the fact that there were only two classes (conventional vs. organic), which may be easier to separate by the LDA model than the four housing systems.
The model with the lowest accuracies was the PLS-DA, with 85.6% (fit), 84.7% (CV), and 84.6% (test set).In the literature, PLS-DA and OPLS-DA have often been used for food classification approaches based on data from NMR spectra [29][30][31][32].In relation to eggs, Cardoso et al. [11] reported a PLS-DA model to classify barn and free-range eggs using the 1 H NMR spectra.The RF, kNN, and ANN models showed poorer classification results than SVM, QDA, and LDA but better than PLS-DA.
Therefore, the SVM model is the best model in this study with a cross-validation accuracy of 98.5% to predict the housing system of laying hens using 1 H NMR spectra.Compared to the literature, Jiménez-Carvelo et al. [33] summarized that the SVM algorithm is suitable for several food classification/authentication approaches, for example, in the case of honey, meat, milk, plant products and oils.

Prediction of Housing System from Supermarket Samples
To evaluate the models' predictive performance in a practical setting, 290 egg samples labeled '0' (organic), '1' (free-range), and '2' (barn) were purchased from supermarkets or discount stores.Eggs labeled '3' (colony cage) were not commercially available.The outcomes of the model predictions are detailed in Tables 2 and 3. Prediction of the supermarket sample set was performed with all seven models to compare prediction results and provide information on model weakness or sample outliers.For samples V-001, V-002, V-003, V-004, V-005, V-007, V-013, V-017, V-018, V-026, and V-027, (almost) all models correctly predicted the labeled housing system for (almost) all eggs.This observation indicates that the eggs were correctly labeled with a very high statistical certainty.If all or a large proportion of the eggs in a sample cannot be predicted as labeled by all or almost all models then it is reasonable to assume that this sample is an outlier.Mislabeled samples, as well as samples that differ greatly from the training set, for example, due to a different breed, are possible outliers.This was observed for samples V-006, V-009, V-014, V-016, V-019, V-020, V-025, and V-028 (Table 3).All models classified only 45.9% (kNN) to 62.8% (LDA) of the samples into the class printed on the shell (Table 2).The best model (SVM) mentioned above predicted 52.8% of the samples according to the labeled housing system.This observation may be due to the small sample size (4188 eggs) in relation to the number of laying hens in Germany (43.7 million hens in the year 2022 [34]) and/or due to a higher number of mislabeled eggs than expected.The prediction results (Table 3) demonstrated that most of the eggs labeled as barn eggs were misclassified.One reason could be that laying hens from free-range farming have to be housed indoors when sudden external influences occur (e.g., avian influenza risk area) [35], meaning the housing system is temporarily comparable to barn housing.In this case, the eggs from these laying hens had to be labeled as barn eggs if they were kept indoors for more than 12 weeks [36].Since a new regulation came into effect in November 2023, such eggs can still be labeled as free-range eggs today [5].However, the eggs used in this study were collected before the new regulation came into effect.In addition, the sample size for barn eggs may still be too small to provide a representative sample.In Germany, the barn system contains more than twice as many laying hens as the free-range system and more than five times as many laying hens as the organic system.It can therefore be concluded that the population of barn eggs is significantly larger and is therefore less well represented by the same number of samples than in the other systems.Thus, the authentication model developed here may be improved using a larger number of samples, in particular from barn eggs, which will provide a more representative sample.
Additionally, the age of the laying hens may impact the metabolome of the eggs and thus be an influencing factor.It has been reported in the literature that the fatty acid pattern of the egg yolk is influenced by the age of the laying hens [37][38][39].In this study, eggs from laying hens were collected throughout the laying period to minimize the influence of age in the machine learning models.Furthermore, the breed of the laying hens could be another influencing factor.Hejdysz et al. [40] analyzed eggs from 14 different breeds and reported significant differences in fatty acid profiles and cholesterol levels.There are further studies reporting differences in the metabolome of eggs from different breeds [10,[41][42][43].In the authentication model developed in this study, eggs from four different breeds were considered in order to include the variation between these breeds in the model and to be able to determine the housing system of laying hens despite this variation.This should make it possible to authenticate the housing system for laying hens regardless of their breed.Therefore, it is possible that the eggs collected in supermarkets and discounters were from breeds that were not included in the collection of authentic samples and therefore not included in the machine learning models.Furthermore, the classification models found in the literature were based on 34-357 samples, and half of the studies did not ensure the authenticity of the samples.Compared to the literature, this study presents the largest classification models using authentic egg samples.
Improving the authentication model to predict unknown samples requires adding more authentic samples.These authentic samples should include samples from other breeds or breed hybrids, depending on the breeds used in egg production, and from the four housing systems, especially the barn systems.In addition, studies on feeding and its effect on the metabolome of eggs and, thus, on the prediction of the authentication model will be helpful.The machine learning model may be useful to authenticate eggs with and

Figure 1 .
Figure 1.Overview of the preparation, processing, and calculation steps.

Figure 1 .
Figure 1.Overview of the preparation, processing, and calculation steps.

Table 1 .
Results of fit and cross-validation of the models LDA, QDA, PLS-DA, SVM, RF, kNN, and ANN.

Table 2 .
Results of test set fit, permutation test, and purchased sample set for each computed model.

Table 3 .
Results of the purchased sample set of each computed model.More details can be found in TableS3in the Supplemental Material.