Reﬂected Light Spectrometry and AI-Based Data Analysis for Detection of Rapid Chicken Eggshell Change Caused by Mycoplasma Synoviae

Featured Application: The proposed method may be used in the production process for fast iden-tiﬁcation of eggs originating from poultry infected with MS. Abstract: Mycoplasma synoviae (MS) is a pathogen that causes economic losses in the poultry industry. It can be transmitted, amongst others, via the respiratory tract and spread relatively quickly. As such, MS infections are mainly controlled by maintaining MS-free breeder ﬂocks. Routine diagnosis for the detection of MS may be based on serological, culture, and molecular tests. Here, we propose an optical solution where AI-based analysis of spectral data obtained from the light reﬂected from the eggshells is used to determine whether they originate from healthy or Mycoplasma synoviae -infected hens. The wavelengths proposed for spectral MS detection are limited to those of VIS and NIR DPSS lasers, which are freely accessible on market. The results are satisfactory: for white eggshells, the F-score is over 95% for ﬁve different combinations of wavelengths (using eight or nine wavelengths); for brown eggshells, the F-score is above 85%, also for ﬁve different combinations of 6–9 wavelengths.


Introduction
M. synoviae is a member of the genus Mycoplasma of the class Mollicutes, a group of wall-less Gram-positive bacteria causing economic losses in the poultry industry. It is one of the major avian pathogenic agents and has a multifactorial etiology that involves complex interactions amongst pathogen, host, and environmental factors [1]. Poultry healthcare problems in this global industry associated with M. synoviae infections are related to respiratory infections, arthropathic problems, and strains causing eggshell pathology [1][2][3][4]. Eggs with eggshell apex abnormalities (EAAs) have a clear demarcation zone on the top zone of the egg, up to approximately 2 cm from the apex. Eggshell pathologies are also characterized by a roughened shell surface, shell thinning, increased translucency, cracks, and breaks [3,4].
Egg quality is important to the commercial egg industry. The deterioration in the quality of eggshells causes significant losses in the egg industry [5,6]. Any infection of the reproductive system of a laying hen can affect the quality of the eggs and eggshells [7]. The eggshell plays a crucial role in the development of the embryo, protecting it from mechanical damage, regulating gas exchange, and providing a source of nutrients. The eggshell protects the egg from contamination by bacteria and other pathogens, ensuring healthy embryo development [8].
M. synoviae is vertically transmitted and can cause egg infertility, embryonic death, poorly developed embryos, and weak poults, sometimes even without overt signs of active Appl. Sci. 2021, 11, 7799 2 of 12 infection in the parent flock [9,10]. The occurrence of clinical signs may be related to many factors, such as high levels of ammonia, inadequate ventilation, high stocking densities, and extremes of temperature. Infections with other microorganisms, such as E. coli, infectious bronchitis virus, Newcastle disease virus, and Ornithobacterium rhinotracheale, are also often involved in the etiology of these diseases [11].
Mycoplasmas are important causes of disease and loss of production in intensively reared poultry, particularly in those that are under environmental stress [12].
The disease caused by MS is among the diseases for which the OIE must be notified [13]. As MS can be transmitted via the respiratory tract and egg, the main method of controlling MS infections is to maintain MS-free breeder flocks [14]. Routine diagnosis for the detection of MS may be based on serological, culture, and molecular tests. Sera collected from the flock can be tested for the presence of antibodies using the serum plate agglutination test (SPA), enzyme-linked immunosorbent assay (ELISA), and, rarely, hemagglutination inhibition (HI) [15,16]. The culture method requires the use of PPLO broth and is expensive and time-consuming (28 days). The most frequently used and more sensitive method is the detection of DNA using polymerase chain reaction (PCR) and its modifications, real-time PCR, multiplex PCR, and LAMP [17][18][19][20]. A quick and sensitive method is polymerase spiral reaction (PSR) [21]. The method is 100 times more sensitive than PCR and has a higher positive rate (69.9%) than ELISA (65.3%).
Optical methods can be used to classify eggs as being from healthy or MS-infected hens using spectroscopy in transmitted light [22]. The authors obtained the best results for a group of brown shells, attaining 88% accuracy. With the destruction method, an egg is required because the measurement is taken through a single eggshell piece. To overcome this disadvantage, we propose a significant change to the spectroscopic data acquisition process.
The proposed approach for spectroscopic data acquisition is based on the detection of reflected VIS light. Since the eggs are not destroyed, this method can be used on the production line for egg analysis.

Materials and Methods
For the evaluation of the proposed approach, a set of 1475 measurements of eggshell samples was used. The set consisted of brown-and white-colored eggshells with a confirmed origin from healthy or MS-infected hens. Table 1 shows the quantity of each subset of samples. The samples from the healthy subgroups originated from the inner reference flock of the National Veterinary Research Institute (NVRI), while the MS-infected eggs originated from commercial flocks that were under the veterinary supervision of the NVRI and were confirmed by three techniques: a specific MS PCR [17,18,21], LAMP [19], and sequencing of the vlhA gene [20,21].

Spectral Data Acquisition
Spectral data were collected in a laboratory setup; a schematic of the setup is shown in Figure 1. The measured sample was placed on an XY translation stage, enabling proper sample positioning. The light source used was an incandescent lamp. The light beam was formed by a condenser lens for proper illumination of the object. The spectrometer fiber head (Thorlabs CCS100/M with M14L01 fiber head attached) was mounted on a movable arm that could be angularly adjusted to maximize the reflected signal on the detector. Such placement of the head enabled the easy and fast adjustment of the setup to different curvatures of egg samples.

Spectral Data Acquisition
Spectral data were collected in a laboratory setup; a schematic of the setup is shown in Figure 1. The measured sample was placed on an XY translation stage, enabling proper sample positioning. The light source used was an incandescent lamp. The light beam was formed by a condenser lens for proper illumination of the object. The spectrometer fiber head (Thorlabs CCS100/M with M14L01 fiber head attached) was mounted on a movable arm that could be angularly adjusted to maximize the reflected signal on the detector. Such placement of the head enabled the easy and fast adjustment of the setup to different curvatures of egg samples. Due to the large number of measurements, these measurements were recorded over several days. Before each measurement set, the setup was calibrated by the acquisition of spectral data of the calibration object-a white grinding plate. All measurements results were divided by the result of the measurement of the calibration object. This procedure removed the influence of sample illumination and recording path parameters.

Obtained Spectral Data-Initial Analysis
All data obtained in the laboratory setup ( Figure 1) were collected in the spectral range of 350-750 nm, with a resolution of 0.5 nm. The amount of data collected in a real stand, for example, by a sorting machine, requires large calculation capacities when the whole range is analyzed. Researchers have tended to apply the method in commercial designs with VIS illumination; therefore, we needed to decrease the amount of analyzed data to ensure a reasonable processing time. From the whole spectral range registered by the spectrometer, we chose the wavelengths that correspond to commercially available lasers, such as diode-pumped solid-state (DPSS) lasers. The DPSS lasers were chosen because of their potential applicability in sorting and detecting systems, and their lasing parameters, stability, and relatively small dimensions. The available wavelengths of the DPSS lasers that are within the measurement range used for eggshells are 457, 473, 501, 515, 523, 526, 532, 543, 556, 561, 589, 593, 660, 671, 690, and 729 nm. The incandescent lamp used in the laboratory setup had a significant drop in intensity for wavelengths lower than 450 nm, which is the main reason for the lower limit on DPSS wavelengths because, based on the experiment, no reliable data were to be obtained from that region. Table 2 shows the exact wavelength and its full width at half maximum (FWHM) together with the assigned name, symbolizing the input variable. Figure 2 visualizes an example of data for ten samples from each measurement category. Due to the large number of measurements, these measurements were recorded over several days. Before each measurement set, the setup was calibrated by the acquisition of spectral data of the calibration object-a white grinding plate. All measurements results were divided by the result of the measurement of the calibration object. This procedure removed the influence of sample illumination and recording path parameters.

Obtained Spectral Data-Initial Analysis
All data obtained in the laboratory setup ( Figure 1) were collected in the spectral range of 350-750 nm, with a resolution of 0.5 nm. The amount of data collected in a real stand, for example, by a sorting machine, requires large calculation capacities when the whole range is analyzed. Researchers have tended to apply the method in commercial designs with VIS illumination; therefore, we needed to decrease the amount of analyzed data to ensure a reasonable processing time. From the whole spectral range registered by the spectrometer, we chose the wavelengths that correspond to commercially available lasers, such as diode-pumped solid-state (DPSS) lasers. The DPSS lasers were chosen because of their potential applicability in sorting and detecting systems, and their lasing parameters, stability, and relatively small dimensions. The available wavelengths of the DPSS lasers that are within the measurement range used for eggshells are 457, 473, 501, 515, 523, 526, 532, 543, 556, 561, 589, 593, 660, 671, 690, and 729 nm. The incandescent lamp used in the laboratory setup had a significant drop in intensity for wavelengths lower than 450 nm, which is the main reason for the lower limit on DPSS wavelengths because, based on the experiment, no reliable data were to be obtained from that region. Table 2 shows the exact wavelength and its full width at half maximum (FWHM) together with the assigned name, symbolizing the input variable. Figure 2 visualizes an example of data for ten samples from each measurement category.

Support Vector Machine
Support vector machine (SVM) is a supervised learning algorithm introduced by Vapnik [23]. This algorithm aims to find a hyperplane that separates two classes of a multi-dimensional dataset with a maximum margin. For data to be classified by SVM, the dataset must be linearly separable. If the dataset does not fulfill this requirement, another kernel function must be used that will transform the data to a higher dimension to ensure they are linearly separable by a hyperplane that spans these dimensions. The choice of kernel depends on the type of data. The radial basis function (RBF) kernel is considered the best choice for practical applications [23]. When classifying spectral data collected from diseased and healthy eggs, the problem is the overlap of measurements from both classes. In this case, the use of RBF is preferred because it transforms the data to infinite Appl. Sci. 2021, 11, 7799 4 of 12 dimensions in which the relationship between observations might be found. Then, the resulting value enables the creation of the decision boundary that enables the classification of multidimensional data [23][24][25].   A0  457  3  A1  473  3  A2  501  3  A3  515  3  A4  523  3  A5  526  3  A6  532  3  A7  543  3  A8  556  3  A9  561  3  A10  589  3  A11  593  3  A12  632  3  A13  660  3  A14  671  3  A15  690  3  A16 729 3   473  3  A2  501  3  A3  515  3  A4  523  3  A5  526  3  A6  532  3  A7  543  3  A8  556  3  A9  561  3  A10  589  3  A11  593  3  A12  632  3  A13  660  3  A14  671  3  A15  690  3  A16 729 3

Metrics
To determine whether the SVM algorithm was able to find the optimal solution for binary classification of the data, the numerical values that determine the quality of such a prediction were calculated.
Firstly, we define the outcomes of the model classification compared to their real labels. The technique that helps to represent the quality of classification is the confusion matrix. In binary classification, this matrix consists of four values: true positives (TPs), false positives (FPs), false negatives (FNs), and true negatives (TNs) [26]. Every observation is placed in the proper cell according to the real and predicted value. Then, the confusion matrix is analyzed to define the quality of the classification.
For binary classification, the three metrics extracted from the confusion matrix are used often when a dataset is unbalanced [27]. The first is precision: Precision indicates how well the algorithm classifies a positive class out of a provided dataset. The second metric is recalled, which indicates the ability of the model to capture positive cases: Recall = TP/(TP + FN).
Since the goal of the experiment was to maximize both metrics, the third, auxiliary metric is usually introduced, the F-score: The F-score represents the harmonic mean of the above two metrics. Therefore, maximizing this metric will improve the overall quality of the classification and in the case where harmonic mean smaller values have a strong impact on the result. In our case, this was especially important because the dataset was unbalanced. Thus, using a simple accuracy metric would not be sufficient. A model that classifies everything as one class may obtain a high score depending on the level of unbalance of the set. In addition, the previously mentioned metrics can be used to maximize individual classification tasks. For example, maximizing precision will allow for a much better classification of positive cases, i.e., eggs from sick hens. Unfortunately, this action incurs a tradeoff, i.e., observations belonging to the negative class are more likely to be incorrectly classified. This phenomenon is called the precision-recall tradeoff.

Hyperparameters Optimization
Using SVM with RBF as the kernel function, there are two important hyperparameters, C and gamma, that need to be optimized to obtain a robust and properly regularized classifier. The problems that arise in optimization are overfitting and underfitting of the classifier. The former means that the algorithm will fit its decision boundaries too strongly relative to the test dataset. Thus, when a new observation is being classified, there is a high probability that it will be classified incorrectly. This error can be verified using either an additional partitioning of the set into a validation dataset or k-fold crossvalidation algorithms. Underfitting is easy to recognize because the model performs poorly by assigning random classes to the classified observations. The hyperparameter C is responsible for how the decision boundary is generated. The smaller the C, the smoother the decision boundary, but as C increases, the decision boundary becomes increasingly complicated, until it can separate individual observations, achieving maximum classification quality on the training set, but generating a highly overfitted model that is unusable when the predicted observations are not the same as in the learning set. The gamma used in the kernel function is responsible for the strength of the influence of each observation. When gamma is large, a single observation only affects observations in closer proximity in multidimensional space.
In our case, to optimize hyperparameters, a grid search algorithm was used [28,29]. The algorithm trained many models (classifying algorithms) on given data within given ranges of selected hyperparameters. After finishing the process, the best model with the chosen values of hyperparameters was returned. To increase the accuracy of classification, the process was repeated twice. Firstly, a coarse search was performed for a wide range of values (depending on the chosen parameter) with a large step size. The algorithm was then used again within the ranges based on the first search results and a much smaller step size. To obtain a model, regularized 5-fold cross-validation was used. Using k-fold cross-validation helps ensure that the model performs well with different training and learning sets, but for large datasets, this solution is not optimal because the learning time is multiplied k times [30]. In the subsequent sections, we use five-fold cross-validation with the F-score metric.

Results
The RBF-based SVM algorithm is highly susceptible to the different magnitudes of the individual features submitted for learning. Therefore, the features all need to be standardized so that one feature is not predominant compared to the rest [31]. In our case, the features were the integer reflectance intervals corresponding to the individual half-widths of the previously selected lasers. The standardization was performed by subtracting the mean of all observations from each observation and dividing by their standard deviation. This yielded a distribution with the mean at zero and a standard deviation equal to one. Even though all measurements were previously normalized after the data collection process, the use of additional standardization raised the classification result by 2% and 4% for white and brown eggs, respectively. However, this should be performed on a case-by-case basis, and it is best to test which of the scaling measures returns the highest quality classification. It is also important to check for outliers in the dataset, as they will disrupt not only the standardization process but also the learning process. In our dataset, there were no outliers, so it was possible to start learning. Since the dataset was not large, this provided the possibility of generating subsets consisting of all combinations without repetition from one to nineteen of all selected wavelengths. The combinations were ranked, and the top twenty positions are shown in Tables 3 and 4 for the white and brown eggshells, respectively.  The quality of classification with hyperparameter optimization was then tested for each subset of features, and those with the highest results were selected. A new ranking for white and brown eggshells is shown in Table 5 and Table 6, respectively. This activity was important because an overabundance of features can lead to noise in the best solution, and too few features can lead to underfitting the classification model.  The importance of proper values of the C parameter and gamma were discussed in Section 2.2. A new ranking for wavelength combinations was performed, but with a lower value of C and a higher value of gamma. The results are shown in Table 7 (for white eggshells) and Table 8 (for brown eggshells). Table 7. Top 20 ranking for multiple wavelength usage for white eggshells with hyperparameter optimization and decreased C parameter value. Tables 9 and 10 are included to increase the clarity of the results. The exact combination of wavelengths is shown; the bold ones are the key parameters that carry crucial spectral information for the proper classification of eggshells.

Discussion
For obtaining reliable results with the proposed algorithm, the value of the optimization parameter C must be decreased, and the value of the gamma parameter must be increased. The ranking of wavelength combinations together with the five-fold CV mean for white and brown eggshells are shown in Tables 9 and 10, respectively. The obtained F-score results are satisfactory. As predicted, the F-score value for the white eggshell dataset is higher than that for the brown eggshell dataset due to the broader variety of brown eggshell pigmentation, which strongly affects the spectral characteristics of the reflected light.
For the brown eggshell dataset, to obtain an F-score higher than 83%, it was necessary to use at least five wavelengths (out of 17). Analysis of the data shown in Table 10 resulted in four key wavelengths: A4 = 532 nm, A8 = 556 nm, A14 = 671 nm, and A15 = 690 nm. The fifth wavelength necessary to reach a high F-score is A3 = 515 nm; nevertheless, the highest score (86.21%) was obtained by adding a sixth wavelength, A5 = 526 nm. Additional wavelengths that may be added to the key ones are A6 = 532 nm and A7 = 543 nm. The usage of only the key wavelengths (A4 = 532 nm, A8 = 556 nm, A14 = 671 nm, A15 = 690 nm) resulted in an F-score of 78.552%, and a five-fold CV SD of 3.35 with C = 76.1 and gamma = 0.84.

Conclusions
The application of the proposed algorithm (with the optimization of hyperparameters) for the classification of eggshells into groups from either a healthy flock or from an MSinfected flock produced the following results: for the white eggshell dataset, an F-score over 95% was obtained; for the brown eggshell dataset, 86% was the highest obtained result. The significantly lower F-score for the brown eggshell dataset is due to having the widest range of pigmentation in the samples. Analysis was performed using the wavelengths of DPSS lasers, which are easily available on the market, so further research with different wavelengths may result in better scores. The crucial conclusion drawn from obtained results is that parallel classification of white and brown eggshells requires increasing the number of wavelengths to eight. Only two wavelengths are common among the key ones for both groups: A14 = 671 nm, A15 = 690 nm. The sets differ in the number of key wavelengths and the spectral ranges of those wavelengths.
For white eggshells, an F-score higher than 94% was obtained with seven wavelengths, but to reach the top result (95.755%), eight wavelengths were required. For brown eggshells, five wavelengths were sufficient, with six required to reach the top F score (86.21%).

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy issues.