Advanced Classiﬁcation of Coffee Beans with Fatty Acids Proﬁling to Block Information Loss

: Classiﬁcation is a kernel process in the standardization, grading, and sensory aspects of coffee industries. The chemometric data of fatty acids and crude fat are used to characterize the varieties of coffee. Two category classiﬁers were used to distinguish the species and roasting degree of coffee beans. However, the fatty acid proﬁling with normalized data gave a bad discriminant result in the classiﬁcation study with mixed dimensions in species and roasted degree. The result of the predictive model is in conﬂict with the context of human cognition, since roasted coffee beans are easily visually distinguished from green coffee beans. By exploring the effects of error analysis and information processing technologies, the lost information was identiﬁed as a bias–variance tradeoff derived from the percentile normalization. The roasting degree as extensive information was attenuated by the percentile normalization, but the cultivars as intensive information were enhanced. An informational spiking technique is proposed to patch the dataset and block the information loss. The identiﬁed blocking of informational loss could be available for multidimensional classiﬁcation systems based on the chemometric data.


Introduction
Various classification techniques are widely used in the identification of cultivars or species, as well as in the standardization and the grading of products for commercial and agricultural production [1][2][3].Classification is also a kernel process for accurate decision-making after measurements in observation, survey, clinical diagnosis, and industrial quality management [4][5][6][7].
Green coffee is one of the most traded agricultural commodities in the world.The species of commercial coffee consist almost entirely of Coffea arabica (Arabica) and Coffea canephora (Robusta).Arabica is generally more prominent and expensive in the market [8].
Green beans of both species can be distinguished by featured appearances and different compositions that affect the sensory qualities of coffee products [5].However, most commercial roasted and ground coffees are actually blends of the two species.The molecular genetics approach was applied to differentiate two coffee species in green beans for the quantification of any adulteration of Arabica with Robusta beans [9].After roasting and grinding, more advanced analytical methods are required as indicators of subtle differentiation between the coffee species [10] because these biological features would be diminished after roasting at high temperature (>200 • C) [11].
Within this realm, several works have successfully distinguished the coffee varieties by using their chemometric data, such as amino acids, metals, sucrose, organic acids, and sterols [6,12,13].
Symmetry 2018, 10, 529 2 of 11 However, acquisition of measured data should be readily available, assured, and inexpensive for a good predictive model.Otherwise, a good predictor derived from measured data must be associated with sensory evaluations [14][15][16][17].The sensory descriptors could also be established using regression approaches based on the chemometric data [3,10,18].Fatty acid profiling is most often evaluated to achieve discrimination among the varieties of coffee beans because the sensory qualities of coffee are complicated and affected by multiple factors [19].
Different types of compositional data have been applied to characterize green beans (cultivars as a nominal variable) or investigate the roasting degree of coffees (in a ratio scale).The first two principal components of visible micro-Raman spectra reveal different chlorogenic acid and lipid compositions when comparing Arabica and Robusta green coffee [20].Dong et al. reported the effect of different drying techniques on the molecular composition of green Robusta [21].Wei et al. used an NMR-based prediction model to evaluate roasted coffee bean extracts [22].Han et al. and Frank et al. used specific chemical compounds to assess the toxic risk [23] and bitter taste [24] in roasted coffees, respectively.Romano et al. used the specific fatty acids ratio to determine the relative amounts of Arabica and Robusta in a green coffee blend [1].Martin et al. obtained a classification result with residual errors for green and roasted Arabica and Robusta coffees by using linear discriminant analysis [25].Recently, Dias and Benassi proposed a two-step discrimination among coffee species and roasted degrees carried out using heat-labile compounds [11].All of these studies demonstrate that multidimensional discrimination would be a challenging task in classification.
As shown in Figure 1, chemometric protocols applied to the fatty acid composition data of specimens provide an approach to extract information on coffee quality.In this study, a discriminant system was developed with a learning model to achieve predictive functions.Two linear classifiers-LC RG (roasted, green) and LC AR (Arabica, Robusta)-are used to establish four independent groups.Thus, any one specimen (S i ) can be placed into one of the groups, as the logic expression S i ∈ {(Roasted ∪ Green) ∩ (Arabica ∪ Robusta)} indicates.The performances of the classifiers with chemometric data were evaluated and validated by their correctness.Within this realm, several works have successfully distinguished the coffee varieties by using their chemometric data, such as amino acids, metals, sucrose, organic acids, and sterols [6,12,13].However, acquisition of measured data should be readily available, assured, and inexpensive for a good predictive model.Otherwise, a good predictor derived from measured data must be associated with sensory evaluations [14][15][16][17].The sensory descriptors could also be established using regression approaches based on the chemometric data [3,10,18].Fatty acid profiling is most often evaluated to achieve discrimination among the varieties of coffee beans because the sensory qualities of coffee are complicated and affected by multiple factors [19].
Different types of compositional data have been applied to characterize green beans (cultivars as a nominal variable) or investigate the roasting degree of coffees (in a ratio scale).The first two principal components of visible micro-Raman spectra reveal different chlorogenic acid and lipid compositions when comparing Arabica and Robusta green coffee [20].Dong [23] and bitter taste [24] in roasted coffees, respectively.Romano et al. used the specific fatty acids ratio to determine the relative amounts of Arabica and Robusta in a green coffee blend [1].Martin et al. obtained a classification result with residual errors for green and roasted Arabica and Robusta coffees by using linear discriminant analysis [25].Recently, Dias and Benassi proposed a two-step discrimination among coffee species and roasted degrees carried out using heat-labile compounds [11].All of these studies demonstrate that multidimensional discrimination would be a challenging task in classification.
As shown in Figure 1, chemometric protocols applied to the fatty acid composition data of specimens provide an approach to extract information on coffee quality.In this study, a discriminant system was developed with a learning model to achieve predictive functions.Two linear classifiers-LCRG (roasted, green) and LCAR (Arabica, Robusta)-are used to establish four independent groups.Thus, any one specimen (Si) can be placed into one of the groups, as the logic expression Si Î {(Roasted ∪ Green) ∩ (Arabica ∪ Robusta)} indicates.The performances of the classifiers with chemometric data were evaluated and validated by their correctness.However, information loss causing mislabeling was found when the reliability of the data processing was evaluated.The LCRG operator has poorer accuracy than LCAR, showing that the result of the prediction model is in conflict with the context of human cognition, since roasted coffee beans are easily distinguished from the green ones by their brown color.A similar bias-variance dilemma was also observed in the early classification study [25].The bias-variance tradeoff has also been applied to explain the effectiveness of heuristics in human learning, even if it is a problem in supervised learning.
As technology progresses, classifications are used across every discipline, and the data structures are evolving into a more complex form [26,27].In this study, the source of the However, information loss causing mislabeling was found when the reliability of the data processing was evaluated.The LC RG operator has poorer accuracy than LC AR , showing that the result of the prediction model is in conflict with the context of human cognition, since roasted coffee beans are easily distinguished from the green ones by their brown color.A similar bias-variance dilemma was also observed in the early classification study [25].The bias-variance tradeoff has also been applied to explain the effectiveness of heuristics in human learning, even if it is a problem in supervised learning.
As technology progresses, classifications are used across every discipline, and the data structures are evolving into a more complex form [26,27].In this study, the source of the information loss was identified as an obvious pattern of classification errors derived from percentile normalization.Further, Symmetry 2018, 10, 529 3 of 11 the accuracy of the classification system would be successfully enhanced by patching of the breach using other featured data with the same properties as the lost information, as shown in Figure 2.
Symmetry 2018, 10, x FOR PEER REVIEW 3 of 11 information loss was identified as an obvious pattern of classification errors derived from percentile normalization.Further, the accuracy of the classification system would be successfully enhanced by patching of the breach using other featured data with the same properties as the lost information, as shown in Figure 2. The use of regression analysis aims to find independent latent variables for advanced classification.Simultaneously, some leaks would be produced by the structural normalization of the dataset.Thus, the data integrity and quality must be considered in a preprocessing phase before extracting knowledge from raw data [28].The preprocessing phase takes over half of the knowledge discovery process.Our study demonstrates informational extraction achieved based on the patching of data structures in a multimodal classification.

Sample Collection and Preparation
Green coffee beans of Arabica and Robusta cultivars were purchased from coffee suppliers who guaranteed the origins and were verified by our experts.Portions of green beans were roasted and collected for further analysis and cupping with reliable and traceable filing.
The roasting and grinding levels of these coffee beans were arbitrary and without specific requirements.We expect that the samples were similar to those obtained in daily life.All of the coffee beans, including green and roasted beans, were stored under steady conditions to avoid oxidation or compositional changes.Then, 200 g of each portion of ground coffee beans (powdered) was sampled and labelled as a specimen in this study.

Lipid Extraction and Crude Fat
The Soxhlet solid-liquid extraction method [29] (Association of Official Analytical Chemists (AOAC) Official Method 2003.05/920.39) was used to extract the lipid fraction from the ground coffee beans.All of the glass apparatus were rinsed using petroleum ether and dried in an oven at 102 °C.Ten grams of ground coffee sample were weighed and placed in the thimble.A quantity of 90 mL of petroleum ether was placed in a 150 mL round-bottom flask.We continued the extraction process for 5 hours, and a defatted residue was obtained after distillation.Almost all the solvent was collected and placed in the oven and then removed using a desiccator.The weight of the sample was then noted.As a result, the crude fat (%) = (W − T)/S × 100% was calculated, where W, T, and S are the weights of the thimble with ether extract, the empty thimble, and the sample, respectively.The use of regression analysis aims to find independent latent variables for advanced classification.Simultaneously, some leaks would be produced by the structural normalization of the dataset.Thus, the data integrity and quality must be considered in a preprocessing phase before extracting knowledge from raw data [28].The preprocessing phase takes over half of the knowledge discovery process.Our study demonstrates informational extraction achieved based on the patching of data structures in a multimodal classification.

Sample Collection and Preparation
Green coffee beans of Arabica and Robusta cultivars were purchased from coffee suppliers who guaranteed the origins and were verified by our experts.Portions of green beans were roasted and collected for further analysis and cupping with reliable and traceable filing.
The roasting and grinding levels of these coffee beans were arbitrary and without specific requirements.We expect that the samples were similar to those obtained in daily life.All of the coffee beans, including green and roasted beans, were stored under steady conditions to avoid oxidation or compositional changes.Then, 200 g of each portion of ground coffee beans (powdered) was sampled and labelled as a specimen in this study.

Lipid Extraction and Crude Fat
The Soxhlet solid-liquid extraction method [29] (Association of Official Analytical Chemists (AOAC) Official Method 2003.05/920.39) was used to extract the lipid fraction from the ground coffee beans.All of the glass apparatus were rinsed using petroleum ether and dried in an oven at 102 • C. Ten grams of ground coffee sample were weighed and placed in the thimble.A quantity of 90 mL of petroleum ether was placed in a 150 mL round-bottom flask.We continued the extraction process for 5 h, and a defatted residue was obtained after distillation.Almost all the solvent was collected and placed in the oven and then removed using a desiccator.The weight of the sample was then noted.As a result, the crude fat (%) = (W − T)/S × 100% was calculated, where W, T, and S are the weights of the thimble with ether extract, the empty thimble, and the sample, respectively.

Preparation of Fatty Acid Methyl Esters
Fatty acid methyl esters (FAMEs) were prepared by a method modified from the IUPAC standard method [30,31].Briefly, 200 mg of crude fat (lipid extraction) in a screw-capped glass tube was hydrolyzed with 1 mL of 1 M KOH in 70% ethanol (Sigma-Aldrich, St. Louis, MO, USA) at 90 • C for 1 h.The reaction mixture was acidified with 0.2 mL of 6 M HCl, and then 1 mL of water was added.The free fatty acids (FAs) were extracted with n-hexane to be methylated with 1 mL of 10% BF 3 in methanol at 37 • C for 20 min.A quantity of 3 mL of 6% potassium carbonate solution was added to the solution, and then FAMEs were extracted with 1 mL of hexane.Of the n-hexane top layer, 200 µL was transferred into a vial and crimped.

Fatty Acids Profile by GC-FID Analysis
The FAMEs were determined using gas chromatography (TRACE GC Ultra, Thermo Fisher Scientific, Rodano-Milan, Italy) equipped with a flame ionization detector (FID) and liquid auto-injector (AI-3000, Thermo Fisher Scientific, Rodano-Milan, Italy).Separation was carried out in an Rtx-WAX capillary column (60 m × 0.53 mm id × 1 µm, Resteck Corporation, Bellefonte, PA, USA).Injection volume was 1 µL in split mode, and inlet temperature was 250 • C. Nitrogen was used as the carrier gas (flow rate of 1.2 mL/min), and the oven temperature was programmed as follows: initial temperature 50 • C, held for 2 min; then increased by 10 • C/min to 280 • C, where it was held for 5 min.All data of FAMEs were recorded and quantitatively integrated using Chrom-Card data system (version 2.3, Thermo Fisher Scientific, Rodano-Milan, Italy) with an external standards calibration curve.
In addition to this, the individual peaks of FAMEs were also identified using Agilent gas chromatography and mass spectrometric detector (models 6890N GC and 5973 MSD, Agilent Technologies, Santa Clara, CA, USA) under the same chromatographic conditions.Scan acquisition (m/z 45-550) for MSD in the EI mode was carried out using HP Chemistation B.04.03 (Agilent Technologies, Santa Clara, CA, USA) and the NIST 17 Mass Spectral Library (Scientific Instrument Services, Ringoes, NJ, USA).

Statistics Software and Calculations
Statistical calculations and analysis were performed using Excel 2010 (Microsoft Corporation, Santa Rosa, CA, USA) and PASW Statistics 18.0.3.25 (International Business Machines Corporation, Armonk, NY, USA).The normalized and standardized data are re-calculated to a new data matrix.The discriminant analysis was carried in the direct mode, and all variables passing the tolerance criteria (0.001) were entered simultaneously with equal prior probabilities.The discriminant displays a max variance pattern (and structure) matrix without rotated transformation.

Fatty Acids Analysis by GC-FID
Fats and oils are important ingredients in many foods.Fat contributes to the texture, flavor, mouthfeel, and aroma of foods.The fatty acid composition was determined by the GC-FID method with a calibration curve after methyl esterification and extraction.All quantitative data are listed in Table 1.Regression analysis is widely used to estimate the relationships among variables for prediction models in the field of machine learning [32].The performance of regression analysis methods in practice depends on the form of the data generating process and on the probability distributions of the dependent variables around the prediction of the regression function.
The majority of the composition of coffee beans is contributed by the fatty acids C18:2 and C16:0, and the smaller parts (<1%) were accounted for by C20:1 and C22:0.While the absolute measurement uncertainties are a constant value, the fatty acids of the smaller parts would have greater relative uncertainty than would those occurring in larger proportion.For instance, the relative deviation (RSD) of fatty acid C20:1 is 13.7%, greater than the 0.126% of fatty acid C18:2, since the limit of quantitation (LOQ) is 50 ppm (0.05 mg/g).
As the featured variables, the distributions of fatty acids were compared to fit the normally distributed populations in Figure 3.However, highly symmetrical variances are not sensitive to the varieties of coffee beans.The dataset was not directly used as input variables for the classification algorithm.The similar distributions among these variables imply that the variances of fatty acids are constrained patterns within the dataset.This pattern may refer to the relationship of continuous variables, as opposed to the discrete variables used in classification.

Figure 3.
The measured data of 34 coffee samples were pooled and profiled for the variances of cFAT and fatty acids, as unsupervised data.The boxes and lines present as the mean ±2s (standard deviations, 95%), the median, and the quartiles (Q1 and Q3) for specific variables after standardization.The numbers below (underlined) present the average composition for each fatty acid as a percentage of total free fatty acids (100%).

Normalization (Percentile) and Standardization (Z-Score)
Many data processing techniques were utilized to reformat the data framework as normalized, including percentages, standardization (Z-score), logarithms, and inverse measured data.Generally, normalization removes the physical units of a measured dataset to make it a dimensionless dataset.
The fatty acids C18:0 and C18:2 are used to describe the structural characteristics of the system in Figure 4.The correlation with the original measured data has a strong linearity, which is 5 times the absolute quantities of fatty acids.The groups of roasted Arabica and green Robusta are at the ends of the line, and the other two groups are superposition in the middle zone of the line.The high correlation of two fatty acids implies the variables dependence in the quantitative data.Thus, the composition of fatty acids could be considered as an intensive property for individual specimens.The measured data of 34 coffee samples were pooled and profiled for the variances of cFAT and fatty acids, as unsupervised data.The boxes and lines present as the mean ±2s (standard deviations, 95%), the median, and the quartiles (Q1 and Q3) for specific variables after standardization.The numbers below (underlined) present the average composition for each fatty acid as a percentage of total free fatty acids (100%).

Normalization (Percentile) and Standardization (Z-Score)
Many data processing techniques were utilized to reformat the data framework as normalized, including percentages, standardization (Z-score), logarithms, and inverse measured data.Generally, normalization removes the physical units of a measured dataset to make it a dimensionless dataset.
The fatty acids C18:0 and C18:2 are used to describe the structural characteristics of the system in Figure 4.The correlation with the original measured data has a strong linearity, which is 5 times the absolute quantities of fatty acids.The groups of roasted Arabica and green Robusta are at the ends of the line, and the other two groups are superposition in the middle zone of the line.The high correlation of two fatty acids implies the variables dependence in the quantitative data.Thus, the composition of fatty acids could be considered as an intensive property for individual specimens.

Normalization (Percentile) and Standardization (Z-Score)
Many data processing techniques were utilized to reformat the data framework as normalized, including percentages, standardization (Z-score), logarithms, and inverse measured data.Generally, normalization removes the physical units of a measured dataset to make it a dimensionless dataset.
The fatty acids C18:0 and C18:2 are used to describe the structural characteristics of the system in Figure 4.The correlation with the original measured data has a strong linearity, which is 5 times the absolute quantities of fatty acids.The groups of roasted Arabica and green Robusta are at the ends of the line, and the other two groups are superposition in the middle zone of the line.The high correlation of two fatty acids implies the variables dependence in the quantitative data.Thus, the composition of fatty acids could be considered as an intensive property for individual specimens.After normalization by percentiles, the percentile data shows a scattering pattern without an obvious correlation in six times the dimensional quantities of the fatty acids.The results indicate that the structures of the original dataset are ordered and become disrupted and more varied by normalization.Percentile normalization can enhance the variability of quantitative data, but it also amplifies the uncertainty (bias) to add on to the variances at the same time.

Discrimination Analysis
The raw data of pooled specimens were calculated using the linear classifiers (LC RG and LC AR ) to obtain the scores dF RG and dF AR , respectively.Further, the discriminant scores were scattered into the groups (quadrants), as shown in Figure 5A.The target of classification was successfully achieved by the linear discriminant algorithm.The percentile data were also given scores by the linear classifiers, and the resulting scatter plot is shown in Figure 5B.

Discrimination Analysis
The raw data of pooled specimens were calculated using the linear classifiers (LCRG and LCAR) to obtain the scores dFRG and dFAR, respectively.Further, the discriminant scores were scattered into the groups (quadrants), as shown in Figure 5A.The target of classification was successfully achieved by the linear discriminant algorithm.The percentile data were also given scores by the linear classifiers, and the resulting scatter plot is shown in Figure 5B.It is worth noting that there are five cases of error in the classification, which are noted in the confusion matrix in Table 2.The classifier of coffee species (LCAR) has perfect correctness, but the classifier of roasting degree (LCRG) only has 85% correctness.The classification errors in roasting degree occur in two ways: green mistaken as roasted or roasted mistaken as green.The LCRG has poorer discriminability than the LCAR in the training model; this is in conflict with the predictive model using human cognition.
Table 2. Classification accuracy is assessed using a confusion matrix based on the discriminant functions with the normalized (percentile) data and classification into the four groups (2 × 2).The correctness is used to describe the performance of the individual classifiers (LCRG or LCAR).In sensory testing, it is easier to differentiate the roasting degrees than to distinguish coffee species.Therefore, some information associated with the roasting categories must be attenuated in the percentile normalization.The dimensional reduction of the dataset matrix, which is rescaled as a reference standard, perhaps causes the information loss.For instance, the freedom of the eight fatty acids in percentage is 7 because the total composition must be 100%.

Green
Discriminant analysis deals with the taxonomic classification (supervised learning) so that the cases are partitioned into the labeled groups.Partial least squares discriminant analysis has demonstrated great success in modelling high-dimensional datasets for versatility.Despite that, the user needs to optimize a wealth of parameters before reaching reliable and validated outcomes [26].Unlike in principle component analysis and cluster analysis, the algorithms are used to explore unknown patterns in prior (unsupervised) learning.It is worth noting that there are five cases of error in the classification, which are noted in the confusion matrix in Table 2.The classifier of coffee species (LC AR ) has perfect correctness, but the classifier of roasting degree (LC RG ) only has 85% correctness.The classification errors in roasting degree occur in two ways: green mistaken as roasted or roasted mistaken as green.The LC RG has poorer discriminability than the LC AR in the training model; this is in conflict with the predictive model using human cognition.In sensory testing, it is easier to differentiate the roasting degrees than to distinguish coffee species.Therefore, some information associated with the roasting categories must be attenuated in the percentile normalization.The dimensional reduction of the dataset matrix, which is rescaled as a reference standard, perhaps causes the information loss.For instance, the freedom of the eight fatty acids in percentage is 7 because the total composition must be 100%.
Discriminant analysis deals with the taxonomic classification (supervised learning) so that the cases are partitioned into the labeled groups.Partial least squares discriminant analysis has demonstrated great success in modelling high-dimensional datasets for versatility.Despite that, the user needs to optimize a wealth of parameters before reaching reliable and validated outcomes [26].Unlike in principle component analysis and cluster analysis, the algorithms are used to explore unknown patterns in prior (unsupervised) learning.

Information Loss in Data Processing
For supervised learning, the training dataset was reviewed according to the distributions of labeled categories.We examined the differences in labeled categories for each classifier using Student's t-test, as shown in Figure 6.Interesting, the Z-scored data differed significantly for discrimination of green and roasted coffees, which is the function of LC RG .However, the percentile data suppressed the significant difference between the Arabica and Robusta coffees, as shown in Figure 6A.Only the percentile data of fatty acids C20:0 and C22:0 have significance (t value > 2) at the 95% confidence level because the fatty acid C22:0 is the smaller part in the composition of fatty acids with average levels less than 1.0 %, as shown in Figure 3.The bias-variance tradeoff is a serious problem in this classification.

Information Loss in Data Processing
For supervised learning, the training dataset was reviewed according to the distributions of labeled categories.We examined the differences in labeled categories for each classifier using Student's t-test, as shown in Figure 6.Interesting, the Z-scored data differed significantly for discrimination of green and roasted coffees, which is the function of LCRG.However, the percentile data suppressed the significant difference between the Arabica and Robusta coffees, as shown in Figure 6A.Only the percentile data of fatty acids C20:0 and C22:0 have significance (t value > 2) at the 95% confidence level because the fatty acid C22:0 is the smaller part in the composition of fatty acids with average levels less than 1.0 %, as shown in Figure 3.The bias-variance tradeoff is a serious problem in this classification.Otherwise, the Z-scored data have significance for the discrimination of Arabica and Robusta coffees, and the percentile data are enhanced in significance for fatty acids C18:1, C18:2, C18:3, C20:0, and C20:1 in Figure 6B.Thus, the percentile normalization is better suited to distinguishing the Arabica and Robusta coffees.
These results demonstrate that the discrimination of roasting degree is dominated by the extensive property of the raw data, and the discrimination of coffee species is dominated by the intensive property of the percentile data, as it is relative scale invariant.The majority of the lost information has an extensive property within the raw data.The percentile normalization reduces the dimensions of the data matrix and shrinks the contained information by erasing part of the extensive information.

Patching the Breach in the Classification System
The lost information has the extensive property and is erased in data processing.If related data with extensive property was spiked into the smaller normalized data pool, the discrimination could be enhanced, allowing higher correctness.As shown in our proposal in Figure 2, the crude fat content was used to patch the informational breach, forming a patching process for the classification system with normalized data.
In Figure 7A, the crude fat contents without normalization are distributed into the four labelled groups, and the confusion errors are shown in the grey area.The crude fat content with the extensive property of specimen information was not associated with the percentile fatty acids in order to avoid artificial containment derived from the normalization.Otherwise, the Z-scored data have significance for the discrimination of Arabica and Robusta coffees, and the percentile data are enhanced in significance for fatty acids C18:1, C18:2, C18:3, C20:0, and C20:1 in Figure 6B.Thus, the percentile normalization is better suited to distinguishing the Arabica and Robusta coffees.
These results demonstrate that the discrimination of roasting degree is dominated by the extensive property of the raw data, and the discrimination of coffee species is dominated by the intensive property of the percentile data, as it is relative scale invariant.The majority of the lost information has an extensive property within the raw data.The percentile normalization reduces the dimensions of the data matrix and shrinks the contained information by erasing part of the extensive information.

Patching the Breach in the Classification System
The lost information has the extensive property and is erased in data processing.If related data with extensive property was spiked into the smaller normalized data pool, the discrimination could be enhanced, allowing higher correctness.As shown in our proposal in Figure 2, the crude fat content was used to patch the informational breach, forming a patching process for the classification system with normalized data.
In Figure 7A, the crude fat contents without normalization are distributed into the four labelled groups, and the confusion errors are shown in the grey area.The crude fat content with the extensive property of specimen information was not associated with the percentile fatty acids in order to avoid artificial containment derived from the normalization.Obviously, the patched dataset can be well partitioned by the two classifiers; the results are shown in Figure 7B.The source of information loss is evidenced by the informational spiking with the extensive property of crude fat content (cFAT).These results demonstrate that the system performance of machine learning depends on the input informational integrity and type.Data processing perhaps enhances one system function, but suppresses another.

Conclusions
All kinds of data are used as a medium for transmitting information in modern life.Different professional explanations are often added in the processes of data transfer and expression.We have shown that if there is no mutual crossover between the two sets of data, the percentile process will be more effective for the classification of coffee beans.The source and the property of information loss in this classification were identified as the normalization processing and the extensive quantity.The loss of information is noted in the quantitative features of coffee beans that have gone through the roasting process.The performance of this coffee classification is enhanced and validated by our patching technique with the traceable informational processing.Furthermore, our results will promote correctness and avoid the bias-variance tradeoff in classification systems with multiple classifiers.For industrial applications, effects of different processing and materials could be associated with the food quality and consumers' preference by the accurate discriminant exploring based on chemometric data.
et al. reported the effect of different drying techniques on the molecular composition of green Robusta [21].Wei et al. used an NMR-based prediction model to evaluate roasted coffee bean extracts [22].Han et al. and Frank et al. used specific chemical compounds to assess the toxic risk

Figure 1 .
Figure 1.Information conversion frames are associated with the real individuals, the chemometric data, and the sensory recognitions in this study.

Figure 1 .
Figure 1.Information conversion frames are associated with the real individuals, the chemometric data, and the sensory recognitions in this study.

Figure 2 .
Figure 2. Conceptual diagram showing the informational flows, leaks, and blocking in this 2D classification.

Figure 2 .
Figure 2. Conceptual diagram showing the informational flows, leaks, and blocking in this 2D classification.

Figure 4 .
Figure 4.A restructured correlation of fatty acids (C18:0 and C18:2) is presented with the normalized data in percentiles and juxtaposed with the correlation of fatty acids with the pooled measured data.

Figure 3 .
Figure3.The measured data of 34 coffee samples were pooled and profiled for the variances of cFAT and fatty acids, as unsupervised data.The boxes and lines present as the mean ±2s (standard deviations, 95%), the median, and the quartiles (Q1 and Q3) for specific variables after standardization.The numbers below (underlined) present the average composition for each fatty acid as a percentage of total free fatty acids (100%).

Symmetry 2018 , 11 Figure 3 .
Figure3.The measured data of 34 coffee samples were pooled and profiled for the variances of cFAT and fatty acids, as unsupervised data.The boxes and lines present as the mean ±2s (standard deviations, 95%), the median, and the quartiles (Q1 and Q3) for specific variables after standardization.The numbers below (underlined) present the average composition for each fatty acid as a percentage of total free fatty acids (100%).

Figure 4 .
Figure 4.A restructured correlation of fatty acids (C18:0 and C18:2) is presented with the normalized data in percentiles and juxtaposed with the correlation of fatty acids with the pooled measured data.

Figure 4 .
Figure 4.A restructured correlation of fatty acids (C18:0 and C18:2) is presented with the normalized data in percentiles and juxtaposed with the correlation of fatty acids with the pooled measured data.

Figure 5 .
Figure 5. Linear discriminant analysis plotted with the raw data (A) or the percentile data (%) (B) of the fatty acids in 34 coffee beans.

Figure 5 .
Figure 5. Linear discriminant analysis plotted with the raw data (A) or the percentile data (%) (B) of the fatty acids in 34 coffee beans.

Figure 6 .
Figure 6.Comparison of the effects of the Z-scored or percentile (%) data on the t values of fatty acids for discrimination of (A) Green and Roasted or (B) Arabica and Robusta coffees.

Figure 6 .
Figure 6.Comparison of the effects of the Z-scored or percentile (%) data on the t values of fatty acids for discrimination of (A) Green and Roasted or (B) Arabica and Robusta coffees.

Figure 7 .
Figure 7. Group distributions of the crude fat contents (A) in green and roasted coffee beans are compared using raw data.Further, linear discriminant analysis (B) is plotted with the percentile data of the fatty acids patched with the crude fat content (cFAT) of the 34 coffee bean samples.

Table 1 .
The measured data of the contents of crude fat (cFAT) and eight fatty acids (FAs) listed for 34 samples of coffee beans.

Table 2 .
Classification accuracy is assessed using a confusion matrix based on the discriminant functions with the normalized (percentile) data and classification into the four groups (2 × 2).The correctness is used to describe the performance of the individual classifiers (LC RG or LC AR ).