Multivariate Analysis for the Classification of Chocolate According to Its Percentage of Cocoa by Using Terahertz Time-domain Spectroscopy (THz-TDS)

: Terahertz Time-domain Spectroscopy is a useful technique for determining some physical characteristics of materials, which is based on selective frequency absorption of a broad-spectrum electromagnetic pulse. In order to investigate the potential of this technology to classify cocoa percentages in chocolates, the terahertz spectra (0.5-10 THz) of 5 chocolate samples (50%, 60%, 70%, 80% and 90% of cocoa) were examined. The acquired data matrices were analyzed with the MATLAB 2019b application where the dielectric function was obtained along with the absorbance curves and was classified by using 24 mathematical classification models, achieving differentiations of around 93% obtained by the Gaussian SVM algorithm model with a kernel scale of 0.35 and a one-against-one multiclass method. It was concluded that the combined processing and classification of images obtained from the Terahertz Time-domain Spectroscopy and the use of machine learning algorithms can be used to successfully classify chocolates with different percentages of cocoa.


Introduction
The different spectroscopy techniques used in organic products have always explored ranges within the spectra: visible, ultraviolet and infrared, assessing how light-sensitive photoreceptors control many crucial biological processes [1], This boom in studies at this frequency is due to access to instruments that are available, but some spectra of the intermediate band or terahertz region (THz) are not totally studied and defined yet [2] showing great potential for uses in products of biological origin.
The so-called non-contact and non-destructive methods, such as NIR spectroscopy [3] and multispectral / hyperspectral images [4] , images in the visible range [5], RAMAN spectroscopy [6], have been widely used in the food sector as they are sensitive to intra-molecular vibration [7] and have increasingly been applied as a powerful analytical tool for determining food quality, as well as for identifying the geographical origin.
Although many of the spectroscopic techniques mentioned above have been used in the application of food detection, little attention has been paid to the use of Terahertz spectroscopy (THz), which is in a relatively unexplored range of the electromagnetic spectrum ranging from 0,1 to 10 THz, which lies between the mid-infrared and microwave ranges [8].
The composition of cocoa beans is directly influenced by genetic variability, geographical origin and processing. Therefore, chemical and biochemical characteristics and their relationship to external parameters are key characteristics for quality control and technological aspects [9], Currently, there are studies using near-infrared spectroscopy (NIRS) in the cocoa and chocolate industry [10], showing that it can detect differences but that there are still points for improvement, such as exploring other spectra. Here THz spectroscopy could provide information on time and frequency domains while being insensitive to background thermal radiation [11].
Looking for the applicability of this technology in the chocolate industry, the objective is to determine the level of differentiation of chocolate bars based on their percentage of cocoa in their composition by using THZ spectroscopy and multivariate analysis.

Raw Material
The cocoa genotype (Theobroma cacao L.) that was used is called "Marañon Native" and comes from the area of Cajamarca -Peru, which was used to make chocolate bars with 50%, 60%, 70%, 80% and 90% of cocoa in their composition. For this process, 10 samples were used for each percentage used. The bars used had dimensions of 10 cm × 10 cm with 0.5 mm of depth. This gave us an image for each sample. In total, 50 images were taken, with 2048 wavelengths, which gave us an average of 600 Mb per square centimeter analyzed.

Imaging Equipment in the THZ Range
Terahertz time-domain measurements were obtained by using a Terapulse 4000 spectrometer (Teraview Ltd., Cambridge, UK) in transmission mode. The transmission chamber, the operating scheme are shown in Figure 1. For its operation it was purged with dry nitrogen gas throughout the measurement and the noise was reduced with an average of 10 measurements. Each wave form in the time domain covered a range of 150 ps using a resolution of 0.1 ps. Images were built with equipment scanner. The data acquisition was performed in the TPRJ format and the images were analyzed by using codes internally developed in Matlab v.2019b (Mathworks, Massachusetts, USA).

Multivariate Analysis
The flow chart presented in Figure 2 shows the methodology used to determine the best classifier of chocolates based on their cocoa content. To identify the best classifier, 7 classification algorithms were used: Decision Trees, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Support Vector Machines (SVM), Nearest Neighbor Classifiers, Ensemble Classifiers and Naive Bayes Classifiers. Each of these algorithms was combined with its level of interpretability and flexibility, obtaining 24 models that are used in this research. The selection of discriminating variables was carried out using a feature selection technique based on the staggered decorrelation of variables. Cross-validation was used to randomly divide the original data set of THZ spectra into a training set and test set, measuring the mean cross-validation error as a performance indicator. For the other parameters, a heuristic procedure was used to select the scale value based on the Kernel function to calculate the best classifier. The best model will be determined based on its Accuracy. Finally, once the best model is determined, the characteristics will be transformed with a PCA to reduce its dimensionality.

Terahertz Imaging Analysis
In the experiment, chocolate samples were ordered in 10 cm × 10 cm trays and were subjected to reflection image measurements. Figure 3(a) show the transmitted time domain impulse and the Absorbance values for each type of chocolate. All images were pretreated with a linear filter to reduce image noise, as recommended by Shen [13]. The THz spectral image data set of the sample is based on specific parameters, such as the time interval, amplitude or the phase of the THz wave, and, then, it builds the refractive index, the spatial density distribution, the thickness distribution, and the sample contour. From each sample its absorption coefficient was obtained versus the THz frequency, which is shown in Figure 3(a), the absorption spectrum of chocolate samples was measured from 0.1 to 8 THz. In the respective analysis it was observed that the region of absorption spectra from 0.1 to 2.0 THz show a greater difference among the samples, and this was used for the multivariate analysis. It is also possible to observe in Figure 3(b) that the samples are divided into 5 groups, except for a slight overlap of the frequency of 1.6 THz. This confirms that terahertz spectra have enough information to classify different products based on their cocoa percentage.

Multivariate analysis
To achieve a proper classification, linear (LDA) and nonlinear (SVM) classification models were used. These models were made by using the Matlab Machine Learning application, which allowed us to explore the data set interactively, the selection of characteristics and specification of validation schemes. The training accuracy of the used models was evaluated by using the accuracy indicator (%). All models used a cross validation (15 folds). This PCA multivariate analysis generated the test of 24 models. The models with the best Accuracy are shown in Table 1. The best model was the optimized model of Fine Gaussian SVM which obtained an Accuracy of 93%, with a Kernel Scale of 1 and cubic function type and a Multiclass Method One vs One, optimized with a Bayesian function of 30 interactions. This type of model has been reported many times in research works on the use of Machine Learning for image recognition [14]. Figure 3(a) shows the confusion matrix of the model. The coefficients for this PCA application are PC1 (63.8%) and PC2 (36.2%). In addition, a prediction model was adjusted, obtaining a RMSE of 0.171751 with a function of SVM type, which is shown in Figure 4  The use in the field of chocolate production has been given as a great potential in THZ spectroscopy. A study by Catapano [15] shows work on quality control of chocolate bars contaminated with foreign objects, where THz technology showed a great ability to detect and discriminate different types of materials based on their composition. Techniques for carrying out these quality processes, especially based on their composition, have always used techniques such as mass spectroscopy [16], so the importance of evaluating novel techniques such as Time-domain Spectroscopy become necessary, especially by assessing the importance of non-destructive inspection for the food industry, meeting the needs of modern and rapid techniques for international trade in food.
THz waves have the ability to penetrate a wide variety of materials, and vibration and rotational energy levels of most biological systems are in the THz band [17]. Currently, THz-TDS remains an expensive technology. However, the recent and rapid development of THz systems for agro-industrial research opens up real possibilities for these costs to be significantly reduced in the coming years.

Conclusion
The overall results show that Terahertz time-domain spectroscopy together with classification modeling can successfully identify the composition of chocolate bars based on their cacao percentage. Along with this, the ability of this technique to characterize the molecular structure of many biological substances, makes them an attractive analytical process tool for better monitoring in food quality control. But while this Terahertz time-domain spectroscopy is demonstrating efficiency in classification methods, as in chocolate, there are still many parameters to take into account in the use of this type of technology.