Light and Shadow in Near-Infrared Spectroscopy: A Powerful Tool for Cannabis sativa L. Analysis

: Cannabis sativa L. is an ancient cultivar that has found applications in various ﬁelds, e.g., medicine, due to its beneﬁcial effects. However, due to its psychotropic effects, the regulation of this cultivar has increased throughout the decades. In this context, the need for rapid and reliable analytical methods to ensure the quality control of Cannabis cultivars has become of extreme importance. NIRS has arisen as a powerful tool in this ﬁeld due to its multiple advantages, e.g., non-destructive, rapid, and cost-effective. In this article, the chemometric techniques commonly employed in NIRS method development are described, along with their application for the analysis of Cannabis samples. Regarding qualitative methods, different mathematical treatments and classiﬁcation models are explained. As for quantitative methods, the representative linear and non-linear modelling techniques applied for the development of prediction equations are described, alongside their application in the Cannabis ﬁeld. To the best of our knowledge, this is the ﬁrst time this type of review is written, since there are several articles which address cannabinoid determination, but the main purpose of this review is to enhance the potential of NIRS over the traditional techniques employed for the analysis of Cannabis samples.


Introduction
Cannabis sativa L. is one of the oldest known cultivars and it has been exploited for its versatility and advantages in various fields, such as medicine and food or textile industry.Despite its negative connotation related to its recreational consumption, Cannabis has been used as a potential treatment or palliative measure for various illnesses and medical conditions such as chronic pain, epilepsy, multiple sclerosis, cancer, glaucoma, and neurodegenerative disorders, like Parkinson's and Huntington's disease, among others [1].In this sense, the medical Cannabis industry focuses on the cultivation, processing, and distribution of Cannabis and Cannabis-derived products for medical purposes.It includes the production of pharmaceutical-grade Cannabis products, such as oils, tinctures, capsules, and more, which are used to treat various medical conditions.The medicinal Cannabis industry is subject to strict regulations and often requires specific licenses and certifications.Cannabis produces at least 144 distinct phytocannabinoids, the main constituents of the plant, which represent the potential cause of this medical benefit.The most prevalent cannabinoids are the non-psychoactive tetrahydrocannabinolic and cannabidiolic acid (THCA and CBDA, respectively); these transform into the biologically active neutral forms, ∆ 9 -tetrahydrocannabinol (∆ 9 -THC) and cannabidiol (CBD), after undergoing the decarboxylation process.In recent years, there has been a growing interest in the scientific examination of various phytocannabinoids, including CBN (cannabinol), CBC (cannabichromene), and CBG (cannabigerol), with regard to their potential applications in the fields of medicine, cosmetics, and other purposes [2].
On the other hand, the non-psychotropic species of Cannabis sativa L., industrial hemp, is characterized by containing minimal concentrations of ∆ 9 -THC.Hemp fibres can be used in the production of textiles, paper, and building materials, while hemp seeds can be processed into food products and dietary supplements.The common agricultural policy (CAP) of the European Union provides subsidies for the cultivation of these specific strains, as long as their ∆ 9 -THC content does not exceed 0.2%.In this sense, according to Regulation (EU) No. 1307/2013 of the European Parliament, the verification of the ∆ 9 -THC content must be carried out to prevent the cultivation of illicit drug-type Cannabis in hemp fields.However, this limitation may vary depending on the country, increasing up to 1% in Czech Republic [3].
For this reason, the importance of the quality control of the Cannabis industry has increased dramatically and numerous accredited testing laboratories have emerged in the last decade [4].The complexity of the Cannabis sample challenges the quality control of this matrix, since many parameters may be determined.In this sense, analysis could be classified considering both favorable and unfavorable parameters.In the first group, different compounds such as cannabinoids, terpenes, flavonoids, alkaloids, or sterols, among others may be quantified [5][6][7][8][9].All these parameters will give beneficial information of the vegetal material intended to be used as a medicinal product, since they provide certain characteristics to the Cannabis plant, such as flavour or smell.On the other hand, certain parameters such as pesticides, mycotoxins, or heavy metals are categorized as adverse factors, and their presence may need to be limited to specific quantities [10][11][12][13].Therefore, they are also essential considerations for the quality control of Cannabis.
Traditionally, the determination of cannabinoids in Cannabis samples, has been performed via chromatographic techniques, such as gas (GC) [14][15][16] or liquid chromatography (LC) [17][18][19] coupled to different detectors, namely diode array (DAD), mass spectrometry (MS) and flame ionization detector (FID), among others [20].These techniques present a series of advantages that make them appropriate for the analysis of cannabinoids and other minor analytes such as terpenes or flavonoids, which are depicted in Figure 1.LC and GC provide excellent separation capabilities, allowing the analysis of complex mixtures and permitting the determination of cannabinoids and other compounds with high selectivity and sensitivity.This is especially reinforced when coupled with detectors like mass spectrometry, which makes them suitable for the detection and quantification of compounds, even at low concentrations [21].However, the achievement of superior analytical performance using these techniques involves the use of hazardous solvents such as acetonitrile, methanol, and/or hexane (Figure 1).On the other hand, it should be noted that the expenses associated with the acquisition, maintenance, and operation of the LC and GC equipment are significant, accompanied by the daily operation of these instruments that requires highly skilled personnel [22].As for GC, this approach is not suitable for the direct determination of the acidic form of cannabinoids since high temperatures (up to 300 • C) are reached throughout the analysis and these compounds decarboxylate to their neutral form when heated due to their thermolabile nature.Additionally, some studies have demonstrated the potential degradation of cannabinoids in the chromatographic injector port, leading in inaccuracies [23].For this reason, derivatization of the acidic cannabinoids is mandatory prior GC analysis, which can be time-consuming.Furthermore, temperature can also lead to conversion reactions [24][25][26][27][28], thus resulting in an inappropriate determination of the analyte.Further techniques, such as thin-layer chromatography (TLC) or nuclear magnetic resonance (NMR) have also been employed for the determination of cannabinoids [29,30].However, these techniques are not suitable for quantification due to the limitations they present, i.e., elevated cost and maintenance in the case of NMR, or low sensitivity in the case of TLC.Additional techniques have been utilized to determine further components of the plant.Malík et al. [31] describe the use of Kjeldahl method for the analysis of the nitrogen content in Cannabis as part of a comparative study of nutrition.Some other macro and microelements as well as trace elements were determined via flame atomic absorption spectroscopy (FAAS) and inductively coupled plasma optical emission spectroscopy (ICP-OES) [32,33].In all the cases, sample treatment procedures require the use of hazardous chemicals, like nitric or sulfuric acid, at very high temperatures, which need to be handled with caution by trained personnel.It involves multiple steps, like sample digestion, and distillation and titration in the case of Kjeldahl, taking several hours to complete a single analysis (Figure 1).Moreover, it is imperative to follow the appropriate procedures and comply with environmental regulations when disposing of the chemical waste produced during the analysis.Despite all these disadvantages, the Kjeldahl approach is a well-established chemical method, used for many years to determine nitrogen in multiple samples of different fields, such as the food and beverage industry, agriculture and soil science, environmental analysis, and pharmaceutical industry [34].On the other hand, measuring moisture content is particularly useful as it allows for results normalization, accounting for variations in sample humidity that may occur during storage, and can also provide information regarding the progress of cultivation [8,35].This parameter could be determined employing several analytical techniques, such as a nuclear magnetic resonance (NMR) instrument [36], which has been used to measure the reference moisture content of ground and whole hemp materials.Additionally, the determination of sugars in Cannabis is a crucial aspect of plant analysis since it can provide valuable information regarding physiology or quality.Furthermore, sugars play a pivotal role in the Additional techniques have been utilized to determine further components of the plant.Malík et al. [31] describe the use of Kjeldahl method for the analysis of the nitrogen content in Cannabis as part of a comparative study of nutrition.Some other macro and microelements as well as trace elements were determined via flame atomic absorption spectroscopy (FAAS) and inductively coupled plasma optical emission spectroscopy (ICP-OES) [32,33].In all the cases, sample treatment procedures require the use of hazardous chemicals, like nitric or sulfuric acid, at very high temperatures, which need to be handled with caution by trained personnel.It involves multiple steps, like sample digestion, and distillation and titration in the case of Kjeldahl, taking several hours to complete a single analysis (Figure 1).Moreover, it is imperative to follow the appropriate procedures and comply with environmental regulations when disposing of the chemical waste produced during the analysis.Despite all these disadvantages, the Kjeldahl approach is a wellestablished chemical method, used for many years to determine nitrogen in multiple samples of different fields, such as the food and beverage industry, agriculture and soil science, environmental analysis, and pharmaceutical industry [34].On the other hand, measuring moisture content is particularly useful as it allows for results normalization, accounting for variations in sample humidity that may occur during storage, and can also provide information regarding the progress of cultivation [8,35].This parameter could be determined employing several analytical techniques, such as a nuclear magnetic resonance (NMR) instrument [36], which has been used to measure the reference moisture content of ground and whole hemp materials.Additionally, the determination of sugars in Cannabis is a crucial aspect of plant analysis since it can provide valuable information regarding physiology or quality.Furthermore, sugars play a pivotal role in the biosynthesis of terpenes and cannabinoids, and thus, understanding the sugar content can offer valuable insights into the capacity of the plant for producing these compounds, which are responsible for the plant's aroma and psychoactive effects.There are various analytical techniques that can be employed for determining sugar content, with HPLC being one of the most widely used methods [37,38].
As previously mentioned, there are adverse compounds of the Cannabis plant that also need to be determined in order to assess the quality control of Cannabis, i.e., pesticides, heavy metals or mycotoxins [11][12][13].These compounds can be introduced in the Cannabis plant via various sources, e.g., soil, water, or air [39,40] and can be harmful to human health if consumed, leading to potential neurological problems, organ damage, or cancer, among others.For this reason, many countries have established regulations and limits, and thus, quality assessment by a specialized laboratory is mandatory for licensed Cannabis producers [40].Numerous analytical techniques are typically employed for the determination of heavy metals, i.e., ICP-MS, AAS, or X-ray fluorescence (XRF) [41]; pesticides, i.e., GC-MS or LC-MS; and mycotoxins, i.e., LC-MS [11].
One of the main common disadvantages of all the aforementioned techniques, is the employment of expensive equipment that require expertise personnel as well as reactants and solvents.In this context, near-infrared spectroscopy (NIRS) has been gaining much attention in the last decade due to its multiple advantages (Figure 1).On the one hand, its broad spectral range facilitates the detection and quantification of numerous compounds or properties in complex mixtures without the need for extensive separation methods.Additionally, it is a low-cost and rapid technique that can provide results within a few minutes, which is especially useful for high-throughput analysis or applications where quick results are required [42].Since it is a non-destructive technique, samples can be used after NIRS measurements for other applications [36,43], which is particularly advantageous when working with valuable or limited sample quantities [44].On the other hand, its versatility allows broad application across various sample types, including solids, liquids, and gases, being particularly adept at simultaneous analysis of multiple components within a sample [45].Nowadays, with the development of portable NIRS instruments, measurements can be performed in situ, and consequently, sample treatment is either entirely or almost entirely avoided [42].Therefore, NIRS is truly in line with the green analytical chemistry principles, i.e., diminishing the utilization of hazardous chemicals and reagents, employing equipment that is energy-efficient, and producing minimal waste [46].However, the main drawback of NIRS technology is the interpretation of the spectra, which is difficult, and the use of mathematical and statistical methods, i.e., chemometrics, is mandatory to extract the important information related to chemical mechanisms [47,48].
Once the potential of NIRS against traditional techniques has been settled, in this article, the different chemometric techniques commonly employed in NIRS method development will be described, along with their application for the analysis of Cannabis sativa L. samples.For this purpose, the article is divided in Sections 2 and 3, i.e., qualitative and quantitative methods, in which a thorough explanation of the different multivariate statistical techniques will be provided.Although there are several articles that review the different analytical techniques employed for the determination of cannabinoids [49], to the best of our knowledge, this is the first time that all this information along with the theoretical basis of NIRS method development is gathered in a revision article.

Qualitative Methods
The term "qualitative methods" refers to the identification and classification of samples based on their NIR spectra by means of spectral patterns.These methods are particularly useful when aiming for a differentiation between types or classes of samples without quantifying specific compounds.In this sense, Cannabis sativa L. is a type of cultivar with multiple characteristics than can be employed for its classification, e.g., chemotype, genotype, and growth stage, among others [50][51][52].Additionally, NIRS methodology has been used for the classification of samples regarding the crop year or location [53].
Furthermore, to develop a robust NIR methodology, a large number of samples is required in order to cover a wide range of variability among the samples.As a consequence, a big matrix of data and variables is usually obtained, thus hindering the ability to easily extract the important information within the data [54].There are different chemometric techniques that can facilitate the process of data management (Table 1), which will be described in this section, along with their application to the analysis of Cannabis sativa L. samples.
Prior method development, preprocessing of the spectra is a very important step in order to avoid interferences related to factors that may affect the information of interest, e.g., sample particle size variations, outliers, scatter effects or baseline shifts, among others [52,55].For this purpose, different mathematical treatments can be used, e.g., standard normal variate (SNV), detrending, Savitzky-Golay polynomial derivative filters, multiplicative scatter correction (MSC), etc. [56].These techniques are useful to prevent the effects previously mentioned while preserving the shape and integrity of the original spectral features, which is particularly interesting in the case of complex spectra [57].As can be seen in Figure 2, the quality of the spectra is considerably improved after preprocessing, with sharper and well-differentiated bands.
Furthermore, to develop a robust NIR methodology, a large number of samples is required in order to cover a wide range of variability among the samples.As a consequence, a big matrix of data and variables is usually obtained, thus hindering the ability to easily extract the important information within the data [54].There are different chemometric techniques that can facilitate the process of data management (Table 1), which will be described in this section, along with their application to the analysis of Cannabis sativa L. samples.
Prior method development, preprocessing of the spectra is a very important step in order to avoid interferences related to factors that may affect the information of interest, e.g., sample particle size variations, outliers, scatter effects or baseline shifts, among others [52,55].For this purpose, different mathematical treatments can be used, e.g., standard normal variate (SNV), detrending, Savitzky-Golay polynomial derivative filters, multiplicative scatter correction (MSC), etc. [56].These techniques are useful to prevent the effects previously mentioned while preserving the shape and integrity of the original spectral features, which is particularly interesting in the case of complex spectra [57].As can be seen in Figure 2, the quality of the spectra is considerably improved after preprocessing, with sharper and well-differentiated bands.Once the data are preprocessed, the classification can be performed following two approaches, i.e., supervised and unsupervised.Regarding the first approach, the supervised methods, there is previous knowledge about some characteristics of the samples, such as the categories in which the samples are classified.However, when developing unsupervised methods, no previous information about the samples is provided and, therefore, selfinterpretation of the differences encountered among the spectra is needed [58].

Principal Component Analysis
The aim of principal component analysis (PCA) is to reduce the dimensionality of the spectral data and identify the most significant variations, thus facilitating the visualization and interpretation of sample groupings [58].In order to achieve this goal, PCA finds a new coordinate system that represents the data with a reduced number of uncorrelated variables called principal components (PCs) [59].In this sense, PCA is used to obtain a graphical representation with no previous knowledge about the samples.
Considering that PCA diminishes the complexity of the spectral data, other advantages arise, namely data processing, since redundant or highly correlated variables are eliminated.Additionally, PCA facilitates the detection of outliers, i.e., samples that deviate significantly from the majority of data points, by means of their abnormal coordinates in the transformed space.By combining all these features, PCA effectively identifies important characteristics within a dataset, thus being a highly valuable technique [60].In the field of Cannabis, PCA has been particularly interesting, since it allows researchers to differentiate not only between chemotypes, but also to distinguish among illegal or legal Cannabis plantations depending on the country regulations, thus becoming a great help for the authorities considering that further analysis and sample treatment can be avoided [51,52,61].To give an example, Duchateau et al. [52] employed different chemometric techniques to compare the performance of two NIR devices, i.e., a benchtop and a handheld device, for the classification of 189 Cannabis samples as legal or illegal according to the European and Swiss legislation.In this article, PCA played a crucial role in visualizing the spectra in twoor three-dimensional plots.Its purpose was to perform an exploratory analysis, ensuring that the data could be effectively modelled afterwards using supervised techniques, and that good predictive results are not based on coincidence or due to modelling of the noise.
Borille et al. [51] also employed the NIRS technology combined with chemometric methods for the growth stage classification of 29 sample triplicates obtained from Cannabis cultivated in a greenhouse from seized seeds.In this article, PCA by intervals (iPCA) was employed in the raw spectra as a preprocess step in order to eliminate irrelevant information and select the best spectral range for the classification.This technique is an extension of the traditional PCA that evaluates intervals rather than point estimates, thus providing a more comprehensive and robust analysis of the data.Subsequent PCA was also used for the purpose of visualizing the clusters of samples according to similarity.
Relatedly, Tran et al. [42] developed prediction models using two different NIRS instruments, i.e., benchtop and handheld, for the quantification of cannabinoids in Cannabis samples.Before quantitative method development, they applied different preprocessing techniques and PCA to take an overview of the classification of the data.As can be seen in Figure 3, PCA revealed three well-defined classes depending on the cannabinoid content, i.e., high CBDA, high THCA, and even ratio, for the benchtop instrument, thus showing the potential of this technique to preliminary classification of samples.

Hierarchical Clustering Analysis
Hierarchical clustering analysis (HCA) is an unsupervised multivariate analysis technique commonly used in NIRS technology.HCA consists of clustering the samples into groups by means of the similarities between the spectra, so that the within-group similarities are larger compared to the between-group similarities.Thus, by performing successive divisions, a tree-like structure, i.e., dendrogramme, is obtained [62].In order to obtain this dendrogramme, two main approaches are employed: agglomerative and divisive.The agglomerative approach is a bottom-up strategy in which the samples are considered as individual clusters that are progressively grouped according to their similarity until a single cluster containing all the samples is formed.On the other hand, the divisive approach is a top-down method that separates a single cluster containing all the samples into smaller subclusters based on the differences between samples or groups.
In the article previously mentioned by Duchateau et al. [52], HCA was employed to do a preliminary evaluation of the different clusters of samples.For the benchtop NIR, three major clusters were obtained, separating the samples according to the content of Δ 9 -THC in higher than 1% (w/w), between 0.2 and 1% (w/w) and less than 0.2% (w/w).As for the handheld device, HCA showed to major clusters: one with samples with a Δ 9 -THC content higher than 1% (w/w) and a second cluster with samples containing less than 1% (w/w).Thus, by employing HCA, a global vision of the similarities between samples is possible, and therefore, the suitability of a subsequent supervised method is ensured.
Furthermore, in the article mentioned in the previous section by Borille et al. [51], HCA revealed three main groups according to the three growth stages employed.However, the dendrogram did not entirely reveal the specific clustering of the growth stage, which confirms that these chemometric techniques are used only for a preliminary exploratory analysis and not for a precise classification, for which more advanced chemometric tools are required.

Non-Hierarchical Clustering Analysis
The objective of non-hierarchical clustering methods is to obtain one final partition of the data.In this case, the number of clusters is fixed by the user, which is an important drawback to these techniques since there is no previous knowledge about the samples.Some examples of non-hierarchical clustering methods are the k-means, which assess the similarities between samples by means of distance measurements; the Density Based Spatial Clustering of Applications with Noise (DBSCAN), based on sample density; or the Self-Organising Map (SOM), which studies the relationship between variables and

Hierarchical Clustering Analysis
Hierarchical clustering analysis (HCA) is an unsupervised multivariate analysis technique commonly used in NIRS technology.HCA consists of clustering the samples into groups by means of the similarities between the spectra, so that the within-group similarities are larger compared to the between-group similarities.Thus, by performing successive divisions, a tree-like structure, i.e., dendrogramme, is obtained [62].In order to obtain this dendrogramme, two main approaches are employed: agglomerative and divisive.The agglomerative approach is a bottom-up strategy in which the samples are considered as individual clusters that are progressively grouped according to their similarity until a single cluster containing all the samples is formed.On the other hand, the divisive approach is a top-down method that separates a single cluster containing all the samples into smaller subclusters based on the differences between samples or groups.
In the article previously mentioned by Duchateau et al. [52], HCA was employed to do a preliminary evaluation of the different clusters of samples.For the benchtop NIR, three major clusters were obtained, separating the samples according to the content of ∆ 9 -THC in higher than 1% (w/w), between 0.2 and 1% (w/w) and less than 0.2% (w/w).As for the handheld device, HCA showed to major clusters: one with samples with a ∆ 9 -THC content higher than 1% (w/w) and a second cluster with samples containing less than 1% (w/w).Thus, by employing HCA, a global vision of the similarities between samples is possible, and therefore, the suitability of a subsequent supervised method is ensured.
Furthermore, in the article mentioned in the previous section by Borille et al. [51], HCA revealed three main groups according to the three growth stages employed.However, the dendrogram did not entirely reveal the specific clustering of the growth stage, which confirms that these chemometric techniques are used only for a preliminary exploratory analysis and not for a precise classification, for which more advanced chemometric tools are required.

Non-Hierarchical Clustering Analysis
The objective of non-hierarchical clustering methods is to obtain one final partition of the data.In this case, the number of clusters is fixed by the user, which is an important drawback to these techniques since there is no previous knowledge about the samples.Some examples of non-hierarchical clustering methods are the k-means, which assess the similarities between samples by means of distance measurements; the Density Based Spatial Clustering of Applications with Noise (DBSCAN), based on sample density; or the Self-Organising Map (SOM), which studies the relationship between variables and samples, among others [63].However, no applications of these methods to Cannabis samples combined with NIRS spectroscopy have been found.

Soft Independent Modelling of Class Analogy
Soft Independent Modelling of Class Analogy (SIMCA) is a class modelling technique commonly used in chemometrics for pattern recognition and classification, thus being a supervised method [64,65].Samples are placed in a PCs space that describes a certain class created by SIMCA after applying PCA, in order to evaluate whether they belong to it or not.The number of PCs determined by SIMCA is key, since important features from the model may be excluded when lacking PCs, thus hindering the selectivity.On the other hand, employing an excessive number of PCs can lead to increase in the noise and overfitting of the model [66].As implied on its name, SIMCA is focused on evaluating the similarities among the samples within a class rather than on discrimination between classes [67].
Pereira et al. [50] developed a method using NIRS hyperspectral imaging and machine learning methods for the detection and identification of illegal plantations of Cannabis sativa L. PCA was firstly employed to explore the characteristics of the images and identify the potential clusters related to Cannabis sativa L. Different areas of the images were selected in order to collect the pixels from important parts of the plant, e.g., margin, veins, etc. Prior to SIMCA method development, sparse PCA was performed to eliminate the variables that were not informative.Then, the training set of 401 pixels/spectrum selected via sparse PCA were used to build the SIMCA model.By selecting another region of interest from the same image, a validation set was obtained and the developed SIMCA model was applied, obtaining good results in terms of sensitivity and specificity, thus ensuring the suitability of the model for the identification of Cannabis sativa L. with a low false-positive rate.

Partial Least Squares Discriminant Analysis
Another supervised method is the so-called partial least squares discriminant analysis (PLS-DA), which combines dimensionality reduction and prediction model construction [68].Although this technique has been gaining attention in recent decades, it is prone to overfitting in some cases in which the number of variables significantly exceeds the number of samples.Therefore, cross-validation becomes an essential step, not only for characteristic selection and classification, but also for mere data visualization.On the other hand, it is a multi-step procedure that involves different mathematical operations and parameters [68].However, if all the potential obstacles are carefully taken into account and a thorough evaluation of every step of the procedure is carried out, PLS-DA can be a powerful technique for managing highly multidimensional data.
There are several articles in which this technique is employed so as to take an overlook of the data that are going to be subsequently fitted into a quantitative method or to develop very accurate qualitative methods for the classification of Cannabis samples [20,42,69,70].For instance, Birenboim et al. [20] developed a method for the classification of Cannabis cultivars and the quantification of major cannabinoids and terpenes via Fourier transform near-infrared spectroscopy (FT-NIR) combined with chemometrics.Prior development of the quantitative method, multivariate classification and regression models were used, e.g., PLS-DA.This technique was particularly useful for major class separation of the samples in four categories, namely high-THC, high-CBD, hybrid, and high-CBG, for which the FT-NIR spectra of the ground inflorescence samples were employed, as observed in Figure 4.The results showed no misclassified samples, with sensitivity and specificity values of 1, which ensures that the classification model is highly accurate.Furthermore, the root mean standard error of cross validation (RMSECV) and calibration (RMSEC) ratio, along with the root mean standard error of prediction (RMSEP) and the RMSECV ratio, were both below 1.5, thus indicating a low probability of model overfitting to the data.Similarly, Tran et al. [42] developed different prediction models using two NIRS instruments, i.e., a benchtop FT-NIR and a handheld microNIR, for the classification and quantification of cannabinoids in Cannabis sativa L. samples.After performing PCA to check data quality and have an overview of the trends within the population, which showed three different clusters based on the chemotype, PLS-DA models were carried out to ensure how accurately the chemotypes could be predicted.The results showed three well-defined categories, namely high-CBDA, high-THCA, and even-ratio for the benchtop device, with a classification accuracy of 100% for both high-CBDA and high-THCA models and 99.4% for even-ratio model.The models obtained by using the handheld device also showed the same three categories as the benchtop instrument; however, only the high-THCA model has a classification accuracy of 100%, and the other models showed lower classification accuracy values compared to the benchtop instrument.
San Nicolas et al. [71] developed a non-invasive method for the classification of Cannabis cultivars based on hyperspectral imaging along with PCA and PLS-DA.In this case, two different approaches employing PLS-DA were proposed, i.e., direct calibration with the complete dataset, which comprise 502 flower spectra; and a two-layer hierarchical model, which classifies the stem separated from the rest of the plant.Both classification methods showed promising results, with an 89.47% correct classification of the samples for the direct calibration, and a 91.23% for the hierarchical approach.

Parametric and Non-Parametric Methods
Artificial neural networks (ANN) are non-linear, non-parametric supervised methods composed of several layers of neurons, namely inputs output and hidden [72].A neuron is a processing unit that transforms input data into output by means of backpropagation of the hidden layers, which represent the modelling process.Several parameters have to be optimized when developing ANN methods, e.g., the number of hidden layers, which can be an intricated and time-consuming process if the proper training algorithm is not selected [72,73].Despite the fact that it is challenging to ascertain the structure of an ANN, these methods are suitable for classification purposes due to their ability to calculate and approximate functions of any form [74,75].
There are several articles in which ANN are applied to create a model for the determination of different parameters from samples of hemp extracts [76] or for optimizing in vitro germination and growth of hemp seeds [77], among others.However, no specific applications of this technique to Cannabis samples in combination with NIRS technology have been found.Similarly, Tran et al. [42] developed different prediction models using two NIRS instruments, i.e., a benchtop FT-NIR and a handheld microNIR, for the classification and quantification of cannabinoids in Cannabis sativa L. samples.After performing PCA to check data quality and have an overview of the trends within the population, which showed three different clusters based on the chemotype, PLS-DA models were carried out to ensure how accurately the chemotypes could be predicted.The results showed three well-defined categories, namely high-CBDA, high-THCA, and even-ratio for the benchtop device, with a classification accuracy of 100% for both high-CBDA and high-THCA models and 99.4% for even-ratio model.The models obtained by using the handheld device also showed the same three categories as the benchtop instrument; however, only the high-THCA model has a classification accuracy of 100%, and the other models showed lower classification accuracy values compared to the benchtop instrument.
San Nicolas et al. [71] developed a non-invasive method for the classification of Cannabis cultivars based on hyperspectral imaging along with PCA and PLS-DA.In this case, two different approaches employing PLS-DA were proposed, i.e., direct calibration with the complete dataset, which comprise 502 flower spectra; and a two-layer hierarchical model, which classifies the stem separated from the rest of the plant.Both classification methods showed promising results, with an 89.47% correct classification of the samples for the direct calibration, and a 91.23% for the hierarchical approach.

Parametric and Non-Parametric Methods
Artificial neural networks (ANN) are non-linear, non-parametric supervised methods composed of several layers of neurons, namely inputs output and hidden [72].A neuron is a processing unit that transforms input data into output by means of back-propagation of the hidden layers, which represent the modelling process.Several parameters have to be optimized when developing ANN methods, e.g., the number of hidden layers, which can be an intricated and time-consuming process if the proper training algorithm is not selected [72,73].Despite the fact that it is challenging to ascertain the structure of an ANN, these methods are suitable for classification purposes due to their ability to calculate and approximate functions of any form [74,75].
There are several articles in which ANN are applied to create a model for the determination of different parameters from samples of hemp extracts [76] or for optimizing in vitro germination and growth of hemp seeds [77], among others.However, no specific applications of this technique to Cannabis samples in combination with NIRS technology have been found.
On the other hand, k-nearest neighbours (KNN) is a non-parametric supervised technique used for classification via evaluation of distances between samples [58].Typically, the Euclidean distance between the samples from the validation set and the samples from the training set is calculated and the different classes in which samples can be divided are established [78].Then, the unknown samples are placed in the class to which the majority of its KNN from the training set belong.
Linear discriminant analysis (LDA) is a linear, parametric, and, as its name indicates, discriminant supervised technique originally described by R. Fisher [79].This method aims to reduce the dimensionality of the data set, similarly to what PCA provides; however, LDA finds the hyperplane that maximizes the distance between the different classes [80].Furthermore, when employing LDA, the location of the data set is not changed, as opposed to PCA.Su et al. [36] developed a NIRS method for rapid measurement of moisture and cannabinoid contents in samples of Cannabis sativa L., along with a qualitative method for the classification of the samples in legal or illegal via LDA.The application of this technique yielded up to 94% correct classifications, thus showing the potential of LDA for discriminant analysis.A more flexible technique is Quadratic Discriminant Analysis (QDA), since it can learn quadratic boundaries, as opposed to LDA, which is based on linear boundaries [81].Borregaard et al. [82] carried out a study about the discrimination of crop and weed based on high-dimensional spectral data.They stated that QDA and LDA classification present similarities, with a lower performance than PCA/SIMCA and PLS methods in illegal or legal Cannabis plantations.
Table 1 summarizes the applications mentioned throughout the present article, which comprise different chemometric techniques employed for the classification of cannabinoids and other parameters in Cannabis samples, alongside the mathematical techniques employed for spectra pretreatment.

Quantitative Methods
Quantitative methods in near-infrared spectroscopy (NIRS) refer to analytical approaches used to determine the concentration or amount of specific compounds or properties in a sample.They are valuable for quality control, potency assessment, and determining the chemical composition of Cannabis samples in research, forensic, and regulatory settings [22,43].These methods involve establishing a relationship between the spectral data obtained from NIRS measurements and the concentration of the target analyte or property of interest [83].Despite all the advantages of NIRS, which were previously described in Figure 1, it is important to note that calibration models for Cannabis analysis should be developed and validated using appropriate reference values and considering the specific Cannabis matrices and regulatory requirements of the intended application [36].Therefore, high-quality reference data are essential to ensure the accuracy, precision, and reliability of the calibration model and subsequent predictions.In order to achieve this goal, a set of representative samples with known reference values is initially introduced in the NIRS equipment to obtain their corresponding spectra.These samples should cover the range of concentrations expected in the samples to be analyzed [20].The reference values are typically obtained using a reference method, such as traditional wet chemistry techniques.Although this step may appear inconsequential, it truly holds significant importance within the process as the accurate prediction of the model relies heavily on these data.
Once all the data are obtained, as was the case with qualitative methods, the NIRS spectra collected from the calibration sample set may be preprocessed to avoid interferences related to factors that may affect the information of interest, e.g., sample particle size variations, outliers, scatter effects, or baseline shifts, among others.This may involve steps like baseline correction, smoothing, normalization, or noise reduction techniques [56].The preprocessed spectral data are then correlated with the corresponding reference values to establish a calibration model.Various multivariate statistical techniques, such as partial least squares (PLS) regression or principal component regression (PCR), are commonly used.These methods aim to find the best mathematical relationship between the spectral data and the reference values.Finally, the developed calibration model needs to be validated using an independent set of samples not used in the model development.These validation samples are analyzed using NIRS, and the predicted values are compared to their reference values to assess the accuracy and reliability of the model.Statistical parameters like root mean square error of prediction (RMSEP) or correlation coefficients are typically used to evaluate the performance of the model.According to the ICH Guideline Q2(R2) on the validation of analytical procedures, if RMSEP is comparable to root mean-squared error of calibration (RMSEC), then the accuracy of the method is confirmed [84].Another parameter that is usually employed is standard error of prediction (SEP), which is similar to RMSEP but independent of bias, unlike RMSEP [85].Calibration in NIRS is an ongoing process that should be continuously improved over time by introducing new samples that can enrich the performance of the predictive equation and increase its robustness.
Quantitative NIRS methods find applications in various fields such as pharmaceutical analysis, food quality control, agricultural monitoring, and environmental analysis.In the field of Cannabis, this technique has garnered increasing attention in recent years, gradually establishing itself as an integral part of the quality control process for this medicinal product [20,71].For the calibration model development, as aforementioned, there are multiple techniques commonly employed for the obtention of predictive equations, which will be described in the next section.

Partial Least Squares Regression
Partial Least Squares Regression (PLS-R) is a predictive technique that originated from Herman Wold's concept of PLS and has gained considerable importance in various fields over time.This method is particularly useful when dealing with datasets exhibiting a high degree of multicollinearity [86].In order to solve the problem mentioned, PLS-R performs iterative least squares fitting on latent variables, which are found as linear combinations of the initial variables.By doing so, PLS-R sets the spectral data variables that best describe the reference values [58].In contrast to other multivariate statistical techniques, such as PCR, PLS-R provides more accurate predictions by placing stronger emphasis on the relationship between the independent and dependent variables during dimensionality reduction.
Owing to PLS-R being one of the most employed multivariate techniques, numerous articles describe the use of this approach for developing calibration models for various matrices in NIRS.In the particular case of Cannabis, PLS-R is usually employed for the construction of regression models for cannabinoids [22,36,[42][43][44]55,69,70,87] and/or terpenes [20], among other parameters [36,88].Birenboim et al. [20] describe the use of Fourier transform near-infrared spectroscopy (FT-NIR) to determine cannabinoid and terpene content in Cannabis inflorescence samples.Spectral regions of 1450-1880 and 2130-2350 nm were fundamental for predicting all cannabinoids and terpenes by means of PLS-R.As can be observed in Figure 5A, when representing the complete dataset, three well-defined clusters are formed, i.e., THCA high (Figure 5B), THCA mid (Figure 5C), and THCA low (Figure 5D).This is also observed for the rest of cannabinoids.The authors indicate the high predictive capabilities of the developed models; however, according to Williams [89], RPD values are low for that classification (Table 2), and they should be used instead for screening purposes.Adapted from reference [20] with permission from Elsevier, copyright 2022.
Sánchez-Carnerero et al. [48] made a comparative study using two equipment, a dispersive NIR and a FT-NIR spectrometers.In this case, they employed PLS-R for the determination of various cannabinoids, affording RPD values better than 3 for Δ 9 -THC, CBC and CBD, being close to 6 for the latter.They also build prediction equations for CBDV, Δ 9 -THCV, Δ 8 -THC, CBG, and CBN, obtaining RPD values close to 2 in these cases (Table 2), and therefore being useful for screening purposes.The large number of samples for calibration, as well as the dispersion of data (SD) in Δ 9 -THC, CBC, and CBD, play an Sánchez-Carnerero et al. [48] made a comparative study using two equipment, a dispersive NIR and a FT-NIR spectrometers.In this case, they employed PLS-R for the determination of various cannabinoids, affording RPD values better than 3 for ∆ 9 -THC, CBC and CBD, being close to 6 for the latter.They also build prediction equations for CBDV, ∆ 9 -THCV, ∆ 8 -THC, CBG, and CBN, obtaining RPD values close to 2 in these cases (Table 2), and therefore being useful for screening purposes.The large number of samples for calibration, as well as the dispersion of data (SD) in ∆ 9 -THC, CBC, and CBD, play an important role in obtaining good predictive results.Both instruments provide similar results for all cannabinoids, although the chemometric treatment of data was varied.

Principal Component Regression
Principal component regression (PCR) is a multivariate analysis technique based on the combination of PCA and least-squares regression.PCR mainly focuses on dimensionality reduction, unlike PLS-R, in which the prediction of the dependent variables is emphasized.This is one of the reasons to explain that PCR components are always orthogonal, i.e., they are not affected by the dependent variable, as opposed to PLS-R, in which components are not necessarily orthogonal and are chosen to retrieve as much predictive information as possible [90].However, although PCR interpretability can be challenging owing to the components being linear combinations of the original predictors, it is a very useful technique in those cases in which dimensionality reduction is the priority.PCR models have been extensively employed in different matrices; nevertheless, only one NIRS application in Cannabis has been found.Townsend et al. [91] developed an application note explaining the determination of ∆ 9 -THC and CBD content in Cannabis flowers via FT-NIR.For this purpose, PCR was employed as chemometric model, affording values of standard error of prediction (SEP) of 0.73 and 0.92% (Table 2), for total CBD and total THC, respectively, which demonstrates the potential of this PCR-based FT-NIR model.
Further lineal regression multivariate statistical techniques, namely multiple linear regression (MLR), are also helpful in those cases in which the number of independent variables is small, and the main objective is the interpretation of the results and not to reduce dimensionality or to predict [92].However, although MLR has been widely used in various matrices, no NIRS approaches employing MLR in Cannabis have been found.

Artificial Neural Networks
Although all the aforementioned techniques have been widely used throughout the years, there are some situations in which these methods are not suitable for data treatment, e.g., when the mathematical model describing the data set is unknown.In this context, artificial neural networks (ANN) emerges as a powerful modelling technique since it can retrieve useful information when going through all the data and is capable of modelling complex non-linear relationships [93].For this reason, it has great potential when dealing with multicollinearity and a large number of independent variables.However, due to the complicated relationships provided by ANN, the interpretation of the model can be a hurdle.Furthermore, this technique usually requires a higher amount of data and computational resources.
ANN models have been used in a wide variety of fields and matrices, from biological to commercial or food and drink approaches.As for Cannabis applications, there are few in which ANN models are applied.For example, Gloerfelt-Tarp et al. [47] developed a NIR-based chemometric application for the quantification of 12 cannabinoids in plant material, with emphasis on the discrimination between neutral and carboxylic forms of each cannabinoid.For this purpose, different machine learning algorithms were employed, including deep neural network and random forest, affording values of root mean standard error of validation (RMSE v ) in the range of 0.001 and 0.560 (%) (Table 2).Valinger et al. [76] also employed ANN modelling to predict different physical and chemical properties in hemp extracts, namely total dissolved solids, extraction yield, total polyphenolic content, and antioxidant activity.As observed in Table 2, the values of RMSE v were in the range of 0.0140-305.5601%for solid-liquid extraction (SLE) and 0.0320-21.8810%for microwaveassisted extraction (MAE).

Support Vector Machine
Support vector machine (SVM) is a multivariate statistical technique originally proposed to address classification issues, but it can be applied to various situations and fields, e.g., bioinformatics or handwriting recognition, among others [94].The main objective of SVM is to find a hyperplane that best separates the different classes of a data set based on particular patterns in those classes or observations.Unlike ANN, SVM models are transparent; i.e., it is easier to interpret which data points contribute to the model.However, SVM applications are mainly designed for binary classification, while ANN can handle multiclass classification without the need of combination with other techniques.Nonetheless, SVM models are preferred when overfitting is a concern, since they tend to be less prone to cause this problem [95].Chen et al. [57] proposed a NIR methodology for the quantification of CBD in hemp oil via comparison of PLS-R and self-optimizing support vector elastic net (SOSVEN) models.This technique is an advanced variant of SVM that combines the typical elements of SVM with characteristics from elastic nets (EN), which are usually employed in linear regression.SOSVEN is particularly useful for feature and model selection, along with hyperparameter optimization.In this case, SOSVEN had lower validation errors when compared to PLS-R to predict the concentration of CBD and total CBD (Table 2), thus demonstrating the potential of this multivariate statistical technique.

Near-Infrared Hyperspectral Imaging
Although NIRS can predict physical and biochemical characteristics in diverse sample types with high efficiency, simplicity, and accuracy, its measurement scope is restricted to a relatively small section of the specimen for determining average composition values.This shortcoming is especially problematic when dealing with heterogeneous samples, such as Cannabis samples, as NIRS spectroscopy fails to provide important information on the spatial distribution of quality parameters [96].However, the integration of NIRShyperspectral imaging systems facilitates the simultaneous acquisition of spatial and spectral information [73].
Near-infrared hyperspectral imaging (NIR-HSI) is a highly advanced methodology, which can capture up to several hundred images of different wavelength, offering a detailed spectral response of target features [97].This technique is particularly adept at discerning even the most subtle variations in ground covers, as well as tracking changes over time.Previous research has also demonstrated that HSI surpasses multispectral images in terms of effectively monitoring vegetation properties, such as the leaf area index (LAI), differentiating between crop types, retrieving crop biomass, and assessing leaf nitrogen content [98].HSI has emerged as a highly promising method for the non-invasive assessment of diverse constituents of the Cannabis plant.This technique involves the measurement of pixel reflectance and subsequent correlation with cannabinoid content, enabling effective and reliable evaluation of the properties of the plant.Holmes et al. [99] utilized NIRS-HSI to estimate the content of cannabidiolic acid (CBDA) in flowers and leaves of Cannabis sativa L. The Gaussian Process Regression (GPR) model was chosen as it possessed the capacity to effectively predict the outcomes obtained through LC-MS.The proposed model displayed significant potential as a screening technique for implementation in the cultivation of this crop.Additionally, Lu et al. [44] describe a HSI technology for non-destructive quantification of major cannabinoids, including CBD, ∆ 9 -THC (tetrahydrocannabinol), CBG (cannabigerol) and their acid forms in fresh floral and leaf materials of industrial hemp on a dry weight basis.Parsimonious PLS models were utilized, obtaining the best RPD values of 2.6 for CBD and ∆ 9 -THC in flowers.This value indicates that the prediction of these cannabinoids is fair, being appropriate for screening purposes [100].The lack of accuracy in the prediction method could be related to the measurement wavelength range used in this article, which was in in the short-wave near-infrared (SW-NIR) region, from 400 to 1000 nm.However, according to Sánchez-Carnerero et al. [48], cannabinoids mainly absorb in the 1064-2357 nm range.Therefore, extending the measurement region to a wider scale could be a solution, as additional information may have been lost in the shorter region.On the other hand, Abeysekera et a1.[101] developed a method based on HSI to determine THCA in Cannabis plant samples.In this case, the utilized camera worked from 645 nm to 2070 nm, obtaining increased spectral information to be related to cannabinoids concentration.Partial least squares feature selection (PLSFS) was selected as the regression model, as it provided the best accuracy in estimating of the THCA content.Despite the multiple benefits of NIR-HSI, the technique presents several disadvantages to take into consideration, such as limited accessibility to some researchers or industries with budget constraints, due to the high cost of acquisition and maintenance of HSI systems.Additionally, HSI generates large amounts of data due to the high number of spectral bands.Analyzing and processing such voluminous data can be computationally intensive and time-consuming [102].On the other hand, the spectral interference and the environmental sensitivity makes it difficult to obtain good predictions to be used in quality control or process control evaluation [97].
Table 2 summarizes the applications mentioned throughout the present article, which comprise various quantitative multivariate statistical techniques for the quantification of cannabinoids, terpenes, and/or other parameters in Cannabis samples via NIRS.

Conclusions
Traditionally, Cannabis has been analysed using different analytical techniques that, despite their inherent benefits, come with significant disadvantages, hindering their application for routine analysis.In the era of heightened environmental consciousness, the demand for more eco-friendly methodologies has intensified.Near-infrared spectroscopy (NIRS) satisfies this necessity as moderately-priced equipment that efficiently operates within seconds without the requirement of toxic reagents.However, its primary drawbacks include the necessity of analytical reference techniques for providing reference values and the expertise in chemometrics to correlate spectral and numerical data, necessary for developing predictive equations.This article offers a comprehensive review of the most employed chemometric techniques, classified into qualitative and qualitative methods, and their application to obtain NIRS equations for predicting cannabinoids, terpenes, and other parameters in Cannabis samples.
Future applications of NIRS in Cannabis analysis should be headed towards developing new predictive equations, expanding beyond cannabinoids, terpenes, and moisture content.Further interesting compounds of Cannabis, such as flavonoids, phenols, or alkaloids, shows considerable potential in influencing the organoleptic properties of the future medicine.Furthermore, compounds like heavy metals or pesticides, which are related to possible contamination of the plant, could also benefit from these applications.The portability of this equipment open doors to new possibilities, enabling direct plant monitoring without the necessity of harvesting.

Figure 1 .
Figure 1.Comparison of the traditional techniques employed for the analysis of Cannabis samples versus NIRS.Advantages are represented in greenish colour while disadvantages are in orangish colour, grey circles express unfilled.A scale from 1 to 5 has been employed to evaluate each parameter in terms of the degree of benefit/drawback.For instance, when comparing the 'Instrument/Maintenance expenses', a 5/5 disadvantage score is chosen for traditional techniques due to their characteristic high cost, while a 3/5 advantage score is selected for NIRS since they tend to have an intermediate cost.

Figure 1 .
Figure 1.Comparison of the traditional techniques employed for the analysis of Cannabis samples versus NIRS.Advantages are represented in greenish colour while disadvantages are in orangish colour, grey circles express unfilled.A scale from 1 to 5 has been employed to evaluate each parameter in terms of the degree of benefit/drawback.For instance, when comparing the 'Instrument/Maintenance expenses', a 5/5 disadvantage score is chosen for traditional techniques due to their characteristic high cost, while a 3/5 advantage score is selected for NIRS since they tend to have an intermediate cost.

Figure 3 .
Figure 3. Principal components analysis of the Cannabis samples using the benchtop NIR instrument.Adapted from reference [42].

Figure 3 .
Figure 3. Principal components analysis of the Cannabis samples using the benchtop NIR instrument.Adapted from reference [42].

Figure 4 .
Figure 4. Partial least squares discriminant analysis classification of the FT-NIR spectra of Cannabis inflorescence samples.Reproduced from reference [20] with permission from Elsevier, copyright 2022.

Figure 4 .
Figure 4. Partial least squares discriminant analysis classification of the FT-NIR spectra of Cannabis inflorescence samples.Reproduced from reference [20] with permission from Elsevier, copyright 2022.

Figure 5 .
Figure 5. Correlation between the predicted values of THCA (y-axis) via PLS-R and the references values (x-axis) provided through HPLC-DAD at global (A), high (B), mid (C) and low (D) concentration.Colours and shapes correspond to the different strains employed for calibration.Adapted from reference[20] with permission from Elsevier, copyright 2022.

Figure 5 .
Figure 5. Correlation between the predicted values of THCA (y-axis) via PLS-R and the references values (x-axis) provided through HPLC-DAD at global (A), high (B), mid (C) and low (D) concentration.Colours and shapes correspond to the different strains employed for calibration.Adapted from reference [20] with permission from Elsevier, copyright 2022.

Table 1 .
Highlighted applications of chemometric methods in the analysis of Cannabis samples.

Table 2 .
Highlighted applications of multivariate statistical techniques in the analysis of Cannabis samples.