Fast Search Method Based on Vector Quantization for Raman Spectroscopy Identiﬁcation

: In spectroscopy, matching a measured spectrum to a reference spectrum in a large database is often computationally intensive. To solve this problem, we propose a novel fast search algorithm that ﬁnds the most similar spectrum in the database. The proposed method is based on principal component transformation and provides results equivalent to the traditional full search method. To reduce the search range, hierarchical clustering is employed, which divides the spectral data into multiple clusters according to the similarity of the spectrum, allowing the search to start at the cluster closest to the input spectrum. Furthermore, a pilot search was applied in advance to further accelerate the search. Experimental results show that the proposed method requires only a small fraction of the computational complexity required by the full search, and it outperforms the previous methods.


Introduction
Spectroscopy techniques, such as infrared and Raman spectroscopy, are increasingly being used to measure and analyze the physical and chemical properties of materials. There are two types of analysis methods related to this technique. The first is to identify the constituents of a given spectrum, and the second is to identify the spectrum itself by comparing it directly to other known spectra in the database [1,2]. The second type of analysis is addressed in this study.
Spectral identification methods can be divided into two categories: classification methods based on machine learning (ML) and algorithms based on the similarity evaluation [3]. The first methods show good classification performance through an optimal learning model by training a given database with a ML-based algorithm. Conventionally, k-nearest neighbor (KNN) [4], random forest (RF) [5] and artificial neural network (ANN) [6] methods have been proposed, and various 1D-convolutional neural network(CNN) models based on deep learning have recently been proposed [7,8]. Good identification performance is expected from these methods if a sufficient number of samples in each spectrum is obtained.
However, most existing Raman libraries provide one sample for each type, such that ML methods require significant time to build up sufficient samples of the target material. The other methods are more suitable for utilizing existing Raman libraries. Representative methods include correlation search [9] and cosine similarity, the Hit-quality index (HQI) [10], and the Euclidean distance (ED) search [11,12]. These methods are intuitive and have often been used for identifying different types of Raman spectra. In recent years, methods have been proposed that improve identification performance in various applications along with the moving window technique [13,14].
Projection methods have likewise been proposed to transform the input data using singular value decomposition (SVD), discrete wavelet transform (DWT), and Karhunen-Loeve Transform (KLT) also known as principal component transformation (PCT). One of the most important properties of these transformations is that most of the data energy can be stored using very few coefficients [33,34]. Based on these advantages, improved search methods combined with existing fast algorithms have been proposed, providing faster search capabilities.
Apart from data transformation, our method adopted a cluster structure to reduce the search area. Spectra from the database to be compared were pre-clustered into hierarchical cluster groups based on similarity. In the proposed method, the spectra were sequentially tested in the order of cluster distance, starting with the reference spectrum belonging to the nearest clusters, which helped to quickly find the nearest spectrum. The search process was further accelerated using a pilot search with fewer dimensions of the transformed data to find the closest candidates as quickly as possible.
The remainder of this paper is organized as follows. In Section 2, the main methods in the existing VQ field were reviewed and compared. In Section 3, a novel fast search algorithm is described.
To confirm the applicability of our method, the simulation results of the proposed method are compared with those of existing methods in Section 4. Finally, a brief conclusion is presented.

Full Search and PDS
The full search method finds a vector y min with the smallest ED between an input vector x = (x 1 , x 2 , . . . , x N ) and vectors in a database, C = (y i |i = 1, 2, . . . , M). This method is expressed by Equation (1), where y i = (y i1 , y i2 , . . . , y iN ) is a reference spectrum of one material in the database and d 2 (x, y i ) is the ED between an input spectrum x and a reference spectrum y i , which is expressed by Equation (2).
This method sequentially searches the entire database to determine the spectrum closest to the input data. Therefore, assuming that the dimension of an input spectrum is N and the number of spectra in the database is M, N M multiplications, M(2N − 1) additions, and M − 1 comparisons are required.
PDS is a very simple way to reduce the amount of computation by allowing only a few dimensions of the input data to be used to avoid unnecessary distance calculations. Assume the current minimum distance is d 2 min in the search process. For the next y i , if the cumulative sum from y i1 to y iq is larger than d 2 min as in Equation (3) where 1 ≤ q ≤ N, the distance calculation is terminated. Therefore, this method reduces (N − q) multiplications and 2(N − q) additions.

ENNS and MPS
ENNS uses hyperplanes orthogonal to the central line to partition search space. Each point on the fixed hyperplane that intersects the central line has the same mean as P m = (m, m, . . . , m). The hyperplane is called an equal average hyperplane. ENNS calculates the mean m x for input data x = (x 1 , x 2 , . . . , x N ) first. Then a reference spectrum y is found, having the minimum mean difference to x. d(x, y) is computed and set to the current minimum distance d min . It is obvious that any reference spectra close to x is inside the hyperplane centered around x with the radius d min . By projecting the input data on the central axis, two boundary projection points, P max = (m max , m max , . . . , m max ) and P min = (m min , m min ,. . . , m min ) can be found, where The search space is now bounded by the equal-average hyperplanes intersecting the above two points. Hence, it is sufficient to search only those spectra having mean values raging from m min to m max . During the search process, a more similar reference spectrum that is found leads to a larger decrease in d min , and the search area decreases further.
MPS based on the local segment mean and its structure were proposed to further accelerate the search process. It showed better performance than ENNS and its variants. It is based on a two dimensional mean pyramid structure and is mainly applied to image coding, so it is not suitable for one dimensional spectra in its original form. However, this technique can be applied to one dimensional cases with minor modifications. We modified the MPS as follows. Let m x,m/n be the sub-mean of x as Equation (5).
For simplicity, the modified one dimensional MPS (MPS1D) for three layers can be represented as follows.
The method first verify with the given y if the lowest right term in the above is larger than d 2 min . If y passes the test, then the second right term is checked along with the first. If y passes all tests, y could be closer than the current closest spectrum; hence, d(x, y) is calculated. Otherwise, it is discarded. As expected, these tests reduce the search area more effectively than ENNS. However, this method also shares the same problem with ENNS because it relies on the mean of the given data at the lower level. To overcome this limitation, methods using the coordinate transformation were proposed.

PCT
The principal component analysis is a type of multivariate analysis that reduces the dimension of data while maintaining the information of the original dataset [35]. Assume thet an N × M matrix Y consists of M number of N dimensional spectra. The N × N correlation matrix R can be decomposed as in Equation (7), where λ 1 ≥ λ 2 ≥ · · · ≥ λ N . Transformed coefficients known as principal components (PCs) can be calculated using Equation (8).
PCT is known to be effective at compressing information of data into the first few PCs. The degree of compression effect can be determined by calculating the percentage of the cumulative sum of eigenvalues, as shown in Equation (9). The calculated results using experimental Raman spectra are shown in Figure 1. The cumulative ratio saturates on a small number of PCs and reaches nearly one on approximately 250 PCs. This indicates that 250 PCs yield approximately the same results as 3300 dimensional data from the original domain. The 250 × 3300 calculations are significantly smaller than 3300 × 3300 calculations for the transformation; however, calculating 250 PCs for a given input spectrum is still considered a high computational overhead. This computational overhead can be further reduced by using an existing PDS. In this same context, MPS + PDS can also be considered. The performance of these methods is reported in Section 4 along with the performance of the proposed method.

Hierarchical Clustering
Hierarchical clustering (HC) is an algorithm that links similar data into groups called clusters [36]. To perform HC analysis, the similarity between data is measured first. The similarity measurement generally uses a distance value, and for a data set consisting of M spectra, M * (M − 1)/2 distance values are obtained. Using the calculated distance values, data close to each other are linked to the same cluster.
HC is divided into two types, divisive and agglomerative. Of the two methods, the divisive method was chosen because it requires less computation and allows clusters to be divided according to certain criteria, such as the maximum cluster distance or the sum of distances within the cluster. This method, unlike K-means clustering, can perform training without predetermining the number of clusters [37]. Thus, after the structure is complete, the desired number of clusters can be decided. Figure 2 is an example of HC using 100 Raman spectra.
In general, the computational complexity of HC is larger than that of the K-means clustering. However, because the clusters are precomputed and determined, the real-time search speed is not affected. After finding the center of each cluster, the data closest to the center of each cluster is set as the practical center of the cluster to minimize unnecessary computation. Then, the distance between the center and all data in that cluster is stored, and the farthest distance from the center is set as the cluster size.

Cluster Search (CS)
First, we calculate the distances between the centers of all clusters and the input spectrumx and sort all clusters in ascending order. Then, we find the center of the cluster closest to the input spectrum x. The distance between the center of that cluster and x as d(x, c) is denoted and set to the current d min . The closest cluster is most likely to contain the closest spectrum; hence, the search starts from that cluster. When examining a specific cluster, if the difference between d(x, c) and d(y, c) is lager than the current d min , then all members of that cluster are excluded from the closest candidate by the following inequality.
If the above inequality is not true, then the spectra of the cluster must be searched in turn. The order in which the spectra were searched was determined in the order far from the center. This is because when a specific member satisfies the above inequality, all remaining members can be excluded. Combining PDS into the process can further avoid unnecessary calculations. The loop is repeated for all N clusters in ascending order to finally locate the spectrum closest to the input vector. Ultimately, the proposed method significantly reduces computational complexity because partial or all spectra of the cluster are excluded from the candidate by the inequality (10) and PDS.

Pilot Search (PS)
Applying a coarse-to-fine strategy to the above method can further reduce the computational overhead. It is an approach that first finds a relatively close spectrum, and subsequently finds the closest spectrum based on it. A relatively close spectrum can be found using a pilot search with few PCs, the pilot search does not guarantee the same results as the full search. However, if a sufficiently close spectrum is found, the closest spectrum can be found quickly, making the overall search much faster.
For the pilot search, previously built clusters or newly built clusters can be used. New clusters require additional memory, so the same previously built clusters are used in this study. The number of PCs to use for the pilot search is discussed in the experimental section. The pseudo-code of the proposed method is given in Agorithm 1.

Algorithm 1: Proposed algorithm
N : number of PCs N p : number of PCs for pilot search C : set of all clusters d yc k : pre-computed distance between y and k-th cluster c k : center of k-th cluster y : pre-transformed spectrum of y ∈ C calculate x 1:N p using PCT on the input spectrum w; sort the clusters in ascending order according to dxc k ; x,y < d 2 min then y min = y; d min = d x,y ; end end end /* Main search */ calculate x N p :N using PCT on the input spectrum; d min = d(w, z min ) ; x,y < d 2 min then y min = y; d min = d(w, z min ); end end end return d min and y min ;

Experimental Section
A total of 40 chemicals and 12 explosives were prepared using ≥ 99% concentration standard from Sigma-Aldrich (St. Louis, MO, USA) and the materials were measured using three Raman instruments. They merged with a commercial Raman library (Thermo Fisher Scientific) of 14,033 spectra to form a Raman database of 14,085 spectra. The Raman database consists of one template for each material. Table 1 shows the detailed specifications of the four Raman spectroscopy systems used to measure the spectrum. All spectra were adjusted to have a resolution of 201-3500 cm −1 by resampling and were preprocessed with additive noise reduction and background noise removal [38,39]. Figure 3 shows an example of Raman spectra after preprocessing. The types of chemicals are acetonitrile, benzene, cyclohexane, and toluene. To analyze the performance of the algorithm, 2817 types of the Raman spectrum, which is 20% of the database, was searched from all 14,085 types of the Raman spectrum. Similar to the real spectrum, noise of approximately 15, 20, and 25 dB was added to the input spectrum. The factor influencing the identification performance of the proposed method is the noise of the input spectrum used in the experiment. The identification performance of the spectrum acquired under harsh noise conditions is expected to be relatively low, and the well-removed spectrum can be expected to have good identification performance. Therefore, it is important to introduce suitable noise reduction methods and find optimal parameters. However, the VQ method has no effect on the identification performance of the full search method, which is the reference identification algorithm. Therefore, to focus on the aim of this study, the main content of this paper is limited to the VQ issue. Figure 4 depicts a flowchart of the proposed method including preprocessing. Black arrows on the right indicate real-time processes, while white arrows indicate previously calculated processes.

Results and Discussion
There are several ways to evaluate the computational complexity of an algorithm. Among them, the execution speed depends on various aspects of the CPU, such as the instruction mix, pipeline structure, cache memory, and the number of cores, thus rendering it difficult to find an explicit relationship between the search speed and execution time. Therefore, in this study, the number of necessary additions and multiplications was chosen as a criterion for evaluating the search speed.
In general, the fast search technique in VQ presumes the same identification performance as the full search technique, which uses the entire dimension of the input data. Therefore, all experimental results were compared focusing only on the computational complexity, under the same identification performance conditions as the full search.
Preliminary experiments were conducted to determine the appropriate number of clusters and PCs. Table 2 shows the results of the PTC + PDS method according to the number of PCs. This method showed the least computational complexity when 150 PCs were used. In Section 2, it was discussed that 250 PCs contain almost all the information from the 3300 raw data points. However, this does not indicate that it is the best parameter. Based on this result, we determined the optimal number of clusters. Table 3 shows the number of multiplications and additions according to the number of clusters in CS using 150 PCs. According to Table 3, the computational complexity decreases as the number of clusters increases. However, once the number of clusters exceeds 80, the computational complexity increases again. Therefore, the number of clusters was set at 80 in this study.
Subsequently, the number of clusters was fixed at 80, and the computational complexity, cluster skip, and element skip according to the number of PCs were analyzed. The results are presented in Table 4. Cluster skip refers to the exclusion of all cluster members from the candidates. In contrast, element skip implies excluding some data in the cluster from the candidates. As the number of PCs increases, the number of cluster skips increases, as information regarding the spectrum that can be used in Equation (10) increases. Therefore, more clusters can be excluded than when the number of PCs is small. However, the overall computational complexity must be considered, as converting more PCs requires additional computation. According to Table 4, the best results are obtained using 80 PCs. For all PCs considered in the experiment, the sum of the number of cluster and element skips approximates the number of total clusters 80. Hence, it can be confirmed that the cluster structure effectively excludes spectra that cannot be candidates.
Finally, the pilot search was applied to further speed up the search. Table 5 shows the experimental results obtained while investigating the effect of the number of PCs used in the pilot search. The results show typical trade-off characteristics. As the number of PCs for the pilot search increases, the computational amount of the pilot search decreases, while the computational amount of CS increases. Due to these characteristics, the total number of calculations is similar. This means that the pilot test reliably helps improve the performance. In the following experiments, we chose 40 as the number of PCs for the pilot search. To analyze the performance of each algorithm, including the proposed method, 2817 Raman spectra were assessed, and the results are listed in Table 6. PDS significantly reduces the number of additions. In contrast, MPS is more effective in reducing the number of multiplications. This is because MPS relies on the segmental mean, which reduces the need for addition rather than multiplication. The overall performance depends on the characteristics of the data. If the data are not distributed around the mean, the benefit decreases. This is the reason for introducing the CS. The combination of MPS1D and PDS reduces the overall computation complexity compared to the case of using only MPS1D. The MPS1D_sort method sorts the reference spectra according to the mean in advance. If the difference between the mean is more than K times the minimum distance, further search is not needed as in ENNS, which speeds up the search. This method showed a performance improvement of approximately 55.2% compared to the MPS1D + PDS.
Subsequently, PCT with PDS shows a reduction of approximately 63.56% in computational complexity compared to MPS1D_Sort + PDS. The proposed method, CS, showed an improvement of approximately 36.34% when using the same number of PCs as PCT + PDS, and nearly 47.57% when using the optimal number of PCs. From these results, it was confirmed that pre-determining the cluster structure of the database is effective for fast search. These properties are not just found in CS. MPS1D, a method of determining the search structure in advance through database analysis, also showed faster search speed compared to the method without structure. Determining an appropriate search structure is a critical issue as this structure is created in advance and does not affect the real-time search.
Finally, the introduction of the pilot search reduced the computational amount by approximately 34.78% compared to CS (80 PCs). This result corresponds to a computational complexity equivalent to 0.48% that of the full search. The overall average results of the major algorithms are shown in Figure 5 for convenience.

Conclusions
In this paper, we proposed a novel search method that speeds up the search for the identification of Raman spectra. The principal component transformation was introduced along with the cluster structure. The reduced number of data dimensions reduces the computational complexity of distance calculations, and the cluster structure combines the well-known trigonometric inequality with PDS to exclude numerous spectra that cannot be the candidate from the search. Finally, the pilot search was applied to further speed up the search. Optimal parameters of the proposed method were investigated and determined experimentally.
Moreover, various algorithms in the VQ field were modified and introduced into structures suitable for 1D signals and compared with the proposed method. From the results of the experiments, it was found that the proposed method significantly surpassed the existing method in terms of the required number of additions and multiplications.
The proposed method is particularly suitable for systems with relatively limited computing power, such as portable Raman spectroscopy and the compact Raman spectrometer. In addition, applications such as hazardous substance detection require fast and accurate detection, and the search time is particularly long for the large database of 14085 considered in the paper, so the proposed technique can be used appropriately. We expect that the proposed method and its variants will be a promising alternative to the spectral search problems in the above-mentioned applications.