You are currently viewing a new version of our website. To view the old version click .
Mathematics
  • Feature Paper
  • Article
  • Open Access

9 February 2025

Fast Search Using k-d Trees with Fine Search for Spectral Data Identification

,
and
Department of Intelligent Electronics and Computer Engineering, Chonnam National University, Gwangju 61186, Republic of Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Section E1: Mathematics and Computer Science

Abstract

Spectral identification is an essential technology in various spectroscopic applications, often requiring large spectral databases. However, the reliance on large databases significantly increases computational complexity. To address this issue, we propose a novel fast search algorithm that substantially reduces computational demands compared to existing methods. The proposed method employs principal component transformation ( P C T ) as its foundational framework, similar to existing techniques. A running average filter is applied to reduce noise in the input data, which reduces the number of principal components ( P C s ) necessary to represent the data. Subsequently, a k-d tree is employed to identify a relatively similar spectrum, which efficiently constrains the search space. Additionally, fine search strategies leveraging precomputed distances enhance the existing pilot search method by dynamically updating candidate spectra, thereby improving search efficiency. Experimental results demonstrate that the proposed method achieves accuracy comparable to exhaustive search methods while significantly reducing computational complexity relative to existing approaches.

1. Introduction

The utilization of spectroscopic techniques, such as Raman spectroscopy and infrared spectroscopy, has steadily increased in recent years. Raman spectroscopy reveals the molecular fingerprint of a sample and provides quantitative information about its chemical composition [1]. Spectroscopy also reveals sensitive biochemical alterations in cells and tissues. These properties can be used in disease diagnosis and treatment, and they have applications in a variety of fields, including biology and medicine [2]. Consequently, significant research efforts have been devoted to advancing spectral signal processing techniques. Among these efforts, one of the most critical and challenging tasks is the fast and precise identification of measured spectra, which requires careful comparison and analysis against well-established and comprehensive spectral databases [3]. This task is essential not only for improving the efficiency and precision of spectral measurements but also for enabling reliable interpretation and utilization of spectral data in various scientific and industrial applications [4].
Various spectrum identification methods have been proposed, with deep learning-based approaches receiving considerable attention in recent years [5,6]. To effectively leverage these techniques, the availability of a large spectral database is essential [7]. However, existing spectral libraries typically provide only a single sample for each type of spectrum, which limits the ability to design robust classification models. As a result, search techniques, such as correlation search and Euclidean distance ( E D ) search, are commonly employed. These methods are intuitive, computationally efficient, and compatible with existing spectral libraries [8].
Fast search techniques are particularly suitable for embedded systems with limited computational resources [9,10]. For instance, portable spectrometer systems designed to detect hazardous substances, such as explosives and poisons, at accident sites exemplify the utility of such techniques [11,12]. Such applications demand the development of identification methods that are both fast and accurate. To meet this requirement, we focus on fast search methods that utilize similarity evaluation techniques. Unlike deep learning and machine learning approaches, which often incur significant computational overhead during the training process due to the size and complexity of datasets, similarity evaluation methods facilitate efficient searches. By relying on a single representative spectrum for comparison, these methods significantly reduce computational demands while maintaining accuracy [13].
The E D between spectra is a commonly used metric for similarity measurement [14]. The spectrum most similar to the input spectrum is identified as the corresponding material. A fast E D comparison method typically utilizes a clustering structure to optimize the search process. The primary advantage of such a structure lies in its ability to reduce the search space by grouping spectra in the database based on their similarity, thereby minimizing computational effort.
In previous studies, hierarchical clustering was employed to assess the similarity between spectra. This approach grouped closely related spectra into clusters, enabling an initial identification of the input spectrum. Subsequently, a pilot search algorithm was applied within the reduced subset to determine the spectrum most similar to the input, achieving both efficient and accurate identification [15].
To surpass the performance of existing methods, this study combines k-d trees with a fine search process. k-d trees are data structures that generalize binary search trees to k dimensions, offering low computational costs. While k-d trees are typically used for exact solutions [16], we adapt them to identify approximate neighbors [17,18]. A fine search is then applied as a postprocessing step to ensure that the results are consistent with those obtained through an exhaustive search.
This paper is organized as follows. Section 2 presents a review of previously developed search methods. Section 3 introduces the proposed fast algorithm, while Section 4 compares its simulation results with those of existing techniques. Finally, Section 5 provides a summary of the experimental results and conclusions.

2. Previous Methods

2.1. Full Search with Partial Distortion Search

The full search method involves identifying the vector that minimizes the squared Euclidean distance between the input vector x = ( x 1 , x 2 , . . . , x K ) and the database vector Y = y i | i = 1 , 2 , , N , as shown in Equation (1). Here, y i = ( y i 1 , y i 2 , . . . , y i K ) represents the reference spectrum, and d 2 ( x , y i ) denotes the squared Euclidean distance between the input vector x and the spectrum y i , as described in Equation (2).
d m i n 2 = m i n { d 2 ( x , y i ) , i = 1 , 2 , , N } .
d 2 ( x , y i ) = k = 1 K ( x k y i k ) 2 .
The search method using the squared Euclidean distance sequentially searches the entire database to find the reference vector closest to the input vector. This method requires N × K multiplications, N ( 2 K 1 ) additions, and N 1 comparisons for a k-dimensional input vector where the number of spectra is N. This method has a limitation in that the computational load increases as the number of spectra N grows. A simple initial solution to this problem is a partial distortion search ( P D S ).
P D S is an algorithm that terminates early by setting a termination condition during the calculation of d 2 ( x , y i ) . Assuming that the current minimum distance is d m i n 2 along the minimum search path, the algorithm stops calculating the distance when there exists a value q < K that satisfies the condition in Equation (3). This method reduces ( K q ) multiplications and 2 ( K q ) additions [19].
k = 1 q ( x k y i k ) 2 d m i n 2 .

2.2. Fast Search Based on Principal Component Analysis

Assuming that the dimension of the spectrum is 3300, the full search method calculates the Euclidean distance between all data points. Performing these calculations on the entire dataset demands substantial computational resources, rendering it impractical for fast searches. To address this challenge, Principal Component Analysis ( P C A ) is frequently employed to reduce computational costs. P C A is a multivariate analysis technique that reduces data dimensionality while preserving the essential information of the original dataset [20,21].
Let the reference spectra Y be an N × M matrix. A correlation matrix R of dimension N × N is constructed. The matrix R can be decomposed as shown in Equation (4), where the diagonal eigenvalue matrix Λ contains the eigenvalues λ 1 λ 2 λ N . The transformed principal components ( P C s ) are then computed using Equation (5).
R = YY T = V Λ V T .
PC = V T Y .
The degree of compression achieved through P C A can be quantified as a percentage of the cumulative sum of eigenvalues, as given in Equation (6).
C R ( K ) = i = 1 K λ i / i = 1 N λ i .
The cumulative ratio reaches approximately 1 when 250 principal components are used. This indicates that employing 250 PCs provides results nearly identical to those obtained via a full search, significantly reducing computational costs. However, using 250 PCs still involves a considerable amount of computation.
Previous studies have demonstrated that a search method utilizing a smaller number of P C s can achieve comparable results. The performance of such methods, along with the proposed approach, is discussed in detail in Section 4.

2.3. Cluster Search with Pilot Search

Cluster search ( C S ) is an algorithm based on hierarchical clustering ( H C ) that utilizes a similarity measure between data clusters. In C S , all distances between clusters in a spectral dataset are precomputed [15]. Assuming that the dataset consists of M clusters, there are M × ( M 1 ) / 2 pairs of distances. The C S method calculates the center of each cluster and designates the data point closest to the center as the “center data”. Distances between the center data and all other points within the same cluster are then calculated and stored, with the farthest distance defining the cluster size.
The C S algorithm begins by computing the distance between the input spectrum and each cluster center, sorting the clusters in ascending order of distance, and identifying the closest cluster. The distance between the input spectrum and the center of the closest cluster is denoted as d ( x , c ) , which is initialized as the minimum distance d m i n .
For a given cluster, if the difference between d ( x , c ) and d ( y , c ) exceeds the current d m i n , all members of that cluster are excluded from consideration as the closest candidates. Conversely, if the condition in Equation (7) is not satisfied, the spectra within the cluster are examined sequentially. The search order progresses from the cluster center outward to the farthest member. If a specific member meets the condition, the remaining members of the cluster are excluded from further inspection. This process is repeated across all M clusters, ultimately identifying the spectrum closest to the input.
d ( x , c ) d ( y , c ) > d m i n .
Pilot search begins by identifying a relatively close spectrum and then refining the search to locate the closest spectrum from this initial point. Previous studies have demonstrated that pilot search can effectively identify a relatively close spectrum using only a small number of P C s . Once a sufficiently close spectrum is initially identified, the subsequent determination of the closest spectrum can be achieved with substantially reduced computational effort.

4. Experimental Results

The Raman spectrum library used in the experiment comprised 14,085 chemical substances, with each spectrum represented as a one-dimensional vector of size 3300. Background noise and additive noise in the spectrum library were removed using noise removal techniques [23,24]. Four Raman spectroscopy systems were used to measure the spectra: the Renishaw 2000, an FT-Raman system, and two inVia instruments. Figure 4 shows examples of preprocessed Raman spectra from the experiments, including 3-aminophenylboronic acid monohydrate, 2-amino-5,6-dimethylbenzimidazole, heneicosane, and triphenylantimony(V) dichloride.
Figure 4. Examples of Raman spectra.
For performance comparison, a total of 2817 types, corresponding to 20% of the entire database, were randomly chosen and had noise levels of 15, 20, and 24 dB added to generate test data. Since actual spectra contain non-linear noise, their presence can negatively impact performance. Therefore, it is crucial to identify the optimal parameters that allow for accurate analysis, even when harsh noise is added to the test data.
The performance of the algorithm was objectively evaluated using two criteria: the number of multiplication operations and the number of addition operations. The computational cost of each operation was calculated as the average number of operations across 2817 test samples.
The performance of the proposed method is significantly influenced by the number of P C s used. Using too few P C s reduces the computational load during the k-d tree search process but increases the burden on the fine search process, potentially resulting in a higher total computational cost. Conversely, using too many P C s minimizes the need for a fine search, as a k-d tree search alone becomes sufficient. Therefore, determining the optimal number of P C s is crucial for balancing efficiency and accuracy.
Table 1 illustrates this trade-off, highlighting the balance between the two processes. Specifically, when more than 40 PCs are used, the fine search step becomes redundant. In our method, 16 PCs were selected as the optimal number, minimizing the total computational cost.
Table 1. Computational complexity of the proposed method according to the number of P C s .
To compare the performance of each algorithm, all test Raman spectra were evaluated, and the results are presented in Table 2. The number of P C s used in the algorithm was consistent with the optimal number of P C s determined in the previous study [15].
Table 2. Comparison of the multiplication and addition operations of each algorithm.
The full search + P D S method achieved a performance improvement of 80.97% compared to the existing full search method. When P C T was applied, a performance improvement of 92.62% was observed. Furthermore, the C S method, which constructs a cluster tree using 80 PCs, achieved a performance improvement of 47.57%, and the P S + C S algorithm, which introduces a pilot search to find a better initial spectrum, achieved a performance improvement of 34.78%. The overall average results of the main algorithms are shown in Figure 5.
Figure 5. Comparison of multiplication and addition operations in major algorithms.

5. Conclusions

In this paper, we propose a novel search method to enhance search speed for spectrum identification. A running average filter is applied to suppress noise, improving the efficiency of P C T . By reducing data dimensionality through P C T , the computational complexity of distance calculations is significantly decreased. The k-d tree is constructed using these reduced-dimensional P C s . Since this approach does not inherently guarantee identical results to a full search, a fine search step is incorporated to ensure accuracy equivalent to that of a full search.
The optimal parameters for the proposed method were investigated and experimentally determined. The experimental results demonstrate that the proposed method significantly outperforms existing methods in terms of the number of required additions and multiplications, offering substantial improvements over traditional approaches. However, the fine search stage of the proposed algorithm necessitates additional storage to store the distances between spectra and their corresponding indices.
Consequently, the proposed method is particularly well suited for environments with limited computational power but sufficient memory resources. Deep learning techniques used for spectrum identification typically require extensive datasets, making them challenging to apply to library data containing only a single sample per type. In contrast, the proposed method is expected to be effective for retrieval problems in library datasets with only one spectrum per type.

Author Contributions

Conceptualization, Y.S. and S.-J.B.; methodology, Y.S. and S.-J.B.; validation, Y.S., T.C. and S.-J.B.; formal analysis, Y.S. and S.-J.B.; investigation, T.C.; resources, Y.S. and S.-J.B.; data curation, T.C.; writing—review and editing, Y.S. and S.-J.B.; visualization, Y.S.; supervision, S.-J.B.; project administration, S.-J.B.; funding acquisition, S.-J.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP)-Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government (MSIT) (IITP-2025-RS-2022-00156287, 50%) and ICAN (ICT Challenge and Advanced Network of HRD) support program (RS-2022-00156385, 50%).

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, H.; Xu, W.; Broderick, N.; Han, J. An adaptive denoising method for Raman spectroscopy based on lifting wavelet transform. J. Raman Spectrosc. 2018, 49, 1529–1539. [Google Scholar] [CrossRef]
  2. Kong, K.; Kendall, C.; Stone, N.; Notingher, I. Raman spectroscopy for medical diagnostics—From in-vitro biofluid assays to in-vivo cancer detection. Adv. Drug Deliv. Rev. 2015, 89, 121–134. [Google Scholar] [CrossRef] [PubMed]
  3. Li, J.; Chu, X.; Tian, S.; Lu, W. The identification of highly similar crude oils by infrared spectroscopy combined with pattern recognition method. Spectrochim. Acta Part 2013, 112, 457–462. [Google Scholar] [CrossRef]
  4. Loethen, Y.L.; Kauffman, J.F.; Buhse, L.F.; Rodriguez, J.D. Rapid screening of anti-infective drug products for counterfeits using Raman spectral library-based correlation methods. Analyst 2015, 140, 7225–7233. [Google Scholar] [CrossRef] [PubMed]
  5. Chen, T.; Son, Y.; Park, A.; Baek, S.-J. Baseline correction using a deep-learning model combining ResNet and UNet. Analyst 2022, 147, 4285–4292. [Google Scholar] [CrossRef]
  6. Lussier, F.; Thibault, V.; Charron, B.; Wallace, G.Q.; Masson, J.F. Deep learning and artificial intelligence methods for Raman and surface-enhanced Raman scattering. Trends Anal. Chem. 2020, 124, 115796. [Google Scholar] [CrossRef]
  7. Liu, Y. Adversarial nets for baseline correction in spectra processing. Chemom. Intell. Lab. Syst. 2021, 213, 104317. [Google Scholar] [CrossRef]
  8. Park, J.K.; Park, A.; Yang, S.K.; Baek, S.-J.; Hwang, J.; Choo, J. Raman spectrum identification based on the correlation score using the weighted segmental hit quality index. Analyst 2017, 142, 380–388. [Google Scholar] [CrossRef] [PubMed]
  9. Hajjou, M.; Qin, Y.; Bradby, S.; Bempong, D.; Lukulay, P. Assessment of the performance of a handheld Raman device for potential use as a screening tool in evaluating medicines quality. J. Pharm. Biomed. Anal. 2013, 74, 47–55. [Google Scholar] [CrossRef]
  10. Sanchez, L.; Farber, C.; Lei, J.; Zhu-Salzman, K.; Kurouski, D. Noninvasive and nondestructive detection of cowpea bruchid within cowpea seeds with a hand-held Raman spectrometer. Anal. Chem. 2019, 91, 1733–1737. [Google Scholar] [CrossRef] [PubMed]
  11. Moore, D.S.; Scharff, R.J. Portable Raman explosives detection. Anal. Bioanal. Chem. 2009, 393, 1571–1578. [Google Scholar] [CrossRef] [PubMed]
  12. Izake, E.L. Forensic and homeland security applications of modern portable Raman spectroscopy. Forensic Sci. Int. 2010, 202, 1–8. [Google Scholar] [CrossRef]
  13. Gray, R.M.; Neuhoff, D.L. Quantization. IEEE Trans. Inf. Theory 1998, 44, 2325–2383. [Google Scholar] [CrossRef]
  14. Orchard, M.T. A fast nearest-neighbor search algorithm. IEEE ICASSP 1991, 91, 2297–2300. [Google Scholar] [CrossRef]
  15. Park, J.K.; Lee, S.; Park, A.; Baek, S.-J. Fast Search Method Based on Vector Quantization for Raman Spectroscopy Identification. Mathematics 2020, 8, 1970. [Google Scholar] [CrossRef]
  16. Moayeri, N.; Neuhoff, D.L.; Stark, W.E. Fine-coarse vector quantization. IEEE Trans. Signal Process. 1991, 39, 1503–1515. [Google Scholar] [CrossRef]
  17. Murphy, O.J.; Selkow, S.M. The efficiency of using k-d trees for finding nearest neighbors in discrete space. Inf. Process. Lett. 1986, 23, 215–218. [Google Scholar] [CrossRef]
  18. Tiwari, V.R. Developments in KD Tree and KNN Searches. Int. J. Comput. Appl. 2023, 185, 17–23. [Google Scholar] [CrossRef]
  19. Bei, C.; Gary, R.M. An improvement of the minimum distortion encoding algorithms for vector quantization and pattern matching. IEEE Trans. Commun. 1985, 33, 1132–1133. [Google Scholar] [CrossRef]
  20. Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef]
  21. Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef] [PubMed]
  22. Bentley, J.L. Multidimensional binary search trees used for associative searching. Commun. Acm 1975, 18, 509–517. [Google Scholar] [CrossRef]
  23. Baek, S.-J.; Park, A.; Ahn, Y.-J.; Choo, J. Baseline correction using asymmetrically reweighted penalized least squares smoothing. Analyst 2015, 140, 250–257. [Google Scholar] [CrossRef]
  24. Press, W.H.; Flannery, B.P.; Teukolsky, S.A.; Vetterling, W.T. Numerical Recipes in C; Cambridge University Press: New York, NY, USA, 1988. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.