# Pattern Recognition for Human Diseases Classification in Spectral Analysis

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Exploratory Data Analysis (EDA)

#### 2.1. Principal Component Analysis (PCA)

- (i)
- Standardize the $d$-dimensional dataset
- (ii)
- Compute the covariance matrix of the whole dataset using Equation (1):$$cov\left(X,Y\right)=\frac{1}{n-1}{{\displaystyle \sum}}_{i=1}^{n}\left({X}_{i}-\overline{x}\right)\left({Y}_{i}-\overline{y}\right)$$
- (iii)
- Compute the covariance matrix to obtain its eigenvectors and corresponding eigenvalues. A scalar $\lambda $ is called an eigenvalue of a square matrix $A$ if there is a non-zero vector, called eigenvector [11]:$$Av=\lambda v$$

- (iv)
- Sort the eigenvalues in decreasing order and choose the associated eigenvectors with the biggest eigenvalues for a d × k dimensional matrix $W$
- PCA projects the feature space onto a smaller subspace, where the eigenvector will form the axes of this new feature subspace.
- Remove the eigenvectors with the lowest eigenvalues since it bears the least information about the data distribution.

- (v)
- Construct a projection matrix, $W$, from the top k eigenvectors.
- (vi)
- Transform the $d$-dimensional input dataset, $X$, using the projection matrix $W$ to obtain the new k-dimensional feature subspace:$${X}^{\prime}=XW$$

- (a)
- The dataset should have multiple continuous variables such as ratios or intervals. On the other hand, Ordinal variables can also be employed as well.
- (b)
- The chosen variables should be in a linear relationship since this approach is based on Pearson correlation coefficients.
- (c)
- The sample size should be sufficient and large enough to yield a valid result. The Kaisere-Meyere-Olkin (KMO) and Bartlett’s test of Sphericity are two methods for determining sample adequacy.
- (d)
- Data reduction should be possible to be applied to the data. In order to reduce variables to a smaller number of principal components, adequate correlations between variables are required.
- (e)
- There should be no significant outliers in the data.

#### 2.2. Kernel Principal Component Analysis (KPCA)

- Compute $\varphi =\frac{1}{N}{\sum}^{}\varphi \left({X}_{j}\right)$ and the centered kernel$\tilde{\kappa}$ using Equation (5).$$\tilde{\kappa}\left(x,y\right)=\kappa \left(x,y\right)-\frac{1}{N}{\displaystyle \sum}_{j=1}^{N}\kappa \left(x,{x}_{j}\right)-\frac{1}{N}{\displaystyle \sum}_{i=1}^{N}\kappa \left({x}_{j},y\right)+\frac{1}{{N}^{2}}{\displaystyle \sum}_{i=1}^{N}{\displaystyle \sum}_{j=1}^{N}\kappa \left({x}_{i},{y}_{j}\right)$$
- Compute the centered matrix $\tilde{K}$ as in Equation (6) and normalize so that $\Vert {w}_{i}\Vert {}^{2}={\lambda}_{i}^{-1}$$$\tilde{K}=\left(\tilde{\kappa}\left({x}_{i},{x}_{j}\right)\right)\in \mathbb{R}$$
- Compute the eigenvectors, ${w}_{i}\in {\mathbb{R}}^{N}$ of $\tilde{K}$:$$\tilde{K}{w}_{i}={\lambda}_{i}{w}_{i}$$
- For every data point x, compute its ith nonlinear principal component using Equation (8) for $i=1,2,\dots ,d.$$${y}_{i}={w}_{i}{}^{T}{\left[\tilde{\kappa}\left({x}_{1},x\right),\dots ,\tilde{\kappa}\left({x}_{N},x\right)\right]}^{T}$$

#### 2.3. Successive Projection Algorithm (SPA)

- Before the first iteration $\left(n=1\right)$, let ${x}_{j}=j$th column of ${X}_{cal}$; $j=1,\dots ,J$.
- Let $S$ be the yet unselected set of wavelengths. That is, $S=\left\{j\mathrm{such}\mathrm{that}1\le j\le J\mathrm{and}j\notin \left\{k\left(0\right),\dots ,k\left(n-1\right)\right\}\right\}$
- Compute the projection of ${x}_{j}$ on the sub-space orthogonal to ${x}_{k\left(n-1\right)}$
- Let $k\left(n\right)=\mathrm{arg}\left(\mathrm{max}{\mathrm{Px}}_{j},j\in S\right)$
- Let ${x}_{j}={\mathrm{Px}}_{j},j\in S$
- Let $n=n+1$. If $n<N$, return to step 2.
- The resulting wavelength is $\left\{k\left(n\right);n=0,\dots ,N-1\right\}$

#### 2.4. Genetic Algorithm (GA)

- Create initial population
- Evaluate initial population
- Select the subpopulation from the initial population
- Produce offspring of these pairs using the genetic operator of crossover and mutation
- Evaluate the offspring and replace the worst parents with the best offspring
- Repeat until the stopping criteria are satisfied

#### 2.5. Partial Least Square Regression (PLS-R)

- Set $u$ to the first column of $Y$
- Let$w=\frac{{X}^{T}u}{{u}^{T}u}$
- Scale$w$to be of length one
- Let$t=Xw$
- Let$c=\frac{{Y}^{T}t}{{t}^{T}t}$
- Scale$c$to be of length one
- $u=\frac{{Y}^{T}c}{{c}^{T}c}$
- If convergence then 9, else 2
- $X$-loadings: $p=\frac{{X}^{T}t}{{t}^{T}t}$
- $Y$-loadings: $q=\frac{{Y}^{T}u}{{u}^{T}u}$
- Regression ($u$ upon $t$): $b=\frac{{u}^{T}t}{{t}^{T}t}$
- Residual matrices: $X\to X-t{p}^{T}$ and $Y\to Y-bt{c}^{T}$

## 3. Classification Algorithm

#### 3.1. Linear Discriminant Analysis (LDA)

- Calculate the d-dimensional mean vectors, ${m}_{i}$ for the different classes from the dataset.
- Construct and evaluate the scatter matrices:
- Interclass (within-class) scatter matrix, S
_{W}$${S}_{W}={\displaystyle \sum}_{i=1}^{c}{S}_{i}$$ - Intraclass (between-class) scatter matrix ${S}_{B}$$${S}_{B}={\displaystyle \sum}_{i=c}^{c}{N}_{i}\left({m}_{i}-m\right){\left({m}_{i}-m\right)}^{T}$$

- Solve the generalized eigenvalue problem for the matrix ${S}_{W}^{-1}{S}_{B}$ using Equation (2) where $A={S}_{W}^{-1}{S}_{B}$, $\lambda $ is the eigenvalue and $v$ is the eigenvector
- Select the linear discriminant for the new feature subspace by arranging the eigenvector by decreasing the eigenvector and choose k eigenvectors with the most significant eigenvalues
- Transform into a new subspace:$$Y=X\times W$$

#### 3.2. K-Nearest Neighbors (KNN)

- Calculate the distance metric between the new data and every training sample
- Locate the k sample from the training samples nearest to the new data.
- Sort by distance to the new data in descending order and select the top k

- Assign the new data to the majority class

#### 3.3. Decision Tree (DT)

- Begin from the root node
- Convert each ordered variable $X$ to an unordered variable ${X}^{\prime}$ by categorizing its values in the node into a small number of intervals.
- If $X$ is unordered, let ${X}^{\prime}=X$

- Perform a chi-square test of independence of each ${X}^{\prime}$ variable versus Y on the data in the node and calculate its significance probability.
- Choose the variable $X\ast $ associated with the ${X}^{\prime}$ that has the smallest significance probability.
- Search and split set $\left\{X\ast \in S\ast \right\}$ that minimizes the sum of Gini indexes and uses it to split the node into two child nodes.
- Gini is a measure of impurity calculated by counting the frequency of events that a randomly selected data instance is incorrectly labeled, assuming that that instance is to be labeled randomly based on the distribution of class labels. For binary classification with class positive and negative, ${p}_{pos}$ is the probability that data instance in class positive is being chosen, and $\left(1-{p}_{pos}\right)$ is the probability that that instance is incorrectly labeled as negative. Hence the Gini index can be calculated as below:$$Giniindex={p}_{pos}\left(1-{p}_{positive}\right)+{p}_{negative}\left(1-{p}_{negative}\right)$$

- If the stopping criterion is reached, break the loop. Otherwise, repeat steps 2–5 to each child node until the stopping criterion is finally reached.
- Prune the tree with the CART method.
- Occasionally in step (6), stopped splitting suffers from the horizon effect phenomenon. In stopped splitting, a node may be deemed a leaf, preventing the possibility of beneficial splits in subsequent nodes. As a result, a stopping condition may be satisfied “too soon” for overall optimal recognition accuracy. As a result, pruning is done when a tree has reached full maturity and has the least amount of impurity in its leaf nodes. Then, for elimination, all pairs of neighboring leaf nodes are evaluated. Any combination that results in an insignificant increase in impurity is removed, and the shared antecedent node is deemed a leaf [25].

#### 3.4. Random Forest (RF)

- $LetD=\left\{\left(x1,y1\right),\left(x2,y2\right),\dots ,\left(xN,yN\right)\right\}$ denote the training data, with ${x}_{i}={\left({x}_{i,1},{x}_{i,2},\dots ,{x}_{i,p}\right)}^{T}$
- For j = 1 to J: Take a bootstrap sample D of size N from D.
- Using the bootstrap sample, ${D}_{j}$ as the training data fit a tree.
- Start with all observations in a single node and recursively repeat the following procedures for each node until the stopping criterion is reached: From the p available predictors, choose m predictor at random.

- Find the best binary split among all binary splits in the predictors from step (1).
- split the node into two descendant nodes using the split from step (2).
- Then predict a new point x using Equation (18).$$f\left(x\right)=argma{x}_{y}{\displaystyle \sum}_{j=1}^{J}I\left(\widehat{{h}_{j}}\left(x\right)\right)$$

#### 3.5. Support Vector Machine (SVM)

- Identify the class function from which the decision boundary is to be chosen.
- For linear SVM, a linear function is used as follows:$$H\left(w,b\right)=\{x:{w}^{T}x+b=0\}$$

- Define the margin that includes the minimal distance between a candidate decision boundary and the point in each class as in Equation (20).$$h\left({x}_{i}\right)=\{\begin{array}{c}+1ifw\xb7x+b\ge 0\\ -1ifw\xb7x+b0\end{array}$$
- Choose the class’s decision boundary (usually the hyperplane) in step (1).
- Compute the performance of the chosen decision boundary on the training set.
- Compute the expected classification performance on the new data point.

#### 3.6. Partial Least Squares Discriminant Analysis (PLS-DA)

#### 3.7. Artificial Neural Network (ANN)

- (i)
- Nonlinearity for a better fit to the data
- (ii)
- Noise-insensitivity offers precise prediction in the existence of uncertain data and measurement errors,
- (iii)
- High parallelism means rapid processing and hardware failure-tolerance
- (iv)
- Learning and adaptivity enable the system to update its internal structure in response to changing environment
- (v)
- Generalization allows utilization of the model to be applied on unlearn data.

## 4. Performance Metrics for Classification Model

#### 4.1. Accuracy

#### 4.2. Sensitivity

#### 4.3. Specificity (Recall)

#### 4.4. Precision

#### 4.5. F1-Score

#### 4.6. Area under the ROC Curve (AUC)

- (i)
- Choosing a decision threshold that minimizes error rate or misclassification cost in a particular class and cost distribution.
- (ii)
- Finding a region where one classifier outperforms another
- (iii)
- Identifying regions where classifiers perform worse than chance
- (iv)
- Obtaining calibrated class posterior estimates

## 5. Application of Pattern Recognition for Disease Classification in Spectral Analysis

#### 5.1. Ultraviolet-Visible (UV/Vis) Spectroscopy

#### 5.1.1. General Theory of UV/Vis Spectroscopy

#### 5.1.2. Past Studies on UV/Vis Spectroscopy for Disease Classification

#### 5.2. Infrared (IR) Spectroscopy

^{−1}(2.5 m) to 400 cm

^{−1}(25 m). It is bounded by the far-IR region (FIR) spanning the wavelength range 400 cm

^{−1}(25 m) to 10 cm

^{−1}(1 mm) and the critical near-IR region (NIR) spanning the wavelength range 12,500 cm

^{−1}(800 nm) to 4000 cm

^{−1}(2.5 m). The most often used technique of spectroscopy is infrared spectroscopy. Numerous factors contribute to its enormous popularity and spread. The approach is speedy, sensitive, and simple to use, and it enables the sampling of gases, liquids, and solids using a variety of various procedures. Notable characteristics include the ease with which the spectra may be evaluated qualitatively and quantitatively [43].

#### 5.2.1. General Theory of FTIR Spectroscopy

#### 5.2.2. Past Study on IR Spectroscopy for Disease Classification

_{2}= 0.9980 accuracy, the PLSR model was effectively employed to predict infection indications based on biochemical changes in blood samples. Then, in [52], another research was undertaken to discriminate typhoid and dengue illness using the same methodologies but without the PLSR, obtaining 100 percent accuracy between samples. Ref. [50] employed PCA as a dimensionality reduction technique in conjunction with three models to develop a method for rapidly detecting gliomas in human blood. In this research, a non-linear SVM with a radial basis function (RBF) kernel was used in conjunction with a Back Propagation Neural Network (BPNN) and DT. As a consequence, DT accuracy was determined to be quite poor, but PSO-SVM performed the best, with an accuracy of 92.00 percent and an AUC value of 0.919.

#### 5.3. Raman Spectroscopy

#### 5.3.1. General Theory of Raman Spectroscopy

#### 5.3.2. Past Studies on Raman Spectroscopy for Disease Classification

## 6. Initial Experimental Analysis of the Classification of Spectra Data

## 7. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Mitchell, T.M. Machine Learning; McGraw-Hill: New York, NY, USA, 1997. [Google Scholar]
- Otto, M. Chemometrics: Statistics and Computer Application in Analytical Chemistry, 3rd ed.; Wiley-VCH Verlag GmbH & Co.: Weinheim, Germany, 2017. [Google Scholar]
- Ahmed, N.; Dawson, M.; Smith, C.; Wood, E. Biology of Disease, 1st ed.; Taylor & Francis Group: Abingdon-on-Thames, UK, 2007. [Google Scholar]
- Nielsen, S.S. Food Analysis, 5th ed.; Springer: Cham, Switzerland, 2017. [Google Scholar]
- Santos, M.C.D.; Nascimento, Y.M.; Araújo, J.M.G.; Lima, K.M.G. ATR-FTIR spectroscopy coupled with multivariate analysis techniques for the identification of DENV-3 in different concentrations in blood and serum: A new approach. RSC Adv.
**2017**, 7, 25640–25649. [Google Scholar] [CrossRef] [Green Version] - Sammut, C.; Webb, G.I. Encyclopedia of Machine Learning; Springer: New York, NY, USA, 2010. [Google Scholar]
- Sumithra, V.S.; Surendran, S. A computational geometric approach for overlapping community (cover) detection in social network. In Proceedings of the 2015 International Conference on Computing and Network Communications (CoCoNet), Trivandrum, India, 16–19 December 2015; pp. 98–103. [Google Scholar]
- Anowar, F.; Sadaoui, S.; Selim, B. Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, t-SNE). Comput. Sci. Rev.
**2021**, 40, 100378. [Google Scholar] [CrossRef] - Olver, P.; Shakiban, C. Applied Linear Algebra, 2nd ed.; Springer International Publishing AG: Cham, Switzerland, 2018. [Google Scholar]
- Raschka, S.; Mirjalili, V. Python Machine Learning, 3rd ed.; Packt Publishing Ltd.: Birmingham, UK, 2019. [Google Scholar]
- Kumar, R.; Sharma, V. Chemometrics in forensic science. Trends Anal. Chem.
**2018**, 105, 191–201. [Google Scholar] [CrossRef] - Zimmer, V.A.M.; Fonolla, R.; Lekadir, K.; Piella, G.; Hoogendoorn, C.; Frangi, A.F. Patient-Specific Manifold Embedding of Multispectral Images Using Kernel Combinations. Mach. Learn. Med. Imaging
**2013**, 8184, 82–89. [Google Scholar] [CrossRef] - Vidal, R.; Ma, Y.; Sastry, S.S. Generalized Principal Component Analysis; Springer: New York, NY, USA, 2016. [Google Scholar] [CrossRef]
- Araújo, M.C.U.; Saldanha, T.C.B.; Galvão, R.K.H.; Yoneyama, T.; Chame, H.C.; Visani, V. The successive projections algorithm for variable selection in spectroscopic multicomponent analysis. Chemom. Intell. Lab. Syst.
**2001**, 57, 65–73. [Google Scholar] [CrossRef] - Santos, M.C.D.; Morais, C.L.M.; Nascimento, Y.M.; Araujo, J.M.G.; Lima, K.M.G. Spectroscopy with computational analysis in virological studies: A decade (2006–2016). Trends Anal. Chem.
**2017**, 97, 244–256. [Google Scholar] [CrossRef] - Jarvis, R.M.; Goodacre, R. Genetic algorithm optimization for pre-processing and variable selection of spectroscopic data. Bioinformatics
**2005**, 21, 860–868. [Google Scholar] [CrossRef] [Green Version] - Nawaz, H.; Rashid, N.; Saleem, M.; Asif Hanif, M.; Irfan Majeed, M.; Amin, I.; Iqbal, M.; Rahman, M.; Ibrahim, O.; Baig, S.M.; et al. Prediction of viral loads for diagnosis of Hepatitis C infection in human plasma samples using Raman spectroscopy coupled with partial least squares regression analysis. J. Raman Spectrosc.
**2017**, 48, 697–704. [Google Scholar] [CrossRef] - Höskuldsson, A. PLS regression methods. J. Chemom.
**1988**, 2, 211–228. [Google Scholar] [CrossRef] - Sharma, V.; Kumar, R. Trends of chemometrics in bloodstain investigations. Trends Anal. Chem.
**2018**, 107, 181–195. [Google Scholar] [CrossRef] - James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning; Springer: New York, NY, USA, 2013. [Google Scholar]
- Alfeilat, H.A.A.; Hassanat, A.B.A.; Lasassmeh, O.; Tarawneh, A.S.; Alhasanat, M.B.; Eyal Salman, H.S.; Prasath, V.B.S. Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review. Big Data
**2019**, 7, 221–248. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Miller, J.N.; Miller, J.C. Statistics and Chemometrics for Analytical Chemistry, 6th ed.; Pearson Education Limited: Edinburgh Gate, UK, 2016. [Google Scholar]
- Boonamnuay, S.; Kerdprasop, N.; Kerdprasop, K. Classification and Regression Tree with Resampling for Classifying Imbalanced Data. Int. J. Mach. Learn. Comput.
**2018**, 8, 336–340. [Google Scholar] [CrossRef] - Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, 2nd ed.; Wiley: Somerset, NJ, USA, 2012. [Google Scholar]
- Maimon, O.; Rokach, L. Data Mining and Knowledge Discovery Handbook; Springer: New York, NY, USA, 2010. [Google Scholar]
- Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers, Inc.: Los Altos, CA, USA, 1993. [Google Scholar]
- Zhang, C.; Ma, Y. Ensemble Machine Learning: Methods and Applications; Springer: New York, NY, USA, 2012. [Google Scholar]
- Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn.
**1995**, 20, 273–297. [Google Scholar] [CrossRef] - Lampropoulos, A.S.; Tsihrintzis, G.A. Machine Learning Paradigms; International Publishing: New York, NY, USA, 2015. [Google Scholar]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; The MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
- Clarke, B.; Fokoue, E.; Zhang, H.H. Principles and Theory for Data Mining and Machine Learning; Springer: New York, NY, USA, 2009. [Google Scholar]
- Zhang, Y.; Li, J.; Hong, M.; Man, Y. Applications of Artificial Intelligence in Process Systems Engineering; Elsevier: Amsterdam, The Netherlands, 2021. [Google Scholar]
- Fordellone, M.; Vichi, M. Finding groups in structural equation modeling through the partial least squares algorithm. Comput. Stat. Data Anal.
**2020**, 147, 106957. [Google Scholar] [CrossRef] - Ruiz-Perez, D.; Guan, H.; Madhivanan, P.; Mathee, K.; Narasimhan, G. So you think you can PLS-DA? BMC Bioinform.
**2020**, 21, 2. [Google Scholar] [CrossRef] - Popovic, A.; Morelato, M.; Roux, C.; Beavis, A. Review of the most common chemometric techniques in illicit drug profiling. Forensic Sci. Int.
**2019**, 302, 109911. [Google Scholar] [CrossRef] - Basheer, I.A.; Hajmeer, M. Artificial neural networks: Fundamentals, computing, design, and application. Journal of Microbiological Methods
**2001**, 43, 3–31. [Google Scholar] [CrossRef] - Tharwat, A. Classification assessment methods. Appl. Comput. Inform.
**2020**, 17, 168–192. [Google Scholar] [CrossRef] - Japkowicz, N. Why question machine learning evaluation methods? In AAAI 2006 Workshop on Evaluation Methods for Machine Learning; AAAI: Menlo Park, CA, USA, 2006; pp. 6–11. [Google Scholar]
- Hand, D.J.; Till, R.J. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Mach. Lang.
**2001**, 45, 171–186. [Google Scholar] [CrossRef] - Jin Huang Ling, C.X. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng.
**2005**, 17, 299–310. [Google Scholar] [CrossRef] [Green Version] - Gauglitz, G.; Vo-Dinh, T. Handbook of Spectroscopy; Wiley: Hoboken, NJ, USA, 2001. [Google Scholar]
- Banwell, C.N. Fundamentals of Molecular Spectroscopy; McGraw-Hill: London, UK, 1983. [Google Scholar]
- Santos, M.C.D.; Nascimento, Y.M.; Monteiro, J.D.; Alves, B.E.B.; Melo, M.F.; Paiva, A.A.P.; Pereira, H.W.B.; Medeiros, L.G.; Morais, I.C.; Fagundes Neto, J.C.; et al. ATR-FTIR spectroscopy with chemometric algorithms of multivariate classification in the discrimination between healthy vs. dengue vs. chikungunya vs. zika clinical samples. Anal. Methods
**2018**, 10, 1280–1285. [Google Scholar] [CrossRef] - Naseer, K.; Ali, S.; Mubarik, S.; Hussain, I.; Mirza, B.; Qazi, J. FTIR spectroscopy of freeze-dried human sera as a novel approach for dengue diagnosis. Infrared Phys. Technol.
**2019**, 102, 102998. [Google Scholar] [CrossRef] - Roy, S.; Perez-Guaita, D.; Bowden, S.; Heraud, P.; Wood, B.R. Spectroscopy goes viral: Diagnosis of hepatitis B and C virus infection from human sera using ATR-FTIR spectroscopy. Clin. Spectrosc.
**2019**, 1, 100001. [Google Scholar] [CrossRef] - Zlotogorski-Hurvitz, A.; Dekel, B.Z.; Malonek, D.; Yahalom, R.; Vered, M. FTIR-based spectrum of salivary exosomes coupled with computational-aided discriminating analysis in the diagnosis of oral cancer. J. Cancer Res. Clin. Oncol.
**2019**, 145, 685–694. [Google Scholar] [CrossRef] - Yue, F.; Chen, C.; Yan, Z.; Chen, C.; Guo, Z.; Zhang, Z.; Chen, Z.; Zhang, F.; Lv, X. Fourier transform infrared spectroscopy combined with deep learning and data enhancement for quick diagnosis of abnormal thyroid function. Photodiagnosis Photodyn. Ther.
**2020**, 32, 101923. [Google Scholar] [CrossRef] - Chen, F.; Meng, C.; Qu, H.; Cheng, C.; Chen, C.; Yang, B.; Gao, R.; Lv, X. Human serum mid-infrared spectroscopy combined with machine learning algorithms for rapid detection of gliomas. Photodiagnosis Photodyn. Ther.
**2021**, 35, 102308. [Google Scholar] [CrossRef] - Elkadi, O.A.; Hassan, R.; Elanany, M.; Byrne, H.J.; Ramadan, M.A. Identification of Aspergillus species in human blood plasma by infrared spectroscopy and machine learning. Spectrochim. Acta Part A Mol. Biomol. Spectrosc.
**2021**, 248, 119259. [Google Scholar] [CrossRef] - Naseer, K.; Ali, S.; Qazi, J. ATR-FTIR spectroscopy based differentiation of typhoid and dengue fever in infected human sera. Infrared Phys. Technol.
**2021**, 114, 103664. [Google Scholar] [CrossRef] - Yang, C.; Guang, P.; Li, L.; Song, H.; Huang, F.; Li, Y.; Wang, L.; Hu, J. Early rapid diagnosis of Alzheimer’s disease based on fusion of near- and mid-infrared spectral features combined with PLS-DA. Optik
**2021**, 241, 166485. [Google Scholar] [CrossRef] - Naseer, K.; Amin, A.; Saleem, M.; Qazi, J. Raman spectroscopy based differentiation of typhoid and dengue fever in infected human sera. Spectrochim. Acta Part A Mol. Biomol. Spectrosc.
**2019**, 206, 197–201. [Google Scholar] [CrossRef] - Khan, S.; Ullah, R.; Khan, A.; Wahab, N.; Bilal, M.; Ahmed, M. Analysis of dengue infection based on Raman spectroscopy and support vector machine (SVM). Biomed. Opt. Express
**2016**, 7, 2249. [Google Scholar] [CrossRef] [PubMed] - Khan, S.; Ullah, R.; Khan, A.; Ashraf, R.; Ali, H.; Bilal, M.; Saleem, M. Analysis of hepatitis B virus infection in blood sera using Raman spectroscopy and machine learning. Photodiagnosis Photodyn. Ther.
**2018**, 23, 89–93. [Google Scholar] [CrossRef] [PubMed] - Khan, S.; Ullah, R.; Ashraf, R.; Khan, A.; Khan, S.; Ahmad, I. Optical screening of hepatitis-B infected blood sera using optical technique and neural network classifier. Photodiagnosis Photodyn. Ther.
**2019**, 27, 375–379. [Google Scholar] [CrossRef] [PubMed] - Cheng, H.; Xu, C.; Zhang, D.; Zhang, Z.; Liu, J.; Lv, X. Multiclass identification of hepatitis C based on serum Raman spectroscopy. Photodiagnosis Photodyn. Ther.
**2020**, 30, 101735. [Google Scholar] [CrossRef] [PubMed] - Lu, H.; Tian, S.; Yu, L.; Lv, X.; Chen, S. Diagnosis of hepatitis B based on Raman spectroscopy combined with a multiscale convolutional neural network. Vib. Spectrosc.
**2020**, 37, 103038. [Google Scholar] [CrossRef] - Gao, R.; Yang, B.; Chen, C.; Chen, F.; Chen, C.; Zhao, D.; Lv, X. Recognition of chronic renal failure based on Raman spectroscopy and convolutional neural network. Photodiagnosis Photodyn. Ther.
**2021**, 34, 102313. [Google Scholar] [CrossRef] - Lü, G.; Zheng, X.; Lü, X.; Chen, P.; Wu, G.; Wen, H. Label-free detection of echinococcosis and liver cirrhosis based on serum Raman spectroscopy combined with multivariate analysis. Photodiagnosis Photodyn. Ther.
**2021**, 33, 102164. [Google Scholar] [CrossRef] - Ryzhikova, E.; Ralbovsky, N.M.; Sikirzhytski, V.; Kazakov, O.; Halamkova, L.; Quinn, J.; Zimmerman, E.A.; Lednev, I.K. Raman spectroscopy and machine learning for biomedical applications: Alzheimer’s disease diagnosis based on the analysis of cerebrospinal fluid. Spectrochim. Acta Part A Mol. Biomol. Spectrosc.
**2021**, 248, 119188. [Google Scholar] [CrossRef] - Paraskevaidi, M.; Morais, C.L.M.; Lima, K.M.G.; Ashton, K.M.; Stringfellow, H.F.; Martin-Hirsch, P.L.; Martin, F.L. Potential of mid-infrared spectroscopy as a non-invasive diagnostic test in urine for endometrial or ovarian cancer. Analyst
**2018**, 143, 3156–3163. [Google Scholar] [CrossRef]

**Figure 1.**Region of the electromagnetic spectrum along with the corresponding type of quantum changes [44].

**Figure 2.**Algorithm mentioned for disease classification for IR and Raman spectroscopy in this study.

**Figure 3.**Performance comparison of LDA, PCA-SVM, and PCA-RF in classifying ATR-FTIR data of ovarian cancer in urine.

Kernel | Formula |
---|---|

Linear | $K\left({x}_{i},{x}_{j}\right)={x}_{i}{}^{T}{x}_{j}$ |

Radial Basis Function (RBF) | $K\left({x}_{i},{x}_{j}\right)=exp\left(-\mathsf{\gamma}\Vert {x}_{i}-{x}_{j}\Vert {}^{2}\right),\mathsf{\gamma}0$ |

Polynomial | $K\left({x}_{i},{x}_{j}\right)={\left(\mathsf{\gamma}{x}_{i}{}^{T}{x}_{j}+r\right)}^{d},\mathsf{\gamma}0$ |

Sigmoid | $K\left({x}_{i},{x}_{j}\right)=tanh\left(-\mathsf{\gamma}{x}_{i}{}^{T}{x}_{j}+r\right)$ |

**Table 2.**Overview of selected past studies on pattern recognition for disease classification using IR spectroscopy.

Year | Disease (Sample) | Aim of Research | EDA Algorithm | Classification Algorithm | Findings | References |
---|---|---|---|---|---|---|

2017 | Dengue (Human blood and human sera) | Identification of DENV-3 in different concentrations in blood and serum | PCA, SPA, GA | LDA | Blood samples yield the best results with sensitivity and specificity: 100% | [6] |

2018 | Dengue, Chikungunya and Zika (Human blood) | Discrimination between healthy vs. dengue vs. chikungunya vs. zika clinical samples | PCA, SPA, GA | LDA | PCA-LDA Sensitivity: 100%; Specificity: 92%SPA-LDASensitivity: 100%; Specificity: 92% GA-LDA Sensitivity: 92%; Specificity: 86% | [45] |

2019 | Dengue (Freeze-dried human blood serum) | Discrimination of dengue positive and healthy serum samples by biochemical differences detected | PCA | LDA PLSR | Sensitivity: 89%; Specificity: 95%; R ^{2} = 0.9980 | [46] |

2019 | Hepatitis B and Hepatitis C (Human blood serum) | Classification of human serum samples based on the presence of Hepatitis B virus and Hepatitis C virus infection | PLS-DA | Method 1 HBV vs. control sensitivity: 69.4% specificity: 73.7% HCV vs. control Sensitivity: 51.3% Specificity: 90.9% Method 2 HBV vs. control Sensitivity: 84.4% Specificity: 93.1% HCV vs. control Sensitivity: 80.0% Specificity: 97.2%, HBV vs. HCV Sensitivity: 77.4% Specificity: 83.3% Method 3 HBV vs. control (high molecular concentrate) Sensitivity: 87.5% Specificity: 94.9% HCV vs. control (high molecular concentrate) Sensitivity: 81.6% Specificity: 89.6% | [47] | |

2019 | Oral cancer (Saliva in pellet form) | Discrimination of FTIR spectra of salivary exosomes from oral cancer patients and healthy individuals | PCA | LDASVM | LDAAccuracy: 95%; Sensitivity: 100%; Specificity: 89% SVM Training accuracy: 100% Cross-validation accuracy: 89% | [48] |

2020 | Abnormal thyroid disease specifically hypothyroidism and hyperthyroidism (Human blood) | Diagnosis of abnormal thyroid disease in human blood samples | None | MLP, LSTM, CNN | Accuracy (original data) MLP: 91.3%, LSTM: 88.6%; CNN: 89.3%, Accuracy (with data enhancement) MLP: 92.7%; LSTM: 93.6%; CNN: 95.1% | [49] |

2021 | Human glioma (Human blood serum) | Identification of patients with gliomas | PCA | PSO-SVM, BPNN, DT | AccuracyPSO-SVM: 92.00% BPNN: 91.83% DT: 87.20%, | [50] |

2021 | Aspergillus species (Human blood plasma in kBr pellet form) | Identification of Aspergillus species in human blood plasma | Two models based on PLS-DA | Total accuracy of 84.4%, while oversampling and autoscaling improved this accuracy to 93.3%. | [51] | |

2021 | Typhoid and Dengue (Human blood serum) | PCA | LDA | Accuracy: 100% | [52] | |

2021 | Alzheimer’s Disease (Human blood serum) | Diagnosis of an individual with Alzheimer’s disease | PCA | PLS-DA | NIR Accuracy: 80.6%; Sensitivity: 93.3%; Specificity: 93.7% MIR Accuracy: 98.5%; Sensitivity: 97.5%; Specificity: 100% | [53] |

**Table 3.**Overview of selected past studies on pattern recognition for disease classification using Raman spectroscopy.

Year | Disease (Sample) | Aim of Research | EDA Algorithm | Classification Algorithm | Findings | References |
---|---|---|---|---|---|---|

2016 | Dengue Virus (Human Blood Sera) | Classification of dengue infected and normal healthy sera. | PCA | SVM with three different kernel functions (RBF, polynomial function, and linear function) | Best: SVM (RBF kernel with the polynomial kernel of order 1) Accuracy: 85%; Sensitivity: 73%; Specificity: 93% | [55] |

2017 | Hepatitis C (Human Plasma) | Identify the biochemical changes associated with the presence of the Hepatitis C virus in infected human blood plasma samples and healthy samples. | PCA | LDA | PCA-LDA Sensitivity: 98.8%; Specificity: 98.6% | [18] |

2018 | Hepatitis B (Human Blood) | Classification of hepatitis B virus infection in human blood serum | PCA | SVM with two different kernels, each with two different implementation methods has been used. | Best: SVM (RBF kernel with the polynomial kernel of order 2) Accuracy: 98%; Sensitivity: 100%; Specificity: 95% | [56] |

2019 | Hepatitis B (Human Blood Serum) | Classification of normal sera and pathological (hepatitis B infected) sera | None | NNC | Accuracy: 99.3; Sensitivity: 99.2; Specificity: 99.4 | [57] |

2019 | Dengue and Typhoid (Human Blood Serum) | Differentiation between typhoid and dengue in which both have some symptom similarity | PCA | LDA | The PCA-LDA model yielded sufficient diagnostic accuracy and sensitivity | [54] |

2020 | Hepatitis C (Human Blood Serum) | Classification of three kinds of serum Raman spectra (healthy people, hcv1 patients, and hcv2 patients) | None | SVM | Total Accuracy Rate: 91.1% | [58] |

2020 | Hepatitis B (Human Blood Serum) | Identification of patient infected with hepatitis B virus in human blood serum | PCA | LDA, KNN, SVM, RF, ANN, MSCIR | Model (Accuracy) LDA: 80.77%; KNN: 77.69%; SVM: 89.23%; RF: 86.92%; ANN: 91.53%; MSCIR: 96.15% | [59] |

2021 | Chronic Renal Failure (CRF) (Human Blood Serum) | Classification of normal blood serum and CRF infected blood serum | None | CNNImproved AlexNet | Improved AlexNet (Accuracy: 79.44%) CNN (Accuracy: 95.22%) | [60] |

2021 | Echinococcosis and liver cirrhosis (Human Blood Serum) | Discrimination of echinococcosis and liver cirrhosis from healthy volunteers | PCA | LDA | Overall Accuracy: 87.7% Healthy volunteers Sensitivity: 92.5%; Specificity: 93.2%, Patients with Echinococcosis Sensitivity: 81.5%; Specificity: 96.1%, Liver Cirrhosis Sensitivity: 89.1%; Specificity: 92.4% | [61] |

2021 | Alzheimer’s Disease (Cerebrospinal Fluid (CSF)) | Differentiation of CSF samples obtained from Alzheimer’s Disease patients and healthy controls. | PCA GA | ANN SVM-DA | Sensitivity & Specificity: 84% | [62] |

Classification Model | Accuracy (%) | Sensitivity (%) | Specificity (%) |
---|---|---|---|

LDA | 98.75 | 100.00 | 97.50 |

PCA-SVM | 97.50 | 97.50 | 97.50 |

PCA-RF | 95.00 | 97.50 | 92.50 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Hasbi, N.H.; Bade, A.; Chee, F.P.; Rumaling, M.I.
Pattern Recognition for Human Diseases Classification in Spectral Analysis. *Computation* **2022**, *10*, 96.
https://doi.org/10.3390/computation10060096

**AMA Style**

Hasbi NH, Bade A, Chee FP, Rumaling MI.
Pattern Recognition for Human Diseases Classification in Spectral Analysis. *Computation*. 2022; 10(6):96.
https://doi.org/10.3390/computation10060096

**Chicago/Turabian Style**

Hasbi, Nur Hasshima, Abdullah Bade, Fuei Pien Chee, and Muhammad Izzuddin Rumaling.
2022. "Pattern Recognition for Human Diseases Classification in Spectral Analysis" *Computation* 10, no. 6: 96.
https://doi.org/10.3390/computation10060096