A Review of Machine Learning Methods Recently Applied to FTIR Spectroscopy Data for the Analysis of Human Blood Cells

Machine learning (ML) is a broad term encompassing several methods that allow us to learn from data. These methods may permit large real-world databases to be more rapidly translated to applications to inform patient–provider decision-making. This paper presents a review of articles that discuss the use of Fourier transform infrared (FTIR) spectroscopy and ML for human blood analysis between the years 2019–2023. The literature review was conducted to identify published research of employed ML linked with FTIR for distinction between pathological and healthy human blood cells. The articles’ search strategy was implemented and studies meeting the eligibility criteria were evaluated. Relevant data related to the study design, statistical methods, and strengths and limitations were identified. A total of 39 publications in the last 5 years (2019–2023) were identified and evaluated for this review. Diverse methods, statistical packages, and approaches were used across the identified studies. The most common methods included support vector machine (SVM) and principal component analysis (PCA) approaches. Most studies applied internal validation and employed more than one algorithm, while only four studies applied one ML algorithm to the data. A wide variety of approaches, algorithms, statistical software, and validation strategies were employed in the application of ML methods. There is a need to ensure that multiple ML approaches are used, the model selection strategy is clearly defined, and both internal and external validation are necessary to be sure that the discrimination of human blood cells is being made with the highest efficient evidence.


Introduction
As society continues to evolve, the importance of healthcare as a crucial pillar becomes more evident with each passing day. Providing and improving healthcare services has become an essential target, as it plays a vital role in supporting other societal pillars. Automation and medical devices are an integral part of healthcare services, and their significance has increased with advancements in technology and communication. It is now a prominent time for the healthcare industry to revolutionize the way it delivers services to society, given the rising number of diseases and epidemics.
With the population increasing every second, the demand for laboratory testing has also surged, requiring more health experts to attend to patient analysis and reporting. Particularly, in the case of complex elements such as blood, there is an urgent need for fast and accurate technology to provide initial indications of a patient's status [1,2].
An adult human body contains approximately five liters of blood, with blood cells comprising nearly 45% of the blood tissue volume. Blood cells are categorized into three types-red blood cells (RBCs), white blood cells (WBCs), and platelets. In particular, the WBCs include basophils, lymphocytes, neutrophils, monocytes, and eosinophils. RBCs serve as the primary means of transporting oxygen, WBCs play a crucial role in the immune system, fighting against diseases, and platelets aid in the coagulation process, promoting wound healing with scabs. Both physiological and pathological changes can affect the composition of blood, which is clinically a crucial factor to consider [3].
Blood tests have emerged as a direct means of detecting an individual's health status or diagnosing illnesses. Complete blood cell (CBC) counting is a traditional blood test that involves identifying and counting basic blood cells to examine, monitor, and manage variations in blood. Although this technology has been used since the last century, it has become burdensome for the healthcare sector due to the time, power, and reagents required [4].
It is now time to explore new technologies for analyzing human blood. For instance, Fourier transform infrared (FTIR) spectroscopy is a powerful, low-cost, and fast analysis tool. It can determine the molecular structure of a substance by matching the specific frequency absorbed by the molecule through the transition of the vibration frequency of the bond or group [4]. It has been used for human blood analysis. However, the interpretation of the output FTIR data is, in most cases, performed by human intervention. To overcome this limitation, researchers have joined the use of FTIR spectroscopy analysis to machine learning (ML) for the distinction between pathological and healthy human blood cells.
Thus, this paper aims to perform a review of articles that discuss the use of FTIR spectroscopy and ML for human blood analysis. The resulting tool can benefit healthcare providers as it utilizes precise and specialized equipment. With the use of ML algorithms, the procedure can be automated, replacing the need for specialized equipment.
Artificial intelligence (AI) is a broad scientific endeavor that involves developing computational systems to simulate human intelligence and enable problem-solving. ML and deep learning (DL) are subfields of AI, with the study herein presented focusing on ML. It is essential to understand the concept of AI before delving into each of these subfields, including their differences and applications. AI provides a means of automating various tasks without human intervention, making it a sought-after solution in many industries, including homes [5].
ML involves the use of computational tools and methods designed for specific goals to solve problems. In situations where the data input and the desired result are known, but the method to get there is unknown, the ML model can forecast and provide a solution. Therefore, ML presents a different approach from the traditional programming methods where the parameters and approaches are known [6].
ML models undergo data preparation, training, and testing stages. Data preprocessing involves examining and modifying data to improve model comprehension by removing irrelevant information and changing the format. To forecast any value or outcome, the ML model must first be trained using a specified approach. The user responsible for training must provide a substantial amount of data, as well as the expected results for each data point, along with training parameters and settings. This way, the model can learn to understand and interpret the data, making accurate predictions about future outcomes [7,8].
This paper is organized into 5 sections. Section 2 presents the methods used in the review, and how the papers have been collected and selected. Section 3 presents the results obtained from the collected data, which are analyzed and discussed in Section 4. Finally, Section 5 displays the paper's conclusions and suggestions for future work.

Method
For this paper, a review was conducted using the standard methodology illustrated in (Figure 1). The following databases were searched for peer-reviewed journal articles published between 2019 and 2023: PubMed, MEDLINE, PMC, ScienceDirect, and Web of Science. Before 2019, the number of papers representing ML methods applied to FTIR spectroscopy was low (the authors only identified four papers under this criterium) and, consequently, not significant [9][10][11][12], so this time period was not considered in the search process. After 2019, the number of papers increased dramatically, and the quality of the data showed promising outcomes [13,14], allowing the availability of data to perform a more detailed review analysis. In addition to the database search, relevant articles were manually identified by reviewing the reference lists of the included articles to investigate the effectiveness of the reference lists for the identification of additional, relevant studies. The search terms used were "Machine learning", "Machine learning AND FTIR OR Attenuated Total Reflectance (ATR)-FTIR spectroscopy", and "Machine learning AND FTIR OR ATR-FTIR spectroscopy AND human blood". No language restrictions were applied, and Excel software was used to store the articles retrieved from the databases and screen for duplicates. Eligible studies included those that analyzed prospective or retrospective observational data, reported quantitative results for the experiment and evaluation made for the method used. The papers' data were separated into two tables, whereby in one table (1), the study's targeting disease, the study type, the sample source methodology, and the FTIR analysis, among other details, are described, and the ML methods, metrics, and validation approach are described in another table (2). in (Figure 1). The following databases were searched for peer-reviewed journal articles published between 2019 and 2023: PubMed, MEDLINE, PMC, ScienceDirect, and Web of Science. Before 2019, the number of papers representing ML methods applied to FTIR spectroscopy was low (the authors only identified four papers under this criterium) and consequently, not significant [9][10][11][12], so this time period was not considered in the search process. After 2019, the number of papers increased dramatically, and the quality of the data showed promising outcomes [13,14], allowing the availability of data to perform a more detailed review analysis. In addition to the database search, relevant articles were manually identified by reviewing the reference lists of the included articles to investigate the effectiveness of the reference lists for the identification of additional, relevant studies The search terms used were "Machine learning", "Machine learning AND FTIR OR Attenuated Total Reflectance (ATR)-FTIR spectroscopy", and "Machine learning AND FTIR OR ATR-FTIR spectroscopy AND human blood". No language restrictions were applied and Excel software was used to store the articles retrieved from the databases and screen for duplicates. Eligible studies included those that analyzed prospective or retrospective observational data, reported quantitative results for the experiment and evaluation made for the method used. The papers' data were separated into two tables, whereby in one table (1), the study's targeting disease, the study type, the sample source methodology and the FTIR analysis, among other details, are described, and the ML methods, metrics and validation approach are described in another table (2). The information extracted from the articles was analyzed descriptively and qualitatively and grouped into categories such as study characteristics, diseases studied, statistical methodologies employed, software packages, as well as the strengths and weaknesses of the reported studies. The findings were interpreted based on these categories (Section 3. Results).

Results
The search methodology described in Section 2 was conducted to identify articles published in the last five years (2019-2023) that combined FTIR spectroscopy and ML The information extracted from the articles was analyzed descriptively and qualitatively and grouped into categories such as study characteristics, diseases studied, statistical methodologies employed, software packages, as well as the strengths and weaknesses of the reported studies. The findings were interpreted based on these categories (Section 3. Results).

Results
The search methodology described in Section 2 was conducted to identify articles published in the last five years (2019-2023) that combined FTIR spectroscopy and ML methods to distinguish between human blood cells. A total of 39 eligible studies were identified and their characteristics, along with patient data, were summarized based on the keywords used in the search. Figure 2 illustrates that, in the last five years, there has been a growing interest among scientists and researchers in the medical field in the application of AI and ML for FTIR data analysis. These studies have demonstrated the potential of AI and ML to improve healthcare services, particularly in the area of disease diagnosis based on human blood. methods to distinguish between human blood cells. A total of 39 eligible studies were identified and their characteristics, along with patient data, were summarized based on the keywords used in the search. Figure 2 illustrates that, in the last five years, there has been a growing interest among scientists and researchers in the medical field in the application of AI and ML for FTIR data analysis. These studies have demonstrated the potential of AI and ML to improve healthcare services, particularly in the area of disease diagnosis based on human blood.  Table 1 summarizes, with a chronological criterium, the sum-up content data from the 39 selected papers. It includes information about the targeted diseases, the criteria followed in the data collection process, the methodology to collect the samples, and the sample size, as well as the used software and positive and negative outcomes from each reported study. Counting of articles, in the last 5 years, relating to ML and FTIR spectroscopy applied to human blood cells. Table 1 summarizes, with a chronological criterium, the sum-up content data from the 39 selected papers. It includes information about the targeted diseases, the criteria followed in the data collection process, the methodology to collect the samples, and the sample size, as well as the used software and positive and negative outcomes from each reported study.

Summary of the Eligible Publications
The presented data (see Figure 2) show that in 2022, the number of papers published on the application of ML and FTIR for diagnosing multiple diseases reached an all-time high. This was particularly evident after the pandemic in 2020, which acted as a catalyst for the adoption of ML and FTIR technologies. The majority of the data sources used in these papers were based on real experimental datasets or designs, accounting for 75% of the total, while the remaining 25% primarily utilized electronic health records (Table 1). Table 2 illustrates, also chronologically, the reported ML methods, metrics to evaluate ML, and internal validation procedures followed in the selected studies (according to the papers mentioned in Table 1). Scanning range from 4000 to 600 cm −1 with an average of 32 scans and at a spectral resolution of 4 cm −1 . Analysis parameter: ratios between the baseline-corrected and normalized spectra (using the Savitzky-Golay algorithm)

ML Methods, Metrics, and Internal Validation in the Selected Papers
The spectroscopic method proved to be an effective tool to identify toxicological changes in the blood and serum of individuals with substance use disorder   Leave-one-out cross-validation, root mean square error of cross-validation, and cross-validation  Table 2 outlines the various ML algorithms used for classification or prediction in the selected studies, including SVM (featured in 24 studies), PCA (featured in 17 studies), the KNN and XGB [50] (featured in 6 and 5 studies, respectively), RF [51] (featured in 11 studies), PCA-LDA and OPLS-DA (featured in 5 and 4 studies, respectively), and LDA [52] (featured in 6 studies). The other algorithms/methods presented in the studies were only reported once. Figure 3 reveals that, in addition to the commonly used algorithms, other methods were employed, such as DT, BPNN, MLP, NB, LR [53], and a novel Bayesian approach, as well as various analytical approaches, including HCA, PCC, and PPV.  Table 2 outlines the various ML algorithms used for classification or prediction in the selected studies, including SVM (featured in 24 studies), PCA (featured in 17 studies), the KNN and XGB [50] (featured in 6 and 5 studies, respectively), RF [51] (featured in 11 studies), PCA-LDA and OPLS-DA (featured in 5 and 4 studies, respectively), and LDA [52] (featured in 6 studies). The other algorithms/methods presented in the studies were only reported once. Figure 3 reveals that, in addition to the commonly used algorithms, other methods were employed, such as DT, BPNN, MLP, NB, LR [53], and a novel Bayesian approach, as well as various analytical approaches, including HCA, PCC, and PPV. The figure shows that the SVMs are frequently used in the biological field-SVMs are one of the most powerful classifiers in ML that can be applied when a dataset is introduced in two classes in a high dimensional feature space, and this is the nature of biological cells.
Most biological genomic data are high-dimensional, heterogeneous, and noisy. This feature makes some methods such as SVMs, PCA, and RFs suitable to be used in the biological field rather than other such as the DL and PCC.
In addition, as summarized in Table 2, almost all the publications (n = 33, 92%) utilized two or more methods, and only less than eight percent (n = 6, 8%) applied a single ML algorithm.
Additionally, further information is presented in Figure 4 to help establish a visual analysis between the type of diseased and the ML methods used for its classification.  The figure shows that the SVMs are frequently used in the biological field-SVMs are one of the most powerful classifiers in ML that can be applied when a dataset is introduced in two classes in a high dimensional feature space, and this is the nature of biological cells.
Most biological genomic data are high-dimensional, heterogeneous, and noisy. This feature makes some methods such as SVMs, PCA, and RFs suitable to be used in the biological field rather than other such as the DL and PCC.
In addition, as summarized in Table 2, almost all the publications (n = 33, 92%) utilized two or more methods, and only less than eight percent (n = 6, 8%) applied a single ML algorithm.
Additionally, further information is presented in Figure 4 to help establish a visual analysis between the type of diseased and the ML methods used for its classification.

Internal and External Validation
The evaluation of the reported studies' publication quality identified the most common gap in publications as the lack of external validation, which was conducted by only two studies [13,49]. Twelve of the reported studies predefined the success criteria for model performance [15,19,22,25,26,36,38,41,43,47,48] and nine studies discussed the generalizabil-ity of the model [14,17,20,27,29,33,35,42,46]. All the studies, except one [37], discussed the balance between model accuracy and model sensitivity and specificity.

Strengths and Weaknesses
The authors of the selected articles noted both strengths and weaknesses in the used ML methods. Overall, the simplicity and low complexity of ML methods were recognized as strengths, as they are powerful and efficient tools for handling large datasets. However, one article highlighted that the effectiveness of ML is highly dependent on proper method selection and parameter optimization and that these steps are essential for obtaining accurate estimates [25].
Even with careful planning and despite their advantages, ML approaches still present several limitations, which warrant attention in future studies. Overfitting was identified as a weakness, which can occur when too much detail is included in the method. Other limitations stem from the quality and availability of the data sources used, such as incomplete variable sets or missing data, which can negatively affect model development and performance. Retrospective database studies were identified as particularly vulnerable to the lack of relevant variables, as researchers are limited to recorded data. Finally, the lack of external validation was noted as a limitation of the studies reviewed in this analysis.

Discussion
In this review, we examined the methods and approaches used for ML in the context of observational datasets related to the FTIR or ATR-FTIR analysis of human blood cell conditions. While ML methods have been applied more broadly in recent years, our review focused specifically on studies that utilized FTIR or ATR-FTIR spectroscopies and human blood cells, and therefore, our findings may not be applicable to all ML methods. Our primary objective was to explore the potential of ML methods in distinguishing between healthy and pathological cells on a large scale, not limited to a specific disease, to improve healthcare services and provide physicians with reliable evidence applicable to individual patients. This review aims to provide guidance and best practices for the use of ML methods in discriminating between human blood cells, with the goal of improving their effectiveness and increasing their use in generating data and models for healthcare decision-making. The used methods represent a single point on a potentially wide distribution, meaning that any cell spectrum could fall anywhere within that distribution and may be far from the point that distinguishes between healthy and pathological conditions.
Multiple algorithms were used in the majority of the articles, although in some articles single modeling methodologies were considered; this underscores the importance of selecting and developing ML algorithms, particularly considering recent advances in analytics capabilities.
The FTIR analysis presented in Table 1 shows that all the studies discussed utilized the mid-IR domain, specifically the wavelength range of 400 to 4000 cm −1 , with spectral resolutions ranging between 2-8 cm −1 being the most common 4 cm −1 . While the basic IR parameters were consistent across the studies, the analysis varied in terms of the preprocessed spectrum region, i.e., the main interest regions. This variation is primarily influenced by the specific wavelength being targeted, which in turn affects the intensity band observed in the selected preprocessed spectrum region. While a single model may sometimes produce accurate results that match the data well, creative methods can be used to support the model's certainty. It is advised that this be adopted as a best practice in the future and be used as an extra criterion to evaluate the caliber of research among ML algorithms.
The methods that were used in each publication performed differently based on many inputs, varied in the metrics and internal validations, and gave different results describing the potential and significance of using these methods in the medical field.
SVMs and PCA are the two ML most frequently used methods in the biological field, with SVMs being used in 25 papers among the 39 selected ones and PCA being used in 17 papers. Other methods such as RFs, LDA, and XGB also play a big part in the biology field as methods for assessing biological cells and enhancing outcomes.
Cancer seems to be the main targeted disease in order to develop a method that can be used to detect the carcinogenic cells on the nuclear stage (10 types of ML methods were used in the 39 selected papers), followed by inflammatory diseases in general.
These findings demonstrate the power of basic ML algorithms as applied intelligence tools to distinguish the complex vibrational spectra of, for example, cancer patients from those of healthy patients. These experimental methods show promise as a valid and efficient liquid biopsy for artificial-intelligence-assisted early cancer screening, as shown in Figure 5. be adopted as a best practice in the future and be used as an extra criterion to evaluate the caliber of research among ML algorithms.
The methods that were used in each publication performed differently based on many inputs, varied in the metrics and internal validations, and gave different results describing the potential and significance of using these methods in the medical field.
SVMs and PCA are the two ML most frequently used methods in the biological field, with SVMs being used in 25 papers among the 39 selected ones and PCA being used in 17 papers. Other methods such as RFs, LDA, and XGB also play a big part in the biology field as methods for assessing biological cells and enhancing outcomes.
Cancer seems to be the main targeted disease in order to develop a method that can be used to detect the carcinogenic cells on the nuclear stage (10 types of ML methods were used in the 39 selected papers), followed by inflammatory diseases in general.
These findings demonstrate the power of basic ML algorithms as applied intelligence tools to distinguish the complex vibrational spectra of, for example, cancer patients from those of healthy patients. These experimental methods show promise as a valid and efficient liquid biopsy for artificial-intelligence-assisted early cancer screening, as shown in Figure 5.   Average Accuracy Average sensitivity Average specificity Figure 5. Metrics to evaluate the average ML performance in FTIR applied to different diseases. For each disease, the most common ML methods are reported. The blue column represents the average accuracy, the red column represents the average sensitivity, and the green column represents the average specificity.
By combining FTIR spectroscopy with deep learning, it was possible not only to differentiate between allergic and healthy patients but also to stratify and treat patients, indicating its potential for monitoring the efficacy of SIT in individual patients. However, further investigation is needed to determine whether FTIR-spectroscopic-based identification of allergic status is limited to adults or can also be applied to samples collected from other groups, such as children.
The classification process utilizing PLS and DNNs [54] algorithms shows promising results in distinguishing COVID-19 patients from healthy individuals. The extraction process revealed numerous features of the FTIR signal and combining all these features achieved effective accuracy values.
The results indicate that using these 16 features could be valuable in accurately classifying COVID-19 patients and healthy subjects. For the proposed method for predicting the biological contour shape, the method enables the application of genomic selection to rice grain shape improvement as depicted in Figure 5.
Moreover, Figure 5 shows that the neural network models have successfully identified unique infrared absorption spectra in many diseases, such as cancer, allergies, thyroid function, and COVID-19. These spectra can effectively distinguish between malignant and benign breast tissues.
The ML classification that used SVMs [55] to distinguish between pathological and healthy patients for multiple diseases such as malaria and Alzheimer's disease and miscarriage achieved sensitivity between 95-100% (with three false negatives for each disease) and specificity between 95-97% (with two false positives for each disease). The method had good predictability with a low error rate and can be sufficiently accurate for the analysis of blood cells like the reference method. The suggested method is simple, fast, economical and does not require pollutant solvents and expensive equipment. It is worth noting that one of the false positives was due to a Plasmodium species other than Plasmodium falciparum or Plasmodium vivax, which was not detected by the PCR primers employed. Therefore, the results may actually be better than reported. The study also demonstrated that ATR-FTIR spectroscopy can be a reliable and efficient diagnostic tool for miscarriage, with the potential to be used at the point of care in tropical field conditions. The spectra can be analyzed via a cloud-based system, which makes it easily accessible for mass screening. This approach is highly sensitive, selective, portable, and requires low logistics, which makes it a potentially outstanding tool for malaria elimination programs. Currently, the experimental program is focused on reducing the sample requirements to fingerpick volumes [8].

Conclusions
This paper has presented a review of articles that discuss the use of FTIR spectroscopy combined with ML for human blood analysis between the years 2019-2023.
The reported implementation of ML techniques covered a wide range of approaches, methods, statistical software, and validation strategies. Based on these findings, it is essential to assess and compare several modeling approaches when creating ML-based models for correctly evaluating FTIR data, aimed at diagnosing human blood, which calls for the highest research standards. Models should be assessed using precise criteria before being chosen.
There is potential to apply FTIR-based ML diagnosis to aid clinical decisions as a triage method for human blood cell discrimination, extending to early cancer screening, health monitoring, disease detection, and food safety. Identifying early stages or referral decisions is essential, usually unmet by the current diagnostic pathway. That level of proof has been attained by a sizable number of studies that distinguish between abnormal and healthy human blood cells. Moreover, considering data availability, power computing, and access to AI tools, it should be expected that there will be a high increase in the use of ML and DL approaches for human blood analysis in the future years due to the advantages they provide.
This fast method is highly important and critical for many patients. It is effective for both accessible and inaccessible diseases in which lab tests are normally useless. Thus, it could serve as a significant and objective diagnostic tool that will assist physicians in increasing their diagnostic accuracy of the etiology of diseases, especially for inaccessible patients. Before ML approaches are used to analyze blood cells, we need additional field validation in other study sites with different parasite populations and an in-depth evaluation of the biological basis of blood cells.
ML approaches still present several limitations, which warrant attention in future studies using these methods. Overfitting has been identified as a weakness, which can occur when too much detail is included in the method. Other limitations stem from the quality and availability of the used data sources, such as incomplete variable sets or missing data, which can negatively affect the model's development and performance. Retrospective database studies were identified as particularly vulnerable to the lack of relevant variables, as researchers are limited to recorded data. Moreover, the lack of external validation was noted as a limitation of the studies reviewed in this analysis. Thus, improving the classification algorithms and model training on larger datasets could also improve specificity and sensitivity, as well as looking up details to increase the potential of FTIR spectroscopy and ML application in the biological field [56][57][58][59]. Finally, the output from this paper-apart from the description of the most used ML techniques and corresponding metrics in the biological field-could be a strong basis for further developments, focusing the application of these methodologies on human blood cell analysis [60], even targeting their application in microfluidic, lab-on-a-chip [61][62][63][64], or other point-of-care miniaturized devices [65], among others.