Excitation–Emission Fluorescence Spectroscopy Combined with Machine Learning for Biomedical Diagnostics: A Systematic Review
Abstract
1. Introduction
2. Materials and Methods
2.1. Inclusion and Exclusion Criteria
2.2. Study Selection Process
3. Fundamentals of Excitation–Emission Matrix Fluorescence Spectroscopy and Machine Learning
3.1. Basic Principles of EEM Fluorescence Spectroscopy
3.2. Data Characteristics and Challenges
3.3. Basic Principles of Machine Learning
3.3.1. Data Preprocessing
- -
- Normalization is a fundamental preprocessing step that rescales input features with different units and magnitudes to a standard range, typically between zero and one. This transformation prevents variables with larger scales from disproportionately influencing machine learning model outcomes and is particularly beneficial for linear models operating on small datasets, where empirical evidence shows it can improve classification accuracy from 61% to over 91% and increase the Matthews correlation coefficient (MCC) from 23% to 83% [33]. In the context of EEM fluorescence spectroscopy, normalization constitutes a critical stage of the data preprocessing workflow, alongside baseline correction and noise filtering, that enables the effective integration of high-dimensional spectral matrices with machine learning algorithms for biomedical diagnostic applications [34].
- -
- Baseline correction addresses systematic low-frequency spectral distortions caused by scattering, fluorescence background, and detector drift, which obscure true analyte signals and compromise quantitative analysis. Effective correction must disentangle analyte signals from background interference without distorting peak morphology, a challenge addressed through physical and statistical models, such as polynomial fitting and asymmetric least squares, as well as adaptive algorithms, including penalized splines and iterative optimization methods. Parametric approaches, such as piecewise polynomial fitting (PPF) and B-spline fitting (BSF), offer computational simplicity and localized control to avoid overfitting but require careful parameter tuning; high-order polynomials risk introducing oscillations, whereas low-order fits may underperform with complex baselines. Adaptive methods, such as asymmetrically reweighted penalized least squares (arPLS), improve robustness by dynamically adjusting weights based on the distribution of residuals, thereby suppressing baseline drift while preserving spectral peaks. However, arPLS remains prone to overfitting faint peaks near the baseline and may misclassify low-intensity signals as baseline variations under high-noise conditions. The choice between parametric and adaptive approaches depends on spectral complexity: simple baselines are well-handled by parametric models, whereas complex fluorescence backgrounds and overlapping interferences demand adaptive or data-driven algorithms. Regardless of the strategy chosen, effective baseline correction is essential for preserving spectroscopic feature fidelity and ensuring that downstream machine learning models operate on chemically meaningful data rather than distortion-dominated signals [35].
- -
- Noise filtering constitutes a critical preprocessing strategy in machine learning pipelines, particularly in medical domains, where training datasets frequently contain mislabeled or anomalous instances. The multi-class saturation filter, grounded in the saturation property of the training data, operates iteratively to identify and remove examples whose elimination reduces the hypothesis complexity, as measured by the complexity of the least complex hypothesis value. An empirical evaluation across eight UCI medical datasets (University of California, Irvine, USA) demonstrated that classifiers trained on noise-filtered data consistently achieved higher mean prediction accuracy than those trained on unfiltered datasets, with statistically significant improvements observed in seven of eight domains for both inductive learning by logic minimization (ILLM) and C4.5 algorithms. Notably, the application of noise elimination prior to model training yielded relative information score improvements of approximately 3% in the diagnosis of rheumatic diseases, surpassing the performance of built-in noise-handling mechanisms, such as the CN2 significance test. These results underscore the value of preprocessing-based noise elimination as a complementary strategy to algorithmic noise tolerance, while also highlighting that excessive filtering may reduce predictive performance, necessitating careful calibration of the noise sensitivity parameter according to the training set size [36].
- -
- Centering is a fundamental preprocessing transformation that adjusts the data so that the meaning of each feature becomes zero, thereby eliminating bias related to the location of the data in feature space and allowing machine learning models to focus on the relationships and patterns within the data. This step is particularly critical for algorithms sensitive to the scale and location of the data, such as principal component analysis (PCA) and various regression techniques, as it ensures that the first principal component aligns with the direction of maximum variance. Importantly, centering only shifts the location of the data without affecting its variance or the shape of its distribution; therefore, the total variance, expressed as the trace of the covariance matrix, remains unchanged after the transformation. In iterative optimization algorithms, such as gradient descent, centering accelerates convergence by removing bias from the optimization landscape while reducing numerical instability during high-dimensional computations. For multi-class datasets, centering each class around its own centroid, rather than applying a global centering strategy, maximizes between-class variance while minimizing within-class variance, thereby directly influencing the Fisher discriminant ratio as a measure of class separability. As a standard practice in data preprocessing pipelines, centering is commonly applied as a preliminary step in standardization, where the data are subsequently scaled by dividing by the standard deviation to form Z-score normalization, making features comparable regardless of their original units or scales [37].
3.3.2. Unsupervised Learning
- -
- k-Means clustering is a fundamental unsupervised learning algorithm that partitions multidimensional data into k distinct clusters by minimizing the total within-cluster variance, thereby identifying natural groupings and latent structures in complex datasets without requiring prior class labels. Formally, the algorithm minimizes the sum of squares error (SSE) through an iterative procedure that assigns each observation to the nearest centroid and recalculates cluster centroids based on updated membership until no further reassignment is possible, making it computationally tractable for large, high-dimensional datasets. A well-recognized limitation of k-means is its susceptibility to convergence at local rather than global optima, a consequence of sensitivity to initial seed selection; this can be mitigated through multiple random restarts or rational initialization strategies, such as hierarchical pre-clustering via Ward’s method, which has shown superior recovery performance in comparative studies. The determination of the optimal number of clusters, k, remains a critical methodological decision, addressable through algorithmic approaches that test for Gaussian distribution assumptions, graphical inspection of the SSE curve across values of k, or quantitative indices, including the Calinski–Harabasz criterion, the Davies–Bouldin index, and the gap statistic. Preprocessing decisions also critically influence cluster quality: standardization by range rather than z-score has been recommended as the most effective normalization strategy, and variable weighting or selection procedures, such as variable selection for k-means (VS-KM), can substantially reduce noise and improve the discriminative structure of the resulting partition. Additional methodological safeguards include outlier detection through jackknife-based influence measures and trimmed k-means variants, as well as extensions to alternative metric spaces, including k-medians and k-harmonic means, that offer improved robustness in the presence of outliers or complex data geometries [39]. When applied to spectroscopic data, k-means serves as an exploratory tool for hypothesis generation, identifying sample subgroups that may reflect distinct biochemical states or disease phenotypes, although the results should be validated through complementary supervised methods and clinical correlation before drawing biological conclusions.
- -
- Hierarchical clustering is an unsupervised grouping procedure that systematically reduces n mutually exclusive subsets to a single group by iteratively merging pairs of subsets whose union produces the smallest increase in the chosen objective function. This approach is particularly suited for large-scale studies (n > 100), in which obtaining a globally optimal solution for a fixed number of groups is computationally prohibitive. At each step, all possible k(k − 1)/2 pairs of active subsets are evaluated, and the merger associated with the optimal objective function value is accepted without modifying the previously formed groups. The objective function, which may reflect any criterion selected by the investigator, such as the SSE, quantifies the information loss associated with each merger, providing a principled basis for evaluating grouping quality. Critically, this procedure does not require the number of groups to be specified in advance; instead, the objective function values computed across all stages from n to one group provide quantitative clues for selecting the operationally appropriate number of clusters. The complete hierarchical structure, together with the incremental loss estimates at each merging stage, enables a multi-resolution inspection of data organization and the identification of meaningful subgroup boundaries [40].
- -
- PCA is a dimensionality reduction technique that transforms high-dimensional datasets into a reduced set of uncorrelated variables, termed principal components (PCs), which are defined directly from the data rather than established a priori, making PCA an inherently adaptive method for exploratory analysis. Mathematically, each PC is obtained as a linear combination of the original variables by solving an eigenvalue problem on the covariance or correlation matrix, where the coefficients of these linear combinations are known as loadings and the projected values for each observation as scores. Because PCA operates without distributional assumptions, it is broadly applicable to numerical data of diverse types. When the original variables differ in units of measurement, correlation matrix PCA, which standardizes variables prior to analysis, is the preferred approach, as it prevents variables with larger scales from disproportionately influencing the components. The proportion of total variance explained by the retained PCs serves as the standard criterion for evaluating the quality of any low-dimensional representation, enabling a principled trade-off between dimensionality reduction and information retention. Complementary visualization tools, such as biplots, further enhance interpretability by simultaneously representing observations and variables in the reduced space, facilitating the identification of structure and relationships within complex datasets [41].
- -
- t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear, non-parametric dimensionality reduction technique that maps high-dimensional data into a two- or three-dimensional representation by converting pairwise Euclidean distances into conditional probabilities that reflect the similarity between data points. Unlike linear approaches, t-SNE employs a Student’s-t distribution with one degree of freedom in the low-dimensional space, rather than a Gaussian, to model these similarities, which provides two critical advantages: it strongly repels dissimilar data points that are incorrectly placed in close proximity and avoids infinite repulsive forces that would otherwise destabilize the optimization. This design directly addresses the crowding problem inherent to earlier SNE formulations, in which the insufficient area of the low-dimensional map caused moderately dissimilar points to collapse toward the center, obscuring natural cluster boundaries. The optimization of the cost function is further supported by early exaggeration, a technique that temporarily amplifies the joint probabilities in the high-dimensional space during initial iterations, encouraging the formation of tight, well-separated clusters that can subsequently reorganize into a globally coherent structure. As a result, t-SNE is capable of simultaneously preserving local neighborhood relationships and revealing global structure at multiple scales, a property that distinguishes it from techniques such as Sammon mapping, isomap, and locally linear embedding. However, its computational and memory complexity of O(n2) makes direct application impractical for datasets substantially exceeding 10,000 data points; this limitation is addressed by a random walk variant that computes pairwise affinities for a subset of landmark points by integrating over all paths through a neighborhood graph constructed from the full dataset, thereby incorporating structural information from undisplayed data points into the final visualization [42].
- -
- PARAFAC is a trilinear decomposition method that simultaneously fits multiple two-way arrays, or slices of a three-way array, in terms of a common set of factors with differing relative weights in each slice, representing a generalization of the bilinear factor analysis model to three-way data. Applied to excitation–emission matrix spectroscopy, this trilinear structure naturally accommodates the three-way array formed by samples, excitation wavelengths, and emission wavelengths, decomposing it into a set of latent factors each associated with a distinct spectral component. A defining advantage of PARAFAC over conventional two-way methods, such as PCA, is its intrinsic axis property: when each factor exhibits sufficiently distinct patterns of variation across the three modes of the data, the orientation of factor axes is uniquely determined by minimizing residual error alone, eliminating the rotational indeterminacy that requires an arbitrary separate rotation phase in traditional factor analysis. This uniqueness is guaranteed when each factor displays distinct proportional patterns of variation across the levels of each mode, such that no two factors can be exchanged or recombined without reducing the overall fit of the model. Solution reliability was assessed through fit diagnostics, including R-squared, stress, and mean squared error, and confirmed via split-half validation, in which replication of essentially the same factor structure across independent subsamples demonstrated that the recovered axes reflected genuine systematic patterns in the data rather than artifacts. When the proportional structure of the data is violated or factors exhibit insufficient variation, degenerate solutions characterized by two or more highly negatively correlated factors may arise; these can be addressed by constraining factors to orthogonality in one mode or by employing indirect fitting through PARAFAC, which is generally immune to such degeneracies. Extensions of the basic model include PARAFAC2, which permits oblique factors while maintaining consistency constraints on interfactor angles across covariance matrices, and the PARATUCK family of models, which combines the intrinsic axis capabilities of PARAFAC with the greater structural generality of Tucker three-mode factor analysis [43].
3.3.3. Supervised Learning
- -
- Logistic regression (LR) is a statistical method introduced for the analysis of binary sequences, where the outcome of each observation takes one of two forms, such as success or failure, or presence or absence of a condition. The model is formulated through the logistic law, which expresses the probability of a binary outcome as a function of one or more predictor variables via the logit transformation: the log-odds of the outcome are modeled as a linear combination of the predictors, ensuring that estimated probabilities remain bounded within the interval [0, 1]—a fundamental constraint that linear regression cannot guarantee. The parameter β associated with each predictor has a well-defined interpretation: when the probability of the outcome is small, β approximates the fractional change in that probability per unit increase in the predictor; when the probability of the complementary outcome is small, β approximates the corresponding fractional change in that quantity. The regression coefficients can be estimated through maximum likelihood or minimum logit χ2, two asymptotically equivalent approaches that extend naturally to multiple predictor variables through standard iterative or non-iterative multiple regression calculations [45]. Owing to these properties, LR has become widely used in biomedical research for predicting whether a set of conditions will or will not result in disease or a clinically relevant outcome, with the probability of class membership derived directly from a linear combination of the available features. However, its intrinsic linear nature implies a reduced predictive capacity when the relationship between predictors and the outcome is non-linear, which represents a fundamental limitation in datasets where complex interactions among variables are present [46].
- -
- Support vector machines (SVMs) are supervised machine learning algorithms grounded in statistical learning theory, originally developed for binary classification and subsequently extended to multi-class problems. Their core operating principle consists of identifying an optimal decision boundary that maximizes the separation between classes, a property that confers notable generalization capacity even in complex classification scenarios. When data are not linearly separable, SVMs employ kernel functions to project the input features into higher-dimensional spaces, where class discrimination becomes feasible, thereby allowing the model to capture non-linear relationships between variables. These characteristics have made SVMs particularly well-suited for biomedical applications, where they have been applied for over two decades to tasks such as clinical diagnosis, disease prognosis, biomarker identification, and prediction of treatment outcomes, owing to their high precision and robustness in handling high-dimensional data. Moreover, SVMs have been integrated into personalized medicine frameworks, where their capacity to simultaneously process genomic, demographic, and clinical information supports the identification of patient subgroups and the prediction of individualized therapeutic responses. Although SVMs have limitations related to computational cost and sensitivity to hyperparameter selection, the development of improved variants has substantially broadened their applicability to complex, often imbalanced datasets encountered in real-world biomedical research [47,48].
- -
- Decision trees are supervised machine learning classifiers that assign class labels to data items by recursively partitioning the feature space through a hierarchical series of questions. Each internal node evaluates a feature-based condition and directs items along branches toward terminal leaf nodes, where class assignments are made. The quality of each partition is measured using impurity criteria, such as entropy or the Gini index, and the algorithm selects the question that minimizes the weighted average impurity of the resulting subsets, thereby maximizing information gain at each split. To prevent overfitting, tree complexity is controlled either by early stopping or by post-construction pruning, in which internal nodes collapse into leaves when doing so reduces the classification error on held-out examples. Decision trees are flexible enough to handle both real-valued and categorical features simultaneously, as well as datasets with missing values. Furthermore, ensemble strategies which aggregate predictions from multiple trees trained on bootstrapped subsets of the data using random feature subsets, can reduce the chance that a new example will be misclassified by avoiding commitment to a single tree [49].
- -
- Random Forest (RF) is an ensemble machine learning method that generates a collection of decision trees, each trained on a random subset of available data and a random selection of input features and combines their individual outputs to produce a final prediction. The use of bootstrap aggregation ensures that each tree is built on a different portion of the trained data, which introduces diversity among the trees and reduces the correlation between them. The generalization of an RF converges to a stable limit as the number of trees increases, a property derived from the law of large numbers, which explains why the algorithm does not overfit as the ensemble grows larger. The accuracy of the model depends on two fundamental parameters: the strength of the individual trees and the degree of correlation between them, such that forests composed of stronger and less correlated trees consistently yield lower generalization error. A particularly valuable property of RF is its resistance to noise in the output labels, as the algorithm does not concentrate weight on misclassified instances in the same manner as the boosting method, making it more robust under real-world data conditions. The algorithm is applicable to both classification and regression tasks, offers a built-in estimation of variable importance through measures that quantify how much each feature contributes to reducing prediction error across the trees, and can effectively handle datasets with missing values and class imbalance without requiring extensive preprocessing. Furthermore, out-of-bag samples, which are the observations not used in constructing each individual tree, serve as an internal validation set that provides unbiased estimates of the generalization error without the need for a separate test dataset. These characteristics collectively make RF one of the most efficient and widely adopted algorithms in machine learning, without demonstrated utility across diverse domains, including health applications such as disease diagnosis and patient outcome prediction [50,51].
- -
- Naïve Bayes is one of the most efficient and widely validated probabilistic classifiers for disease prediction, occupying a prominent place among the supervised learning algorithms applied to biomedical diagnosis. Based on Bayes’ theorem, this algorithm estimates the posterior probability of a given condition from the independent contribution of multiple predictor variables, assuming conditional independence among features, an assumption that, despite its simplicity, yields remarkably competitive predictive performance. Its structure can be represented as a directed acyclic graph, in which a single parent node is connected to all predictive variables, and within each node, a variable can take multiple values associated with a specific probability. A systematic review encompassing 23 studies and 53,725 patients demonstrated that Naïve Bayesian networks achieved superior predictive performance in most disease categories evaluated, with 80% of the studies reporting accuracy values above 75%, and more than half exceeding an area under the receiver operating characteristic (ROC) curve of 0.80, a threshold classified as having good discriminative ability. Across diverse clinical scenarios, including brain disease, cancer, diabetic kidney disease, and cardiovascular conditions, the classifier consistently outperformed or performed comparably to other algorithms, such as LR, SVMs, and neural networks. Beyond its predictive accuracy, Naïve Bayes supports clinical decision-making by merging multiple patient characteristics into a unified probability estimate, aiding physicians in both diagnosis and prognosis. This performance across heterogeneous disease domains aligns with the broader recognition that machine learning methods enhance diagnostic precision and facilitate the integration of complex multivariate data into actionable clinical insights [12,52].
- -
- k-Nearest Neighbor (KNN) is a supervised, non-parametric machine learning method, meaning that it does not impose any assumptions on the underlying distribution of the data. Its operating principle is grounded in the premise that observations located close to one another tend to belong to the same category; thus, the class of a new data point is determined by a majority vote among its KNN in the feature space. A recognized advantage of this algorithm is its simplicity and suitability for multi-class classification problems, particularly when the dataset is not large. In the large-sample limit, the probability of error of the single nearest-neighbor classifier is bounded above by twice the Bayes probability of error, implying that at least half of the classification information available in an infinite sample set is contained in the nearest neighbor. In practice, the selection of the number k of neighbors represents the most critical step in the training process, as small values yield unstable models, whereas k = 5 has been shown to perform reliably across a wide variety of datasets. Unlike other machine learning algorithms, KNN does not construct an explicit function during training; instead, it retains observations in memory and draws upon them when classifying new data points. Neighborhood computation can be performed using distance metrics, such as Euclidean or Manhattan distance, and search algorithms, including the k-d tree and ball tree, allowing the computational complexity to be reduced as a function of the number of training samples and the dimensionality of the feature space [53,54].
- -
- Artificial Neural Networks (ANN) are computational systems that attempt to simulate, in a simplified manner, the organization of biological neural networks, borrowing from neurophysiological knowledge of neurons and their interconnections. The basic computational unit of an ANN is an artificial neuron, which receives weighted inputs scaled by adjustable parameters , computes their sum , and passes the result through a non-linear activation function to produce an output . Multiple such units are organized into an architecture comprising an input layer, one or more hidden layers, and an output layer; the number of hidden layers and nodes in each layer are critical factors that directly influence modeling accuracy and the convergence of learning. Training an ANN consists of iteratively adjusting the connection weights to minimize a defined error function between the actual and desired outputs. The back-propagation algorithm, proposed by Rumelhart, Hinton, and Williams in 1986, provided an effective solution for setting weights in multilayer networks, extending the representational capacity of ANN to problems of essentially no bound in complexity. By virtue of this architecture, ANN can solve complex, mathematically ill-defined, non-linear, and stochastic problems through simple computational operations, with a self-organizing feature that allows them to generalize across a wide range of problems without reprogramming. In supervised learning, the network learns the correlation between input data and desired outcomes and, after training, can generate appropriate outputs in response to new inputs, a property known as generalization. Deep learning extends this principle through architectures composed of multiple processing layers that learn hierarchical representations of data, with each layer transforming its input into progressively more abstract representations; this approach has demonstrated breakthroughs in image recognition, speech recognition, drug discovery, and genomics [55,56,57].
3.3.4. Model Evaluation Metrics
3.3.5. Machine Learning Workflow
4. Biomedical Applications of EEM Fluorescence Spectroscopy Coupled with Machine Learning
4.1. Cancer Detection
4.2. Neurological and Metabolic Diseases
4.3. Analysis of Bioactive Compounds and Hormonal Contaminants
4.4. Diagnosis of Infectious Diseases
5. Discussion
5.1. Performance and Validity of Current Approaches
5.2. Methodological Challenges in Machine Learning Applied to EEM Data
5.3. Future Perspectives and Research Opportunities
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| ACC | Accuracy |
| AdV-7 | Adenovirus 7 |
| AGE/AGEs | Advanced Glycation End products |
| ANN | Artificial Neural Network |
| APTLD | Alternating Penalty Trilinear Decomposition |
| arPLS | Asymmetrically Reweighted Penalized Least Squares |
| AUC | Area Under the Receiver Operating Characteristic Curve |
| BSA | Bovine Serum Albumin |
| BSF | B-Spline Fitting |
| CBZ | Carbamazepine |
| CBZ-EP | Carbamazepine Epoxide |
| CEA | Carcinoembryonic Antigen |
| CFU | Colony Forming Units |
| CNN | Convolutional Neural Network |
| COVID-19 | Coronavirus Disease 2019 |
| CRC | Colorectal Cancer |
| CWM | Constant Wavelength Mode |
| DMAF | 2-(4′-dimethylaminophenyl)-3-hydroxyflavone |
| DMBA | 7,12-dimethylbenz[α]anthracene |
| DT | Differential Transform |
| EC | Endometrial Cancer |
| EE | Ethinylestradiol |
| EEM | Excitation–Emission Matrix |
| EJCR | Elliptical Joint Confidence Region |
| FAD | Flavin Adenine Dinucleotide |
| FFT | Fast Fourier transform |
| FN | False Negative |
| FP | False Positive |
| FT/FFT | Fourier Transform/Fast Fourier Transform |
| GA | Genetic Algorithm |
| HSA | Human Serum Albumin |
| HPV | Human Papillomavirus |
| ILLM | Inductive Learning by Logic Minimization |
| KNN | k-Nearest Neighbor |
| LDA | Linear Discriminant Analysis |
| LOD | Limit of Detection |
| LR | Logistic Regression |
| MCC | Matthews Correlation Coefficient |
| MCR-ALS | Multivariate Curve Resolution–Alternating Least-Squares |
| MLP | Multilayer Perceptron |
| MLR | Multiple Linear Regression |
| MOPS | 3-(N-morpholino)propanesulfonic acid (buffer) |
| MTT | 3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide |
| NADH | Nicotinamide Adenine Dinucleotide (reduced form) |
| NADPH | Nicotinamide Adenine Dinucleotide Phosphate |
| N-PLS | N-way Partial Least Squares |
| NOR | Norgestimate |
| PAH/OH-PAH | Polycyclic Aromatic Hydrocarbons/Hydroxylated PAH |
| PARAFAC | Parallel Factor Analysis |
| PBS | Phosphate-Buffered Saline |
| PC/PCA | Principal Component/Principal Component Analysis |
| PLS/PLS-DA | Partial Least Squares/Partial Least Squares Discriminant Analysis |
| PIV | Poliovirus |
| PPF | Polynomial Fitting |
| PRISMA | Preferred Reporting Items for Systematic Reviews and Meta-Analyses |
| PSNR | Peak Signal-to-Noise Ratio |
| Q2 | Predictive Squared Correlation Coefficient |
| QDA | Quadratic Discriminant Analysis |
| R2 | Coefficient of Determination |
| RBF | Radial-Base Function |
| REP% | Relative Prediction Error Percentage |
| RF | Random Forest |
| RMSECV | Root Mean Square Error of Cross-Validation |
| RMSEP | Root Mean Square Error of Prediction |
| RMS%RE | Relative Mean Square % Error |
| ROC | Receiver Operating Characteristic Curve |
| RT-PCR | Reverse Transcription Polymerase Chain Reaction |
| RVT | trans-Resveratrol |
| SARS-CoV-2 | Respiratory Syndrome Coronavirus 2 |
| SCC | Squamous Cell Carcinoma |
| SENS | Sensitivity |
| SGD | Stochastic Gradient Descent |
| SIL | Squamous Intraepithelial Lesion |
| SIMCA/DD-SIMCA | Soft Independent Modeling of Class Analogy/Data-Driven SIMCA |
| SMOTE | Synthetic Minority Oversampling Technique |
| SNE/t-SNE | Stochastic Neighbor Embedding/t-distributed SNE |
| SPEC | Specificity |
| SPA | Successive Projections Algorithm |
| SVMs/SVM-RFE | Support Vector Machines/SVM Recursive Feature Elimination |
| SSE | Sum of Squares Error |
| SWANRF | Self-Weighted Alternating Normalized Residue Fitting |
| SWATLD | Self-Weighted Alternating Trilinear Decomposition |
| TN | True Negative |
| TM-PLS-DA | Tchebichef Moments Partial Least Squares Discriminant Analysis |
| 3D-FS | Three-Dimensional Fluorescence Spectra |
| TP | True Positive |
| TPA | 12-0-tetradecanoylphorbol-13-acetate |
| U-PLS/RBL | Unfolded Partial Least Squares/Residual Bilinearization |
| UV-Vis | Ultraviolet–Visible Spectroscopy |
| VS-KM | Variable Selection for K-Means |
| WT | Wavelet Transform |
References
- Cifuentes-Rodriguez, N.A.; Benavides-Cuestas, E.R.; Chacón-Chamorro, S.G.; Segura-Giraldo, B. Optical fluorescence spectroscopy using LED-type light in ex-vivo cervical tissues. Ing. Compet. 2023, 25, e-20912532. [Google Scholar] [CrossRef]
- Monici, M. Cell and tissue autofluorescence research and diagnostic applications. Biotechnol. Annu. Rev. 2005, 11, 227–256. [Google Scholar] [CrossRef] [PubMed]
- Suehiro, A.; Uchida, K.; Wakabayashi, I. Measurement of urinary advanced glycation end-products (AGEs) using a fluorescence assay for metabolic syndrome-related screening tests. Diabetes Metab. Syndr. Clin. Res. Rev. 2016, 10, S110–S113. [Google Scholar] [CrossRef]
- Li, Z.; Peleato, N. Comparison of dimensionality reduction techniques for cross-source transfer of fluorescence contaminant detection models. Chemosphere 2021, 276, 130064. [Google Scholar] [CrossRef] [PubMed]
- Sebastiani, M.; Vacchi, C.; Manfredi, A.; Cassone, G. Personalized Medicine and Machine Learning: A Roadmap for the Future. J. Clin. Med. 2022, 11, 4110. [Google Scholar] [CrossRef] [PubMed]
- Nilius, H.; Tsouka, S.; Nagler, M.; Masoodi, M. Machine learning applications in precision medicine: Overcoming challenges and unlocking potential. TrAC—Trends Anal. Chem. 2024, 179, 117872. [Google Scholar] [CrossRef]
- Johnson, K.B.; Wei, W.; Weeraratne, D.; Frisse, M.E.; Misulis, K.; Rhee, K.; Zhao, J.; Snowdon, J.L. Precision Medicine, AI, and the Future of Personalized Health Care. Clin. Trans. Sci. 2021, 14, 86–93. [Google Scholar] [CrossRef] [PubMed]
- Lawaetz, A.J.; Bro, R.; Kamstrup-Nielsen, M.; Christensen, I.J.; Jørgensen, L.N.; Nielsen, H.J. Fluorescence spectroscopy as a potential metabonomic tool for early detection of colorectal cancer. Metabolomics 2012, 8, S111–S121. [Google Scholar] [CrossRef]
- Gomidze, N.; Kalandadze, L.; Khajishvili, M.; Nakashidze, O.; Jabnidze, I.; Jakobia, D.; Makharadze, K. Fluorescence spectroscopy as a novel tool in hematological diagnostics. APL Bioeng. 2025, 9, 026102. [Google Scholar] [CrossRef] [PubMed]
- Binson, V.A.; Thomas, S.; Subramoniam, M.; Arun, J.; Naveen, S.; Madhu, S. A Review of Machine Learning Algorithms for Biomedical Applications. Ann. Biomed. Eng. 2024, 52, 1159–1183. [Google Scholar] [CrossRef] [PubMed]
- Ahsan, M.M.; Luna, S.A.; Siddique, Z. Machine-Learning-Based Disease Diagnosis: A Comprehensive Review. Healthcare 2022, 10, 541. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Rahnavard, A.; Crandall, K.A. Machine learning enhances biomarker discovery: From multi-omics to functional genomics. Med. Res. Arch. 2025, 13. [Google Scholar] [CrossRef]
- Wang, X.; Ding, Q.; Groleau, R.R.; Wu, L.; Mao, Y.; Che, F.; Kotova, O.; Scanlan, E.M.; Lewis, S.E.; Li, P.; et al. Fluorescent Probes for Disease Diagnosis. Chem. Rev. 2024, 124, 7106–7164. [Google Scholar] [CrossRef] [PubMed]
- O’Dea, R.E.; Lagisz, M.; Jennions, M.D.; Koricheva, J.; Noble, D.W.; Parker, T.H.; Gurevitch, J.; Page, M.J.; Stewart, G.; Moher, D.; et al. Preferred reporting items for systematic reviews and meta-analyses in ecology and evolutionary biology: A PRISMA extension. Biol. Rev. 2021, 96, 1695–1722. [Google Scholar] [CrossRef] [PubMed]
- Lakowicz, J.R. Principles of Fluorescence Spectroscopy, 3rd ed.; Springer: New York, NY, USA, 2007. [Google Scholar] [CrossRef]
- Wang, F.; Li, J.; Wang, F.; Wang, H.; Zhou, T.; Zhang, X.; Liu, J.; Wang, Y.; Dong, W. Machine learning-based predictive modeling of HAAs concentration in secondary water supply system using UV–vis absorption and excitation-emission matrix (EEM) fluorescence spectroscopy. Microchem. J. 2025, 219, 114329. [Google Scholar] [CrossRef]
- Qin, X.-Q.; Yao, B.; Jin, L.; Zheng, X.-Z.; Ma, J.; Benedetti, M.F.; Li, Y.; Ren, Z.-L. Characterizing Soil Dissolved Organic Matter in Typical Soils from China Using Fluorescence EEM–PARAFAC and UV–Visible Absorption. Aquat. Geochem. 2020, 26, 71–88. [Google Scholar] [CrossRef]
- Geng, T.; Wang, Y.; Yin, X.L.; Chen, W.; Gu, H.W. A Comprehensive Review on the Excitation-Emission Matrix Fluorescence Spectroscopic Characterization of Petroleum-Containing Substances: Principles, Methods, and Applications. Crit. Rev. Anal. Chem. 2023, 54, 2827–2849. [Google Scholar] [CrossRef] [PubMed]
- Wisuthiphaet, N.; Zhang, H.; Liu, X.; Nitin, N. Detection of Escherichia coli Using Bacteriophage T7 and Analysis of Excitation-Emission Matrix Fluorescence Spectroscopy. J. Food Prot. 2024, 87, 100396. [Google Scholar] [CrossRef] [PubMed]
- Lemes, L.F.R.; Soares, F.L.F.; Nagata, N. Determination of DEET, Icaridin, and IR3535 in insect repellents using excitation-emission matrix (EEM) fluorescence spectroscopy and multiway calibration. Microchem. J. 2024, 206, 111601. [Google Scholar] [CrossRef]
- Siano, G.; Mora, S.; Schenone, A.; Giovanini, L. Sequential acquisition of fluorescence signals with changing fluorophore concentrations. Multivariate Curve Resolution with time measurements assistance. arXiv 2025, arXiv:2502.19431v1. [Google Scholar]
- Costa Pereira, J.; Pais, A.A.C.C.; Burrows, H.D. Analysis of raw EEM fluorescence spectra—ICA and PARAFAC capabilities. Spectrochim. Acta A Mol. Biomol. Spectrosc. 2018, 205, 320–334. [Google Scholar] [CrossRef] [PubMed]
- Głowacz, K.; Skorupska, S.; Grabowska-Jadach, I.; Bro, R.; Ciosek-Skibińska, P. Excitation–Emission Matrix Fluorescence Spectroscopy Coupled with PARAFAC Modeling for Viability Prediction of Cells. ACS Omega 2023, 8, 15968–15978. [Google Scholar] [CrossRef] [PubMed]
- Dos Santos, R.F.; Paraskevaidi, M.; Mann, D.M.A.; Allsop, D.; Santos, M.C.D.; Morais, C.L.M.; Lima, K.M.G. Alzheimer’s disease diagnosis by blood plasma molecular fluorescence spectroscopy (EEM). Sci. Rep. 2022, 12, 16199. [Google Scholar] [CrossRef] [PubMed]
- Caputi, A.F.; Squeo, G.; Sikorska, E.; Silletti, R.; Noviello, M.; Pasqualone, A.; Summo, C.; Caponio, F. Feasibility of excitation-emission fluorescence spectroscopy in tandem with chemometrics for quantitation of trans-resveratrol in vine-shoot ethanolic extracts. J. Sci. Food Agric. 2025, 105, 1496–1507. [Google Scholar] [CrossRef] [PubMed]
- Abbasi, S.; Gharaghani, S.; Benvidi, A.; Rezaeinasab, M. New insights into the efficiency of thymol synergistic effect with p-cymene in inhibiting advanced glycation end products: A multi-way analysis based on spectroscopic and electrochemical methods in combination with molecular docking study. J. Pharm. Biomed. Anal. 2018, 150, 436–451. [Google Scholar] [CrossRef] [PubMed]
- Corcoran, T.C. Compressive Detection of Highly Overlapped Spectra Using Walsh–Hadamard-Based Filter Functions. Appl. Spectrosc. 2018, 72, 392–403. [Google Scholar] [CrossRef] [PubMed]
- Alpaydin, E. Introduction to Machine Learning, 3rd ed.; MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
- Zhao, H.; Chen, Y.; Fu, X. Comparison of Machine Learning Based on Category Theory. J. Web Eng. 2023, 22, 41–54. [Google Scholar] [CrossRef]
- Harris, M.A.; McCoach, D.B. Classify with caution: An illustrative example using mixture models and machine learning. J. Res. Pers. 2025, 116, 104602. [Google Scholar] [CrossRef]
- Rashidi, H.H.; Tran, N.K.; Betts, E.V.; Howell, L.P.; Green, R. Artificial Intelligence and Machine Learning in Pathology: The Present Landscape of Supervised Methods. Acad. Pathol. 2019, 6, 2374289519873088. [Google Scholar] [CrossRef] [PubMed]
- Jayaraman, P.; Desman, J.; Sabounchi, M.; Nadkarni, G.N.; Sakhuja, A. A primer on reinforcement learning in medicine for clinicians. npj Digit. Med. 2024, 7, 337. [Google Scholar] [CrossRef] [PubMed]
- Sujon, K.M.; Hassan, R.B.; Towshi, Z.T.; Othman, M.A.; Samad, M.A.; Choi, K. When to Use Standardization and Normalization: Empirical Evidence from Machine Learning Models and XAI. IEEE Access 2024, 12, 135300–135314. [Google Scholar] [CrossRef]
- Dimas Pratama, Y.; Salam, A. Comparison of Data Normalization Techniques on KNN Classification Performance for Pima Indians Diabetes Dataset. J. Appl. Inform. Comput. 2025, 9, 693–706. [Google Scholar] [CrossRef]
- Yan, C. A Review on Spectral Data Preprocessing Techniques for Machine Learning and Quantitative Analysis. iScience 2024, 28, 112759. [Google Scholar] [CrossRef] [PubMed]
- Gamberger, D.; Lavrac, N.; Dzeroski, S. Noise Detection and Elimination in Data Preprocessing: Experiments in Medical Domains. Appl. Artif. Intell. 2000, 14, 205–223. [Google Scholar] [CrossRef]
- Dagal, I.; Harrison, A.; Ibrahim, A.-W.; Mbasso, W.F. Comprehensive Evaluation of Data Preprocessing and Visualization Techniques for Enhanced Classification and Sampling. Clust. Comput. 2025, 28, 476. [Google Scholar] [CrossRef]
- Tautan, A.M.; Andrei, A.G.; Smeralda, C.L.; Vatti, G.; Rossi, S.; Ionescu, B. Unsupervised learning from EEG data for epilepsy: A systematic literature review. Artif. Intell. Med. 2025, 162, 103095. [Google Scholar] [CrossRef] [PubMed]
- Steinley, D. K-means Clustering: A Half-century Synthesis. Br. J. Math. Stat. Psychol. 2006, 59, 1–34. [Google Scholar] [CrossRef] [PubMed]
- Ward, J.H. Hierarchical Grouping to Optimize an Objective Function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
- Jolliffe, I.T.; Cadima, J. Principal Component Analysis: A Review and Recent Developments. Philos. Trans. R. Soc. A 2016, 374, 20150202. [Google Scholar] [CrossRef] [PubMed]
- Van der Maaten, L.; Hinton, G.; Rachmad, Y. Visualizing Data Using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Harshman, R.A.; Lundy, M.E. PARAFAC: Parallel Factor Analysis. Comput. Stat. Data Anal. 1994, 18, 39–72. [Google Scholar] [CrossRef]
- Osisanwo, F.Y.; Akinsola, J.E.T.; Awodele, O.; Hinmikaiye, J.O.; Olakanmi, O.; Akinjobi, J. Supervised Machine Learning Algorithms: Classification and Comparison. Int. J. Comput. Trends Technol. 2017, 48, 128–138. [Google Scholar] [CrossRef]
- Cox, D.R. The Regression Analysis of Binary Sequences. J. R. Stat. Soc. B 1958, 20, 215–242. [Google Scholar] [CrossRef]
- Jovel, J.; Greiner, R. An Introduction to Machine Learning Approaches for Biomedical Research. Front. Med. 2021, 8, 771607. [Google Scholar] [CrossRef] [PubMed]
- Guido, R.; Ferrisi, S.; Lofaro, D.; Conforti, D. An Overview on the Advancements of Support Vector Machine Models in Healthcare Applications: A Review. Information 2024, 15, 235. [Google Scholar] [CrossRef]
- Khyathi, G.; Indumathi, K.P.; Jumana Hasin, A.; Lisa Flavin Jency, M.; Siluvai, S.; Krishnaprakash, G. Support Vector Machines: A Literature Review on Their Application in Analyzing Mass Data for Public Health. Cureus 2025, 17, e77169. [Google Scholar] [CrossRef] [PubMed]
- Kingsford, C.; Salzberg, S.L. What are decision trees? Nat. Biotechnol. 2008, 26, 1011–1013. [Google Scholar] [CrossRef] [PubMed]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Salman, H.A.; Kalakech, A.; Steiti, A. Random Forest Algorithm Overview. Babylon. J. Mach. Learn. 2024, 2024, 69–79. [Google Scholar] [CrossRef] [PubMed]
- Langarizadeh, M.; Moghbeli, F. Applying Naive Bayesian Networks to Disease Prediction: A Systematic Review. Acta Inform. Med. 2016, 24, 364–369. [Google Scholar] [CrossRef] [PubMed]
- Cover, T.M.; Hart, P.E. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Gupta, A.K.; Chakroborty, S.; Ghosh, S.K.; Ganguly, S. A machine learning model for multi-class classification of quenched and partitioned steel microstructure type by the k-nearest neighbor algorithm. Comput. Mater. Sci. 2023, 228, 112321. [Google Scholar] [CrossRef]
- Graupe, D. Principles of Artificial Neural Networks, 2nd ed.; World Scientific: Singapore, 2007. [Google Scholar]
- Cho, H.S.; Leu, M.C. Artificial neural networks in manufacturing processes: Monitoring and control. IFAC Proc. Vol. 1998, 31, 529–537. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Uddin, S.; Khan, A.; Hossain, M.E.; Moni, M.A. Comparing different supervised machine learning algorithms for disease prediction. BMC Med. Inform. Decis. Mak. 2019, 19, 281. [Google Scholar] [CrossRef] [PubMed]
- Blakeley, D.D.; Oddone, E.Z.; Hasselblad, V.; Simel, D.L.; Matchar, D.B. Noninvasive carotid artery testing: A meta-analytic review. Ann. Intern. Med. 1995, 122, 360–367. [Google Scholar] [CrossRef] [PubMed]
- Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed]
- Sokolova, M.; Japkowicz, N.; Szpakowicz, S. Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In AI 2006: Advances in Artificial Intelligence; Sattar, A., Kang, B.H., Eds.; Lecture Notes in Computer Science, vol. 4304; Springer: Berlin/Heidelberg, Germany, 2006; pp. 1015–1021. [Google Scholar] [CrossRef]
- Olivieri, A.C.; Faber, N.M.; Ferré, J.; Boqué, R.; Kalivas, J.H.; Mark, H. Uncertainty estimation and figures of merit for multivariate calibration (IUPAC Technical Report). Pure Appl. Chem. 2006, 78, 633–661. [Google Scholar] [CrossRef]
- Nagelkerke, N.J.D. A note on a general definition of the coefficient of determination. Biometrika 1991, 78, 691–692. [Google Scholar] [CrossRef]
- Zambrano, A.; Trilleras, J.; Arana, V.A.; Lima, K.M.G.; Neves, A.C.O.; Morais, C.L.M.; Romero, C.; Falconar, A.K.I.; Muñoz, B.S.; García, R.; et al. ATR-FTIR and multivariate analysis for differential diagnosis of dengue and leptospirosis: A feasibility study. Sci. Rep. 2025, 15, 34092. [Google Scholar] [CrossRef] [PubMed]
- Kheirollahpour, M.; Shokoufi, N.; Lotfi, M. The Potential of Optical Technologies in Early Virus Detection; Prospects in Addressing Future Viral Outbreaks. Cri. Rev. Anal. Chem. 2025, 56, 1198–1226. [Google Scholar] [CrossRef] [PubMed]
- Santos, M.C.D.; Mariz, J.V.M.; Silva, R.V.O.; Morais, C.L.M.; Lima, K.M.G. Clinical applications of spectroscopic techniques in conjunction with multivariate analysis in virus diagnosis. Biomed. Spectrosc. Imaging 2023, 10, 49–75. [Google Scholar] [CrossRef]
- Rumaling, M.I.; Chee, F.P.; Bade, A.; Hasbi, N.H.; Daim, S.; Juhim, F.; Duinong, M.; Rasmidi, R. Methods of optical spectroscopy in detection of virus in infected samples: A review. Heliyon 2022, 8, e10472. [Google Scholar] [CrossRef] [PubMed]
- Lv, R.; Wang, Z.; Ma, Y.; Li, W.; Tian, J. Machine Learning Enhanced Optical Spectroscopy for Disease Detection. J. Phys. Chem. Lett. 2022, 13, 9238–9249. [Google Scholar] [CrossRef] [PubMed]
- Escandar, G.M.; Damiani, P.C.; Goicoechea, H.C.; Olivieri, A.C. A review of multivariate calibration methods applied to biomedical analysis. Microchem. J. 2006, 82, 29–42. [Google Scholar] [CrossRef]
- Mustorgi, E.; Durante, C.; Malegori, C.; Greco, P.; Bartoletti, R.; Cocchi, M.; Casale, M. An analytical approach based on excitation-emission fluorescence spectroscopy and chemometrics for the screening of prostate cancer through urine analysis: A proof–of–concept study. Chemom. Intell. Lab. Syst. 2023, 234, 104752. [Google Scholar] [CrossRef]
- Nath, A.; Rivoire, K.; Chang, S.; West, L.; Cantor, S.B.; Basen-Engquist, K.; Adler-Storthz, K.; Cox, D.D.; Atkinson, E.N.; Staerkel, G.; et al. A pilot study for a screening trial of cervical fluorescence spectroscopy. Int. J. Gynecol. Cancer 2004, 14, 1097–1107. [Google Scholar] [CrossRef]
- Diagaradjane, P.; Yaseen, M.A.; Yu, J.; Wong, M.S.; Anvari, B. Autofluorescence characterization for the early diagnosis of neoplastic changes in DMBA/TPA-induced mouse skin carcinogenesis. Lasers. Surg. Med. 2005, 37, 382–395. [Google Scholar] [CrossRef] [PubMed]
- Soares, F.; Becker, K.; Anzanello, M.J. A hierarchical classifier based on human blood plasma fluorescence for non-invasive colorectal cancer screening. Artif. Intell. Med. 2017, 82, 1–10. [Google Scholar] [CrossRef] [PubMed]
- Yin, B.; Mi, J.Y.; Zhai, H.L.; Zhao, B.Q.; Bi, K.X. An effective approach to the early diagnosis of colorectal cancer based on three-dimensional fluorescence spectra of human blood plasma. J. Pharm. Biomed. Anal. 2021, 193, 113757. [Google Scholar] [CrossRef] [PubMed]
- Švecová, M.; Blahova, L.; Kostolný, J.; Birkova, A.; Urdzik, P.; Marekova, M.; Dubayova, K. Enhancing endometrial cancer detection: Blood serum intrinsic fluorescence data processing and machine learning application. Talanta 2025, 283, 127083. [Google Scholar] [CrossRef] [PubMed]
- Hordge, L.Q.N.; McDaniel, K.L.; Jones, D.D.; Fakayode, S.O. Simultaneous determination of estrogens (ethinylestradiol and norgestimate) concentrations in human and bovine serum albumin by use of fluorescence spectroscopy and multivariate regression analysis. Talanta 2016, 152, 401–409. [Google Scholar] [CrossRef] [PubMed]
- Nemati, S.S.; Emadi, S.; Ghiasvand Mohammadkhani, L.; Kompany-Zareh, M.; Hasanzadeh, Z. Multivariate spectrofluorimetric detection of lipase isolated from Serratia marcescens in chromatographic fractions. Spectrochim. Acta A Mol. Biomol. Spectrosc. 2019, 222, 117137. [Google Scholar] [CrossRef] [PubMed]
- Escandar, G.M.; Gómez, D.G.; Mansilla, A.E.; De La Peña, A.M.; Goicoechea, H.C. Determination of carbamazepine in serum and pharmaceutical preparations using immobilization on a nylon support and fluorescence detection. Anal. Chim. Acta 2004, 506, 161–170. [Google Scholar] [CrossRef]
- Goicoechea, H.C.; Calimag-Williams, K.; Campiglia, A.D. Multi-way partial least-squares and residual bi-linearization for the direct determination of monohydroxy-polycyclic aromatic hydrocarbons on octadecyl membranes via room-temperature fluorescence excitation emission matrices. Anal. Chim. Acta 2012, 717, 100–109. [Google Scholar] [CrossRef] [PubMed]
- Pagani, A.P.; Ibañez, G.A. Four-way calibration applied to the processing of pH-modulated fluorescence excitation-emission matrices. Analysis of fluoroquinolones in the presence of significant spectral overlapping. Microchem. J. 2017, 132, 211–218. [Google Scholar] [CrossRef]
- Trevisan, M.G.; Poppi, R.J. Determination of doxorubicin in human plasma by excitation–emission matrix fluorescence and multi-way analysis. Anal. Chim. Acta 2003, 493, 69–81. [Google Scholar] [CrossRef]
- Wang, T.; Liu, Q.; Long, W.J.; Chen, A.Q.; Wu, H.L.; Yu, R.Q. A chemometric comparison of different models in fluorescence analysis of dabigatran etexilate and dabigatran. Spectrochim. Acta A Mol. Biomol. Spectrosc. 2021, 246, 118988. [Google Scholar] [CrossRef] [PubMed]
- Xu, J.; Xu, J.; Tong, Z.; Yu, S.; Liu, B.; Mu, X.; Du, B.; Gao, C.; Wang, J.; Liu, Z.; et al. Impact of different classification schemes on discrimination of proteins with noise-contaminated spectra using laboratory-measured fluorescence data. Spectrochim. Acta A Mol. Biomol. Spectrosc. 2023, 296, 122646. [Google Scholar] [CrossRef] [PubMed]
- Bernardes, C.D.; Poppi, R.J.; Sena, M.M. Direct determination of trans-resveratrol in human plasma by spectrofluorimetry and second-order standard addition. Talanta 2010, 82, 640–645. [Google Scholar] [CrossRef] [PubMed]
- Głowacz, K.; Skorupska, S.; Grabowska-Jadach, I.; Ciosek-Skibínska, P. Excitation-emission matrix fluorescence spectroscopy for cell viability testing in UV-treated cell culture. RSC Adv. 2022, 12, 7652–7660. [Google Scholar] [CrossRef] [PubMed]
- Costa, F.S.L.; Bezerra, C.C.R.; Neto, R.M.; Morais, C.L.M.; Lima, K.M.G. Identification of resistance in Escherichia coli and Klebsiella pneumoniae using excitation-emission matrix fluorescence spectroscopy and multivariate analysis. Sci. Rep. 2020, 10, 12994. [Google Scholar] [CrossRef] [PubMed]
- Henry, J.; Endres, J.L.; Sadykov, M.R.; Bayles, K.W.; Svechkarev, D. Fast and accurate identification of pathogenic bacteria using excitation-emission spectroscopy and machine learning. Sens. Diagn. 2024, 3, 1253–1262. [Google Scholar] [CrossRef] [PubMed]
- Araujo Gomes, G.J.; Beltrão, F.E.d.L.; Fragoso, W.D.; Lemos, S.G. Discrimination between COVID-19 positive and negative blood serum based on excitation-emission matrix fluorescence spectroscopy and chemometrics. Talanta 2024, 280, 126788. [Google Scholar] [CrossRef] [PubMed]
- Zhang, P.; Yang, Q.; Xu, X.; Feng, H.; Du, B.; Xu, J.; Liu, B.; Mu, X.; Wang, J.; Tong, Z. Fluorescence excitation-emission matrix spectroscopy combined with machine learning for the classification of viruses for respiratory infections. Talanta 2025, 286, 127462. [Google Scholar] [CrossRef] [PubMed]





| Disease Detection | Sample | Machine Learning Models | Results | Reference |
|---|---|---|---|---|
| Prostate cancer screening | Urine (69 samples; 46 cancer, 23 healthy); 95 EEMs | PARAFAC (4 factors) → LDA, PLS-DA; 70/30 split (Kennard–Stone) |
| [70] |
| Cervical precancer and cancer (SIL) | Cervical tissue (58 women, 92 measurement sites) | PCA → logistic discrimination; LDA (forward stepwise, F = 20) |
| [71] |
| Skin SCC early detection/staging | In vivo mouse skin (n = 40 experimental + 6 control + 6 blank); weekly EEMs over 15 weeks | Stepwise Fisher LDA; leave-one-out cross-validation |
| [72] |
| Colorectal cancer (CRC) screening | Human blood plasma (299 individuals; 74 CRC, 75 healthy, 75 adenomas, 75 other) | Hierarchical SVM: Level 1 binary SVM (CRC vs. non-CRC); Level 2 one-class SVM |
| [73] |
| Colorectal cancer (CRC) discrimination | Human blood plasma (74 CRC, 74 adenomas, 77 non-malignant, 4 outliers removed) | TM-PLS-DA (Venetian blinds 10-fold CV, 8 LVs) |
| [74] |
| Colorectal cancer (CRC) early detection (metabonomic) | Human blood plasma (n = 308; 77 CRC, 77 adenomas, 77 other non-malignant, 77 no findings); diluted and undiluted | PARAFAC (3 EEM sets, 10 + 3 + 6 components) → pooled score matrix (19 vars) → PLS-DA (forward selection); also iPLS on raw unfolded; 10-fold CV |
| [8] |
| Endometrial cancer (EC) detection | Blood serum (n = 118; 73 EC, 45 healthy controls); diluted 10× in phosphate buffer pH 7.4; 3D synchronous fluorescence spectra (CWM, Δλ = 10–200 nm) | PCA + k-means clustering; PLS-DA; RF, SVM, LR, SGD classifiers; 10-fold CV × 100 repeats; grid search optimization |
| [75] |
| Alzheimer’s disease (AD) detection | Human blood plasma (230 individuals: 83 AD, 147 controls); dried plasma on microscope slides; EEM (excitation 230–450 nm); Kennard–Stone split 162/34/34 | PARAFAC-QDA (6 factors) and Tucker3-QDA; Kennard–Stone train/validation/test split |
| [24] |
| Estrogen quantification (EE and NOR) in serum albumin | BSA and HSA spiked with EE/NOR in MOPS buffer, pH 7.4 | PLS regression (multivariate); univariate regression (comparison) |
| [76] |
| Lipase activity detection in chromatographic fractions | Gel filtration fractions (13 EEMs) from S. marcescens extracellular lipase | N-PLS (with region selection) → GA-RBF-ANN; PARAFAC (3 components) |
| [77] |
| Therapeutic drug monitoring (CBZ and CBZ-EP) | Human serum (spiked and real patient samples); pharmaceutical preparations | PARAFAC, SWATLD, N-PLS (three-way EEM); PLS-1 (two-way emission) |
| [78] |
| OH-PAH metabolites in urine (PAH exposure biomarkers) | Synthetic urine spiked with four OH-PAH metabolites; C18 membrane SPE | N-PLS/RBL and MCR-ALS; calibration factors A = 4 (leave-one-out CV) |
| [79] |
| Fluoroquinolone quantification (CIP, OFLO, NOR) in urine | Spiked human urine (diluted 1/200) with interferents (salicylate, naproxen) | PARAFAC (four-way/third-order calibration); no constraints applied |
| [80] |
| Doxorubicin (DXR) monitoring in plasma | Human blood plasma (10 healthy volunteers, spiked 0.75–11.25 µg/mL) | PARAFAC (2 factors) → linear regression; N-PLS (2 factors, leave-one-out CV) |
| [81] |
| Dabigatran etexilate/dabigatran quantification | Spiked human plasma and urine (7 calibration, 3 blank, 5 spiked prediction samples per matrix) | SWANRF (trilinear), MCR-ALS (bilinear), U-PLS/RBL (latent variables) |
| [82] |
| Proteinaceous biotoxin classification | 14 protein samples in solid powder form (4 biotoxins + 10 harmless proteins); measured at 45°; λex 260–315 nm, λem 270–420 nm; 75/25 split | Four schemes: EEM→MLP, EEM→PCA→MLP, EEM→RF→MLP, EEM→RF→PCA→MLP |
| [83] |
| trans-Resveratrol (RVT) quantification in plasma | Human plasma (5 healthy volunteers, diluted 10×, buffered pH 5.0); spiked 0.10–5.00 µg/mL | PARAFAC + SOSAM (Second-Order Standard Addition Method); univariate linear regression |
| [84] |
| Cell viability assessment (UV-induced cytotoxicity) | A375 melanoma cells exposed to UV (λ = 365 nm); 72 EEMs (6 exposures × 12 replicates) | UPLS regression; 75/25 train/test split; Venetian blinds CV |
| [85] |
| Cell viability prediction (oxaliplatin cytotoxicity) | A375 (melanoma) and HaCaT (keratinocytes) treated with oxaliplatin (5–1000 µM); 144 EEMs (2 lines × 6 concentrations × 12 replicates) | PARAFAC (5 components) → MLR on PARAFAC scores; cross-validation; Tukey post hoc tests |
| [23] |
| Antimicrobial resistance (E. coli, K. pneumoniae) | Bacterial suspensions in phosphate buffer (75 samples; 5 replicates per species) | PCA → 2D-LDA, 2D-PCA-LDA/QDA/SVM; Unfolded: UPCA-QDA/SVM, USPA-QDA/SVM, UGA-QDA/SVM |
| [86] |
| Pathogenic bacteria identification (species and Gram classification) | Bacterial suspensions (8 species: 4 G+/4 G−); DMAFa fluorescent dye; 595 EEMs (cross-validation dataset) | Convolutional Neural Network (CNN, MATLAB); K-fold cross-validation; independent test set |
| [87] |
| COVID-19 (SARS-CoV-2) detection | Blood serum (n = 106; 80 RT-PCR positive, 26 negative); diluted 1:1000 in PBS pH 7.4; 40/66 split (Kennard–Stone) | PARAFAC → SIMCA/DD-SIMCA/PCA-DA/PLS-DA; Unfolded EEM → same classifiers; Single λex 280 nm → SIMCA/PLS-DA |
| [88] |
| Respiratory virus classification (8 virus types) | Inactivated virus solutions (8 types, 5 samples each, n = 40); 1.0 × 107 PFU/mL; 28/12 train/test split | PCA → RF and SVM; mid-level data fusion (SNV-FFT, SNV-WT, FFT-WT, SNV-FFT-WT); K-fold CV |
| [89] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Pérez Hincapié, M.; Arana, V.A.; García-Alzate, R.; Lozano-Arias, D.; Trilleras, J. Excitation–Emission Fluorescence Spectroscopy Combined with Machine Learning for Biomedical Diagnostics: A Systematic Review. Sci 2026, 8, 148. https://doi.org/10.3390/sci8070148
Pérez Hincapié M, Arana VA, García-Alzate R, Lozano-Arias D, Trilleras J. Excitation–Emission Fluorescence Spectroscopy Combined with Machine Learning for Biomedical Diagnostics: A Systematic Review. Sci. 2026; 8(7):148. https://doi.org/10.3390/sci8070148
Chicago/Turabian StylePérez Hincapié, Melissa, Victoria A. Arana, Roberto García-Alzate, Daisy Lozano-Arias, and Jorge Trilleras. 2026. "Excitation–Emission Fluorescence Spectroscopy Combined with Machine Learning for Biomedical Diagnostics: A Systematic Review" Sci 8, no. 7: 148. https://doi.org/10.3390/sci8070148
APA StylePérez Hincapié, M., Arana, V. A., García-Alzate, R., Lozano-Arias, D., & Trilleras, J. (2026). Excitation–Emission Fluorescence Spectroscopy Combined with Machine Learning for Biomedical Diagnostics: A Systematic Review. Sci, 8(7), 148. https://doi.org/10.3390/sci8070148

