1. Introduction
Autoimmune diseases refer to a series of disorders caused by the immune system mistakenly attacking the body’s own tissues and organs, characterized by a wide variety of types, involvement of systemic symptoms, and complex pathogenesis. The complexity and diversity of autoimmune diseases is manifested in that the etiology of a single autoimmune disease is reflected by a combination of symptoms; correspondingly, a single symptom may imply the involvement of pathological processes of multiple autoimmune diseases [
1]. Consequently, their diversity and complexity pose significant challenges for clinicians in clinical diagnosis. Typical autoimmune diseases include ankylosing spondylitis (AS), osteoarthritis (OA), and Sjögren’s syndrome (SS). Ankylosing spondylitis (AS), also known as axial spondyloarthritis, is a chronic inflammatory disease primarily affecting the axial skeleton [
2]. The pathogenesis of AS has not been fully elucidated; it is generally believed to involve interactions among genetic, immune, and environmental factors, including abnormal expression of the HLA-B27 antigen, immune system dysregulation, and environmental factors [
3]. Osteoarthritis (OA) is a whole-joint disease involving structural changes in articular cartilage, subchondral bone, ligaments, joint capsules, synovium, and periarticular muscles, with pain as its primary symptom [
4]. The pathogenesis of OA is mainly attributed to an imbalance between pro-inflammatory and anti-inflammatory mediators, which further leads to low-grade inflammation, cartilage degradation, bone remodeling, and synovial hyperplasia [
5]. Sjögren’s syndrome (SS) is a systemic chronic autoimmune disease characterized by immune-mediated damage to the salivary and lacrimal glands, resulting in xerostomia (dry mouth) and xerophthalmia (dry eyes) [
6]. SS is generally considered to be caused by factors such as congenital immune barrier dysfunction, infections by pathogens like viruses, and genetic factors.
Clinical diagnostic approaches for the aforementioned diseases typically include clinical symptom assessment, antigen detection, autoantibody testing, biopsy, and imaging examinations [
7]. However, conventional diagnostic methods have several limitations: clinical symptom assessment relies on the judgment of experienced physicians; biopsy is often highly invasive; autoantibody testing and imaging examinations are relatively costly; and other diseases that may cause similar symptoms need to be excluded during the diagnostic process before a diagnosis can be confirmed. Raman spectroscopy is a non-invasive analytical technique that enables non-destructive detection, providing molecular fingerprints of samples and quantitative information about their chemical composition. It possesses multiple properties beneficial for medical diagnosis, including high chemical specificity [
8], and thus holds great potential in disease diagnosis. With the advancement of artificial intelligence, the integration of spectroscopy with deep learning and machine learning has effectively addressed the limitations of traditional diagnostic methods for autoimmune diseases, allowing Raman spectroscopy to further exert a significant role in the biomedical field, particularly in disease diagnosis [
9,
10,
11,
12,
13,
14]. Leng et al. [
15] utilized Raman and Fourier-transform infrared (FTIR) spectra from a total of 119 patients with different types of cancer, performing low-level fusion and feature fusion on the spectra. By employing classifiers such as support vector machine (SVM) and convolutional neural network–long short-term memory (CNN–LSTM), they improved the accuracy of the fusion model by approximately 10%. Yong et al. [
16] classified Raman spectra collected from the superficial and deep layers of cartilage in 45 patients with osteoarthritis and 19 patients with osteoporosis (serving as healthy controls). Using a multi-convolutional neural network with sixfold cross-validation, they achieved higher classification accuracy. Yang et al. [
17] extracted multi-scale features from surface-enhanced Raman spectroscopy (SERS) data via wavelet transform, and constructed a rapid detection method for liver cancer samples by combining data augmentation and deep learning techniques, achieving an accuracy of 99.38%. Cao et al. [
18] designed a one-dimensional residual convolutional neural network (1D-ResNet) architecture to classify tumor tissues of colorectal cancer and visualized and interpreted the fingerprint peaks identified by the deep learning model; their study achieved an accuracy of 98.5% in colorectal cancer detection.
However, existing studies often assume that model training is conducted under the condition that all data are labeled, i.e., supervised learning. In practice, annotating medical data is labor-intensive, resource-consuming, and costly. Furthermore, considering patient privacy, techniques such as de-identification and data desensitization may be required for data processing. Additionally, annotations typically rely on physicians’ clinical diagnostic outcomes. For autoimmune diseases—characterized by complex pathogenesis, high heterogeneity, and the need for exclusionary diagnosis—the accuracy of labels is, to some extent, questionable. Unsupervised learning, which can be performed without labeled training data, enables the learning of internal data structures and distributions, as well as the discovery of potential relationships within the data. It effectively addresses the issue of insufficient data and annotations, holding significant implications for the practical application of artificial intelligence in clinical diagnosis. Although unsupervised learning has been widely applied in the fields of machine learning and deep learning with increasingly mature techniques, research on autoimmune disease diagnosis methods addressing insufficient data annotation remains relatively limited. The framework proposed in this study aims to fill the gap in this field, provide a solid foundation for subsequent research, and further enhance the diversity and feasibility of this research direction.
Transfer learning is a class of machine learning algorithms that aim to leverage existing knowledge to train or refine models in new domains where data or label resources are scarce, thereby improving model performance. Deep neural network-based transfer learning has already played a significant role in fields such as computer vision, signal processing, bioinformatics, and recommendation systems [
19,
20,
21]. Domain adaptation, a specific application scenario of transfer learning, shares the goal of addressing knowledge transfer between source and target domains with transfer learning. However, domain adaptation focuses on reducing the distribution discrepancy between the source and target domains to enable semi-supervised or unsupervised model training on the target domain. Both transfer learning and domain adaptation have extensive applications and great potential in Raman spectroscopy. Jiaqi Hu et al. [
22] performed pre-training using 10,000 Raman spectra of 200 substances from the RRUFF database and evaluated the feature transfer performance of CNN-1D, ResNet-1D, and Inception-1D on collected pesticide Raman spectral data. Chen et al. [
23] selected three sets of serum Raman spectral data as the source domain and two sets as the target domain, with data augmentation applied. They trained three deep neural network models—CNN-LSTM, GoogLeNet, and ResNet—on the source domain data for disease diagnosis, transferred the models to the target domain, and further improved model performance by constructing a decision-level fusion model combined with logistic regression. Yu Yao et al. [
24] implemented unsupervised multiplex biomolecular detection by conducting domain adaptation on 15 different Raman spectra in suspension array technology (SAT) based on Raman spectral encoding. Existing studies have utilized Raman spectroscopy for unsupervised learning and have made significant contributions across various application domains. However, there remains a lack of systematic research focused specifically on unsupervised diagnosis of diseases—particularly autoimmune diseases—using Raman spectroscopy. As summarized in
Table 1, a comparison between previous Raman spectroscopy studies combined with AI and this study highlights our distinct approach.
This study addresses the challenges in Raman spectroscopy-based autoimmune disease diagnosis, including AS, OA, and RA, specifically the difficulties in acquiring spectral data and labels, as well as diagnostic complexities arising from the intricate and diverse etiologies of autoimmune diseases. We propose an unsupervised domain adaptation framework with consensus voting for pseudo-label generation. Specifically, we generate pseudo-labels for the target domain by leveraging votes from source domains with feature distributions similar to the target domain, select high-confidence labeled samples to compute and optimize mc-loss [
25] and cross-entropy loss, and update the parameters of the feature extractor accordingly. Furthermore, we drew on the domain discriminator framework in conditional domain adaptation [
26], concatenating features and labels from both the source and target domains to condition the training of the adversarial network. Our framework achieved optimal performance in mutual adaptation experiments using serum Raman spectral data from three homologous autoimmune diseases. Additionally, we conducted transfer experiments where Raman spectral data of three common non-homologous cancers were transferred to autoimmune disease data, achieving unsupervised disease diagnosis and further validating the generalization capability of our model.
4. Interpretability
For end-to-end deep network frameworks, Grad-Cam [
37] provides an effective method to describe the interpretability of the network. Specifically, Grad-Cam obtains the gradient information of the feature layer through backpropagation on the network’s predicted values, then performs weighted summation and activation on all channels, and finally obtains a set of weights. This set of weights can be regarded as the degree of contribution of different positions in the feature map to the network’s predicted values. We applied this technology to the migration tasks of Raman spectroscopy, enabling the plotting of heatmaps.
Figure 5 shows the spectral heatmaps of all homologous disease migration tasks. In Raman spectroscopy, different peak positions reflect specific structures and their vibrational information in the measured sample substances, thereby indicating the presence of different substances. Peak positions with large weights shown in the heatmaps can, to a certain extent, indicate that the substances corresponding to these peaks are of great significance for the diagnosis of autoimmune diseases or may become important biomarkers. To explore this possibility, we performed spectral analysis on the spectra in
Figure 5.
Tasks with the same target domain disease are grouped together, dividing the six tasks into three groups, where each group represents the diagnostic task results of OA, AS, and SS, respectively. This study identified peaks with a contribution value greater than 0.5 (shown in red in
Figure 5) when different diseases were transferred to the target disease; these peaks were defined as the key characteristic peaks for model-based diagnosis.
Table 8 presents the key characteristic peaks across all tasks and their occurrence frequencies. In the six groups of homologous disease transfer tasks, the features identified as key characteristic peaks may represent the common traits shared by homologous autoimmune diseases. These features were effectively expressed and transferred to the target domain during the domain adaptation process. Among them, characteristic peaks such as 924.02 cm
−1, 929.19 cm
−1, 990.84 cm
−1, and 1316.85 cm
−1 exhibited high contribution in all homologous tasks and were identified as key characteristic peaks, indicating that they may correspond to key biomolecules influencing disease diagnosis. The biomolecules represented by the key characteristic peaks in
Table 8 will be analyzed across different diseases to identify specific disease markers.
Proline, valine, phenylalanine, and guanine (B, Z-marker), as well as lipids, contributed more than 0.5 in all tasks, proving, to some extent, their specificity in the diagnosis of autoimmune diseases. Proline is a non-essential amino acid that is essential for the synthesis of collagen, one of the main components of cartilage. It has been shown [
38] that circUbqln1 (non-coding RNA) is able to promote OA by affecting proline metabolism; specifically, circUbqln1 upregulates the transcriptional and enzymatic activities of proline dehydrogenase (PRODH), leading to an acceleration of proline deletion, resulting in the accumulation of its metabolite P5C (proline-5-carboxylic acid), which may interfere with the normal collagen synthesis process. Also, proline was one of the best biomarkers for differentiating autoimmune neuroinflammation from controls, with an area under the receiver operating characteristic curve (AUC) of 0.77 [
39]. Valine is generally thought to be associated with the development of autoimmune neuroinflammatory disorders, and it is included in a larger group of metabolites that consists primarily of amino acids, amino acid metabolites, and acetyl carnitine. It has also been noted that the serum metabolomics of patients with SS and HC can be distinguished by 21 significant metabolites, including elevated levels of alanine and valine [
40]. Methionine or its metabolites may be implicated in the pathogenesis of autoimmune diseases in some cases, e.g., methionine plays a key role in the activity of
1-antitrypsin (AAT), especially in the active center loop of AAT, and oxidation of methionine can affect its inhibition of proteases such as neutrophil elastase, which further affects certain disease processes [
41].
Furthermore, in autoimmune diseases such as Sjögren’s syndrome (SS), cholesterol metabolism plays a key role in T-cell biology [
42]. Cholesterol helps maintain the stability of cell membranes and regulates their fluidity. In addition, cholesterol is involved in the formation of key structures such as lipid rafts, major histocompatibility complex molecules, and T-cell receptors, all of which are necessary for adaptive immunity.
5. Conclusions and Discussion
This study systematically explores the application framework of unsupervised domain adaptation (UDA) technology in Raman spectroscopy analysis, providing an effective solution to address the domain shift problem caused by data discrepancies. The proposed CDAN-PL framework demonstrates significant performance advantages over traditional supervised learning methods in scenarios with extremely limited labeled data and outperforms other state-of-the-art (SOTA) domain adaptation algorithms. Experimental results show that in homologous disease migration tasks, our proposed CDAN-PL achieves the highest accuracy compared to baseline models, with an average accuracy of 92.3% across all tasks. In addition, we selected three common types of cancer as source domain data, which also effectively perform unsupervised diagnosis of autoimmune diseases with an average accuracy of 90.05%, verifying the generalization ability of our model. Furthermore, using Grad-CAM technology, we identified spectral peaks that contribute significantly to diagnostic results, and the substances corresponding to these peaks may become important biomarkers for future diagnosis and research. Overall, this study provides a new perspective and method for the unsupervised diagnosis of autoimmune diseases.
In summary, the CDAN-PL framework has initially verified the feasibility of autoimmune disease diagnosis based on unlabeled Raman spectroscopy. However, this study still has several limitations, including a limited sample size and a single data modality. A total of 297 samples from various autoimmune diseases and their corresponding control groups, as well as 154 samples from various types of cancer and their control groups, were included in this study. Although domain adaptation models typically rely on large-scale data to ensure the reliability of knowledge transfer, we adopted the Synthetic Minority Oversampling Technique (SMOTE) to perform data augmentation separately for the disease group and the control group in each experiment to alleviate the issue of insufficient sample size, increasing the sample size of each category to 300. Nevertheless, considering that excessive use of SMOTE may distort the original statistical distribution of the data [
43], no further expansion of the augmentation scale was conducted. In the future, we plan to recruit more real-world samples and develop more specialized algorithms for few-shot domain adaptation, aiming to further improve the performance and application value of Raman spectroscopy in unsupervised and supervised diagnostic tasks.
Furthermore, although the current model relies solely on spectral features derived from blood samples, it is well established that clinical variables such as age and gender can significantly influence both susceptibility to and progression of autoimmune disorders [
44,
45]. Future extensions of this work could therefore benefit from integrating such demographic and clinical metadata, either through multimodal learning architectures or hybrid predictive frameworks. Incorporating these factors would enable more personalized and context-aware diagnostic support, thereby aligning computational predictions more closely with established clinical practice.