Criteria and Protocol: Assessing Generative AI Efficacy in Perceiving EULAR 2019 Lupus Classification

Lushington, Gerald H.; Nair, Sandeep; Jupe, Eldon R.; Rubin, Bernard; Purushothaman, Mohan

doi:10.3390/diagnostics15182409

Open AccessArticle

Criteria and Protocol: Assessing Generative AI Efficacy in Perceiving EULAR 2019 Lupus Classification

by

Gerald H. Lushington

^*

,

Sandeep Nair

,

Eldon R. Jupe

,

Bernard Rubin

and

Mohan Purushothaman

Progentec Diagnostics Inc., Oklahoma City, OK 73104, USA

^*

Author to whom correspondence should be addressed.

Diagnostics 2025, 15(18), 2409; https://doi.org/10.3390/diagnostics15182409

Submission received: 1 July 2025 / Revised: 24 August 2025 / Accepted: 19 September 2025 / Published: 22 September 2025

(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Background/Objectives: In clinical informatics, the term ‘information overload’ is increasingly used to describe the operational impediments of excessive documentation. While electronic health records (EHRs) are growing in abundance, many medical records (MRs) remain in legacy formats that impede efficient, systematic processing, contributing to the extenuating challenges of care fragmentation. Thus, there is a growing interest in using generative AI (genAI) for automated MR summarization and characterization. Methods: MRs for a set of 78 individuals were digitized. Some were known systemic lupus erythematosus (SLE) cases, while others were under evaluation for possible SLE classification. A two-pass genAI assessment strategy was implemented using the Claude 3.5 large language model (LLM) to mine MRs for information relevant to classifying SLE vs. undifferentiated connective tissue disorder (UCTD) vs. neither via the 22-criteria EULAR 2019 model. Results: Compared to clinical determination, the antinuclear antibody (ANA) criterion (whose results are crucial for classifying SLE-negative cases) exhibited favorable sensitivity 0.78 ± 0.09 (95% confidence interval) and a positive predictive value 0.85 ± 0.08 but a marginal performance for specificity 0.60 ± 0.11 and uncertain predictivity for the negative predictive value 0.48 ± 0.11. Averaged over the remaining 21 criteria, these four performance metrics were 0.69 ± 0.11, 0.87 ± 0.04, 0.54 ± 0.10, and 0.93 ± 0.03. Conclusions: ANA performance statistics imply that genAI yields confident assessments of SLE negativity (per high sensitivity) but weaker positivity. The remaining genAI criterial determinations support (per specificity) confident assertions of SLE-positivity but tend to misclassify a significant fraction of clinical positives as UCTD.

Keywords:

American College of Rheumatology (ACR); antinuclear antibody (ANA); disease classification; European Alliance of Associations for Rheumatology (EULAR); generative artificial intelligence (genAI); systemic lupus erythematosus (SLE); natural language processing (NLP); undifferentiated connective tissue disorder (UCTD)

Graphical Abstract

1. Introduction

Healthcare fragmentation can pose major impediments to the effective treatment of chronic illness, especially for patients with lengthy medical histories who have transitioned across different providers or who require the attention of multiple medical services [1]. A crucial barrier is the limited interoperability in medical records, such that providers caring for a given patient are faced with the task of manually ascertaining care-critical information from documents in disparate formats [2].

One medical discipline where fragmentation is especially problematic relates to the care for intractable disorders such as systemic lupus erythematosus (SLE) [3], for which informed treatment is often contingent on harvesting valuable insight from case histories. Manual processing of voluminous, long-duration medical records (MRs) encoded in legacy formats can overlook valuable insight—a problem relating to what is known both informally and formally as ‘information overload’ [4,5,6,7]. Such problems may prove addressable using generative artificial intelligence (genAI) to intelligently automate document parsing [8], but doing so will first require characterizing the incumbent challenges and opportunities. A recent review by Sequí-Sabater and Benavent [9] surveyed the adoption of AI within the rheumatological practice but focused more on machine learning and deep learning, citing genAI as only a support tool for research, communications, and clinical decisions, without mentioning MR deconvolution. This suggests, as Sequí-Sabater and Benavent imply [9], that serious consideration for how genAI may impact medical practice is just beginning.

1.1. Prospective Roles for AI in Medical Record Extraction

Hindering the medical adoption of genAI is concern over the risk and reproducibility [8,10,11] relating to the stochastic underlying algorithm., i.e., multiple replicate runs of the same code reading the same documents may yield different results. This limits the range of target applications to support functions which, to make a human analogy, value ‘subjective perception’ as much as ‘objective precision’. Thus, using genAI to enumerate and sort evidence may attain greater trust than automatically ranking observations or proposing conclusions. Further, this conservative approach aligns well with addressing MR-driven information overload. Ideally, genAI document parsers may soon tame the overload problem associated with bulky MRs and do so according to physician-specified goals at reliable levels of specificity, sensitivity, and consistency.

1.2. Potential Applications for Rheumatology

GenAI-augmented information extraction may prove broadly applicable across medicine, but there is value in beginning with well-defined protocols based on specific observations, tests, or medication profiles of established value to well-defined medical objectives. As a discipline well-rooted in information-intensive decision-making [12,13], rheumatology abounds in such systematic protocols. Examples include well-validated criterially specific protocols to assess or classify rheumatoid arthritis [14], scleroderma [15], Gout [16], fibromyalgia [17], psoriatic arthritis [18], idiopathic inflammatory myopathies [19], and systemic lupus erythematosus (SLE) [20]. These protocols encode statistically significant cross-correlations among laboratory tests, imaging, qualitative clinical observations, patient feedback, etc., relative to patient status markers for a given disease or treatment. With documented clinical sensitivities and specificities all ranging between 80 and 99% [14,15,16,17,18,19,20], derived from moderate criterial bases (ranging from three to roughly two dozen terms), such protocols offer viable, potentially excellent targets for genAI emulation.

1.3. Applications to SLE Classification Criteria

In a recent pilot demonstration [21], we assessed the Claude-3-Haiku large language model (LLM) [22] as a means for mining MRs for criteria associated with the 1997 American College of Rheumatology (ACR) (referred to herein as ‘ACR 1997’) SLE classification protocol [23]. Medical visit annotations and test results were analyzed for 78 patients, including 46 with eventual SLE-positive (SLE+) clinical classifications, 18 consigned to ‘undifferentiated connective tissue disorder’ (UCTD), and 14 cases with inadequate evidence of SLE or adjacent conditions (SLE-).

The pilot study provided proof-of-principle insight into genAI suitability for mining medical criteria, but increased utilization of the 2019 joint ACR/EULAR classification standard [20] (hereafter called EULAR 2019) has motivated this sequel study. ACR 1997 and EULAR 2019 achieve similar levels of predictive specificity (both roughly 93.4%) [20,23], but EULAR 2019 achieves superior sensitivity (96.1%) compared to ACR 1997 (82.8%) despite considering similar core criteria. The difference in sensitivity arises from three significant adjustments to the decision-making structure: EULAR 2019 derives greater sensitivity from finer criterial graining (22 relatively narrow criteria, replacing 11 broader terms), trained criterial weighting, and specification of the antinuclear antibodies test (ANA) as an entry criterion [20]. GenAI determinations should mimic and ideally match such superior clinical performance. To realistically assess a general genAI method for such challenges, our updated study emulates the EULAR 2019 protocol in translating real-world medical histories among candidate SLE patients toward preliminary ANA-related triage and the broader criterial differentiation of SLE+ and UCTD cases.

2. Materials and Methods

To establish preliminary performance benchmarks, genAI was employed to assess the MRs of 78 individuals according to evidence mapping to the 22 criteria associated with EULAR 2019 classifications. Specific details follow.

2.1. Available Medical Records

Medical records from two prior clinical studies [24,25] were used in a manner consistent with IRB (institutional review board) specifications (corresponding approvals appended). For the set of patients retrospectively examined for this analysis, details such as demographic composition, recruitment, outcomes, etc., are reported elsewhere [21]. For the purposes of context, the current study focused on 78 individuals, divided equally among two distinct groups:

A total of 39 individuals whose pre-classification case histories [24] (hereafter called ‘pre-SC’) covered 1+ years, ending with clinical evaluation for SLE, via either the ACR 1997 or EULAR 2019 protocols.
A total of 39 individuals with confirmed SLE cases [25] (hereafter called ‘post-SC’) covering 1+ years, all beginning at some unspecified duration after prior SLE classification.

MRs for all cases included visit annotations and medical test results, all provided as scanned text or tabular documents, converted to flat electronic text using the Textract program for optical character recognition on AWS [26]. All details relating to the acquisition, secure handling, and processing of documents follow our protocol published previously [21]. For pre-SC records, documented clinical determinations were provided for some SLE classification criteria, including 15 classifications using EULAR 2019 [20] and 24 via ACR 1997 [23].

2.2. EULAR 2019 Criteria

For reference, all classification criteria are listed in Table 1. Among these, clinical determinations (explicit or inferred) were available to validate ANA predictions for all 78 cases; however, only 15 cases contained clinical validation for the remaining criteria.

Notable for EULAR 2019 SLE classification criteria is the data heterogeneity, driven by disparate formats across quantitative test results, descriptive observations, medical imaging, etc. Record standards also vary broadly among practices. ANA determinations, for example, may be conveyed numerically (e.g., ‘ANA titer 1:160’), qualitatively (‘positive ANA result’), or descriptively (‘speckled ANA’), and similar variance exists for other criteria. To address this, genAI can benefit from preparative strategies such as augmented-generation (AG) techniques [27] that supplement prompts with extensive related text examples to steer LLM contextualization.

In our case, lacking a large compendium of sample SLE MRs, the alternative is ‘prompt engineering’ [28], where representative phrasing and reporting formats augment the core extraction query specification. Specifically, the rigors of organic LLM training may be supplemented (or, to an extent, replaced) by biased techniques associated with earlier natural language processing (NLP) methods such as intelligent (i.e., discipline-biased) specification of keyword and keyphrase lists that populate manually dictated genAI prompts. Such keys can be evaluated within the test documents (e.g., medical records to be mined) via rule-based or lexicon-based assessment, thus guiding new iterations of prompts that balance the original discipline-biased vocabulary with terms that are actually representative of the target documents.

2.3. Search Parameters and Prompt Specifications

Most genAI text retrieval and assessment protocols in this study are identical to those described in detail in our prior study [21] on ACR 1997 criteria [20], basically involving a two-pass assessment of all criteria, where the first pass condenses all available clinical notes, lab test results, imaging reports, ICD-10 codes, and structured and unstructured data formats down to a body of potentially relevant annotations, upon which the second pass performs rigorous relevance scoring. Variations in criterial performance between the current study and the precursor are attributable to use of a different LLM (Claude 3.5, San Francisco, CA, USA [29] for the current study; Claude 3.0, San Francisco, CA, USA [22] for the prior work [21]) and different criterial definitions determined by the clinical decision protocol encoded for the EULAR 2019 [20] and ACR 1997 [23] studies.

There is fairly strong mapping between the core content of the eleven ACR 1997 criteria and the 22 criteria from EULAR 2019, but many sub-criterial factors in ACR 1997 classification are represented as full criteria in EULAR 2019, thus altering various Boolean constructs in our genAI prompts. These modifications include, higher emphasis on correct determination of patient ANA status has prompted adjustment of the first-pass NLP prompt in the current study to be more permissive. In particular, any mention of ANA titer, pattern, or test, regardless of value, is grounds for first-pass compilation, while strong exclusionary language was discarded from the earlier study such as, for example, restrictions aimed at disqualifying ANA results associated with drug-induced lupus or any other conditions unrelated to SLE.

2.4. Criterial Assessment Procedure

Criterial classification roles are shown in a schema (Figure 1) that empowers high sensitivity and specificity for SLE classifications [20]. Relative to the ACR 1997 framework, the EULAR 2019 stipulation of ANA as an entry criterion often reduces clinical effort by short circuiting the exhaustive determination of criteria 2–22 (Table 1) for ANA- cases. Regardless of ANA status, however, physicians may derive value from genAI-based clinical support across the fuller set of analytical determinations. For example, a current SLE− determination is overruled by prior ANA+ evidence, even buried deeply within a patient’s medical history. Also, various EULAR 2019 SLE criteria may shed light on alternative diagnoses, even given ANA− status. Our protocol is thus configured to assess all 22 criteria regardless of ANA determination.

NLP criterial assessments were conducted over five replicate runs, aimed at gauging internal consistency of genAI stability as a function of prompt composition and generative sampling governed by stochastic temperature and token size [22]. Consistency over the five replicates was scored according to the trivial implementation of Johnson Lindenstrauss random projection [30]: full consistency (all five replicate determinations agreed with each other) was scored as 1.0, partial consistency (four replicates agreed but one dissented) was scored as 0.6, and weak consistency (three replicates agreed; two dissented) was scored as 0.2.

For ANA (criterion 1), prediction statistics were computed over all 78 cases and were assessed for known or inferred clinical ANA status. Predicted ANA outcomes fed into the EULAR framework (top of Figure 1) as a basis for SLE- triage. Prediction statistics for all other criteria (those numbered 2–22 in Table 1) were computed for all 27 pre-SC cases with full criterial clinical determinations and were applied toward UCTD vs. SLE+ discrimination, both for categorical branching and final SLE scoring (see Figure 1).

As indicated previously, aspects of the assessment protocol employed in the current study emulate our prior publication [21] and are not repeated verbatim, but a key enhancement is criterial influence analysis, specifically aimed at discriminating UCTD and SLE+ cases according to disproportionate criterial prevalence and weights differentiating the two classes. Analysis was thus conducted to quantify differential weighted influence of each individual criterion relative to all criteria involved in SLE+ determinations (Equation (1a)), and the weighted influence of each individual criterion relative to all criteria involved in UCTD determinations (Equation (1b)), as follows:

I n f l u e n c e (S) = \frac{N_{s} W_{s}}{\sum_{i \in {crit (S)}} (N_{i} W_{i})}

(1a)

I n f l u e n c e (U) = \frac{N_{u} W_{u}}{\sum_{i \in {crit (U)}} (N_{i} W_{i})}

(1b)

where ‘S’ refers to a given criterion (Table 1 criteria 2–22) with influence on the determination of SLE+ cases, N_S is the number of total counts of that criterion across the set of SLE+ cases for which criteria were clinically determined, W_S is the EULAR-2019 weight of that criterion, ‘U’ refers to a given criterion with influence on determining UCTD cases, N_U is the number of total counts of that criterion across the set of UCTD cases for which criteria were clinically determined, and W_U is the corresponding weight of that criterion. Denominator summations include all similarly defined criterial count-weight products for any of the criteria with non-zero counts, either among SLE+ cases (Equation (1a)) or for UCTD cases (Equation (1b)).

2.5. Statistical Analysis

For standard errors of criterial assessments, it was noted from Wald interval analysis and Wilson Score interval analysis [31] that, to 95% confidence, no criteria had an approximately even balance between the ratio of positive clinical determinations versus the ratio of negative clinical determinations. This precludes the presumption of binomial distribution (e.g., Wald analysis) in determining standard error. Consequently, 95% confidence intervals for all derivative performance statistics (i.e., accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and weighted criterial influence) were based on Wilson Score interval determination. Considerations of the small sample sizes prompted the use of the Fisher exact test for qualitative assertions of certitude, rank, or superiority, all of which employed a p-value of 0.05 as the upper limit for null-hypothesis rejection.

3. Results

Broader details of clinical and predictive statistics are provided for all EULAR 2019 criteria in Appendix A, but, for brevity, we focus herein on a select subset chosen for significance relative to our main foci of ANA-based triage and the differentiation of SLE+ and UCTD classes. Table 2 below thus targets eight criteria of prospectively elevated clinical and/or genAI impacts.

Notably, while genAI predictive metrics varied substantially for different criteria, the summary of the statistics for all 22 criteria (bottom row) reveals that the specificity and NPV are greater than the sensitivity and PPV (p < 0.02). The diminished sensitivity and PPV scores reflect a low incidence of clinical positives, i.e., only four criteria (ANA, NSA, oral ulcers, and joint involvement) were clinically identified in 10+% of cases.

3.1. ANA Results

The ANA summary statistics diverged from other SLE criteria by demonstrating significant rates for sensitivity (95% CI: {0.68,0.87}) and PPV (95% CI: {0.77,0.93}), while specificity and NPV were marginal. ANA divergence largely arose from the greater determination among studied cases and higher clinical positivity. Given its unique weight in SLE classification, raw ANA results are outlined in Table 3 to illustrate the origins of this trend.

Table 3 indicates that high TP rates underlie the strong sensitivity and PPV scores. Conversely, TN, FN, and FP counts are all smaller and have a similar magnitude, thus suppressing the specificity and NPV. These latter deficiencies (see Table 2) were driven by 14 post-SC cases (all clinically confirmed SLE+) for which manual scrutiny of the MR documents revealed that the genAI first pass generally uncovered ANA status indicators; however, the more stringent second pass scoring was less reliable in confirming that these cases legitimately met EULAR 2019 standards, largely due to cases where relevant records of ANA testing that produced positive determinations occurred prior to the one to two years of compiled medical history made available to this study. Conversely, there were five pre-SC cases where computational analysis yielded ANA+ indications with supporting quantitative evidence (1:40 titers) that would have been rejected by human scrutiny, plus three other records stating ANA positivity but lacking quantitative corroboration. The final putative FP appears to be an instance where MR evidence contained a clear indication of ANA positivity at 1:80 titers that, apparently, was overlooked during the clinical SLE classification.

Despite criterial deficiencies, genAI findings and clinical SLE classifications led to similar distributions of predictive outcomes. Figure 2A (plotting case attrition after each decision branch) reveals analogous trends (upper portion of Figure 2A) for both real clinical evaluation (blue) and genAI simulation (brown). Final clinical and genAI classification ratios (Figure 2B) suggest a rough similarity in the ratios of SLE− to UCTD to SLE+ classifications, even if class concordances (colored heatmap) are marginal. In total, genAI achieved an exact match with clinical classifications in 54% of cases (predominantly SLE+ true positives), while the dominant mismatch (18% of all cases) involved clinical SLE+ cases for which genAI predicted SLE−, dominated by the aforementioned ANA false negatives.

3.2. Criteria Differentiating UCTD Versus SLE+

In comparing clinical SLE+ vs. UCTD classifications, our analysis reveals some inter-class variation in the relative prevalence of positive criterial determinations. While formal clinical determinations of many criteria are sparse (see Appendix A, Table A1), some criteria (e.g., joint involvement) are well documented in MRs, thus potentially adding insight that, prior to SLE classification, could help to inform clinical priorities. To such ends, Table 4 details the relative influence (see Equations (1a) and (1b)) for selected genAI criterial determinations.

The analytical value of Table 4 may reside in documenting those metrics with influences that differentiate SLE+ from UCTD and vice versa. For this, the differential influence of anti-Smith antibodies (ADS) (p < 0.004) and low C3/C4 (p < 0.03) statistically favored clinical SLE+ determinations, while proteinuria (p < 0.009) and non-scarring alopecia (NSA) (p < 0.015) had elevated representation in UCTD determinations.

4. Discussion

In healthcare and biomedicine, genAI has emerged as a proverbial solution in search of an application. The incumbent algorithms have been tested extensively relative to tasks such as MCAT test-taking [32] or medical diagnoses [33], achieving a performance that seems intriguing but consistently fails to attain levels required for stringent medical practice [32,33]. While genAI may still rise to parity with medical experts, it may be that other medically important objectives will prove more readily attainable. In particular, this study seeks to assess the potential value of genAI toward alleviating information overload scenarios that exacerbate the widespread problem of fragmented care for chronic disease.

Clinical classification of SLE often presents laborious demands for patient evaluation and lab testing. As outlined in Figure 1, the EULAR 2019 framework presents opportunities to reduce this burden via decision branches, including the triage of SLE cases based on failure to achieve the ANA entry criterion, followed by applying a ‘first-to-ten’ pursuit of SLE+ (i.e., a patient is classified with SLE upon reaching ten criterial points, regardless of the total number of criteria evaluated), provided at least one criterion is ‘clinical’.

As discussed earlier, the advantage of this protocol is that decisions may be reached well before all criteria are fully assessed. A possible disadvantage, however, is that some patients may end up with incorrect SLE− or UCTD assessments if useful criterial evidence from past medical history is overlooked—a risk exacerbated by long MRs in legacy formats. Automated MR mining via genAI may help mitigate this risk.

4.1. GenAI Progress Toward ANA Determination

As observed in our Results, genAI tends to accrue promising measures of sensitivity and PPV, thus suggesting an encouraging performance for characterizing the analytically crucial criterion of ANA status. Unfortunately, MR documents posed challenges such as limited duration—a source of numerous ANA false negatives that suppressed NPV performance. This might suggest an imperative to statistically profile the frequency of ANA testing among classified SLE patients to potentially infer a minimum temporal MR duration to mine in order to achieve set levels of predictive confidence.

A second concern arises from the ANA results (see Table 2 and Table 3) whose statistical specificity (0.57; 95% CI: {0.49,0.70}) reflects a substantial number of false positives (9). Notably, this specificity level aligns with clinical assessments for indirect immunofluorescence testing [34,35], although more recent ELISA tests have a substantially better specificity [34,36]. MR annotations might also report ANA+ status without proximal indication that the positivity was actually at the 1:40 titer level, which is often considered to be a marker for possible autoimmune dysfunction but is generally inadequate for SLE classification. This weakness sometimes relates to limits in token-based contextualization (i.e., conceptual modifiers to a given fact may not influence conclusions about that fact if the fact-modifier and modifiable-fact are not both within the same text window under consideration within granular genAI processing) but may also reflect the residual discussion of an older test result whose precise details (potentially involving a low-precision test or a 1:40 titer) predate the one- to two-year MR window available for genAI analysis in this study.

4.2. GenAI Progress Toward Criterial Discrimination Between UCTD and SLE+

An interesting insight emerged from profiling potential roles of non-ANA criteria in discriminating between SLE+ cases and those still best described as UCTD. Specifically, evidence in Table 4 pinpoints metrics that have weighted genAI prevalences, suggesting a differing prevalence in SLE+ cases versus UCTD. Specifically, the computed influence of anti-Smith antibodies (ADS) (p < 0.004) and low C3/C4 (p < 0.03) discriminated significantly in favor of SLE+ over UCTD, while proteinuria (p < 0.009) and non-scarring alopecia (NSA) (p < 0.015) had statistically greater weighted genAI prevalences in UCTD than SLE+.

4.3. Sampling Limitations

While this preliminary evidence is interesting, it is important to recognize sampling limits. SLE+ sampling is modestly robust (48 patients), but limited SLE- sampling (22 patients) hinders the full evaluation of the capacity of genAI to distinguish SLE from other non-rheumatological maladies, and too few UCTD outcomes (nine cases) are available to assure the representation of the breadth of connective tissue pathological heterogeneity. Furthermore, while we have access to definitive SLE classifications for all 78 cases examined in the study, and we have examined one- to two-year compendia of MRs for all, our access to explicit clinically assessed criterial-level determinations is limited to 15 cases at the EULAR 2019 level, plus 12 cases for which simpler ACR 1997-level criterial assessments were made. This collectively means that, while our total sample is fairly robust for evaluating the overall prospect for genAI recognition of true SLE+ cases, we have inadequate, fine-grained symptomatic points of validation to confidently differentiate UCTD cases versus SLE- or to be confident that genAI truly identifies specific disease measures. Thus, the initial significance of genAI-derived SLE classifications is initially promising but warrants further investigation for clear characterization of UCTD cases and confident prediction of SLE- status. Furthermore, many criterial-level predictions are not statistically significant; thus, firm criterial conclusions will require the assessment of a larger cohort.

5. Conclusions

Generative AI methods are gaining attention for diverse medical information tasks [8,37,38,39,40], with demonstrable prowess in technical extraction across many disciplines [41,42,43,44]. Within the medical community, however, uncertainty persists regarding whether genAI has risen to the point of reliable decision support and, if so, what roles could be confidently delegated [37,45,46,47]. Achieving such trust requires accepted benchmarks and extensive validation.

In order to extend our prior pilot assessment of genAI replication of the SLE ACR 1997 classification [21], the present study furthers the discourse by considering genAI efficacy for parsing legacy MRs (i.e., digitized paper documents) as a means for characterizing patients according to their case histories, with performance being assessed based on the propensity of genAI to replicate clinical determinations to support the classification of prospective SLE patients according to the EULAR 2019 protocol.

To this end, net performance statistics for resolving SLE+ versus UCTD versus SLE- determinations achieved a 54% accuracy ±11% (95% CI) for tripartite classification. This is within the performance range determined by meta-analysis on the use of genAI for diagnosing various medical conditions [33,48], but it can be objectively surmised that none of these studies fully merit widespread practical medical adoption yet.

Advancing from numbers that are marginally promising but functionally inadequate may be supported by the continued improvement of genAI algorithms, but SLE case heterogeneity [49,50] suggests that informatics alone will not fully close the gap. Immediate next steps thus involve assessing which SLE-relevant metrics were handled well by the current methodology, determining why these successes occurred (e.g., which MR instances and data structures tended to accurately document specific criteria) and correspondingly adapt our strategies for information acquisition. While we believe that the current version represents a useful incremental benchmark based on straightforward evidence, it is equally apparent that genAI may be powerfully augmented by complementary algorithms operating on data other than what genAI or other NLP methods might consider. A potentially impactful augmentation, for example, could strengthen associative reasoning by also mining MRs for intervention history. Key data may include procedures, tests, and multiple aspects of prescription history (e.g., dose, compliance, apparent effectiveness, etc.). Such data generally resides outside of conventional digital diagnostics but adds a context that, via machine learning or deep learning, may illuminate statistical associations with specific disease states, whether they be previously diagnosed, suggested, or unsuspected. Such associations would not be considered actionable medical evidence but may prove valuable for assisting medical professionals in identifying lines of inquiry and for filling MR gaps resulting from care fragmentation.

While genAI is racing ahead in science and society, the field of medicine operates under the Hippocratic oath, whose conservative strictures require cautious, measured advancement. Our intended service to the community at this point is to blend preliminary optimism with realism. Understanding pitfalls may be more valuable than claims of transformative breakthroughs. The latter are destined to come in time, and that progress will be hastened with every pitfall that is identified and addressed.

Author Contributions

Conceptualization, S.N., G.H.L., E.R.J., B.R., and M.P.; methodology, S.N., G.H.L., E.R.J., B.R., and M.P.; software, S.N. and G.H.L.; validation, S.N. and G.H.L.; formal analysis, G.H.L. and S.N.; investigation, S.N.; resources, S.N., E.R.J., and M.P.; data curation, S.N., G.H.L., and E.R.J.; writing—original draft preparation, G.H.L.; writing—review and editing, S.N., G.H.L., E.R.J., B.R., and M.P.; visualization, S.N. and G.H.L.; supervision, S.N., G.H.L., E.R.J., B.R., and M.P.; project administration, S.N., E.R.J., and M.P.; funding acquisition, M.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study utilized data from two prior clinical studies (named herein as Pre-SC Study and Post-SC Study), as approved by the governing Institutional Review Board. Pre-SC Study was conducted in accordance with the Declaration of Helsinki and was approved by the Western Institutional Review Board (WIRB)—Copernicus Group (WCG IRB) (Study Number: 1350085, first date of approval: 23 February 2023). Post-SC study was conducted in accordance with the Declaration of Helsinki and was also approved by the WCG IRB (Study Number: 1332417, first date of approval: 2 May 2022). Both studies have undergone annual continuing review and renewal. Inclusion and exclusion criteria are described in detail in Appendix C [23,24,25,51,52,53].

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study based on conditions stipulated within the IRB-approved study plan (WCG IRB Study Number: 1350085 and WCG IRB Study Number 1332417). The consent terms included explicit approval for use of data and samples in future research studies.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the SLE patients who participated in the two studies that contributed medical records to this analysis and the evaluating clinicians: Cristina Arriens, Teresa Aberle, Joseph Huffstutter, Judith James, Reshma Khan, Jennifer Murphy, Timothy Niewold, Donald Thomas, Beth Valashinas, Chad Walker, and Anil Warrier. We also thank Melissa Munroe, Nancy Redinger, Jessica Crawley, Sneha Nair, Daniele DeFreese, and Adrian Holloway for their assistance with these studies. During the preparation of this manuscript/study, the author(s) used the Claude 3.5 large language model from anthropic.ai for the purposes of performing testable criterial extraction from medical records. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

Authors Gerald H. Lushington, Sandeep Nair, Eldon R. Jupe, Bernard Rubin and Mohan Purushothaman were employed by the company Progentec Diagnostics, Inc. All authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

acute cutaneous lupus	ACL
American College of Rheumatology	ACR
anti-dsDNA/anti-Smith antibodies	ADS
artificial intelligence	AI
autoimmune hemolysis	AIH
antinuclear antibody	ANA
ANA negative	ANA−
ANA positive	ANA+
anti-phospholipid antibodies	APL
Amazon web services	AWS
confidence interval	CI
low C3 AND low C4	C3+4
low C3 OR low C4	C3/4
European Alliance of Associations for Rheumatology	EULAR
false negative	FN
false positive	FP
generative artificial intelligence	genAI
institutional review board	IRB
large language model	LLM
lupus nephritis class II/V	LN25
lupus nephritis class III/IV	LN34
medical record	MR
natural language processing	NLP
negative predictive value	NPV
non-scarring alopecia	NSA
post-SLE classification	post-SC
pleural/pericardial effusion	PPE
positive predictive value	PPV
pre-SLE classification	pre-SC
systemic lupus erythematosus	SLE
SLE negative	SLE−
SLE positive	SLE+
subacute cutan./discoid lupus	SCD
thrombocytopenia	Thromb.
true negative	TN
true positive	TP
undifferentiated connective tissue disorder	UCTD

Appendix A

Table A1. EULAR 2019 SLE criterial accuracy for genAI analysis of MR evidence. Criterial abbreviations are defined in Table 1. PPV = positive predictive value. NPV = negative predictive value. Consistency is based on the Johnson Lindenstrauss formalism. Sensitivity and PPV terms are unquantifiable for cases where the number of true positives is equal to 0 and are thus marked n/a. ^# Includes 14 post-SC cases lacking explicit ANA records (ANA+ inferred from clinical SLE+ status). ^@ Includes 4 pre-SC cases lacking explicit ANA records; ANA− inferred from clinical SLE− (non-UCTD) classification.

	Clinical Determination			genAI Predictions
Criteria	Pos.	Neg.	Unspec.	Sens.	Spec.	PPV	NPV	Consistency
1. ANA	57 ^#	21 ^@	0	0.75	0.57	0.83	0.46	1
2. Fever	1	14	63	0	1	n/a	0.93	1
3. Leukopenia	1	14	63	0	0.93	0	0.93	1
4. Thromb.	0	15	63	n/a	1	n/a	1	0.99
5. AIH	0	15	63	n/a	1	n/a	1	1
6. Delirium	0	15	63	n/a	1	n/a	1	1
7. Psychosis	0	15	63	n/a	1	n/a	1	1
8. Seizure	0	15	63	n/a	0.8	0	1	1
9. NSA	4	11	63	0.5	0.91	0.67	0.83	1
10. Oral ulcers	5	10	63	0.2	0.8	0.33	0.67	1
11. SCD	0	15	63	n/a	1	n/a	1	1
12. ACL	0	15	63	n/a	1	n/a	1	1
13. PPE	0	15	63	n/a	1	n/a	1	1
14. Acute pericarditis	0	15	63	n/a	1	n/a	1	1
15. Joint involvement	2	13	63	1	0.15	0.15	1	1
16. Proteinuria	0	15	63	n/a	0.53	0	1	0.99
17. LN25	0	15	63	n/a	1	n/a	1	1
18. LN34	0	15	63	n/a	1	n/a	1	1
19. APL	0	15	63	n/a	0.67	0	1	1
20. Low C3/4	0	15	63	n/a	1	n/a	1	0.97
21. Low C3+4	0	15	63	n/a	1	n/a	1	1
22. ADS	0	15	63	n/a	0.87	0	1	0.99
Total (±95% c.i)	70	323	1323	0.69 ±0.11	0.87 ±0.04	0.54 ±0.10	0.93 ±0.03	1 ±0.00

Appendix B

Table A2. Weighted assessed influences of non-ANA EULAR-2019 criteria on SLE and UCTD determination, as defined by Equations (1a) and (1b). Criterial counts refer to genAI positives.

	Counts		Influence [Wilson 95% Confidence Interval]
	SLE+	UCTD	SLE+ (Equation (1a))	UCTD (Equation (1b))
2. Fever	5	1	0.227 [0.092, 0.506]	0.200 [0.010, 0.913]
3. Leukopenia	19	1	1.295 [0.801, 1.767]	0.300 [0.015, 1.368]
4. Thromb.	5	0	0.455 [0.184, 1.012]	0.000 [0.000, 0.000]
5. AIH	3	0	0.273 [0.072, 0.788]	0.000 [0.000, 0.000]
6. Delirium	2	0	0.091 [0.016, 0.334]	0.000 [0.000, 0.000]
7. Psychosis	1	0	0.068 [0.006, 0.405]	0.000 [0.000, 0.000]
8. Seizure	1	0	0.045 [0.004, 0.270]	0.000 [0.000, 0.000]
9. NSA	10	5	0.455 [0.240, 0.764]	1.000 [0.402, 1.598]
10. Oral ulcers	19	4	0.864 [0.534, 1.178]	0.800 [0.274, 1.452]
11. SCD	2	0	0.182 [0.032, 0.668]	0.000 [0.000, 0.000]
12. ACL	5	1	0.682 [0.276, 1.518]	0.600 [0.030, 2.736]
13. PPE	2	0	0.227 [0.040, 0.835]	0.000 [0.000, 0.000]
14. Acute pericarditis	2	0	0.273 [0.048, 1.002]	0.000 [0.000, 0.000]
15. Joint involvement	40	8	5.455 [4.644, 5.826]	4.800 [2.652, 5.790]
16. Proteinuria	9	5	0.818 [0.412, 1.432]	2.000 [0.804, 3.196]
17. LN25	0	0	0.000 [0.000, 0.000]	0.000 [0.000, 0.000]
18. LN34	1	0	0.227 [0.020, 1.350]	0.000 [0.000, 0.000]
19. APL	9	1	0.409 [0.206, 0.716]	0.200 [0.010, 0.912]
20. Low C3/4	14	0	0.955 [0.573, 1.431]	0.000 [0.000, 0.000]
21. Low C3+4	4	0	0.364 [0.120, 0.904]	0.000 [0.000, 0.000]
22. ADS	23	0	3.136 [2.214, 4.038]	0.000 [0.000, 0.000]

Appendix C

Study Inclusion Exclusion Criteria

In the study to expedite lupus classification in at risk individuals, participants were digitally recruited through a publicly available web portal designed for visitors to evaluate their risk of developing rheumatic connective tissue diseases [24]. Study participation was available to females or males ranging in age from 18 to 45 years who believed they had an autoimmune condition but have not yet been formally diagnosed with SLE. Individuals completed the Connective Tissue Disease Screening Questionnaire (CSQ) [51,52]. Individuals identified at ‘possible’ (SLE-CSQ = 3) or ‘probable’ (SLE-CSQ ≥ 4) risk of SLE by the CSQ were recruited to consent and participate in the study. The participants had to understand the requirements of the study, provide written informed consent, including consent for the use and disclosure of research-related health information, and comply with the study data collection procedures. Exclusion criteria included prior classification with SLE following medical records evaluation, individuals under age 18 years or over age 45 years, and pregnant women.

In the study to evaluate self-efficacy and disease management [25], participants had a clinical diagnosis of classified SLE as defined by meeting at least four ACR classification criteria for SLE [23] OR meeting at least four SLICC classification criteria for SLE [53], be at least 18 years old, and have not been treated with biologic therapy in the last 12 months (belimumab, rituximab, or similar therapies). All participants had to meet at least any ONE of the following criteria: history of rheumatologist-documented clinical disease flare in the last year; documentation of an SLE-related emergency room visit in the last year; history of rheumatologist-documented measure of active disease in the last year (e.g., organ system involvement due to SLE or Systemic Lupus Erythematosus Disease Activity Index (SLEDAI) score > 3) AND history of steroid or other immune modifying drug used in last year (hydroxychloroquine, methotrexate, azathioprine, MMF, cyclophosphamide, etc.). Participants had to be able to understand the requirements of the study, provide written informed consent, including consent for the use and disclosure of research-related health information, and consent to comply with the study data collection procedures. Exclusion criteria included inability to comply with the study visit schedule and procedures, under age 18 years, or exclusion at the discretion of rheumatologist due to disease severity or ongoing pregnancy.

References

Joo, J.Y. Fragmented care and chronic illness patient outcomes: A systematic review. Nurs. Open 2023, 10, 3460–3473. [Google Scholar] [CrossRef]
Wong, Z.S.-Y.; Gong, Y.; Ushiro, S. A pathway from fragmentation to interoperability through standards-based enterprise architecture to enhance patient safety. npj Digit. Med. 2025, 8, 41. [Google Scholar] [CrossRef]
Walunas, T.L.; Jackson, K.L.; Chung, A.H.; Mancera-Cuevas, K.A.; Erickson, D.L.; Ramsey-Goldman, R.; Kho, A. Disease Outcomes and Care Fragmentation Among Patients with Systemic Lupus Erythematosus. Arthritis Care Res. 2017, 69, 1369–1376. [Google Scholar] [CrossRef] [PubMed]
Khairat, S.; Morelli, J.; Boynton, M.H.; Bice, T.; A Gold, J.; Carson, S.S. Investigation of Information Overload in Electronic Health Records: Protocol for Usability Study. JMIR Res. Protoc. 2025, 14, e66127. [Google Scholar] [CrossRef]
Asgari, E.; Kaur, J.; Nuredini, G.; Balloch, J.; Taylor, A.M.; Sebire, N.; Robinson, R.; Peters, C.; Sridharan, S.; Pimenta, D. Impact of Electronic Health Record Use on Cognitive Load and Burnout Among Clinicians: Narrative Review. JMIR Med. Inform. 2024, 12, e55499. [Google Scholar] [CrossRef]
Cahill, M.; Cleary, B.J.; Cullinan, S. The influence of electronic health record design on usability and medication safety: Systematic review. BMC Health Serv. Res. 2025, 25, 31. [Google Scholar] [CrossRef]
Nijor, S.; Rallis, G.B.; Lad, N.; Gokcen, E. Patient Safety Issues from Information Overload in Electronic Medical Records. J. Patient Saf. 2022, 18, e999–e1003. [Google Scholar] [CrossRef] [PubMed]
Reddy, S. Generative AI in healthcare: An implementation science informed translational path on application, integration and governance. Implement. Sci. 2024, 19, 27. [Google Scholar] [CrossRef] [PubMed]
Sequí-Sabater, J.M.; Benavent, D. Artificial intelligence in rheumatology research: What is it good for? RMD Open 2025, 11, e004309. [Google Scholar] [CrossRef]
Templin, T.; Perez, M.W.; Sylvia, S.; Leek, J.; Sinnott-Armstrong, N.; Silva, J.N.A. Addressing 6 challenges in generative AI for digital health: A scoping review. PLoS Digit. Health 2024, 3, e0000503. [Google Scholar] [CrossRef]
Chustecki, M. Benefits and Risks of AI in Health Care: Narrative Review. Interact. J. Med. Res. 2024, 13, e53616. [Google Scholar] [CrossRef]
Hassan, R.; Faruqui, H.; Alquraa, R.; Eissa, A.; Alshaiki, F.; Cheikh, M. Classification Criteria and Clinical Practice Guidelines for Rheumatic Diseases. In Skills in Rheumatology; Springer: Singapore, 2021; pp. 521–566. [Google Scholar] [CrossRef]
June, R.R.; Aggarwal, R. The use and abuse of diagnostic/classification criteria. Best Pract. Res. Clin. Rheumatol. 2014, 28, 921–934. [Google Scholar] [CrossRef]
Aletaha, D.; Neogi, T.; Silman, A.J.; Funovits, J.; Felson, D.T.; Bingham, C.O., 3rd; Birnbaum, N.S.; Burmester, G.R.; Bykerk, V.P.; Cohen, M.D.; et al. 2010 Rheumatoid arthritis classification criteria: An American College of Rheumatology/European League Against Rheumatism collaborative initiative. Ann. Rheum. Dis. 2010, 69, 1580–1588. [Google Scholar] [CrossRef]
Van den Hoogen, F.; Khanna, D.; Fransen, J.; Johnson, S.R.; Baron, M.; Tyndall, A.; Matucci-Cerinic, M.; Naden, R.P.; Medsger, T.A., Jr.; Carreira, P.E.; et al. 2013 classification criteria for systemic sclerosis: An American college of rheumatology/European league against rheumatism collaborative initiative. Ann. Rheum. Dis. 2013, 72, 1747–1755. [Google Scholar] [CrossRef]
Neogi, T.; Jansen, T.L.T.A.; Dalbeth, N.; Fransen, J.; Schumacher, H.R.; Berendsen, D.; Brown, M.; Choi, H.; Edwards, N.L.; Janssens, H.J.E.M.; et al. 2015 Gout Classification Criteria: An American College of Rheumatology/European League Against Rheumatism collaborative initiative. Arthritis Rheumatol. 2015, 67, 2557–2568. [Google Scholar] [CrossRef]
Wolfe, F.; Clauw, D.J.; Fitzcharles, M.-A.; Goldenberg, D.L.; Katz, R.S.; Mease, P.; Russell, A.S.; Russell, I.J.; Winfield, J.B.; Yunus, M.B. The American College of rheumatology preliminary diagnostic criteria for fibromyalgia and measurement of symptom severity. Arthritis Care Res. 2010, 62, 600–610. [Google Scholar] [CrossRef] [PubMed]
Taylor, W.; Gladman, D.; Helliwell, P.; Marchesoni, A.; Mease, P.; Mielants, H.; CASPAR Study Group. Classification criteria for psoriatic arthritis: Development of new criteria from a large international study. Arthritis Care Res. 2006, 54, 2665–2673. [Google Scholar] [CrossRef] [PubMed]
Lundberg, I.E.; Msc, A.T.; Bottai, M.; Werth, V.P.; Mbbs, M.C.P.; de Visser, M.; Alfredsson, L.; Amato, A.A.; Barohn, R.J.; Liang, M.H.; et al. 2017 European League Against Rheumatism/American College of Rheumatology Classification Criteria for Adult and Juvenile Idiopathic Inflammatory Myopathies and Their Major Subgroups. Arthritis Rheumatol. 2017, 69, 2271–2282. [Google Scholar] [CrossRef] [PubMed]
Aringer, M.; Costenbader, K.; Daikh, D.; Brinks, R.; Mosca, M.; Ramsey-Goldman, R.; Smolen, J.S.; Wofsy, D.; Boumpas, D.T.; Kamen, D.L.; et al. 2019 European League Against Rheumatism/American College of Rheumatology Classification Criteria for Systemic Lupus Erythematosus. Arthritis Rheumatol. 2019, 71, 1400–1412. [Google Scholar] [CrossRef]
Nair, S.; Lushington, G.H.; Purushothaman, M.; Rubin, B.; Jupe, E.; Gattam, S. Prediction of Lupus Classification Criteria via Generative AI Medical Record Profiling. BioTech 2025, 14, 15. [Google Scholar] [CrossRef]
Claude, version 3.0; Anthropic: San Francisco, CA, USA, 2024.
Hochberg, M.C. Updating the American College of Rheumatology revised criteria for the classification of systemic lupus ery-thematosus. Arthritis Rheum. 1997, 40, 1725. [Google Scholar] [CrossRef]
Jupe, E.; Nadipelli, V.; Lushington, G.; Crawley, J.; Rubin, B.; Nair, S.; Nair, S.; Purushothaman, M.; Munroe, M.; Walker, C.; et al. Expediting lupus classification of at-risk individuals using novel technology: Outcomes of a pilot study. J. Rheumatol. 2025, 52 (Suppl. 1), 214. [Google Scholar] [CrossRef]
Jupe, E.R.; Purushothaman, M.; Wang, B.; Lushington, G.; Nair, S.; Nadipelli, V.R.; Rubin, B.; Munroe, M.E.; Crawley, J.; Nair, S.; et al. Impact of a digital platform and flare risk blood biomarker index on lupus: A study protocol design for evaluating self efficacy and disease management. Contemp. Clin. Trials Commun. 2025, 45, 101471. [Google Scholar] [CrossRef] [PubMed]
Belval, E.; Delteil, T.; Schade, M.; Radhakrishna, S. Amazon Textract, version 1.9.2.; Amazon Web Services: Seattle, DC, USA, 2025.
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-T.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Chen, B.; Zhang, Z.; Langrené, N.; Zhu, S. Unleashing the potential of prompt engineering for large language models. Patterns 2025, 6, 101260. [Google Scholar] [CrossRef] [PubMed]
Claude, version 3.5; Anthropic: San Francisco, CA, USA, 2025.
Johnson, W.B.; Lindenstrauss, J. Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 1984, 26, 189–206. [Google Scholar] [CrossRef]
Sauro, J.; Lewis, J.R. Comparison of Wald, Adj-Wald, exact, and Wilson intervals calculator. In Proceedings of the Human Factors and Ergonomics Society, 49th Annual Meeting (HFES 2005), Orlando, FL, USA, 26–30 September 2005; pp. 2100–2104. [Google Scholar]
VBommineni, L.; Bhagwagar, S.; Balcarcel, D.; Davatzikos, C.; Boyer, D. Performance of ChatGPT on the MCAT: The Road to Personalized and Equitable Premedical Learning. MedRxiv 2023. [Google Scholar] [CrossRef]
Takita, H.; Kabata, D.; Walston, S.L.; Tatekawa, H.; Saito, K.; Tsujimoto, Y.; Miki, Y.; Ueda, D. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. npj Digit. Med. 2025, 8, 175. [Google Scholar] [CrossRef]
Ali, Y. Rheumatologic Tests: A Primer for Family Physicians. Am. Fam. Physician 2018, 98, 164–170. [Google Scholar]
Solomon, D.H.; Kavanaugh, A.J.; Schur, P.H.; American College of Rheumatology Ad Hoc Committee on Immunologic Testing Guidelines. Evidence-based guidelines for the use of immunologic tests: Antinuclear antibody testing. Arthritis Care Res. 2002, 47, 434–444. [Google Scholar] [CrossRef]
Verizhnikova, Z.; Aleksandrova, E.; Novikov, A.; Panafidina, T.; Seredavkina, N.; Roggenbuck, D.; Nasonov, E. AB1027 Diagnostic Accuracy of Automated Determination of Antinuclear Antibodies (ANA) by Indirect REAction of Immunofluorescence on Human Hep-2 Cells (IIF-HEP-2) and Enzyme-Linked Immunosorbent Assay (ELISA) for Diagnosis of Systemic Lupus Erythematosus (SLE). Ann. Rheum. Dis. 2014, 73, 1140. [Google Scholar] [CrossRef]
Chen, Y.; Esmaeilzadeh, P. Generative AI in Medical Practice: In-Depth Exploration of Privacy and Security Challenges. J. Med. Internet Res. 2024, 26, e53008. [Google Scholar] [CrossRef]
Xu, R.; Wang, Z. Generative artificial intelligence in healthcare from the perspective of digital media: Applications, opportunities and challenges. Heliyon 2024, 10, e32364. [Google Scholar] [CrossRef]
Bhuyan, S.S.; Sateesh, V.; Mukul, N.; Galvankar, A.; Mahmood, A.; Nauman, M.; Rai, A.; Bordoloi, K.; Basu, U.; Samuel, J. Generative Artificial Intelligence Use in Healthcare: Opportunities for Clinical Excellence and Administrative Efficiency. J. Med. Syst. 2025, 49, 10. [Google Scholar] [CrossRef]
Loni, M.; Poursalim, F.; Asadi, M.; Gharehbaghi, A. A review on generative AI models for synthetic medical text, time series, and longitudinal data. npj Digit. Med. 2025, 8, 281. [Google Scholar] [CrossRef]
Spillias, S.; Ollerhead, K.M.; Andreotta, M.; Annand-Jones, R.; Boschetti, F.; Duggan, J.; Karcher, D.B.; Paris, C.; Shellock, R.J.; Trebilco, R. Evaluating generative AI for qualitative data extraction in community-based fisheries management literature. Environ. Evid. 2025, 14, 9. [Google Scholar] [CrossRef] [PubMed]
Safdar, M.; Xie, J.; Mircea, A.; Zhao, Y.F. Human–Artificial Intelligence Teaming for Scientific Information Extraction from Data-Driven Additive Manufacturing Literature Using Large Language Models. J. Comput. Inf. Sci. Eng. 2025, 25, 074501. [Google Scholar] [CrossRef]
Li, Y.; Datta, S.; Rastegar-Mojarad, M.; Lee, K.; Paek, H.; Glasgow, J.; Liston, C.; He, L.; Wang, X.; Xu, Y. Enhancing systematic literature reviews with generative artificial intelligence: Development, applications, and performance evaluation. J. Am. Med. Inform. Assoc. 2025, 32, 616–625. [Google Scholar] [CrossRef]
Shahid, F.; Hsu, M.-H.; Chang, Y.-C.; Jian, W.-S. Using Generative AI to Extract Structured Information from Free Text Pathology Reports. J. Med. Syst. 2025, 49, 36. [Google Scholar] [CrossRef] [PubMed]
Hasan, S.S.; Fury, M.S.; Woo, J.J.; Kunze, K.N.; Ramkumar, P.N. Ethical Application of Generative Artificial Intelligence in Medicine. Arthrosc. J. Arthrosc. Relat. Surg. 2024, 41, 874–885. [Google Scholar] [CrossRef] [PubMed]
Tran, M.; Balasooriya, C.; Jonnagaddala, J.; Leung, G.K.-K.; Mahboobani, N.; Ramani, S.; Rhee, J.; Schuwirth, L.; Najafzadeh-Tabrizi, N.S.; Semmler, C.; et al. Situating governance and regulatory concerns for generative artificial intelligence and large language models in medical education. npj Digit. Med. 2025, 8, 315. [Google Scholar] [CrossRef] [PubMed]
Ning, Y.; Teixayavong, S.; Shang, Y.; Savulescu, J.; Nagaraj, V.; Miao, D.; Mertens, M.; Ting, D.S.W.; Ong, J.C.L.; Liu, M.; et al. Generative artificial intelligence and ethical considerations in health care: A scoping review and ethics checklist. Lancet Digit. Health 2024, 6, e848–e856. [Google Scholar] [CrossRef] [PubMed]
Yim, D.; Khuntia, J.; Parameswaran, V.; Meyers, A. Preliminary Evidence of the Use of Generative AI in Health Care Clinical Services: Systematic Narrative Review. JMIR Med. Inform. 2024, 12, e52073. [Google Scholar] [CrossRef]
Sjöwall, C.; Parodis, I. Clinical Heterogeneity, Unmet Needs and Long-Term Outcomes in Patients with Systemic Lupus Erythematosus. J. Clin. Med. 2022, 11, 6869. [Google Scholar] [CrossRef]
Dai, X.; Fan, Y.; Zhao, X. Systemic lupus erythematosus: Updated insights on the pathogenesis, diagnosis, prevention and therapeutics. Signal Transduct. Target. Ther. 2025, 10, 102. [Google Scholar] [CrossRef]
Karlson, E.W.; Sanchez-Guerrero, J.; Wright, E.A.; Lew, R.A.; Daltroy, L.H.; Katz, J.N.; Liang, M.H. A connective tissue disease screening questionnaire for population studies. Ann. Epidemiol. 1995, 5, 297–302. [Google Scholar] [CrossRef]
Karlson, E.W.; Costenbader, K.H.; McAlindon, T.E.; Massarotti, E.M.; Fitzgerald, L.M.; Jajoo, R.; Husni, E.; Wright, E.A.; Pankey, H.; Fraser, P.A. High sensitivity, specificity and predictive value of the Connective Tissue Disease Screening Questionnaire among urban African-American women. Lupus 2005, 14, 832–836. [Google Scholar] [CrossRef] [PubMed]
Petri, M.; Orbai, A.; Alarcón, G.S.; Gordon, C.; Merrill, J.T.; Fortin, P.R.; Bruce, I.N.; Isenberg, D.; Wallace, D.J.; Nived, O.; et al. Derivation and validation of the Systemic Lupus International Collaborating Clinics classification criteria for systemic lupus erythematosus. Arthritis Rheum. 2012, 64, 2677–2686. [Google Scholar] [CrossRef]

Figure 1. EULAR 2019 classification decision branches, beginning with the entry criterion (at least one positive recorded ANA (also known as ANA+) test required for SLE+ classification), followed by stipulation that at least one clinical criterion (Table 1, criteria 2–18) be met, and followed by the final requirement that the sum of criterial weights (the point listings in Table 1) must be 10 or greater.

Figure 2. EULAR clinical and genAI classifications showing (A) attrition plot as a function of decision branches and (B) relative partitioning statistics among the classes.

Table 1. EULAR 2019 SLE criteria [20], their role in classification, and number of clinical determinations (clin-dets). Failure to meet criterion 1 confers automatic SLE− negative. SLE+ classification requires at least one positive result among clinical criteria 2–18, plus a total weighted score of at least 10.

Criteria (Abbreviation)	Role (Clin-Dets)	Data Type	Weight
1. Antinuclear antibodies (ANA)	Required (78)	quantitative test	-
2. Fever	Clinical (15)	quantitative test	2
3. Leukopenia	Clinical (15)	quantitative test	3
4. Thrombocytopenia (Thromb.)	Clinical (15)	quantitative test	4
5. Autoimmune hemolysis (AIH)	Clinical (15)	quantitative test	4
6. Delirium	Clinical (15)	qualitative exam	2
7. Psychosis	Clinical (15)	qualitative exam	3
8. Seizure	Clinical (15)	qualitative exam	2
9. Non-scarring alopecia (NSA)	Clinical (15)	qualitative exam	2
10. Oral ulcers	Clinical (15)	qualitative exam	2
11. Subacute cutan./discoid lupus (SCD)	Clinical (15)	qualitative exam	4
12. Acute cutaneous lupus (ACL)	Clinical (15)	quantitative test	6
13. Pleural/pericardial effusion (PPE)	Clinical (15)	qualitative imaging	5
14. Acute pericarditis	Clinical (15)	qualitative or imaging	6
15. Joint involvement	Clinical (15)	qualitative exam	6
16. Proteinuria	Clinical (15)	quantitative test	4
17. Lupus nephritis class II/V (LN25)	Clinical (15)	separate classification	8
18. Lupus nephritis class III/IV (LN34)	Clinical (15)	separate classification	10
19. Anti-phospholipid antibodies (APL)	Immunolog. (15)	quantitative test	2
20. Low C3 OR low C4 (C3/4)	Immunolog. (15)	quantitative test	3
21. Low C3 AND low C4 (C3+4)	Immunolog. (15)	quantitative test	4
22. Anti-dsDNA/anti-Smith antibodies (ADS)	Immunolog. (15)	quantitative test	6

Table 2. genAI accuracy for selected EULAR 2019 criteria. Criteria are defined in Table 1. PPV = positive predictive value. NPV = negative predictive value. Sensitivity and PPV are unquantifiable for criteria lacking in clinical positives and are marked n/a. ^# Includes 14 post-SC cases lacking explicit ANA records (ANA+ inferred from clinical SLE+ status). ^@ Includes 4 pre-SC cases lacking explicit ANA records; ANA− inferred from clinical SLE− (non-UCTD) classification. Details available in Appendix A.

	Clinical Determination			genAI Predictions
Key Criteria	Pos.	Neg.	Unspecified	Sens.	Spec.	PPV	NPV
1. ANA	57 ^#	21 ^@	0	0.75	0.57	0.83	0.46
3. Leukopenia	1	14	63	0	0.93	0	0.93
9. NSA	4	11	63	0.5	0.91	0.67	0.83
10. Oral ulcers	5	10	63	0.2	0.8	0.33	0.67
15. Joint involvement	2	13	63	1	0.15	0.15	1
16. Proteinuria	0	15	63	n/a	0.53	n/a	1
20. Low C3/4	0	15	63	n/a	1	n/a	1
22. ADS	0	15	63	n/a	0.87	n/a	1
Total (all 22 criteria) (± 95% c.i)	70	323	1323	0.69 ±0.11	0.87 ±0.04	0.54 ±0.10	0.93 ±0.03

Table 3. Distribution and originating evidence for ANA determinations. (TP = true positive, FN = false negative, FP = false positive, and TN = true negative).

	Total	Numerical (e.g., Titer > 1:80 or Titer < 1:80)	Phrase (e.g., “ANA Positive” or “ANA Negative”)	Not Reported in MR
TP	43	32	11	0
FN	14	0	0	14
FP	9	6	3	0
TN	12	1	7	4

Table 4. Weighted assessed influences of non-ANA EULAR-2019 criteria on SLE and UCTD determination, as defined by Equations (1a) and (1b). Additional information is available in Appendix B.

	Criterial genAI Positives		Influence [Wilson 95% Confi-dence Interval]
	SLE+	UCTD	SLE+ (Equation (1a))	UCTD (Equation (1b))
3. Leukopenia	19	1	1.295 [0.801, 1.767]	0.300 [0.015, 1.368]
9. NSA	10	5	0.455 [0.240, 0.764]	1.000 [0.402, 1.598]
10. Oral ulcers	19	4	0.864 [0.534, 1.178]	0.800 [0.274, 1.452]
15. Joint involvement	40	8	5.455 [4.644, 5.826]	4.800 [2.652, 5.790]
16. Proteinuria	9	5	0.818 [0.412, 1.432]	2.000 [0.804, 3.196]
20. Low C3/4	14	0	0.955 [0.573, 1.431]	0.000 [0.000, 0.000]
22. ADS	23	0	3.136 [2.214, 4.038]	0.000 [0.000, 0.000]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lushington, G.H.; Nair, S.; Jupe, E.R.; Rubin, B.; Purushothaman, M. Criteria and Protocol: Assessing Generative AI Efficacy in Perceiving EULAR 2019 Lupus Classification. Diagnostics 2025, 15, 2409. https://doi.org/10.3390/diagnostics15182409

AMA Style

Lushington GH, Nair S, Jupe ER, Rubin B, Purushothaman M. Criteria and Protocol: Assessing Generative AI Efficacy in Perceiving EULAR 2019 Lupus Classification. Diagnostics. 2025; 15(18):2409. https://doi.org/10.3390/diagnostics15182409

Chicago/Turabian Style

Lushington, Gerald H., Sandeep Nair, Eldon R. Jupe, Bernard Rubin, and Mohan Purushothaman. 2025. "Criteria and Protocol: Assessing Generative AI Efficacy in Perceiving EULAR 2019 Lupus Classification" Diagnostics 15, no. 18: 2409. https://doi.org/10.3390/diagnostics15182409

APA Style

Lushington, G. H., Nair, S., Jupe, E. R., Rubin, B., & Purushothaman, M. (2025). Criteria and Protocol: Assessing Generative AI Efficacy in Perceiving EULAR 2019 Lupus Classification. Diagnostics, 15(18), 2409. https://doi.org/10.3390/diagnostics15182409

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Criteria and Protocol: Assessing Generative AI Efficacy in Perceiving EULAR 2019 Lupus Classification

Abstract

1. Introduction

1.1. Prospective Roles for AI in Medical Record Extraction

1.2. Potential Applications for Rheumatology

1.3. Applications to SLE Classification Criteria

2. Materials and Methods

2.1. Available Medical Records

2.2. EULAR 2019 Criteria

2.3. Search Parameters and Prompt Specifications

2.4. Criterial Assessment Procedure

2.5. Statistical Analysis

3. Results

3.1. ANA Results

3.2. Criteria Differentiating UCTD Versus SLE+

4. Discussion

4.1. GenAI Progress Toward ANA Determination

4.2. GenAI Progress Toward Criterial Discrimination Between UCTD and SLE+

4.3. Sampling Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI