1. Introduction
Atrial fibrillation (AF) is the most common sustained arrhythmia and remains a leading cause of cardioembolic stroke [
1,
2]. Current guidelines promote proactive case detection and structured pathways to minimize missed AF cases and speed up prevention efforts, while recognizing uncertainties about the best screening methods and workflows [
1]. Consumer wearables and single-lead devices have facilitated large-scale rhythm monitoring outside clinics, but population prevalence limits positive predictive value, so most programs still depend on confirmatory testing for diagnosis [
3]. Patch-based ambulatory monitors and implantable loop recorders (ILR) improve detection, especially after cryptogenic stroke, but they require significant resources and cannot be used indiscriminately [
4,
5]. Consequently, many health systems face the challenge of deciding whom to monitor, when, and how often to reassess risk, rather than whether to screen at all.
Artificial intelligence (AI)–enabled analysis of standard 12-lead electrocardiograms (ECGs) recorded during sinus rhythm offers an alternative approach: instead of intermittently detecting AF itself, models can infer latent atrial cardiomyopathy and near-term AF risk from subtle features in sinus rhythm [
6,
7]. Early research shows that convolutional neural network models trained on routine 12-lead ECGs can identify individuals with AF (or imminent AF) even when the tracing appears normal at the time of care, suggesting that AI-ECG could serve as a pre-screening tool to prioritize further rhythm monitoring [
6,
7]. Since this method uses existing equipment and fits into standard outpatient visits, it can provide immediate, actionable risk assessments without requiring patient-owned devices or extended monitoring.
Building on this concept, we previously reported a single-ECG machine-learning model that estimated AF risk during sinus rhythm using a standard 12-lead tracing, demonstrating strong discrimination in routine clinical data [
8]. Although single-timepoint estimation offers high negative predictive value, it may provide false reassurance if risk evolves over time; conversely, indiscriminate escalation based on one elevated score can trigger unnecessary monitoring in low-prevalence settings [
1,
3,
4]. These limitations motivate a sequential strategy that explicitly accounts for temporal evolution of risk while conserving resources.
In this study, we assess a sequential, two-visit AI-ECG protocol that analyzes a baseline 12-lead ECG and a second ECG taken after a short interval, escalating to ambulatory monitoring only if risk remains persistently high. The staged approach aims to maintain sensitivity and negative predictive value at initial assessment—features that enable safe deferral in low-risk patients—while enhancing accuracy at follow-up by requiring long-term confirmation before allocating limited monitoring resources. From a healthcare system perspective, this triage could improve diagnostic accuracy and target monitoring to patients most likely to benefit, with potential downstream improvements in cost-effectiveness [
9]. We hypothesized that the sequential design would preserve a high negative predictive value to safely defer monitoring in low-risk individuals while increasing detection among those with consistently high risk, especially in subgroups where timely AF detection could influence antithrombotic treatment.
The present work does not introduce a new algorithm; rather, it operationalizes a diagnostic protocol for opportunistic AF screening using models previously described in our prior study [
8] and evaluates their performance and workflow implications in a clinic-embedded setting.
2. Method
2.1. Study Design and Oversight
This retrospective review of a diagnostic protocol was conducted at a tertiary academic hospital. The study included adults (≥18 years) with standard 12-lead ECGs recorded from 2010 to 2021. The Institutional Review Board (IRB) approved the study and granted a waiver of informed consent because de-identified data was used (IRB No. SMC 2020-01-007).
2.2. ECG Acquisition and Preprocessing
ECGs were recorded using Philips PageWriter systems (Philips Medical Systems, Andover, MA, USA) at a sampling rate of 500 Hz for 10 s, with a 5 μV resolution. Standard 12-lead electrode positions were used (
Figure 1). Limb electrodes were placed on the right arm, left arm, right leg, and left leg (on distal limbs or, when necessary, proximally on the limbs to reduce artefact while preserving Einthoven geometry). Precordial electrodes were positioned at V1 at the 4th intercostal space at the right sternal border; V2 at the 4th intercostal space at the left sternal border; V4 at the 5th intercostal space on the mid-clavicular line; V3 midway between V2 and V4; V5 on the anterior axillary line at the same horizontal level as V4; and V6 on the mid-axillary line at the same level as V4. These placements were applied consistently across recordings.
Nine independent leads, excluding augmented limb leads, were processed. The signals were subjected to standard filtering and segmentation. From the P-QRS-T complexes, we extracted features at the beat, interval, and morphology levels. Additionally, for the serial model, delta features were derived by calculating differences between corresponding features in baseline and follow-up ECGs.
2.3. Outcome Definition and Adjudication
Patients were classified as having definite AF if AF was documented on a 12-lead ECG or Holter monitor and confirmed in their health records. The earliest confirmed AF date was set as the index AF date. The comparison group consisted of individuals without AF documentation but with adjudicated normal sinus rhythm (NSR) ECGs. Exclusion criteria included: (i) a prior AF diagnosis before the index NSR ECG; (ii) no NSR ECG before the index date; (iii) only one NSR ECG available; (iv) an AF indication without a positive AF ECG; (v) incomplete records preventing careful adjudication; and (vi) ECGs that could not be classified as NSR.
2.4. Cohorts and Splitting
For model development, we analyzed 248,612 ECGs from 164,793 patients (including 10,735 with AF). Data splits were performed at the patient level to prevent data leakage, with separate development (training/validation) and test sets. We also detail the calendar windows for each split and perform a temporal sensitivity analysis, where training data used earlier cases and testing used later cases. For temporal robustness, training/validation used earlier calendar windows, and the test set comprised a later window that was not accessed during development; all splits were at the patient level. The protocol evaluation set included 11,349 patients with longitudinal ECGs suitable for staged assessment; a subgroup with prior stroke (n = 551) was analyzed specifically for clinical significance.
2.5. Model Development
We developed a two-step AI-ECG workflow that maps engineered ECG features to the probability of near-term AF. A single-ECG classifier is applied at the baseline visit, and a serial-ECG classifier is applied at a short-interval follow-up. Both models were trained strictly on development partitions defined elsewhere in this Methods section and did not access any information from the test partition.
For input representation, the single-ECG model consumes a fixed-length vector of morphology and timing descriptors derived from sinus-rhythm 12-lead ECGs. The serial-ECG model uses the within-person change between two ECGs by taking the element-wise difference (follow-up minus baseline) of the same feature set, emphasizing temporal progression rather than absolute levels. Pair construction for the serial step, including the prespecified blanking rule, follows the study’s cohort definition described elsewhere.
The feature taxonomy included peak amplitudes, conventional intervals, segment levels, and durations, with beat-level quantities summarized by mean, minimum, maximum, and standard deviation. To reflect atrial remodeling signals, we also used compact P-wave shape indices, beat-wise correlation statistics against a per-record template, a narrow-band atrial-activity index, and concise heart-rate-variability summaries. Age and sex were appended as non-ECG covariates. Feature generation adhered to consistent conventions across all experiments.
Modeling was performed with gradient-boosted decision trees (LightGBM) using a logistic objective. We employed Bayesian optimization over a bounded hyperparameter space (number of estimators, learning rate, maximum depth/num_leaves, subsample, colsample_bytree, min_data_in_leaf, and class weights) with early stopping on the validation split to control overfitting. Class imbalance was handled via class weighting during training. All runs used fixed random seeds and version-locked dependencies to ensure determinism and auditability.
Model selection was based on validation of the area under the receiver operating characteristic curve (AUROC). Operating points used in the staged workflow were prespecified and are described in the thresholding subsection; the serial-stage operating point corresponded to Youden’s J, while the baseline stage used a sensitivity-oriented operating point appropriate for screening. Performance metrics and calibration procedures are detailed in the statistical evaluation subsection.
The primary model was a LightGBM optimized with a logistic loss. Because tree ensembles are not layer-based, notions such as “layers” and “activation functions” do not apply; model capacity and structure are governed by tree depth/number of leaves, learning rate (shrinkage), and the number of boosting iterations. Regularization included row/column subsampling, split constraints (e.g., min_data_in_leaf, minimum split gain), L1/L2 penalties on leaf weights, and early stopping on the patient-level validation split. Class weighting addressed imbalance.
For comprehensive algorithmic definitions, feature computation protocols, and additional implementation details, please refer to “Machine Learning Algorithm to Predict Atrial Fibrillation Using Serial 12-Lead ECGs Based on Left Atrial Remodeling” [
8] and its Supplementary Material.
2.6. Threshold Selection and Decision Logic
On the validation set, we identified two operating points: (i) Youden’s J and (ii) a sensitivity-focused point in the 0.85–0.90 range suitable for screening. In the staged protocol, the baseline step (using a single-ECG model) employs the sensitivity-focused threshold to ensure a high NPV, while the follow-up step (with a serial-ECG model) uses Youden’s J to improve positive predictive value (PPV). Confusion matrices and scenario examples are included in the text to help interpretation.
2.7. Metrics and Statistical Analysis
We report AUROC with 95% confidence intervals (CIs), sensitivity, specificity, PPV/negative predictive value (NPV), accuracy, and F1 score. Continuous variables are summarized as mean ± standard deviation (SD) and analyzed with appropriate tests, while categorical variables are presented as counts and percentages. Bootstrap resampling was employed to calculate CIs when applicable. We additionally report precision–recall AUC and calibration (intercept and slope) with patient-level bootstrap 95% CIs to complement AUROC in this imbalanced setting.
2.8. Proposed Algorithm for Screening Atrial Fibrillation (Figure 2)
The proposed algorithm employs a sequential screening process to identify AF. It involves two main stages: an initial prediction using a single ECG model, followed by an assessment with a serial ECG model. During the first visit, a 12-lead ECG is recorded and analyzed with an AI-based model to estimate the likelihood of AF. If AF is detected at this point, the patient will undergo intensive ECG monitoring via wearable devices for further investigation. If not, a follow-up visit is scheduled after three months for reassessment. At this follow-up, a second 12-lead ECG is performed. The serial ECG model combines data from both ECGs to improve detection accuracy. If AF is predicted then, the patient is referred for intensive monitoring with wearable devices. If not, routine screening with a 12-lead ECG every six months continues.
This algorithm is designed to sequentially evaluate the risk of AF using both single and serial ECG analyses. Patients identified as having a higher likelihood of AF based on either model will undergo continuous monitoring, while those with negative results will continue with routine ECG assessments at regular intervals.
Figure 2.
Pre-screening protocol for early diagnosis of AF.
Figure 2.
Pre-screening protocol for early diagnosis of AF.
3. Results
3.1. Baseline Characteristics (Table 1) for Developing Two AI Models
A total of 2,083,335 ECGs were collected from 883,568 adult patients. Of these, 1,688,419 ECGs from 712,547 patients were excluded based on the study criteria. Ultimately, 164,793 patients took part, with 154,058 in the NSR group and 10,735 in the AF group. The AF group had a significantly higher mean age (65.7 ± 13.0 years) than the NSR group (56.5 ± 13.5 years; p < 0.001). Additionally, a larger proportion of males was observed in the AF group (53.9%) compared to the NSR group (46.9%; p < 0.001). Patients in the AF group also had more ECGs on average (3.5 ± 3.8) than those in the NSR group (2.1 ± 1.6; p < 0.001).
Table 1.
Baseline characteristics.
Table 1.
Baseline characteristics.
| | Overall (n = 164,793) | NSR Group (n = 154,058) | AF Group (n = 10,735) | p Value |
|---|
| Age, years | 57.3 ± 13.7 | 56.5 ± 13.5 | 65.7 ± 13.0 | <0.001 |
| Male, n (%) | 77,983 (47.3) | 72,201 (46.9) | 5782 (53.9) | <0.001 |
| Number of ECGs per patient | 2.4 ± 2.1 | 2.1 ± 1.6 | 3.5 ± 3.8 | <0.001 |
3.2. Performance of the Proposed Algorithm
The study involved 11,349 patients to assess the effectiveness of the proposed algorithm for AF screening. Participants were split into two groups: a Normal Group with 10,274 patients who had no history of AF, and an AF Group with 1075 patients diagnosed with AF. Of those in the AF Group, 551 had a history of embolic stroke, indicating a subset with cerebrovascular issues linked to AF.
The proposed AI-enabled sequential ECG screening algorithm was used in this study to evaluate its ability to predict AF cases. Its classification performance was assessed by comparing predictions with the actual clinical diagnoses of AF in the dataset. The algorithm achieved a sensitivity of 88.1%, showing a strong capacity to detect AF correctly, and a specificity of 78.7%, indicating its effectiveness in excluding non-AF cases. The PPV was 30.2%, meaning about one-third of patients predicted to have AF were true positives, while the NPV was 98.4%, demonstrating a high ability to rule out AF in negative cases. The overall accuracy of the model was 79.6%, and the F1 score was 0.450, which balances the precision and recall (
Table 2). The confusion matrix (
Figure 3) displays the model’s classification results. It correctly identified 8088 negative cases and 947 positive cases. However, 2186 negative cases were wrongly classified as positive, and 128 positive cases were misclassified as negative.
The model showed a strong ability to detect AF, with an AUROC of 0.908 (
Figure 4). A subgroup analysis was performed for patients with a history of stroke (
n = 551), a group especially relevant due to the close link between AF and thromboembolic events. The model identified 468 out of 551 stroke patients (84.9%) as having AF, demonstrating a high detection rate in this clinically important subgroup. These findings emphasize the algorithm’s effectiveness in screening for AF in the general population and its consistent performance in stroke patients, where early AF detection is vital for secondary prevention.