Abbreviations
AP, average precision; AP-lift, AP-pos_ratio (positive prevalence); AUC, area under the curve; BLE, Bluetooth Low Energy; CORAL, correlation alignment; DANN, domain-adversarial neural network; DomAcc, domain-classifier accuracy; ECE, expected calibration error; ERM, empirical risk minimization; FPR, false positive rate; GRL, gradient reversal layer; IDS, intrusion detection system; IoT, Internet of Things; LOCO, leave one capture out; LogReg, logistic regression; MLP, multilayer perceptron; MMD, maximum mean discrepancy; PR, precision–recall; RF, random forest; ROC-AUC, receiver operating characteristic area under the curve; SWD, sliced Wasserstein distance; t-SNE, t-distributed stochastic neighbor embedding; UDA, unsupervised domain adaptation; XGB, XGBoost; Zigbee, IEEE 802.15.4-based low-power wireless protocol; ZBDS, Zigbee dataset (ZBDS2023).
Figure 1.
Per seed target BLE performance for all methods (seeds 2024/2025/2026). ROC-AUC: receiver operating characteristic area under the curve; AP: average precision. Points show each seed; this visualization complements
Table 4 (mean ± std) and highlights seed-to-seed variability.
Figure 1.
Per seed target BLE performance for all methods (seeds 2024/2025/2026). ROC-AUC: receiver operating characteristic area under the curve; AP: average precision. Points show each seed; this visualization complements
Table 4 (mean ± std) and highlights seed-to-seed variability.
Figure 2.
t-distributed stochastic neighbor embedding (t-SNE) visualization of source (IP) and target (BLE) embeddings (seed = 2026) obtained from the 128-d feature extractor: (a) before adaptation (ERM, source only) and (b) after domain-adversarial training (DANN/GRL). t-SNE parameters (scikit-learn): perplexity = 30, init = pca, learning_rate = auto, n_iter = 1000, and random_state = 2026. Points are colored by domain (source IP vs. target BLE), with two colors (purple and yellow) indicating the two domains.
Figure 2.
t-distributed stochastic neighbor embedding (t-SNE) visualization of source (IP) and target (BLE) embeddings (seed = 2026) obtained from the 128-d feature extractor: (a) before adaptation (ERM, source only) and (b) after domain-adversarial training (DANN/GRL). t-SNE parameters (scikit-learn): perplexity = 30, init = pca, learning_rate = auto, n_iter = 1000, and random_state = 2026. Points are colored by domain (source IP vs. target BLE), with two colors (purple and yellow) indicating the two domains.
Figure 3.
Domain discriminator behavior during domain-adversarial training (seed = 2026). (
a) Domain classifier accuracy (DomAcc) over epochs computed on a balanced domain-validation set (equal source-validation and unlabeled target samples): 1.0 indicates perfect domain separability (no alignment), while 0.5 indicates maximal domain confusion (alignment). (
b) DANN training losses over epochs (source classification loss and domain loss). The transient confusion phase motivates domain-aware checkpointing; R3 selects a ‘star’ checkpoint using DomAcc among near-best source-validation epochs (seed = 2026; star epoch: 20;
Table 10).
Figure 3.
Domain discriminator behavior during domain-adversarial training (seed = 2026). (
a) Domain classifier accuracy (DomAcc) over epochs computed on a balanced domain-validation set (equal source-validation and unlabeled target samples): 1.0 indicates perfect domain separability (no alignment), while 0.5 indicates maximal domain confusion (alignment). (
b) DANN training losses over epochs (source classification loss and domain loss). The transient confusion phase motivates domain-aware checkpointing; R3 selects a ‘star’ checkpoint using DomAcc among near-best source-validation epochs (seed = 2026; star epoch: 20;
Table 10).
Figure 4.
Target BLE performance summary across methods (seed = 2026): (a) metric summary (ROC-AUC/AP/F1); (b) target BLE precision–recall curve.
Figure 4.
Target BLE performance summary across methods (seed = 2026): (a) metric summary (ROC-AUC/AP/F1); (b) target BLE precision–recall curve.
Figure 5.
Top-10 feature-level error contributors for DANN on the BLE target test set (seed = 2026). Features are ranked by |(FN−TP)|, the absolute difference between the mean standardized feature value of false negatives (FN) and true positives (TP) on the target test set. Values are computed after applying the source-fitted standardization (StandardScaler).
Figure 5.
Top-10 feature-level error contributors for DANN on the BLE target test set (seed = 2026). Features are ranked by |(FN−TP)|, the absolute difference between the mean standardized feature value of false negatives (FN) and true positives (TP) on the target test set. Values are computed after applying the source-fitted standardization (StandardScaler).
Figure 6.
Epoch-wise target-test ROC-AUC versus balanced domain accuracy (DomAcc) for DANN/GRL (seed = 2026). Lower DomAcc values (closer to 0.5) indicate stronger domain confusion on the balanced domain-validation set.
Figure 6.
Epoch-wise target-test ROC-AUC versus balanced domain accuracy (DomAcc) for DANN/GRL (seed = 2026). Lower DomAcc values (closer to 0.5) indicate stronger domain confusion on the balanced domain-validation set.
Figure 7.
Training dynamics for DANN/GRL (seed = 2026): source-validation ROC-AUC, target-test ROC-AUC, and balanced DomAcc across epochs. DomAcc approaching 0.5 indicates transient domain confusion; DomAcc near 1.0 indicates that the domains remain separable. The dashed horizontal line indicates the chance-level balanced domain accuracy (DomAcc = 0.5).
Figure 7.
Training dynamics for DANN/GRL (seed = 2026): source-validation ROC-AUC, target-test ROC-AUC, and balanced DomAcc across epochs. DomAcc approaching 0.5 indicates transient domain confusion; DomAcc near 1.0 indicates that the domains remain separable. The dashed horizontal line indicates the chance-level balanced domain accuracy (DomAcc = 0.5).
Figure 8.
R3 gains in target-test ROC-AUC (active): per-seed improvement in target-test ROC-AUC for the domain-aware checkpoint selected by R3 (AUC*) relative to the default-best checkpoint (AUC_best); ΔROC-AUC = AUC* − AUC_best. The asterisk (*) is used only to denote the R3-selected domain-aware checkpoint (not a statistical significance marker).
Figure 8.
R3 gains in target-test ROC-AUC (active): per-seed improvement in target-test ROC-AUC for the domain-aware checkpoint selected by R3 (AUC*) relative to the default-best checkpoint (AUC_best); ΔROC-AUC = AUC* − AUC_best. The asterisk (*) is used only to denote the R3-selected domain-aware checkpoint (not a statistical significance marker).
Figure 9.
Oracle gap of domain-aware star (active): per seed oracle gap in target-test ROC-AUC (oracle—star). Oracle is computed post hoc and is used only as an analysis upper bound.
Figure 9.
Oracle gap of domain-aware star (active): per seed oracle gap in target-test ROC-AUC (oracle—star). Oracle is computed post hoc and is used only as an analysis upper bound.
Figure 10.
Threshold-transfer failure case on the BLE target test set (seed = 2026). Confusion matrices (raw counts) are shown for ERM (source only), noGRL (lambda = 0), and DANN (GRL), illustrating that ERM/noGRL can collapse to near all-positive predictions under τ* transfer, while DANN mitigates this failure mode. The color intensity indicates the magnitude of the raw counts in each confusion-matrix cell (darker blue = larger count).
Figure 10.
Threshold-transfer failure case on the BLE target test set (seed = 2026). Confusion matrices (raw counts) are shown for ERM (source only), noGRL (lambda = 0), and DANN (GRL), illustrating that ERM/noGRL can collapse to near all-positive predictions under τ* transfer, while DANN mitigates this failure mode. The color intensity indicates the magnitude of the raw counts in each confusion-matrix cell (darker blue = larger count).
Figure 11.
Target score histograms on the mixed-class BLE capture group (ble_cap_2) under leakage control (seed = 2026): ERM (left) vs. DANN (right).
Figure 11.
Target score histograms on the mixed-class BLE capture group (ble_cap_2) under leakage control (seed = 2026): ERM (left) vs. DANN (right).
Figure 12.
Reliability diagrams (calibration curves) for ERM and DANN under leakage-controlled evaluation. The solid curve shows the empirical calibration (fraction of positives) in each probability bin, and the dashed diagonal line indicates perfect calibration (y = x); deviations from the diagonal indicate miscalibration under domain shift.
Figure 12.
Reliability diagrams (calibration curves) for ERM and DANN under leakage-controlled evaluation. The solid curve shows the empirical calibration (fraction of positives) in each probability bin, and the dashed diagonal line indicates perfect calibration (y = x); deviations from the diagonal indicate miscalibration under domain shift.
Table 1.
Dataset statistics for Internet Protocol (IP, source) and Bluetooth Low Energy (BLE, target) after windowing (window size = 64; stride = 64).
Table 1.
Dataset statistics for Internet Protocol (IP, source) and Bluetooth Low Energy (BLE, target) after windowing (window size = 64; stride = 64).
| Domain | Windows | y = 0 | y = 1 | Pos. Ratio |
|---|
| IP (source) | 15,625 | 7813 | 7812 | 0.500 |
| BLE (target) | 3041 | 1273 | 1768 | 0.581 |
Table 2.
Data splits for the representative run (seed = 2026). The target split is unlabeled during training and used only for alignment/diagnostics; labels are used solely for final reporting. The feature dimension is 14.
Table 2.
Data splits for the representative run (seed = 2026). The target split is unlabeled during training and used only for alignment/diagnostics; labels are used solely for final reporting. The feature dimension is 14.
| Domain | Train | Unlabeled Split | Test |
|---|
| Source (IP) | 10,937 | 2344 | 2344 |
| Target (BLE) | NA | 1520 | 1521 |
Table 3.
Key architecture and training settings used in all experiments (unless stated otherwise). A complete environment snapshot is provided in
Supplementary File S1. Abbreviations: ERM, empirical risk minimization; CORAL, correlation alignment; MMD, maximum mean discrepancy; DANN, domain-adversarial neural network; GRL, gradient reversal layer; DomAcc, domain-classifier accuracy.
Table 3.
Key architecture and training settings used in all experiments (unless stated otherwise). A complete environment snapshot is provided in
Supplementary File S1. Abbreviations: ERM, empirical risk minimization; CORAL, correlation alignment; MMD, maximum mean discrepancy; DANN, domain-adversarial neural network; GRL, gradient reversal layer; DomAcc, domain-classifier accuracy.
| Parameter | Value |
|---|
| Input features | 14 window-level statistical features (packet length and inter-arrival time statistics); window size = 64, stride = 64. |
| Standardization | StandardScaler fitted on source (IP) train split only; applied to all splits/domains. |
| Feature extractor (G) | MLP, 14 → 128 → 128 (ReLU). |
| Classifier head (C) | Linear 128 → 2 (binary classification). |
| Domain discriminator (D) | MLP, 128 → 64 → 2 logits. |
| Training length | 20 epochs for all methods. |
| GRL schedule (DANN) | Warm-up epochs 1–3: = 0, then increases monotonically toward 1 (see dann_train_history.csv). |
| Checkpoint selection | Best checkpoint selected on labeled source validation ROC-AUC (tie-breakers: source-validation AP; then earliest epoch); no target labels used. |
| Seeds | 2024/2025/2026; seed controls data splitting, initialization, and minibatch order (Python/NumPy/PyTorch RNGs). |
| Optimizer | Adam (PyTorch), default = 0.9, = 0.999. |
| Learning rate | 1 × 10−3. |
| Batch size | 512. |
| Weight decay | 0.0. |
| Dropout | Included in implementation (p = 0.0); see checkpoints/state_dict keys. |
| DomAcc evaluation set | Computed on a balanced domain-validation set formed by subsampling equal numbers from the source-validation and unlabeled target splits each epoch; therefore, chance = 0.5. |
| CORAL weight () | 1.0 (implicit; CORAL uses closed-form covariance alignment, no extra scaling term). |
| MMD weight (β(ep, step)) | β(ep, step) schedule (MMD-ERM): loss = L_cls + β(ep, step)·L_mmd with EPOCHS = 20, WARMUP_EPOCHS = 3, β_max = 1.0. β(ep, step) = 0 for ep ≤ 3; else β(ep, step) = β_max·clip(p, 0, 1), where p = ((ep − WARMUP_EPOCHS − 1)·n_steps + step)/((EPOCHS − WARMUP_EPOCHS)·n_steps). |
| MMD kernel/bandwidth | Multi-kernel RBF MMD with kernel_mul = 2.0 and kernel_num = 5. Bandwidth normalization: bw = bw/; bandwidth_list = bw∗(i = 0 to 4). |
Table 4.
Cross-protocol target BLE performance (mean ± std over seeds 2024/2025/2026). Classical ML baselines (LogReg/RF/XGB) are also reported. All results use active-window filtering (zero-activity windows removed); F1 and DomAcc_last are not applicable (NA) for these non-adversarial models. Note: F1@ can be inflated when a source-selected threshold transfers poorly to the target; therefore, we treat ROC-AUC/AP as primary metrics. Abbreviations: BLE, Bluetooth Low Energy; ERM, empirical risk minimization; CORAL, correlation alignment; MMD, maximum mean discrepancy; DANN, domain-adversarial neural network; GRL, gradient reversal layer; AP, average precision; DomAcc_last, final-epoch domain-classifier accuracy; LogReg, logistic regression; RF, random forest; XGB, XGBoost.
Table 4.
Cross-protocol target BLE performance (mean ± std over seeds 2024/2025/2026). Classical ML baselines (LogReg/RF/XGB) are also reported. All results use active-window filtering (zero-activity windows removed); F1 and DomAcc_last are not applicable (NA) for these non-adversarial models. Note: F1@ can be inflated when a source-selected threshold transfers poorly to the target; therefore, we treat ROC-AUC/AP as primary metrics. Abbreviations: BLE, Bluetooth Low Energy; ERM, empirical risk minimization; CORAL, correlation alignment; MMD, maximum mean discrepancy; DANN, domain-adversarial neural network; GRL, gradient reversal layer; AP, average precision; DomAcc_last, final-epoch domain-classifier accuracy; LogReg, logistic regression; RF, random forest; XGB, XGBoost.
| Method | Tgt AUC | Tgt AP | Tgt F1 | DomAcc_last |
|---|
| CORAL-ERM | 0.555 ± 0.094 | 0.712 ± 0.060 | 0.620 ± 0.030 | NA |
| DANN (GRL) | 0.620 ± 0.043 | 0.668 ± 0.046 | 0.613 ± 0.211 | 0.973 ± 0.002 |
| ERM (source only) | 0.590 ± 0.097 | 0.632 ± 0.057 | 0.735 ± 0.000 | NA |
| MMD-ERM | 0.613 ± 0.062 | 0.655 ± 0.034 | 0.614 ± 0.210 | NA |
| noGRL (lambda = 0) | 0.579 ± 0.079 | 0.625 ± 0.045 | 0.735 ± 0.000 | 0.996 ± 0.002 |
| LogReg | 0.683 ± 0.027 | 0.756 ± 0.078 | NA | NA |
| RF | 0.758 ± 0.051 | 0.770 ± 0.062 | NA | NA |
| XGB | 0.711 ± 0.013 | 0.706 ± 0.009 | NA | NA |
Table 5.
Bootstrap 95% confidence intervals on the BLE target test set (seed = 2026).
Table 5.
Bootstrap 95% confidence intervals on the BLE target test set (seed = 2026).
| Method | AUC | AUC 95% CI | AP | AP 95% CI |
|---|
| ERM (source only) | 0.501 | [0.500, 0.502] | 0.581 | [0.559, 0.606] |
| noGRL (lambda = 0) | 0.502 | [0.500, 0.504] | 0.583 | [0.557, 0.608] |
| DANN (GRL) | 0.626 | [0.597, 0.653] | 0.713 | [0.680, 0.743] |
| Δ (DANN–ERM) | 0.124 | [0.096, 0.153] | 0.131 | [0.106, 0.157] |
Table 6.
Unified diagnostic workflow for cross-protocol transfer under target-unlabeled UDA (seed = 2026 numbers shown where applicable).
Table 6.
Unified diagnostic workflow for cross-protocol transfer under target-unlabeled UDA (seed = 2026 numbers shown where applicable).
| Stage/Question | Proxy Diagnostic (No Target Labels) | Interpretation and Recommended Action | Where It Is Reported |
|---|
| Split realism and leakage risk | Random window split vs. capture-wise/LOCO; group-wise splitting when IDs are available | Treat random splits as optimistic; always report capture-wise/LOCO results for deployment-faithful claims. | Section 3.3 and Section 4.7; Table 13 |
| Global representation gap | SWD and kernel MMD (with σ sweep); DomAcc curves during adversarial training | Use multiple, complementary proxies; divergences can disagree (seed = 2026: SWD 66.115 → 40.679, MMD 0.579 → 0.669). Avoid final-epoch conclusions when DomAcc returns to ≈1.0. | Tables 7 and 8; Figures 3, 6 and 7 |
| Within-class semantic shift | Class-conditional KS tests (per-feature, per-class) | Large within-class shifts suggest protocol/capture artifacts and can explain LOCO collapse despite apparent ‘alignment’. | Table 9; Section 4.7 |
| Checkpoint selection without target labels | R3: choose a ‘star’ epoch among near-best source-validation epochs by preferring higher domain confusion (DomAcc closer to 0.5) | Mitigates seed-to-seed variance without using target labels. Report δ sensitivity and the oracle gap for context. | Section 4.5; Tables 10 and 11; Supplementary Table S1 |
| Operating-point safety | Threshold-transfer audit (τ_F1 vs. τ at calibrated FPR); micro-FPR on benign-only captures; calibration checks | Prefer PR/AP and calibrated operating points; a source-tuned τ_F1 can yield unsafe micro-FPR = 1.0 under shift. | Section 4.6; Table 12; Figures 10–12 |
| Reproducibility and auditability | Release derived features, split definitions, and scripts; report sensitivity analyses | Enable independent reproduction and auditing even under restricted raw pcap access; reduce single-author interpretation bias. | Supplementary File S1; Appendix A.5 |
Table 7.
Domain-gap diagnostics (maximum mean discrepancy (MMD) and sliced Wasserstein distance (SWD)) before vs. after adaptation (seed = 2026).
Table 7.
Domain-gap diagnostics (maximum mean discrepancy (MMD) and sliced Wasserstein distance (SWD)) before vs. after adaptation (seed = 2026).
| Metric | Before | After |
|---|
| MMD | 0.579 | 0.669 |
| SWD | 66.115 | 40.679 |
Table 8.
MMD kernel bandwidth sensitivity (σ sweep) on 14D window-level feature vectors (IP vs. BLE; subsampled windows per domain). σ denotes the RBF kernel bandwidth; MMD2 is estimated with an unbiased estimator.
Table 8.
MMD kernel bandwidth sensitivity (σ sweep) on 14D window-level feature vectors (IP vs. BLE; subsampled windows per domain). σ denotes the RBF kernel bandwidth; MMD2 is estimated with an unbiased estimator.
| σ | MMD2 (Unbiased) |
|---|
| 0.1 | 0.692760 |
| 0.2 | 0.723636 |
| 0.5 | 0.840080 |
| 1 | 1.040332 |
| 2 | 1.110997 |
| 5 | 0.438673 |
| 10 | 0.138348 |
Table 9.
Semantic-shift diagnostic via class-conditional KS tests (IP vs. BLE). For each feature, we report KS statistics (D) and p-values within benign and attack subsets. Very small p-values may underflow to 0 in double precision; we report them as <1 × 10−300 for readability.
Table 9.
Semantic-shift diagnostic via class-conditional KS tests (IP vs. BLE). For each feature, we report KS statistics (D) and p-values within benign and attack subsets. Very small p-values may underflow to 0 in double precision; we report them as <1 × 10−300 for readability.
| Feature | KS D (Benign) | p (Benign) | KS D (Attack) | p (Attack) |
|---|
| pkt_len_mean | 0.556167 | <1 × 10−300 | 1.000000 | <1 × 10−300 |
| pkt_len_min | 0.966222 | <1 × 10−300 | 1.000000 | <1 × 10−300 |
| pkt_len_median | 0.587588 | <1 × 10−300 | 1.000000 | <1 × 10−300 |
| iat_max | 0.998429 | <1 × 10−300 | 0.729937 | <1 × 10−300 |
| iat_std | 0.998429 | <1 × 10−300 | 0.729681 | <1 × 10−300 |
| iat_min | 0.998429 | <1 × 10−300 | 0.729372 | <1 × 10−300 |
| pkt_len_std | 0.998301 | <1 × 10−300 | 0.640511 | <1 × 10−300 |
| pkt_len_max | 0.995159 | <1 × 10−300 | 0.494227 | <1 × 10−300 |
| nonzero_ratio | 0.966094 | <1 × 10−300 | 0.625192 | <1 × 10−300 |
Table 10.
Domain-aware checkpoint selection (R3) under the active protocol. Per seed comparison between default-best (source-validation ROC-AUC) and domain-aware star (DomAcc closest to 0.5 among near-best epochs).
Table 10.
Domain-aware checkpoint selection (R3) under the active protocol. Per seed comparison between default-best (source-validation ROC-AUC) and domain-aware star (DomAcc closest to 0.5 among near-best epochs).
| Seed | Best Epoch | Star Epoch | Tgt ROC-AUC (Best) | Tgt ROC-AUC (Star) | Delta ROC-AUC | Tgt AP (Best) | Tgt AP (Star) | Delta AP |
|---|
| 2024 | 1 | 11 | 0.504 | 0.538 | 0.034 | 0.570 | 0.554 | −0.015 |
| 2025 | 1 | 12 | 0.496 | 0.542 | 0.046 | 0.561 | 0.555 | −0.006 |
| 2026 | 1 | 20 | 0.475 | 0.556 | 0.081 | 0.547 | 0.555 | 0.008 |
| Mean ± Std | | | 0.492 ± 0.015 | 0.545 ± 0.009 | 0.053 ± 0.024 | 0.559 ± 0.011 | 0.555 ± 0.000 | −0.005 ± 0.012 |
Table 11.
Oracle comparison for checkpoint selection (analysis-only upper bound) under the active protocol. Oracle epoch maximizes target-test ROC-AUC post hoc.
Table 11.
Oracle comparison for checkpoint selection (analysis-only upper bound) under the active protocol. Oracle epoch maximizes target-test ROC-AUC post hoc.
| Seed | Best Epoch | Star Epoch | Oracle Epoch | Tgt ROC-AUC (Best) | Tgt ROC-AUC (Star) | Tgt ROC-AUC (Oracle) | Oracle Gap (Oracle − Star) |
|---|
| 2024 | 1 | 11 | 20 | 0.504 | 0.538 | 0.542 | 0.004 |
| 2025 | 1 | 12 | 20 | 0.496 | 0.542 | 0.557 | 0.015 |
| 2026 | 1 | 20 | 17 | 0.475 | 0.556 | 0.556 | 0.000 |
Table 12.
Operating point and calibration audit under leakage control (seed = 2026). AUC/AP are reported only for the mixed-class BLE capture group (ble_cap_2). Pos. ratio is the positive-class prevalence in ble_cap_2; AP-lift = AP − pos_ratio to contextualize AP under class imbalance. micro-FPR is computed across all capture groups (ble_cap_0/1/2) using negative counts; F1 is maximized on source validation, and (FPR = 1%) is the source-validation threshold, achieving 1% FPR on negatives. Abbreviations: AUC, area under the ROC curve; AP, average precision; FPR, false-positive rate; ECE, expected calibration error.
Table 12.
Operating point and calibration audit under leakage control (seed = 2026). AUC/AP are reported only for the mixed-class BLE capture group (ble_cap_2). Pos. ratio is the positive-class prevalence in ble_cap_2; AP-lift = AP − pos_ratio to contextualize AP under class imbalance. micro-FPR is computed across all capture groups (ble_cap_0/1/2) using negative counts; F1 is maximized on source validation, and (FPR = 1%) is the source-validation threshold, achieving 1% FPR on negatives. Abbreviations: AUC, area under the ROC curve; AP, average precision; FPR, false-positive rate; ECE, expected calibration error.
| Model | AUC (ble_cap_2) | AP (ble_cap_2) | Pos. Ratio (ble_cap_2) | AP-Lift | Micro-FPR@ | Mi-cro-FPR@ (FPR = 1%) | ECE (src − val) | ECE (tgt ble_cap_2) |
|---|
| ERM_MLP | 0.501 | 0.714 | 0.714 | 0.000 | 1.000 | 1.000 | 0.577 | 0.286 |
| DANN_GRL | 0.501 | 0.714 | 0.714 | 0.000 | 0.684 | 0.684 | 0.562 | 0.286 |
Table 13.
Leakage-controlled leave-one-capture-out (LOCO) performance on the only mixed-class BLE capture group (ble_cap_2). Mean ± std across seeds {2024, 2025, 2026}. AP-lift is defined as AP − pos_ratio to contextualize AP under class imbalance (pos_ratio = 0.714 for ble_cap_2).
Table 13.
Leakage-controlled leave-one-capture-out (LOCO) performance on the only mixed-class BLE capture group (ble_cap_2). Mean ± std across seeds {2024, 2025, 2026}. AP-lift is defined as AP − pos_ratio to contextualize AP under class imbalance (pos_ratio = 0.714 for ble_cap_2).
| Model | AUC | AP | AP-Lift |
|---|
| LogReg | 0.670 ± 0.001 | 0.862 ± 0.001 | 0.148 ± 0.001 |
| RF | 0.593 ± 0.016 | 0.808 ± 0.004 | 0.094 ± 0.004 |
| XGBoost | 0.490 ± 0.067 | 0.713 ± 0.029 | −0.001 ± 0.029 |
| ERM_MLP | 0.501 ± 0.000 | 0.714 ± 0.000 | 0.000 ± 0.000 |
| DANN_GRL | 0.497 ± 0.006 | 0.713 ± 0.002 | −0.001 ± 0.002 |