Federated Learning for XSS Detection: Analysing OOD, Non-IID Challenges, and Embedding Sensitivity

Wang, Bo; Khan, Imran; White, Martin; Beloff, Natalia

doi:10.3390/electronics14173483

Open AccessArticle

Federated Learning for XSS Detection: Analysing OOD, Non-IID Challenges, and Embedding Sensitivity

Department of Informatics, University of Sussex, Brighton BN1 9RH, UK

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(17), 3483; https://doi.org/10.3390/electronics14173483

Submission received: 29 July 2025 / Revised: 29 August 2025 / Accepted: 29 August 2025 / Published: 31 August 2025

(This article belongs to the Special Issue Enhancing Cybersecurity: Advanced Attack Detection and Defense Techniques)

Download

Browse Figures

Versions Notes

Abstract

This paper investigates federated learning (FL) for cross-site scripting (XSS) detection under out-of-distribution (OOD) drift. Real-world XSS traffic involves fragmented attacks, heterogeneous benign inputs, and client imbalance, which erode conventional detectors. To simulate this, we construct two structurally divergent datasets: one with obfuscated, mixed-structure samples and another with syntactically regular examples, inducing structural OOD in both classes. We evaluate GloVe, GraphCodeBERT, and CodeT5 in both centralised and federated settings, tracking embedding drift and client variance. FL consistently improves OOD robustness by averaging decision boundaries from cleaner clients. Under FL scenarios, CodeT5 achieves the best aggregated performance (97.6% accuracy, 3.5% FPR), followed by GraphCodeBERT (96.8%, 4.7%), but is more stable on convergence. GloVe reaches a competitive final accuracy (96.2%) but exhibits a high instability across rounds, with a higher false positive rate (5.5%) and pronounced variance under FedProx. These results highlight the value and limits of structure-aware embeddings and support FL as a practical, privacy-preserving defence within OOD XSS scenarios.

Keywords:

web security; machine learning; cross-site scripting attack; federated learning; out of distribution

1. Introduction

Cross-site scripting (XSS) attacks remain a persistent security threat due to their widespread occurrence and ease of exploitation [1]. Machine learning-based detection, including reinforcement learning [2,3] and ensemble learning [4,5], has advanced significantly, with earlier studies [4,6,7] and more recent works [8,9,10,11] focusing on improving model architectures and feature extraction.

However, many methods still face generalisation issues due to the highly distributed data structure and privacy concerns. Federated learning (FL) has emerged as a privacy-preserving alternative, allowing collaborative training without exposing raw data. This study explores the use of FL for XSS detection, addressing key challenges such as non-independent and identically distributed (non-IID) data, heterogeneity, and out-of-distribution (OOD). While FL has been applied in cybersecurity [12,13], its role in XSS detection remains underexplored. Most prior works focus on network traffic analysis, rather than text-based XSS payloads.

This study presents the first systematic application of federated learning to XSS detection under text-based XSS threat scenarios. Our key contributions are as follows.

We design a federated learning (FL) framework for XSS detection under structurally non-IID client distributions, incorporating diverse XSS types, obfuscation styles, and attack patterns. This setup reflects real-world asymmetry, where some clients contain partial or ambiguous indicators and others contain clearer attacks. Importantly, structural divergence also affects negatives, whose heterogeneity is a key yet underexplored factor in generalisation failure. Our framework enables the study of bidirectional OOD, where fragmented negatives cause high false positive rates under distribution mismatch.
Unlike prior work that mixes lexical or contextual features across splits, we maintain strict structural separation between training and testing data. By using an external dataset [14] as an OOD domain, we isolate bidirectional distributional shifts across both classes under FL. Our analysis shows that generalisation failure can also be driven by structurally complicated benign samples, not only by rare or obfuscated attacks, emphasising the importance of structure-aware dataset design.
We compare three embedding models (GloVe [15], CodeT5 [16], GraphCodeBERT [17]) in centralised and federated settings, showing that generalisation depends more on embedding compatibility with class heterogeneity than on model capacity. Using divergence metrics and ablation studies, we demonstrate that structurally complex and underrepresented negatives lead to severe false positives. Static embeddings like GloVe show more robust generalisation under structural OOD, indicating that stability relies more on representational resilience than expressiveness.

2. Related Work

Existing research on federated learning (FL) for XSS detection remains scarce. The most relevant work by Jazi & Ben-Gal [18] investigated FL’s privacy-preserving properties using simplified setups and traditional models (e.g., MLP, KNN). Their non-IID configuration assumes an unrealistic “all-malicious vs. all-benign” client split, and evaluation is conducted separately on a handcrafted text-based XSS dataset [14] and the CICIDS2017 intrusion dataset [19]. However, they do not consider data heterogeneity or OOD generalisation. Still, the dataset [14] they selected is structurally rich and thus serves as a suitable OOD test dataset in our experiments (see Section 3.2).

Research on distributional shift or OOD has recently gained popularity [20,21,22,23], but most studies remain focused on computer vision, with few addressing specific cybersecurity scenarios in federated learning. For example, in addressing distribution shift in image anomaly detection [23], the authors proposed Generalised Normality Learning to mitigate differences between in-distribution (ID) and OOD. For the FL domain, the FOOGD approach [22] utilised federated learning to handle two distinct types of distribution shifts, namely covariate shift and semantic shift. This study offers valuable design insights (e.g., MMD and SM3D strategies), though its application domain is still primarily computer vision.

Heterogeneity in datasets remains a significant challenge for XSS detection [24,25,26,27]. The absence of standardised datasets, particularly in terms of class variety and sample volume, can have a substantial impact on the decision boundaries learned by detection models [28,29]. Most existing studies, including [6,9,10,11], attempt to address this issue through labour-intensive manual processing, aiming to ensure strict control over data quality, feature representation, label consistency, and class definitions.

However, we argue that complete reliance on manual curation often fails to reflect real-world conditions. In practical cybersecurity scenarios, data imbalance is both common and inevitable, especially regarding the ratio and diversity of attack versus non-attack samples [27,28,30]. This often results in pronounced structural and categorical divergence between positive and negative classes. For example, commonly used XSS filters frequently over-filter benign inputs [31], indicating a mismatch between curated datasets and actual deployment environments.

In light of these challenges, federated learning demonstrates strong potential. It enables models to share decision boundaries through privacy-preserving aggregation [20,32], offering an effective alternative to centralised data collection and manual intervention.

Meanwhile, we argue that findings from FL research on malicious URL detection [33,34] are partially transferable to XSS detection. Although some malicious URLs may embed XSS payloads, the two tasks differ in semantic granularity, execution contexts, and structural variability. Given their shared challenges like class imbalance, distribution shift, and non-IID data, we think FL techniques proven effective for URL detection offer a reasonable foundation for XSS adaptation.

The high sensitivity of XSS-related information, such as emails or session tokens, makes sharing difficult without anonymisation. Yet studies [35,36] show that anonymisation often introduces significant distributional shifts due to strategy-specific biases. Disparities in logging, encoding, and user behaviour further distort data distributions, compromising generalisation [35,36].

For example, strings embedded in polyglot-style payloads are hard to anonymise, as minor changes may affect execution. Consider the following sample:

<javascript:/*-><img/src=‘x’onerror=eval(unescape(/%61%6c%65%72%74%28%27%45%78%66%69%6c%3A%20%2b%20%27%2b%60test@example.com:1849%60%29/))>

Naively replacing “test@example.com” with an unquoted *** breaks JavaScript syntax, rendering the sample invalid and misleading detectors. While AST-based desensitisation can preserve structure, it is complex, labour-intensive, and lacks scalability [37].

To address these challenges, this study introduces a federated learning (FL) framework to enhance XSS detection while preserving data privacy, especially under an OOD scenario. FL enables collaborative training without exposing raw data [12,32], mitigating distributional divergence and improving robustness [22,32]. More importantly, our approach leverages structurally well-aligned, semantically coherent clients to anchor global decision boundaries, allowing their generalisation capabilities to be implicitly shared across clients with fragmented, noisy, or ambiguous data distributions. In doing so, we avoid the need for centralised, large-scale anonymisation or sanitisation and instead provide low-quality clients with clearer classification margins without direct data sharing or manual intervention. This decentralised knowledge transfer mechanism forms the basis of our FL framework, detailed in Section 5, and evaluated under dual OOD settings across three embedding models. Section 4 will explain the centralised OOD testing.

3. Methodology and Experimental Design

3.1. Settings and Rationale

Please see Figure 1 for the project pipeline and Figure 2 for the overall paper logic flow.

3.1.1. Experiment Environment

Our experiments are based on the FLOWER framework [38], an open-source system for simulating federated learning that supports various federated learning (FL) schemes and aggregation algorithms, including FedAvg [39], FedProx [40], and robust methods such as Krum [41]. The experiments were conducted on the JADE2 high-performance computing (HPC) cluster, using a single NVIDIA V100 GPU (32 GB) per run (used an average RAM of 16 GB for FL training). As JADE2 is a multi-user shared system, centralised training time varied between 0.1–0.5 h, and federated training time varied between 0.5–2 h, depending on system load and job scheduling conditions (typical time cost 2882.32 s for GloVe with FedAvg and 4614.06 s for GraphCodeBERT with FedAvg).

3.1.2. Embedding Selection Rationale

To evaluate the effectiveness of different natural language processing techniques in OOD XSS detection, we selected three representative word-embedding paradigms:

GloVe-6B-300d (static embedding): A word-embedding model that maps words to fixed-dimensional vectors based on co-occurrence statistics.
GraphcodeBERT-base (BERT-derived, pre-trained with data flow graphs): A transformer trained on code using masked language modelling, edge prediction, and token-graph alignment. It models syntax and variable dependencies, making it suited for well-structured XSS payloads.
CodeT5-base (sequence-to-sequence, code-aware): A unified encoder–decoder model pre-trained on large-scale code corpora. In our setting, we utilise the encoder component to extract contextual embeddings. CodeT5 captures both local and global structural patterns through its masked span prediction and identifier-aware objectives, making it suitable for modelling fragmented or obfuscated payloads that lack explicit syntax trees.

Unlike GraphCodeBERT, which relies heavily on syntax-level alignment, CodeT5 learns a broader structural abstraction that generalises better to heterogeneous inputs. This makes it particularly effective in detecting distributional shifts in structurally diverse or OOD payloads commonly seen in federated XSS detection scenarios.

For practical considerations, we adopted mid-sized variants of each model to ensure computational feasibility and compatibility with federated learning environments. Larger-scale state-of-the-art (SOTA) models such as GPT-3/3.5/4 [42] and DeepSeek-coder-1B/6.7B [43], while potentially more expressive, are prohibitively expensive in terms of inference cost and memory footprint, even when used solely for frozen embedding. Such overhead renders them unsuitable for decentralised training settings, especially when synchronous inference across heterogeneous clients is required.

In addition, to ensure a fair and interpretable comparison, we intentionally avoided mixing model scale and design improvements. The selected models strike a practical balance between representation power and computational efficiency, enabling a focused evaluation of embedding characteristics without introducing confounding factors or excessive system complexity.

3.1.3. Freeze Embedding

Despite the potential for improved downstream performance, we intentionally avoid fine-tuning the embedding models (e.g., CodeT5, GraphCodeBERT) in our pipeline. This design choice reflects both practical and privacy-driven considerations.

In typical and classical FL settings, model training must occur on decentralised clients where raw data cannot be aggregated. Fine-tuning pre-trained models typically requires centralised access to data and intensive resources, which contradicts FL’s privacy-preserving assumptions.

Furthermore, recent studies [44,45,46] have demonstrated that fine-tuning can amplify privacy leakage risks by recovering previously “forgotten” personal information from language models (LMs). They will also increase the FL computation cost and complexity [46]. Therefore, we use frozen embedding models to better align with real-world FL deployments, where privacy and generalisation must coexist without heavy centralised retraining, and to reduce the risk of inference attacks [47] that exploit model updates to extract sensitive client information.

3.1.4. Downstream Classifier

The downstream classifier is a unified light transformer model with d_model = 256, nhead = 8, num_encoder_layers = 3, dim_feedforward = 512, dropout = 0.1, learning rate = 0.001, Batch_size = 64. The input dimensions of the three word-embedding models used are 768 for both CodeT5 and GraphcodeBERT, and 300 for GloVe, respectively. We used Cross-Entropy Loss for both centralised and FL tests.

3.1.5. Optimisation and Aggregation

We applied FedAvg and FedProx with Focal Loss [48] to address client drift and imbalance in the non-IID federated learning setting for aggregation. The Focal Loss modification helps to mitigate the impact of class imbalance, particularly for rare XSS attack variants. For details, please see Section 5.1.

In the overall framework, we avoided overly complex designs like federated domain adaptation [49] to minimise the influence of different factors on the advantages of federated learning. Our experiment design aims to verify the potential role of the federated learning framework in OOD XSS attack detection rather than to validate single models or approaches that have already been extensively studied and repeatedly tested, as mentioned earlier. Many of these models strongly depend on specific datasets and centralised training conditions, making them less applicable to real-world FL scenarios with non-IID, privacy-constrained data distributions. The following sections will explain the dataset preparation, the central aggregation algorithms used for federated learning, and the experimental evaluation results.

3.2. Dataset Design and Explanation

3.2.1. Dataset Construction

Following recent studies [6,8,9,14,18,25,26,50], we categorise XSS datasets into two types: text-oriented and traffic-oriented. Our focus is on text-oriented datasets, which include raw payloads, JavaScript fragments, and event handlers, and more directly capture XSS surface forms. Unlike general intrusion datasets (e.g., CICIDS2017, NF-ToN-IoT [51]), XSS detection lacks large-scale, standardised text corpora. Existing datasets are often small, domain-specific, and poorly documented [10,50,52,53].

We use two complementary datasets to support federated learning experiments:

Dataset 1: A manually curated training set (73,277 samples; 39,134 positives) sourced from OWASP, GitHub, and PortSwigger. It includes diverse XSS types (reflected, stored, and DOM-based) and obfuscation styles. Positive samples are often partial or fragmented payloads, while negative samples are heterogeneous, including mixed-format code snippets, incomplete traces, and unrelated injections.
Dataset 2: A structurally consistent test set (42,514 samples; 15,137 positives) from [14], dominated by fully-formed reflected XSS payloads (~95.7%) with high lexical and syntactic regularity. Its negative samples are more cleanly separated (e.g., full URLs, plain text), resulting in lower structural ambiguity.

To simulate FL-specific non-IID conditions,

We partition Dataset 1 across five clients with attack-type and source-specific imbalance;
We use Dataset 2 as an out-of-distribution (OOD) test set to evaluate generalisation under structural shift.

No data augmentation or resampling was applied to preserve natural fragmentation, partial injections, and scanning artefacts. The dataset continues to be refined to ensure that observed OOD effects stem from real-world variability, not artificial perturbations.

We released both raw datasets in https://github.com/OldTestRun/V1.4.git (accessed on 5 March 2025).

Our dataset design was also inspired by the research of Sun’s team [54], along with their formula for evaluating model generalisation errors. For the formulation, S is the training sample, A is the (possibly randomised) learning algorithm, R(⋅) is the population risk under the target evaluation distribution (ID or OOD), and

\hat{R_{S}}

(⋅) is the empirical risk on S. The expectation is taken for the random draw of S and the algorithmic randomness of A (e.g., minibatch sampling). A smaller

ε_{gen}

indicates a better out-of-sample performance. In our experiments, A instantiates FedAvg or FedProx with frozen encoders, and S denotes the centralised or client–local training split.

ε_{gen} ≔ E_{S} E_{A} [R (A (S)) - \hat{R_{S}} (A (S))]

(1)

3.2.2. Semantic-Preserving Substitution and Lexical Regularisation

In Dataset 1, we replaced high-frequency canonical payloads such as “alert” with syntactically valid but functionally diverse JavaScript APIs like prompt. See Table 1. These variants, although not strictly equivalent in runtime effect, remain plausible within XSS contexts and preserve executable structure. The substitutions were selected to expand structural diversity and better reflect real-world attack surface variability. Unlike traditional lexical regularisation which aims to preserve semantic identity, our transformation introduces controlled structural perturbations without altering the label or removing executable intent. While Dataset 2 retains conventional alert-style payloads, Dataset 1 exposes the model to more varied expressions. This design enables us to evaluate robustness under structurally diverse but semantically plausible inputs, particularly relevant for fragmented or ambiguous samples in practical deployment scenarios.

3.2.3. Quantitative Lexical-Level Analysis Reveals Distributional Divergence

To quantify lexical-level divergence between Dataset 1 and Dataset 2, we extracted the top-100 TF-IDF features from 3000 sampled samples. In positive samples, 63 features overlapped (Jaccard = 45.98%, Cosine = 0.4988), showing moderate consistency. In contrast, negative samples had only 20 overlaps (Jaccard = 10.5%, Cosine = 0.2230), reflecting greater lexical diversity. While this suggests notable variation in negative samples, we hypothesise that generalisation gaps cannot be solely attributed to this, as structural inconsistencies in positive samples also play a key role. See Table 2. For the formulation,

T_{1}

refers to the top-k TF-IDF features from Dataset 1; the same as

T_{2}

, the overlap count is defined as

|T_{1} \cap T_{2}|

where

T_{1}

,

T_{2}

denote the sets of top-k TF-IDF features in each dataset. The cosine similarity between aggregated TF-IDF vectors is given by

\frac{\vec{v_{1}} \cdot \vec{v_{2}}}{| \vec{v_{1}} | | \vec{v_{2}} |}

, where

\vec{v_{1}}

and

\vec{v_{2}}

represent the mean TF-IDF vectors of each dataset. However, since TF-IDF cannot effectively capture structural differences in positive samples (similar to GloVe, which also lacks structural awareness), we further employed other measurements to visualise such differences in the following paragraphs of Section 4.2.

O v e r l a p C o u n t = |T_{1} \cap T_{2}|

(2)

Jaccard Similarity = \frac{|T_{1} \cap T_{2}|}{|T_{1} \cup T_{2}|}

(3)

Cos ine Similarity = \frac{\vec{v_{1}} \cdot \vec{v_{2}}}{| \vec{v_{1}} | | \vec{v_{2}} |}

(4)

3.2.4. Visualisation of Different Datasets’ Positive Samples

While we initially considered multiple projection methods, such as T-SNE [55], we ultimately chose UMAP [56] for this analysis. We used GraphCodeBERT for embeddings, as it offers better sensitivity to structural and token-level variation in code-like or script-based inputs, which are common in XSS payloads. We focused on positive samples for visualisation since our dataset mainly contains potential payloads and a relatively minor portion of actual attacks. As shown in Figure 3, Dataset 1 appears fragmented, reflecting obfuscated or diverse payloads, while Dataset 2 forms a more compact and uniform cluster. This structural contrast supports the presence of feature-level drift across datasets.

3.3. Experimental Procedure Overview

We conducted four groups of experiments to evaluate model generalisation, feature sensitivity, and federated learning performance:

Centralised Embedding Evaluation: We tested three embedding models, GloVe, GraphcodeBERT, and CodeT5, under centralised settings using Dataset 1 for training and Dataset 2 for testing. This setup evaluates each model’s generalisation ability to unseen attack structures in an OOD context.
Dataset Swap OOD Test: To further explore the impact of feature distribution divergence, we reversed the datasets—training on Dataset 2 and testing on Dataset 1. This demonstrates how models trained on one domain generalise (or fail to generalise) to structurally distinct inputs.
Federated Learning with Non-IID Clients: We simulated a more realistic extreme horizontal FL setup with five clients. Dataset 1 and Dataset 2 were partitioned across clients to introduce heterogeneous distributions. Each client was trained locally and evaluated on unseen data from the other dataset. We used FedAvg and FedProx for aggregation, evaluating accuracy, false positive rate, recall, and precision.
Centralised In-Distribution Control Test: As a baseline, we trained and evaluated the classifier based on three embedding models on a single, fully centralised test set that merges both datasets. This setup lets us contrast truly centralised learning with our federated learning regime, isolate any performance gains attributable to data decentralisation, and expose the limits of federated learning when distributional heterogeneity is removed.

4. Independent Client Testing with OOD Distributed Data

In the first part of our evaluation, we trained on Dataset 1 and tested on Dataset 2, then reversed the setup. While both datasets target reflected XSS, they differ in structural and lexical characteristics, as detailed in Section 3.1. This asymmetry, present in both positive and negative samples, led to significant generalisation gaps. In particular, models trained on one dataset exhibited lower precision and increased false positive rates when tested on the other, reflecting the impact of data divergence under OOD settings.

We evaluated all three embedding models under both configurations. Confusion matrices (Figure 4 and Figure 5) illustrate the classification differences when trained on low- versus high-generalisation data, respectively. Before this, we established performance baselines via 20% splits on the original training set to rule out overfitting (Table 3). Figure 6 summarises cross-distribution performance under each model, and Figure 7 highlights the extent of performance shifts under structural OOD. These results confirm that both positive and negative class structures play a critical role in the generalisation performance of XSS detectors.

To isolate the impact of positive sample structure, we conducted cross-set training where the training positives originated from the high-generalisation Dataset 2 while retaining fragmented negatives from Dataset 1 on the most structure-sensitive model GraphcodeBERT. Compared to the baseline trained entirely on Dataset 1, this setup substantially improved accuracy (from 56.80% to 71.57%) and precision (from 44.82% to 68.39%), with recall slightly increased to 99.70%. These findings highlight that structural integrity in positive samples enhances model confidence and generalisability even under noisy negative supervision. Conversely, negatives primarily increase false positives (FPR 68.19%). See Table 4.

4.1. Generalisation Performance Analysis

When we evaluate the generalisation ability of GloVe, GraphCodeBERT, and CodeT5 embeddings by testing on the high-generalisation dataset (Dataset 2) and training on the structurally diverse and fragmented Dataset 1, all models experience a significant drop in performance, particularly in precision and false positive rate (FPR), indicating high sensitivity to structural shifts across datasets.

GraphCodeBERT shows the most severe performance degradation, with precision dropping from 84.38% to 45.03% (−39.35%) and FPR increasing from 19.16% to 65.62% (+46.46%). Despite maintaining nearly perfect recall (99.63%), it heavily overpredicts positives when faced with unfamiliar structures, suggesting poor robustness to syntactic variance due to its code-centric pre-training.

CodeT5 suffers slightly less, but still significant degradation: precision drops from 84.50% to 46.36% (−38.14%), and FPR rises from 18.47% to 61.95% (+43.48%). This suggests that while its span-masked pre-training aids structural abstraction, it still fails under negative class distribution shift.

GloVe demonstrates the most stable cross-dataset performance, with a precision decline from 90.13% to 51.58% (−38.55%) and FPR increasing from 11.90% to 47.90% (+36.00%). Although static and context-agnostic, GloVe is less vulnerable to structural OOD, likely due to its reliance on global co-occurrence statistics rather than positional or syntactic features.

These results support that structural generalisation failure arises from both positive class fragmentation and negative class dissimilarity. Models relying on local syntax (e.g., GraphCodeBERT) are more prone to false positives, while those leveraging global distributional features (e.g., GloVe) exhibit relatively better robustness under extreme OOD scenarios.

Sensitivity of Embeddings to Regularisation Under OOD

Under structural OOD conditions, CodeT5 achieved high recall (≥99%) but suffered from low precision and high FPR, indicating overfitting to local patterns. Stronger regularisation (dropout = 0.3, lr = 0.0005) led to improved precision (+4.73%) and reduced FPR (−10.89%), showing modest gains in robustness. GloVe benefited the most from regularisation, with FPR dropping to 29.49% and precision rising to 63.41%. In contrast, GraphCodeBERT remained not very sensitive to regularisation, with relatively smaller changes across settings. These results suggest that structure-sensitive embeddings require tuning to remain effective under structural shift, while static embeddings like GloVe offer more stable performance.

Notably, we also observed that stronger regularisation on dropout tends to widen the performance gap between the best and worst OOD scenarios, especially for GloVe (4%~9%). These results suggest that structure-sensitive embeddings require tuning to remain effective under distributional shift. See Table 5.

4.2. Embedding Level Analysis

P Q M KL F_{P} (x) - F_{Q} (x) JSD (P ∥ Q) P Q

To assess whether embedding similarity correlates with generalisation, we computed pairwise Jensen–Shannon divergence (JSD) [57] and Wasserstein distances (WD) [58] across models on both datasets. P and Q: probability distributions of two embedding sets; M: mean distribution; KL: Kullback–Leibler divergence from one distribution to another; FPx-FQx: cumulative distribution functions. JSDP∥Q reflects a symmetric, smoothed divergence metric capturing the balanced difference between P and Q.

As shown in Table 6, the three embedding models respond differently to structural variation. GraphCodeBERT has the lowest JSD (0.2444) but the highest WD (0.0758), suggesting its embeddings shift more sharply in space despite low average token divergence. This sensitivity leads to poor generalisation, with false positive rates exceeding 65% under OOD tests. GloVe shows the highest JSD (0.3402) and moderate WD (0.0562), indicating broader but smoother distribution changes. It performs most stably in OOD scenarios, likely due to better tolerance of structural drift. CodeT5 has the lowest WD (0.0237), meaning its embeddings change little across structure shifts. However, this low sensitivity results in degraded precision, especially for negative-class drift.

JSD (P ∥ Q) = \frac{1}{2} KL (P ∥ M) + \frac{1}{2} KL (Q ∥ M), M = \frac{1}{2} (P + Q)

(5)

W (P, Q) = \int_{- \infty}^{\infty} |F_{P} (x) - F_{Q} (x)| d x

(6)

Kernel-Based Statistical Validation of OOD Divergence

While metrics like JSD and Wasserstein quantify distributional shifts, they do not assess statistical significance. To address this, we compute the Maximum Mean Discrepancy (MMD) between Dataset 1 and Dataset 2 using Random Fourier Features (RFFs) for efficiency, with 40,000 samples per set.

MMD score scope for different models’ embedding in all samples: 0.001633 (GraphcodeBERT)—0.082517 (GloVe)—0.118169 (CodeT5).
In positive samples: 0.000176 (GloVe)—0.000853 (GraphcodeBERT)—0.106470 (CodeT5).
In negative samples: 0.004105 (GraphcodeBERT)—0.007960 (CodeT5)—Glove (0.517704).
All embeddings’ $P - V A L U E$ < 0.001 (refers to a distinct OOD).

These data confirmed a statistically significant distributional shift and semantic OOD in negative samples. For formulation, please see below.

X

,

Y

values refer to the set of different embeddings. Digamma

ϕ (x_{i})

means the kernel feature mapping approximated via Random Fourier Features (RFFs). For

P - V A L U E

,

s

is the observed MMD score,

k

represents the number of permutations,

s_{i}

is the MMD value obtained for permutation i.

MM D^{2} (X, Y) = ⃦ \frac{1}{n} \sum_{i = 1}^{n} ϕ (x_{i}) - \frac{1}{m} \sum_{j = 1}^{m} ϕ (y_{j}) ⃦^{2}

(7)

p = \frac{1 + \sum_{i = 1}^{k} I (s_{i} \geq s)}{k + 1}

(8)

The unusually high negative-class MMD of GloVe largely arises from lexical-surface drift along dimensions that have negligible classifier weights. Suggesting the decision boundary learned during hard-negative mining is far from benign regions in these dimensions, the model maintains a low false-positive rate under OOD settings despite the apparent distribution gap. Conversely, contextual models display a smaller overall MMD yet place their boundary closer to benign clusters, yielding higher FPR. This suggests that absolute MMD magnitude is not a sufficient indicator of OOD robustness; alignment between drift directions and decision-relevant subspaces is critical.

These results, supported by lexical analysis (Section 3.2), indicate that the observed generalisation gap is attributable to systematic data divergence, particularly in negative sample distributions, rather than random fluctuations.

5. Federated Learning Tests Under Non-IID Scenarios

This paragraph will investigate whether such a generalisation holds under decentralised settings, to validate our original idea that federated learning can enhance the model’s generalisation even under an OOD situation.

5.1. Federated Learning Settings

5.1.1. Dataset Distribution

The rest of the training and test splits were partitioned according to the label and sample categories described in Section 3.2, using fixed random seeds (=42) to ensure reproducibility. We set three representative non-IID configurations: (1) clients with severe class imbalance (e.g., skewed positive/negative ratios); (2) clients with varied total data quantities and randomly sampled label distributions; (3) clients with composite distribution skew involving both label imbalance and quantity mismatch, potentially including noisy samples. An example of the composite configuration (3) is illustrated in Figure 8. For the remaining parts of the two datasets, approximately 50% of the labels and sample sizes are evenly distributed among five clients as the test set. However, the test sets for Clients 1 to 4 are derived from Dataset 2, while the test set for Client 5 is from Dataset 1. This forms the OOD distribution. This setup reflects a realistic federated setting where label and distributional skews co-occur [32,59].

5.1.2. Federated Learning Setup

We simulate a horizontal FL setup with five clients, each holding structurally distinct training data. Clients 1–4 use imbalanced and diverse samples from Dataset 1, while Client 5 holds syntactically regular data from Dataset 2, forming a heterogeneous training landscape with an inter-client label and structure skew.

All clients participate in 30 global rounds with learning rate = 0.005 and dropout = 0.1, using FedAvg and FedProx (proximal term = 0.2) for aggregation. Clients train locally for 10 epochs with SGD and Focal Loss (α = 1.4, γ = 2.0) and StepLR scheduler (step = 5, γ = 0.5). After each round, the global model is redistributed and evaluated on each client’s mismatched test set, and Clients 1–4 are tested on Dataset 2 and Client 5 on Dataset 1 to enable systematic evaluation under structural OOD.

Client training and evaluation run in parallel, and metrics (accuracy, recall, precision, F1, and FPR) are computed locally and aggregated at the server.

5.1.3. Aggregation Algorithms

We adopt two standard aggregation methods to evaluate FL under non-IID settings: FedAvg and FedProx. FedAvg computes the global model as a weighted average of client updates, proportionally based on each client’s local data size. This method ensures that clients with more data significantly influence the global model, which enhances the model’s performance and generalisation ability.

FedAvg formulas:

w_{t + 1} = \sum_{k = 1}^{K} \frac{n_{k}}{n} w_{t}^{k}

(9)

$w_{t + 1}$ : The weight of the global model after round t + 1.
$n_{k}$ : The data size of client k.
$K$ : The number of participating clients.
$n$ : The total data size across all clients.
$w_{t}^{k}$ : The local model weight of client k after round t.

FedProx is particularly suitable for non-IID settings, as it stabilises training by reducing local model drift. We include an evaluation of how regularised aggregation affects generalisation under heterogeneous XSS data.

FedProx formulas:

{w_{t + 1}^{k} = ar g m in}_{w} (f_{k} (w) + \frac{μ}{2} | w - w_{t} |^{2})

(10)

$w_{t + 1}^{k}$ : The optimised weight of the local model on client k after round t+1.
$f_{k} (w)$ : The loss function for client k.
$μ$ : The regularisation parameter (proximal term).
$w_{t}$ : The weight of the global model after round t.
${a r g m i n}_{w}$ : The argument of the minimum indicates that $w_{t + 1}^{k}$ minimises the expression within the parentheses.

5.2. Federated Learning Performance

Firstly, we examined the global classifier’s performance under two algorithms, using GLOVE-6B-300D as an example (the other two models showed different convergence behaviours, with GraphCodeBERT improving most under FedProx), as shown in Figure 9. At a learning rate of 0.005, the global model exhibits severe oscillation during training, particularly under FedProx. This instability stems from the concentration of generalisable data in a single client, whose contribution is suppressed by proximal regularisation. Lowering the rate to 0.001 improves stability with GloVe, but slows training, requiring more rounds to match performance. After 30 rounds, it reaches 94.89% accuracy and 6.7% FPR, compared to 97.14% and 3.2% under 0.005. In contrast, GraphCodeBERT and CodeT5 exhibit distinct convergence patterns, despite their poorer out-of-distribution (OOD) performance. Aggregated accuracy and FPR curves for all embeddings under both FedProx and FedAvg are shown in Figure 10.

FedAvg improves training stability across all embeddings but slightly hinders final performance for GloVe and CodeT5. In contrast, GraphCodeBERT benefits from FedProx, showing improved final accuracy; however, CodeT5 still achieves a better performance on the worst client. Table 7 reports the peak and worst client-side results under FedProx.

Centralised No Data Isolation Testing Baseline

We also tested the classifier performance of three different embedding models, without data isolation, to demonstrate a comparison with federated learning. In this scenario, the training dataset contains data from both Dataset 1 and Dataset 2 (25% from the high generalisation dataset, the test dataset also includes 25% from the original train dataset, Dataset 1), with balanced negative and positive samples. See Figure 11 and Table 8.

5.3. Federated Learning Result Analysis

GloVe reaches a competitively aggregated endpoint (Acc = 96.2%, FPR = 5.5%) but exhibits the most significant round-to-round variance, reflecting its struggle to reconcile client-specific structural drift under FedProx. GraphCodeBERT starts lower (77.3%) but climbs steadily to 96.8%, with a relatively smooth trajectory and a low final FPR (4.7%). Its graph-guided attention appears to benefit from inter-client variability, suggesting an implicit alignment effect during parameter averaging. CodeT5 surpasses 90% within five rounds and peaks at 97.6%/3.5% FPR (aggregated), but later shows mild instability. As we use only the frozen encoder, this is likely due to local structural sensitivities encoded during pre-training, rather than decoder-side overfitting. These contrasting convergence profiles highlight the importance of aligning aggregation dynamics with embedding characteristics in non-IID FL settings. FedAvg generally offers a better convergence stability, while FedProx accelerates early learning but suffers from instability due to the suppression of the only high-generalisation client. GraphCodeBERT benefits most from ensemble smoothing, while GloVe’s sparse lexical weights are pulled in conflicting directions. CodeT5 learns fastest initially but is vulnerable to jitter in later rounds. Overall, FL averages out lexical and structural distributional shifts that would otherwise harm model boundaries when trained in isolation. Client-level improvements from round 1 to 30 are summarised in Figure 12.

6. Conclusions

This study shows that federated learning (FL) can achieve privacy-preserving, OOD-resilient XSS detection if the embedding geometry can align well with the aggregation strategy. Analyses with JSD, Wasserstein 1, and RFF MMD reveal a split drift pattern: benign inputs mainly reorder words, whereas malicious inputs drift along deeper structural axes. Under centralised training, the static word level of GloVe benefits from stable term frequencies, while the structure-aware GraphCodeBERT is penalised, suggesting it overreacts to code edits. When data remain local and FedProx aggregates update, the variety of control and data flow graphs across clients is ensemble-smoothed, allowing GraphCodeBERT to converge the most stably and to achieve the best balance between FPR and recall.

GloVe still reaches acceptable final accuracy, but its training curve oscillation may be because client-specific vocabularies pull its sparse lexical weights in conflicting directions. CodeT5, used solely as a frozen encoder, improves fastest during the first few rounds, yet later shows mild jitter which may be because its representations are sensitive to local structural quirks.

Overall, FL effectively averages out the lexical and distributional shifts that harm each model when trained in isolation. Future work should pair FL with embeddings that are structure-aware without being oversensitive and should design aggregation rules that adapt to each client’s drift profile instead of applying uniform weighting.

7. Limitations

Incorporating Partial Participation with Invariant Learning. Our current setup assumes synchronous client participation per round, whereas real-world FL often involves dropout or intermittent availability. While we do not explicitly simulate asynchronous updates, recent methods such as FEDIIR [21] have shown robustness under partial participation by implicitly aligning inter-client gradients to learn invariant relationships. Extending such approaches to our structure-variant OOD setting may improve robustness in realistic, non-synchronous FL environments.
Data Quality as a Structural Bottleneck. A key challenge in federated XSS detection lies not in algorithmic optimisation, but in the difficulty of acquiring high-quality, generalisable data across all clients. Our results suggest that if no clients possess substantial structural diversity or sufficient sample representation, the global model’s generalisation ability will be severely impaired, even with robust aggregation. Federated learning in XSS detection contexts fundamentally depends on partial data sufficiency among clients. As part of future work, we plan to expand the dataset to include more structurally complex XSS payloads, especially context-dependent polyglot attacks that combine HTML, CSS, and JavaScript in highly obfuscated forms. Such samples are essential to better simulate real-world, evasive behaviours and stress-test federated models under extreme structural variability.
Deployment Feasibility and Optimisation Needs. While the current framework employs a lightweight transformer classifier, future work may explore further simplification of the downstream classifier through distilled models (e.g., TinyBERT), linear-attention architectures (e.g., Performer), or hybrid convolution–attention designs to reduce computational overhead and improve real-world deployability.

Author Contributions

Conceptualization, B.W.; Methodology, B.W.; Software, B.W.; Validation, B.W.; Formal analysis, B.W.; Investigation, B.W.; Resources, B.W.; Data curation, B.W.; Writing—original draft, B.W.; Writing—review & editing, B.W.; Visualization, B.W.; Supervision, I.K., M.W. and N.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

OOD	Out-of-Distribution
XSS	Cross-Site Scripting
FL	Federated Learning
IID	Independent and Identically Distributed
Non-IID	Non-Independent and Identically Distributed
FPR	False Positive Rate
MMD	Maximum Mean Discrepancy
NLP	Natural Language Processing
JSD	Jensen–Shannon Divergence
WD	Wasserstein Distance
TF-IDF	Term Frequency–Inverse Document Frequency

References

MITRE: CWE Top 25 Most Dangerous Software Weaknesses. 2023. Available online: https://cwe.mitre.org/top25/archive/2023/2023_top25_list.html (accessed on 18 August 2024).
Tariq, I.; Sindhu, M.A.; Abbasi, R.A.; Khattak, A.S.; Maqbool, O.; Siddiqui, G.F. Resolving cross-site scripting attacks through genetic algorithm and reinforcement learning. Expert Syst. Appl. 2021, 168, 114386. [Google Scholar] [CrossRef]
Fang, Y.; Huang, C.; Xu, Y.; Li, Y. RLXSS: Optimising XSS Detection Model to Defend Against Adversarial Attacks Based on Reinforcement Learning. Future Internet 2019, 11, 177. [Google Scholar] [CrossRef]
Shakeel, N.; Bhushan, S.; Purnachand, C. Ensemble Methods to Detect XSS Attacks. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 695–700. [Google Scholar] [CrossRef]
Zhou, Y.; Wang, P. An ensemble learning approach for XSS attack detection with domain knowledge and threat intelligence. Comput. Secur. 2019, 82, 261–269. [Google Scholar] [CrossRef]
Fang, Y.; Li, Y.; Liu, L.; Huang, C. DeepXSS: Cross-Site Scripting Detection Based on Deep Learning. In Proceedings of the 2018 International Conference on Computing and Artificial Intelligence, New York, NY, USA, 12–14 March 2018; pp. 47–51. [Google Scholar] [CrossRef]
Rathore, S.; Sharma, P.K.; Park, J.H. XSSClassifier: An Efficient XSS Attack Detection Approach Based on Machine Learning Classifier on SNSs. J. Inf. Process. Syst. 2017, 13, 1014–1028. [Google Scholar] [CrossRef]
Alqura’n, R.; AlJamal, M.; Al-Aiash, I. Advancing XSS Detection in IoT over 5G: A Cutting-Edge Artificial Neural Network Approach. IoT 2024, 5, 478–508. [Google Scholar] [CrossRef]
Tan, X.; Xu, Y.; Wu, T.; Li, B. Detection of Reflected XSS Vulnerabilities Based on Paths-Attention Method. Appl. Sci. 2023, 13, 7895. [Google Scholar] [CrossRef]
Abu Al-Haija, Q. Cost-effective detection system of cross-site scripting attacks using a hybrid learning approach. Results Eng. 2023, 19, 101266. [Google Scholar] [CrossRef]
Bakır, R.; Bakır, H. Swift Detection of XSS Attacks: Enhancing XSS Attack Detection by Leveraging Hybrid Semantic Embeddings and AI Techniques. Arab. J. Sci. Eng. 2025, 50, 1191–1207. [Google Scholar] [CrossRef]
Li, L.; Fan, Y.; Tse, M.; Lin, K.-Y. A review of applications in federated learning. Comput. Ind. Eng. 2020, 149, 106854. [Google Scholar] [CrossRef]
Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated Learning with Non-IID Data. arXiv 2018, arXiv:1806.00582. [Google Scholar] [CrossRef]
Mereani, F.A.; Howe, J.M. Detecting Cross-Site Scripting Attacks Using Machine Learning. In Advanced Machine Learning Technologies and Applications; Hassanien, A.E., Tolba, M.F., Kim, T.-h., Eds.; AISC: Hong Kong, China; Springer: Cham, Switzerland, 2018; Volume 723, pp. 200–210. [Google Scholar] [CrossRef]
Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Available online: https://aclanthology.org/D14-1162.pdf (accessed on 20 October 2024).
Wang, Y.; Wang, W.; Joty, S.; Hoi, S.C.H. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8696–8708. [Google Scholar] [CrossRef]
Guo, D.; Ren, S.; Lu, S.; Feng, Z.; Tang, D.; Liu, S.; Zhou, L.; Duan, N.; Svyatkovskiy, A.; Fu, S.; et al. GraphCodeBERT: Pre-training Code Representations with Data Flow. arXiv 2021, arXiv:2009.08366. [Google Scholar]
Jazi, M.; Ben-Gal, I. Federated Learning for XSS Detection: A Privacy-Preserving Approach. In Proceedings of the 16th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, Porto, Portugal, 17–19 November 2024; pp. 283–293. [Google Scholar] [CrossRef]
CICIDS2017 Dataset. 2024. Available online: https://www.unb.ca/cic/datasets/ids-2017.html (accessed on 18 August 2024).
Vahidian, S.; Morafah, M.; Chen, C.; Shah, M.; Lin, B. Rethinking Data Heterogeneity in Federated Learning: Introducing a New Notion and Standard Benchmarks. IEEE Trans. Artif. Intell. 2024, 5, 1386–1397. [Google Scholar] [CrossRef]
Guo, Y.; Guo, K.; Cao, X.; Wu, T.; Chang, Y. Out-of-Distribution Generalization of Federated Learning via Implicit Invariant Relationships. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 23–29 July 2023. PMLR 202: 11905–11933. [Google Scholar]
Liao, X.; Liu, W.; Zhou, P.; Yu, F.; Xu, J.; Wang, J.; Wang, W.; Chen, C.; Zheng, X. FOOGD: Federated Collaboration for Both Out-of-distribution Generalisation and Detection. In Proceedings of the NeurIPS 2024 (Conference on Neural Information Processing Systems), Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Cao, T.; Zhu, J.; Pang, G. Anomaly detection under distribution shift. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6488–6500. [Google Scholar]
Côté, P.-O.; Nikanjam, A.; Ahmed, N.; Humeniuk, D.; Khomh, F. Data cleaning and machine learning: A systematic literature review. Autom. Softw. Eng. 2024, 31, 54. [Google Scholar] [CrossRef]
Kaur, J.; Garg, U.; Bathla, G. Detection of cross-site scripting (XSS) attacks using machine learning techniques: A review. Artif. Intell. Rev. 2023, 56, 12725–12769. [Google Scholar] [CrossRef]
Hannousse, A.; Yahiouche, S.; Nait-Hamoud, M.C. Twenty-two years since revealing cross-site scripting attacks: A systematic mapping and a comprehensive survey. Comput. Sci. Rev. 2024, 52, 100634. [Google Scholar] [CrossRef]
Wang, H.; Singhal, A.; Liu, P. Tackling imbalanced data in cybersecurity with transfer learning: A case with ROP payload detection. Cybersecurity 2023, 6, 2. [Google Scholar] [CrossRef] [PubMed]
Gao, C.; Zhang, X.; Han, M.; Liu, H. A review on cyber security named entity recognition. Front. Inf. Technol. Electron. Eng. 2021, 22, 1153–1168. [Google Scholar] [CrossRef]
Pramanick, N.; Srivastava, S.; Mathew, J.; Agarwal, M. Enhanced IDS Using BBA and SMOTE-ENN for Imbalanced Data for Cybersecurity. SN Comput. Sci. 2021, 5, 875. [Google Scholar] [CrossRef]
Al-Shehari, T.; Kadrie, M.; Al-Mhiqani, M.N.; Alfakih, T.; Alsalman, H.; Uddin, M.; Ullah, S.S.; Dandoush, A. Comparative evaluation of data imbalance addressing techniques for CNN-based insider threat detection. Sci. Rep. 2024, 14, 24715. [Google Scholar] [CrossRef]
Talib, N.A.A.; Doh, K.-G. Assessment of Dynamic Open-source Cross-site Scripting Filters for Web Application. KSII Trans. Internet Inf. Syst. 2021, 15, 3750–3770. [Google Scholar] [CrossRef]
Pei, J.; Liu, W.; Li, J.; Wang, L.; Liu, C. A Review of Federated Learning Methods in Heterogeneous Scenarios. IEEE Trans. Consumer Electron. 2024, 70, 5983–5999. [Google Scholar] [CrossRef]
Sakazi, I.; Grolman, E.; Elovici, Y.; Shabtai, A. STFL: Utilising a Semi-Supervised, Transfer-Learning, Federated-Learning Approach to Detect Phishing URL Attacks. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–10. [Google Scholar] [CrossRef]
Khramtsova, E.; Hammerschmidt, C.; Lagraa, S.; State, R. Federated Learning For Cyber Security: SOC Collaboration For Malicious URL Detection. In Proceedings of the 2020 IEEE 40th International Conference on Distributed Computing Systems, Singapore, 29 November–1 December 2020; pp. 1316–1321. [Google Scholar] [CrossRef]
Pimenta, I.; Silva, D.; Moura, E.; Silveira, M.; Gomes, R.L. Impact of Data Anonymisation in Machine Learning Models. In Proceedings of the 13th Latin-American Symposium on Dependable and Secure Computing (LADC 2024), Recife, Brazil, 26–29 November 2024; pp. 188–191. [Google Scholar] [CrossRef]
Rahman, A.; Iqbal, A.; Ahmed, E.; Ontor, R.H. Privacy-Preserving Machine Learning: Techniques, Challenges, and Future Directions in Safeguarding Personal Data Management. Frontline Mark. Manag. Econ. J. 2024, 4, 84–106. [Google Scholar] [CrossRef]
Sun, W.; Fang, C.; Miao, Y.; You, Y.; Yuan, M.; Chen, Y.; Zhang, Q.; Guo, A.; Chen, X.; Liu, Y.; et al. Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We? arXiv 2023, arXiv:2312.00413. [Google Scholar] [CrossRef]
Flower Framework Documentation. 2024. Available online: https://flower.ai/docs/framework/_modules/flwr/server/strategy/fedprox.html#FedProx (accessed on 20 September 2024).
McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Aguera y Arcas, B. Communication-Efficient Learning of Deep Networks from Decentralised Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; PMLR; pp. 1273–1282. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimisation in Heterogeneous Networks. Proc. Mach. Learn. Syst. arXiv 2020, arXiv:1812.06127. [Google Scholar]
Blanchard, P.; Mhamdi, E.M.E.; Guerraoui, R.; Stainer, J. Byzantine-Tolerant Machine Learning. arXiv 2017, arXiv:1703.02757. [Google Scholar] [CrossRef]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
DeepSeek AI. DeepSeek-Coder-6.7B-Instruct. 2024. Available online: https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct (accessed on 9 September 2024).
Wang, T.; Zhai, L.; Yang, T.; Luo, Z.; Liu, S. Selective privacy-preserving framework for large language models fine-tuning. Inf. Sci. 2024, 678, 121000. [Google Scholar] [CrossRef]
Du, H.; Liu, S.; Zheng, L.; Cao, Y.; Nakamura, A.; Chen, L. Privacy in Fine-tuning Large Language Models: Attacks, Defences, and Future Directions. arXiv 2025, arXiv:2412.16504. [Google Scholar] [CrossRef]
Chen, H.; Zhao, H.; Gao, Y.; Liu, Y.; Zhang, Z. Parameter-Efficient Federal-Tuning Enhances Privacy Preserving for Speech Emotion Recognition. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Rao, B.; Zhang, J.; Wu, D.; Zhu, C.; Sun, X.; Chen, B. Privacy Inference Attack and Defense in Centralized and Federated Learning: A Comprehensive Survey. IEEE Trans. Artif. Intell. 2025, 6, 333–353. [Google Scholar] [CrossRef]
Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Peterson, D.; Kanani, P.; Marathe, V.J. Private Federated Learning with Domain Adaptation. arXiv 2019, arXiv:1912.06733. [Google Scholar] [CrossRef]
Thajeel, I.K.; Samsudin, K.; Hashim, S.J.; Hashim, F. Machine and Deep Learning-based XSS Detection Approaches: A Systematic Literature Review. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 101628. [Google Scholar] [CrossRef]
NF-ToN-IoT Dataset. 2024. Available online: https://staff.itee.uq.edu.au/marius/NIDS_datasets/ (accessed on 20 August 2024).
Sarhan, M.; Layeghy, S.; Portmann, M. Towards a Standard Feature Set for Network Intrusion Detection System Datasets. Mobile Netw. Appl. 2022, 27, 357–370. [Google Scholar] [CrossRef]
Qin, Q.; Li, Y.; Mi, Y.; Shen, J.; Wu, K.; Wang, Z. Detecting XSS with Random Forest and Multi-Channel Feature Extraction. Comput. Mater. Contin. 2024, 80, 843–874. [Google Scholar] [CrossRef]
Sun, Z.; Niu, X.; Wei, E. Understanding Generalisation of Federated Learning via Stability: Heterogeneity Matters. In Proceedings of the 39th International Conference on Machine Learning, PMLR (2022), Baltimore, Maryland, USA, 17–23 July 2022; pp. 1–15. [Google Scholar]
Chan, D.M.; Rao, R.; Huang, F.; Canny, J.F. T-SNE-CUDA: GPU-Accelerated T-SNE and its Applications to Modern Data. In Proceedings of the 2018 30th International Symposium on Computer Architecture and High Performance Computing, Lyon, France, 17–23 July 2018; pp. 330–338. [Google Scholar] [CrossRef]
McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
Lin, J. Divergence Measures Based on the Shannon Entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar]
Zhang, J.; Li, C.; Qi, J.; He, J. A Survey on Class Imbalance in Federated Learning. arXiv 2023, arXiv:2303.11673. [Google Scholar] [CrossRef]

Figure 1. The project pipeline of the paper includes a general description of the data source, embedding models, and testing setups. Dataset 2 from Mereani, F.A.; Howe, J.M [14].

Figure 2. Paper logic flow includes the design and purpose of different sections and the overall purpose and research questions.

Figure 3. UMAP of the two models’ embedding on the different samples’ distributions of the two datasets.

Figure 4. Confusion matrices (per-class normalised, percentage) of the classifier trained on Dataset 1.

Figure 5. Confusion matrices (per-class normalised, percentage) of the classifier trained on Dataset 2.

Figure 6. Cross-dataset classification performance across embedding models (CT5 refers to CodeT5).

Figure 7. Classifier’s performance change under OOD scenarios.

Figure 8. Train data distribution strategy and sample numbers.

Figure 9. Classifier convergence curve with GloVe-6b-300d embeddings under FedAvg and FedProx with learning rate 0.001 or 0.005 and aggregation rounds = 50 or 30.

Figure 10. Classifier convergence comparison under FedAvg and FedProx aggregation with different embedding models, aggregation rounds = 30, and learning rate = 0.005.

Figure 11. Confusion matrices (per-class normalised, percentage) under centralised training without data isolation.

Figure 12. A single client’s best performance improvement comparison with different embedding models. The initial testing is the first round, and the final results are tested in 30 rounds.

Table 1. High-frequency pattern replacements.

Function Name Examples	Rationale
Console.error	Outputs an error message to the console.
confirm	Displays a confirmation dialogue asking the user to confirm an action.
prompt	Displays a prompt to input information.

Table 2. Quantitative feature-level analysis.

Metrics	Baseline (IID)	Negative Samples Positive Samples
Top-100 TF-IDF	70–90	20 ± 1	63 ± 1
Jaccard similarity	70–90%	10.50% ± 1	45.98% ± 1
Cosine similarity	0.85–0.95	0.2230 ± 0.01	0.4988 ± 0.01

Table 3. Overfitting validation on the same dataset.

Embedding Model	Accuracy	FPR	Recall	Precision	Test Dataset Type
GloVe-6B-300d	98.12 ± 1%	1.31 ± 1%	98.45 ± 1%	98.29 ± 1%	20% of the Same dataset
CodeT5	98.30 ± 1%	2.21 ± 2%	98.31 ± 1%	98.16 ± 1%	20% of the Same dataset
GraphcodeBERT	99.24 ± 0.5%	0.87 ± 2%	99.40 ± 0.5%	99.02 ± 0.5%	20% of the Same dataset

Table 4. Exchanged positive samples in Dataset 2 (as a test dataset) for performance comparison.

Embedding Model	Accuracy	FPR	Precision	Recall	Positive Sample
GraphcodeBERT	56.80%	66.22%	44.82%	99.69%	Dataset 1
GraphcodeBERT	71.57%	68.19%	68.39%	99.70%	Dataset 2

Table 5. Regurgitation of two embedding models, downstream performance comparison.

Embedding Model	Accuracy	Recall	Precision	FPR	Classifier Hyperparameters
GloVe-6B-300d	65.84%	98.53%	50.65%	51.79%	Lr = 0.005, drop out = 0.1
	69.31%	98.08%	53.38%	46.21%	Lr = 0.001, drop out = 0.1
	79.00%	94.74%	63.41%	29.49%	Lr = 0.001, drop out = 0.5
GraphcodeBERT	56.80%	99.69%	44.82%	66.22%	Lr = 0.001, drop out = 0.1
	57.25%	99.63%	45.03%	65.24%	Lr = 0.0005, drop out = 0.3
CodeT5	59.50%	99.26%	46.36%	61.95%	Lr = 0.001, drop out = 0.1
	66.42%	97.86%	51.09%	51.06%	Lr = 0.0005, drop out = 0.3

Table 6. Jensen–Shannon and Wasserstein divergence between Dataset 1 and Dataset 2 across different embedding models.

Comparison	JSD	WD
GraphCodeBERT	0.2444	0.0758
GloVe	0.3402	0.0562
CodeT5	0.3008	0.0237

Table 7. Global classifier’s performance records under FedProx with different embedding models after 30 rounds of aggregation.

Embedding Model	Accuracy	FPR	Precision	Recall	F-1
GraphcodeBERT	99.92/95.02%	0.69/6.76%	99.94/86.48%	99.94/99.49%	99.94/92.86%
GloVe-6b-300d	98.63/94.06%	1.35/9.69%	99.69/86.84%	99.61/98.87%	99.65/93.25%
Code T5	99.64/96.13%	0.31/3.19%	99.70/94.48%	99.74/99.04%	99.04/96.77%

Table 8. No data isolation scenario: classifier performance results.

Embedding Model	Accuracy	FPR	Precision	Recall	F1-Score
GloVe-6B-300d	99.01% ± 1.2	1.05% ± 1.4	98.56% ± 1.5	99.10% ± 1.5	98.83% ± 1.1
CodeT5	98.90% ± 0.5	1.60% ± 1.1	97.83% ± 0.5	99.59% ± 0.3	98.70% ± 1.4
GraphcodeBERT	99.05% ± 0.7	0.93% ± 1.2	98.72% ± 1.5	99.03% ± 0.5	98.87% ± 1.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, B.; Khan, I.; White, M.; Beloff, N. Federated Learning for XSS Detection: Analysing OOD, Non-IID Challenges, and Embedding Sensitivity. Electronics 2025, 14, 3483. https://doi.org/10.3390/electronics14173483

AMA Style

Wang B, Khan I, White M, Beloff N. Federated Learning for XSS Detection: Analysing OOD, Non-IID Challenges, and Embedding Sensitivity. Electronics. 2025; 14(17):3483. https://doi.org/10.3390/electronics14173483

Chicago/Turabian Style

Wang, Bo, Imran Khan, Martin White, and Natalia Beloff. 2025. "Federated Learning for XSS Detection: Analysing OOD, Non-IID Challenges, and Embedding Sensitivity" Electronics 14, no. 17: 3483. https://doi.org/10.3390/electronics14173483

APA Style

Wang, B., Khan, I., White, M., & Beloff, N. (2025). Federated Learning for XSS Detection: Analysing OOD, Non-IID Challenges, and Embedding Sensitivity. Electronics, 14(17), 3483. https://doi.org/10.3390/electronics14173483

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Federated Learning for XSS Detection: Analysing OOD, Non-IID Challenges, and Embedding Sensitivity

Abstract

1. Introduction

2. Related Work

3. Methodology and Experimental Design

3.1. Settings and Rationale

3.1.1. Experiment Environment

3.1.2. Embedding Selection Rationale

3.1.3. Freeze Embedding

3.1.4. Downstream Classifier

3.1.5. Optimisation and Aggregation

3.2. Dataset Design and Explanation

3.2.1. Dataset Construction

3.2.2. Semantic-Preserving Substitution and Lexical Regularisation

3.2.3. Quantitative Lexical-Level Analysis Reveals Distributional Divergence

3.2.4. Visualisation of Different Datasets’ Positive Samples

3.3. Experimental Procedure Overview

4. Independent Client Testing with OOD Distributed Data

4.1. Generalisation Performance Analysis

Sensitivity of Embeddings to Regularisation Under OOD

4.2. Embedding Level Analysis

Kernel-Based Statistical Validation of OOD Divergence

5. Federated Learning Tests Under Non-IID Scenarios

5.1. Federated Learning Settings

5.1.1. Dataset Distribution

5.1.2. Federated Learning Setup

5.1.3. Aggregation Algorithms

5.2. Federated Learning Performance

Centralised No Data Isolation Testing Baseline

5.3. Federated Learning Result Analysis

6. Conclusions

7. Limitations

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI