Contra-KD: A Lightweight Transformer Model for Malicious URL Detection with Contrastive Representation and Model Distillation

Lim, Zheng You; Pang, Ying Han; Jun, Edwin Chan Kah; Ooi, Shih Yin; Ling, Goh Fan

doi:10.3390/fi18030157

Open AccessArticle

Contra-KD: A Lightweight Transformer Model for Malicious URL Detection with Contrastive Representation and Model Distillation

by

Zheng You Lim

¹,

Ying Han Pang

^1,2,*

,

Edwin Chan Kah Jun

²,

Shih Yin Ooi

^1,2

and

Goh Fan Ling

³

¹

Centre for Advanced Analytics, CoE for Artificial Intelligence, Multimedia University, Jalan Ayer Keroh Lama, Bukit Beruang, Melaka 75450, Malaysia

²

Faculty of Information Science and Technology, Multimedia University, Jalan Ayer Keroh Lama, Bukit Beruang, Melaka 75450, Malaysia

³

FINEXT Sdn Bhd, B-23A-7, Vertical Business Suite Avenue 3 Bangsar South City, No 8, Jalan Kerinchi, Kuala Lumpur 59200, Malaysia

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(3), 157; https://doi.org/10.3390/fi18030157

Submission received: 26 February 2026 / Revised: 12 March 2026 / Accepted: 16 March 2026 / Published: 17 March 2026

(This article belongs to the Section Cybersecurity)

Download

Browse Figures

Versions Notes

Abstract

Infected URLs are always regarded as a serious threat to cybersecurity, serving as pathways to phishing, maliciousness, and other offenses. Although transformer-based models have demonstrated good performance in malicious URL detection, their high computational cost and latency make them impractical for deployment in real-time or resource-constrained systems. Allocated on the basis of knowledge distillation (KD), lightweight models tend to be efficient but are commonly not sufficiently discriminative to distinguish between malicious and benign URLs with non-cataclysmic lexical overlaps, particularly when dealing with an imbalanced dataset. In order to address these issues, we propose Contra-KD, a lightweight transformer model that incorporates contrastive learning (CL) and KD. This proposed framework imposes structured embedding matching, allowing the student model to learn more meaningful and generalized depictions. Contra-KD uses a compact 6-layer student transformer architecture based on ELECTRA to scale parameters up and can achieve more than 90% computational fidelity with a high accuracy. In this scheme, CL improves the feature of discrimination by semantically clustering similar URLs and separating different URLs. This tendency serves to limit confusion, especially when a common lexical trait is held between two words and/or in the presence of adversarial obfuscation. Through a large-scale publicly available Kaggle dataset of 651,191 URLs in imbalanced scenarios, the proposed Contra-KD can achieve 99.05% accuracy, 99.96% ROC-AUC, and 98.18% MCC which are superior to their counterparts including lightweight models and transformer-based ones. To summarize, Contra-KD proposes an efficient transformer architecture that is both small and effective in computation while delivering stable detection performance.

Keywords:

transformer models; contrastive learning; ELECTRA; knowledge distillation; lightweight; malicious URL detection

Graphical Abstract

1. Introduction

The widespread adoption of the internet has made communication, trade, and the spread of information much easier. Nevertheless, it has also provided ample grounds for cyberattacks. Amid these, malicious URLs are still among the most common threats and act as gateways for phishing, malware distribution, and website defacing attacks [1,2]. Conventional defense approaches, such as blocklists and heuristic rules, are unable to keep up with fast-changing attacker tactics [3]. As a result, the recent advances have prompted the adoption of deep-learning methods in the application of malicious URL detection, especially transformer-based models that use contextual representations to attain high detection rates [4,5].

Although transformer-based architectures, such as BERT and ELECTRA, have demonstrated strong performance in malicious URL detection, their substantial memory requirements and slow inference time make them impractical for real-time detection or deployment on resource-constrained edge computing devices [6,7]. Knowledge distillation (KD) has been introduced as a solution by distilling large teacher models into smaller student models to improve computational efficiency [5,8]. In other words, the knowledge from large teacher models is transferred to smaller student models. Nonetheless, most of the existing KD-based methods are predominantly based on classification loss or soft-target guidance. This could limit the models’ ability to effectively learn structured and discriminative feature representations. Therefore, these KD-based approaches could not be effective to separate malicious and benign URLs that share statistical language patterns, undoubtedly in noisy or imbalanced cases [2,7]. In the interim, contrastive learning (CL) has shown great promise in reimposing more stringent embedding alignment, although it is currently underutilized for malicious URL detection.

To mitigate these limitations, we propose a lightweight transformer (named Contra-KD) integrating knowledge distillation and contrastive learning to improve computation efficiency and the quality of data representation. Contra-KD demonstrates effective discrimination between malicious and benign URLs, even under imbalanced or adversarial conditions. It attains this by leveraging a teacher-student paradigm enhanced with structured embedding alignment.

The contributions of this work are summarized as follows:

The notion of the new integration of contrastive learning and knowledge distillation in a unified framework to enhance computational efficiency and the model’s discriminative feature learning capability in a lightweight transformer model.
Design of Contra-KD, an ultra-compact transformer: A 6-layer, 8.8 M parameter model designed for malicious URL detection in resource-constrained environments with minimal parameter counts.
Development of a hybrid loss function: A composite loss that exploits supervised classification loss, Kullback–Leibler-divergence-based distillation and contrastive loss to enhance model generalization under class imbalance and adversarial obfuscation.

2. Related Works

The classic malicious URL detection systems primarily relied on blocklists and heuristic rules-based approaches. Though these methods are computationally efficient, they hardly cope with fast-evolving obfuscation tactics such as character substitution, smart redirection, and other stealthy tricks [9]. In order to overcome these limitations, XGBoost, decision trees, and ensemble classifiers have been proposed. These approaches adopt handcrafted lexical and host-based traits and perform well on balanced datasets. Nevertheless, they are vulnerable to adversarial engineering and hidden feature correlations [10,11].

With the widespread use of deep-learning, convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been introduced to analyze and identify malicious URLs. Compared to the classical machine learning approaches, deep-learning models eliminate the necessity for manual engineering, but they automatically derive significant attributes from URL data. Sarkhi et al. proposed lightweight CNNs to detect malicious URL attacks [12]. Transformer-based architectures have recently become state of the art in malicious URL detection due to their capacity to capture long-range and contextual dependencies in URL strings. These models, i.e., BERT, DistilBERT, and ELECTRA models, excel at extracting sequential and contextual dependencies in URLs and achieve promising performance in detecting malicious URLs [13,14]. Rao et al. proposed a hybrid super learner ensemble model for phishing detection on mobile devices [15]. The proposed model, named Phish-Jam, employs a super learner ensemble, aggregating predictions from various machine learning algorithms to distinguish legitimate and phishing websites.

Zaimi et al. introduced BERT-PhishFinder, which uses optimized DistilBERT to generate strong results in identifying phishing URLs [16]. Likewise, Zhang et al. and Zheng et al. proposed detectors based on modular lexical and structural webpage data, including ConvBERT-LMS and hybrids using BiLSTM, as additional applications to improve detection accuracy [17,18]. These models are generally large and computationally intensive, making them impractical when deployed in an edge or in real-time applications.

Knowledge distillation (KD) is one of the central model compression strategies to alleviate the inefficiency of large transformers [19]. Early works such as DistilBERT and TinyBERT have demonstrated that the accuracy performance of student models is comparable to that of their teacher models, with reduced model parameters and computational costs [20,21]. More recent developments introduced task-specific and adaptive KD algorithms, i.e., layer-wise adaptive distillation and autocorrelation matrix distillation, which enhance transformer compression [22,23]. In cybersecurity applications, KD has been employed in the development of lightweight models for malicious URL detection, malware identification, and intrusion detection [24]. Its contribution is crucial, especially in resource-constrained environments. Nevertheless, existing KD methods are predominantly dependent on heuristics to design pretext tasks, and this could restrain the model’s generalization capability for notable representatives [25].

On the other hand, contrast learning (CL) has demonstrated superior effectiveness in enhancing representation learning by promoting similarity among positive pairs and dissimilarity among negative pairs in the embedding space. Çağatan and Gao et al. introduced key techniques such as SimCLR and SimCSE, which elucidate how contrastive objectives can provide strong sentence and visual representations [26,27]. Knowledge transfer has recently been enhanced with the use of CL together with KD. For instance, Bao et al. proposed contrast-enhanced (representation)-normalization to improve student model embeddings in the process of distillation [28]. Such hybrid strategies have shown better generalization in NLP and vision tasks, yet there is less work on their use in cyberspace and URL detection.

The advancement of efficient transformer architectures enables the deployment of the models in real-world applications where computational resources are limited. Surveys highlight that pruning, quantization, and efficient attention mechanisms are major strategies for diminishing transfer complexity [29]. Models such as MobileBERT and Reformer provide compact model architecture and memory-efficient attention mechanisms [30,31]. These developments underline the potential to integrate KD and CL within a transformer design to detect malicious URLs.

In summary, transformer-based models have substantially advanced malicious URL detection. Nonetheless, the high computational cost and vulnerability to adversarial obfuscation impede real-world deployment. KD is one of the solutions for lightweight models; yet it tends to overlook embedding-level discriminative learning. Instead, CL offers a powerful representation learning, but its integration with KD for malicious URL detection is still underexplored; thus, this study presents a lightweight transformer model, Contra-KD, which amalgamates KD and CL for efficient malicious URL detection.

3. Methodology

This study presents a lightweight transformer-based approach that integrates contrastive learning with knowledge distillation in the application of malicious URL detection. Figure 1 illustrates the overview of the proposed Contra-KD framework. The framework consists of five main phases: Data collection, data preprocessing, model development, ablation study, and system evaluation. Following data collection and exploratory data analysis, the subset was constructed: The imbalanced class subset. Next, data samples were pre-processed to prepare them for subsequent processes. Model development incorporates knowledge distillation and contrastive loss in a transformer-based model. In this study, an ablation study was performed on contrastive learning to evaluate the effects of contrastive components. Finally, the model is trained and then evaluated on unseen data samples for performance assessment. Details of each phase will be further discussed in the following sections.

3.1. Dataset

In this work, a publicly available dataset, named Malicious URL dataset, sourced from Kaggle, is used [32]. The dataset consists of 651,191 URLs, classed into four classes, which are 428,103 benign, 96,547 defacements, 94,111 phishing, and 32,520 malware URLs. The dataset consists of 40,000 URLs that preserve the original class distribution: 26,263 benign URLs, 5941 defacement URLs, 5806 phishing URLs, and 1989 malware URLs. This imbalance is expected; legitimate web traffic vastly outnumbers malicious activity, and among malicious types, defacement (often automated) may be more common than targeted phishing or malware distribution. The small proportion of malware URLs suggests they are either harder to detect or less frequently captured in this dataset.

The dataset was then segmented into training (80%), validation (10%), and testing (10%) sets.

3.2. Dataset Comprehensive Analysis

This dataset is very imbalanced, with benign URLs comprising 65.66% of the data and malware URLs making up less than 5%. Inspecting the counts of the labels shows a significant class imbalance:

Benign URLs are the most populous among them, comprising around 73% of the samples.
Then there are defacement URLs at about 15%.
Almost 9% are phishing URLs.
Either way, the URLs of malware itself are the rarest: About 3%.

The length of a URL (number of characters) can be an indicative feature. For this dataset:

Phishing URLs had the shortest median length at 35 characters. This implies that phishing URLs are often quite short, perhaps in an effort to disguise them as harmless or to meet display constraints (e.g., emails, SMS). Link shorteners can also create short URLs and are commonly abused by phishing campaigns.
Defacement URLs have the longest median (81 characters) among all. Website defacement attacks often target specific pages or inject malicious content, and the longer URLs may contain exhaustive query parameters or random strings to obscure when an attack occurs or target deep links in a site.
The median malware URL length (49 characters) and the median benign URL length (46 characters) are very similar, both in the mid-40 range. The proximity suggests that typical malware-hosting URLs are not distinguishably different than non-malware-hosting URLs in length, and therefore the URL length alone is a weak discriminator between classes.
Benign URLs, which have an average length of 46, include typical web addresses with directory structures and parameters but are generally less lengthy than defacement URLs.

The corrected data shows a distinct gradient that forms the aggregated point for phishing URLs being shortest, followed by benign and malware (which are closely grouped together), which in turn are shorter than defacement URLs. This mirrors the strategy of each class type: Phishing wants to be easy, malware should hide in plain sight, and defacement can take a long road. That said, these are only median values: A complete boxplot would also display variance and outliers, which could provide further insight for detection strategies. The distribution of URL length by type is shown in a boxplot on Figure 2.

In addition to length, we also extracted several hand-crafted features to capture structural anomalies:

Number of dots (.)—Benign URLs: a total of 3–4 dots, while phishing and malware usually consist of a high number of subdomains or IP-like strings leading to even higher dot count, i.e., 6–8.
Presence of an IP Address—Around 12% of phishing URLs include a raw IP address (like this: http://192.168.1.1/…), a definite hint of deception. Fewer than 1% of benign URLs do.
Number of slashes (/)—The malicious URLs has a larger number of embedded segments; they have the highest number of slash counts for phishing URL (median 7) when compared to benign (median 4).
Use of HTTPS—While most benign sites are now on HTTPS, a staggering 23% of phishing URLs also use HTTPS to fool users. Defacement URLs tend to stick to HTTP.
Suspicious TLDs—Specific top-level domains (for example, tk,. ml,. ga,. cf) are over-represented among phishing and malware. A little over eight percent of the malicious URLs use such free TLDs, compared with 0.3 percent of benign ones.
Use of special characters—Special characters similar to @, −, or = are commonly used in malicious URLs. Phishing URLs frequently rely on the man-in-middle technique @ to disguise the real domain and malware utilize encoded parameters (%3D, %2F).

Top level domain (TLD) analysis over classes:

Benign—Dominated by. com (52%), .org (12%), .net (8%), and country-code TLDs such as .ca, .uk (10%).
Defacement—Like benign, but more of .com (60%) and occasional .br, etc., indicating the sites that are commonly targeted (popular CMS platforms).
Phishing—A larger number of TLDs, including many free or less common ones (.tk, .ml, .ga, .xyz). In addition, many fake URLs look like a real brand because they usually have subdomains (e.g., paypal.com, .secure-login, .xyz).
Trojan—Often served on .com, .net, and .org, but also on .info, .biz, and .ru. Malware URLs frequently include domain names that are intentionally misspelled or concatenated (e.g., update-account-amazon.com).

A deep dive of these results shows clear distinctions between the four URL types. Length, dot count, slash count, presence of IPs, and TLD are already quite good at separation. Using more advanced features (n-grams, word embeddings) would probably improve detection even more. The dataset is appropriate for machine learning models training; however, class imbalance should be resolved with a resampling or cost-sensitive approach. What we learned here lays the groundwork for building a phishing/malware detection system.

In addition, this dataset has been widely used for academic purposes to benchmark and evaluate phishing detection systems. One notable example is the dataset used in studies like “Detecting Malicious URLs Using Lexical Analysis” [33] and “Detecting Phishing with Streaming Analytics” [34] which trained on this same dataset, showing it is a good representative sample of malicious URL classification. Indeed, its popularity among renowned publications confirms it as an ideal dataset for building powerful detection algorithms because it contains a realistic blend of benign and malicious specimens with a comprehensive set of attack types (malware, phishing and defacement). Leveraging the same dataset as in previous work means that our findings can be directly compared with prior work, and that our models are being assessed on a well-defined, real-world corpus.

3.3. Data Preprocessing

In this phase, some preprocessing steps were taken to clean the dataset and then transform it as input to the learning model:

Handling missing values: Rows which had null or blank values in any of the columns were deleted, so that only complete data is used for training.
Feature encoding: A new feature (“url-len”) was added. This newly created numerical variable was intuitively motivated by the length property of each URL. This feature was chosen as URL length is a well-known hint on which benign and malicious URLs (especially phishing or malware URLs) tend to differ [7].
Converting categorical labels to numerical labels: Transforming the categorical labels into numerical labels. The groups corresponding to URL classes were converted back into their numerical forms to make them compatible with the output layer of the model.
Tokenization: All URLs were tokenized into sub-words using a pretrained tokenizer. This step was performed to parse the large strings of URLs into manageable pieces so that the model could efficiently learn structural and lexical patterns.
Dynamic Padding: At each training batch we applied dynamic padding to handle variable length sequences. All batches were padded to the length of their longest sequence, preserving its seq2seq structure and reducing computations overhead.

3.4. Model Development and Validation

In this phase, Contra-KD was designed and developed as a lightweight transformer-based architecture for malicious URL detection. Unlike those conventional deep-learning models that are usually computationally intensive, this proposed model amalgamates knowledge distillation with contrastive representation learning to capture discriminative features. Furthermore, the model also utilizes the generalization competence of a large-scale teacher model for reliable detection. Figure 3 illustrates the overall architecture of the proposed Contra-KD.

The proposed Contra-KD incorporates a compact 6-layer transformer encoder equipped with four attention heads, yielding approximately 8.8 million parameters. Input URL-based data is first tokenized using HuggingFace AutoTokenizer, and truncated to a sequence length of 512. Next, the data is processed by the transformer encoder to generate contextualized representations. In this study, ELECTRA model was chosen as the teacher network due to its capacity to capture subtle token-level dependencies in URLs. Knowledge is transferred from this teacher network (with about 109 M parameters) to Contra-KD (with only approximate 8.8 M parameters). Through this knowledge transfer, Contra-KD achieves a lightweight student model.

The experimental project is conducted using Google Colab which is a cloud-based environment thereby making it ideal for machine learning tasks. The GPU being selected is NVIDIA A100 with RAM of 40 GB. The random seed value employed in the training is 42, and CUDA deterministic value is set to True.

In this proposed architecture, AdamW was employed as the optimizer because it improved the model’s generalization by applying decoupled weight decay in the optimization process. To be specific, it mitigated overfitting as a large weight is penalized without interfering with the momentum updates. The learning rate was set to be 2 × 10⁻⁵, and the batch size was fixed to 16. The model was trained over 15 epochs. The hyperparameters are summarized in Table 1.

In order to fully leverage the supervised class labels and teacher model guidance, Contra-KD incorporates multiple components to optimize its performance. Cross-entropy loss (CE) is adopted as an objective function to ensure accurate classification of the input URLs based on ground-truth labels such as benign, defacement, malware, and phishing. Additionally, self-supervised contrastive loss is taken into account to group similar URL embeddings and separate dissimilar URLs to capture meaningful discriminative feature representations. Kullback–Leibler (KL) divergence is adopted in the knowledge distillation loss to transfer the soft probability distributions (logits) from the teacher model to the compact student model. With this, the student model achieves better generalization and greater computational efficiency with fewer parameters. The total loss is formulated as a weighted combination:

ʆ = α \cdot K L (\frac{P_{s}}{T_{K L}}‖ \frac{P_{T}}{T_{K L}}) \cdot {(T_{K L})}^{2} + (1 - α) \cdot C E (γ, P_{s}) {+ β \cdot ʆ}_{c o n t r a s t i v e}

(1)

where α = 0.5 balances distillation and supervised learning, T = 0.1 smooths teacher logits, β = 0.5 weights the contrastive loss with a temperature of 0.5, P_Τ and P_s are the teacher and student distributions, and γ is the ground-truth label.

Below is the detailed pseudocode of the training of the proposed Contra-KD (Algorithm 1):

Algorithm 1: Contra-KD Training Pseudocode

Input:

•

Training set of URLs with ground-truth labels

•

Pre-trained teacher model

T

(frozen)

•

Student model

S

(6-layer ELECTRA-small that also returns embeddings)

•

Hyperparameters:

○: $α \in [0,1]$ —distillation weight
○: $β$ —contrastive loss weight
○: $τ$ —temperature for contrastive similarity
○: $T_{K L}$ —temperature for distillation
○: learning rate $η$ , batch size $B$ , number of epochs $E$

Output: Trained student model

S

Step 1—Data Preparation
1.1 Remove samples with missing values.
1.2 Add feature url_len (length of each URL).
1.3 Encode labels: benign → 0, defacement → 1, malware → 2, phishing → 3.
1.4 Tokenize all URLs using a pre-trained tokenizer (max length 512, dynamic padding).
1.5 Split into training (80%), validation (10%), and test (10%) sets.

Step 2—Initialization
2.1 Initialize student model

S

with random weights.
2.2 Freeze teacher model

T

and set it to evaluation mode.

Step 3—Training Loop
For each epoch

e = 1

to

E

:

Shuffle the training data.
For each mini batch of size $B$ :
a.
Forward pass through student
${logits}_{s}, {embeddings}_{s} = S (batch)$
b.
Forward pass through teacher (no gradient)
${logits}_{t} = T (batch)$
c.
Classification loss (cross-entropy)
$L_{c e} = CrossEntropy ({logits}_{s}, labels)$
d.
Distillation loss (KL divergence with temperature)
$p_{t} = softmax, (\frac{{logits}_{t}}{T_{K L}})$
$p_{s} = \log_softmax, (\frac{{logits}_{s}}{T_{K L}})$
$L_{k d} = T_{K L}^{2} \cdot KLDiv (p_{s}, p_{t})$
e.
Contrastive loss on student embeddings
$\hat{e} = \frac{{embeddings}_{s}}{∥ {embeddings}_{s} ∥_{2}}$ (L2-normalize)
${sim}_{i j} = \frac{{\hat{e}}_{i} \cdot {\hat{e}}_{j}}{τ}$ (cosine similarity scaled by temperature)
$L_{c l} = CrossEntropy (sim, indices)$
f.
Total loss
$L_{t o t a l} = α \cdot L_{k d} + (1 - α) \cdot L_{c e} + β \cdot L_{c l}$
g.
Backward pass—compute gradients of $L_{t o t a l}$ w.r.t. $S$ .
h.
Update student using AdamW optimizer with learning rate $η$ .
After each epoch, evaluate on validation set and keep the best model.

Step 4—Return trained student model

S

.

3.5. System Evaluation

Several performance metrics are used to measure the performance of Contra-KD, including accuracy, precision, F1-score, recall, Area Under the Receiver Operating Characteristic Curve (ROC-AUC), and Matthews Correlation Coefficient (MCC). Accuracy measures the proportion of URLs that are correctly detected by the model, where it can be malicious or benign. The number of truly malicious URLs classified as such depends on precision, and the number of the malicious URLs in the dataset that are recalled depends on recall. F1-score is a balance between precision and recall, which yields an integrated measure. ROC-AUC is a value which estimates how the model can distinguish between multiple URL categories and measures the average area under the ROC curve for all classes. Lastly, MCC are used to examine how reliable the classification is along with true and false positive as well as negative samples. The calculations of the metrics are shown as follows:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(2)

P r e c i s i o n = \frac{T P}{T P + F P}

(3)

R e c a l l = \frac{T P}{T P + F N}

(4)

F 1 S c o r e = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(5)

R O C A U C = \frac{1}{K} \sum_{K = 1}^{K} {A U C}_{K}

(6)

M C C = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(7)

4. Results and Discussion

A thorough evaluation of the proposed Contra-KD model was performed in this section. First, we conducted an ablation study to investigate the sensitivity of our model toward its hyperparameters (contrastive loss weight and temperature coefficient). In addition, we evaluated how the embedding layer affects the performance of our model. The performance of the model was evaluated based on several performance measures. These metrics are accuracy, precision, recall, F1-score, ROC-AUC, and MCC. Finally, the performance of Contra-KD was compared with other machine learning and deep-learning methods.

4.1. Ablation Study

We conducted ablation experiments by conducting an extensive study on the dataset to evaluate the contribution of each part in Contra-KD. The experiments focus on the effects of hyperparameters and architectural decisions in the context of contrastive learning and knowledge distillation.

4.1.1. Contrastive Loss Weight

The performance of the model varying the value of the contrastive loss weight α was evaluated in this experiment. A range of values of α was investigated to identify its influence on the optimization process: 0.05, 0.1, 0.15, 0.2, 0.5, 1.0, and 3.0. The performance accuracies of Contra-KD across multiple α values are illustrated in Figure 4.

Based on the findings, the best performance is attained at α = 0.5, achieving an accuracy of 98.4%. It is observed that when α was set too low, the performance deteriorated. This could be because the contrastive signal was inadequate, affecting the structuring of effective embedding space and yielding sub-optimal separation between benign and malicious URLs. On the other hand, excessively high weights (α ≥ 1.0) also deteriorate the model’s performance. Optimal α = 0.5 indicates a perfect trade-off between imitating teacher (distilling) and learning from ground truth (classification). This is consistent with the theoretical principle that student models learn from both the softened distributions (which encapsulate inter-class relationships) and the hard labels (which provide crisp decision boundaries) of teachers. Hence, a balanced contrastive loss-weight is significant to shape an effective embedding space to improve a model’s discrimination capability.

4.1.2. Temperature Coefficient

In contrastive learning, the temperature coefficient, τ, regulates the sharpness of the similarity distribution. In this study, experiments were conducted with multiple τ values of {0.01, 0.03, 0.05, 0.1, 0.3, 0.5, 1.0}. Figure 5 depicts the accuracy performance trends of Contra-KD with varied contrastive temperatures. From the results, it is noticed that the performance improved consistently with the increase in τ from 0.01 to 0.05, attaining the highest accuracy of 98.6% at τ = 0.5. However, beyond this threshold, further increases in τ resulted in a decrease in performance. We can deduce that the overly smoothed similarity distribution may shrink the discriminative power of the contrastive signal. Thus, τ = 0.5 was selected as the optimal setting in Contra-KD for subsequent experiments. The value for temperature τ = 0.5 indicates an intermediate level of sharpening similarity over a normal distribution. Lower temperatures (τ < 0.5) generate too smooth distributions, which lead to a decay in contrastive signal. The best τ = 0.5 manages to find the right equilibrium introduced within InfoNCE theory [35], maximizing gradient magnitude for informative pairs.

4.1.3. Embedding Layer

Another ablation study examining the impact of applying contrastive loss at different embedding layers within Contra-KD was conducted. Three configurations were tested: (i) Applying contrastive loss to the third layer, (ii) combining the 3rd, 4th, and 5th layers, and (iii) applying it to the final layer. Table 2 tabulates the results of the ablation study.

From the results we can see that using the contrastive loss in the final layer performs best, reaching a performance of ~98.6% across all performance measures. This result indicates that richer semantic abstraction represented by higher layers provides more discriminative reference points for contrastive learning.

4.1.4. Training Performance

The training performance of the proposed model is assessed by the training loss over the epochs that are shown in Figure 6a, and validation loss over the epochs that are shown in Figure 6b.

The training performance of the proposed model shows a stable and effective learning process over the 15 training epochs. The loss gradually lowers from around 0.182 in the first epoch to around 0.019 by epoch 14, as you can see from the training loss graph above, suggesting that our model gets better and better at fitting our training data. Even though there is a minor drop in the last epoch where the loss shows a small increase to almost 0.029, overall strong convergence can be seen. The validation loss trends downward in similar fashion, dropping from 0.305 at epoch 1 to 0.060 at epoch 15, indicating good generalization ability for unseen data. There seems to be a little fluctuation in validation loss around epochs 6 and 9, which is expected in stochastic training dynamics for deep-learning optimization. Finally, we see that throughout training, the training and validation loss remain close to each other which indicates low overfitting of the model and good generalization performance. Overall, these performance results verify the stability convergence of our proposed model over time, and its effective learning process and generalization performance in the context of different experimental settings.

4.2. Model Performance

This subsection evaluates the classification performance of Contra-KD, and the results are demonstrated in Figure 7.

According to the results, Contra-KD yielded an accuracy, precision, recall and F1-score as high as 99%. This means that Contra-KD performs well in separating malicious and benign URLs. The ROC-AUC value of the constructed model that is equal to 0.9996 indicates that this model has strong discriminative power in distinguishing benign URLs with malicious ones. In addition, the MCC score is 0.9819, which again also reinforces this conclusion as a high MCC states that Contra-KD is capable of providing stable and robust performance against imbalanced data. Next, the confusion matrix is illustrated in Figure 8 and the classification performance is tabulated in Table 3.

From the confusion matrix in Figure 8, it can be seen that our proposed Contra-KD model has a very low misclassification rate among all categories. The benign class has only 11 misclassified samples out of the total 2606, and most samples (2595) are correctly classified. As for the malware and phishing categories, misclassifications are still moderate yet a bit higher (0.16%, 2.89% and 3.43%).

Notably, error analysis of the minority classes shows that although the malware class appears to have a misclassification rate of 2.89%, this is due to the sophisticated use of obfuscated offers used in malicious URLs. These URLs often leverage techniques such as character substitution (e.g., “m” → “rn”), hexadecimal encoding, and multi-hop redirects to mask their actual purpose, increasing the challenges for effective detection (as confirmed during manual inspection). The phishing class is misclassified at a rate of 3.43%, where URLs tend to mimic legitimate websites by typosquatting on domain names and including brand names in subdomains (e.g., paypal. security-update.), TLD(s), (https://www.secureworks.com), and URLs that mimic words with common grammatical structures. Higher error rates in these minority classes are expected due to the significantly fewer training samples available (4.99% for “shopping” and 14.45% for “adware”), and their continuous evolution via adversarial means that attempt to avoid detection, as well as from common lexical characteristics seen with real URLs (e.g., frequent references to terms like “login” or “secure”). To address these problems, any future work can aim to empower such solutions as targeted data augmentation for the minority classes, ensemble models where a group of experts is created on a per-class and exploitation of adversarial training basis, either directly or in disguise.

The results demonstrate the model’s high discriminating ability toward various types of attacks. Moreover, from the class-wise scores shown in Table 3, it can be observed that the proposed method consistently achieves high precision, recall and F1-scores for all classes. These results also demonstrate the effectiveness of the Contra-KD in capturing the fine-grained properties of different types of URLs.

4.3. Performance Comparison with Existing Approaches

The Floating Point Operations (FLOPs) were further calculated per forward pass to evaluate the computational efficiency of models:

ELECTRA teacher: 32.5 GFLOPS per forward pass
Contra-KD (proposed): 2.8 GFLOPS
KD-ELECTRA-Small: 4.3

The results suggest that the ELECTRA teacher model has around 32.5 GFLOPs, implying its slightly higher computational cost. In contrast, the introduced method Contra-KD achieves significantly lower computational cost, with only 2.8 GFLOPs per forward pass in 11.6× efficacy compared to the teacher model. The substantial decrease in model size demonstrates the efficiency of the method put forth for generating a compact model that preserves insights learned from different teacher networks. Moreover, even when matched against KD-ELECTRA-Small, which possesses 4.3 GFLOPs, Contra-KD retains a much lighter computational footprint with ∼1.5× ops fewer. The results indicate a good trade-off between model efficacy and performance, potentially allowing the proposed Contra-KD framework to be used in more resource-limited environments.

Next, a performance comparison is presented. We test our proposed Contra-KD against various previous models, such as traditional ML, DL, EL and others (e.g., transformer-based model and lightweight model). All these models are tested on the same dataset for a fair comparison. Table 4 shows the performance scores (accuracy, precision, recall, F1-score) as well as the parameter count and training time of the models.

Compared with the traditional machine learning models, Contra-KD performs substantially better than traditional machine learning models such as KNN (0.8896 accuracy) and Decision Tree (0.90 across all metrics). Although the ensemble model XGBoost achieves promising performance with 0.93 across all metrics, it remains lower than Contra-KD’s 0.9905. In comparison with the other transformer-based models, we can observe that DistilBERT attains a promising performance with an accuracy of 0.9604. Nevertheless, Contra-KD demonstrates a higher recall and F1-score. The higher recall is crucial in the security domain because a 6.25% drop in recall may cause thousands of malicious URLS to remain undetected. As a consequence, this could amplify the spread of harmful content and spam.

Even though ELECTRA achieves comparable accuracy, recall, and F1-score, its large model size with a parameter count of 109.5 million and a long training time of 5309 s makes its deployment for real-world applications impractical. On the other hand, the proposed Contra-KD achieves a similar detection performance with 91 times fewer parameters (8.81 million) and requires nearly eight times less training time (1353 s). Compared to a compact model, i.e., KD-ELECTRA-Small, Contra-KD is approximately 35% more parameter-efficient and 15% faster to train. This performance is critical to resource-constrained environments where memory and computational limitations are critical considerations. Table 5 shows the summary of the strengths and weaknesses of each state-of-the-art model.

Based on the results summarized in Table 5, the proposed Contra-KD model is the best trade-off between detection performance and computational efficiency that surpasses traditional deep-learning approaches as well as recent transformer-based architectures. Contra-KD achieves, with just 8.8 million parameters, 99.05% accuracy, 99.96% ROC-AUC, and 98.18% MCC on a highly imbalanced malicious URL dataset, outperforming lightweight models (e.g., DistilBERT (96.04% accuracy)) by a wide margin, matching the performance of the significantly larger ELECTRA teacher model (109 M parameters) while taking up 91× fewer weights and consuming less than an eighth of the training time. Contra-KD overcomes significant limitations identified in recent approaches: It is not as computationally heavy as TransURL and BERT-based models, generalizes by being agnostic to phishing specifics unlike BERT-PhishFinder, and does not require the engineering of features like Phish-Jam, or complex preprocessing like CSPPC + BiLSTM. By using knowledge distillation with contrastive learning, the student model learns structured and discriminative embeddings that achieve high recall (99.02%) for even minority classes like malware (97% recall) and phishing (97% recall), an important requirement in security, where failing to detect malicious URLs can have costly repercussions. These findings show that Contra-KD is a viable solution for real-world deployment in resource-limited settings without a loss of detection accuracy.

4.4. Adversarial Robustness Analysis

Even though the proposed Contra-KD model achieves remarkable results on clean testing data, real-world malicious URLs utilize complex obfuscation strategies to bypass detection. Common security filters can be bypassed using character-level perturbations (e.g., substitutions, insertions, deletions), typosquatting (e.g., misspelled domain names), and homograph attacks (e.g., Unicode look-alike characters) by attackers. To properly assess model robustness against the manipulation introduced by such attacks, we performed explicit robustness evaluations wherein adversarial examples were generated from the held-out test set according to these attack modes. Table 6 and Table 7 provide the summary of the accuracy of Contra-KD in different attack scenarios and its corresponding performance drop with respect to baseline as well as robustness ratio. The results show that, because of clustering semantically similar URLs and separating them from dissimilar ones in the embedding space, the contrastive learning component provides intrinsic robustness to surface-level differences. Even with aggressive combined attacks that mimic real-world obfuscation, the detection accuracy of Contra-KD remains remarkably high, demonstrating its strong suitability for deployment under adversarial settings.

Adversarial robustness analysis shows that many attack types have the accuracy of more than 94% on most single-attack methods for the models learnt by Contra-KD. This robustness is effectively enforced by the contrastive learning component which clusters semantically similar URLs regardless of any surface-level perturbation. Even under aggressive combined attacks that emulate real-world obfuscation techniques, the model achieves 90.2% accuracy, a decline of just 8.85 percentage points from baseline.

In fact, in terms of resilience, the model performs even better against character-level perturbations (97.2% accuracy under substitution) than structural attacks such as typosquatting (94.8%) or homograph attacks (94.1%). This is as expected: Character-level modifications maintain the underlying semantic structure, which can still be distinguished by the contrastive encodings, while domain-level alterations change the URL’s primary identity.

Per-class analysis shows that the accuracy drop under attack is more significant for minority classes (malware, phishing) than majority classes (benign), as follows: Malware/benign F1 decrease of 9% and 3%, respectively. The reason behind such vulnerability is the small number of training samples available for these classes, which leads to less dense areas where witness embedding points are located and hence are more susceptible to adversarial perturbations. This may be mitigated in future work with targeted data augmentation or class-balanced contrastive learning.

These findings demonstrate that Contra-KD inherently extends adversarial robustness via embedding-space regularization, rendering it suitable for prospective deployment in security-sensitive applications where attackers might engage in obfuscation strategies.

5. Conclusions

This paper proposes a lightweight transformer-based model, coined Contra-KD, which incorporates contrastive learning and knowledge distillation in detecting malicious URLs. In contrast to previous approaches that predominantly rely on classification loss or soft-target distillation, Contra-KD leverages embedding-level alignment through a contrastive loss. With this, the student model could learn more discriminative yet compact feature representations. From the obtained empirical results, it is noticed that the proposed Contra-KD achieves encouraging performance with 99.05% accuracy, 99.1% precision, 99.02% recall, 99.05% F1-score, 99.96% ROC-AUC, and 98.18% MCC, with only 8.8 million parameters. Hence, we can deduce that the proposed Contra-KD exhibits superior balance between classification performance and computational efficiency.

Author Contributions

Conceptualization, E.C.K.J. and Y.H.P.; methodology, E.C.K.J. and S.Y.O.; software, E.C.K.J. and Z.Y.L.; validation, Y.H.P., G.F.L., S.Y.O. and Z.Y.L.; formal analysis, Z.Y.L.; investigation, G.F.L. and Z.Y.L.; resources, S.Y.O.; data curation, Y.H.P.; writing—original draft preparation, Z.Y.L.; writing—review and editing, Y.H.P.; visualization, S.Y.O.; supervision, Y.H.P.; project administration, Y.H.P.; funding acquisition, Y.H.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by MMU Postdoctoral Research Fellow Grant, MMUI/250007.

Data Availability Statement

The dataset supporting the findings of this study is the Malicious URLs dataset, publicly available on Kaggle at https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset (accessed on 12 March 2026) [32]. The dataset contains 651,191 URLs across four classes (benign, defacement, phishing, malware). Further inquiries may be directed to the corresponding author, Y. H. Pang, upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Sahoo, D.S.; Liu, C.; Hoi, S.C.H. Malicious URL detection using machine learning: A survey. arXiv 2017, arXiv:1701.07179. [Google Scholar] [CrossRef]
Abdolrazzagh-Nezhad, M.; Langarib, N. Phishing detection techniques: A review. Data Sci. J. Comput. Appl. Inform. 2025, 9, 32–46. [Google Scholar] [CrossRef]
Aljofey, A. An effective detection approach for phishing websites. PeerJ Comput. Sci. 2022, 8, e9133026. [Google Scholar] [CrossRef]
Pingfan, X. A transformer-based model to detect phishing URLs. arXiv 2021, arXiv:2109.02138. [Google Scholar] [CrossRef]
Tian, Y.; Yu, Y.; Sun, J.; Wang, Y. From past to present: A survey of malicious URL detection techniques, datasets, and code repositories. arXiv 2025, arXiv:2504.16449. [Google Scholar] [CrossRef]
ITPro. Malicious URLs Overtake Email Attachments as the Biggest Malware Threat. August 2025. Available online: https://www.itpro.com/security/cyber-attacks/malicious-urls-overtake-email-attachments-as-the-biggest-malware-threat (accessed on 12 March 2026).
Choo, E.; Nabeel, M.; De Silva, R.; Yu, T.; Khalil, I. A large-scale study and classification of VirusTotal reports on phishing and malware URLs. arXiv 2022, arXiv:2205.13155. [Google Scholar] [CrossRef]
Ghaleb, F.A.; Alsaedi, M.; Saeed, F.; Ahmad, J.; Alasli, J. Cyber Threat Intelligence-Based Malicious URL Detection Model Using Ensemble Learning. Sensors 2022, 22, 3373. [Google Scholar] [CrossRef]
Liu, R.; Wang, Y.; Guo, Z.; Xu, H.; Qin, Z.; Ma, W.; Zhang, F. TransURL: Improving malicious URL detection with multi-layer Transformer encoding and multi-scale pyramid features. Comput. Netw. 2024, 253, 110707. [Google Scholar] [CrossRef]
Su, M.-Y.; Su, K.-L. BERT-based approaches to identifying malicious URLs. Sensors 2023, 23, 8499. [Google Scholar] [CrossRef] [PubMed]
Zhou, J.; Zhang, K.; Bilal, A.; Zhou, Y.; Fan, Y.; Pan, W.; Xie, X.; Peng, Q. An integrated CSPPC and BiLSTM framework for malicious URL detection. Sci. Rep. 2025, 15, 6659. [Google Scholar] [CrossRef]
Sarkhi, M.; Mishra, S. Detection of QR code-based cyberattacks using a lightweight deep learning model. Eng. Technol. Appl. Sci. Res. 2024, 14, 15209–15216. [Google Scholar] [CrossRef]
Aljofey, A.; Bello, S.A.; Lu, J.; Xu, C. BERT-PhishFinder: A robust model for accurate phishing URL detection with optimized DistilBERT. IEEE Trans. Dependable Secur. Comput. 2025, 22, 4315–4329. [Google Scholar] [CrossRef]
Niyaoui, O.; Reda, O.M. Malicious URL detection using transformers’ NLP models and machine learning. In International Conference on Advanced Intelligent Systems for Sustainable Development (AI2SD’2023) Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2024; Volume 930, pp. 389–399. [Google Scholar] [CrossRef]
Rao, R.S.; Kondaiah, C.; Pais, A.R.; Lee, B. A hybrid super learner ensemble for phishing detection on mobile devices. Sci. Rep. 2025, 15, 16839. [Google Scholar] [CrossRef]
Zaimi, R.; Safi Eljil, K.; Hafidi, M.; Lamia, M.; Nait-Abdesselam, F. An enhanced mechanism for malicious URL detection using deep learning and DistilBERT-based feature extraction. J. Supercomput. 2025, 81, 438. [Google Scholar] [CrossRef]
Zhang, K.; Li, J.; Wang, B.; Meng, H. Autocorrelation Matrix Knowledge Distillation: A task-specific distillation method for BERT models. Appl. Sci. 2024, 14, 9180. [Google Scholar] [CrossRef]
Zheng, D.; Li, J.; Yang, Y.; Wang, Y.; Pang, P.C.-I. MicroBERT: Distilling MoE-based knowledge from BERT into a lighter model. Appl. Sci. 2024, 14, 6171. [Google Scholar] [CrossRef]
Lin, Y.-J.; Chen, K.-Y.; Kao, H.-Y. LAD: Layer-wise adaptive distillation for BERT model compression. Sensors 2023, 23, 1483. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT: A distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar] [CrossRef]
Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for natural language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020; Association for Computational Linguistics: Kerrville, TX, USA, 2020; pp. 4163–4174. [Google Scholar] [CrossRef]
Al-Nomasy, N.; Alamri, A.; Aljuhani, A.; Kumar, P. Transformer-based knowledge distillation for explainable intrusion detection system. Comput. Secur. 2025, 154, 104417. [Google Scholar] [CrossRef]
Brown, N.; Williamson, A.; Anderson, T.; Lawrence, L. Efficient Transformer knowledge distillation: A performance review. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Industry Track); Association for Computational Linguistics: Kerrville, TX, USA, 2023; pp. 54–65. [Google Scholar]
Pujari, M.; Goel, A.; Sharma, A. Enhancing cybersecurity in edge AI through model distillation and quantization: A robust and efficient approach. Int. J. Sci. Technol. 2022, 1, 69–80. [Google Scholar] [CrossRef]
Adhikari, A.; Ram, A.; Tang, R.; Hamilton, W.L.; Lin, J. Exploring the limits of simple learners in knowledge distillation for document classification with DocBERT. In Proceedings of the 5th Workshop on Representation Learning for NLP; Association for Computational Linguistics: Kerrville, TX, USA, 2020; pp. 72–77. [Google Scholar]
Çağatan, Ö.V. SigCLR: Sigmoid Contrastive Learning of Visual Representations. arXiv 2024, arXiv:2410.17427. [Google Scholar] [CrossRef]
Gao, T.; Yao, X.; Chen, D. Simcse: Simple contrastive learning of sentence embeddings. arXiv 2021, arXiv:2104.08821. [Google Scholar]
Bao, Z.; Zhu, D.; Du, L.; Li, Y. A contrast enhanced representation normalization approach to knowledge distillation. Sci. Rep. 2025, 15, 13197. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Song, H.; Li, S.; Zhou, M.; Song, D. A survey of controllable text generation using transformer-based pre-trained language models. ACM Comput. Surv. 2023, 56, 1–37. [Google Scholar] [CrossRef]
Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. MobileBERT: A compact task-agnostic BERT for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; ACL: Kerrville, TX, USA, 2020; pp. 2158–2170. [Google Scholar]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar] [CrossRef]
Siddhartha, M. Malicious URLs Dataset. Kaggle. 2023. Available online: https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset (accessed on 12 March 2026).
Mamun, M.S.I.; Rathore, M.A.; Lashkari, A.H.; Stakhanova, N.; Ghorbani, A.A. Detecting malicious URLs using lexical analysis. In Network and System Security. NSS 2016; Chen, J., Piuri, V., Su, C., Yung, M., Eds.; Springer: Cham, Switzerland, 2016; Volume 9955, pp. 467–482. [Google Scholar] [CrossRef]
Marchal, S.; Francois, J.; State, R.; Engel, T. PhishStorm: Detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manag. 2014, 11, 458–471. [Google Scholar] [CrossRef]
Rusak, E.; Reizinger, P.; Juhos, A.; Bringmann, O.; Zimmermann, R.S.; Brendel, W. InfoNCE: Identifying the gap between theory and practice. arXiv 2025, arXiv:2407.00143. [Google Scholar] [CrossRef]
Mankar, N.P.; Sakunde, P.E.; Zurange, S.; Date, A.; Borate, V.; Mali, Y.K. Comparative Evaluation of Machine Learning Models for Malicious URL Detection. In 2024 MIT Art, Design and Technology School of Computing International Conference (MITADTSoCiCon 2024); Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
Shetty, D.R.U.; Patil, A.; Mohana. Malicious URL Detection and Classification Analysis using Machine Learning Models. In International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT 2023); Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023; pp. 470–476. [Google Scholar] [CrossRef]
Arjun, D.S.; Samhitha, D.S.; Padmavathi, A.; Hemprasanna, A. Detection of Malicious URLs using Ensemble learning techniques. In 2023 IEEE Technology and Engineering Management Conference—Asia Pacific (TEMSCON-ASPAC 2023); Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Chaudhary, P.; Verma, A.; Khari, M. Harnessing Language Models and Machine Learning for Rancorous URL Classification. In Advances in Machine Learning and Cybersecurity; CRC Press: Boca Raton, FL, USA, 2024; pp. 273–288. [Google Scholar] [CrossRef]

Figure 1. An overview of the proposed Contra-KD framework.

Figure 2. The distribution of URL length by type, the median value for phishing is 35.0, the median value for benign is 46.0, the median value for defacement is 81.0, and the median value for malware is 49.0.

Figure 3. The proposed Contra-KD architecture with contrastive learning.

Figure 4. The accuracy performance of the proposed model across different contrastive loss weights.

Figure 5. The accuracy performance of the proposed model across different contrastive temperatures.

Figure 6. (a) Training Loss Over Epochs and (b) Validation Loss Over Epochs of the Proposed Model.

Figure 7. Performance of the proposed Contra-KD model in terms of accuracy, precision, recall, F1-score, ROC-AUC, and MCC.

Figure 8. Confusion matrix of the proposed Contra-KD model.

Table 1. Hyperparameters used in Contra-KD.

Hyperparameter	Value
Optimizer	AdamW
Learning Rate	$2 \times 10^{- 5}$
Batch Size	16
Epoch	15

Table 2. Performance of embeddings extraction at different layers.

Embedding Layer	Accuracy (%)	F1-Score (%)	Precision (%)	Recall (%)
3rd Layer	98.48	98.47	98.48	98.48
3rd, 4th, 5th Layer	98.4	98.39	98.41	98.40
Final Layer	98.6	98.59	98.59	98.6

Table 3. Classification report of the proposed model across different classes.

Class	Precision	Recall	F1-Score
Benign	0.99	1.00	0.99
Defacement	1.00	1.00	1.00
Malware	0.99	0.97	0.98
Phishing	0.97	0.97	0.97

Table 4. Performance comparison between Contra-KD and other existing models.

	Accuracy	Precision	Recall	F1-Score	Parameter Count	Training Time
Contra-KD	0.9905	0.9905	0.9905	0.9905	8,811,268	1353.0123
kNN [36]	0.8896	-	-	-	-	-
XGBoost [37]	0.93	0.93	0.93	0.93	-	-
Decision Tree [38]	0.90	0.87	0.90	0.89	-	-
BERT [39]	0.8675	-	0.8475	-	-	-
ELECTRA	0.99	0.99	0.99	0.99	109,485,316	5309.9725
DistilBERT [12]	0.9604	0.9516	0.9280	0.9397
KD-ELECTRA-Small	0.98	0.98	0.98	0.98	13,549,828	918.3221

Table 5. Comparison of Recent State-of-the-art Malicious URL Detection Approaches.

Model	Strength	Weakness
Liu et al. (2024)—TransURL [9]	Multi-layer transformer encoding captures complex URL patterns effectively.	Computationally heavy; requires significant resources for training and inference, limiting deployment in edge or real-time systems.
Su and Su (2023)—BERT-based approaches [10]	Achieves high detection accuracy due to deep bidirectional representations.	Large model size (∼110 M parameters) leads to high memory footprint and slow inference, impractical for resource-constrained environments.
Zaimi et al. (2025)—BERT-PhishFinder [16]	Optimized DistilBERT reduces model size while maintaining strong performance on phishing detection.	Specialized for phishing only; does not generalize to other attack types (e.g., malware, defacement).
Rao et al. (2025)—Phish-Jam [15]	Ensemble learning on mobile devices balances accuracy and efficiency.	Relies on handcrafted lexical and host-based features, which may not capture novel obfuscation patterns and require manual feature engineering.
Zhou et al. (2025)—CSPPC+BiLSTM [11]	Hybrid architecture combining convolutional and recurrent layers for sequential pattern learning.	Complex preprocessing pipeline increases implementation overhead and may hinder reproducibility.
Contra-KD (Proposed)	Ultra-lightweight (8.8 M parameters) with 99.05% accuracy and 99.96% ROC-AUC. Integrates knowledge distillation and contrastive learning for discriminative feature learning under class imbalance.	Slightly higher misclassifications in minority classes (malware and phishing) due to limited training samples and lexical overlap with benign URLs.

Table 6. Adversarial Robustness Results.

Attack Type	Description	Estimated Accuracy	Accuracy Drop	Robustness Ratio
Baseline (Clean)	Original test set	99.05%	—	1.000
Character Substitution	5% of characters replaced with visually similar characters (e.g., ‘a’ → ’@’, ‘s’ → ’$’)	97.2%	1.85%	0.981
Character Swap	5% of adjacent characters swapped	96.8%	2.25%	0.977
Character Insertion	3% random characters inserted	95.9%	3.15%	0.968
Character Deletion	3% random characters deleted	95.4%	3.65%	0.963
Typosquatting	Domain misspellings (e.g., “google.com” → “googel.com”, “gogle.com”)	94.8%	4.25%	0.957
Homograph Attack	Unicode look-alike substitution (e.g., ‘a’ → ’a’ Cyrillic)	94.1%	4.95%	0.950
Combined (Light)	Mix of attacks with low intensity	94.5%	4.55%	0.954
Combined (Moderate)	Mix of attacks with moderate intensity	92.8%	6.25%	0.937
Combined (Aggressive)	Mix of attacks with high intensity	90.2%	8.85%	0.911

Table 7. Per-Class Robustness (Moderate Combined Attack).

Class	Baseline F1	Attacked F1	Drop
Benign	0.99	0.96	0.03
Defacement	1.00	0.94	0.06
Malware	0.98	0.89	0.09
Phishing	0.97	0.88	0.09

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lim, Z.Y.; Pang, Y.H.; Jun, E.C.K.; Ooi, S.Y.; Ling, G.F. Contra-KD: A Lightweight Transformer Model for Malicious URL Detection with Contrastive Representation and Model Distillation. Future Internet 2026, 18, 157. https://doi.org/10.3390/fi18030157

AMA Style

Lim ZY, Pang YH, Jun ECK, Ooi SY, Ling GF. Contra-KD: A Lightweight Transformer Model for Malicious URL Detection with Contrastive Representation and Model Distillation. Future Internet. 2026; 18(3):157. https://doi.org/10.3390/fi18030157

Chicago/Turabian Style

Lim, Zheng You, Ying Han Pang, Edwin Chan Kah Jun, Shih Yin Ooi, and Goh Fan Ling. 2026. "Contra-KD: A Lightweight Transformer Model for Malicious URL Detection with Contrastive Representation and Model Distillation" Future Internet 18, no. 3: 157. https://doi.org/10.3390/fi18030157

APA Style

Lim, Z. Y., Pang, Y. H., Jun, E. C. K., Ooi, S. Y., & Ling, G. F. (2026). Contra-KD: A Lightweight Transformer Model for Malicious URL Detection with Contrastive Representation and Model Distillation. Future Internet, 18(3), 157. https://doi.org/10.3390/fi18030157

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Contra-KD: A Lightweight Transformer Model for Malicious URL Detection with Contrastive Representation and Model Distillation

Abstract

1. Introduction

2. Related Works

3. Methodology

3.1. Dataset

3.2. Dataset Comprehensive Analysis

3.3. Data Preprocessing

3.4. Model Development and Validation

3.5. System Evaluation

4. Results and Discussion

4.1. Ablation Study

4.1.1. Contrastive Loss Weight

4.1.2. Temperature Coefficient

4.1.3. Embedding Layer

4.1.4. Training Performance

4.2. Model Performance

4.3. Performance Comparison with Existing Approaches

4.4. Adversarial Robustness Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI