Next Article in Journal
Multi-Head Hierarchical Attention Framework with Multi-Level Learning Optimization Strategy for Legal Text Recognition
Previous Article in Journal
A Sea-Surface Radar Target-Detection Method Based on an Improved U-Net and Its FPGA Implementation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Session2vec: Session Modeling with Multi-Instance Learning for Accurate Malicious Web Robot Detection

1
School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China
2
National Engineering Research Center of Disaster Backup and Recovery, Beijing University of Posts and Telecommunications, Beijing 100876, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(10), 1945; https://doi.org/10.3390/electronics14101945 (registering DOI)
Submission received: 10 April 2025 / Revised: 7 May 2025 / Accepted: 8 May 2025 / Published: 10 May 2025
(This article belongs to the Special Issue Network Protocols and Network Security)

Abstract

:
This study addresses the side effect of the rapid development of the Internet, positioning botnets within digital ecosystems as a very serious potential threat to the Internet users. Malicious web robot might facilitate Web/data scraping, DDoS attacks, and data theft yielding serious cybersecurity threats. Modern botnets are advanced and have unique browser fingerprints, making their detection a real challenge. Traditional feature extraction methods heavily depend on expert knowledge. They also struggle with dimensional inconsistency when processing sessions of varying lengths, failing to counter evolving camouflage attacks. To approach such challenges, we propose Session2vec, a session representation framework based on multi-instance learning (MIL), which pioneers the MIL approach for Web session modeling. In this approach, we treat each request as an instance and the entire session as an instance collection, and then we use the FastText model to convert each URL request into a vector representation. Then, we utilize two innovative multi-instance aggregation methods: SARD (Session-level Aggregated Residual Descriptors) and SFAR (Session-level Fisher Aggregated Representation) to aggregate variable-length sessions into fixed-dimensional vectors capturing spatiotemporal features and distributional information within sessions. Simulation results of the proposed SARD and SFAR methods on public datasets show accuracy improvement of 5.2% and 16.3% on average, respectively, compared to state-of-the-art baselines. They also enhance F1 scores by 8.5% and 19.7%, respectively.

Graphical Abstract

1. Introduction

Web robots or crawlers have become a major source of network traffic. While some robots (such as those used by search engines) behave well, other robots may perform DDoS [1,2] attacks and carry out low-rate threats [3], posing a serious security threat to websites. Web robots (also known as “Web crawlers”), as a key component supporting data collection and indexing in the modern Internet, are technologically neutral and are thus widely used in building search engine indexes, scanning for network vulnerabilities, and archiving Web content [4,5,6]. According to the latest research [7], for Web applications with known vulnerabilities, 47.4% of the network traffic is Web robot traffic, with malicious traffic accounting for as much as 63.57%. This trend highlights the urgency of strengthening cybersecurity defenses against such network security events.
Assessing whether a website bot harbors malicious intent is often challenging; it is relatively straightforward to identify such bots only when they display clear danger signals (e.g., frequent access to restricted resources within a short time span or initiating numerous connection requests). However, more covert malicious bots tend to evade detection by mimicking human behavior, spoofing trusted IP addresses, or adhering to typical website navigation structures, thereby posing potential threats to both the ethical and security dimensions of websites and their hosted information [8]. Currently, the detection of Web robots primarily relies on Web log data to identify and intercept potential malicious activities [7,9,10,11,12]. Consequently, extracting effective and comprehensive data representations from Web logs is crucial for reliable Web robot detection. Because these detection models heavily depend on how log data are represented, a lack of high-quality feature extraction and representation can significantly hinder the accurate differentiation between legitimate and malicious access.
Despite notable progress in Web bot detection research, significant limitations persist in feature representation generalizability and temporal modeling capabilities. For instance, methods based on static statistical features (e.g., request frequency and error rate) achieve high detection precision (98.7%) in specific scenarios [13], yet their reliance on manually defined feature sets fails to capture the temporal semantics of request sequences, resulting in degraded cross-platform generalization performance. To address this limitation, subsequent studies have attempted to integrate multi-source heterogeneous data, such as combining mouse dynamics with request metadata to enhance detection robustness [12]. However, these approaches heavily depend on manual feature engineering and prior knowledge, making it difficult to handle diverse attack patterns. Recent real-time detection frameworks leveraging multimodal behavioral analysis [14] have improved temporal modeling by defining 43 browser event features for LSTM classifiers. Nevertheless, their feature design heavily depends on prior knowledge of HTTP protocols, struggles to adapt to emerging communication protocols like WebSocket, and incurs substantial deployment costs due to the requirement for fully annotated event sequences. Notably, semantic-enhanced methods (e.g., LDA2Vec topic feature extraction) that combine Web logs with content data have achieved an F1-score of 96.58% in e-commerce contexts [7]. However, their session characterization remains confined to static topic distributions and handcrafted statistics, failing to model dynamic contextual relationships in request sequences, which creates detection blind spots for temporally camouflaged bots.
To avoid reliance on manual feature engineering and to better capture cross-request spatiotemporal relationships, this paper proposes an end-to-end session representation method based on multi-instance learning (MIL), termed Session2vec. Under the MIL paradigm [15], each session is regarded as a bag of multiple request instances, and a comprehensive, scalable session vector representation is constructed through character-level subword embeddings and spatiotemporal clustering. Specifically, we first employ FastText to embed each request URL into a vector. Subsequently, drawing inspiration from the classic VLAD (Vector of Locally Aggregated Descriptors) and Fisher Vector algorithms [16], we propose two innovative aggregation strategies: SARD and SFAR. SARD extracts spatiotemporal relationships from clustering residuals based on hard assignments, while SFAR captures richer distributional features using soft assignments based on Gaussian Mixture Models (GMMs) [17]. Compared to traditional manual statistical features or sequence analysis methods, Session2vec focuses more on spatiotemporal associations among requests and the concealed distributional characteristics that malicious activities may exhibit, thereby significantly enhancing Web robot detection performance and robustness. Our five-fold cross-validation results demonstrate that this method achieves optimal performance in terms of both accuracy and F1-score.
In the field of network and information security, multi-instance learning has been widely applied to enhance the representation of unstructured data, such as text and logs. For example, in the threat behavior extraction framework SeqMask, cyberthreat intelligence (CTI) texts are treated as bags of behavioral phrase instances, and a Mask Attention mechanism is employed to mine key behavioral terms, effectively revealing the adversaries’ strategies, tactics, and procedures (TTPs) and achieving excellent classification performance in distant supervision scenarios [18]. This work demonstrates that under weakly supervised or semi-supervised conditions, semantic aggregation and attention filtering via multi-instance learning can enhance the extraction of critical information while reducing labeling costs. Inspired by this idea, Session2vec extends the multi-instance learning framework to dynamic session behavior modeling: it not only focuses on the content or features of individual requests but also captures the latent associative structures among requests through spatiotemporal distribution modeling, thereby effectively countering complex camouflage attacks.
Furthermore, the MIL-based text representation model BOS proposed by He et al. segments texts into sentence-level instances and employs sentence similarity measures along with an improved KNN algorithm to perform document classification [19]. This method overcomes the bag-of-words independence assumption inherent in conventional vector space models (VSMs) [20], thereby preserving to some extent the internal semantic structure of texts. With regard to semantic structure modeling, the syntax-aware entity embedding model developed by He et al. [19] employs tree-structured neural networks (e.g., Tree-GRU/Tree-LSTM) to represent entity contexts and leverages both sentence-level and entity-level attention mechanisms to enhance the accuracy of distant supervision relation extraction [21]. Their work indicates that incorporating richer hierarchical structures (such as syntactic trees or graph structures) can provide more refined contextual information for multi-instance learning in complex texts. However, in Web security detection scenarios, request patterns are typically more dynamic and diverse—randomized URL paths, adversarial request fingerprints, and significant quantities of noisy data render static tree structures unsuitable. For this reason, Session2vec does not rely on predefined syntactic trees; rather, it employs FastText [22] subword embeddings to capture the character-level semantics and structure of request paths, and then applies the SARD and SFAR algorithms to aggregate the entire session sequence, thereby achieving a weakly supervised representation of variable-length request streams.
The main contributions of this work are as follows:
  • Session representation innovation within the multi-instance learning framework:This research pioneers the systematic application of the MIL paradigm to Web session representation, treating each request as a semantic instance and the entire session as an instance collection. By proposing the SARD algorithm to perform clustering residual encoding on FastText-generated request vectors, we effectively model spatiotemporal behavior patterns within sessions. This approach significantly improves detection accuracy for complex camouflage attacks and provides a novel perspective for session representation in the Web security domain.
  • Session vectorization and end-to-end detection framework: Traditional Web robot detection typically relies on handcrafted features or synthetic data, which struggle to comprehensively capture real attack scenarios. Session2vec implements cross-request end-to-end detection through session-level vectorization, reducing over-reliance on manual features while naturally integrating contextual information to improve overall malicious behavior recognition. The proposed SARD and SFAR aggregation strategies enable the model to adapt to variable-length sessions while preserving temporal relationships between requests.
  • Unsupervised embedding and dual aggregation for spatiotemporal modeling: Session2vec addresses the challenges of high annotation costs and data diversity in network security tasks. It combines unsupervised FastText subword embeddings with two complementary aggregation strategies, SARD and SFAR, to achieve efficient spatiotemporal feature capture.
The remainder of this paper is organized as follows: In Section 2, we review the related work on Web robot detection and session modeling. Section 3 formulates the problem within a multi-instance learning framework. In Section 4, we detail our proposed approach, including the request embedding process and the design of the SARD and SFAR aggregation methods for session-level representation. Section 5 presents the experimental setup, dataset statistics, and performance evaluation. Section 6 concludes the paper and outlines potential avenues for future research. Finally, Section 7 discusses future work.

2. Related Work

Recent advancements in Web robot detection have explored diverse methodologies to address the evolving challenges posed by malicious bots. Traditional approaches, such as rule-based systems and classical machine learning algorithms (e.g., KNN, decision trees, and neural networks), have shown limitations in handling variable-length session data and inconsistent feature dimensions, particularly when bots mimic human-like behaviors or employ sophisticated evasion tactics [13]. For instance, while neural networks achieve high precision in bot detection, their computational overhead and reliance on fixed-dimensional feature vectors hinder scalability.
Semantic-based detection methods leverage content and behavioral patterns to distinguish bots. Studies like [7,23] utilize semantic features from Web content or logs, assuming human users exhibit topic-specific interests. However, these methods often neglect the temporal and contextual relationships within session-level requests, limiting their ability to model complex bot behaviors. Similarly, dynamic metadata approaches (e.g., mouse dynamics [12]) enhance accuracy but focus on biometric traits rather than session-structured log analysis, which is critical for detecting bots operating at scale.
Semi-supervised learning has emerged to address labeled data scarcity. For example, Web-S4AE [11] employs a stacked sparse autoencoder to exploit unlabeled data, yet its dependency on content–log hybrid features may not fully resolve the challenges of variable request counts within sessions. In contrast, Session2vec introduces a multi-instance learning (MIL) framework coupled with FastText [22] and miVLAD and miFV algorithms [16]. This approach uniquely converts individual requests into fixed-dimensional vectors (via FastText) and aggregates them into session-level representations (via miVLAD and miFV), effectively addressing feature inconsistency while capturing temporal and contextual dependencies. Unlike the study [11], which requires extensive labeled data, MIL reduces annotation costs by treating sessions as bags of instances, aligning with the trend toward weak supervision in bot detection.
Furthermore, while real-time detection methods [24,25] prioritize early classification of active sessions, they often overlook the structural heterogeneity of session data. Session2vec bridges this gap by unifying session modeling and representation learning, offering a robust solution for detecting advanced bots with dynamic behavioral patterns.

3. Problem Formulation

In Web behavior analysis, each session can be treated as a “multi-instance” bag composed of multiple requests. Traditional single-instance approaches often focus on the features of individual requests while neglecting the broader temporal and semantic patterns within a session, making it difficult to detect sophisticated malicious bot activities. To address this issue, we model session-level detection as a multi-instance learning problem.
Formally, consider a Web log dataset containing multiple sessions { S 1 , S 2 , , S m } . Each session S i comprises requests { r i 1 , r i 2 , , r i n } and carries a single label (e.g., “benign” or “malicious”). In the multi-instance setting, each request r i j is treated as an instance, whereas each session S i is regarded as a bag. Our objective is to make a session-level classification decision to distinguish between human and bot traffic.
Under this framework, we first embed each request into a vector representation. We then deploy an aggregation step that converts all instance vectors within a bag into a fixed-dimensional session vector, which can be fed into supervised or unsupervised methods to uncover potential malicious behaviors. Because the number of requests (instances) per session (bag) can vary, multi-instance learning naturally accommodates this variability, making it highly suitable for our problem setting.

4. Methodology

In this study, we propose a structured framework for Web robot detection, illustrated in Figure 1. First, both human and robot users generate significant numbers of Web logs by interacting with a server. We preprocess these logs, extracting relevant features such as the request URL field. After data filtering and feature extraction, the Session2vec approach is employed to obtain session-level representations.
The representation process involves two stages: (1) Converting each request URL into a fixed-dimensional embedding using a FastText model; (2) Treating all requests within a session as a multi-instance bag, which is subsequently aggregated into a session-level vector through two methods: SARD and SFAR. The resulting vectors can then be fed into a classifier to differentiate between human and robot activities.

4.1. Request Embedding

In this study, we first perform text preprocessing and vectorization on each raw data request to construct a session-level representation. Specifically, the URL path in each request is preprocessed by replacing the forward slash (“/”) with a space, thereby generating a continuous text string. We then apply a pretrained FastText-300 model to convert the text into a 300-dimensional vector representation. Since the model is trained on large-scale corpora, it effectively mitigates issues such as spelling errors and out-of-vocabulary words. The complete process of request embedding generation is detailed in Algorithm 1.
For a single session S, multiple requests are contained within. By stacking the vector of each request in the original chronological order, we obtain an n × 300 matrix (n is the number of requests), as shown in Algorithm 1. This matrix preserves the local semantic features of each request and simultaneously captures the temporal relationships within the session, laying the foundation for detecting complex behavior patterns.
The following pseudocode illustrates how to build request embeddings using a pretrained FastText-300 model and stack them into matrices grouped by session.
Algorithm 1 Request embedding generation
Require: 
Dataset D, where each record contains the fields session id and request path ; pretrained FastText model M (with 300-dimensional output).
Ensure: 
A vector representation for each request and a request matrix E R n × 300 grouped by session.
  1:
  for each record r D  do
  2:
     Extract the request path: uri r . request _ path
  3:
     Preprocess uri: replace the forward slash with a space to obtain the string processed _ uri
  4:
     Use the pretrained FastText model to compute the sentence vector: v M . get _ sentence _ vector ( processed _ uri )
  5:
     Save the vector v as a new field for the record (e.g., uri _ vector )
  6:
  end for
  7:
  for each session s (grouped by session _ id ) do
  8:
     Collect all request vectors for session s: { v 1 , v 2 , , v n }
  9:
     Stack these vectors using np . vstack to form the matrix E s R n × 300
10:
  end for
11:
  Return the collection of request matrices { E s } corresponding to all sessions

4.2. Session-Level Representations: SARD and SFAR

In this section, we propose two innovative session-level aggregation methods, SARD and SFAR, to build fixed-dimensional session representations from request embeddings. Both methods follow a multi-instance aggregation principle, effectively deriving statistical and distributional features from session requests while maintaining a fixed output dimension.

4.2.1. SARD (Session-Level Aggregated Residual Descriptors)

SARD is inspired by the classical VLAD approach. It utilizes hard assignment to cluster each request vector and accumulates their residuals as the final representation. Let
E s R n × d
be the request embedding matrix of session s, where n is the number of requests and d is the embedding dimension (e.g., 300). Assume there is a pretrained KMeans [26] model C with K centroids
{ c 1 , c 2 , , c K } , c k R d .
For each request vector x i in E s , we identify its nearest centroid index k i * via
k i * = argmin k x i c k 2 .
We then accumulate the residual into the matrix V R K × d :
V [ k i * ] V [ k i * ] + ( x i c k i * ) .
By row-wise flattening V, we obtain
v sard R K · d .
Finally, we apply power normalization and L 2 normalization:
v sign ( v ) | v | , v sard v sard v sard 2 .
Algorithm 2 illustrates the complete SARD procedure.
Algorithm 2 Session embedding—SARD generation
Require: 
Set of request embeddings { E s } , where E s R n s × d ; number of clusters K; pretrained KMeans model C.
Ensure: 
Fixed-dimensional session representation v sard R K · d for each session.
  1:
  for the matrix E s for each session s do
  2:
     Initialize the residual matrix V zeros ( K , d )
  3:
     for each request vector x E s  do
  4:
         k * C . predict ( x )
  5:
         V [ k * ] V [ k * ] + ( x C . cluster _ centers [ k * ] )
  6:
     end for
  7:
     Flatten V to obtain v sard R K · d
  8:
     Apply power normalization and L 2 normalization
  9:
     Save v sard as the final representation of session s
10:
  end for
11:
  Return { v sard }

4.2.2. SFAR (Session-Level Fisher Aggregated Representation)

Unlike the hard assignment in SARD, SFAR leverages a Gaussian Mixture Model (GMM) to capture the underlying probability distribution of request embeddings. The GMM is parameterized by
λ = { π k , μ k , Σ k } k = 1 K ,
where π k denotes the prior probability of the kth component, and μ k and Σ k are its mean and covariance (often diagonal). For each request vector x i , we compute the posterior probability:
γ k ( x i ) = π k N ( x i μ k , Σ k ) j = 1 K π j N ( x i μ j , Σ j ) .
We then calculate the first-order and second-order gradients with respect to the GMM parameters:
μ k = 1 n s π k i = 1 n s γ k ( x i ) x i μ k σ k , σ k = 1 n s 2 π k i = 1 n s γ k ( x i ) ( x i μ k ) 2 σ k 2 1 .
All these gradients are concatenated to form the Fisher vector
v sfar R 2 K · d ,
which is then power-normalized and L 2 -normalized to enhance robustness and comparability. Algorithm 3 provides the complete pseudocode for the SFAR approach.
Algorithm 3 Session embedding—SFAR generation
Require: 
Set of request embeddings { E s } , where E s R n s × d ; number of Gaussian components K; pretrained GMM model G.
Ensure: 
Fixed-dimensional session representation v sfar R 2 K · d for each session.
  1:
  for the matrix E s for each session s do
  2:
     Initialize the Fisher vector v sfar zeros ( 2 K · d )
  3:
     for each request vector x E s  do
  4:
       Compute posterior probabilities γ k ( x ) for all components k
  5:
       Compute gradients μ k and σ k and accumulate them in v sfar
  6:
     end for
  7:
     Apply power normalization and L 2 normalization
  8:
     Save v sfar as the final representation of session s
  9:
  end for
10:
  Return { v sfar }

5. Experiments

5.1. Experiment Setup

In this study, we used a high-performance computing system with two NVIDIA RTX A6000 GPUs (48 GB memory each) and 80 virtual CPUs powered by an Intel Xeon Platinum 8383C processor (2.70 GHz). The experiments ran on Ubuntu 24.04.1, with Python 3.8.8 as the main programming language. Key Python libraries included Numpy 1.20.1, Pandas 1.2.4, XGBoost 2.1.2, Fasttext 0.9.2, and Scikit-Learn 0.24.1 [27]. This setup enhanced computational efficiency and improved model evaluation.

5.2. Datasets

Table 1 presents the basic statistics of the five-fold cross-validation dataset. We used the dataset from [8], which was established as a multimodal detection benchmark comprising session-level Web logs and mouse trajectory data. Its original architecture aimed to improve adversarial bot recognition by fusing spatiotemporal features. The Web logs recorded HTTP request types (GET/POST), resource categories (HTML/CSS/image files), status codes, and timestamps, which together captured visitors’ temporal behavioral patterns (e.g., request frequency, navigation paths). Meanwhile, the mouse trajectory data, collected through embedded frontend scripts, captured spatial interaction features (e.g., movement speed, click coordinates). Because mouse data collection is often limited in real-world scenarios, we focused on single-modal Web log analysis. We retained semantic features in request sequences (e.g., resource-type preferences, error-status distribution) and behavioral patterns (e.g., high-frequency access, abnormal jumps) to examine the viability of a lightweight detection framework.
In this study, we merged data from Phase 1 and Phase 2 and divided them into training and testing sets at a 7:3 ratio, consistent with the method used by [8] for comparison with their statistical feature representation approach, while ensuring that each category (human, advanced bot, and moderate bot) maintained the same proportion in both sets. As shown in the table, the merged dataset contained an average of 188.40 sessions, with the number of sessions per fold ranging from 186 to 190. The total number of requests averaged 120,330.60, ranging from 117,270 to 124,466. Notably, each session contained an average of 638.64 requests, indicating highly intensive session interactions.
Figure 2 illustrates the distribution of requests per session across the five folds of the cross-validation dataset. The boxplots revealed relatively consistent request distribution patterns across folds, with median values of approximately 600 requests per session, indicating frequent and intensive session interactions. Notably, all folds contained outliers representing sessions with exceptionally high request counts, potentially indicating particularly complex interaction patterns or prolonged sessions. This consistency demonstrated the appropriateness of the data partitioning and indicated stability in session behavior patterns throughout the dataset. This characteristic is especially important for building robust detection models, as it ensures that training and testing data have similar statistical properties, thereby enhancing the model’s generalization capability.

5.3. Comparison Method

In this study, we compared the performance of the proposed Session2vec method with traditional session representation methods based on statistical features [8]. These statistical features are commonly used in Web robot detection tasks and can describe various aspects of a session, such as the number of requests, response status codes, and resource types requested. However, these methods heavily rely on manually defined statistical features, which may fail to capture the complex and dynamic behaviors of bots, especially when the bots mimic human interactions, where traditional methods perform poorly. In contrast, the Session2vec method, through its multi-instance learning framework and subword embeddings, better captures the temporal and behavioral dynamics of Web sessions, overcoming these limitations. The statistical features used for comparison are shown in Table 2.
These statistical features are frequently used in the existing literature for Web robot detection. However, these methods are often limited in their ability to generalize to complex and dynamic attack behaviors. They fail to model the temporal and spatiotemporal relationships between requests, which are crucial for detecting sophisticated bots. To address this, Session2vec incorporates a multi-instance learning framework with FastText subword embeddings and the miVLAD algorithm to aggregate session-level information into a fixed-length vector representation. This approach not only resolves inconsistencies in request counts but also captures the temporal dynamics of the session, providing a more robust and scalable solution for bot detection.
To evaluate the performance of Session2vec in comparison to statistical feature-based methods, we applied several machine learning models: Random Forest [38], Decision Trees [39], MLP [40], XGBoost [41], and KNN [42]. The models were evaluated using two key performance metrics: accuracyand F1-score. These metrics allow for a thorough evaluation of classification performance, accounting for both the overall accuracy and the balance between precision and recall, which is essential for detecting imbalanced bot detection tasks. The results demonstrated that Session2vec outperformed traditional methods, providing superior detection accuracy and robustness against sophisticated camouflage attacks.

5.4. Performance Metrics

In our experiments, we evaluated the performance of our methods using four key metrics: accuracy (Acc), precision (P), recall (R), and the F1-score (F1). These metrics provided a comprehensive analysis of the model’s performance by assessing its correctness, completeness, and balance. Below are the detailed definitions and formulas for each metric:
  • True positives (TP): the number of instances correctly classified as positive by the model.
  • False positives (FP): the number of instances incorrectly classified as positive by the model.
  • False negatives (FN): the number of instances incorrectly classified as negative by the model.
  • True negatives (TN): the number of instances correctly classified as negative by the model.
Accuracy ( A c c ): Accuracy measures the overall correctness of the model by computing the ratio of correctly classified instances to the total number of instances. The formula is:
A c c = TP + TN TP + FP + FN + TN
Precision (P): Precision quantifies the proportion of correctly predicted positive instances out of all predicted positive instances. The formula is:
P = TP TP + FP
Recall (R): Recall measures the model’s ability to identify all actual positive instances. The formula is:
R = TP TP + FN
F1-score (F1): The F1-score combines precision and recall into a single metric by computing their harmonic mean. It is particularly useful when dealing with imbalanced datasets. The formula is:
F 1 = 2 · P × R P + R
By incorporating these metrics, we ensured a holistic evaluation of the model’s performance, balancing its accuracy, precision, and recall, while addressing potential class imbalances.

5.5. Experiment Results

To evaluate the effectiveness of our proposed methods, we conducted extensive experiments comparing the performance of SARD and SFAR with baseline approaches across multiple classification algorithms.

5.5.1. Cross-Fold Validation Performance Analysis

To evaluate the stability and generalization capability of the proposed methods, we conducted a 5-fold cross-validation comparison among three approaches: the baseline method, SARD, and SFAR. For each fold, the performance was averaged over five commonly used classifiers: Random Forest, Decision Trees, MLP, XGBoost, and KNN. This averaging was designed to highlight the robustness and general applicability of the proposed methods across different learning algorithms. Figure 3 illustrates the performance of these three methods across different folds in terms of precision, recall, F1-score, and accuracy. It is evident from the figure that our proposed SFAR method consistently outperformed both the baseline and SARD methods across all evaluation metrics. In particular, in terms of precision, the SFAR method achieved an average improvement of approximately 24.6% compared to the baseline.
Table 3 provides a detailed breakdown of the performance metrics for each fold. The data revealed that the SFAR method achieved the best results (highlighted in bold) across all folds and all evaluation metrics. Notably, the SFAR method performed exceptionally well in fold 5, achieving a precision of 0.892, while the baseline method only reached 0.707. This result suggests that incorporating factor analysis for session representation enabled the more effective capture of latent structures in user behavior patterns, thereby enhancing anomaly detection performance. Moreover, even in fold 1, where the baseline method performed the poorest (precision of only 0.603), the SFAR method maintained a high precision value (0.829), further demonstrating the robustness of the proposed approach.

5.5.2. Overall Performance Comparison

Figure 4 illustrates a comprehensive comparison of classification performance—based on the average results from 5-fold cross validation—across five different models: Decision Tree, Random Forest, MLP, XGBoost, and KNN. For each model, three methods were benchmarked: the traditional baseline approach, our proposed SARD, and SFAR. The evaluation was carried out using four key metrics—precision, recall, F1-score, and accuracy—which are presented in separate panels to enable a multi-dimensional assessment of performance.
As evident from Figure 4, our proposed methods demonstrated significant improvements over the baseline in most scenarios. Particularly noteworthy is the performance enhancement observed in MLP and KNN models, where both SARD and SFAR substantially outperformed the baseline. The improvement was more pronounced with SFAR, which consistently achieved the highest scores across most models and metrics.
Table 4 provides the detailed numerical results corresponding to Figure 4, with the best performance for each model and metric highlighted in bold. The most remarkable improvements were observed with the MLP model, where SFAR enhanced the precision from a baseline of 0.130 to 0.833, representing an improvement of over 500%. For KNN, SFAR achieved a precision of 0.873 compared to the baseline’s 0.625, demonstrating a substantial enhancement of approximately 40%.
Even for models that performed relatively well with the baseline approach, such as XGBoost and Random Forest, our methods still achieved meaningful improvements. SFAR increased XGBoost’s precision from 0.867 to 0.897 and Random Forest’s from 0.835 to 0.895. Notably, SFAR consistently outperformed both SARD and the baseline across all tested models, except for the Decision Tree model’s precision, where the baseline achieved 0.806 versus SFAR’s 0.809, a negligible difference that was offset by SFAR’s superior performance in recall, F1-score, and accuracy for that model.
These results demonstrate that our session-based feature extraction techniques, particularly the feature aggregation approach employed in SFAR, effectively capture the temporal and behavioral patterns in session data, leading to more accurate anomaly detection across various classification algorithms.

5.5.3. Performance Improvement Analysis

This section presents a quantitative analysis of the performance improvements achieved by the SARD and SFAR methods compared to the baseline approach. We employed Absolute Improvement ( A I ) as the evaluation metric, defined as the direct difference between the performance indicator values of the improved method and the baseline method, calculated as follows:
A I = P i m p r o v e d P b a s e l i n e
where P i m p r o v e d represents the performance metric value of the improved method, and P b a s e l i n e represents the performance metric value of the baseline method. As shown in Table 5, regarding the F1-score, the SARD method demonstrated an improvement of 0.0846 over the baseline, while the SFAR method achieved an improvement of 0.1971, with the latter being approximately 2.33 times greater than the former. In terms of accuracy, SARD improved by 0.0521, whereas SFAR improved by 0.1627, with the latter again significantly outperforming the former (by approximately 3.12 times). These results indicate that while both methods effectively enhanced the performance of the baseline approach, the SFAR method exhibited more substantial advantages across all evaluation metrics. Notably, the improvement in F1-score (0.1971) achieved by the SFAR method exceeded its improvement in accuracy (0.1627), suggesting that this method demonstrated superior performance in balancing precision and recall. This characteristic is particularly significant for addressing class imbalance problems, as it can effectively maintain classification performance for both minority and majority classes simultaneously.

6. Conclusions

This paper pioneered a novel session representation learning framework from the perspective of multiple instance learning, effectively addressing the inherent limitations of traditional statistical methods in Web robot detection. We designed and implemented two innovative approaches: SARD and SFAR. These methods automatically extract latent semantic features from high-dimensional sparse session data, enabling precise representation of Web behaviors. Through comprehensive five-fold cross-validation experiments, we thoroughly evaluated the performance of the proposed methods under different data partitioning conditions. The experimental results demonstrated that the SFAR method significantly outperformed traditional baseline approaches across all evaluation metrics, particularly achieving an average improvement of 24.6% in precision. Notably, even on datasets where the baseline method underperformed, the SFAR method maintained high precision and stability, showcasing its exceptional robustness and generalization capability. While the SARD method showed more modest improvements, it consistently outperformed traditional methods across all evaluation metrics, validating the effectiveness of our proposed framework.

7. Future Research Directions

Building on the promising results of this study, several avenues for future work could further advance research in this area: (1) Expanding application domains: investigate the applicability of the proposed session representation framework to additional and more diverse datasets, potentially including other types of Web logs or heterogeneous data sources, to further validate the model’s robustness and generalizability. (2) Real-time detection and adaptation: develop real-time detection systems that leverage the proposed techniques and incorporate online learning strategies to adapt dynamically to evolving attack patterns and emerging camouflaged bot behaviors. (3) Scalability and efficiency: investigate computational optimization techniques to improve the scalability and efficiency of the framework for deployment in large-scale, real-world environments with massive Web traffic.

Author Contributions

Methodology, J.Z.; Validation, Z.W. and L.Y.; Resources, S.P.; Writing—original draft, J.Z.; Supervision, D.H. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Tang, D.; Dai, R.; Zuo, C.; Chen, J.; Li, K.; Qin, Z. A Low-rate DoS Attack Mitigation Scheme Based on Port and Traffic State in SDN. IEEE Trans. Comput. 2025, 74, 1758–1770. [Google Scholar] [CrossRef]
  2. Tang, D.; Liu, B.; Li, K.; Xiao, S.; Liang, W.; Zhang, J. PLUTO: A Robust LDoS Attack Defense System Executing at Line Speed. IEEE Trans. Dependable Secur. Comput. 2024, 1–18. [Google Scholar] [CrossRef]
  3. Tang, D.; Dai, R.; Yan, Y.; Li, K.; Liang, W.; Qin, Z. When sdn meets low-rate threats: A survey of attacks and countermeasures in programmable networks. ACM Comput. Surv. 2024, 57, 1–32. [Google Scholar] [CrossRef]
  4. Singh, K.; Singh, P.; Kumar, K. User behavior analytics-based classification of application layer HTTP-GET flood attacks. J. Netw. Comput. Appl. 2018, 112, 97–114. [Google Scholar] [CrossRef]
  5. Stevanovic, D.; Vlajic, N.; An, A. Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Appl. Soft Comput. 2013, 13, 698–708. [Google Scholar] [CrossRef]
  6. Tsvetkova, M.; García-Gavilanes, R.; Floridi, L.; Yasseri, T. Even good bots fight: The case of Wikipedia. PLoS ONE 2017, 12, e0171774. [Google Scholar] [CrossRef] [PubMed]
  7. Jagat, R.R.; Sisodia, D.S.; Singh, P. Exploiting Web Content Semantic Features to Detect Web Robots from Weblogs. J. Netw. Comput. Appl. 2024, 230, 103975. [Google Scholar] [CrossRef]
  8. Iliou, C.; Kostoulas, T.; Tsikrika, T.; Katos, V.; Vrochidis, S. Detection of Advanced Web Bots by Combining Web Logs with Mouse Behavioural Biometrics. Digit. Threat. Res. Pract. 2021, 2, 1–26. [Google Scholar] [CrossRef]
  9. Chu, Z.; Gianvecchio, S.; Wang, H. Bot or human? A behavior-based online bot detection system. In From Database to Cyber Security: Essays Dedicated to Sushil Jajodia on the Occasion of His 70th Birthday; Springer: Cham, Switzerland, 2018; pp. 432–449. [Google Scholar]
  10. Rovetta, S.; Cabri, A.; Masulli, F.; Suchacka, G. Bot or not? A case study on bot recognition from web session logs. In Italian Workshop on Neural Nets; Springer: Cham, Switzerland, 2017; pp. 197–206. [Google Scholar]
  11. Jagat, R.R.; Sisodia, D.S.; Singh, P. Web-S4AE: A Semi-Supervised Stacked Sparse Autoencoder Model for Web Robot Detection. Neural Comput. Appl. 2023, 35, 17883–17898. [Google Scholar] [CrossRef]
  12. See, A.; Wingarz, T.; Radloff, M. Detecting Web Bots via Mouse Dynamics and Communication Metadata. In ICT Systems Security and Privacy Protection; Meyer, N., Grocholewska-Czuryło, A., Eds.; Springer Nature: Cham, Switzerland, 2024; Volume 679, pp. 73–86. [Google Scholar]
  13. Sirikonda, J.; Zabihimayvan, M. Evaluating Machine Learning Techniques for Web Robot Detection. J. Stud. Res. 2024, 13. [Google Scholar] [CrossRef]
  14. Ousat, B.; Shariatnasab, M.; Schafir, E.; Chaharsooghi, F.S.; Kharraz, A. In-Application Defense Against Evasive Web Scans through Behavioral Analysis. arXiv 2024, arXiv:2412.07005. [Google Scholar]
  15. Herrera, F.; Ventura, S.; Bello, R.; Cornelis, C.; Zafra, A.; Sánchez-Tarragó, D.; Vluymans, S.; Herrera, F.; Ventura, S.; Bello, R.; et al. Multiple Instance Learning; Springer: Cham, Switzerland, 2016. [Google Scholar]
  16. Wei, X.S.; Wu, J.; Zhou, Z.H. Scalable algorithms for multi-instance learning. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 975–987. [Google Scholar] [CrossRef]
  17. Reynolds, D.A. Gaussian mixture models. Encycl. Biom. 2009, 741, 3. [Google Scholar]
  18. Ge, W.; Wang, J. SeqMask: Behavior Extraction over Cyber Threat Intelligence via Multi-Instance Learning. Comput. J. 2024, 67, 253–273. [Google Scholar] [CrossRef]
  19. He, W.; Wang, Y. Text Representation and Classification Based on Multi-Instance Learning. In Proceedings of the 2009 International Conference on Management Science and Engineering, Moscow, Russia, 14–16 September 2009; pp. 34–39. [Google Scholar]
  20. Salton, G.; Wong, A.; Yang, C.S. A vector space model for automatic indexing. Commun. ACM 1975, 18, 613–620. [Google Scholar] [CrossRef]
  21. He, Z.; Chen, W.; Li, Z.; Zhang, M.; Zhang, W. SEE: Syntax-Aware Entity Embedding for Neural Relation Extraction. In Proceedings of the Aaai Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  22. Busta, M.; Neumann, L.; Matas, J. FASText: Efficient Unconstrained Scene Text Detector. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1206–1214. [Google Scholar]
  23. Lagopoulos, A.; Tsoumakas, G. Content-Aware Web Robot Detection. Appl. Intell. 2020, 50, 4017–4028. [Google Scholar] [CrossRef]
  24. Cabri, A.; Suchacka, G.; Rovetta, S.; Masulli, F. Online Web Bot Detection Using a Sequential Classification Approach. In Proceedings of the 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Exeter, UK, 28–30 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1536–1540. [Google Scholar]
  25. Suchacka, G.; Cabri, A.; Rovetta, S.; Masulli, F. Efficient On-the-Fly Web Bot Detection. Knowl.-Based Syst. 2021, 223, 107074. [Google Scholar] [CrossRef]
  26. Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
  27. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  28. Alam, S.; Dobbie, G.; Koh, Y.S.; Riddle, P. Web bots detection using particle swarm optimization based clustering. In Proceedings of the 2014 IEEE Congress on Evolutionary Computation (CEC), Beijing, China, 6–11 July 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 2955–2962. [Google Scholar]
  29. Wang, D.; Xi, L.; Zhang, H.; Liu, H.; Zhang, H.; Song, T. Web robot detection with semi-supervised learning method. In Proceedings of the 3rd International Conference on Material, Mechanical and Manufacturing Engineering (IC3ME 2015), Guangzhou, China, 27–28 June 2015; Atlantis Press: Dordrecht, The Netherlands, 2015; pp. 2123–2128. [Google Scholar]
  30. Iliou, C.; Kostoulas, T.; Tsikrika, T.; Katos, V.; Vrochidis, S.; Kompatsiaris, Y. Towards a framework for detecting advanced web bots. In Proceedings of the 14th International Conference on Availability, Reliability and Security, Canterbury, UK, 26– 29 August 2019; pp. 1–10. [Google Scholar]
  31. Sisodia, D.S.; Verma, S.; Vyas, O.P. Agglomerative approach for identification and elimination of web robots from web server logs to extract knowledge about actual visitors. J. Data Anal. Inf. Process. 2015, 3, 1–10. [Google Scholar] [CrossRef]
  32. Stevanovic, D.; An, A.; Vlajic, N. Feature evaluation for web crawler detection with data mining techniques. Expert Syst. Appl. 2012, 39, 8707–8717. [Google Scholar] [CrossRef]
  33. Zabihimayvan, M.; Sadeghi, R.; Rude, H.N.; Doran, D. A soft computing approach for benign and malicious web robot detection. Expert Syst. Appl. 2017, 87, 129–140. [Google Scholar] [CrossRef]
  34. AlNoamany, Y.A.; Weigle, M.C.; Nelson, M.L. Access patterns for robots and humans in web archives. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, Indianapolis, IN, USA, 22–26 July 2013; pp. 339–348. [Google Scholar]
  35. Bai, Q.; Xiong, G.; Zhao, Y.; He, L. Analysis and detection of bogus behavior in web crawler measurement. Procedia Comput. Sci. 2014, 31, 1084–1091. [Google Scholar] [CrossRef]
  36. Doran, D.; Gokhale, S.S. An integrated method for real time and offline web robot detection. Expert Syst. 2016, 33, 592–606. [Google Scholar] [CrossRef]
  37. Rude, H.N.; Doran, D. Request type prediction for web robot and internet of things traffic. In Proceedings of the 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 9–11 December 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 995–1000. [Google Scholar]
  38. Belgiu, M.; Drăguţ, L. Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar] [CrossRef]
  39. Charbuty, B.; Abdulazeez, A. Classification based on decision tree algorithm for machine learning. J. Appl. Sci. Technol. Trends 2021, 2, 20–28. [Google Scholar] [CrossRef]
  40. Taud, H.; Mas, J. Multilayer perceptron (MLP). In Geomatic Approaches for Modeling Land Change Scenarios; Springer: Cham, Switzerland, 2018; pp. 451–455. [Google Scholar]
  41. Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
  42. Guo, G.; Wang, H.; Bell, D.; Bi, Y.; Greer, K. KNN model-based approach in classification. In Proceedings of the on the Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Proceedings, Catania, Sicily, Italy, 3–7 November 2003; Springer: Berlin/Heidelberg, Germany, 2003; pp. 986–996. [Google Scholar]
Figure 1. Web robot detection framework.
Figure 1. Web robot detection framework.
Electronics 14 01945 g001
Figure 2. Distribution of requests per session by fold.
Figure 2. Distribution of requests per session by fold.
Electronics 14 01945 g002
Figure 3. Performance comparison by fold.
Figure 3. Performance comparison by fold.
Electronics 14 01945 g003
Figure 4. Five-fold average classification performance.
Figure 4. Five-fold average classification performance.
Electronics 14 01945 g004
Table 1. Session and request statistics for each fold in the dataset.
Table 1. Session and request statistics for each fold in the dataset.
FoldTotal SessionsTotal RequestsAvg. Requests/Session
1186118,172635.33
2188118,883632.36
3188117,270623.78
4190124,466655.08
5190122,862646.64
Average188.40120,330.60638.64
Table 2. Statistical features for session representation [8].
Table 2. Statistical features for session representation [8].
FeatureShort Description and Literature
session_idUnique identifier for each session, distinguishing different sessions.
user_idUnique identifier for the user, extracted from the training data.
total_requestsTotal number of HTTP requests made during the session [5,28,29,30,31,32,33].
total_bytesTotal size of the transferred data during the session (in bytes) [5,10,24,28,30,33].
get_requestsNumber of HTTP GET requests made in the session [24,30,31,33,34,35].
post_requestsNumber of HTTP POST requests made in the session [24,30,31,33,35].
head_requestsNumber of HTTP HEAD requests made in the session [5,10,24,29,30,31,32,33,35].
%_http_3xxProportion of requests with HTTP 3xx status codes (redirection) [24,30,33,34].
%_http_4xxProportion of requests with HTTP 4xx status codes (client errors) [5,10,24,29,30,32,33,34].
%_image_requestsProportion of image requests in the session [30,31,36,37].
%_css_requestsProportion of CSS requests in the session [30,37].
%_js_requestsProportion of JavaScript requests in the session [30,36,37].
html_to_image_ratioRatio of HTML requests to image requests [29,30,32,33].
depth_sdStandard deviation of URL path depth, reflecting the diversity in page access [5,30,32,33].
max_requests_per_pageMaximum number of requests to a single page.
avg_requests_per_pageAverage number of requests to each page in the session [30].
max_consecutive_sequentialMaximum number of consecutive requests with a parent–child URL relationship [30,33].
%_consecutive_sequentialProportion of consecutive requests with a parent–child URL relationship [5,29,30,32].
session_timeDuration of the session from the first to the last request (in seconds) [10,28,30,31,33,34].
browsing_speedRate of page requests per second during the session [30,34].
sd_inter_request_timesStandard deviation of the time intervals between consecutive requests [30,34].
Table 3. Performance comparison across folds averaged over Random Forest, Decision Trees, MLP, XGBoost, and KNN (The best results are in bold).
Table 3. Performance comparison across folds averaged over Random Forest, Decision Trees, MLP, XGBoost, and KNN (The best results are in bold).
FoldMetricBaselineSARD (Ours)SFAR (Ours)
1Precision0.6030.7240.829
1Recall0.6370.7170.807
1F1-score0.5960.7170.810
1Accuracy0.6370.7170.807
2Precision0.6870.7310.873
2Recall0.7280.7240.872
2F1-score0.6920.7240.869
2Accuracy0.7280.7240.872
3Precision0.6150.7340.864
3Recall0.6500.7320.861
3F1-score0.6160.7280.860
3Accuracy0.6500.7320.861
4Precision0.6520.7370.847
4Recall0.6950.7370.842
4F1-score0.6620.7270.839
4Accuracy0.6950.7370.842
5Precision0.7070.8200.892
5Recall0.7470.8070.888
5F1-score0.7150.8080.888
5Accuracy0.7470.8070.888
Table 4. Performance comparison of different methods across various models (The best results are in bold).
Table 4. Performance comparison of different methods across various models (The best results are in bold).
ModelMethodPrecisionRecallF1-ScoreAccuracy
Decision TreeBaseline0.8060.7990.7980.799
SARD (ours)0.7880.7860.7820.786
SFAR (ours)0.8090.8030.8000.803
Random ForestBaseline0.8350.8270.8260.827
SARD (ours)0.8510.8440.8420.844
SFAR (ours)0.8950.8860.8870.886
MLPBaseline0.1300.3610.1920.361
SARD (ours)0.5860.5760.5710.576
SFAR (ours)0.8330.8200.8180.820
XGBoostBaseline0.8670.8620.8620.862
SARD (ours)0.8370.8300.8300.830
SFAR (ours)0.8970.8930.8930.893
KNNBaseline0.6250.6080.6040.608
SARD (ours)0.6830.6810.6790.681
SFAR (ours)0.8730.8680.8680.868
Table 5. Performance improvement of SARD and SFAR methods compared to baseline.
Table 5. Performance improvement of SARD and SFAR methods compared to baseline.
MethodF1-Score ImprovementAccuracy Improvement
SARD0.08460.0521
SFAR0.19710.1627
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, J.; Pan, S.; Han, D.; Wang, Z.; Yao, L.; Lu, Y. Session2vec: Session Modeling with Multi-Instance Learning for Accurate Malicious Web Robot Detection. Electronics 2025, 14, 1945. https://doi.org/10.3390/electronics14101945

AMA Style

Zhang J, Pan S, Han D, Wang Z, Yao L, Lu Y. Session2vec: Session Modeling with Multi-Instance Learning for Accurate Malicious Web Robot Detection. Electronics. 2025; 14(10):1945. https://doi.org/10.3390/electronics14101945

Chicago/Turabian Style

Zhang, Jiachen, Shengli Pan, Daoqi Han, Zhanfeng Wang, Liangwei Yao, and Yueming Lu. 2025. "Session2vec: Session Modeling with Multi-Instance Learning for Accurate Malicious Web Robot Detection" Electronics 14, no. 10: 1945. https://doi.org/10.3390/electronics14101945

APA Style

Zhang, J., Pan, S., Han, D., Wang, Z., Yao, L., & Lu, Y. (2025). Session2vec: Session Modeling with Multi-Instance Learning for Accurate Malicious Web Robot Detection. Electronics, 14(10), 1945. https://doi.org/10.3390/electronics14101945

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop