1. Introduction
In the context of deepening digital transformation, modern server systems are required to handle massive and diverse service requests, and their request processing efficiency directly determines service quality and user experience. For server systems, accurately diagnosing whether specific types of service requests have abnormal bottlenecks, and further analyzing their root causes from the host resource layer and network transmission layer has become a core requirement for ensuring stable system operation. Processing delays or failures of critical service requests can lead to service process interruptions, decreased operational efficiency, and even significant economic losses. Therefore, there is an urgent need for a reliable diagnosis method to identify such abnormal bottlenecks and provide targeted optimization strategies.
Existing performance analysis methods still have significant limitations in addressing this need. Although log-based analysis is widely used, it is limited by the heterogeneity of different system log formats, making the parsing process complex and suffering from poor universality [
1,
2]; traditional KPI threshold monitoring can only reflect macroscopic system states (e.g., high CPU utilization), but it cannot pinpoint bottlenecks related to specific service requests, nor can it distinguish between environmental noise and actual performance degradation [
3]. Furthermore, many active probing methods rely on sending probe packets, which may interfere with normal system operation and introduce security risks. These deficiencies highlight the urgent need for non-intrusive, service-oriented, and comparative analysis-supported methods.
To address the above challenges, this paper proposes a diagnosis method for server systems, namely Cross-Environment Server Diagnosis with Fusion (CSDF). The core objectives of CSDF are as follows: (1) detecting abnormal bottlenecks in service requests in a replay environment; (2) analyzing the causes of bottlenecks from the host resource layer.
This CSDF method is based on two types of fundamental data: network card captured packets and Key Performance Indicator (KPI) metrics collected via Application Performance Management (APM) systems. It unfolds through two key steps: First, a traffic replay tool is utilized to reproduce real application requests in a controlled replay environment. These requests are originally captured via network cards in a production environment. The tool ensures a 1:1 reproduction ratio, maintaining the original traffic patterns accurately. Therefore, by comparing performance differences between the two environments, the abnormal performance of service requests in the replay environment is precisely identified. Second, by combining KPI metrics collected from APM systems (such as CPU utilization, memory usage, etc.), the Random Forest algorithm is employed to build a correlation model between KPI metrics and performance bottlenecks, identifying potential bottlenecks at the host resource layer.
The advantages of this method are threefold:
Non-intrusiveness and security: it adopts a passive measurement approach, analyzing existing network packets without sending active probes or modifying system code and thus minimizing interference with the running system.
High universality: unlike log analysis, which is fragmented by format issues, CSDF uses data packets that adhere to unified protocols (e.g., TCP/HTTP) and can adapt to various heterogeneous systems.
Efficient comparative analysis: 1:1 request reproduction in the replay environment effectively isolates performance deviations from environmental noise, while automated data processing and machine learning modeling significantly reduce manual analysis burden.
Despite its significant advantages, the implementation of CSDF still faces two core challenges: first, how to precisely extract effective metrics closely related to performance from massive and complex network packets, and on this basis, achieve accurate correlation of identical service requests across different environments; second, how to identify abnormal points in service requests in the replay environment and further discover the correlation between network metrics and system KPI multi-dimensional metrics. To address these issues, this paper designs a modular data processing flow for CSDF and combines a machine learning-driven correlation analysis model to build a complete solution, laying a solid foundation for the practical implementation of CSDF.
The main contributions of this paper include the following: (1) Constructing a non-intrusive general diagnostic framework for CSDF, which achieves analysis by passively capturing network traffic without intruding into the system’s internal or modifying code. It can adapt to server systems of different architectures, solving the problems of insufficient versatility and strong invasiveness of traditional methods. (2) Developing a host-network layer correlation analysis model for CSDF based on the random forest algorithm to quantify the correlation between KPI metrics and performance bottlenecks. This breaks through the limitations of single-dimensional analysis and achieves multi-dimensional traceability of bottleneck causes. (3) Designing a 1:1 precise replay comparison mechanism for CSDF, which, through request-level mapping between production and replay environments, provides a stable baseline for locating service request anomalies, improving the accuracy and interpretability of performance difference identification.
CSDF has been validated in a production system of China Tower. By analyzing 102,964 real HTTP requests and their replay results, abnormal bottlenecks for specific service requests were successfully identified. The random forest model in CSDF revealed a strong correlation between key KPIs and bottlenecks. This method has now been deployed and applied in China Tower’s server environment, with promising prospects for wider adoption in the diagnosis and optimization of more complex server systems.
The remainder of this paper is organized as follows.
Section 2 reviews related work.
Section 3 presents the proposed CSDF methodology.
Section 4 introduces the experimental setup and results.
Section 5 discusses the findings and limitations.
Section 6 concludes the paper and outlines future work.
2. Related Work
The core challenge in server system performance diagnosis lies in precisely identifying abnormal bottlenecks in service requests and analyzing their causes. Previous research has explored this from multiple dimensions, including data collection, analysis methods, and validation mechanisms. However, there is still room for improvement in terms of non-intrusiveness, multi-source data fusion, and the depth of comparative analysis.
Early performance diagnosis work mainly relied on two mainstream methods, which played important roles in the development of performance diagnosis and provided fundamental support for identifying and analyzing system performance problems, effectively promoting the implementation of related technologies for a long period.
One category is agent-based monitoring tools such as performance probes for specific programming languages, which collect runtime data by embedding code within the application. This approach can obtain fine-grained metrics, giving it an irreplaceable advantage in precisely locating internal application performance bottlenecks and providing critical data support for developers to deeply understand application runtime status. However, this method is somewhat intrusive; the agent program may have potential conflicts with application logic, and in high-concurrency scenarios, its own resource consumption can compromise system performance, biasing the measurement outcomes [
1].
Another category is the log-based analysis method, which relies on preset log recording statements and locate problems by parsing logs. Early approaches utilized machine learning techniques like PCA to identify abnormal execution flows from parsed log data [
4]. Subsequent research has advanced this area through log clustering [
5], temporal analysis [
6], and, more recently, deep learning models. For instance, DeepLog pioneered the use of LSTMs to model sequential log patterns [
7], while other works have explored various neural architectures for detecting both sequential and quantitative anomalies [
8,
9,
10]. However, these methods are fundamentally reliant on the quality and completeness of application logs.
These limitations have driven research towards “non-intrusive data collection,” with APM system KPI metrics becoming core data sources. Host-layer metrics (CPU utilization, memory utilization, number of pending connections, etc.) collected via APM systems are widely used to reflect resource constraints [
11]. For example, KPIRoot + identifies system-level bottlenecks by analyzing causal relationships between KPIs [
12], but relying solely on host metrics cannot be correlated to specific service requests, making it difficult to distinguish between transaction-specific resource exhaustion and systemic resource saturation phenomena. Therefore, the fusion of network layer service characteristics and host layer resource metrics becomes an inevitable requirement for precisely locating abnormal bottlenecks in service requests.
Facing the high-dimensional nonlinear relationships of multi-source data, machine learning algorithms show strong modeling capabilities [
13]. Systems like Eadro [
14] and InstantOps [
15] leverage graph neural networks and mul-ti-task learning to build a holistic model of a microservice application for anomaly detection and root cause localization. These approaches represent the cutting edge in AIOps and have demonstrated remarkable effectiveness. However, these studies mainly rely on host metrics from a single environment, lacking cross-environment comparative validation and making it difficult to isolate environmental noise interference on the model [
16].
To solve the problem, the traffic replay method has been introduced into performance analysis. This method enables a comparative analysis between independent execution contexts under identical input configurations. Industry tools such as Goreplay can capture production traffic and replay it for load testing and performance verification. However, existing research on traffic replay often remains at the level of “reproducing load” and fails to integrate network traffic characteristics with host metrics for refined analysis.
In summary, the existing research remains fragmented in terms of non-intrusive data collection, multi-source fusion analysis, and cross-environment comparative validation. This paper’s method aims at solving these limitations, achieving a precise comparison through traffic replay, fusing network and host data, and using Random Forest modeling to build an end-to-end service bottleneck diagnosis framework.
3. Method
This paper proposes a non-intrusive server system performance diagnosis method based on comparative analysis of network packets from a production environment and a replay environment. Through a complete process of “capture-replay-parsing-correlation-analysis,” it achieves precise identification of abnormal service request bottlenecks and analysis of their causes. The flowchart of this process is shown in
Figure 1. The overall method is based on passively collected network traffic packets and APM system KPI metrics, sequentially going through four core stages: capture and replay, packet parsing and cross-environment correlation alignment, feature comparison for anomaly identification, and multi-dimensional bottleneck cause analysis for the host resource layer, thus building a full-link diagnostic system from data collection to problem localization and optimization [
17].
3.1. Collection and Replay: Building a Comparative Analysis Baseline
To achieve “zero-interference” performance diagnosis for service systems, a combination of passive network traffic collection and precise traffic replay is adopted, as shown in Step 1 of
Figure 2, to build the basis for comparison between the production and replay environments [
18]:
On the production environment side, by enabling a promiscuous mode at key server network cards, tcpdump is used to passively capture all raw network packets flowing through specified network cards, storing them as industry-standard PCAP files. This process strictly adheres to the non-intrusive principle, does not interfere with normal system service processes, and completely preserves packet headers (such as IP addresses, port numbers, and TCP flags) and payloads, providing full data support for subsequent analysis.
On the replay environment side, with the aid of the traffic replay tool TCPCopy, the real application requests stored in the PCAP files captured in the production environment are reproduced at a 1:1 request frequency and timing in a controlled replay environment. This step uses simulated traffic to compare identical inputs across environments, minimizing external noise and establishing a stable baseline for detecting bottlenecks.
3.2. Packet Parsing and Cross-Environment Alignment
The collected PCAP files contain unstructured raw network traffic and need to be converted into structured data for analysis through multi-stage parsing and alignment:
3.2.1. Protocol-Aware Intelligent Parsing
We developed a protocol-aware parsing engine to perform layered protocol parsing on data packets in PCAP files:
Transport Layer (TCP): extract Sequence and Acknowledgment numbers, identify TCP connection establishment (SYN, SYN-ACK, ACK) and data transfer phases (e.g., data segments, retransmission flags).
Application Layer (HTTP): Based on status code rules, deeply parse request-response content. For HTTP traffic, associate TCP three-way handshake sequences, ACK acknowledgment logic, match GET/POST request messages with corresponding status code responses, achieving precise application layer request-response binding.
To systematically organize the extracted information, the key fields from each layer are categorized and summarized in
Table 1.
3.2.2. Request-Response Pair Association
The precise association of request-response pairs is the foundation for subsequent analysis, with its core being the use of network communication identification features to pair data packets.
Extract source IP, destination IP, source port, destination port, TCP sequence number/acknowledgment number, and application layer URL from pre-processed packets to generate unique index keys for request and response packets as shown in
Figure 3.
By matching index keys, pair request and response packets belonging to the same communication session, and extract request timestamp, response timestamp, application request path, packet length, and other information to calculate processing latency (difference between response timestamp and request timestamp), storing it in a structured database. For packets that fail to pair, record their characteristic information for subsequent investigation. The pairing flowchart is shown in
Figure 4.
3.2.3. Cross-Environment Alignment
When evaluating performance differences between a production environment and a replay environment, traditional direct comparison methods based on TCP sequence numbers or timestamps face challenges, especially when the replay link introduces asynchronous characteristics such as using TCPcopy for Layer 2 replay, where the order of requests may be disrupted, making one-to-one packet-level or flow-level matching unreliable. To overcome this problem, this paper proposes a cross-environment request alignment algorithm based on HTTP semantic content hashing, to achieve precise request pair matching, thereby accurately calculating and comparing end-to-end latencies in different environments.
The core of the alignment method lies in “semantic” hashing HTTP requests. The uniqueness of an HTTP request is determined by its key components. This paper defines the semantic hash value of an HTTP request as the combined hash of the following elements.
HTTP Method: GET, POST, PUT.
Request URI: Includes path and query string, but excludes protocol, domain name, and port, such as /api/getAllUserTodoData?sysId=-510&userId=zhanglz3&index=0.
Normalized Headers: Select a few key, usually unchanging, request headers for hashing, such as Host, User-Agent, Content-Type.
Request Body: For requests with a body like POST or PUT, hash their content. For requests without a body like GET, this part is empty.
Concatenate the standardized elements above into a string, and then compute its cryptographic hash value such as SHA-256. This hash strategy can distinguish requests with different content while remaining robust to minor changes in the transport layer or non-critical HTTP headers.
Based on the semantic hash value of each HTTP request and its recorded request, start the timestamp in the PCAP. This paper adopts a queue matching strategy, with specific steps as Algorithm 1:
Algorithm 1: Queue-matching strategy |
Input: ProdRequestInfos, ReplayRequestInfos Output: MatchedPairs ![Electronics 14 03824 i001 Electronics 14 03824 i001]() |
Initialize Hash Indexes: Create hash indexes for both production and replay environments. These indexes use the hash value as the key, and a list of all requests corresponding to that hash value (sorted by timestamp) as the value. A request object includes its original timestamp and original data pointer.
Precise Alignment: Iterate through each hash value in the production environment. For each hash value, sequentially retrieve requests from the replay environment’s hash value queue in chronological order. Each successful retrieval of a pair is considered a strictly matched request pair.
Deduplication and Remnant Handling: During the matching process, matched requests will be removed from their respective hash indexes. For requests that exist in the production environment but have no corresponding hash value (or insufficient quantity) in the replay environment, as well as requests that exist in the replay environment but have no corresponding hash value in the production environment, they will be marked as “unmatched requests,” potentially indicating request loss or extra generation.
Through this method, even if the request order is disrupted, as long as the HTTP semantic content of the requests remains consistent, we can accurately align them.
3.3. Feature Comparison for Anomaly Detection
On the basis of aligning latency data from both production and replay environments, to precisely identify high-value latency anomalies in the replay environment relative to the production system, while avoiding false positives caused by minor fluctuations in low-latency data points, a nonlinear threshold method is introduced for data filtering and anomaly marking [
19,
20]. The core of this method lies in its adaptiveness, capable of dynamically adjusting the acceptable upper latency limit for the replay environment based on the production system’s latency.
In system performance monitoring, simply using a fixed ratio, such as 1.5 times bigger, as a threshold can be too lenient in high-latency scenarios and too strict in low-latency scenarios. Especially when latency is close to zero, a tiny absolute latency difference can lead to a huge relative ratio, causing a large number of false positives. To solve this problem, a nonlinear threshold function is designed to achieve the following:
Allow larger fluctuations at low latencies: when production environment latency is very low, the replay environment is allowed to have a relatively large latency multiple difference without being marked as abnormal.
Converge to a near-linear relationship at high latencies: as production environment latency increases, the influence of the nonlinear compensation term gradually diminishes, and the threshold curve approaches a relatively fixed proportional relationship.
Precisely identify true performance degradation: Distinguish between normal environment fluctuations and scenarios where the replay environment’s performance truly experiences an anomaly.
Our anomaly detection logic is based on a nonlinear latency threshold function, whose mathematical expression is as follows:
where
x is the production environment latency (
), and
is the acceptable upper latency for the replay environment.
Its key mathematical properties are elucidated as follows:
Low latency range (
):
At this point, the threshold is approximately linearly proportional to the production latency, x, but the multiple is determined by . For example, with experimental parameters , the initial slope of the threshold can be as high as 7 (i.e., when , replay latency is allowed to be 7 times the production latency), effectively tolerating normal fluctuations in the low-latency range.
High latency range (
):
At high latencies, the threshold approaches 2 times the production latency, fitting the design goal of linear decay in anomaly tolerance.
Medium latency transition range: The threshold function’s derivative decreases monotonically, indicating that the detection logic gradually increases sensitivity to higher latency anomalies while remaining robust to small fluctuations.
3.4. Anomaly Data Attribution
To analyze the performance bottlenecks of anomalous service requests in the replay environment from the host resource layer [
21], this paper employs a dual-pronged approach: first, utilizing the Pearson correlation coefficient to identify linear relationships between individual KPI metrics and processing latency, particularly highlighting direct impacts during anomaly spikes; and second, leveraging the random forest algorithm to construct a comprehensive correlation model between KPI metrics and processing latency. By quantitatively analyzing the impact of resource status on service performance through these complementary methods, precise attribution is achieved. This process is based on host resource data collected via the APM system and the processing latency of anomalous requests, forming a complete analysis chain through data correlation, model training, and result interpretation.
In the data preparation stage, KPI metrics [
22] and anomalous request processing latencies are aligned by timestamp using a 5 min time window. The selection of a 5 min aggregation window reflects the common practice in enterprise operational APM systems, where data is typically exported at 5 min intervals. This granularity aligns with the practical data availability and operational constraints found in many real-world APM deployments, ensuring that our analysis is grounded in representative scenarios.
For the
k-th time window
, KPI metrics such as CPU utilization,
, memory usage,
, the disk I/O wait time,
, the number of pending connections,
, etc., are extracted as feature variables, forming a feature vector,
; simultaneously, the average processing latency of anomalous requests within the window is calculated as the target variable,
, with its calculation formula being as follows:
where
is the number of anomalous requests within the window, and
and
are the time of the first byte of request and the time of the first byte of response for the
i-th anomalous request, respectively. To enhance the model’s ability to capture resource trends, key features are selected using mutual information, retaining metrics strongly correlated with processing latency (mutual information value greater than 0.1).
3.4.1. Pearson Correlation Analysis
Following data collection and alignment, an initial assessment of the linear relationship between each KPI metric and the anomalous request processing latency is conducted using the Pearson correlation coefficient (
r). The Pearson correlation coefficient gauges the strength and direction of a linear relationship between two variables,
X (a KPI) and
Y (latency), and it is calculated as follows:
A higher absolute value of r indicates a stronger linear correlation. Pearson correlation is particularly effective at identifying distinct performance changes caused by the increase in a single anomalous metric, offering an intuitive measure of linear dependency. This analysis helps to pinpoint KPIs that show a direct, linear escalation with processing latency, especially during periods where a specific resource becomes overtly constrained.
3.4.2. Random Forest Feature Importance Analysis
The model construction stage employs the random forest algorithm, which integrates multiple decision trees to reduce the risk of overfitting. Unlike linear correlation, random forest can capture complex non-linear relationships and interactions among multiple features, providing a holistic view of bottleneck attribution. Each decision tree is independently trained on a bootstrapped sample of the training set, with random selection of a subset of features at node splitting, using variance reduction as the criterion:
where
S is the current node’s sample set,
k is the feature,
t is the splitting threshold,
are the left and right subsets after splitting, and
denotes the variance of sample latencies. The dataset is divided into 70% training set and 30% test set based on chronological order. Model performance is evaluated using mean squared error (MSE), mean absolute error (MAE), and coefficient of determination (
):
where
n is the number of samples,
is the actual latency,
is the predicted latency, and
is the mean actual latency.
After model training, the importance of each KPI on anomalous latency is quantified through feature importance, calculated as follows:
where
T is the total number of decision trees, and
is the
t-th tree. A higher feature importance score indicates a greater impact of that KPI on abnormal performance. The bottleneck is located by combining feature importance with KPI values during anomalous periods: if CPU utilization has the highest importance and its value continuously exceeds 80% during anomalous windows, then CPU resource overload is inferred as the primary bottleneck; if memory usage importance is significant and accompanied by a surge in page swap rates, it points to insufficient memory.
4. Experiment Validation
4.1. Experiment Settings
To validate our proposed methodology, experiments were conducted in controlled environments: a production environment and a replay environment, both configured to handle identical HTTP requests. Both environments utilized an Inspur CS5280H server (Inspur Group Co., Ltd., Jinan, China) equipped with a Hygon C86 7265 CPU (Hygon Information Technology Co., Ltd., Tianjin, China) (4 × 24 cores) and 512GB DDR4 memory. The production environment was connected via a 10Gbps network and deployed within China Tower (China Tower Co., Ltd., Beijing, China)’s internal service system. Both servers ran CentOS 7.5.1804. A detailed summary of the hardware and software specifications for both environments is provided in
Table 2.
During the data collection phase, tcpdump (v4.9.2) was used to capture a full day’s worth of real HTTP request traffic from China Tower’s production environment, totaling 102,964 requests. This dataset, originating from an actual running workload, was subsequently replayed in the mirrored environment, resulting in 102,964 effective transactions.
In addition to network traffic data, real-world resource usage metrics were also collected from China Tower’s real-time environment, including CPU utilization, memory consumption, disk I/O wait time, and database response latency, among other host-level metrics. These metrics feed into China Tower’s result analysis and precise adaptation platform (
Figure 5), enabling cross-domain correlation analysis between environment-level performance indicators and network-level behavior. Therefore, this experimental setup allows for robust and repeatable evaluation of the environment’s performance under loads directly obtained from China Tower’s service infrastructure, closely resembling the production environment without altering the original application logic.
4.2. Anomaly Detection Validation
To validate the effectiveness of the proposed nonlinear latency threshold in anomaly detection for the replay environment, we conducted experiments on a real-world dataset. The method involves comparing latency in a production environment (Production_Time_since_request) against that in a replay environment.
To determine the optimal parameters for the nonlinear latency threshold curve, a crucial step involved grid search optimization. This systematic exploration of a predefined parameter space allowed us to identify the combination of C1 (scale factor) and C2 (shape factor) that yielded the best performance metrics, specifically F1-score, precision, and recall, against a dataset with manually annotated anomalies. The results of the top-performing parameter combinations are presented in
Table 3.
After detection by the nonlinear latency threshold algorithm, a scatter plot is generated for each selected path.
As shown in
Table 3, the optimal parameters C1 = 0.07 and C2 = 0.007 (Rank 1) achieved the highest F1-Score of 82.93%, indicating a robust balance between precision and recall in identifying anomalies. The subsequent analysis will primarily refer to the behavior of the nonlinear threshold curve defined by these optimal parameters.
With
Figure 6 (Path: /workflow/insertBillWorkFlowItem Latency Comparison) as an example, the experimental results clearly demonstrate the superiority of the optimized nonlinear threshold method in several aspects, particularly when visually compared against the simple y = x line and other suboptimal parameter combinations (e.g., Rank 12 and Rank 26 in the figure legend):
Effective Avoidance of Low-Latency Misjudgments: When the production latency (Production_Time_since_request) is extremely low (e.g., from 0.000 s to 0.003 s), the optimized nonlinear anomaly threshold (Rank 1, teal line) allows for a larger absolute difference between replay and production latencies while still classifying data as normal (blue dots). For instance, at a production latency of 0.001 s, even if the replay environment latency reaches 0.005 s to 0.01 s, these points are correctly identified as normal due to the non-linearly upward curving threshold. This successfully addresses the challenge where traditional fixed-multiple or fixed-difference thresholds frequently lead to false positives for normal fluctuations in low-latency scenarios. The dense distribution of blue dots below the Rank 1 threshold line, even those with higher absolute replay latencies, confirms this effectiveness. The optimal curve clearly outperforms the more restrictive Rank 12 curve or the y = x line in this region, which would incorrectly flag many normal fluctuations as anomalies.
Precise Identification of High-Value Anomalies: The algorithm precisely marks data as anomalous (orange dots) when the ratio of replay environment latency to production environment latency exceeds the reasonable range defined by the threshold. Furthermore, the plot includes “Manual Anomaly” points (red triangles), representing ground truth anomalies. The Rank 1 optimized threshold demonstrates high accuracy in encompassing these manual anomalies, signifying its capability to detect critical performance issues. For example, when the production latency is around 0.004 s, if the replay latency reaches 0.04 s or even 0.055 s, these points are accurately classified above the Rank 1 threshold line and identified as high-value anomalies. This indicates that true high-latency anomalies causing performance issues are effectively captured after excluding normal fluctuations, and confirmed by the manual annotations.
Adaptability and superiority of the optimized threshold curve: The shape of the optimal nonlinear threshold curve (Rank 1, teal line), with its higher slope in the low-latency segment and its gradual flattening in the high-latency segment, strongly aligns with empirical observations of latency distribution in actual system operations. This optimal tuning, achieved through grid search, allows the curve to flexibly adapt to the replay environment’s tolerance levels under different production latencies, providing a more contextually appropriate judgment criterion. Compared to a simple y = x (Equal Latency) reference line, the optimized nonlinear threshold offers significantly greater discriminative flexibility and practical significance. Moreover,
Figure 6 vividly illustrates the superiority of the Rank 1 optimal threshold over suboptimal curves like Rank 12 (gray line) and Rank 26 (brown line). The Rank 1 threshold provides a better balance, capturing more true anomalies (aligning with manual anomalies and efficiently identifying auto-detected ones) without excessive false positives, as corroborated by its highest F1-Score in
Table 3. The suboptimal curves either miss apparent anomalies or are too aggressive, leading to lower overall performance.
The experimental results, bolstered by robust parameter optimization through grid search, fully demonstrate the effectiveness and superiority of the nonlinear threshold-based anomaly detection method in a production-replay dual-environment latency comparative analysis. This method successfully solves the problem of misjudgment caused by fluctuations in low-latency data while ensuring the high-precision identification of true high-value anomaly points that are also validated through manual annotations. This provides powerful tool support for real-time monitoring of performance, rapid localization, and resolution of potential performance bottlenecks in replay environments.
4.3. Attribution Verification
To validate the effectiveness of the proposed anomaly attribution method, particularly its capability in identifying bottlenecks at the host resource layer, we conducted a quantitative analysis using both the Pearson correlation coefficient and the feature importance derived from the random forest model. This dual approach provides complementary insights: Pearson correlation identifies direct linear relationships, which are insightful for pinpointing immediate impacts from individual KPI surges, while random forest feature importance quantifies the overall contribution of each KPI within a complex, non-linear system.
4.3.1. Pearson Correlation Analysis
The Pearson correlation coefficients were calculated between each KPI metric and the anomalous request processing delay based on historical anomalous data. The results, as shown in
Table 4, provide a direct measure of linear association.
The interpretation of Pearson correlation values generally follows the following guidelines.
∼0.3: weak correlation (often considered insignificant in practical terms).
∼0.5: low correlation (requires further sample size-dependent significance testing).
∼0.8: moderate correlation (possesses practical application value).
: high correlation (may indicate multicollinearity if multiple variables share high correlation).
As is evident from
Table 4, “1-Minute Average Load” exhibits a remarkably high Pearson correlation coefficient of 0.877. This strongly indicates a robust linear relationship where increasing system load directly and proportionally leads to increased anomalous request processing delays. This finding intuitively aligns with the expectation that system-wide load is a primary driver of performance degradation. Other metrics, like “Closed Connections” (0.338), show a low to moderate correlation, suggesting some linear influence, while “Pending Connections” (0.267) and “Memory Utilization” (0.265) fall into the weak to low correlation range, indicating a less direct linear relationship during the observed periods. Many other KPIs, such as “CPU Utilization” (0.082) and “Filesystem Utilization” (0.080), show very weak linear correlations.
It is crucial to note that Pearson correlation coefficients, while intuitive for direct linear relationships, can often be small for individual KPIs during normal operations when multiple factors interact. However, when a specific metric experiences an anomalous surge and directly causes a performance bottleneck, its Pearson coefficient with latency will typically become significantly large and visually evident, clearly pointing to that particular bottleneck. This makes Pearson correlation valuable for identifying sudden, distinct anomalies within single metrics.
4.3.2. Random Forest Feature Importance Analysis
While Pearson correlation provides insights into individual linear relationships, the random forest model offers a comprehensive view of how multiple KPIs collectively contribute to anomalous processing delays, considering complex non-linear interactions. The model’s performance was evaluated using the mean squared error (MSE), which yielded a value of 0.096558. This low MSE indicates good predictive accuracy for anomalous request processing delay, thereby providing a reliable foundation for the subsequent feature importance analysis.
As clearly depicted in
Figure 7, “1-Minute Average Load” stands out as the most significant KPI in the random forest model, obtaining a feature importance value of 0.6538, which is substantially higher than all other KPIs. This observation strongly validates the finding that the overall system load is the predominant factor contributing to anomalous service request processing delays. Under high load conditions, multiple resources such as CPU, memory, and I/O can simultaneously experience bottlenecks, thereby significantly impeding the request processing speed. This finding aligns strongly with the conventional understanding of host resource bottlenecks, as a high load is frequently a comprehensive indicator of overall performance degradation.
Following closely is “Pending Connections,” with an importance value measured at 0.2379. This suggests that, in our specific scenario, congestion in the connection queues at either the service or network layer is the second most critical factor contributing to anomalous delays. This may indicate issues such as insufficient concurrent processing capacity of application services, database connection pool exhaustion, or network traffic spikes. It also clearly demonstrates the model’s effectiveness in identifying bottlenecks within specific operational contexts.
Notably, common utilization metrics like “CPU Utilization” and “Memory Utilization” exhibited relatively low feature importance in this random forest analysis (0.000030 and 0.006939, respectively). This observation does not imply the unimportance of these metrics; instead, it may suggest that, under the current service load and host configuration, the mere utilization of CPU and memory is not the most direct or prominent bottleneck leading to anomalous processing delays in the context of the overall system behavior. For instance, even with low CPU utilization, a high average load (driven by I/O waits or context switches) can still lead to service processing delays due to numerous processes waiting in run queues. Similarly, memory-related issues might manifest more through significant changes in page swap rates (which could be captured by other metrics like “Swap Space Utilization” in the Pearson analysis or implicitly through load in the RF model) than in mere utilization percentages. This highlights the random forest model’s ability to automatically identify the most explanatory key indicators for the target variable (processing delay) from a multitude of related KPIs, thereby avoiding misjudgments caused by a reliance on single-indicator biases.
In conclusion, the combined analysis of Pearson correlation and random forest feature importance provides a robust and comprehensive understanding of anomaly attribution. Pearson correlation intuitively identifies KPIs that show strong direct linear impacts, especially when a single metric is driving an acute anomaly. Complementing this, the KPI feature importance distribution presented in
Figure 7 not only intuitively quantifies the overall correlation strength between various performance metrics and anomalous processing delay but also provides strong empirical evidence supporting the effectiveness of the proposed random forest-based bottleneck identification method. It demonstrates the model’s capability to effectively differentiate and highlight the key resource KPIs that most significantly impact service performance within a complex system, thus establishing a robust data foundation for accurate anomaly attribution by integrating the instantaneous states of these highly important metrics.
5. Discussion
This study proposes a performance diagnosis method based on network traffic parsing and cross-environment comparison, which, through the collaborative design of non-intrusive data collection, precise traffic replay, and machine learning attribution, provides a practical solution for server bottleneck identification. Unlike the overview of CSDF framework in the introduction, this section will delve into the uniqueness of CSDF, its practical deployment value, and existing challenges.
5.1. Core Value of CSDF
The innovativeness of CSDF is primarily reflected in the combination of non-intrusive design and environmental isolation. Unlike traditional solutions relying on log parsing or agent implantation, CSDF achieves “zero-interference” with service systems by passively capturing packets from network cards and natively collecting APM metrics, aligning with recent advances in non-intrusive performance analysis for softwarized networks [
23]. This design demonstrates significant advantages in China Tower’s production environment—data collection can be completed without suspending services, and it is compatible with traffic characteristics of heterogeneous systems, especially work-order systems.
Secondly, the 1:1 traffic replay and hash-indexed alignment mechanism address the pain point of “difficulty in establishing baselines” in performance analysis. While the introduction briefly described the necessity of replay, its actual value lies in the following: successfully isolating environmental noise (such as sudden network fluctuations) by reproducing production request flows in a controlled environment, combined with multi-layer alignment based on URL, port, and temporal constraints. Experimental data shows that, even for high-frequency repeated requests, alignment accuracy can still be maintained above 96.8% (), laying the foundation for quantitative analysis of latency differences.
Finally, the attribution model achieves a precise correlation between “service anomalies and resource bottlenecks.” Unlike the preliminary introduction to the model’s functionality, its deeper value lies in the following: quantitatively linking macroscopic metrics like CPU utilization and memory usage with microscopic single-service latency through feature importance quantification (e.g., 1-Minute Average Load’s correlation coefficient reached 0.87), filling the gap of finding the causes.
5.2. Practical Limitations and Optimization Directions
In practical deployment, the limitations of CSDF gradually become apparent: First, the efficiency bottleneck of large-scale traffic processing. When the PCAP file size exceeds 100 GB, the time consumed by protocol parsing and alignment algorithms significantly increases (about 40% more than for 2 GB files), requiring the introduction of distributed computing frameworks for optimization. Second, the hardware adaptability of the replay environment. Although the introduction emphasized the importance of environmental mirroring, it was found that differences in CPU lead to latency deviations for some computer-intensive requests, necessitating the inclusion of detailed parameters in hardware configuration guidelines.
Furthermore, the model’s temporal sensitivity still has room for improvement. The current KPI sampling interval (5 min) makes it difficult to capture instantaneous bottlenecks (such as memory leaks within 1 min). In the future, adaptive sampling rates combined with sliding windows can be explored to balance accuracy and overhead.
5.3. Security Perspective
While CSDF is primarily designed for performance diagnosis, our cross-environment traffic replay and fine-grained packet analysis also offer potential for detecting traffic-based or physical-layer side-channel attacks in wireless networks. Recent studies demonstrate that even encrypted wireless traffic may leak application-level information: RF energy–harvesting attacks can eavesdrop mobile app activities with high accuracy by exploiting ambient Wi-Fi signals [
24], and packet-level open-world app fingerprinting can infer user actions despite encryption and multiplexing challenges [
25]. These findings highlight that the detailed packet timing and flow features captured via CSDF could be extended to identify such covert information leaks or to benchmark countermeasures. Integrating side-channel detection modules into the CSDF pipeline could thus provide a unified framework for both performance bottleneck analysis and early-warning of privacy threats.
5.4. Differentiation from Existing Technologies
Compared to related work mentioned in the introduction, the uniqueness of CSDF lies in the following:
Compared to pure network traffic analysis tools (e.g., Wireshark), which are typically used for post-hoc inspection or security-oriented anomaly detection [
26], CSDF adds cross-environment comparison and resource attribution capabilities.
Compared to APM systems (e.g., KPIRoot+), it achieves refined analysis at the service request level, rather than only focusing on system-level metrics.
Compared to deep learning-driven anomaly detection models (e.g., DeepLog), it avoids reliance on large amounts of labeled data, making it easier to implement in industrial scenarios.
6. Conclusions
This study constructed a complete technical system for server system performance bottleneck diagnosis, from data collection to bottleneck localization. The main conclusions are as follows:
A non-intrusive cross-environment comparison framework was proposed. By synergistically collecting network card captures and APM metrics, combined with 1:1 traffic replay, it quantifies performance differences between production and test environments, addressing issues of service interference or missing baselines in traditional methods.
A hash-indexed request alignment algorithm was designed. Through multi-layer constraints of URL, port, and dynamic temporal thresholds, it solves the matching problem of high-frequency repeated requests, achieving a cross-environment request alignment accuracy exceeding 96.8%.
A random forest attribution model was built. It associates service latency with KPIs such as CPU and memory and quantifies the influence weight of each resource metric on performance, providing data-driven evidence for bottleneck optimization (e.g., strong correlation between database response time and latency).
Experimental validation shows that CSDF can effectively identify performance anomalies in critical service operations, such as billing interfaces, within China Tower’s production system, with an average response time reduction of 32% after optimization. Future work will focus on traffic segmentation processing for distributed systems and dynamic compensation algorithms for hardware characteristics, further enhancing the method’s universality and accuracy.
The value of this study lies not only in providing a technical solution but also in establishing a correlation paradigm of “network behavior-resource status-service performance,” offering a new analytical perspective for performance optimization of complex server systems.
Author Contributions
Conceptualization, R.G.; methodology, Y.H. and X.L.; software, Y.H.; validation, Y.H. and X.L.; formal analysis, Y.H.; investigation, Z.Z., J.Z. and M.W.; resources, Z.Z., J.Z., and M.W.; data curation, Z.Z., J.Z., and M.W.; writing—original draft preparation, Y.H. and X.L.; writing—review and editing, R.G.; visualization, Y.H.; supervision, R.G.; project administration, R.G. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Ethical review and approval were waived for this study due to the nature of the research involving performance diagnosis of server systems based on collected network traffic and APM metrics, which does not involve human subjects or animals and uses anonymized data, thus not requiring specific ethical approval.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data analyzed in this study were collected from China Tower’s internal production system and are not publicly available due to proprietary and privacy reasons. Requests for data access can be made to the corresponding author, subject to approval and applicable data sharing agreements.
Acknowledgments
The authors would like to thank China Tower Corporation Limited for providing the necessary experimental environment and data for this research.
Conflicts of Interest
Authors Zilang Zhang, Jialun Zhao and Mengyuan Wang were employed by the company China Tower Corporation Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
APM | Application Performance Management |
CSDF | Cross-environment Server Diagnosis with Fusion |
KPI | Key Performance Indicator |
CPU | Central Processing Unit |
RAM | Random Access Memory |
I/O | Input/Output |
HTTP | Hypertext Transfer Protocol |
TCP | Transmission Control Protocol |
RTT | Round-Trip Time |
MSE | Mean Squared Error |
MAE | Mean Absolute Error |
| Coefficient of Determination |
PCAP | Packet Capture |
URI | Uniform Resource Identifier |
SHA-256 | Secure Hash Algorithm 256-bit |
References
- Le, V.H.; Zhang, H. Log-based Anomaly Detection Without Log Parsing. In Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Melbourne, Australia, 15–19 November 2021; pp. 492–504. [Google Scholar]
- Wang, J.; Zhao, C.; He, S.; Gu, Y.; Alfarraj, O.; Abugabah, A. LogUAD: Log Unsupervised Anomaly Detection Based on Word2Vec. Comput. Syst. Sci. Eng. 2022, 41, 1207–1222. [Google Scholar] [CrossRef]
- Takei, T.; Horita, H. Analysis of Business Processes with Automatic Detection of KPI Thresholds and Process Discovery Based on Trace Variants. Res. Briefs Inf. Commun. Technol. Evol. 2023, 9, 59–76. [Google Scholar] [CrossRef]
- Fu, Q.; Lou, J.G.; Wang, Y.; Li, J. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. In Proceedings of the Ninth IEEE International Conference on Data Mining, Miami, FL, USA, 6–9 December 2009; pp. 149–158. [Google Scholar]
- Lin, Q.; Zhang, H.; Lou, J.G.; Zhang, Y.; Chen, X. Log Clustering Based Problem Identification for Online Service Systems. In Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C), Austin, TX, USA, 14–22 May 2016; pp. 102–111. [Google Scholar]
- Tak, B.; Park, S.; Kudva, P. Priolog: Mining Important Logs via Temporal Analysis and Prioritization. Sustainability 2019, 11, 6306. [Google Scholar] [CrossRef]
- Du, M.; Li, F.; Zheng, G.; Srikumar, V. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, Dallas, TX, USA, 30 October–3 November 2017. [Google Scholar]
- Meng, W.; Liu, Y.; Zhu, Y.; Zhang, S.; Pei, D.; Liu, Y.; Chen, Y.; Zhang, R.; Tao, S.; Sun, P.; et al. LogAnomaly: Unsupervised Detection of Sequential and Quantitative Anomalies in Unstructured Logs. arXiv 2019, arXiv:1903.03765. [Google Scholar]
- Chen, Z.; Liu, J.; Gu, W.; Su, Y.; Lyu, M.R. Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection. arXiv 2021, arXiv:2107.05908. [Google Scholar]
- Zhao, X.; Guo, K.; Huang, M.; Qiu, S.; Lu, L. ELFA-Log: Cross-System Log Anomaly Detection via Enhanced Pseudo-Labeling and Feature Alignment. Computers 2025, 14, 272. [Google Scholar] [CrossRef]
- Nedelkoski, S.; Bogatinovski, J.; Mandapati, A.K.; Becker, S.; Cardoso, J.; Kao, O. Multi-source Distributed System Data for AI-Powered Analytics. arXiv 2020, arXiv:2009.11313. [Google Scholar]
- Gu, W.; Zhong, R.; Yu, G.; Sun, X.; Liu, J.; Huo, Y.; Chen, Z.; Zhang, J.; Gu, J.; Yang, Y.; et al. KPIRoot+: An Efficient Integrated Framework for Anomaly Detection and Root Cause Analysis in Large-Scale Cloud Systems. arXiv 2025, arXiv:2506.04569. [Google Scholar]
- Xu, J.; Ma, X.; Liu, J.; Zhang, C.; Li, H.; Zhou, X.; Wang, Q. Automatically Identifying Imperfections and Attacks in Practical Quantum Key Distribution Systems via Machine Learning. Sci. China Inf. Sci. 2024, 67, 202501. [Google Scholar] [CrossRef]
- Lee, C.; Yang, T.; Chen, Z.; Su, Y.; Lyu, M.R. Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; pp. 1750–1762. [Google Scholar]
- Rouf, R.; Rasolroveicy, M.; Litoiu, M.; Nagar, S.; Mohapatra, P.; Gupta, P.; Watts, I. InstantOps: A Joint Approach to System Failure Prediction and Root Cause Identification in Microservices Cloud-Native Applications. In Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering, London, UK, 7–11 May 2024. [Google Scholar]
- Nguyen, M.H.; Huynh, T.T.; Nguyen, T.T.; Nguyen, P.L.; Pham, H.T.; Jo, J.; Nguyen, T.T. On-device Diagnostic Recommendation with Heterogeneous Federated BlockNets. Sci. China Inf. Sci. 2025, 68, 140102. [Google Scholar] [CrossRef]
- Xing, P.; Zhang, D.; Tang, J.; Li, Z. A Recover-then-discriminate Framework for Robust Anomaly Detection. Sci. China Inf. Sci. 2025, 68, 142102. [Google Scholar] [CrossRef]
- Wang, Q.; Pan, Z.; Liu, N. An Ensemble and Cost-sensitive Learning-based Root Cause Diagnosis Scheme for Wireless Networks with Spatially Imbalanced User Data Distribution. Sci. China Inf. Sci. 2024, 67, 179301. [Google Scholar] [CrossRef]
- Zeng, Y.; Zhang, F.; Chen, T.; Wang, C. Deterministic Learning-based Neural Output-feedback Control for a Class of Nonlinear Sampled-data Systems. Sci. China Inf. Sci. 2024, 67, 192202. [Google Scholar] [CrossRef]
- Wang, Z.; He, W.; Sun, J.; Wang, G. Time-delay Effects on the Dynamical Behavior of Switched Nonlinear Time-delay Systems. Sci. China Inf. Sci. 2025, 68, 179201. [Google Scholar] [CrossRef]
- Duan, J.; Ji, F.; Li, Y.; Sun, H.; He, R. Trustworthy Forgery Detection with Causal Inference. Sci. China Inf. Sci. 2025, 68, 160108. [Google Scholar] [CrossRef]
- Sui, T.; Tao, X.; Wu, H.; Zhang, X.; Xu, J.; Nan, G. Mining KPI Correlations for Non-parametric Anomaly Diagnosis in Wireless Networks. Sci. China Inf. Sci. 2023, 66, 162301. [Google Scholar] [CrossRef]
- Liu, Q.; Lin, J.; Zhang, T.; Linguaglossa, L. DRST: A Non-Intrusive Framework for Performance Analysis in Softwarized Networks. arXiv 2025, arXiv:2506.17658. [Google Scholar]
- Ni, T.; Lan, G.; Wang, J.; Zhao, Q.; Xu, W. Eavesdropping Mobile App Activity via Radio-Frequency Energy Harvesting. In Proceedings of the 32nd USENIX Security Symposium (USENIX Security 2023), Anaheim, CA, USA, 9–11 August 2023; USENIX Association: Anaheim, CA, USA, 2023; pp. 3511–3528. Available online: https://www.usenix.org/conference/usenixsecurity23 (accessed on 16 September 2025).
- Li, J.; Wu, S.; Zhou, H.; Luo, X.; Wang, T.; Liu, Y.; Ma, X. Packet-level Open-world App Fingerprinting on Wireless Traffic. In Proceedings of the 2022 Network and Distributed System Security Symposium (NDSS’22), San Diego, CA, USA, 24–28 April 2022; The Internet Society: San Diego, CA, USA, 2022. Available online: https://www.ndss-symposium.org/ndss2022/ (accessed on 16 September 2025).
- Dodiya, B.; Singh, U.K. Malicious Traffic Analysis Using Wireshark by Collection of Indicators of Compromise. Int. J. Comput. Appl. 2022, 183, 1–6. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).