1. Introduction
Distributed databases are critical for modern application use cases such as cloud-native workloads, internet of things (IoT), and enterprise systems that need to scale and are expected to have high availability. However, this introduces significant challenges related to monitoring and security. Legacy security, based on access control and audit logs, does not identify structural abnormalities or insider attacks. Graph theory (specifically metric dimension) serves as a theory base to analyze the structure of clients in distributed systems, which aids in identifying possible data leaks or intrusions.
Newer work in graph theory and network analysis points to the structural features that any class of complex system (for example, data shards and distributed APIs present in clouds) seems to share [
1]. These architectures are increasingly common, but they also increase the target surface area for cyber-attacks, making light-weight infection signature detection through traditional avenues quite difficult [
2]. Existing security solutions are either limited to perimeter-style observation, generic traffic-level detection, or are not capable of identifying syntactic anomalies in any large-scale distributed system [
3].
Existing anomaly detection methods for distributed systems struggle with problems such as centralized analysis or complete-graph analysis, making them expensive in terms of computation and communication costs as they grow [
4]. Both scalable systems and rankings tend to focus on static topologies, a property that is often not appropriate for real-world databases, which are dynamic in nature. Furthermore, the metric dimension relies on certain types of graphs [
5,
6] whereas it is not scalable on large or dense networks with high computational costs. Structural ambiguity further lowers the precision of anomaly localization and correct malfunction detection [
7].
Although the accuracy of machine learning models for anomaly detection is high, these models typically ignore system topology and are not capable of tracking the propagation of anomalies among interconnected systems. Unlike graph-based attack-graph approaches, which regard the graph as statically set, these models do not perform well for dynamic databases. For example, there is exposed privacy risk when structural signatures are disclosed or computed without proper measures [
8].
However, the metric dimension is a much stronger notion, where a resolving set, which is a minimum set of landmarks, can uniquely determine each node by their distance vectors [
9]. Such an approach can easily scale for large networks, preserving its discriminative power as the full system grows [
10]. There have already been studies applying the metric dimension in combinatorial optimization and real-time monitoring, showing its capabilities for large-scale anomaly detection [
11], to address resilience in security-critical environments, failure-aware metric dimensions [
12], and approximate versions that balance accuracy with efficiency [
13].
Distance-based analysis has been used in forensic attribution and anomaly detection on dynamic graphs in cyber-attacks [
14] and attack-graph-based monitoring is useful for detecting suspicious states [
15]. Nonetheless, these methods further increase privacy threat in distributed databases, where sensitive data and sensitive services are continuously updated and in the case that an attacker can exploit both structural information [
16] and temporal information [
17]. We present a privacy-preserving, low-overhead, scalable anomaly detection system based on metric dimension theory and ML for distributed databases/microservices. It enables real-time intrusion detection, fine localization and robustness to privacy violations.
Theoretically, metric dimensions seem to be a solution, but three factors complicate their applicability to real-world distributed databases: (1) large graphs are NP-hard, (2) topologies can change over time and (3) user privacy can be at stake, and re-identification attacks are common. This paper then proposes the Graph Metric Dimension-based Anomaly Detection (GMD-AD) framework to tackle these challenges with four specific contributions:
- 1.
Sequential Metric Dimension Algorithm
Our incremental algorithm updates the resolving set in O(Δn log n) time given an edge/node change (Δ is the maximum degree), while O(n3) cannot be recomputed, which allows dynamic distributed database topologies to be adjusted in real time.
- 2.
Parallel Distance Use Cases and ML-Tuned Anomaly Score
Leveraging parallelized breadth-first search (BFS) from resolving set landmarks and incorporating anomaly scores as features for gradient boosting models enable GMD-AD to achieve sub-second localization latency even for graphs n > 10,000 nodes. We show 60% latency improvement over full-graph methods in our experiments.
- 3.
k-Metric Anti-Dimension for Privacy
We combine it with k-metric anti-dimension theory [
18] in order to give quantifiable (
k,
ℓ)-anonymity, which guarantees nodes cannot be distinguished from at least
k − 1 others within distance
ℓ of them. This provides a 40 percentage-point drop-in re-identification success rates with little effect on detection accuracy (F1 > 0.99).
- 4.
Hybrid GNN-Ensemble Architecture
GMD-AD is quite different from full-graph GNNs, which compute embeddings over all nodes, as it only computes graph neural network embeddings over resolving-set subgraphs, and classifies them with a gradient boosting classifier (CatBoost, XGBoost). Our hybrid approach results in 50–70% reduced latency as compared to standalone GNNs (geeking the numbers over time), decreasing the latency grind while leveraging the robustness of noise in graph structure.
GMD-AD is validated against two representative testbeds: 1. MongoDB Sharded Cluster (9 nodes): NoSQL distributed database with real-world process and some injected anomalies (e.g., unauthorized replication, data exfiltration, lateral movement) 2. Standard cloud-native application with HTTP/REST communication, scaled to 128–5120 virtual nodes: SockShop Microservices Benchmark (13 services). We demonstrate the following experimental results: 60% reduced latency: 1200 ms → 480 ms, for 128-node anomaly localization; high detection accuracy: F1-score > 0.997, AUC-ROC > 0.999 and outperforming the nearest competing baselines, including Prov-Graph, LSTM, and GNN-only; noise robustness: with respect to 10% feature noise (using dual-stage noise injection + SMOTE balancing), F1 improves from 0.95 to 0.97; privacy preservation: (k = 3, ℓ = 2) anonymization reduces the re-identification success rate from 68% to 28% (reduction of 40 percentage-points) with minimal degradation in detection (F1: 0.9974 → 0.9941); compared with the nearest competitor, Prov-Graph, GMD-AD significantly reduces the operational cost: 60% lower CPU usage, 66% lower memory footprint and 66% lower storage requirements all while achieving higher detection accuracy.
Organization of Paper:
Section 2 discusses related work in metric dimension theory, graph-based security monitoring, and machine learning techniques for anomaly detection. GMD-AD is described in
Section 3, in terms of the steps for updating the metric dimension in a sequential fashion, k-anti-dimension construction and hybrid ML integration in a privacy-preserving manner.
Section 4 describes our experimental evaluation on MongoDB and SockShop, a cost/benefit analysis and addresses generalizability, limitations and future work.
Section 5 concludes.
2. Related Work
This section contextualizes the suggested framework within the established literature concerning the graph metric dimension, privacy-conscious graph models, and graph-based anomaly detection within distributed databases and microservice architecture. Previous research has investigated these areas from theoretical, algorithmic, and practical viewpoints. Nevertheless, a comprehensive examination of existing methodologies reveals distinct compromises concerning scalability, adaptability to evolving systems, the capacity for anomaly localization, and the preservation of privacy. The subsequent subsections will assess representative approaches, summarize their fundamental concepts and explicitly highlight the limitations that underpin the rationale for the proposed framework. This study addresses a niche research area where directly comparable works are limited. The relevant literature is therefore analyzed through intersecting themes across multiple studies; therefore,
Figure 1 shows, as an example, a Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) diagram for the selection and categorization of the related work on cybersecurity and anomaly detection research, created using draw.io (
https://www.drawio.com/).
Anomaly detection in distributed databases and microservice architectures has been extensively studied, with approaches broadly categorized into provenance-based methods, graph-based structural analysis, machine learning techniques, and emerging operator-theoretic frameworks. This section reviews key prior works, highlighting their strengths and limitations relative to the proposed GMD-AD framework.
2.1. Provenance and Graph-Based Anomaly Detection
Provenance tracking systems, such as Prov-Graph [
19], maintain comprehensive lineage graphs for data operations in distributed environments, enabling forensic analysis of anomalies like data tampering or unauthorized access. Prov-Graph achieves high detection accuracy (F1 ≈ 0.984) by traversing backward and forward dependency traces but incurs significant overhead in CPU, memory, and storage due to full graph maintenance (O(n·m) complexity, where n is nodes and m is operations). Similarly, Titan [
20] uses graph pattern matching for intrusion detection in cloud databases, focusing on query lineage but struggling with scalability in dynamic topologies.
Graph-based methods extend beyond provenance to structural properties. NetSieve [
21] models network flows as graphs and detects anomalies via subgraph isomorphism, effective for lateral movement but computationally intensive for large clusters. GraphSAD [
22] employs graph neural networks (GNNs) for semi-supervised anomaly scoring, achieving good precision in microservices but requiring labeled data and lacking privacy guarantees. In contrast, GMD-AD leverages metric dimension theory to monitor only a logarithmic subset of nodes (
β(
G) ≈ log
n), offering efficiency gains (60–66% cost reductions) and formal privacy via k-metric anti-dimension, while maintaining superior accuracy (F1 = 0.9974).
Machine learning approaches include isolation [
23] for query pattern anomalies and autoencoders for workload deviations in microservices. These excel in unsupervised settings but often overlook graph topology, leading to higher false positives in structured systems like sharded databases.
2.2. Neural Operator and Dynamic Representation Approaches
Recent operator learning paradigms model dynamical systems by learning mappings between function spaces. Sakovich et al. [
24] integrate dynamic mode decomposition (DMD) into neural operators for approximating partial differential equations, capturing transient dynamics in evolving graphs like microservice interactions. Li et al. [
25] propose the Graph Kernel Network for operators on graph-structured data, while enhance scalability for large systems.
Dynamic Mode Decomposition (DMD) decomposes time-series into spatiotemporal modes: standard DMD identifies linear patterns in service interactions, Kernel DMD handles nonlinearities, and Sparse DMD selects minimal explanatory modes. Applied to distributed systems, DMD reveals latency oscillations or workload modes.
Table 1 represents SMS-level latency oscillations.
Neural operators suit continuous dynamics (e.g., latency trajectories) with dense data, enabling nonlinear modeling but at the cost of interpretability. GMD-AD excels in discrete topologies, sparse graphs, and structural anomalies, providing auditable detection and privacy.
DMD-based methods are preferable for continuous variables and high-frequency data, such as equation discovery from observations. Limitations include challenges with discrete events. GMD-AD is ideal for discrete service topologies, structural threats (e.g., unauthorized edges), and regulatory needs, leveraging graph sparsity for efficiency.
Hybrid Approach Proposal
A hybrid GMD-AD + DMD framework is proposed: GMD-AD detects structural changes (e.g., unexpected communications), DMD identifies temporal anomalies (e.g., mode shifts in latency), and a fusion layer combines signals for robust detection. Neural operators and DMD advance continuous modeling, but GMD-AD’s graph-theoretic basis better addresses discrete microservice anomalies. Future work will integrate these for enhanced hybrid systems.
2.3. Metric Dimension-Based Structural Identification
Distance-based structural identification has been well studied in the scope of monitoring and localization within a large-scale network. Laoudias et al. [
4] discussed enabling radio technologies for network localization and tracking, where it was pointed out that distance representation is a concise and expressive methodology of denoting nodes. Despite their effectiveness in benign settings, these techniques rely on prudence and assume that networks are stable; however, they do not address adversarial manipulation or anomaly detection within distributed database systems. Building on this work, Prabhu et al. studied the metric dimension in generalized Sierpiński graphs [
5,
6] and showed how a well-chosen resolving set can place every vertex into correspondence with minimum observation. These results assume highly regular graph constructions in the network, and do not directly apply to heterogeneous, irregular or dynamically changing networks such as those represented by distributed databases.
Brimkov et al. [
7], in order to avoid computational problems, introduced throttling schemes which lead to better approximations of metric dimensions. The use of the octree restricts computational cost but adds uncertainty in distance and decreases localization. Korivand and Soltankhah [
8] also demonstrated that structural symmetries restrict distinguishability, a problem magnified in replicated or load-balanced databases.
Later work examined metric dimensions in more elaborate graph classes. Shao et al. [
9] studied hex-based networks, and Bíró et al. [
10] worked on growing infinite graphs, in both cases showing that satisfaction does not scale well when we add enough to the graph. Dorota and Ismael [
11] surveyed metric-dimension-related parameters from both combinatorial and applied viewpoints, each time pointing to computational intractability as a main challenge. Parameterized approaches on treewidth [
12] and algebraic graph structures [
13] have supported this observation that tractable solutions often rely on assumptions seldom met by operational systems.
Collectively, these works establish the metric dimension as a strong theoretical device for structural identification; however, there is an evident discrepancy between theory and practice, especially when considering real-time anomaly detection in large-scale dynamic distributed databases. This void is a source of direct motivation for the extensions for robustness, adaptability, and privacy we discuss next.
2.4. Robustness and Privacy-Oriented Metric Dimension Variants
Acknowledging that the classical metric dimension formulations are too fragile to failures in realistic settings and system evolution, LATERAL introduced variants that improve robustness. Liu et al. [
13] introduced the fault-tolerant metric dimension which guarantees identifiability in case of reference node faults. Ahmad et al. [
14] strengthened distinguishability with doubly resolving sets while Frongillo et al. [
15] proposed metric dimension truncation for a trade-off between precision and efficiency. These inclusions enhance the robustness; however, they are structurally based for resistance against adversarial compromises. Building upon these concepts in dynamic settings, Henderson et al. [
16] presented metric dimension-based analysis for volatile graphs in digital forensics feasibility on dynamic systems. Gori et al. [
17] presented GRAPH4 that computes anomaly measures on attack graphs, but it is not scalable due to the computation based on full graph traversal and its centralized nature. Other works on fault-tolerant and dynamic structures [
19,
20] also emphasize resilience but do not achieve fine-grained intrusion localization.
As robustness improved, privacy emerged as a parallel concern, particularly as distance-based representations can uniquely identify entities. Chatterjee et al. [
26] studied the computational complexity of privacy measures related to active attacks. Trujillo-Rasúa and Yero [
27] proposed the
k-metric anti-dimension to model anonymity on graphs, which has been subsequently generalized by the equidistant dimension [
28] in addition to its variants of (
k,
ℓ)-anonymity [
29]. These models do not distinguish based on identifiability, making them naturally provide strong privacy guarantees.
Privacy-preserving versions of the above naturally but perversely work against intrusion localization: instead, anonymity hides precisely the differences needed to discover malicious activity. Therefore, while robustness and privacy extensions consider complementary issues, they are independent of anomaly detection, making the reconciliation of identifiability and anonymity under a unified framework essential.
Table 2 compares these robustness- and privacy-aware adaptations, indicating that all of them consider resilience or anonymity separately and no one incorporates them together with anomaly detection or intrusion localization in distributed database systems.
2.5. Graph-Based Anomaly Detection in Distributed Systems
Further, in parallel with the theoretical progress on metric dimension theory, graph-based anomaly detection has been considered heavily in distributed and microservice systems. Liu et al. [
30] used DAG-based metric fusion for anomaly detection in cloud-based microservices, resulting in high detection performance at the cost of requiring a lot of labeled data and centralized processing. Brandon et al. [
31] used graph autoencoder-based models for root cause analysis from distributed traces but they requires full-graph embeddings, which may not scale well. For enhanced detection accuracy, Wang et al. [
32] suggested multimodal graph representation learning for incorporation of logs, traces and metrics. Although effective, this method also brings a high computational burden. Li et al. [
33] proposed a Bidirectional LSTM (BiLSTM) with graph attention for unsupervised anomaly detection, but the high inference latency limits its application to real-time scenarios. Chen et al. [
34] also introduced a GNN-VAE approach for detecting dynamic faults on SDN-based microservices, but it is designed specifically for network flow data and does not consider the database access semantics as well as privacy risks.
Unlike metric/dimension-based approaches, these graph learning techniques favor detection performance to minimal observation and theoretical guarantees. They are based on processing the full graph, have no principled way of minimizing observation overhead, and do not include formal privacy definitions. This discrepancy indicates the potential to leverage the best of both worlds: metric dimension’s brevity and interpretability, alongside machine learning’s flexibility. A comparison summary of these graph-ML-based anomaly detection methods is shown in
Table 3, showing that despite high detection accuracy, most of the prior systems still depend on processing the full-graph pattern, lack minimal monitoring guarantees, and do not make any formal privacy analysis.
2.6. Identified Research Gaps
The distributed database/microservice security literature does not have due absence of metric dimension/resolving sets.
Hybrid structures that do graph-theoretic minimal observations together with high-precision ML classifiers are not available.
Limited empirical validation on real-world distributed database environments, particularly under dynamic and injected anomaly scenarios.
No pipelined, low-latency localization technique based on DV deviations.
Its computational complexity in large graphs obstructs real-time applicability, and improved approximations, as well as parallel algorithms, are necessary.
Inadequate noise handling and class imbalance in graph data during anomaly scores.
These identified gaps motivate the methodological choices adopted in this study. In particular, the need for minimal monitoring in large-scale graphs, real-time anomaly localization, robustness to noise and class imbalance, and empirical validation in realistic distributed environments informs the integration of metric dimension-based resolving sets with scalable machine learning models.
Section 3 details the proposed materials, datasets, and methods designed to address these challenges in a principled and reproducible manner.
3. Methodology
In this section, we describe the methodology of the proposed GMD-AD framework that strengthens cybersecurity for distributed database systems by taking advantage of the graph metric dimension to monitor API-driven access behavior in a non-intrusive but efficient manner. By representing queries and response patterns in a distributed database as graphs and using resolving set-based distance analysis along with machine learning, we enable scalable, accurate, and privacy-preserving detection and localization of malicious cyber-attacks. The overview of this methodology is illustrated in
Figure 2.
3.1. Data Description
In this paper, two diverse datasets are considered to address distinct cybersecurity challenges in distributed environments. A system-level graph dataset, generated using a simulator, models API-based interactions among distributed components for structural anomaly localization and scalability evaluation. In parallel, a behavioral API access dataset provides weighted usage patterns for supervised anomaly detection. Leveraging these datasets, the proposed framework can identify anomalous behavior and localize its spectral origins within the system.
3.1.1. Graph-Structured System Dataset for Distributed Database Security Analysis
For localizing structural anomalies and analyzing its scales, this work leverages a synthetic system-level dataset that mimics the internal topology and the interaction implications among distributed database-backed microservice applications. Influenced by the commonly used design patterns in large-scale distributed applications (such as, but not limited to, API gateways, backend services, cache layers and database shards), the dataset comes from a controlled simulation of a cloud-native system architecture. Since there are no public domain real-world datasets that reveal the service-to-database internal interaction graphs, and also given production system’s strict privacy and security policies, realistic data was crafted by a system-emulation approach. This model captures architectural patterns, API interaction protocols, and runtime behavior typical for many contemporary distributed systems. The resulting dataset is statistically representative in terms of structure and behavior while retaining full reproducibility and privacy-safety.
To capture access behavior in a manner consistent with distributed database and microservice architectures, API interactions are modeled as a graph G = (V, E), where each node v ∈ V stands for an API endpoint or microservice component connecting to its supporting underlying distributed database resources (e.g., query services, authentication services, data aggregation APIs), and each edge e ∈ E represents a sequential order, a logical link, or dependency-based operation between API among user or client individual session. In distributed database systems, such API-level interactions encode access paths to both data shards and replicas and services in the system depicting structural (how the databases are connected)/operational (how requests are treated) aspects of the system. This graph-based model captures the traversal paths, access diversity and abnormal interaction patterns of attacks on distributed database services, maintaining relational context omitted by the flat-table or sequential models.
To represent exactly the structural characteristics of distributed systems, we created a system-level interaction graph in which each node stands for different entities like API endpoints, cache services, or components of distributed databases. Attributes of nodes were artificially created to represent operational load, trustworthiness, and security status aiming at allowing the framework to associate each place an element occupies in a graph directly with its cybersecurity importance.
Table 4 outlines the related features.
In addition, interaction information was used to capture the logical and operational relations among components. Edges represent API-driven access paths connecting external requests to internal services and databases and thus capture realistic access patterns in stacked microservice architectures. This structure enables capturing and analyzing architectural irregularities, e.g., not supposed accesses or extreme misuse interaction frequencies within the graph.
Table 5 provides here the interaction-level features.
The direct representation of API-driven interactions allows abnormal access chains, surprising service couplings and delay outliers that usually accompany coordinated attacks or abusive patterns to be discovered.
3.1.2. Behavioral API Access Dataset for Anomaly Classification
The experiments in this paper are performed on a real-world API access behavior dataset, named “API Security: Access Behavior Anomaly Dataset”, which is downloaded from Kaggle [
35]. It represents access logs in distributed microservices-oriented applications, and services are exposed and accessed through APIs. These systems are especially prone to abuse as adversaries can manipulate business logic by sending abnormal or arbitrary API requests which deviate from normal user behavior. The dataset comprises 34,423 API access behavior samples, each aggregated from one API access session. Access patterns are a product of legitimate user behavior, automated clients and attackers. Since API-driven systems are usually dynamic, browser refresh, session update and network interrupt will have an influence on the request pattern; since program access can alter the behavior of accessing an API, it is natural that there is variability in the way an API is being accessed even by the same user. Access graphs are constructed from long sessions (sessions that lasted for a longer time), which interrelate the structural and temporality dependencies of API calls, thus making it possible to identify sophisticated attack patterns.
For the sake of computational analysis, the dataset also includes summaries of feature-engineered API access behavior to aid in machine learning-based classification whilst preserving raw interaction graphs for distance vector and resolving-set-based analysis, which is considered critical to the proposed GMD-AD framework. The numerical features extracted from API access sessions and used for anomaly detection are summarized in
Table 6.
Figure 3 presents a correlation Heatmap of API Access Behavior Metrics and statistical summarization for each feature summary in the User API Interaction Behavior Metrics dataset. This consists of the count, mean, standard deviation (std), minimum (min), 25th percentile (25%), 50th percentile (median), 75th percentile (75%), and maximum (max) values for all features. These numbers are important to give an idea of how the features are distributed and vary, thus providing some insight into user behavior with APIs.
Figure 4 shows the union of two visualizations for anomaly detection. The first column chart represents the distribution of classes that can be seen as a highly unbalanced one, having some outliers in-sample and almost no attacks. The second chart shows metric dimension values in normal and anomalous conditions. These visualizations highlight class imbalances and possible anomalies, both of which are important for effective anomaly detection and system behavior analysis.
In a steady state environment, we have shown a sample of user API interactions in
Figure 5, which visualizes the interaction graphs in a distributed database system under: (a) normal operating conditions and (b) anomalous conditions, highlighting structural and connectivity deviations caused by injected anomalies among database components. Exactly,
Figure 5a. In this visualization, green nodes represent users or APIs, and edges denote access relationships. The circular arrangement emphasizes core densification of interactions, pointing to the high pairwise connectivity in central nodes. This profile can be used to locate system bottlenecks and improve the common case of opening many files. On the other hand,
Figure 5b contrasts this with an anomalous state, where nodes are red, and edges are black and represent entities of interest relations among them. The most prominent node in the right cluster, with very high connectivity, indicates a deviant or possibly an abuse. The less equilibrated, higher concentrated shape reveals non-uniform intervals that are output as a result of the analysis against anomalies, possible threats to security, and system weaknesses. These visualizations allow stakeholders to interpret normal and anomalous API behaviors, key for anomaly detection and cybersecurity activities.
This system-level graphical dataset is the structural backbone of our GMD-AD framework. It is applied to calculating both resolving sets, distance vectors and node-level anomaly scores. We can locate security incidents with high accuracy by the protocol while it remains scalable because snooping operation covers only a small set of carefully selected members. The anomaly scores that originate from the graph detect anomalous API interaction structure (e.g., unusual traversal paths, unexpected communication flows or latency deviation) normally manifested in malicious behavior.
On the other hand, we only use the externally traced API access behavior for behavior classification accuracy as its labeled pattern of normal/outlier/bot/attack API usage. While the two datasets are not directly linked at the record level, their results are highly correlated at the system-behavior level. For instance, some unusual access frequency, session depth or API diversity in the behavioral dataset are correspondingly met with high distance vector deviations and anomaly scores in the interaction graph, especially on edges denoting an API-involved access path. These complementary signals are used to differentiate the source of an anomaly in the distributed system (structural graph analysis) and what behavior types it constitutes (behavioral classification), resulting in a sideband proved, operationally representative cybersecurity.
3.2. API Security: Access Behavior Anomaly Dataset Preprocessing
Figure 6 outlines the overall data preprocessing and robust workflow used in this study. Prior to model training, the workflow incorporates categorical encoding, missing value imputation, feature scaling, dataset partitioning, scaling, noise injection, SMOTE (Synthetic Minority Over-sampling Technique) balancing, and robustness-enhancing techniques to ensure consistent, reproducible, and reliable evaluation of anomaly detection performance in distributed database environments.
3.2.1. Categorical Encoding and Feature Scaling
The LabelEncoder from sklearn.preprocessing converts the dataset’s categorical attributes into numerical representations, allowing them to be used by machine learning models that only accept numerical inputs. In particular, the behavious_type attribute is encoded as integer labels that match various behavioral classes.
After encoding, numerical features are normalized with the StandardScaler to provide uniform feature scales throughout the dataset. This normalization removes the mean and scales features to unit variance, limiting attributes with higher numerical ranges from having a disproportionate impact on model training. Mathematically, for a feature
, the standardized value
is calculated as:
where
is the
-th data point of feature
,
is the mean of feature
, and
is the standard deviation of feature
.
3.2.2. Handling Missing Values
We also use a rudimentary imputation strategy to deal with missing values in the dataset. To be specific, the missing values are imputed with the mean value of each column. Logically, the average for any column is determined as:
3.2.3. Identifying the Target Variable
We consider behavior_type as the target variable, which we recognize as the column containing class labels for the prediction task. This column is the type of behavior (normal, outlier, attack) that learning models attempt to predict. Every other column in the dataset is a feature (X) that our models will learn from and use to make predictions.
3.2.4. Train-Test Split
To measure the performance of the models, a validation set is split from the data: the train_test_split function from sklearn.model_selection, with a test size of 0.3, for maintaining 30% of the data as test and 70% as train. Here, you set a random_state of 42 so that your data is split in the same way every time your code runs. It was tested on the newest set of data to estimate how well it is likely to perform on new instances. The random_state allows the scientific experiment to be repeated, which is essential in science.
3.2.5. Adding Noise for Robustness
A small percentage of random noise is added to the train and test data (X_train_noisy, X_test_noisy) to make the model more robust. This trick is useful when trying to reduce overfitting, so that models may generalize better to the data with small perturbations. Noise injections could make the model more robust by simulating real-world variations that might happen to the data. It would be highly beneficial for your dataset, as you have some sort of distributed database activity which could fluctuate at real-time usage, session length, and API access. You can get the model to generalize better if you add noise, and it will become less overfit. To cope with class imbalance at the graph level, SMOTE is performed on distance vectors: oversampling of the minority anomalous vectors by interpolating between similar samples so they are equally represented before being integrated within ML. We extend tabular SMOTE to graph features and enhance generalization to rare intrusions. The noise injection strategy in our preprocessing pipeline operates at two distinct stages, each serving a complementary robust purpose. The first stage, Raw Feature Noise (
x →
x′), applies noise to the raw features
immediately after the train-test split. This is to accommodate practical perturbations in real-world data (e.g., measurement errors due to sensor malfunctioning, network transmission jitter that perturbs the precision of the timestamps on entry of discrete time-series data, the fluctuation of API response time due to load balance, the inconsistency of data entry in the case of manual or semi-automated logging). Mathematically, for each raw feature value
, we compute:
where
, and
. The standard deviation is scaled by 0.01 to ensure small perturbations are appropriate for this stage.
In the second stage, Scaled Feature Noise (z → z′) is applied to the scaled features after the Standard Scaler normalization as defined in (4). This noise injection avoids overfitting to the precise normalized distribution and therefore leads to better generalization whenever there are small distribution shifts from training to production, temporal changes to the statistical properties of incoming data (i.e., concept drift), or differences in normalization parameters when the model is applied on a different subset of the data.
For each scaled feature
, we compute:
where
, with
, which is smaller than the raw feature noise to avoid distorting the standardized scale. This second noise injection is subtler to preserve the integrity of the normalized data while ensuring model robustness. Our two-step process is in line with the latest recommendations in robust machine learning, which suggest that controlled perturbations inserted at different levels of processing can improve model generalization by 5–12% in noisy environments. The central idea is that noise in raw-space encodes domain-specific variation, such as database query latency variation, and noise in scaled-space prevents the model from memorizing precise normalization artifacts.
x-noise injection is before scaling while
z-noise injection is after scaling and before the SMOTE balancing (
Figure 6). It does so by ensuring that the procedure for augmenting the training data with noise would not affect the class balancing procedures, therefore retaining the quality of the training process (noisy or otherwise) while improving the generalization performance.
3.3. Machine Learning Models Evaluation
In this work, the machine learning models evaluated in the study were carefully selected to achieve a wide coverage over various algorithmic families, level of interpretability, and computational profiles. Together, the selected models cover three forms of the gradient boosting: XGBoost, CatBoost, and HistGradientBoosting, which bring their own advantages in tabular data, especially to anomaly detection based on analysis of the graph metric dimension. XGBoost is almost an industry standard for efficient and scalable implementation of gradient boosting-based algorithms, specifically for sparse features, and was selected for comparison, CatBoost was selected for its native support of categorical features, and HistGradientBoosting for its memory-efficient gradient boosting algorithm for large datasets and selected for comparison. A Decision Tree Classifier (DT) was also added as an interpretable baseline for assessing feature importance and model complexity. We selected AdaBoost to evaluate the effect of adaptive error correction and subsequent sequential sample re-weighting as specifically suited for imbalanced datasets. We have considered Support Vector Machines, Random Forests, Deep Neural Networks, Logistic Regression, and k-Nearest Neighbors but our criteria for suitable models were computational complexity, performance, and their appropriateness to the problem and hence these models were be considered. This five-model ensemble delivers breadth across boosting and tree-based methods; depth via multiple gradient boosting implementations; and interpretability with the Decision Tree. These results validate that graph-based anomaly detection is a natural fit for gradient boosting methods to provide high, real-world values for robustness and meaningfulness relevant to production systems. This extensive assessment demonstrates the effectiveness of GMD-AD concerning different classifier architectures and also states the rationale behind the better performance of gradient boosting-based methods for Graph Metric Dimension-based Anomaly Detection tasks.
3.3.1. XGBoost
XGBoost is a gradient boosting algorithm that has demonstrated high scalability and efficiency. It constructs trees in a stepwise fashion, where each tree mitigates the mistakes of its predecessor. XGBoost integrates both regularizations (to avoid overfitting) and process parallelization facts which contribute to its high efficiency and, therefore, it is particularly suitable for large datasets with many features. In this paper, max_depth is set as 3 to make every tree depth controllable and ultimately keep the model from being so complex that it cannot be generalized. With a learning rate of 0.05, we decrease the size of steps made in each iteration and increase the generalization power of our model to unseen data. Instead of “compromising” by restricting boosting rounds to calculate, the authors suggested using n_estimators = 50, which means training on 50 boosting rounds, and it “splits the difference” between running up against our limit here for computational cost (without compromising) and achieving great model quality. Alternatively, eval_metric = ‘logloss’ is chosen as the evaluation metric, which is ideal for focusing on classification tasks since it will assess the prediction by how close your predicted probabilities are to the true class values.
3.3.2. CatBoost
CatBoost is a type of gradient boosting algorithm that does not get intimidated by the word “categorical.” It uses optimal handling of categorical data by a technique that does not need to encode the categories, where order is preserved among the ones with similar prediction values, so it is faster and generally more accurate on these datasets. Model depth is restricted to 2 in order not to overfit the model due to the complexity of individual trees. We set a moderate learning rate (0.095), which will make the model tune the contribution of each tree equally. The model is trained with 500 boosting rounds (iterations = 500) to have a sufficient chance to learn from the data without overfitting. The output during training is suppressed when setting verbose = 0, which gives a cleaner result for batching processes and a bigger dataset.
3.3.3. DecisionTree Classifier
DecisionTree Classifier is also a type of non-parametric or tree-based model that uses a tree representation to solve the problem, in which a leaf node corresponds to class labels and non-leaf nodes represent decision rules. At each node of the model, data are divided according to the feature that yields better separation (according to Gini impurity or entropy). Although decision trees are interpretable, they tend to easily overfit if not suitably regularized. In this study, we tune max_depth of the tree to 5 to limit the depth of the tree; otherwise, super-complex models result, and they would overfit the training data. Constraining the depth of the tree allows us to concentrate on only the most important splits. Here we have min_samples_split, set at 20, which indicates that a node will not split any further if there are 20 or fewer samples because doing so would increase the likelihood of responding to noise. random_state = 42 makes sure that the splits in the tree are reproducible across all runs, which is necessary to make reliable results.
3.3.4. AdaBoost Classifier
AdaBoost (Adaptive Boosting) is an ensemble meta-algorithm that creates a highly accurate classifier by combining many weak and inaccurate classifiers. Weak learners are presented one by one in AdaBoost, where a new weak learner is more sensitive to the mistakes made by former ones. This increases the efficacy of weak learners to become stronger in conjunction. In this model, we selected n_estimators = 300, which indicates that the model will fit 300 weak learners to form an ensemble. The learning rate is 0.5, which determines the contribution of each weak learner in the overall model. Using a small learning rate makes the contributions of each weak learner more balanced so that overfitting is avoided. This is especially useful when the base learners are weak and need to be enhanced iteratively for better model accuracy.
3.3.5. HistGradientBoosting Classifier Model
The HistGradientBoosting Classifier is a GBRT model with histogram-based split finding. It can provide speedup of up to 10× on large datasets compared to existing implementations of gradient boosting. This model works by creating an ensemble of decision trees, iteratively fitting new ones to the residual errors of previous fits. By using a histogram to bin the feature values into discrete intervals, it is faster and more lightweight than working on entire sample datasets. For this configuration, we restrict individual trees to a max_depth of 2, meaning each tree is a very shallow decision tree and looks for the most significant feature, which could prevent our model from overfitting. We pick a learning rate of 0.05 to ensure that each tree has less effect on the final model, so the model is more robust in this way and needs more iterations towards convergence. The number of times the model is trained is 200, as fixed by max_ite, so that the model learns what are important patterns without causing overfitting.
To provide transparency, we document models that were considered but excluded after preliminary evaluation in
Table 7.
The extensive evaluation demonstrates the strength of GMD-AD on various classifier architectures and provides reasons for the preference of gradient boosting-based methods for Graph Metric Dimension-based Anomaly Detection.
3.3.6. Proposed Framework and Methods
While traditional machine learning models such as CatBoost and HistGradientBoosting perform well on tabular behavioral features, they do not capture the relational structure of API interactions in distributed databases. Many attacks manifest as coordinated changes across services rather than isolated anomalies. To address this limitation, the proposed framework integrates graph neural components with tabular classifiers, enabling structure-aware detection while maintaining computational efficiency.
This section presents the proposed Graph Metric Dimension-based Anomaly Detection (GMD-AD) framework, designed to enhance cybersecurity in distributed databases by leveraging graph theory for efficient monitoring and machine learning for precise classification. The framework models distributed databases as graphs, computes a minimal resolving set using metric dimension techniques, derives anomaly scores from distance vector deviations, and integrates these with gradient boosting models (e.g., CatBoost and HistGradientBoosting). This hybrid approach addresses the limitations of existing methods, such as high computational overhead in full-graph traversals and sensitivity to class imbalances in tabular data. The GMD-AD framework operates in two main phases: (1) graph-based anomaly localization using the metric dimension, and (2) ML-based refinement for classification. Theoretical analysis demonstrates its efficiency in distributed settings, with pseudocode provided for implementation. All experiments were conducted using Python libraries including NetworkX for graph operations and Scikit-learn for heuristics.
Graph Modeling of Distributed Databases: Distributed databases (e.g., Cassandra or MongoDB clusters) and microservice architectures are modeled as undirected graphs
, where
represents nodes such as database shards, users, APIs, or microservice endpoints. And
represents edges denoting interactions, such as data access queries, API calls, or inter-shard communications. This modeling captures the inherent topology and dynamic behaviors in distributed systems. For instance, user–API interactions from logs are converted into edges weighted by access frequency or duration. Anomalies, such as unauthorized access or data leaks, manifest as structural deviations (e.g., unexpected edges or path changes).
Figure 7 illustrates the graph model of a microservice-based distributed database, with resolving sets highlighted for monitoring. Nodes represent users (green), APIs (blue), and DB shards (red). Edges indicate access interactions. The resolving set (bold nodes) enables unique identification via distance vectors. Anomalous deviations are shown as dashed edges.
For dynamic graphs, the model includes a sequential metric dimension: whenever a node/edge change is triggered, by updating resolving sets that cover the perturbation or computing them from scratch for the specific problem at hand (in response to an inserted or deleted edge), we update our d based on these events with little additional work and achieve near-optimal recomputation times of O(Δ
n log
n) after insertions/deletions [
23]. For stability, scores are scaled with thresholds tuned by machine learning (e.g., grid-search on the validation data) to reduce sensitivity to noise. Distributed BFS (e.g., in GraphX) distributes distance computation over clusters for large
n (>10
4 nodes).
Table 8 shows notations used in the GMD-AD framework.
- 2.
Computing the Resolving Set and Metric Dimension: The metric dimension
is computed to find the smallest resolving set
, where every node
has a unique distance vector to
. For large-scale graphs (common in distributed DBs with
), exact computation is NP-hard. Thus, we employ a greedy heuristic algorithm [
36] for approximation, initialize
, and iteratively add the node that resolves the maximum unresolved pairs until all nodes are uniquely identified. This heuristic achieves near-optimal results with time complexity
in practice, where
, suitable for dynamic DB topologies updated periodically. Shortest path distances are computed using BFS (via NetworkX), yielding distance vectors
for each node.
- 3.
Anomaly Scoring via Distance Vector Deviations: Anomalies are detected by monitoring changes in distance vectors over time. For a node
at time
, the anomaly score
is defined as:
where
is the standard deviation of historical deviations (to normalize noise), and
is a threshold (e.g., 1.5, tuned via validation). High
indicates structural anomalies like intrusions (e.g., new edges from unauthorized access). This score provides localization: suspicious nodes are flagged based on resolving set observations, minimizing monitoring overhead (only
nodes need active tracking, typically
).
3.3.7. Anomaly Threshold Selection and Validation
Systematic empirical validation rather than arbitrary selection established the anomaly detection threshold θ = 1.5 in Equation (4). In this section, the methodology for threshold tuning and sensitivity analysis are described.
Threshold Tuning
Candidate Threshold Range: We evaluated
θ ∈ {0.5, 0.75, 1.0, 1.25, 1.5, 1.75, 2.0, 2.5, 3.0}, where a stratified validation set of 20% of the training data was maintained separately to the final test set. For both the balanced normal/anomalous samples in the validation set (5000 normal, 5000 anomalous), we ensured that we did not introduce bias towards the majority class. Validation results are shown in
Table 9.
Let us discuss the reasons for selection based on the performance on the validation dataset. First, Balanced Error Rates: both FPR (False Positive Rate) and FNR (False Negative Rate) were approximately 1.5%, thereby minimizing false alarms and missed detections. Second, Max F1-score: Maximum F1-score of 0.9974 is gained. Third, Cost-weighted Optimiatality: This threshold had a cost-weighted metric of 16.5, the nearest-optimal (lower values are better). Unlike the high Kappa values, the standard deviation of ±0.0008 (F1) across folds is low, which shows that a threshold at 0.5 results in statistically stable models across different validation folds or splits of data, demonstrating that this is a robust and reliable choice. Stratified 5-fold cross-validation showed that θ = 1.5 consistently performed best, with an F1-score standard deviation of 0.0008 (corresponding to a variance of 6.4 × 107), confirming stability across all folds.
Sensitivity Analysis
The sensitivity analysis investigated the model performance between different threshold values. The plotted F1-score shows the properties of the key observations: there is a solid plateau, with the F1-score sitting above 0.995, indicating that the model is not too sensitive to small threshold changes (
Figure 8). For values of TA that were below 1.0, a rapid drop in classification performance was observed, specifically an increase in FPR (False Positive Rate), and for values that exceeded 2.0, a rise in FNR (False Negative Rate) was recorded.
Statistically speaking, the threshold equates to the 93rd percentile in a normal distribution, which means that 93% of all-normal behavior lies within 1.5 standard deviations from the mean. However, since intrusions will usually cause deviations bigger than this threshold, they will lie in the top 7% of the distribution defined by the novelty function. This is consistent with traditional anomaly detection practices, which are typically reported to use thresholds between 1.5
σ and 2.0
σ for intrusion detection applications. The detection threshold can be mathematically expressed as:
where
is the cumulative distribution function of the standard normal distribution.
Balance between false positives and false negatives is intuitively acceptable for distributed database security, representing an operationally balanced choice. The false positive event rate is 1.5%, which is about 1–2 false alarms per 100 authentic events. Finally, the False Negative Rate of 1.5%, meaning only 1–2 out of every 100 intrusions go undetected, is an acceptable loss when combined with other security layers (e.g., firewalls, authentication).
Using a series of empirical tests, coupled with cross-validation and sensitivity analysis, the threshold was selected. It is a good compromise between metrics for the performance of the model (F1-score), and is stable over the different folds of validation. This threshold is statistically grounded and provides a practical and robust solution to detect anomalies in production systems.
3.4. Hybrid Integration with Machine Learning Models
Beyond appending anomaly scores as features, the GMD-AD framework deepens hybridization by feeding resolving set subgraphs into graph neural networks (GNNs), such as Graph Convolutional Networks (GCNs) or Graph Attention Networks (GATs), for learned embeddings. This captures complex spatial and structural interactions in the distributed database graph that traditional gradient boosting models (e.g., CatBoost) might overlook, such as multi-hop dependencies in anomaly propagation. For instance, GNN layers can aggregate features from landmark nodes in the resolving set , producing enriched representations that integrate metric dimension-based distances with raw log attributes (e.g., api_access_uniqueness).
The integration process involves: (1) Extracting subgraphs induced by
and its
k-hop neighborhoods (
k = 2–3 for efficiency); (2) Applying GNN forward passes to generate node embeddings, where each layer updates representations as
, with
as attention weights tuned for anomaly sensitivity; (3) Concatenating these embeddings with original features and distance deviations
for final classification. This hybrid approach addresses limitations in existing methods (e.g., full-graph GNNs shown in
Table 2), reducing computational overhead by focusing on minimal resolving sets (|S| << |V|), with time complexity approximately
, where
is the average degree.
Preliminary analysis on synthetic graphs (
n = 5000 nodes, simulated distributed DB topologies) shows 50–70% latency reduction compared to standalone GNNs, while maintaining or improving F1-scores (e.g., 0.98 vs. 0.91 for unsupervised detection). This enhancement draws from recent hybrid GNN frameworks for anomaly detection in distributed systems, such as Temporal-Attentive Graph Autoencoders (TAGAEs), which leverage temporal and attentional mechanisms to boost resilience against dynamic threats. Future extensions could incorporate transformers for sequence modeling of distance vector changes over time, further elevating cybersecurity efficacy in real-time monitoring.
Figure 9 illustrates a much-expanded hybrid integration pipeline that consolidates different steps for performing machine learning and graph computations. It describes the process of set computation, GNN embedding extraction, feature augmentation and concatenation and ML classification.
3.5. Theoretical Analysis
Resolving set approximation: . Distance matrix: (BFS from nodes). Anomaly scoring: per update. For large n, use sampling or parallel BFS in distributed environments (e.g., via GraphX in Spark). A minimal resolving set ensures low overhead—monitor only landmarks (e.g., key DB nodes) instead of all. Deviations capture subtle attacks path alterations missed by signature-based IDS. Privacy is preserved via k-anti-resolving extensions. Unlike GNNs, GMD-AD avoids full-graph embeddings, reducing latency by 50–70% in preliminary tests on synthetic graphs.
The parameters utilized in Algorithm 1 were chosen to strike a balance between detection sensitivity, computing efficiency, and robustness in dynamic distributed database environments. Thresholds and model hyperparameters are selected based on empirical validation, previous work in anomaly detection, and practical restrictions such as real-time monitoring and class imbalance.
Table 10 summarizes the rationale for important parameter choices.
| Algorithm 1. GMD-AD framework |
Input: Graph modeling the distributed database (nodes: shards/users/APIs; edges: interactions). Time-series access logs (e.g., API traces with timestamps). Threshold for anomaly scoring (e.g., 1.5). ML model hyperparameters (e.g., for CatBoost: max_depth = 3, learning_rate = 0.05, iterations = 500). Optional: Graph change events (e.g., new edges/nodes from real-time updates).
|
Output:Anomaly classifications (e.g., normal vs. anomalous behavior_type). Anomaly scores for each node , with localization (flagged high-score nodes indicating breach locations).
Model distributed DB as graph from access logs (e.g., add edges based on API interactions, weighted by duration/frequency). Compute resolving set using greedy heuristic: - ○
Initialize . - ○
While unresolved pairs exist: - ▪
Select node that maximizes resolved pairs (unique distance vectors). - ▪
Add to .
- ○
If graph changes (e.g., new edges/nodes from input events): - ▪
Update sequentially (incremental mode): - ▪
Re-compute distances only for affected nodes using targeted BFS (e.g., from changed edges), - ▪
Preserving fault-tolerance [ 22]; avoid full recomputation to maintain O(Δ n log n) time where Δ n is change size.
For each time :
|
Compute distance vectors
for all from (using BFS; parallelized via GraphX for large graphs). Calculate deviations as Euclidean norm . Assign anomaly score , where is historical std dev; flag if .
|
| 4. |
Augment dataset with and components as new features (e.g., append to tabular logs like inter_api_access_duration). Normalize with ML-tuned (e.g., via grid search on validation data for noise robustness); Apply SMOTE to balance anomalous/normal vector samples in the augmented features (interpolate minority class vectors to mitigate imbalance).
|
- 5.
Train ML model (e.g., CatBoost): - ○
Preprocess: SMOTE (extended to graph features), standard scaling, noise injection for robustness. - ○
Fit on augmented features and labels (behavior_type from logs).
- 6.
Predict anomalies on test data; localize breaches via high nodes (output flagged nodes/scores for cybersecurity alerts).
|
| End of Algorithm |
Therefore, the framework shown in
Figure 10 forms the core contribution, enabling scalable, low-overhead anomaly detection in distributed databases. Implementation details and evaluations follow in subsequent sections.
3.6. Complexity Analysis and Ablation Plan—Greedy Resolving Set Approximation
The complexity of the proposed framework and its implications are discussed in more detail. The complexity of the greedy resolving set approximation is
; in the worst case, however it usually gives a performance close to optimal with
[
1]. Subsequent update steps after Δ changes are of
, complexity by using targeted BFS. For parallel distance computation on GraphX/Spark, the complexity is
, where it procures as the number of workers ensuring good scalability. The GNN on the resolving-set subgraphs complexity is of
, with
d ≪
n; i.e., the size of the subgraph is much smaller than that of the original graph. Therefore, the total per-event complexity greatly decreases from
to sub-linear, which allows our approach to become more efficient. In addition, we perform a thorough ablation study (
Section 4.3) to assess the usefulness of each component presented and how much the gain is provided by adding them into the system.
3.7. Evaluation Metrics
Accuracy measures the overall proportion of correctly classified instances.
Precision measures the proportion of correctly predicted positive instances among all instances predicted as positive.
Recall (or sensitivity) measures the proportion of correctly predicted positive instances among all actual positive instances.
The
F1-score is the harmonic mean of precision and recall. It provides a balance between the two metrics, especially when there is an uneven class distribution.
Per-Class Accuracy is the accuracy calculated for each class.
The confusion matrix is a table used to evaluate the performance of a classification model. It displays the number of true positives, true negatives, false positives, and false negatives for each class.
4. Experiments and Evaluation
We evaluate GMD-AD on the SockShop microservices benchmark58 and a real-world MongoDB sharded cluster4 with 12 shards, each one having three replicas. All results are averaged over 20 independent runs (mean ± standard deviation) unless otherwise specified. To assess the significance of our improvements, we performed statistical analyses (independent
t-tests; assuming equal variance; two-tailed) as well as ANOVA where applicable. Simulation samples of means and standard deviations from previous studies were also employed for hypothesis testing (using
p-values; if
p-values ≤ 0.05, for all tests) of emerging simulation box models (see below). The detection and classification results of GMD-AD are shown in
Table 11. Our method achieves an F1-score of AUC-ROC respectively, significantly surpassing prior methods ranging from the Prov-Graph (CCS’23) to GNN-only models, LSTM-based sequence detectors, as well as traditional Isolation Forest. The near-perfect precision (0.998) and recall (0.997) indicate GMD-AD is a robust method for detecting the HTTP-based anomalies in distributed systems. The results of statistical
t-tests confirm that these enhancements are highly significant: GMD-AD vs. Isolation Forest (
p = 4.6948 × 10
−23), vs. LSTM (
p = 2.9395 × 10
−17), vs. GNN-only (
p = 5.8731 × 10
−18) and vs. Prov-Graph (
p = 4.8154 × 10
−22); all with
p << 0.001.
End-to-end anomaly localization latency is shown in
Table 12. By taking advantage of parallel BFS and a common resolving set, GMD-AD decreases latency from 1200 ms in flooding to 480 ms on 128-node topologies (60% less) yet still achieves sub-second localization even when dealing with up to 5120 nodes. The low variance and smooth scaling further confirm the effectiveness of the proposed dynamic metric dimension maintenance scheme.
t-tests between 128-node runs indicate these are significant reductions: parallel vs. flooding,
p = 3.9552 × 10
−26; parallel vs. static MD,
p = 7.5840 × 10
−14; and parallel vs. sequential PDIA approach,
p = 6.0898 × 10
−9, all comparisons with
p << 0.001.
The results of privacy protection by our k-metric anti-dimension method are reported in
Table 10. The adversary’s success rate of re-identifying is significantly reduced from 68.0% to 28.0% (i.e., a reduction of 40 percentage points) by anonymizing them (with (
k = 3,
ℓ = 2)), with almost negligible utility loss (F1 value dropping from 0.9974 to 0.9941). This shows that our GMD-AD algorithm gives strong, near-zero difference privacy while minimally degrading the detection performance. A
t-test comparing the re-identification rates verifies that the decrease is very significant (
p = 2.5196 × 10
−78,
p << 0.001).
Therefore,
Figure 11 depicts the scalability trends presented in
Table 13, emphasizing GMD-AD’s near-linear runtime growth with increasing graph size and its benefit over baseline approaches.
An ablation study on each proposed component is reported in
Table 14 to validate effectiveness. Without including any of these four updates, sequential metric dimension update, parallel BFS sharing,
k-metric anti-dimension and GNN encoding, always significantly decreases the performance, except when neither privacy protection nor graph neural processing is turned on. Paired
t-tests between the full GMD-AD and its ablated versions give
p < 0.01 for all metrics highlighting the synergy of our parts.
Table 15 shows robustness against noisy and incomplete interaction graphs. With an edge perturbation or missing data of 30% even, GMD-AD achieves F1-score = 0.970 ± 0.005, which is also far higher than all baselines with their falling F1-scores not exceeding 0.924. A
t-test at 30% noise of GMD-AD and Prov-Graph confirms higher robustness (
p = 1.5541 × 10
−14,
p << 0.001). Linear regression on noise levels demonstrates the F1 degradation slope of GMD-AD to be 40% less steep than baselines (R
2 = 0.98,
p < 0.001 for difference in slopes).
Finally,
Table 16 evaluates system overhead. GMD-AD only needs a small resolving set of average size 31 ± 4 nodes in size to maintain connectivity (and correctness) between all the robot poses, much smaller than the full-graph approaches which require up to 40× fewer messages exchanged and updates that are approximately 36× faster. For the individual streams in messages and update times, all
p < 10
−10, which confirms significant efficiency improvements.
Table 17 compares five gradient boosting classifiers for the final anomaly classification stage. CatBoost emerges as the top performer with a mean F1-score of 0.9975 and the lowest variance (std precision = 0.01). ANOVA across model accuracies reveals significant differences (F(4,95) = 19.06,
p = 1.56 × 10
−11). Pairwise
t-tests against CatBoost show no significant difference with HistGradientBoosting (
p = 0.104) or DecisionTree (
p = 0.118), marginal with XGBoost (
p = 0.070), but highly significant superiority over AdaBoost (
p = 8.62 × 10
−6).
Figure 12 shows the confusion matrix for different models.
Table 18 provides class-wise metrics, highlighting CatBoost’s balanced performance across anomaly behaviors (classes 0–3). Per-class ANOVA (not tabulated) shows significant model effects for Class 0 (F = 12.45,
p < 0.001), where AdaBoost underperforms (
p = 1.23 × 10
−5 vs. CatBoost). Confusion matrix analysis (
Figure 11) and McNemar’s test on misclassifications indicate CatBoost reduces errors by 25–50% compared to AdaBoost (
p < 0.01).
HistGradientBoosting minimizes errors overall, with chi-square tests on contingency tables showing significant differences in error distributions (χ
2 = 45.2,
p < 0.001 vs. AdaBoost). ROC curves in
Figure 13 yield AUC ≈ 1.000 for top models, with DeLong’s test confirming no significant difference between CatBoost and HistGradientBoosting (
p > 0.05).
The per-class accuracy shown in
Figure 14 reveals variability between Class 0 and Class 1 (ANOVA F = 8.76,
p < 0.01 across models).
In
Figure 15a, feature importance indicates that ‘behavior’ dominates (normalized importance 1.0), with permutation tests confirming its significance (
p < 0.001). Predicted probability densities in
Figure 15b show tight calibration (Brier score < 0.005 for HistGB).
PDP + ICE plots in
Figure 16 for ‘api_access_uniqueness’ exhibit flat dependence (Shapley values ≈ 0,
p > 0.05 for feature effect).
Three-dimensional decision surfaces and Taylor diagrams shown in
Figure 17 quantify low bias (centered RMSE < 0.001 for top models).
UpSet plots shown in
Figure 18 highlight high prediction agreement (10,024 overlaps), with Jaccard similarity >0.95 among ensembles.
In conclusion, GMD-AD delivers state-of-the-art detection accuracy (>0.997 F1), real-time localization (60% faster), strong privacy guarantees (−40 pp re-identification risk), excellent robustness to noise, and minimal overhead—all with statistically significant improvements (p << 0.001 across key metrics)—establishing it as a highly practical and deployable cybersecurity solution for large-scale distributed databases and microservice architectures.
4.1. Operational Cost Analysis and Comparison with Prov-Graph
As presented in
Table 8, GMD-AD has higher detection accuracy than Prov-Graph (F1: 0.9974 vs. 0.984, Δ = 0.0134), but the operational costs are essential in the practicality of producing it. This has a block that compares costs—computational, memory, disk, and network overhead—to one another.
The experimental setup was on AWS EC2 c5 for compute and r5 for RAM. MongoDB sharded cluster on a 72-node setup (3 shards, 24 nodes pr shard, 1 config server) 4× large (16 vCPU, 32 GB RAM), 4× large (4 vCPU, 30 GB RAM) for memory operations, 100 anomalies were injected, monitoring was done for 24 h at 70% reads, 30% writes and 10,000 queries/s workload mix. The operational cost comparison of Prov-Graph vs. GMD-AD is shown in
Table 19 (quantified CPU usage (in %), RAM usage (in GB), storage (MB per day), query processing time (in milliseconds), and network usage (MB per hour), mean ± std. dev of five trials).
GMD-AD contains log(n) landmark tracking steps (of order O(n log n) per update) and causes up to 60% genesis and annual savings of
$5953 per cluster through instance downsizing while a large CPU in full provenance does graph tracking (O(nm)) per node—with full-graph tracking by Prov-Graph in
Table 20. GMD-AD indeed has lower memory requirements (4.2 GB vs. 12.4 GB) (compact distance matrices vs. full graph) → saving =
$6619/year; using compressed vectors, storage is reduced to 620 MB/day (66% improvement), giving savings of
$123/year. Real-time detection achieves 60% quicker (480 ms vs. 1200 ms) question latency (via way of means of effective distance computations). With the low amount of updates, this represents 63% savings in network overhead (140 MB/h vs. 380 MB/h), or
$1886/year.
GMD-AD sacrifices only 1.36% F1 to be 63% cheaper (
$6771/1% F1 gain) and also misses fewer intrusions (2.6 vs. 16 per 1000). GMD-AD is the tool of choice for high accuracy and for low scale (0–100 GBs), budget and latency constraint workloads, while Prov-Graph is the tool of choice for forensics or compliance (less stringent latency need). For detection, the hybrid uses GMD-AD, and Prov-Graph, but only for high-risk operations. GMD-AD saves 63% in TCO, and yields results with good accuracy compared to true positives while allowing for scalable real-time identification of anomalies in distributed systems, shown in
Table 19. A better budget for achieving cybersecurity comes along with GMD-AD.
4.2. Sensitivity and Robustness Analysis
We then evaluated how potentially weak GMD-AD was due to key parameter values with respect to some sensitivity analyses (SAs) where changes were made to the resolving set size (
k), privacy budget (
ℓ in differential privacy), and combinations thereof. They are evaluated on performance consistency across different configurations in
Table 21.
Key observations include diminishing returns beyond k = 8 (accuracy increases only 0.2% from 0.924 to 0.926) while privacy loss rises substantially (31% at k = 8 vs. 48% at k = 20). The default k = 8 aligns with the graph’s metric dimension, balancing accuracy and privacy.
4.2.1. Privacy Parameter (ℓ)
The privacy parameter ℓ controls noise injection in differential privacy; lower
ℓ enhances privacy but reduces utility. Tested range:
ℓ ∈ [0.1, 0.3, 0.5, 1.0, 2.0, 5.0].
Table 22 represents privacy utility.
The default ℓ = 1.0 provides a reasonable trade-off (91.8% accuracy with 55% privacy loss). For high-security settings, ℓ ≤ 0.5 yields stronger privacy (re-ID risk ≤3.8%) at the cost of accuracy (84–89%); for accuracy-priority systems, ℓ ≥ 2.0 increases risk.
4.2.2. Combined Parameter Variations and Robustness
A three-way sensitivity analysis tested robustness across simultaneous changes:
τ ∈ {1.0, 1.5, 2.0},
k ∈ {5, 8, 12},
ℓ ∈ {0.5, 1.0, 2.0}, yielding 27 configurations.
Table 19 summarizes the robustness matrix, with performance variation: accuracy range [0.834, 0.928], mean 0.891, standard deviation 0.031 (3.1%), and coefficient of variation 3.5%, indicating low variability and high robustness. The configurations cluster into three groups: Group A (18 configs, 0.88–0.93 accuracy, robust); Group B (7 configs, 0.82–0.87, low privacy or high
τ); Group C (2 configs, >0.92 with good privacy, optimal).
Figure 15, a 3D plot, visualizes this clustering and the ‘robust zone’ where parameters minimally impact performance.
4.2.3. Parameter Selection Guide
For balanced operation, use defaults (τ = 1.5, k = graph-determined ≈ 8, ℓ = 1.0). Prioritize high accuracy with (τ = 1.2, k = 12, ℓ = 2.0), accepting reduced privacy. For high privacy, select (τ = 1.8, k = 8, ℓ = 0.3), tolerating accuracy drops. These values demonstrate high transferability to new topologies due to inherent robustness.
4.3. Data Source Transparency and Synthetic vs. Real-World Evaluation Limitations
While GMD-AD uses a real MongoDB sharded cluster to validate the performance the other essential component of GMD-AD, it generates a high volume of synthetic structured graphs to perform anomaly detection on the SockShop microservices and across the topology. This allows researchers to conduct controlled experiments with ground truth labels, but falls short in ways, including inadequate supply chain complexities, variation and emergent behavior—leading to overestimation of detection with respect to stylized anomalies and insufficient test cases, which may neglect rare or adaptive threats. Simulations operate under edge distribution/change rate assumptions which often mismatch with real-world dynamics under load/adversity. To mitigate this, we added real elements like MongoDB logs, and traces from Kaggle so that they end having 85% structural similarity for CVE patterns. However, since they only use 20 services to train their synthetic dependency, generalizability is limited. This robustness could be improved by conducting evaluations on large-scale deployments using real-world data, which should be the focus of future work.
4.4. Discussion
This section evaluates the effectiveness of the proposed GMD-AD framework for anomaly detection in distributed database environments. Experimental results demonstrate that gradient boosting classifiers, particularly CatBoost and HistGradientBoosting, achieve strong baseline performance on tabular behavioral features, with consistently high precision, recall, and F1-scores across anomaly classes.
When integrated into the GMD-AD framework, these classifiers benefit from graph metric dimension-based feature augmentation. The use of resolving-set-derived distance vectors significantly reduces monitoring overhead while preserving detection accuracy. Compared to full-graph approaches, GMD-AD achieves up to 60% lower detection latency and maintains stable performance under class imbalance and injected noise. The framework further provides privacy-aware detection through the incorporation of a k-metric anti-dimension, reducing re-identification success rates by approximately 40% while incurring only marginal accuracy loss. Scalability experiments confirm that parallel distance computation enables real-time anomaly localization on graphs with thousands of nodes.
Overall, the results validate that GMD-AD delivers efficient, scalable, and privacy-preserving anomaly detection, combining the strengths of traditional machine learning with graph-theoretic minimal monitoring.
4.5. Empirical Validation on Real-World Distributed Systems
In order to prove the efficiency of the proposed enhancements, we deployed and extensively tested the GMD-AD framework on two production-like cluster able benchmarks: (i) SockShop microservices benchmark (with 11 services running in a Kubernetes cluster under realistic e-commerce workloads) and (ii) MongoDB sharded-cluster-of-clusters benchmark with three shards residing within each share of data, and having three config servers as well as two routers in total. Anomalies were manually injected in a targeted fashion by adding unauthorized API access edges, queries leading to privilege-escalation paths, data exfiltration routes and sybil node insertions modeling advanced cyber warfare. The improved model including sequential resolving set updates, parallel BFS computation, ML-tuned anomaly scoring thresholds and k-metric anti-dimension privacy protection (
k = 3,
ℓ = 2) obtained notable advantages over the baseline. Anomaly localization latency was improved by 60% (from 1200 ms to 480 ms on graphs with ∼5000 nodes) while preserving near-flawless detection accuracy (0.9993 → 0.9975). Robustness against noise had a significant improvement and F1-score increased from 0.95 to 0.97, when it was evaluated under the simulated Gaussian noise with 10% additive. More importantly, the incorporated k-metric anti-dimension lowered the success rate of re-identification attacks from 68% to 28% (absolute reduction: 40%), demonstrating satisfactory privacy protection with negligible detection accuracy loss. This experimental evidence, summarized in
Table 6 and
Figure 13, demonstrates that the theoretical improvements we have achieved translate into meaningful practical gains and solidify our improved GMD-AD framework as an efficient, application-independent and privacy-preserving solution for cybersecurity-augmented real-time security monitoring in distributed database/microservice systems. The framework is implemented in Python 3.11 using NetworkX 3.2, PyTorch-Geometric 2.5, CatBoost 1.2, and Apache Spark 3.5 (GraphX) for parallel BFS. All experiments were run on a 32-core AMD EPYC server with 128 GB RAM and an NVIDIA A100 GPU. Statistical significance of improvements was confirmed using McNemar’s test (
p < 0.01 for latency and privacy gains).
4.6. Generalizability, Limitations, and Future Directions
GMD-AD demonstrates effectiveness with MongoDB and SockShop microservices, yet future research is necessary for generalizability and limitation assessments. It involves diverse setups including a realistic MongoDB sharded cluster and several microservices. GMD-AD is applicable to any database modeled as interaction graphs, extending to systems like Cassandra and Spanner. The model ensures high accuracy, even with challenging dynamics due to continuous topology changes, by employing streaming algorithms and communication binning. Validation involves a five-step process, indicating its robustness across various databases while acknowledging constraints and planning future enhancements for broader applicability and tool integration.
4.7. Model Interpretability and Feature Attribution Analysis
CatBoost’s native feature importance metrics—gain-based (average leaf value change), split-based (split frequency), and SHAP values (game-theoretic attribution)—were computed on the trained model using SockShop synthetic data and are shown in
Table 23.
Remaining features contribute <0.015 gain each. Graph-based metrics dominate (60% total importance), with Metric_Change_Rate as the top feature (28.5% gain), validating metric dimension theory for structural anomalies. Behavioral features provide secondary signals (25% importance).
SHAP analysis (
Figure 16) offers instance-level attributions. For example, in unexpected communication (Service A → Service Z), Metric_Change_Rate (+0.28) and Distance_Variance (+0.15) drive anomaly prediction (0.95 confidence).
Interpretability for Security Operators
Operator-friendly explanations for key anomalies include: (1) Unexpected communication: Metric_Change_Rate > 2
σ and Distance_Variance increases indicate unusual patterns; review firewall rules. (2) Privilege escalation: Privilege_Escalation_Indicator = 1, Request_Frequency_Change > threshold, and Node_Betweenness_Centrality deviations signal unauthorized access; audit permissions. (3) Data exfiltration: High Data_Access_Anomaly_Score and API_Endpoint_Diversity_Change suggest atypical patterns; block and audit transfers.
Table 24 shows the stability of feature importance.
Low standard deviations (<0.003) confirm generalizability. Permutation importance (
Figure 17) validates SHAP: shuffling Metric_Change_Rate drops accuracy 8.2%, versus <0.5% for baselines.
In conclusion, graph metrics from the metric dimension are primary signals, with stable, generalizable importance. SHAP enables instance-level audits, and guides support 10 anomaly types, ensuring interpretability for regulated security environments.