Adaptive Ensemble Machine Learning Framework for Proactive Blockchain Security

Omonayajo, Babatomiwa; Oke, Oluwafemi Ayotunde; Cavus, Nadire

doi:10.3390/app151910848

Open AccessArticle

Adaptive Ensemble Machine Learning Framework for Proactive Blockchain Security

by

Babatomiwa Omonayajo

^1,2,*

,

Oluwafemi Ayotunde Oke

^1,2,*

and

Nadire Cavus

^1,2

¹

Department of Computer Information Systems, Near East University, Nicosia 99138, Cyprus

²

Computer Information Systems Research and Technology Centre, Nicosia 99138, Turkey

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10848; https://doi.org/10.3390/app151910848

Submission received: 28 August 2025 / Revised: 2 October 2025 / Accepted: 6 October 2025 / Published: 9 October 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Blockchain technology has rapidly evolved beyond cryptocurrencies, underpinning diverse applications such as supply chains, healthcare, and finances, yet its security vulnerabilities remain a critical barrier to safe adoption. However, attackers increasingly exploit weaknesses in consensus protocols, smart contracts, and network layers with threats such as Denial-of-Chain (DoC) and Black Bird attacks, posing serious challenges to blockchain ecosystems. We conducted anomaly detection using two independent datasets (A and B) generated from simulation attack scenarios including hash rate, Sybil, Eclipse, Finney, and Denial-of-Chain (DoC) attacks. Key blockchain metrics such as hash rate, transaction authorization status, and recorded attack consequences were collected for analysis. We compared both class-balanced and imbalanced datasets, applying Synthetic Minority Oversampling Technique (SMOTE) to improve representation of minority-class samples and enhance performance metrics. Supervised models such as Random Forest, Gradient Boosting, and Logistic Regression consistently outperformed unsupervised models, achieving high F1-scores (0.90), while balancing the training data had only a modest effect. The results are based on simulated environment and should be considered as preliminary until the experiment is performed in a real blockchain environment. Based on identified gaps, we recommend the exploration and development of multifaceted defense approaches that combine prevention, detection, and response to strengthen blockchain resilience.

Keywords:

blockchain; machine learning; artificial intelligence; cyberattacks; security

1. Introduction

Blockchain is a revolutionary technology as it comes with a multitude of desirable features such as decentralization, immutability, verification, and transparency, which makes it popular in various sectors [1,2]. With the introduction of Bitcoin, this transformative technology began as a financial innovation, allowing users to conduct transactions without centralized authorities [3]. However, the development of blockchain has gone beyond the financial sector, evolving into a decentralized system for storing and securing records across diverse domains, which has introduced new, domain-specific concerns [4]. Blockchain stands as the fifth revolutionary innovation in the history of technology, following mainframes, PCs, the internet, mobile devices, and social media networks [5]. Blockchain technology extends its applications across various sectors, such as financial technology (Fintech), automated transportation, medical care, and smart housing.

Although blockchain technology was originally perceived as secure due to its distributed nature, it now faces unique security challenges and threats [6]. As individuals and organizations place their trust in blockchain technology to protect not just cryptocurrencies but also sensitive data and allow seamless transactions and exchanges [7], understanding the range of threats and vulnerabilities that could compromise the trustworthiness of blockchain is crucial. While recent studies have proposed different detection and defenses approaches that target specific attack vectors in isolation and lack validation across diverse operational conditions, most of the studies lack detailed computational efficiency analyses and proactive approaches [8]. This gap suggests the need for an integrated approach that can proactively defend against a broader spectrum of attacks while maintaining computational efficiency for real-world deployment [9,10,11,12,13,14].

This study assesses the effectiveness of both class-balanced and imbalanced approaches and reviews existing research on blockchain security threats and countermeasures. This study addresses identified gaps through evaluating multiple machine learning-based Anomaly Detection (AD) models using simulated attack datasets to guide the development of a defense mechanism. While prior studies have focused heavily on isolated detection methods that lack validation across diverse operational conditions, there remains a gap in developing integrated, proactive defense frameworks. By addressing this gap, the present study supports blockchain developers, researchers, and regulatory bodies in strengthening trust, scalability, and security across blockchain ecosystems. The research is designed to address these aims through machine learning analysis, performance metric evaluations, and anomaly detection analysis. The remainder of the paper presents the literature review, methodology, results, discussion, and conclusion, respectively.

2. Literature Review

The existing literature on blockchain security has discussed numerous types of attacks, from consensus-level attacks to application-layer attacks. In the context of blockchain, an attack refers to any deliberate attempt to exploit weaknesses in the system’s consensus, networking, or application layers in order to disrupt normal operation and gain unauthorized access [15]. The rapid adoption of blockchain in various industries has led to concerns about threats and scams, including Bitcoin generator scams, phishing schemes, Ponzi schemes, and fake Initial Coin Offerings (ICOs) [9,10]. The blockchain ecosystem also faces attacks that compromise its security, including communication interference threats such as 51% hash rate attacks, denial-of-service attacks, Denial-of-Chain attacks, and, more recently, Black Bird attacks [16,17,18], underscoring the urgent need for integrated defense mechanisms capable of addressing multiple threat types.

At the level of consensus, attacks like the Denial-of-Chain (DoC) have been studied by Bordel et al. [17] who introduced a Grapheoretical approach to predict and mitigate block propagation disruptions. Analogously, focus has been laid on majority-control threats, such as the “Black Bird” 51% attack [18], and quantum-assisted majority attacks [11], which show how increasing computational power can render proof-of-work systems at stake to become vulnerable. Forking attacks are also counteracted by Wang et al. [19], suggesting both economic and governance mechanisms for preventing malicious miners. In the permissioned settings, Salle et al. [20] simulated the impact of Sybil attacks on Hyperledger Fabric to demonstrate the way transaction throughput drops with network invasion. Combined, these studies highlight the vulnerabilities in both the consensus and network layers of blockchain.

Meanwhile, application-layer attacks and social engineering schemes continue to pose a real risk. Badawi et al. [16] studied the Bitcoin generator scam and created a crawler-based approach that unveiled thousands of malicious Bitcoin addresses related to millions of thefts. Similarly, Hong et al. [21] investigated cryptojacking and detected around 1800 malicious webspaces among the hundreds of thousands of popular domains. Albakri & Mokbel [22] tackled malware issues in another way and suggested using biometric-based authentication to secure wallets from key-stealing malware. General threats including ransomware, pump-and-dump, and money laundering are discussed in [9] while privacy-based attacks were discussed by [12], which presented RZee for insider adversaries’ detection with user anonymity preservation. These papers illustrate that blockchain federations are becoming more susceptible to both technically and human-driven attacks.

Still, the methods in the literature can be classified into proactive protocol-level adaptations and reactive detection-based approaches. Bordel et al. [17] and Wang et al. [19] provide an example of a preventive approach where security is woven into the fabric of the blockchain design through the mathematical model adopted and change of consensus to minimize the feasibility of attacks. However, most of the works tackle the anomaly detection and monitoring problem. For example, Agarwal et al. [13] proposed machine learning classifiers to differentiate between malicious and benign Ethereum accounts, and Junejo et al. [12], the former, considered privacy and adversary filtering in a combined form. Hong et al. [21] and Badawi et al. [16] applied this methodology to the web landscape and automated detection to identify cryptojacking and scam domains, respectively. Ali et al. [14] introduced a Hyperledger-based approach for secure threat information sharing, indicating a shift from a need-to-know to a need-to-collaborate organization. Detection approaches demonstrate strong potential; preventive ones are, to a large extent, overlooked and have fewer practical tests.

From a methodological perspective, research covers theoretical modeling, simulation, and empirical analysis. Bordel et al. [17] and La Salle et al. [20] describe the spread of attacks and simulation-based analysis using Graph-theory and Petri net-models, while Wang et al. [19] assesses the effectiveness of defending strategies and approaches in controlled settings. Empirical methods are more and more apparent: Hong et al. [21] performed internet-scale cryptojacking detection and Agarwal et al. [13] used machine learning algorithms on Ethereum datasets, while Badawi et al. [16] gathered live scam data for sixteen months. Albakri and Mokbel [22] and Junejo et al. [12] presented new defenses using biometric datasets and privacy-preserving schemes, respectively. In addition to these, the questionnaires of Chen et al. [5], Guo and Yu [3], and Badawi and Jourdan [9] collated in the intersection of different knowledge areas also revealing the baselines of research studies about security in blockchain.

Clear patterns and limitations emerge from these studies when viewed collectively. The literature is clearly dominant in terms of detection-focused systems, which commonly work in isolation, and are mostly evaluated in a simulated or prototyped environment. Preventative frameworks exist but are rarely validated in real-world deployments. Moreover, many contributions are context-specific, such as Hyperledger Fabric or Ethereum, which restricts generalization across different blockchain platforms. Understudied threats, including bribery, dusting, and emerging post-quantum attacks, also remain largely unexplored. As a result of the comprehensive literature review, prior research shows the lack of preventive framework validation in real-world blockchain; evaluations are limited to single attack variants, preventing generalization across blockchain attack vectors, and emerging threats such as post-quantum attacks do not have much attention.

In this context, the present study contributes by integrating preventative and detection-oriented strategies, evaluating them systematically across multiple simulated attack types and emphasizing computational efficiency and scalability. Using quantum-resilient methods such as forward-secure key rotation and lattice-based signatures, the protocol layer is driven by emerging post-quantum threats. This will be compatible with our framework without changing the learning and anomaly model’s pipeline. By addressing both theoretical robustness and practical applicability, the proposed framework aligns with calls in the literature for more holistic and proactive blockchain defenses.

3. Methodology

3.1. Research Model

This study adopts a machine learning-based anomaly detection approach for blockchain security. The workflow includes data simulation, pre-processing and feature balancing, supervised and unsupervised model training, cross validation (CV), metric evalutions, comparative analysis, and benchmarking their performance under balanced and imbalanced conditions (Figure 1).

3.2. Blockchain Attacks Used

The review of existing the literature led to the identification of new attack variants. The following attacks which include Black Bird (51% hash rate (HR) and Black Bird Embedded Double Spending Attack (BBEDSA) variants), Denial-of-Chain (DoC), Sybil, Eclipse, and Finney were simulated in this study. These were selected because they represent both consensus-level (Black Bird, DoC) and network-level (Sybil, Eclipse) vulnerabilities, alongside temporal anomaly (Finney). Their inclusion ensures a broad representation of the major categories of blockchain threats. These attacks also highlight evolving threats and the need for advanced defense methods in blockchain systems.

3.3. Data Collection and Pre-Processing

Two datasets were collected for this analysis; namely, Dataset A (primary) was generated from the first simulation run while Dataset B (cross-dataset validation) was generated via a second, independent simulation to enable cross dataset validation. Both datasets were extracted from a simulated blockchain attack environment. The simulations imitated Black Bird 51% hash rate, BBEDSA, DoC, Sybil, Eclipse, and Finneys attacks. For each simulation, it recorded key metrics that are representative of blockchain operations like computational power (hash-rate), success or failure of malicious transaction attempts (malicious-transactions-added), and the state of blockchain before and after an attack (blocks-before-attack and blocks-after-attack). Network traffic, system calls, and alerts for attempts at security breaches were logged; time stamped, characterized, and properly simulated multiple attack scenarios were carried out. The output of the simulated environment acts as the source for the data collection for the machine learning analysis.

The diagram (Figure 2) above illustrates how Dataset A and B were generated from a simulated environment, processed and used for machine learning (ML) evaluation. The datasets (A and B) are created as a small number of crafted feature vectors extracted from simulated blockchain-attack events, supplemented by a much larger number of synthetic samples obtained using random sampling within realistic bounds. They are preprocessed (feature selection, scaling), potentially balanced with Adaptive Synthetic Sampling (ADASYN) in addition to Synthetic Minority Oversampling Technique (SMOTE), and served as input for a set of anomaly detection models that are compared regarding their performance under balanced and imbalanced scenarios. The datasets were generated using a custom Python simulation of a proof-of-work blockchain with 100 nodes, configured to emulate different attack types and vectors such as 51% hash rate attack, Black Bird attacks, etc. Simulation parameters (block interval, network latency, and hash rate distribution) were chosen to match typical blockchain-like networks (Table 1).

Dataset A is the main dataset for the study, which provides logs and metrics from the first independent simulation run of the blockchain attack environment. It records hashrates, attacks results, network conditions, and Denial-of-Chain attack diagnosis. It was utilized for training and cross-validation (CV) of models to serve as a benchmark of model performances. Dataset B was generated from another simulation run and used as a cross-validation set in order to evaluate how well trained networks in Dataset A generalize to new attack patterns and conditions. To assess generalizability beyond a single acquisition, we repeated the full training-validation protocol on Dataset B as an independent simulation run, matched in logging format and feature schema to Dataset A. We report models’ performance separately for each dataset and compare summary statistics across datasets to quantify stability under changed operating conditions without reusing test folds or contaminating evaluation data. After the literature review of existing studies, the methodology incorporates an evaluation phase using simulated datasets, where end-to-end computation efficiency (feature extraction, pre-processing, model fitting/inference, memory footprint, and artifact size) is measured to reflect real-world deployment costs. The datasets evaluation was performed using 10-fold cross validation over five different random seeds to compute mean performance metrics and Standard Deviation (SD), thereby providing statistical robustness to the results. The dataset contained nine features across approximately 5005 synthetic simulated records. To ensure robustness for machine learning analysis, SMOTE/ADASYN were applied only to the training folds. We removed constant or near-constant features based on variance threshold, then standardized features to ensure similar scaling across machine learning models. When any class could not support 10 stratified folds, the number of folds K was reduced to the largest valid value (K ≥ 2) while preserving stratification. Class balancing (SMOTE/ADASYN) was applied only to the training folds to avoid information leakage; validation/test folds remains untouched. This ensures a fair comparison between the imbalanced and class-balanced settings under identical split indices.

3.4. Machine Learning and Hybrid Models Used

We examined the performance of seven models by combining both supervised and unsupervised techniques to conduct a thorough evaluation of blockchain anomalies. Below are list of models used:

One-Class Support Vector Machine (OCSVM) (Unsupervised).
Random Forests (RF) (Supervised).
K-Nearest Neighbors (KNN) (Supervised).
Isolation Forest (IF) (Unsupervised).
Gradient Boosting (GB) (Supervised).
Logistic Regression (LR) (Supervised).
Autoencoder (Unsupervised).

To ensure a fair and replicable selection of the hyperparameter shown in Table 2, all tuning was performed using fixed seeds and similar budgets across all models. For some models like Random Forest and Gradient Boosting, grid search was used to read the pre-defined hyperparameter values to find out the optimum configuration for the performance. The hyperparameter of the other models like One-Class SVM and Autoencoder are selected based on the domain knowledge and running those models iteratively. Beyond the grids run for RF and GB, we conducted sensitivity analyses for KNN (K and weighting), LR (C, class weighting), One Class SVM (ν,γ), and Isolation Forest (n_estimators, contamination) (Table 2).

In addition to single classifiers, ensemble methods (Bagging, Boosting, Stacking, and Voting) were utilized to increase robustness. Furthermore, an Adaptive Security Response (ASR) framework was integrated with selected models (RF, GB, LR, IF, Autoencoder). ASR enabled real-time defense by linking anomaly predictions to proactive countermeasures such as transaction isolation, dynamic consensus difficulty, and node throttling. These hybrid approaches where chosen because they reduce overfitting and stabilize performance across folds, while ASR transforms detection into practical protection [23].

3.5. Experimental Design for Model Evaluation

This experiment was designed to evaluate the performance of ML models for anomaly detection in blockchain datasets under two settings (class balanced and imbalanced). Features like hash-rate, malicious-transactions-added, and doc-attack-identified are captured in the dataset that is generated from various blockchain attack simulations.

In the first setup, class balancing methods such as SMOTE and ADASYN were applied to correct data imbalance, while the second setup was implemented with the initial imbalanced dataset to mimic real-world scenes. This design enabled a comparative study between supervised (Random Forest, Gradient Boosting) and unsupervised (Isolation Forest, Autoencoder) models. For unsupervised models, the negative of the decision score was treated as an anomaly measure, and percentile thresholds (50–99%) were swept to label anomalies in the validation split, selecting operating points that balanced precision and recall. Finally, test fold results are reported across the sweep to reveal threshold sensitivity rather than a single, potentially brittle cut-off.

Per-attack test fold metrics were also reported. In each test fold, we also computed metrics disaggregated by attack_type (e.g., hash rate, DoC, Sybil, Eclipse, Finney, smart contract) to understand how detectors behave across threat classes. This per-attack reporting complements aggregate scores on attack-specific robustness.

The above diagram (Figure 3) provides a systematic framework for attack simulation, data preparation, and model evaluation.

3.6. Performance Metrics

This study used various measures to assess machine learning models’ performances on anomaly detection and classification tasks. We used accuracy, precision, recall, F1 score, and anomaly detection percentage (AD%) to assess classification performance and sensitivity to anomalies. These metrics give an overall insight into the success of the models’ performance under varying sets of conditions.

Accuracy measures the proportion of correctly classified instances among all samples, but it can be less informative when classes are imbalanced.

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(1)

Precision divides true positives by total positive predictions, showing the accuracy of positive estimates.

Precision = \frac{T P}{T P + F P}

(2)

Recall is the ratio of true positives to all actual positives, reflecting the model’s capacity to detect all positive instances.

Recall (sensitivity) = \frac{T P}{T P + F N}

(3)

F1 score strikes a compromise between recall and precision, particularly useful for handling imbalanced datasets.

F 1 Score = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(4)

Anomaly Detection (AD) is the proportion of data points identified as anomalies relative to the total number of test samples, providing a measure of model sensitivity.

AD % = \frac{T P + F P}{N} \times 100

(5)

where

TP = True Positives

TN = True Negatives

FP = False Positives

FN = False Negatives

N = Total number of samples

Statistical significance between the class-balanced and imbalanced settings was assessed with paired t-test and Wilcoxon signed-rank test over matched folds and seeds (α = 0.05). We reported mean ± SD and 95% confidence intervals (t-based) for each metric and model; non-parametric p-values are included where distributional assumptions may not hold.

3.7. Computer Configuration

All experiments were implemented in Python (version 3.13.7) using libraries including numpy, pandas, matplotlib, scikit-learn, imbalanced-learn (imblearn), tensorflow/keras, hashlib, etc. The experiments were executed on a Microsoft Corporation (Redmond, WA, USA) Windows 10 operating system with Intel Corporation (Santa Clara, CA, USA) Core i7 processor on HP Inc. (Palo Alto, CA, USA) ProBook laptop with 16 Gigabyte (GB) Random Access Memory (RAM), and NVIDIA Corporation (Santa Clara, CA, USA) RTX 4050 Graphics Processing Unit (GPU) with 6 Gigabyte (GB) Video Random Access Memory (VRAM). For reproducibility, we included the files and datasets in the repository.

4. Results

We report two aspects of this experimental evaluation: the initial single dataset baseline and the 10-fold five-seed cross-validated results with per-attack metrics, threshold sensitivity, cross dataset validation, and computational efficient profiling. Number of samples after SMOTE (4618-balanced) and class balancing (SMOTE/ADASYN) was applied only to training folds to avoid leakage; stratified CV reduced folds automatically if a minority class could not support K = 10.

4.1. Initial Single Dataset Analysis (Baseline)

This single-split analysis was conducted prior to the full 10 × 5 cross-validated study and is retained as a descriptive baseline; the stratified 10 × 5 cross-validation is used for all findings. The dataset imbalances are evident from the balanced class results (Table 3), showing that the application of SMOTE is advantageous. An increased proportion of Class 0 examples during training led to higher recall for these classes (0.65 for Random Forest, 0.76 for Gradient Boosting) and higher percentage of anomalies detected (0.65 for Random Forest, 0.76 for Gradient Boosting), further showing that more balanced trained samples can better capture patterns of minority classes. The additional synthetic samples helped improve the generalization of the supervised models like Logistic Regression and K-Nearest Neighbors. Autoencoder, on the other hand, shows an extremely low result with recall (0.05) and anomaly detection percentage (0.05) because it is very sensitive towards synthetic patterns.

Table 4 shows perfect/imperfect scores, indicating imbalance driving overfitting and motivating cross-validated analysis. The difference between the balanced and imbalanced class can be visually seen in the figure below (Figure 4).

4.2. Cross-Validation (CV) Performance

Ten-fold CV (x5 seeds) shows tightly clustered F1 among top supervised models (GB/RF/LR ≈ 0.90 ± 0.06), with KNN slightly lower (Table 5). In balanced settings, the mean ± SD for GB is 0.907 ± 0.055, RF 0.907 ± 0.055, LR 0.902 ± 0.056, KNN 0.896 ± 0.075, OCSVM 0.608 ± 0.195, Isolation Forest 0.206 ± 0.154, and Autoencoder 0.363 ± 0.163. Median fit times are small (e.g., LR ~0.009 s, GB ~0.416 s, RF ~1.607 s, KNN ~0.002 s) with compact model sizes (LR ~0.0008 MB, GB ~0.168 MB, RF ~0.246 MB).

On imbalanced data, GB.RF remains ≈ 0.907 ± 0.055, LR and KNN dip modestly (0.899 ± 0.054, and 0.865 ± 0.118), and OCSVM declines further (≈0.584 ± 0.226). Fit/predict times are near-identical to balanced (Table 6).

Comparison of balanced and imbalanced cross validation shows almost no F1 change for most models (Figure 5). This can be attributed to the overall frequency configuration of the datasets (for the unbalanced class distributions); it is important to emphasize that even the unbalanced scenario datasets did not display extreme class bias, and therefore, the signal was high enough for most classifiers to learn good decision boundaries. Therefore, SMOTE/ADASYN balancing led to only a minor increase (≈0.01–0.02 F1 in some cases). In other words, the balance and imbalance datasets are structurally very close to each other, so model performance is close to each other. The close similarity of the validation confirms that the classifiers were relatively resilient to a small amount of class imbalance and that balancing predominantly adjusted the minority-class recall without altering overall accuracy.

The delta F1 analysis shows that only modest changes occur when moving from balanced to imbalanced training. KNN improves by 0.031, Logistics Regression by 0.003, OCSVM by 0.023, and Autoencoder by 0.041. Both Gradient Boosting and Random Forest show no change (Delta FI = 0.000), while Isolation Forest actually drops slightly by −0.001 (Table 7).

Pair tests (per-fold x per seed) show LR benefit significantly from class balancing (paired t p ≈ 0.0437; Wilcoxon p ≈ 0.0455), while GB/RF show positive, non-significant trends (p ≈ 0.083), (Table 8).

4.3. Per-Attack Performance

Across the balanced CV runs, Finney remains the hardest class for the supervised models, whereas DoC, Eclipse, hashrate, Sybil, and smart contract audit achieve consistently high detection performance (F1 ≈ 0.90–0.96). For the top models (LR, GB, RF, KNN), Finney F1 consistently shows ≈0.55–0.60, with moderate variance ±0.12–0.15 (Table 9). This pattern echoes the qualitative difficulty of replay style and temporal anomalies, which remains more challenging than structural or consensus-based attacks.

4.4. Threshold Sensitivity (Unsupervised)

Threshold sweeps for OCSVM and Isolation Forest (percentile cut-offs over anomaly scores) indicate the best operating point near the q = 50th percentile for both. With an average over seeds, OCSVM best F1 0.790 ± 0.066 @ q ≈ 50, Interquartile Range (IQR) 0, IF best F1 0.672 ± 0.064 at q ≈ 50 (IQR 20), suggesting OCSVM is less threshold-sensitive than IF (Table 10). Best operating points occur near the 50th percentile for both detectors, suggesting that heavy tail cut-offs are unnecessary and may harm recall.

4.5. Cross Dataset Validation (Datasets A and B)

Balanced CV across both datasets shows consistent ranking and stable means with slightly larger dispersion in Dataset B (expected due to operating condition shift). For balanced (10x5cv), in Dataset A, top supervised models cluster t high F1 with moderate dispersion (LR ~0.912 ± 0.144, GB ~0.893 ± 0.153, RF ~0.893 ± 0.153, KNN ~0.865 ± 0.161). In Dataset B, the ranking is consistent and dispersion is tighter (GB ~0.907 ± 0.055, RF ~0.907 ± 0.055, LR ~0.902 ± 0.056, KNN ~0.896 ± 0.075). For the Imbalance (10x5cv), Dataset A, class imbalance reduce F1 modestly (LR ~0.874 ± 0.159, GB ~0.865 ± 0.161, RF ~0.865 ± 0.161, KNN ~0.846 ± 0.164), while in Dataset B, GB/RF remain strong and stable while LR/KNN dip slightly (GB ~0.907 ± 0.055, RF ~0.907 ± 0.055, LR ~0.899 ± 0.054, KNN ~0.865 ± 0.118) (Table 11).

4.6. Computational Efficiency

We profiled pre-processing and model cost per fold/seed. Median fit times and artifacts sizes remain small, supporting deployability with real-time constraints. Feature extraction on the raw logs averaged approx. 0.013 s (n = 15, 9 features), with negligible variance. These costs, combined with compact model artifacts (≤~0.25 MB for tree ensembles, ~1 KB for LR), make online scoring feasible under typical throughput conditions. Model costs show LR fit ~0.010 s, predict ~0.001 s, size ~0.001 MB; GB fit ~0.420 s, predict ~0.003 s, size ~0.170 MB; RF fit ~1.600 s, predic t~0.160 s, size ~0.245 MB; KNN fit ~0.002 s, predict ~0.009 s, and size ~0.025 MB. OCSVM required ~0.050 s to fit with a small footprint (~0.008 MB), while Isolation Forest was heavier (fit ~2.050 s, predict ~0.032 s, size ~1.100 MB). Autoencoder was the most computationally intensive (fit ~3.200 s, predict ~0.020 s, size ~2.000 MB, peak memory ~250 MB) (Table 12). Generally, memory peaks remained minimal across models, indicating feasibility for deployment in blockchain anomaly detection systems.

4.7. Combined Models (Ensemble and Adaptive Defense Performance)

4.7.1. Ensemble Models Performance

While the previous results show the usage of single ML algorithms, we also explored the effect of combining models with ensemble techniques and incorporating adaptive response mechanisms. This was to better cope with the weaknesses of the single classifier system as their robustness, stability, and detection rate, etc., under various attack situations were not that high. We compared the single classifiers (Random Forest, Gradient Boosting, Logistic Regression, K-Nearest Neighbors, Isolation Forest, One-Class SVM, and Autoencoder) and the ensemble methods (Bagging, Boosting, Stacking, and Voting). Ensemble methods offered modest improvement (F1 ≈ 0.54) over single baselines (F1 ≈ 0.52) but remained far below the cross-validated supervised models (F1 ≈ 0.90). For instance, while Gradient Boosting alone achieved an F1 score of approximately 0.544 and Logistic Regression reached 0.520, boosting ensembles attained a higher F1 score of 0.544 with an accuracy of 0.545 (Table 13). This suggests that, using iterative ensemble methods, it is possible to obtain stronger anomaly detectors that generalize well over both balanced and imbalanced data.

Ensemble methods further decreased the variance and alleviated overfitting, especially with highly unbalanced blockchain datasets where minority-class attack instances are hard to detect. Bagging made performance more stable over the folds, and Voting and Stacking exploited model diversity to improve precision and recall. These results verify that ensemble anomaly detectors offer a better solution than single detectors when applied to blockchain systems. Significant gains were obtained only when the ensemble was combined with the Adaptive Security Response (ASR) mechanism; the ASR system relies on the ensemble model decision outputs to activate flexible blockchain protection actions.

4.7.2. ASR Policy Effects

Prediction of a variety of models was combined to make real-time countermeasures of transaction verification, node isolation, and consensus difficulty adaptation. Unlike the classification metrics reported in Section 4.7.1, the ASR results in Table 14 show the impact of ASR intervention on blockchain resilience. These policy effect outcomes are outlined separately to emphasize that they represent operational outcome rather than reclassification performance.

The same single baseline configuration outlined in Section 4.1 was used for evaluating the ensemble models (Table 13) and, separately, the ASR improved models (Table 14). Only five have been included in the ASR framework (Random Forest, Gradient Boosting, Logistic Regression, Isolation Forest, and Autoencoder), even though seven machine learning models were assessed initially. K-Nearest Neighbors (KNN) and One-Class SVM were not considered due to practical considerations. KNN worked well in some static settings, but for prediction it was computationally expensive as for each incoming transaction, a new transaction was required to calculate distance to all the transactions in the dataset and could not be used for real-time defense of blockchain. Likewise, One-Class SVM had inconsistent performance across folds and sensibility for parameter tuning, reducing its scalability for large emerging blockchain traffic. In contrast, the five models retained struck a balance between detection performance, interpretability, and computational performance, rendering them as possible candidates to be linked to adaptive, real-time response modules.

The ASR integration significantly increased the performance above static single-model detection. ASR results are not directly equivalent to the other classification metrics obtained on the original test fold. To evaluate the effect of the intervention on subsequent improvement, we present ASR effects separately from the previous metrics, evaluated on the same split test used for Table 13 and Table 14. As an example, Random Forest as an individual classifier ascertained an F1 score of 0.498, which rose to 0.9609 in F1-ASR (policy F1) when ASR was enabled. The Autoencoder similarly increased from a baseline F1 of 0.051 to 0.8653 in F1-ASR with ASR integration. The result in Table 13 clearly demonstrates that ensemble models could not perform well (average 50%) due to the lack of additional features, whilst the same models with ASR perform better in results (Table 14). Without adaptive actions, ensemble methods yielded only moderate F1 scores (approx. 0.54), which were comparable to or slightly above single models. The dramatic improvement (F1-ASR > 0.90) occurred only when ASR was introduced, underscoring that proactive, real-time interventions such as dynamic consensus difficulty, transaction isolation, and node throttling are essential to achieve high detection performance (Figure 6). With the combined model, anomaly detection rates are improved and blockchain’s robustness toward new threats is guaranteed. This makes ensemble and ASR-based models more appealing and efficient in real-life deployment for blockchain security cases.

5. Discussion

This study examines contemporary blockchain attack vectors and evaluates a multilayer proactive mechanism using two simulated datasets (A and B), with emphasis on methodological rigor and computational practicality. Our evaluation shows that both supervised and unsupervised machine learning models have distinct strengths and limitations for blockchain anomaly detection. We applied statistically rigorous protocols such as 10x5 CV, balancing comparisons, per-attack reporting, threshold sweeps, cross dataset validation, and efficiency profiling to ensure robust findings.

The unsupervised models (Isolation Forest and One-Class SVM) showed signal on unseen attacks with reasonable accuracy, which is valuable given the ever-evolving nature of blockchain activity. Because they do not require labeled data, these models have potential as real-time early warning monitors, although their overall F1 scores were lower than those of supervised models [24]. In our cross-validated results, however, unsupervised detectors trailed supervised baselines on overall F1, reinforcing their role as complementary early warning monitors rather than primary classifiers. A likely reason is that unsupervised detectors are sensitive to synthetic anomalies, whereas supervised models benefited from labeled attack data. Threshold sweeps revealed the best operating point near the 50th percentile for both OCSVM and IF (OCSVM best F1 ≈ 0.79 ± 0.07 @ q ≈ 50, IF ≈ 0.67 ± 0.06 @ q ≈ 50), indicating OCSVM is less threshold-sensitive. Supervised models such as Random Forest and Gradient Boosting generally performed well, particularly on known attacks, including BBEDSA [25].

This trend is presented by the cross-validated means on balanced Dataset A (LR 0.912 ± 0.144, GB 0.893 ± 0.153, RF 0.893 ± 0.153, KNN 0.865 ± 0.161) and on balanced Dataset B (GB 0.907 ± 0.055, RF 0.907 ± 0.055, LR 0.902 ± 0.056, KNN 0.896 ± 0.075). Under class imbalance, performance drops modestly, yet ranking remains consistent; in Dataset A (LR 0.874 ± 0.159, GB/RF 0.865 ± 0.161, KNN 0.846 ± 0.164), while in Dataset B (GB/RF 0.907 ± 0.055, LR 0.899 ± 0.054, KNN 0.865 ± 0.118). The tighter dispersion in Dataset B suggests that its attack distributions were less noisy than Dataset A, enabling more consistent classification boundaries. Balancing improved Logistic Regression’s mean F1 from 0.899 to 0.902 and KNN from 0.865 to 0.896, with little to no change for Gradient Boosting and Random Forest, for the top supervised models, primarily via minority class recall; for LR the improvement was statistically significant (paired-t p ≈ 0.044; Wilcoxon p ≈ 0.046). Per-attack analysis clarifies where errors concentrate; DoC, Eclipse, hash rate, Sybil, and Smart contract anomalies are typically saturated (F1 ≈ 0.90–0.96) by GB/RF/LR/KNN, while Finney remains challenging (F1 ≈ 0.58 ± 0.13–0.15), consistent with its temporal character. The consistently low F1 on Finney attacks (≈0.58) suggests that their temporal patterns are not captured by static classifiers. Incorporating sequential architectures such as Long Short-term Memory (LSTM)-based detectors or adversarial augmentation for rare temporal anomalies may improve performance in this class. The delta F1 analysis shows that KNN benefits most from balancing by +0.03, Logistic Regression improved by +0.00, OCSVM by +0.02, and Autoencoder by +0.04 gained modestly. In contrast, Gradient Boosting and Random Forest show no change, and Isolation Forest declines marginally (−0.001). These small deltas indicate that the tree ensemble models and Logistic Regression are inherently robust to class imbalance, whereas KNN and the unsupervised detectors are more sensitive. Ensemble models without adaptive response maintained a moderate performance (0.54 F1) because they only aggregated classifiers. In contrast, ASR integrated real-time defense actions, transforming prediction into effective interventions, which explains the high performance difference.

The multi-layered framework shows promise for adaptive consensus and immutable verification techniques. In our simulations, the three layers (consensus hardening, transaction level controls, and immutable block validation) acted as complementary defenses with protocol hardening reducing takeover feasibility, transaction checks constraining blast radius, and detection providing monitoring, response, and action. These effects are demonstrated in simulation and motivate real-world system evaluation. In addition, performance could be further enhanced by optimizing thresholds for reconstruction errors in models such as Autoencoder. More broadly, targeted temporal features may lift Finney performance, and richer augmentation of rare classes can reduce variance in minority recall [13]. Essentially, future research can test the ability of Generative Adversarial Networks (GANs) to create synthetic data points for less-occurring classes to enhance the generalization of the model and reduce bias [26]. This aligns with our finding that balanced training yields the largest gains on minority class recall with negligible runtime overhead. Efficiency profiling confirms feasibility; even RF and GB require <2 s training and <0.2 s inference, supporting their use in near real-time blockchain defense. Feature extraction is non-dominant (~0.013 s for a representative 181 × 9 batch), and high-accuracy models are lightweight; for instance, LR fit ≈ 0.01 s with artifact ≈ 0.001 MB, GB fit ≈ 0.42 s with ≈ 0.17 MB, RF fit ≈ 1.6–1.7 s with ≈ 0.25 MB, KNN fit ≈ 0.002 s, and predict ≈ 0.009 s.

We compare our study in relation to prior studies on blockchain attacks (Table 15). While earlier studies often focused on a single attack type and proposed protocol-level defenses with no ML benchmark, our study presents a multi-layer ML-based benchmark with adaptive defense and per-attack performance.

Compared to previous studies that were conceptual only and lacked empirically validation [3,5], focused solely on a single threat such as DoC (Denial-of-Chain) [17] or Black Bird variants [18], or were restricted to specific mechanisms like BAR (Block Access Restriction) [25] and cryptojacking detection [21], our study makes a stronger, more generalizable, and thus more practical contribution. By integrating multiple machine learning models with ASR that span consensus, network, and application-layer attacks and deliver robust supervised detection performance (F1 > 0.90) with computational efficiency (training < 2.00 s, inference < 0.20 s), this makes our study more comprehensive and proactive than existing studies, which are usually limited to theoretical taxonomies or isolated attack simulations. Meanwhile, we acknowledge that our evaluation is conducted on synthetic blockchain logs, and real-world blockchain scenarios facing adaptive adversaries, dynamic transaction workloads, and fluctuating network conditions could possibly undermine generalization. Hence, future work should extend validation on real blockchain systems, incorporate quantum-resilient protocols, and adopt some automated hyperparameter tuning or data augmentation strategies to enhance the defense against rare attack patterns [20,27,28,29].

6. Conclusions

In this study, we proposed and implemented a proactive defense mechanism. Our findings reveal that many studies focus on detection rather than developing proactive strategies to combat these blockchain attacks and also demonstrate that as technology advances, new attack variants will emerge. The most important finding is that supervised models such as Random Forest and Gradient Boosting achieved consistently high accuracy, while ensemble methods combined with adaptive response significantly enhanced resilience. These results benefit blockchain developers, researchers, and regulatory bodies by offering scalable and practical defense strategies to strengthen the trust and security in blockchain ecosystems.

This study is limited by the exclusive use of simulated datasets A and B. The simulated datasets are justified due to scarce real-world datasets, ethical limitations, diverse attack modeling, and alignment with academic standards for safe and repeatable blockchain security experimentation and validation. Real blockchain networks exhibit adaptive adversaries, dynamic traffic, and higher noise, which may reduce generalization. As a result, we consider all performance to be based on simulation and subject to validation in real blockchain environments. In addition, hyperparameter searches were bounded for practicality, and broader sweeps or Automated Machine Learning (AutoML) could yield further improvements. In future research, we plan to reuse the 10x5 CV methodologies with similar seeds on real blockchain data, use an operational node to gather transaction and block level data collection, and evaluate variance sensitivity under controlled disturbance such as noise injections and prior class changes. This will assess deployment criteria and measure generalized gaps.

Stakeholders, regulators, and industry practitioners should prioritize resilience testing under realistic adversarial conditions, while developers are encouraged to adopt proactive ML anomaly detection combined with Adaptive Security Response mechanisms to improve blockchain security in practice. The proposed multilayered framework addresses current and future challenges of decentralized security. We emphasize that these effective claims are simulation-based; real-world system validation remains essential. We recommend conducting further research into protecting blockchain against present and future threats. To ensure that defensive response continues to be successful, future enhancement will combine detection with quantum safeguards such as adoption-adaptive confirmation polices and post-quantum signature schemes. Furthermore, combining high-level Hyperledger Fabric and cryptographic mechanisms alongside the inherent security features of blockchain could represent a transformative approach to the defense of the technology. In line with our empirical findings, future work will evaluate the framework on real blockchain operational datasets to test generalization and drift handling, assess and evaluate ethical risk under practical conditions using real live blockchain data, strengthen protocol components including quantum resilient variants, and use data augmentation to improve rare attack coverage, particularly for Finney type anomalies.

Author Contributions

Conceptualization, B.O.; methodology, B.O. and O.A.O.; software, B.O.; validation, B.O., O.A.O. and N.C.; formal analysis, B.O. and O.A.O.; investigation, B.O.; resources, B.O.; data curation, B.O.; writing—original draft preparation, B.O.; writing—review and editing, B.O., O.A.O. and N.C.; visualization, B.O. and O.A.O.; supervision, N.C.; project administration, B.O. and O.A.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data and files used in this study are archived on Zenodo at https://doi.org/10.5281/zenodo.17250524 (accessed on 2 October 2025), corresponding to the Github release “Version 1.0—Adaptive Ensemble Machine Learning Framework for Proactive Blockchain Security” (Tag: MDPI2, Commit: 1c733de). A readme file is provided for reproduction.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AD%	Anomaly Detection Percentage
ADASYN	Adaptive Synthetic Sampling
AI	Artificial Intelligence
ASR	Adaptive Security Response
AutoML	Automated Machine Learning
BAR	Block Access Restriction
BBA	Black Bird Attack
BBEDSA	Black Bird Embedded Double Spending Attack
CV	Cross-Validation
DoC	Denial-of-Chain
F1	F1 Score
FN	False Negative
FP	False Positive
GB	Gradient Boosting
GPU	Graphics Processing Unit
HR	Hash Rate
IF	Isolation Forest
IQR	Interquartile Range
KNN	K-Nearest Neighbors
LR	Logistic Regression
ML	Machine Learning
OCSVM	One-Class Support Vector Machine
RAM	Random Access Memory
RF	Random Forest
SD	Standard Deviation
SMOTE	Synthetic Minority Oversampling Technique
SVM	Support Vector Machine
TN	True Negative
TP	True Positive
VRAM	Video Random Access Memory

References

Ahmed, M.R.; Islam, M.; Shatabda, S.; Islam, S. Blockchain-Based Identity Management System and Self-Sovereign Identity Ecosystem: A Comprehensive Survey. IEEE Access 2022, 10, 113436–113481. [Google Scholar] [CrossRef]
Laroiya, C.; Saxena, D.; Komalavalli, C. Applications of Blockchain Technology. In Handbook of Research on Blockchain Technology; Krishnan, S., Balas, V.E., Julie, E.G., Robinson, H., Balaji, S., Kumar, R., Eds.; Academic Press (Elsevier): Cambridge, MA, USA, 2020; pp. 213–243. [Google Scholar] [CrossRef]
Guo, H.; Yu, X. A Survey on Blockchain Technology and Its Security. Blockchain Res. Appl. 2022, 3, 100067. [Google Scholar] [CrossRef]
Uddin, M.; Obaidat, M.; Manickam, S.; Shams, A.; Dandoush, A.; Ullah, H.; Ullah, S.S. Exploring the Convergence of Metaverse, Blockchain, and AI: Opportunities, Challenges, and Future Research Directions. WIREs Data Min. Knowl. Discov. 2024, 14, e1556. [Google Scholar] [CrossRef]
Chen, Y.; Chen, H.; Zhang, Y.; Han, M.; Siddula, M.; Cai, Z. A Survey on Blockchain Systems: Attacks, Defenses, and Privacy Preservation. High-Confid. Comput. 2022, 2, 100048. [Google Scholar] [CrossRef]
Zhang, P.; Schmidt, D.C.; White, J.; Lenz, G. Blockchain Technology Use Cases in Healthcare. In Advances in Computers; Zelkowitz, M.V., Ed.; Academic Press: Cambridge, MA, USA, 2018; Volume 111, pp. 1–41. [Google Scholar] [CrossRef]
Li, M.; Zeng, L.; Zhao, L.; Yang, R.; An, D.; Fan, H. Blockchain-Watermarking for Compressive Sensed Images. IEEE Access 2021, 9, 56457–56467. [Google Scholar] [CrossRef]
Singh, S.; Hosen, A.S.M.S.; Yoon, B. Blockchain Security Attacks, Challenges, and Solutions for the Future Distributed IoT Network. IEEE Access 2021, 9, 13938–13959. [Google Scholar] [CrossRef]
Badawi, E.; Jourdan, G.V. Cryptocurrencies Emerging Threats and Defensive Mechanisms: A Systematic Literature Review. IEEE Access 2020, 8, 200021–200037. [Google Scholar] [CrossRef]
Grobys, K. When the Blockchain Does Not Block: On Hackings and Uncertainty in the Cryptocurrency Market. Quant. Financ. 2021, 21, 1267–1279. [Google Scholar] [CrossRef]
Bard, D.A.; Kearney, J.J.; Perez-Delgado, C.A. Quantum Advantage on Proof of Work. Array 2022, 15, 100225. [Google Scholar] [CrossRef]
Junejo, Z.; Hashmani, M.A.; Alabdulatif, A.; Memon, M.M.; Jaffari, S.R.; Abdullah, M.Z. RZee: Cryptographic and Statistical Model for Adversary Detection and Filtration to Preserve Blockchain Privacy. J. King Saud Univ.—Comput. Inf. Sci. 2022, 34, 7885–7910. [Google Scholar] [CrossRef]
Agarwal, R.; Barve, S.; Shukla, S.K. Detecting Malicious Accounts in Permissionless Blockchains Using Temporal Graph Properties. Appl. Netw. Sci. 2021, 6, 9. [Google Scholar] [CrossRef]
Ali, H.; Ahmad, J.; Jaroucheh, Z.; Papadopoulos, P.; Pitropakis, N.; Lo, O.; Abramson, W.; Buchanan, W.J. Trusted Threat Intelligence Sharing in Practice and Performance Benchmarking through the Hyperledger Fabric Platform. Entropy 2022, 24, 1379. [Google Scholar] [CrossRef] [PubMed]
Saad, M.; Spaulding, J.; Njilla, L.; Kamhoua, C.; Shetty, S.; Nyang, D.; Mohaisen, D. Exploring the Attack Surface of Blockchain: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2020, 22, 1977–2008. [Google Scholar] [CrossRef]
Badawi, E.; Jourdan, G.V.; Onut, I.V. The “Bitcoin Generator” Scam. Blockchain Res. Appl. 2022, 3, 100084. [Google Scholar] [CrossRef]
Bordel, B.; Alcarria, R.; Robles, T. Denial of Chain: Evaluation and Prediction of a Novel Cyberattack in Blockchain-Supported Systems. Future Gener. Comput. Syst. 2021, 116, 426–439. [Google Scholar] [CrossRef]
Xing, Z.; Chen, Z. Black Bird Attack: A Vital Threat to Blockchain Technology. Procedia Comput. Sci. 2022, 198, 556–563. [Google Scholar] [CrossRef]
Wang, Q.; Li, R.; Zhan, L. Blockchain Technology in the Energy Sector: From Fundamentals to Applications. Comput. Sci. Rev. 2021, 39, 100362. [Google Scholar] [CrossRef]
Salle, A.L.; Kumar, A.; Jevtic, P.; Boscovic, D. Joint Modeling of Hyperledger Fabric and Sybil Attack: Petri Net Approach. Simul. Model. Pract. Theory 2022, 122, 102674. [Google Scholar] [CrossRef]
Hong, H.; Woo, S.; Park, S.; Lee, J.; Lee, H. CIRCUIT: A JavaScript Memory Heap-Based Approach for Precisely Detecting Cryptojacking Websites. IEEE Access 2022, 10, 95356–95368. [Google Scholar] [CrossRef]
Albakri, A.; Mokbel, C. Convolutional Neural Network Biometric Cryptosystem for the Protection of the Blockchain’s Private Key. Procedia Comput. Sci. 2019, 160, 235–240. [Google Scholar] [CrossRef]
Dua, M.; Sadhu, A.; Jindal, A.; Mehta, R. A Hybrid Noise Robust Model for Multireplay Attack Detection in ASV Systems. Biomed. Signal Process. Control 2022, 74, 103517. [Google Scholar] [CrossRef]
Campos, G.O.; Zimek, A.; Sander, J.; Campello, R.J.G.B.; Micenková, B.; Schubert, E.; Assent, I.; Houle, M.E. On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study. Data Min. Knowl. Discov. 2016, 30, 891–927. [Google Scholar] [CrossRef]
Xing, Z.; Chen, Z. Using BAR Switch to Prevent Black Bird Embedded Double Spending Attack. Procedia Comput. Sci. 2022, 198, 829–836. [Google Scholar] [CrossRef]
Saxena, D.; Cao, J. Generative Adversarial Networks (GANs): Challenges, Solutions, and Future Directions. ACM Comput. Surv. 2021, 54, 1–42. [Google Scholar] [CrossRef]
Mahmood, M.; Dabagh, A. Blockchain Technology and Internet of Things: Review, Challenge and Security Concern. Int. J. Power Electron. Drive Syst. 2023, 13, 718–735. [Google Scholar] [CrossRef]
Hamdi, A.; Fourati, L.C.; Ayed, S. Vulnerabilities and Attacks Assessments in Blockchain 1.0, 2.0 and 3.0: Tools, Analysis and Countermeasures. Int. J. Inf. Secur. 2023, 23, 713–757. [Google Scholar] [CrossRef]
Allende, M.; León, D.L.; Cerón, S.; Pareja, A.; Pacheco, E.; Leal, A.; Silva, M.D.; Pardo, A.; Jones, D.; Worrall, D.J.; et al. Quantum-Resistance in Blockchain Networks. Sci. Rep. 2023, 13, 5664. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Process flow diagram.

Figure 2. Dataset generation process and flow.

Figure 3. Flow model evaluation protocol and attack taxonomy.

Figure 4. Model performance difference: balanced vs. imbalanced dataset.

Figure 5. Cross-validation results (10 × 5 CV) under balanced and imbalanced dataset.

Figure 6. Model performance with vs. without ASR.

Table 1. Dataset features and descriptions.

Feature	Description
hash_rate	Computational power assigned to a node
malicious_transactions_added	Indicates whether a smart-contract audit allowed a bad transaction through (1 = yes, 0 = no)
nodes_after_sybil_attack	Number of nodes remaining after a Sybil attack
nodes_after_eclipse_attack	Number of nodes remaining after an Eclipse attack
block_added_after_finney	Finney attack success indicator (1 = fraudulent block added, 0 = failed)
transaction_authorized_finney	Whether the Finney attack transaction was authorized (1 = yes, 0 = no)
blocks_before_attack	Chain length immediately before an attack
blocks_after_attack	Chain length immediately after an attack
doc_attack_identified	Target variable: DoC attack detection (1 = detected, 0 = not detected)

Table 2. ML models and hyperparameters.

ML Models	Hyperparameters
One-Class SVM	Kernel: rbf, Nu: 0.5, Gamma: scale
Random Forest	n_estimators: 200, max_depth: 10, criterion: gini
KNN	n_neighbors: 5, Weights: uniform, Algorithm: auto
Isolation Forest	n_estimators: 200, Max_samples: auto, Contamination: 0.1
Gradient Boosting	n_estimators: 150, learning_rate: 0.05, max_depth: 3
Logistic Regression	Penalty: l2, C: 1.0, Solver: lbfgs
Autoencoder	Hidden Layers: ([64, 32, 16, 32, 64)], Activation: relu, Optimizer: adam, Loss: mse, Epochs: 100, Batch Size: 16

Table 3. Model performance metrics—balanced class.

Model	Accuracy	Precision	Recall	F1	AD%
One-Class SVM	0.52	0.51	0.55	0.53	0.53
Random Forest	0.50	0.49	0.65	0.57	0.65
KNN	0.52	0.50	0.59	0.54	0.57
Isolation Forest	0.48	0.49	0.96	0.66	1.00
Gradient Boosting	0.49	0.48	0.76	0.59	0.76
Logistic Regression	0.52	0.52	0.54	0.51	0.54
Autoencoder	0.51	0.50	0.05	0.09	0.05

Table 4. Model performance metrics—imbalanced class.

Model	Accuracy	Precision	Recall	F1	AD%
One-Class SVM	1.00	1.00	1.00	1.00	1.00
Random Forest	1.00	1.00	1.00	1.00	1.00
KNN	0.00	0.50	0.50	0.00	0.00
Isolation Forest	1.00	1.00	1.00	1.00	1.00
Gradient Boosting	1.00	1.00	1.00	1.00	1.00
Logistic Regression	0.00	0.50	0.50	0.00	0.00
Autoencoder	1.00	1.00	1.00	1.00	1.00

Table 5. Balanced class cross validation performance.

Model	F1_ Mean	F1_ Std	Acc_ Mean	Acc_ Std	Fit_ Median	Pred_ Median	Size_ Median
Autoencoder	0.363	0.163	0.442	0.151	0.184	0.020	2.000
Gradient Boosting	0.907	0.055	0.878	0.078	0.416	0.003	0.168
Isolation Forest	0.206	0.154	0.479	0.073	2.061	0.032	1.104
KNN	0.896	0.075	0.868	0.086	0.002	0.009	0.025
Logistic Regression	0.902	0.056	0.872	0.08	0.009	0.001	0.001
One-Class SVM	0.608	0.195	0.591	0.138	0.003	0.002	0.008
Random Forest	0.907	0.055	0.878	0.078	1.607	0.161	0.245

Table 6. Imbalanced class cross validation performance.

Model	F1_ Mean	F1_ Std	Acc_ Mean	Acc_ Std	Fit_ Median	Pred_ Median	Size_ Median
Autoencoder	0.322	0.15	0.418	0.133	0.181	0.020	2.000
Gradient Boosting	0.907	0.055	0.878	0.078	0.415	0.003	0.168
Isolation Forest	0.206	0.156	0.473	0.079	2.07	0.032	1.048
KNN	0.865	0.118	0.848	0.104	0.002	0.009	0.022
Logistic Regression	0.899	0.054	0.867	0.078	0.009	0.001	0.001
One-Class SVM	0.584	0.226	0.563	0.174	0.003	0.002	0.007
Random Forest	0.907	0.055	0.878	0.078	1.604	0.161	0.245

Table 7. Delta F1 (balanced–imbalanced) performance.

Model	F1_ Mean_Bal	F1_ Std_Bal	F1_ Mean_Imbal	F1_ Std_Imbal	F1_Delta_ Bal_Minus_Imbal
Gradient Boosting	0.907	0.055	0.907	0.055	0.000
Random Forest	0.907	0.055	0.907	0.055	0.000
Logistic Regression	0.902	0.056	0.899	0.054	0.003
KNN	0.896	0.075	0.865	0.118	0.031
One-Class SVM	0.608	0.195	0.584	0.226	0.023
Autoencoder	0.363	0.163	0.322	0.15	0.041
Isolation Forest	0.206	0.154	0.206	0.156	−0.001

Table 8. Pair tests (per-fold x per seed) performance.

Model	Paired_T_Pvalue	Wilcoxon_Pvalue	Pairs_N
Autoencoder	N/A	N/A	35
Gradient Boosting	0.083	0.083	35
Isolation Forest	N/A	N/A	35
KNN	0.16	0.157	35
Logistic Regression	0.044	0.046	35
One-Class SVM	1	1	35
Random Forest	0.083	0.083	35

Table 9. Per-attack performance.

Model	Attack_Type	F1_Mean	F1_Std
Gradient Boosting	DoC	0.95	0.04
Gradient Boosting	Eclipse	0.94	0.05
Gradient Boosting	Hashrate	0.96	0.03
Gradient Boosting	Smart Contract	0.92	0.06
Gradient Boosting	Sybil	0.93	0.05
Gradient Boosting	Finney	0.60	0.12
KNN	DoC	0.91	0.07
KNN	Eclipse	0.90	0.08
KNN	Hashrate	0.92	0.06
KNN	Smart Contract	0.89	0.09
KNN	Sybil	0.90	0.08
KNN	Finney	0.56	0.15
Logistic Regression	DoC	0.93	0.06
Logistic Regression	Eclipse	0.92	0.07
Logistic Regression	Hashrate	0.94	0.05
Logistic Regression	Smart Contract	0.90	0.08
Logistic Regression	Sybil	0.91	0.07
Logistic Regression	Finney	0.57	0.14
Random Forest	DoC	0.94	0.05
Random Forest	Eclipse	0.93	0.06
Random Forest	Hashrate	0.95	0.04
Random Forest	Smart Contract	0.91	0.07
Random Forest	Sybil	0.92	0.06
Random Forest	Finney	0.58	0.13

Table 10. Threshold sensitivity.

Model	Best_F1_Mean	Best_F1_Std	Best_Q_Median	Best_Q_Iqr
Isolation Forest	0.672	0.064	50	20
One-Class SVM	0.79	0.066	50	0

Table 11. Cross-dataset validation (F1 ± SD).

Model	DatasetA_ Balanced	DatasetA_ Imbalanced	DatasetB_ Balanced	DatasetB_ Imbalanced
Gradient Boosting	0.893 ± 0.153	0.865 ± 0.161	0.907 ± 0.055	0.907 ± 0.055
KNN	0.865 ± 0.161	0.846 ± 0.164	0.896 ± 0.075	0.865 ± 0.118
Logistic Regression	0.912 ± 0.144	0.874 ± 0.159	0.902 ± 0.056	0.899 ± 0.054
Random Forest	0.893 ± 0.153	0.865 ± 0.161	0.907 ± 0.055	0.907 ± 0.055

Table 12. Computational efficiency of models.

Model	Feature Extraction (s)	Fit Time (s)	Predict Time (s)	Peak Memory (MB)	Model Size (MB)
Logistic Regression	0.013	0.010	0.001	15	0.001
KNN	0.013	0.002	0.009	18	0.025
Gradient Boosting	0.013	0.420	0.003	95	0.170
Random Forest	0.013	1.600	0.160	120	0.245
One-Class SVM	0.013	0.050	0.004	40	0.008
Isolation Forest	0.013	2.050	0.032	110	1.100
Autoencoder	0.013	3.200	0.020	250	2.000

Table 13. Performance of ensemble models (classification metrics).

Ensemble Models	Accuracy	Precision	Recall	F1	AD %
Bagging (RF base)	0.530	0.528	0.533	0.531	0.68
Boosting (GB base)	0.545	0.543	0.546	0.544	0.78
Voting (Hard)	0.535	0.532	0.538	0.535	0.70
Stacking (Meta-LR)	0.540	0.537	0.542	0.540	0.72

Table 14. Policy effect outcome of ASR.

Model (with ASR)	Accuracy-ASR	Precision-ASR	Recall-ASR	F1-ASR	ASR Effect%
Random Forest + ASR	0.962	0.960	0.963	0.961	0.97
Gradient Boosting + ASR	0.945	0.944	0.947	0.946	0.95
Isolation Forest + ASR	0.900	0.890	0.998	0.940	0.99
Autoencoder + ASR	0.870	0.860	0.873	0.865	0.88
Logistic Regression + ASR	0.930	0.928	0.933	0.930	0.92

Table 15. Comparison summary of related studies and this study.

Study	Approach	Limitation	This Study’s Contribution	Models Used	Attacks Considered	Dataset Type	Performance Evaluation
[3]	Comprehensive literature review of blockchain threats and vulnerabilities	Conceptual only; no empirical validation or defense model	Moves beyond review by implementing ML-based proactive defense validated on simulated datasets	N/A (survey)	Broad overview of multiple blockchain threats	Literature review	No performance reported
[5]	Taxonomy of blockchain attacks and defenses (consensus, network, application)	No empirical testing or prototype	Provides empirical ML evaluation across multiple simulated attack scenarios	N/A (conceptual)	Multiple attack categories in theory	Literature review/taxonomy	No performance reported
[17]	Graph-theoretic model to predict/mitigate Denial-of-Chain (DoC)	Focused on a single attack (DoC only)	Extends scope to DoC plus other attacks, embedded in ML-based multi-layer defense	Graph-theory model	DoC (consensus attack)	Simulation	Attack feasibility modeled; no ML results
[18]	Simulation of Black Bird 51% attack variant	No counter-measures proposed	Empirically evaluates ML anomaly detection against Black Bird-style attacks within adaptive framework	N/A (attack simulation)	Black Bird (51% attack variant)	Simulation logs	Impact analysis; no defense performance
[25]	Block Access Restriction (BAR) to mitigate BBEDSA	Specific to Black Bird double-spend; lacks generalization	Generalizes by applying ML-based anomaly detection across diverse attacks	BAR mechanism	Black Bird embedded double spend	Simulation	Attack prevented in model; no ML metrics
[21]	CIRCUIT: JavaScript memory-heap analysis for cryptojacking detection	Narrow scope; limited to cryptojacking	Broadens anomaly detection across diverse blockchain threats, integrating adaptive ML response	ML classifiers for cryptojacking	Cryptojacking	Internet-scale crawl (websites)	Detected ~1800 malicious sites; high precision
This Study (2025)	ML-based anomaly detection + Adaptive Security Response (ASR) across multiple attack types	Based on simulated logs only; not yet tested on live blockchain networks	First integrated ML-based proactive defense benchmarked across multiple simulated blockchain attack scenarios	Random Forest, Gradient Boosting, Logistic Regression, KNN, One-Class SVM, Isolation Forest, Autoencoder, Ensembles, ASR	Black Bird (51% and BBEDSA), DoC, Sybil, Eclipse, Finney, Smart Contract	Simulated blockchain logs (Datasets A and B)	Supervised models F1 > 0.90 (balanced); unsupervised complementary (OCSVM F1 ≈ 0.79); Finney hardest (F1 ≈ 0.58); efficiency: RF/GB training < 2 s, inference < 0.2 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Omonayajo, B.; Oke, O.A.; Cavus, N. Adaptive Ensemble Machine Learning Framework for Proactive Blockchain Security. Appl. Sci. 2025, 15, 10848. https://doi.org/10.3390/app151910848

AMA Style

Omonayajo B, Oke OA, Cavus N. Adaptive Ensemble Machine Learning Framework for Proactive Blockchain Security. Applied Sciences. 2025; 15(19):10848. https://doi.org/10.3390/app151910848

Chicago/Turabian Style

Omonayajo, Babatomiwa, Oluwafemi Ayotunde Oke, and Nadire Cavus. 2025. "Adaptive Ensemble Machine Learning Framework for Proactive Blockchain Security" Applied Sciences 15, no. 19: 10848. https://doi.org/10.3390/app151910848

APA Style

Omonayajo, B., Oke, O. A., & Cavus, N. (2025). Adaptive Ensemble Machine Learning Framework for Proactive Blockchain Security. Applied Sciences, 15(19), 10848. https://doi.org/10.3390/app151910848

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Adaptive Ensemble Machine Learning Framework for Proactive Blockchain Security

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Research Model

3.2. Blockchain Attacks Used

3.3. Data Collection and Pre-Processing

3.4. Machine Learning and Hybrid Models Used

3.5. Experimental Design for Model Evaluation

3.6. Performance Metrics

3.7. Computer Configuration

4. Results

4.1. Initial Single Dataset Analysis (Baseline)

4.2. Cross-Validation (CV) Performance

4.3. Per-Attack Performance

4.4. Threshold Sensitivity (Unsupervised)

4.5. Cross Dataset Validation (Datasets A and B)

4.6. Computational Efficiency

4.7. Combined Models (Ensemble and Adaptive Defense Performance)

4.7.1. Ensemble Models Performance

4.7.2. ASR Policy Effects

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI