MDPI - Publisher of Open Access Journals

14 pages, 417 KB

Open AccessArticle

An Architectural Optimization Framework for Scalable Spatial Clustering in High-Redundancy Environments

by Carlos Roberto Valêncio, Wellington Reguera Gouveia, Geraldo Francisco Donegá Zafalon, Angelo Cesar Colombini, Mario Luiz Tronco and Tiago Luís de Andrade

Technologies 2026, 14(3), 171; https://doi.org/10.3390/technologies14030171 - 10 Mar 2026

Viewed by 244

Abstract

Spatial Big Data mining is often hindered by high computational complexity and the intrinsic autocorrelation of georeferenced records. To address these challenges, this study proposes an architectural optimization framework for the CHSMST+ algorithm, designated as CHSMST+MR. Rather than introducing a brand-new clustering paradigm, [...] Read more.

Spatial Big Data mining is often hindered by high computational complexity and the intrinsic autocorrelation of georeferenced records. To address these challenges, this study proposes an architectural optimization framework for the CHSMST+ algorithm, designated as CHSMST+MR. Rather than introducing a brand-new clustering paradigm, the framework focuses on a Distributed Spatial Cardinality Reduction (DSCR) layer that aggregates redundant spatial records before the core iterative mining logic begins. By transforming raw records into a weighted key-value representation within the Apache Spark environment, the proposed approach significantly mitigates the shuffling bottleneck common in distributed systems. Experimental validation using high-density biological datasets demonstrates an average execution-time reduction of 51.36%, with performance gains reaching up to 79.96% in specific high-redundancy scenarios. The results, obtained through controlled local emulation, confirm that this architectural optimization provides a scalable, deterministic, and lossless solution for accelerating spatial clustering. This work contributes a methodological path for enhancing the performance of iterative spatial mining algorithms in environments characterized by massive data density and coordinate redundancy. Full article

► Show Figures

Figure 1

8 pages, 252 KB

Open AccessProceeding Paper

Enhancing Candidate Generation in Recommendation Systems Through LLM-Powered Semantic Enrichment in a Distributed Environment

by Balagangadhar Reddy Kandula and Lija Jacob

Eng. Proc. 2026, 124(1), 55; https://doi.org/10.3390/engproc2026124055 - 6 Mar 2026

Viewed by 407

Abstract

Effective candidate generation is a critical component of two-stage recommender systems; however, traditional methods such as Term Frequency–Inverse Document Frequency (TF-IDF) often fail to capture deep semantic context. This limitation leads to suboptimal recall rates, particularly for new or niche items—a challenge commonly [...] Read more.

Effective candidate generation is a critical component of two-stage recommender systems; however, traditional methods such as Term Frequency–Inverse Document Frequency (TF-IDF) often fail to capture deep semantic context. This limitation leads to suboptimal recall rates, particularly for new or niche items—a challenge commonly referred to as the cold start problem—thereby degrading overall recommendation quality and user experience. This study proposes a semantically aware approach to improve the initial recall phase of recommendation pipelines. The methodology integrates Large Language Models (LLMs) into a distributed Apache Spark pipeline for large-scale content enrichment, generating 768-dimensional vector embeddings and concise, context-aware summaries for each content item. These enriched representations are indexed in Elasticsearch to enable efficient vector-based retrieval during candidate generation. Quantitative evaluation on a corpus of 143,000 Wikipedia articles demonstrates that the LLM-enriched method achieves a Recall@10 of 62%, representing a 37% relative improvement over the TF-IDF baseline (45%). When relevance is measured using only embedding-independent signals (category overlap and keyword similarity), the method still achieves a Recall@10 of 58%, confirming that gains are not an artifact of the evaluation metric. The resulting candidate pools exhibit improved semantic diversity and broader category coverage, delivering richer input for downstream ranking models. Full article

(This article belongs to the Proceedings of The 6th International Electronic Conference on Applied Sciences)

► Show Figures

Figure 1

23 pages, 1907 KB

Open AccessArticle

Intelligent Hybrid Caching for Sustainable Big Data Processing: Leveraging NVM to Enable Green Digital Transformation

by Lei Tong, Qing Shen and Zhenqiang Xie

Sustainability 2026, 18(5), 2601; https://doi.org/10.3390/su18052601 - 6 Mar 2026

Viewed by 305

Abstract

Apache Spark has gained widespread adoption for large-scale data processing. However, conventional caching methods inadequately address the dual challenges of performance bottlenecks and escalating energy consumption in data-intensive workloads. This paper introduces a sustainable computing framework that integrates Directed Acyclic Graph (DAG) dependency [...] Read more.

Apache Spark has gained widespread adoption for large-scale data processing. However, conventional caching methods inadequately address the dual challenges of performance bottlenecks and escalating energy consumption in data-intensive workloads. This paper introduces a sustainable computing framework that integrates Directed Acyclic Graph (DAG) dependency analysis with garbage collection (GC) behavior monitoring to optimize data placement between DRAM and non-volatile memory (NVM). The proposed Intelligent Hybrid Caching Management Framework (IHCMF) dynamically predicts data access patterns and migrates cache blocks based on cost–benefit analysis, achieving a 37.5% execution time reduction over default Spark configurations in SparkBench evaluations. By improving throughput-per-watt and projecting potential benefits from NVM’s near-zero idle power and extended hardware lifespan, IHCMF provides a scalable, cost-effective caching solution for resource-constrained edge computing environments. This work demonstrates that high-performance computing can be reconciled with environmental sustainability through intelligent memory management. Full article

(This article belongs to the Topic Green Technology Innovation and Economic Growth)

► Show Figures

Figure 1

14 pages, 810 KB

Open AccessArticle

Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark

by Ylenia Galluzzo, Raffaele Giancarlo, Mario Randazzo and Simona E. Rombo

Data 2026, 11(3), 48; https://doi.org/10.3390/data11030048 - 2 Mar 2026

Viewed by 295

Abstract

With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of “omics” data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here, we propose a [...] Read more.

With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of “omics” data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here, we propose a novel approach for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. We implement three algorithms based on the MapReduce framework, distributing the index computation and not only the input dataset, differently than previous approaches from the literature. Experimental results performed on real datasets show that the proposed approach is promising. Full article

► Show Figures

Figure 1

13 pages, 1497 KB

Open AccessArticle

A Spatio-Temporal Model for Intelligent Vehicle Navigation Using Big Data and SparkML LSTM

by Imad El Mallahi, Jamal Riffi, Hamid Tairi, Mostafa El Mallahi and Mohamed Adnane Mahraz

World Electr. Veh. J. 2026, 17(1), 54; https://doi.org/10.3390/wevj17010054 - 22 Jan 2026

Viewed by 332

Abstract

The rapid development of autonomous driving systems has increased the demand for scalable frameworks capable of modeling vehicle motion patterns in complex traffic environments. This paper proposes a big data spatio-temporal modeling architecture that integrates Apache Spark version 4.0.1 (SparkML) with Long Short-Term [...] Read more.

The rapid development of autonomous driving systems has increased the demand for scalable frameworks capable of modeling vehicle motion patterns in complex traffic environments. This paper proposes a big data spatio-temporal modeling architecture that integrates Apache Spark version 4.0.1 (SparkML) with Long Short-Term Memory (LSTM) networks to analyze and classify vehicle trajectory patterns. The proposed SparkML–LSTM framework exploits Spark’s distributed processing capabilities and LSTM’s strength in sequential learning to handle large-scale traffic trajectory data efficiently. Experiments were conducted using the DETRAC dataset, which is a large-scale benchmark for vehicle detection and multi-object tracking consisting of more than 10 h of video captured at 24 different locations. The videos were recorded at 25 frames per second with a resolution of 960 × 540 pixels and annotated across more than 140,000 frames, covering 8.250 vehicles and approximately 1.21 million bounding box annotations. The dataset provides detailed annotations, including vehicle categories (Car, Bus, Van, Others), weather conditions (Sunny, Cloudy, Rainy, Night), occlusion ratio, truncation ratio, and vehicle scale. Based on the extracted trajectory features, vehicle motion patterns were categorized into predefined movement classes derived from trajectory dynamics. The experimental results demonstrate strong classification performance. These findings suggest that the proposed SparkML–LSTM architecture is effective for large-scale spatio-temporal trajectory modeling and traffic behavior analysis, and can serve as a foundation for higher-level decision-making modules in intelligent transportation system. Full article

(This article belongs to the Section Automated and Connected Vehicles)

► Show Figures

Figure 1

28 pages, 12374 KB

Open AccessArticle

A Distributed Instance Selection Algorithm Based on Cognitive Reasoning for Regression Tasks

by Linzi Yin, Wendi Cai, Zhanqi Li and Xiaochao Hou

Appl. Sci. 2026, 16(2), 913; https://doi.org/10.3390/app16020913 - 15 Jan 2026

Viewed by 234

Abstract

Instance selection is a critical preprocessing technique for enhancing data quality and improving machine learning model efficiency. However, existing algorithms for regression tasks face a fundamental trade-off: non-heuristic methods offer high precision but suffer from sequential dependencies that hinder parallelization, while heuristic methods [...] Read more.

Instance selection is a critical preprocessing technique for enhancing data quality and improving machine learning model efficiency. However, existing algorithms for regression tasks face a fundamental trade-off: non-heuristic methods offer high precision but suffer from sequential dependencies that hinder parallelization, while heuristic methods support parallelization but often yield coarse-grained results susceptible to local optima. To address these challenges, we propose CRDISA, a novel distributed instance selection algorithm driven by a formalized cognitive reasoning logic. Unlike traditional approaches that evaluate subsets, CRDISA transforms each instance into an independent “Instance Expert” capable of reasoning about the global data distribution through a unique difference knowledge base. For regression tasks with continuous outputs, we introduce a soft partitioning strategy to define adaptive error boundaries and a bidirectional voting mechanism to robustly identify high-quality instances. Although the fine-grained reasoning implies high computational complexity, we implement CRDISA on Apache Spark using an optimized broadcast mechanism. This architecture provides linear scalability in wall-clock time, enabling scalable processing without sacrificing theoretical rigor. Experiments on 22 datasets demonstrate that CRDISA achieves an average compression rate of 31.7% while maintaining predictive accuracy (

R^{2} = 0.681

) comparable to or better than state-of-the-art methods, proving its superiority in balancing selection granularity and distributed efficiency. Full article

(This article belongs to the Special Issue Big Data Driven Machine Learning and Deep Learning)

► Show Figures

Figure 1

26 pages, 3290 KB

Open AccessArticle

Empirical Evaluation of Big Data Stacks: Performance and Design Analysis of Hadoop, Modern, and Cloud Architectures

by Widad Elouataoui and Youssef Gahi

Big Data Cogn. Comput. 2026, 10(1), 7; https://doi.org/10.3390/bdcc10010007 - 24 Dec 2025

Viewed by 1711

Abstract

The proliferation of big data applications across various industries has led to a paradigm shift in data architecture, with traditional approaches giving way to more agile and scalable frameworks. The evolution of big data architecture began with the emergence of the Hadoop-based data [...] Read more.

The proliferation of big data applications across various industries has led to a paradigm shift in data architecture, with traditional approaches giving way to more agile and scalable frameworks. The evolution of big data architecture began with the emergence of the Hadoop-based data stack, leveraging technologies like Hadoop Distributed File System (HDFS) and Apache Spark for efficient data processing. However, recent years have seen a shift towards modern data stacks, offering flexibility and diverse toolsets tailored to specific use cases. Concurrently, cloud computing has revolutionized big data management, providing unparalleled scalability and integration capabilities. Despite their benefits, navigating these data stack paradigms can be challenging. While existing literature offers valuable insights into individual data stack paradigms, there remains a dearth of studies that offer practical, in-depth comparisons of these paradigms across the entire big data value chain. To address this gap in the field, this paper examines three main big data stack paradigms: the Hadoop data stack, modern data stack, and cloud-based data stack. Indeed, we conduct in this study an exhaustive architectural comparison of these stacks covering the entire big data value chain from data acquisition to exposition. Moreover, this study extends beyond architectural considerations to include end-to-end use case implementations for a comprehensive evaluation of each stack. Using a large dataset of Amazon reviews, different data stack scenarios are implemented and compared. Furthermore, the paper explores critical factors such as data integration, implementation costs, and ease of deployment to provide researchers and practitioners with a relevant and up-to-date reference for navigating the complex landscape of big data technologies and making informed decisions about data strategies. Full article

(This article belongs to the Topic Big Data and Artificial Intelligence, 3rd Edition)

► Show Figures

Figure 1

50 pages, 856 KB

Open AccessArticle

LLM-Driven Big Data Management Across Digital Governance, Marketing, and Accounting: A Spark-Orchestrated Framework

by Aristeidis Karras, Leonidas Theodorakopoulos, Christos Karras, George A. Krimpas, Anastasios Giannaros and Charalampos-Panagiotis Bakalis

Algorithms 2025, 18(12), 791; https://doi.org/10.3390/a18120791 - 15 Dec 2025

Viewed by 1601

Abstract

In this work, we present a principled framework for the deployment of Large Language Models (LLMs) in enterprise big data management across digital governance, marketing, and accounting domains. Unlike conventional predictive applications, our approach integrates LLMs as auditable, sector-adaptive components that robustly and [...] Read more.

In this work, we present a principled framework for the deployment of Large Language Models (LLMs) in enterprise big data management across digital governance, marketing, and accounting domains. Unlike conventional predictive applications, our approach integrates LLMs as auditable, sector-adaptive components that robustly and directly enhance data curation, lineage, and regulatory compliance. The study contributes (i) a systematic evaluation of seven LLM-enabled functions—including schema mapping, entity resolution, and document extraction—that directly improve data quality and operational governance; (ii) a distributed architecture that deploys Apache Spark orchestration with Markov Chain Monte Carlo sampling to achieve quantifiable uncertainty and reproducible audit trails; and (iii) a cross-sector analysis demonstrating robust semantic accuracy, compliance management, and explainable outputs suited to diverse assurance requirements. Empirical evaluations reveal that the proposed architecture persistently attains elevated mapping precision, resilient multimodal feature extraction, and consistent human supervision. These characteristics collectively reinforce the integrity, accountability, and transparency of information ecosystems, particularly within compliance-driven organizational settings. Full article

► Show Figures

Figure 1

34 pages, 831 KB

Open AccessReview

Recent Trends in Machine Learning for Healthcare Big Data Applications: Review of Velocity and Volume Challenges

by Doaa Yaseen Khudhur, Abdul Samad Shibghatullah, Khalid Shaker, Aliza Abdul Latif and Zakaria Che Muda

Algorithms 2025, 18(12), 772; https://doi.org/10.3390/a18120772 - 8 Dec 2025

Viewed by 1777

Abstract

The integration and emerging adoption of machine learning (ML) algorithms in healthcare big data has revolutionized clinical decision-making, predictive analytics, and real-time medical diagnostics. However, the application of machine learning in healthcare big data faces computational challenges, particularly in efficiently processing and training [...] Read more.

The integration and emerging adoption of machine learning (ML) algorithms in healthcare big data has revolutionized clinical decision-making, predictive analytics, and real-time medical diagnostics. However, the application of machine learning in healthcare big data faces computational challenges, particularly in efficiently processing and training on large-scale, high-velocity data generated by healthcare organizations worldwide. In response to these issues, this study critically reviews and examines current state-of-the-art advancements in machine learning algorithms and big data frameworks within healthcare analytics, with a particular emphasis on solutions addressing data volume and velocity. The reviewed literature is categorized into three key areas: (1) efficient techniques, arithmetic operations, and dimensionality reduction; (2) advanced and specialized processing hardware; and (3) clustering and parallel processing methods. Key research gaps and open challenges are identified based on the evaluation of the literature across these categories, and important future research directions are discussed in detail. Among the several proposed solutions are the utilization of federated learning and decentralized data processing, as well as efficient parallel processing through big data frameworks such as Apache Spark, neuromorphic computing, and multi-swarm large-scale optimization algorithms; these highlight the importance of interdisciplinary innovations in algorithm design, hardware efficiency, and distributed computing frameworks, which collectively contribute to faster, more accurate, and resource-efficient AI-driven healthcare big data analytics and applications. This research supports the UNSDG 3 (Good Health and Well-Being) and UNSDG 9 (Industry, Innovation and Infrastructure) by integration of machine learning in healthcare big data and promoting product innovation in the healthcare industry, respectively. Full article

(This article belongs to the Special Issue Applications of Artificial Intelligence in Healthcare, Biomedicine and Medical Informatics)

► Show Figures

Figure 1

9 pages, 433 KB

Open AccessProceeding Paper

Contextual Modeling and Intelligent Decision-Making for IoT Systems: A Combined Ontology and Machine Learning Approach

by Sanaa Mouhim

Eng. Proc. 2025, 112(1), 71; https://doi.org/10.3390/engproc2025112071 - 18 Nov 2025

Viewed by 656

Abstract

In the context of the Internet of Things (IoT), this article proposes an innovative approach combining ontologies and the Apache Spark MLlib library to design an intelligent system capable of dynamically adapting to its environment. The aim is to model the context including [...] Read more.

In the context of the Internet of Things (IoT), this article proposes an innovative approach combining ontologies and the Apache Spark MLlib library to design an intelligent system capable of dynamically adapting to its environment. The aim is to model the context including users, devices, events, and environmental conditions, and exploit massive sensor data to generate intelligent, contextualized predictions. The architecture relies on two pillars: an ontology as a formal way to structure and semantically annotate knowledge and Spark MLlib in order to execute big data machine learning algorithms and notably random forest regression. The solution is targeted to real-time applications such as energy or air quality management in smart homes. The results demonstrate the value of combining ontology and machine learning in order to improve contextual knowledge and automatic decision-making. Full article

(This article belongs to the Proceedings of 7th Edition of the International Conference on Advanced Technologies for Humanity (ICATH 2025))

► Show Figures

Figure 1

41 pages, 762 KB

Open AccessArticle

MCMC Methods: From Theory to Distributed Hamiltonian Monte Carlo over PySpark

by Christos Karras, Leonidas Theodorakopoulos, Aristeidis Karras, George A. Krimpas, Charalampos-Panagiotis Bakalis and Alexandra Theodoropoulou

Algorithms 2025, 18(10), 661; https://doi.org/10.3390/a18100661 - 17 Oct 2025

Cited by 1 | Viewed by 1497

Abstract

The Hamiltonian Monte Carlo (HMC) method is effective for Bayesian inference but suffers from synchronization overhead in distributed settings. We propose two variants: a distributed HMC (DHMC) baseline with synchronized, globally exact gradient evaluations and a communication-avoiding leapfrog HMC (CALF-HMC) method that interleaves [...] Read more.

The Hamiltonian Monte Carlo (HMC) method is effective for Bayesian inference but suffers from synchronization overhead in distributed settings. We propose two variants: a distributed HMC (DHMC) baseline with synchronized, globally exact gradient evaluations and a communication-avoiding leapfrog HMC (CALF-HMC) method that interleaves local surrogate micro-steps with a single–global Metropolis–Hastings correction per trajectory. Implemented on Apache Spark/PySpark and evaluated on a large synthetic logistic regression (

N = 10^{7}

,

d = 100

, workers

J \in {4, 8, 16, 32}

), DHMC attained an average acceptance of

0.986

, mean ESS of 1200, and wall-clock of

64.1

s per evaluation run, yielding

\approx 18.7

ESS/s; CALF-HMC achieved an acceptance of

0.942

, mean ESS of

5.1

, and

14.8

s, i.e., ≈0.34 ESS/s under the tested surrogate configuration. While DHMC delivered higher ESS/s due to robust mixing under conservative integration, CALF-HMC reduced the per-trajectory runtime and exhibited more favorable scaling as inter-worker latency increased. The study contributes (i) a systems-oriented communication cost model for distributed HMC, (ii) an exact, communication-avoiding leapfrog variant, and (iii) practical guidance for ESS/s-optimized tuning on clusters. Full article

(This article belongs to the Special Issue Numerical Optimization and Algorithms: 4th Edition)

► Show Figures

Figure 1

30 pages, 2573 KB

Open AccessArticle

Agent Systems and GIS Integration in Requirements Analysis and Selection of Optimal Locations for Energy Infrastructure Facilities

by Anna Kochanek, Tomasz Zacłona, Michał Szucki and Nikodem Bulanda

Appl. Sci. 2025, 15(19), 10406; https://doi.org/10.3390/app151910406 - 25 Sep 2025

Cited by 3 | Viewed by 1251

Abstract

The dynamic development of agent systems and large language models opens up new possibilities for automating spatial and investment analyses. The study evaluated a reactive AI agent with an NLP interface, integrating Apache Spark for large-scale data processing with PostGIS as a reference [...] Read more.

The dynamic development of agent systems and large language models opens up new possibilities for automating spatial and investment analyses. The study evaluated a reactive AI agent with an NLP interface, integrating Apache Spark for large-scale data processing with PostGIS as a reference point. The analyses were carried out for two areas: Nowy Sącz (36,000 plots, 7 layers) and Ostrołęka (220,000 plots). For medium-sized datasets, both technologies produced similar results, but with large datasets, PostGIS exceeded time limits and was prone to failures. Spark maintained stable performance, analyzing 220,000 plots in approximately 240 s, confirming its suitability for interactive applications. In addition, clustering and spatial search algorithms were compared. The basic DFS required 530 s, while the improved one reduced the time almost tenfold to 54–62 s. The improved K-Means improved the spatial compactness of clusters (0.61–0.76 vs. <0.50 in most base cases) with a time of 56–64 s. Agglomerative clustering, although accurate, was too slow (3000–6000 s). The results show that the combination of Spark, improved algorithms, and agent systems with NLP significantly speeds up the selection of plots for renewable energy sources, supporting sustainable investment decisions. Full article

(This article belongs to the Special Issue Urban Geospatial Analytics Based on Big Data)

► Show Figures

Figure 1

26 pages, 1607 KB

Open AccessFeature PaperArticle

Analyzing Performance of Data Preprocessing Techniques on CPUs vs. GPUs with and Without the MapReduce Environment

by Sikha S. Bagui, Colin Eller, Rianna Armour, Shivani Singh, Subhash C. Bagui and Dustin Mink

Electronics 2025, 14(18), 3597; https://doi.org/10.3390/electronics14183597 - 10 Sep 2025

Viewed by 1959

Abstract

Data preprocessing is usually necessary before running most machine learning classifiers. This work compares three different preprocessing techniques, minimal preprocessing, Principal Components Analysis (PCA), and Linear Discriminant Analysis (LDA). The efficiency of these three preprocessing techniques is measured using the Support Vector Machine [...] Read more.

Data preprocessing is usually necessary before running most machine learning classifiers. This work compares three different preprocessing techniques, minimal preprocessing, Principal Components Analysis (PCA), and Linear Discriminant Analysis (LDA). The efficiency of these three preprocessing techniques is measured using the Support Vector Machine (SVM) classifier. Efficiency is measured in terms of statistical metrics such as accuracy, precision, recall, the F-1 measure, and AUROC. The preprocessing times and the classifier run times are also compared using the three differently preprocessed datasets. Finally, a comparison of performance timings on CPUs vs. GPUs with and without the MapReduce environment is performed. Two newly created Zeek Connection Log datasets, collected using the Security Onion 2 network security monitor and labeled using the MITRE ATT&CK framework, UWF-ZeekData22 and UWF-ZeekDataFall22, are used for this work. Results from this work show that binomial LDA, on average, performs the best in terms of statistical measures as well as timings using GPUs or MapReduce GPUs. Full article

(This article belongs to the Special Issue Hardware Acceleration for Machine Learning)

► Show Figures

Figure 1

27 pages, 2960 KB

Open AccessArticle

(H-DIR)²: A Scalable Entropy-Based Framework for Anomaly Detection and Cybersecurity in Cloud IoT Data Centers

by Davide Tosi and Roberto Pazzi

Sensors 2025, 25(15), 4841; https://doi.org/10.3390/s25154841 - 6 Aug 2025

Viewed by 1434

Abstract

Modern cloud-based Internet of Things (IoT) infrastructures face increasingly sophisticated and diverse cyber threats that challenge traditional detection systems in terms of scalability, adaptability, and explainability. In this paper, we present (H-DIR)², a hybrid entropy-based framework designed to detect and mitigate [...] Read more.

Modern cloud-based Internet of Things (IoT) infrastructures face increasingly sophisticated and diverse cyber threats that challenge traditional detection systems in terms of scalability, adaptability, and explainability. In this paper, we present (H-DIR)², a hybrid entropy-based framework designed to detect and mitigate anomalies in large-scale heterogeneous networks. The framework combines Shannon entropy analysis with Associated Random Neural Networks (ARNNs) and integrates semantic reasoning through RDF/SPARQL, all embedded within a distributed Apache Spark 3.5.0 pipeline. We validate (H-DIR)² across three critical attack scenarios—SYN Flood (TCP), DAO-DIO (RPL), and NTP amplification (UDP)—using real-world datasets. The system achieves a mean detection latency of 247 ms and an AUC of 0.978 for SYN floods. For DAO-DIO manipulations, it increases the packet delivery ratio from 81.2% to 96.4% (p < 0.01), and for NTP amplification, it reduces the peak load by 88%. The framework achieves vertical scalability across millions of endpoints and horizontal scalability on datasets exceeding 10 TB. All code, datasets, and Docker images are provided to ensure full reproducibility. By coupling adaptive neural inference with semantic explainability, (H-DIR)² offers a transparent and scalable solution for cloud–IoT cybersecurity, establishing a robust baseline for future developments in edge-aware and zero-day threat detection. Full article

(This article belongs to the Special Issue Privacy and Cybersecurity in IoT-Based Applications)

► Show Figures

Figure 1

20 pages, 3412 KB

Open AccessArticle

Scalable Graph Coloring Optimization Based on Spark GraphX Leveraging Partition Asymmetry

by Yihang Shen, Xiang Li, Tao Yuan and Shanshan Chen

Symmetry 2025, 17(8), 1177; https://doi.org/10.3390/sym17081177 - 23 Jul 2025

Viewed by 1134

Abstract

Many challenges in solving large graph coloring through parallel strategies remain unresolved. Previous algorithms based on Pregel-like frameworks, such as Apache Giraph, encounter parallelism bottlenecks due to sequential execution and the need for a full graph traversal in certain stages. Additionally, GPU-based algorithms [...] Read more.

Many challenges in solving large graph coloring through parallel strategies remain unresolved. Previous algorithms based on Pregel-like frameworks, such as Apache Giraph, encounter parallelism bottlenecks due to sequential execution and the need for a full graph traversal in certain stages. Additionally, GPU-based algorithms face the dilemma of costly and time-consuming processing when moving complex graph applications to GPU architectures. In this study, we propose Spardex, a novel parallel and distributed graph coloring optimization algorithm designed to overcome and avoid these challenges. We design a symmetry-driven optimization approach wherein the EdgePartition1D strategy in GraphX induces partitioning asymmetry, leading to overlapping locally symmetric regions. This structure is leveraged through asymmetric partitioning and symmetric reassembly to reduce the search space. A two-stage pipeline consisting of partitioned repaint and core conflict detection is developed, enabling the precise correction of conflicts without traversing the entire graph as in previous algorithms. We also integrate symmetry principles from combinatorial optimization into a distributed computing framework, demonstrating that leveraging locally symmetric subproblems can significantly enhance the efficiency of large-scale graph coloring. Combined with Spark-specific optimizations such as AQE skew join optimization, all these techniques contribute to an efficient parallel graph coloring optimization in Spardex. We conducted experiments using the Aliyun Cloud platform. The results demonstrate that Spardex achieves a reduction of 8–72% in the number of colors and a speedup of 1.13–10.27 times over concurrent algorithms. Full article

(This article belongs to the Special Issue Symmetry in Solving NP-Hard Problems)

► Show Figures

Figure 1

Search Results (158)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (158)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI