Algorithms

2025

Jump to: 2024, 2023, 2022, 2021

52 pages, 567 KB

Open AccessReview

Algorithmic Techniques for GPU Scheduling: A Comprehensive Survey

by Robert Chab, Fei Li and Sanjeev Setia

Algorithms 2025, 18(7), 385; https://doi.org/10.3390/a18070385 - 25 Jun 2025

Cited by 5 | Viewed by 15625

Abstract

In this survey, we provide a comprehensive classification of GPU task scheduling approaches, categorized by their underlying algorithmic techniques and evaluation metrics. We examine traditional methods—including greedy algorithms, dynamic programming, and mathematical programming—alongside advanced machine learning techniques integrated into scheduling policies. We also [...] Read more.

In this survey, we provide a comprehensive classification of GPU task scheduling approaches, categorized by their underlying algorithmic techniques and evaluation metrics. We examine traditional methods—including greedy algorithms, dynamic programming, and mathematical programming—alongside advanced machine learning techniques integrated into scheduling policies. We also evaluate the performance of these approaches across diverse applications. This work focuses on understanding the trade-offs among various algorithmic techniques, the architectural and job-level factors influencing scheduling decisions, and the balance between user-level and service-level objectives. The analysis shows that no one paradigm dominates; instead, the highest-performing schedulers blend the predictability of formal methods with the adaptability of learning, often moderated by queueing insights for fairness. We also discuss key challenges in optimizing GPU resource management and suggest potential solutions. Full article

► Show Figures

Figure 1

21 pages, 1005 KB

Open AccessArticle

Q8S: Emulation of Heterogeneous Kubernetes Clusters Using QEMU

by Jonathan Decker, Vincent Florens Hasse and Julian Kunkel

Algorithms 2025, 18(6), 324; https://doi.org/10.3390/a18060324 - 29 May 2025

Cited by 1 | Viewed by 2134

Abstract

Kubernetes has emerged as the industry standard for container orchestration in cloud environments, with its scheduler dynamically placing container instances across cluster nodes based on predefined rules and algorithms. Various efforts have been made to extend and improve upon the Kubernetes scheduler. However, [...] Read more.

Kubernetes has emerged as the industry standard for container orchestration in cloud environments, with its scheduler dynamically placing container instances across cluster nodes based on predefined rules and algorithms. Various efforts have been made to extend and improve upon the Kubernetes scheduler. However, as the majority of Kubernetes clusters operate on homogeneous hardware, most scheduling algorithms are also only developed for homogeneous systems. Heterogeneous infrastructures, which include IoT devices or specialized hardware, have become more widespread and require specialized tuning to optimize workload assignment, for which researchers and developers working on scheduling systems require access to heterogeneous hardware for development and testing; such data may not be available. While simulations such as CloudSim or K8sSim can provide insights, the level of detail they can offer to validate new schedulers is limited, as they are only simulations. To address this, we introduce Q8S, a tool for emulating heterogeneous Kubernetes clusters including x86_64 and ARM64 architectures on OpenStack using QEMU. Emulations created through Q8S provide a higher level of detail than simulations and can be used to train machine learning scheduling algorithms. By providing an environment capable of executing real workloads, Q8S enables researchers and developers to test and refine their scheduling algorithms, ultimately leading to more efficient and effective heterogeneous cluster management. We release our implementation of Q8S as open source. Full article

► Show Figures

Figure 1

16 pages, 289 KB

Open AccessArticle

A Local 6-Approximation Distributed Algorithm for Minimum Dominating Set Problem in Planar Triangle-Free Graphs

by Wojciech Wawrzyniak

Algorithms 2025, 18(5), 280; https://doi.org/10.3390/a18050280 - 10 May 2025

Viewed by 1254

Abstract

In this paper, we present a new distributed approximation algorithm for the minimum dominating set problem in planar triangle-free graphs. The algorithm operates in a constant number of rounds in the LOCAL model. Using the bunch technique, we prove that our algorithm achieves [...] Read more.

In this paper, we present a new distributed approximation algorithm for the minimum dominating set problem in planar triangle-free graphs. The algorithm operates in a constant number of rounds in the LOCAL model. Using the bunch technique, we prove that our algorithm achieves an approximation ratio of 6, which is a significant improvement over previous results for distributed algorithms, where the best known approximation ratio was

8 + ϵ

for any

ϵ > 0

. While sequential algorithms can achieve approximation ratios below 5 for this problem, our distributed algorithm achieves the best known approximation ratio in the LOCAL model. We provide a detailed proof and analysis of the algorithm, which can be implemented in a distributed manner. Full article

► Show Figures

Figure 1

14 pages, 6384 KB

Open AccessArticle

Parallel CUDA-Based Optimization of the Intersection Calculation Process in the Greiner–Hormann Algorithm

by Jiwei Zuo, Junfu Fan, Kuan Li, Qingyun Liu, Yuke Zhou and Yi Zhang

Algorithms 2025, 18(3), 147; https://doi.org/10.3390/a18030147 - 5 Mar 2025

Viewed by 2069

Abstract

The Greiner–Hormann algorithm is a commonly used polygon overlay analysis algorithm. It uses a double-linked list structure to store vertex data, and its intersection calculation step has a significant effect on the overall operating efficiency of the algorithm. To address the time-consuming intersection [...] Read more.

The Greiner–Hormann algorithm is a commonly used polygon overlay analysis algorithm. It uses a double-linked list structure to store vertex data, and its intersection calculation step has a significant effect on the overall operating efficiency of the algorithm. To address the time-consuming intersection calculation process in the Greiner–Hormann algorithm, this paper presents two kernel functions that implement a GPU parallel improvement algorithm based on CUDA multi-threading. This method allocates a thread to each edge of the subject polygon, determines in parallel whether it intersects with each edge of the clipping polygon, transfers the number of intersection points back to the CPU for calculation, and opens up corresponding storage space on the GPU side on the basis of the total number of intersection points; then, information such as intersection coordinates is calculated in parallel. In addition, experiments are conducted on the data of eight polygons with different complexities, and the optimal thread mode, running time, and speedup ratio of the parallel algorithm are statistically analyzed. The experimental results show that when a single CUDA thread block contains 64 threads or 128 threads, the parallel transformation step of the Greiner–Hormann algorithm has the highest computational efficiency. When the complexity of the subject polygon exceeds 53,000, the parallel improvement algorithm can obtain a speedup ratio of approximately three times that of the serial algorithm. This shows that the design method in this paper can effectively improve the operating efficiency of the polygon overlay analysis algorithm in the current large-scale data context. Full article

► Show Figures

Figure 1

26 pages, 1149 KB

Open AccessArticle

A Massively Parallel SMC Sampler for Decision Trees

by Efthyvoulos Drousiotis, Alessandro Varsi, Alexander M. Phillips, Simon Maskell and Paul G. Spirakis

Algorithms 2025, 18(1), 14; https://doi.org/10.3390/a18010014 - 2 Jan 2025

Cited by 3 | Viewed by 2104

Abstract

Bayesian approaches to decision trees (DTs) using Markov Chain Monte Carlo (MCMC) samplers have recently demonstrated state-of-the-art accuracy performance when it comes to training DTs to solve classification problems. Despite the competitive classification accuracy, MCMC requires a potentially long runtime to converge. A [...] Read more.

Bayesian approaches to decision trees (DTs) using Markov Chain Monte Carlo (MCMC) samplers have recently demonstrated state-of-the-art accuracy performance when it comes to training DTs to solve classification problems. Despite the competitive classification accuracy, MCMC requires a potentially long runtime to converge. A widely used approach to reducing an algorithm’s runtime is to employ modern multi-core computer architectures, either with shared memory (SM) or distributed memory (DM), and use parallel computing to accelerate the algorithm. However, the inherent sequential nature of MCMC makes it unsuitable for parallel implementation unless the accuracy is sacrificed. This issue is particularly evident in DM architectures, which normally provide access to larger numbers of cores than SM. Sequential Monte Carlo (SMC) samplers are a parallel alternative to MCMC, which do not trade off accuracy for parallelism. However, the performance of SMC samplers in the context of DTs is underexplored, and the parallelization is complicated by the challenges in parallelizing its bottleneck, namely redistribution, especially on variable-size data types such as DTs. In this work, we study the problem of parallelizing SMC in the context of DTs both on SM and DM. On both memory architectures, we show that the proposed parallelization strategies achieve asymptotically optimal

O ({log}_{2} N)

time complexity. Numerical results are presented for a 32-core SM machine and a 256-core DM cluster. For both computer architectures, the experimental results show that our approach has comparable or better accuracy than MCMC but runs up to 51 times faster on SM and 640 times faster on DM. In this paper, we share the GitHub link to the source code. Full article

► Show Figures

Figure 1

2024

Jump to: 2025, 2023, 2022, 2021

26 pages, 3378 KB

Open AccessArticle

Parallel PSO for Efficient Neural Network Training Using GPGPU and Apache Spark in Edge Computing Sets

by Manuel I. Capel, Alberto Salguero-Hidalgo and Juan A. Holgado-Terriza

Algorithms 2024, 17(9), 378; https://doi.org/10.3390/a17090378 - 26 Aug 2024

Cited by 6 | Viewed by 3314

Abstract

The training phase of a deep learning neural network (DLNN) is a computationally demanding process, particularly for models comprising multiple layers of intermediate neurons.This paper presents a novel approach to accelerating DLNN training using the particle swarm optimisation (PSO) algorithm, which exploits the [...] Read more.

The training phase of a deep learning neural network (DLNN) is a computationally demanding process, particularly for models comprising multiple layers of intermediate neurons.This paper presents a novel approach to accelerating DLNN training using the particle swarm optimisation (PSO) algorithm, which exploits the GPGPU architecture and the Apache Spark analytics engine for large-scale data processing tasks. PSO is a bio-inspired stochastic optimisation method whose objective is to iteratively enhance the solution to a (usually complex) problem by approximating a given objective. The expensive fitness evaluation and updating of particle positions can be supported more effectively by parallel processing. Nevertheless, the parallelisation of an efficient PSO is not a simple process due to the complexity of the computations performed on the swarm of particles and the iterative execution of the algorithm until a solution close to the objective with minimal error is achieved. In this study, two forms of parallelisation have been developed for the PSO algorithm, both of which are designed for execution in a distributed execution environment. The synchronous parallel PSO implementation guarantees consistency but may result in idle time due to global synchronisation. In contrast, the asynchronous parallel PSO approach reduces the necessity for global synchronization, thereby enhancing execution time and making it more appropriate for large datasets and distributed environments such as Apache Spark. The two variants of PSO have been implemented with the objective of distributing the computational load supported by the algorithm across the different executor nodes of the Spark cluster to effectively achieve coarse-grained parallelism. The result is a significant performance improvement over current sequential variants of PSO. Full article

► Show Figures

Figure 1

23 pages, 5573 KB

Open AccessArticle

Research on Distributed Fault Diagnosis Model of Elevator Based on PCA-LSTM

by Chengming Chen, Xuejun Ren and Guoqing Cheng

Algorithms 2024, 17(6), 250; https://doi.org/10.3390/a17060250 - 7 Jun 2024

Cited by 8 | Viewed by 2818

Abstract

A Distributed Elevator Fault Diagnosis System (DEFDS) is developed to tackle frequent malfunctions stemming from the widespread distribution and aging of elevator systems. Due to the complexity of elevator fault data and the subtlety of fault characteristics, traditional methods such as visual inspections [...] Read more.

A Distributed Elevator Fault Diagnosis System (DEFDS) is developed to tackle frequent malfunctions stemming from the widespread distribution and aging of elevator systems. Due to the complexity of elevator fault data and the subtlety of fault characteristics, traditional methods such as visual inspections and basic operational tests fall short in detecting early signs of mechanical wear and electrical issues. These conventional techniques often fail to recognize subtle fault characteristics, necessitating more advanced diagnostic tools. In response, this paper introduces a Principal Component Analysis–Long Short-Term Memory (PCA-LSTM) method for fault diagnosis. The distributed system decentralizes the fault diagnosis process to individual elevator units, utilizing PCA’s feature selection capabilities in high-dimensional spaces to extract and reduce the dimensionality of fault features. Subsequently, the LSTM model is employed for fault prediction. Elevator models within the system exchange data to refine and optimize a global prediction model. The efficacy of this approach is substantiated through empirical validation with actual data, achieving an accuracy rate of 90% and thereby confirming the method’s effectiveness in facilitating distributed elevator fault diagnosis. Full article

► Show Figures

Figure 1

19 pages, 773 KB

Open AccessArticle

Distributed Data-Driven Learning-Based Optimal Dynamic Resource Allocation for Multi-RIS-Assisted Multi-User Ad-Hoc Network

by Yuzhu Zhang and Hao Xu

Algorithms 2024, 17(1), 45; https://doi.org/10.3390/a17010045 - 19 Jan 2024

Cited by 13 | Viewed by 5038

Abstract

This study investigates the problem of decentralized dynamic resource allocation optimization for ad-hoc network communication with the support of reconfigurable intelligent surfaces (RIS), leveraging a reinforcement learning framework. In the present context of cellular networks, device-to-device (D2D) communication stands out as a promising [...] Read more.

This study investigates the problem of decentralized dynamic resource allocation optimization for ad-hoc network communication with the support of reconfigurable intelligent surfaces (RIS), leveraging a reinforcement learning framework. In the present context of cellular networks, device-to-device (D2D) communication stands out as a promising technique to enhance the spectrum efficiency. Simultaneously, RIS have gained considerable attention due to their ability to enhance the quality of dynamic wireless networks by maximizing the spectrum efficiency without increasing the power consumption. However, prevalent centralized D2D transmission schemes require global information, leading to a significant signaling overhead. Conversely, existing distributed schemes, while avoiding the need for global information, often demand frequent information exchange among D2D users, falling short of achieving global optimization. This paper introduces a framework comprising an outer loop and inner loop. In the outer loop, decentralized dynamic resource allocation optimization has been developed for self-organizing network communication aided by RIS. This is accomplished through the application of a multi-player multi-armed bandit approach, completing strategies for RIS and resource block selection. Notably, these strategies operate without requiring signal interaction during execution. Meanwhile, in the inner loop, the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm has been adopted for cooperative learning with neural networks (NNs) to obtain optimal transmit power control and RIS phase shift control for multiple users, with a specified RIS and resource block selection policy from the outer loop. Through the utilization of optimization theory, distributed optimal resource allocation can be attained as the outer and inner reinforcement learning algorithms converge over time. Finally, a series of numerical simulations are presented to validate and illustrate the effectiveness of the proposed scheme. Full article

► Show Figures

Figure 1

18 pages, 492 KB

Open AccessArticle

GPU Algorithms for Structured Sparse Matrix Multiplication with Diagonal Storage Schemes

by Sardar Anisul Haque, Mohammad Tanvir Parvez and Shahadat Hossain

Algorithms 2024, 17(1), 31; https://doi.org/10.3390/a17010031 - 12 Jan 2024

Cited by 4 | Viewed by 5785

Abstract

Matrix–matrix multiplication is of singular importance in linear algebra operations with a multitude of applications in scientific and engineering computing. Data structures for storing matrix elements are designed to minimize overhead information as well as to optimize the operation count. In this study, [...] Read more.

Matrix–matrix multiplication is of singular importance in linear algebra operations with a multitude of applications in scientific and engineering computing. Data structures for storing matrix elements are designed to minimize overhead information as well as to optimize the operation count. In this study, we utilize the notion of the compact diagonal storage method (CDM), which builds upon the previously developed diagonal storage—an orientation-independent uniform scheme to store the nonzero elements of a range of matrices. This study exploits both these storage schemes and presents efficient GPU-accelerated parallel implementations of matrix multiplication when the input matrices are banded and/or structured sparse. We exploit the data layouts in the diagonal storage schemes to expose a substantial amount of fine-grained parallelism and effectively utilize the GPU shared memory to improve the locality of data access for numerical calculations. Results from an extensive set of numerical experiments with the aforementioned types of matrices demonstrate orders-of-magnitude speedups compared with the sequential performance. Full article

► Show Figures

Figure 1

2023

Jump to: 2025, 2024, 2022, 2021

27 pages, 2365 KB

Open AccessArticle

Finding Bottlenecks in Message Passing Interface Programs by Scalable Critical Path Analysis

by Vladimir Korkhov, Ivan Gankevich, Anton Gavrikov, Maria Mingazova, Ivan Petriakov, Dmitrii Tereshchenko, Artem Shatalin and Vitaly Slobodskoy

Algorithms 2023, 16(11), 505; https://doi.org/10.3390/a16110505 - 31 Oct 2023

Viewed by 3366

Abstract

Bottlenecks and imbalance in parallel programs can significantly affect performance of parallel execution. Finding these bottlenecks is a key issue in performance analysis of MPI programs especially on a large scale. One of the ways to discover bottlenecks is to analyze the critical [...] Read more.

Bottlenecks and imbalance in parallel programs can significantly affect performance of parallel execution. Finding these bottlenecks is a key issue in performance analysis of MPI programs especially on a large scale. One of the ways to discover bottlenecks is to analyze the critical path of the parallel program: the longest execution path in the program activity graph. There are a number of methods of finding the critical path; however, most of them suffer a performance drop when scaled. In this paper, we analyze several methods of critical path finding based on classical Dijkstra and Delta-stepping algorithms along with the proposed algorithm based on topological sorting. Corresponding algorithms for each approach are presented including additional enhancements for increasing performance. The implementation of the algorithms and resulting performance for several benchmark applications (NAS Parallel Benchmarks, CP2K, OpenFOAM, LAMMPS, and MiniFE) are analyzed and discussed. Full article

► Show Figures

Figure 1

19 pages, 1143 KB

Open AccessArticle

Algorithm for Enhancing Event Reconstruction Efficiency by Addressing False Track Filtering Issues in the SPD NICA Experiment

by Gulshat Amirkhanova, Madina Mansurova, Gennadii Ososkov, Nasurlla Burtebayev, Adai Shomanov and Murat Kunelbayev

Algorithms 2023, 16(7), 312; https://doi.org/10.3390/a16070312 - 22 Jun 2023

Viewed by 2136

Abstract

This paper introduces methods for parallelizing the algorithm to enhance the efficiency of event recovery in Spin Physics Detector (SPD) experiments at the Nuclotron-based Ion Collider Facility (NICA). The problem of eliminating false tracks during the particle trajectory detection process remains a crucial [...] Read more.

This paper introduces methods for parallelizing the algorithm to enhance the efficiency of event recovery in Spin Physics Detector (SPD) experiments at the Nuclotron-based Ion Collider Facility (NICA). The problem of eliminating false tracks during the particle trajectory detection process remains a crucial challenge in overcoming performance bottlenecks in processing collider data generated in high volumes and at a fast pace. In this paper, we propose and show fast parallel false track elimination methods based on the introduced criterion of a clustering-based thresholding approach with a chi-squared quality-of-fit metric. The proposed strategy achieves a good trade-off between the effectiveness of track reconstruction and the pace of execution on today’s advanced multicore computers. To facilitate this, a quality benchmark for reconstruction is established, using the root mean square (rms) error of spiral and polynomial fitting for the datasets identified as the subsequent track candidate by the neural network. Choosing the right benchmark enables us to maintain the recall and precision indicators of the neural network track recognition performance at a level that is satisfactory to physicists, even though these metrics will inevitably decline as the data noise increases. Moreover, it has been possible to improve the processing speed of the complete program pipeline by 6 times through parallelization of the algorithm, achieving a rate of 2000 events per second, even when handling extremely noisy input data. Full article

► Show Figures

Figure 1

30 pages, 19621 KB

Open AccessArticle

Probability Density Estimation through Nonparametric Adaptive Partitioning and Stitching

by Zach D. Merino, Jenny Farmer and Donald J. Jacobs

Algorithms 2023, 16(7), 310; https://doi.org/10.3390/a16070310 - 21 Jun 2023

Cited by 2 | Viewed by 3093

Abstract

We present a novel nonparametric adaptive partitioning and stitching (NAPS) algorithm to estimate a probability density function (PDF) of a single variable. Sampled data is partitioned into blocks using a branching tree algorithm that minimizes deviations from a uniform density within blocks of [...] Read more.

We present a novel nonparametric adaptive partitioning and stitching (NAPS) algorithm to estimate a probability density function (PDF) of a single variable. Sampled data is partitioned into blocks using a branching tree algorithm that minimizes deviations from a uniform density within blocks of various sample sizes arranged in a staggered format. The block sizes are constructed to balance the load in parallel computing as the PDF for each block is independently estimated using the nonparametric maximum entropy method (NMEM) previously developed for automated high throughput analysis. Once all block PDFs are calculated, they are stitched together to provide a smooth estimate throughout the sample range. Each stitch is an averaging process over weight factors based on the estimated cumulative distribution function (CDF) and a complementary CDF that characterize how data from flanking blocks overlap. Benchmarks on synthetic data show that our PDF estimates are fast and accurate for sample sizes ranging from

2^{9}

to

2^{27}

, across a diverse set of distributions that account for single and multi-modal distributions with heavy tails or singularities. We also generate estimates by replacing NMEM with kernel density estimation (KDE) within blocks. Our results indicate that NAPS(NMEM) is the best-performing method overall, while NAPS(KDE) improves estimates near boundaries compared to standard KDE. Full article

► Show Figures

Graphical abstract

10 pages, 327 KB

Open AccessArticle

A Multithreaded Algorithm for the Computation of Sample Entropy

by George Manis, Dimitrios Bakalis and Roberto Sassi

Algorithms 2023, 16(6), 299; https://doi.org/10.3390/a16060299 - 15 Jun 2023

Cited by 4 | Viewed by 2683

Abstract

Many popular entropy definitions for signals, including approximate and sample entropy, are based on the idea of embedding the time series into an m-dimensional space, aiming to detect complex, deeper and more informative relationships among samples. However, for both approximate and sample [...] Read more.

Many popular entropy definitions for signals, including approximate and sample entropy, are based on the idea of embedding the time series into an m-dimensional space, aiming to detect complex, deeper and more informative relationships among samples. However, for both approximate and sample entropy, the high computational cost is a severe limitation. Especially when large amounts of data are processed, or when parameter tuning is employed premising a large number of executions, the necessity of fast computation algorithms becomes urgent. In the past, our research team proposed fast algorithms for sample, approximate and bubble entropy. In the general case, the bucket-assisted algorithm was the one presenting the lowest execution times. In this paper, we exploit the opportunities given by the multithreading technology to further reduce the computation time. Without special requirements in hardware, since today even our cost-effective home computers support multithreading, the computation of entropy definitions can be significantly accelerated. The aim of this paper is threefold: (a) to extend the bucket-assisted algorithm for multithreaded processors, (b) to present updated execution times for the bucket-assisted algorithm since the achievements in hardware and compiler technology affect both execution times and gain, and (c) to provide a Python library which wraps fast C implementations capable of running in parallel on multithreaded processors. Full article

► Show Figures

Graphical abstract

16 pages, 1679 KB

Open AccessArticle

Fully Parallel Homological Region Adjacency Graph via Frontier Recognition

by Fernando Díaz-del-Río, Pablo Sanchez-Cuevas, María José Moron-Fernández, Daniel Cascado-Caballero, Helena Molina-Abril and Pedro Real

Algorithms 2023, 16(6), 284; https://doi.org/10.3390/a16060284 - 31 May 2023

Cited by 1 | Viewed by 2712

Abstract

Relating image contours and regions and their attributes according to connectivity based on incidence or adjacency is a crucial task in numerous applications in the fields of image processing, computer vision and pattern recognition. In this paper, the crucial incidence topological information of [...] Read more.

Relating image contours and regions and their attributes according to connectivity based on incidence or adjacency is a crucial task in numerous applications in the fields of image processing, computer vision and pattern recognition. In this paper, the crucial incidence topological information of 2-dimensional images is extracted in an efficient manner through the computation of a new structure called the HomDuRAG of an image; that is, the dual graph of the HomRAG (a topologically consistent extended version of the classical RAG). These representations are derived from the two traditional self-dual square grids (in which physical pixels play the role of 2-dimensional cells) and encapsulate the whole set of topological features and relations between the three types of objects embedded in a digital image: 2-dimensional (regions), 1-dimensional (contours) and 0-dimensional objects (crosses). Here, a first version of a fully parallel algorithm to compute this new representation is presented, whose timing complexity order (in the worst case and supposing one processing element per 0-cell) is

O (l o g (M \times N))

, M and N being the height and width of the image. Efficient implementations of this parallel algorithm would allow images to be processed in real time, as well as permit us to uncover fast algorithms for contour detection and segmentation, opening new perspectives within the image processing field. Full article

► Show Figures

Figure 1

27 pages, 958 KB

Open AccessArticle

Parallel Algorithm for Solving Overdetermined Systems of Linear Equations, Taking into Account Round-Off Errors

by Dmitry Lukyanenko

Algorithms 2023, 16(5), 242; https://doi.org/10.3390/a16050242 - 7 May 2023

Cited by 6 | Viewed by 3983

Abstract

The paper proposes a parallel algorithm for solving large overdetermined systems of linear algebraic equations with a dense matrix. This algorithm is based on the use of a modification of the conjugate gradient method, which is able to take into account rounding errors [...] Read more.

The paper proposes a parallel algorithm for solving large overdetermined systems of linear algebraic equations with a dense matrix. This algorithm is based on the use of a modification of the conjugate gradient method, which is able to take into account rounding errors accumulated during calculations when making a decision to terminate the iterative process. The parallel algorithm is constructed in such a way that it takes into account the capabilities of the message passing interface (MPI) parallel programming technology, which is used for the software implementation of the proposed algorithm. The programming examples are shown using the Python programming language and the mpi4py package, but all programs are built in such a way that they can be easily rewritten using the C/C++/Fortran programming languages. The advantage of using the modern MPI-4.0 standard is demonstrated. Full article

► Show Figures

Graphical abstract

19 pages, 1208 KB

Open AccessArticle

Asynchronous Gathering in a Dangerous Ring

by Stefan Dobrev, Paola Flocchini, Giuseppe Prencipe and Nicola Santoro

Algorithms 2023, 16(5), 222; https://doi.org/10.3390/a16050222 - 26 Apr 2023

Cited by 5 | Viewed by 2092

Abstract

Consider a set of k identical asynchronous mobile agents located in an anonymous ring of n nodes. The classical Gather (or Rendezvous) problem requires all agents to meet at the same node, not a priori decided, within a finite amount of time. [...] Read more.

Consider a set of k identical asynchronous mobile agents located in an anonymous ring of n nodes. The classical Gather (or Rendezvous) problem requires all agents to meet at the same node, not a priori decided, within a finite amount of time. This problem has been studied assuming that the network is safe for the agents. In this paper, we consider the presence in the ring of a stationary process located at a node that disables any incoming agent without leaving any trace. Such a dangerous node is known in the literature as a black hole, and the determination of its location has been extensively investigated. The presence of the black hole makes it deterministically unfeasible for all agents to gather. So, the research concern is to determine how many agents can gather and under what conditions. In this paper we establish a complete characterization of the conditions under which the problem can be solved. In particular, we determine the maximum number of agents that can be guaranteed to gather in the same location depending on whether k or n is unknown (at least one must be known). These results are tight: in each case, gathering with one more agent is deterministically unfeasible. All our possibility proofs are constructive: we provide mobile agent algorithms that allow the agents to gather within a predefined distance under the specified conditions. The analysis of the time costs of these algorithms show that they are optimal. Our gathering algorithm for the case of unknown k is also a solution for the black hole location problem. Interestingly, its bounded time complexity is

Θ (n)

; this is a significant improvement over the existing

O (n log n)

bounded time complexity. Full article

► Show Figures

Figure 1

2022

Jump to: 2025, 2024, 2023, 2021

21 pages, 653 KB

Open AccessArticle

A Methodology to Design Quantized Deep Neural Networks for Automatic Modulation Recognition

by David Góez, Paola Soto, Steven Latré, Natalia Gaviria and Miguel Camelo

Algorithms 2022, 15(12), 441; https://doi.org/10.3390/a15120441 - 22 Nov 2022

Cited by 13 | Viewed by 3810

Abstract

Next-generation communication systems will face new challenges related to efficiently managing the available resources, such as the radio spectrum. DL is one of the optimization approaches to address and solve these challenges. However, there is a gap between research and industry. Most AI [...] Read more.

Next-generation communication systems will face new challenges related to efficiently managing the available resources, such as the radio spectrum. DL is one of the optimization approaches to address and solve these challenges. However, there is a gap between research and industry. Most AI models that solve communication problems cannot be implemented in current communication devices due to their high computational capacity requirements. New approaches seek to reduce the size of DL models through quantization techniques, changing the traditional method of operations from a 32 (or 64) floating-point representation to a fixed point (usually small) one. However, there is no analytical method to determine the level of quantification that can be used to obtain the best trade-off between the reduction of computational costs and an acceptable accuracy in a specific problem. In this work, we propose an analysis methodology to determine the degree of quantization in a DNN model to solve the problem of AMR in a radio system. We use the Brevitas framework to build and analyze different quantized variants of the DL architecture VGG10 adapted to the AMR problem. The evaluation of the computational cost is performed with the FINN framework of Xilinx Research Labs to obtain the computational inference cost. The proposed design methodology allows us to obtain the combination of quantization bits per layer that provides an optimal trade-off between the model performance (i.e., accuracy) and the model complexity (i.e., size) according to a set of weights associated with each optimization objective. For example, using the proposed methodology, we found a model architecture that reduced 75.8% of the model size compared to the non-quantized baseline model, with a performance degradation of only 0.06%. Full article

► Show Figures

Figure 1

28 pages, 699 KB

Open AccessArticle

Recent Developments in Low-Power AI Accelerators: A Survey

by Christoffer Åleskog, Håkan Grahn and Anton Borg

Algorithms 2022, 15(11), 419; https://doi.org/10.3390/a15110419 - 8 Nov 2022

Cited by 20 | Viewed by 13810

Abstract

As machine learning and AI continue to rapidly develop, and with the ever-closer end of Moore’s law, new avenues and novel ideas in architecture design are being created and utilized. One avenue is accelerating AI as close to the user as possible, i.e., [...] Read more.

As machine learning and AI continue to rapidly develop, and with the ever-closer end of Moore’s law, new avenues and novel ideas in architecture design are being created and utilized. One avenue is accelerating AI as close to the user as possible, i.e., at the edge, to reduce latency and increase performance. Therefore, researchers have developed low-power AI accelerators, designed specifically to accelerate machine learning and AI at edge devices. In this paper, we present an overview of low-power AI accelerators between 2019–2022. Low-power AI accelerators are defined in this paper based on their acceleration target and power consumption. In this survey, 79 low-power AI accelerators are presented and discussed. The reviewed accelerators are discussed based on five criteria: (i) power, performance, and power efficiency, (ii) acceleration targets, (iii) arithmetic precision, (iv) neuromorphic accelerators, and (v) industry vs. academic accelerators. CNNs and DNNs are the most popular accelerator targets, while Transformers and SNNs are on the rise. Full article

► Show Figures

Figure 1

25 pages, 751 KB

Open AccessArticle

Modeling Different Deployment Variants of a Composite Application in a Single Declarative Deployment Model

by Miles Stötzner, Steffen Becker, Uwe Breitenbücher, Kálmán Képes and Frank Leymann

Algorithms 2022, 15(10), 382; https://doi.org/10.3390/a15100382 - 19 Oct 2022

Cited by 7 | Viewed by 3651

Abstract

For automating the deployment of composite applications, typically, declarative deployment models are used. Depending on the context, the deployment of an application has to fulfill different requirements, such as costs and elasticity. As a consequence, one and the same application, i.e., its components, [...] Read more.

For automating the deployment of composite applications, typically, declarative deployment models are used. Depending on the context, the deployment of an application has to fulfill different requirements, such as costs and elasticity. As a consequence, one and the same application, i.e., its components, and their dependencies, often need to be deployed in different variants. If each different variant of a deployment is described using an individual deployment model, it quickly results in a large number of models, which are error prone to maintain. Deployment technologies, such as Terraform or Ansible, support conditional components and dependencies which allow modeling different deployment variants of a composite application in a single deployment model. However, there are deployment technologies, such as TOSCA and Docker Compose, which do not support such conditional elements. To address this, we extend the Essential Deployment Metamodel (EDMM) by conditional components and dependencies. EDMM is a declarative deployment model which can be mapped to several deployment technologies including Terraform, Ansible, TOSCA, and Docker Compose. Preprocessing such an extended model, i.e., conditional elements are evaluated and either preserved or removed, generates an EDMM conform model. As a result, conditional elements can be integrated on top of existing deployment technologies that are unaware of such concepts. We evaluate this by implementing a preprocessor for TOSCA, called OpenTOSCA Vintner, which employs the open-source TOSCA orchestrators xOpera and Unfurl to execute the generated TOSCA conform models. Full article

► Show Figures

Figure 1

30 pages, 1418 KB

Open AccessArticle

A Dynamic Distributed Deterministic Load-Balancer for Decentralized Hierarchical Infrastructures

by Spyros Sioutas, Efrosini Sourla, Kostas Tsichlas, Gerasimos Vonitsanos and Christos Zaroliagis

Algorithms 2022, 15(3), 96; https://doi.org/10.3390/a15030096 - 18 Mar 2022

Cited by 2 | Viewed by 3199

Abstract

In this work, we propose

D^{3}

-Tree, a dynamic distributed deterministic structure for data management in decentralized networks, by engineering and extending an existing decentralized structure. Conducting an extensive experimental study, we verify that the implemented structure outperforms other well-known hierarchical tree-based [...] Read more.

In this work, we propose

D^{3}

-Tree, a dynamic distributed deterministic structure for data management in decentralized networks, by engineering and extending an existing decentralized structure. Conducting an extensive experimental study, we verify that the implemented structure outperforms other well-known hierarchical tree-based structures since it provides better complexities regarding load-balancing operations. More specifically, the structure achieves an

O (log N)

amortized bound (N is the number of nodes present in the network), using an efficient deterministic load-balancing mechanism, which is general enough to be applied to other hierarchical tree-based structures. Moreover, our structure achieves

O (log N)

worst-case search performance. Last but not least, we investigate the structure’s fault tolerance, which hasn’t been sufficiently tackled in previous work, both theoretically and through rigorous experimentation. We prove that

D^{3}

-Tree is highly fault-tolerant and achieves

O (log N)

amortized search cost under massive node failures, accompanied by a significant success rate. Afterwards, by incorporating this novel balancing scheme into the ART (Autonomous Range Tree) structure, we go one step further to achieve sub-logarithmic complexity and propose the ART

^{+}

structure. ART

^{+}

achieves an

O ({log}_{b}^{2} log N)

communication cost for query and update operations (b is a double-exponentially power of 2 and N is the total number of nodes). Moreover, ART

^{+}

is a fully dynamic and fault-tolerant structure, which supports the join/leave node operations in

O (log log N)

expected WHP (with high proability) number of hops and performs load-balancing in

O (log log N)

amortized cost. Full article

► Show Figures

Figure 1

17 pages, 432 KB

Open AccessArticle

Tries-Based Parallel Solutions for Generating Perfect Crosswords Grids

by Virginia Niculescu and Robert Manuel Ştefănică

Algorithms 2022, 15(1), 22; https://doi.org/10.3390/a15010022 - 13 Jan 2022

Cited by 3 | Viewed by 5351

Abstract

A general crossword grid generation is considered an NP-complete problem and theoretically it could be a good candidate to be used by cryptography algorithms. In this article, we propose a new algorithm for generating perfect crosswords grids (with no black boxes) that relies [...] Read more.

A general crossword grid generation is considered an NP-complete problem and theoretically it could be a good candidate to be used by cryptography algorithms. In this article, we propose a new algorithm for generating perfect crosswords grids (with no black boxes) that relies on using tries data structures, which are very important for reducing the time for finding the solutions, and offers good opportunity for parallelisation, too. The algorithm uses a special tries representation and it is very efficient, but through parallelisation the performance is improved to a level that allows the solution to be obtained extremely fast. The experiments were conducted using a dictionary of almost 700,000 words, and the solutions were obtained using the parallelised version with an execution time in the order of minutes. We demonstrate here that finding a perfect crossword grid could be solved faster than has been estimated before, if we use tries as supporting data structures together with parallelisation. Still, if the size of the dictionary is increased by a lot (e.g., considering a set of dictionaries for different languages—not only for one), or through a generalisation to a 3D space or multidimensional spaces, then the problem still could be investigated for a possible usage in cryptography. Full article

► Show Figures

Figure 1

2021

Jump to: 2025, 2024, 2023, 2022

13 pages, 1466 KB

Open AccessArticle

Parallel Computing of Edwards—Anderson Model

by Mikhail Alexandrovich Padalko, Yuriy Andreevich Shevchenko, Vitalii Yurievich Kapitan and Konstantin Valentinovich Nefedev

Algorithms 2022, 15(1), 13; https://doi.org/10.3390/a15010013 - 27 Dec 2021

Cited by 7 | Viewed by 4154

Abstract

A scheme for parallel computation of the two-dimensional Edwards—Anderson model based on the transfer matrix approach is proposed. Free boundary conditions are considered. The method may find application in calculations related to spin glasses and in quantum simulators. Performance data are given. The [...] Read more.

A scheme for parallel computation of the two-dimensional Edwards—Anderson model based on the transfer matrix approach is proposed. Free boundary conditions are considered. The method may find application in calculations related to spin glasses and in quantum simulators. Performance data are given. The scheme of parallelisation for various numbers of threads is tested. Application to a quantum computer simulator is considered in detail. In particular, a parallelisation scheme of work of quantum computer simulator. Full article

► Show Figures

Figure 1

21 pages, 811 KB

Open AccessArticle

An O(log₂N) Fully-Balanced Resampling Algorithm for Particle Filters on Distributed Memory Architectures

by Alessandro Varsi, Simon Maskell and Paul G. Spirakis

Algorithms 2021, 14(12), 342; https://doi.org/10.3390/a14120342 - 26 Nov 2021

Cited by 15 | Viewed by 5651

Abstract

Resampling is a well-known statistical algorithm that is commonly applied in the context of Particle Filters (PFs) in order to perform state estimation for non-linear non-Gaussian dynamic models. As the models become more complex and accurate, the run-time of PF applications becomes increasingly [...] Read more.

Resampling is a well-known statistical algorithm that is commonly applied in the context of Particle Filters (PFs) in order to perform state estimation for non-linear non-Gaussian dynamic models. As the models become more complex and accurate, the run-time of PF applications becomes increasingly slow. Parallel computing can help to address this. However, resampling (and, hence, PFs as well) necessarily involves a bottleneck, the redistribution step, which is notoriously challenging to parallelize if using textbook parallel computing techniques. A state-of-the-art redistribution takes

O ({({log}_{2} N)}^{2})

computations on Distributed Memory (DM) architectures, which most supercomputers adopt, whereas redistribution can be performed in

O ({log}_{2} N)

on Shared Memory (SM) architectures, such as GPU or mainstream CPUs. In this paper, we propose a novel parallel redistribution for DM that achieves an

O ({log}_{2} N)

time complexity. We also present empirical results that indicate that our novel approach outperforms the

O ({({log}_{2} N)}^{2})

approach. Full article

► Show Figures

Figure 1

21 pages, 4019 KB

Open AccessArticle

Parallel Implementation of the Algorithm to Compute Forest Fire Impact on Infrastructure Facilities of JSC Russian Railways

by Nikolay Viktorovich Baranovskiy, Aleksey Podorovskiy and Aleksey Malinin

Algorithms 2021, 14(11), 333; https://doi.org/10.3390/a14110333 - 15 Nov 2021

Cited by 6 | Viewed by 3423

Abstract

Forest fires have a negative impact on the economy in a number of regions, especially in Wildland Urban Interface (WUI) areas. An important link in the fight against fires in WUI areas is the development of information and computer systems for predicting the [...] Read more.

Forest fires have a negative impact on the economy in a number of regions, especially in Wildland Urban Interface (WUI) areas. An important link in the fight against fires in WUI areas is the development of information and computer systems for predicting the fire safety of infrastructural facilities of Russian Railways. In this work, a numerical study of heat transfer processes in the enclosing structure of a wooden building near the forest fire front was carried out using the technology of parallel computing. The novelty of the development is explained by the creation of its own program code, which is planned to be put into operation either in the Information System for Remote Monitoring of Forest Fires ISDM-Rosleskhoz, or in the information and computing system of JSC Russian Railways. In the Russian Federation, it is forbidden to use foreign systems in the security services of industrial facilities. The implementation of the deterministic model of heat transfer in the enclosing structure with the complexity of the algorithm O (2N² + 2K) is presented. The program is implemented in Python 3.x using the NumPy and Concurrent libraries. Calculations were carried out on a multiprocessor cluster in the Sirius University of Science and Technology. The results of calculations and the acceleration coefficient for operating modes for 1, 2, 4, 8, 16, 32, 48 and 64 processes are presented. The developed algorithm can be applied to assess the fire safety of infrastructure facilities of Russian Railways. The main merit of the new development should be noted, which is explained by the ability to use large computational domains with a large number of computational grid nodes in space and time. The use of caching intermediate data in files made it possible to distribute a large number of computational nodes among the processors of a computing multiprocessor system. However, one should also note a drawback; namely, a decrease in the acceleration of computational operations with a large number of involved nodes of a multiprocessor computing system, which is explained by the write and read cycles in cache files. Full article

► Show Figures

Figure 1

24 pages, 1026 KB

Open AccessArticle

Load Balancing Strategies for Slice-Based Parallel Versions of JEM Video Encoder

by Héctor Migallón, Otoniel López-Granado, Miguel O. Martínez-Rach, Vicente Galiano and Manuel P. Malumbres

Algorithms 2021, 14(11), 320; https://doi.org/10.3390/a14110320 - 1 Nov 2021

Cited by 1 | Viewed by 2798

Abstract

The proportion of video traffic on the internet is expected to reach 82% by 2022, mainly due to the increasing number of consumers and the emergence of new video formats with more demanding features (depth, resolution, multiview, 360, etc.). Efforts are therefore being [...] Read more.

The proportion of video traffic on the internet is expected to reach 82% by 2022, mainly due to the increasing number of consumers and the emergence of new video formats with more demanding features (depth, resolution, multiview, 360, etc.). Efforts are therefore being made to constantly improve video compression standards to minimize the necessary bandwidth while retaining high video quality levels. In this context, the Joint Collaborative Team on Video Coding has been analyzing new video coding technologies to improve the compression efficiency with respect to the HEVC video coding standard. A software package known as the Joint Exploration Test Model has been proposed to implement and evaluate new video coding tools. In this work, we present parallel versions of the JEM encoder that are particularly suited for shared memory platforms, and can significantly reduce its huge computational complexity. The proposed parallel algorithms are shown to achieve high levels of parallel efficiency. In particular, in the All Intra coding mode, the best of our proposed parallel versions achieves an average efficiency value of 93.4%. They also had high levels of scalability, as shown by the inclusion of an automatic load balancing mechanism. Full article

► Show Figures

Figure 1

10 pages, 284 KB

Open AccessArticle

A Parallel Algorithm for Dividing Octonions

by Aleksandr Cariow and Janusz P. Paplinski

Algorithms 2021, 14(11), 309; https://doi.org/10.3390/a14110309 - 24 Oct 2021

Viewed by 2661

Abstract

The article presents a parallel hardware-oriented algorithm designed to speed up the division of two octonions. The advantage of the proposed algorithm is that the number of real multiplications is halved as compared to the naive method for implementing this operation. In the [...] Read more.

The article presents a parallel hardware-oriented algorithm designed to speed up the division of two octonions. The advantage of the proposed algorithm is that the number of real multiplications is halved as compared to the naive method for implementing this operation. In the synthesis of the discussed algorithm, the matrix representation of this operation was used, which allows us to present the division of octonions by means of a vector–matrix product. Taking into account a specific structure of the matrix multiplicand allows for reducing the number of real multiplications necessary for the execution of the octonion division procedure. Full article

► Show Figures

Figure 1

24 pages, 1776 KB

Open AccessArticle

Rough Estimator Based Asynchronous Distributed Super Points Detection on High Speed Network Edge

by Jie Xu and Wei Ding

Algorithms 2021, 14(10), 277; https://doi.org/10.3390/a14100277 - 25 Sep 2021

Viewed by 2596

Abstract

Super points detection plays an important role in network research and application. With the increase of network scale, distributed super points detection has become a hot research topic. The key point of super points detection in a multi-node distributed environment is how to [...] Read more.

Super points detection plays an important role in network research and application. With the increase of network scale, distributed super points detection has become a hot research topic. The key point of super points detection in a multi-node distributed environment is how to reduce communication overhead. Therefore, this paper proposes a three-stage communication algorithm to detect super points in a distributed environment, Rough Estimator based Asynchronous Distributed super points detection algorithm (READ). READ uses a lightweight estimator, the Rough Estimator (RE), which is fast in computation and takes less memory to generate candidate super points. Meanwhile, the famous Linear Estimator (LE) is applied to accurately estimate the cardinality of each candidate super point, so as to detect the super point correctly. In READ, each node scans IP address pairs asynchronously. When reaching the time window boundary, READ starts three-stage communication to detect the super point. This paper proves that the accuracy of READ in a distributed environment is no less than that in the single-node environment. Four groups of 10 Gb/s and 40 Gb/s real-world high-speed network traffic are used to test READ. The experimental results show that READ not only has high accuracy in a distributed environment, but also has less than 5% of communication burden compared with existing algorithms. Full article

► Show Figures

Figure 1

22 pages, 6397 KB

Open AccessArticle

Accelerating In-Transit Co-Processing for Scientific Simulations Using Region-Based Data-Driven Analysis

by Marcus Walldén, Masao Okita, Fumihiko Ino, Dimitris Drikakis and Ioannis Kokkinakis

Algorithms 2021, 14(5), 154; https://doi.org/10.3390/a14050154 - 12 May 2021

Viewed by 3115

Abstract

Increasing processing capabilities and input/output constraints of supercomputers have increased the use of co-processing approaches, i.e., visualizing and analyzing data sets of simulations on the fly. We present a method that evaluates the importance of different regions of simulation data and a data-driven [...] Read more.

Increasing processing capabilities and input/output constraints of supercomputers have increased the use of co-processing approaches, i.e., visualizing and analyzing data sets of simulations on the fly. We present a method that evaluates the importance of different regions of simulation data and a data-driven approach that uses the proposed method to accelerate in-transit co-processing of large-scale simulations. We use the importance metrics to simultaneously employ multiple compression methods on different data regions to accelerate the in-transit co-processing. Our approach strives to adaptively compress data on the fly and uses load balancing to counteract memory imbalances. We demonstrate the method’s efficiency through a fluid mechanics application, a Richtmyer–Meshkov instability simulation, showing how to accelerate the in-transit co-processing of simulations. The results show that the proposed method expeditiously can identify regions of interest, even when using multiple metrics. Our approach achieved a speedup of

1.29 \times

in a lossless scenario. The data decompression time was sped up by

2 \times

compared to using a single compression method uniformly. Full article

► Show Figures

Figure 1

Journal Menu

Journal Browser

Parallel and Distributed Computing: Algorithms and Applications

Share This Topical Collection

Editors

Topical Collection Information

Keywords

Published Papers (28 papers)

2025

Jump to: 2024, 2023, 2022, 2021

2024

Jump to: 2025, 2023, 2022, 2021

2023

Jump to: 2025, 2024, 2022, 2021

2022

Jump to: 2025, 2024, 2023, 2021

2021

Jump to: 2025, 2024, 2023, 2022

Further Information

Guidelines

MDPI Initiatives

Follow MDPI