Applied Sciences

Research

20 pages, 882 KiB

Open AccessArticle

Parallelization of Array Method with Hybrid Programming: OpenMP and MPI

by Apolinar Velarde Martínez

Appl. Sci. 2022, 12(15), 7706; https://doi.org/10.3390/app12157706 - 31 Jul 2022

Cited by 3 | Viewed by 2992

For parallelization of applications with high processing times and large amounts of storage in High Performance Computing (HPC) systems, shared memory programming and distributed memory programming have been used; a parallel application is represented by Parallel Task Graphs (PTGs) using Directed Acyclic Graphs [...] Read more.

For parallelization of applications with high processing times and large amounts of storage in High Performance Computing (HPC) systems, shared memory programming and distributed memory programming have been used; a parallel application is represented by Parallel Task Graphs (PTGs) using Directed Acyclic Graphs (DAGs). For the execution of PTGs in HPC systems, a scheduler is executed in two phases: scheduling and allocation; the execution of the scheduler is considered an NP-complete combinatorial problem and requires large amounts of storage and long processing times. Array Method (AM) is a scheduler to execute the task schedule in a set of clusters; this method was programmed sequentially, analyzed and tested using real and synthetic application workloads in previous work. Analyzing the proposed designs of this method in this research work, the parallelization of the method is extended using hybrid OpenMP and MPI programming in a server farm and using a set of geographically distributed clusters; at the same time, a novel method for searching free resources in clusters using Lévy random walks is proposed. Synthetic and real workloads have been experimented with to evaluate the performance of the new parallel schedule and compare it to the sequential schedule. The metrics of makespan, waiting time, quality of assignments and search for free resources were evaluated; the results obtained and described in the experiments section show a better performance with the new version of the parallel algorithm compared to the sequential version. By using the parallel approach with hybrid programming applied to the extraction of characteristics of the PTGs, applied to the search for geographically distributed resources with Lévy random walks and applied to the metaheuristic used, the results of the metrics are improved. The makespan is decreased even when the loads increase, the times of the tasks in the waiting queue are decreased, the quality of assignments in the clusters is improved by causing the tasks with their subtasks to be assigned in the same clusters or in cluster neighbors and, finally, the searches for free resources are executed in different geographically distributed clusters, not sequentially. Full article

(This article belongs to the Special Issue Applications of Parallel Computing)

► Show Figures

Figure 1

17 pages, 1745 KiB

Open AccessArticle

Resource Profiling and Performance Modeling for Distributed Scientific Computing Environments

by Md Azam Hossain, Soonwook Hwang and Jik-Soo Kim

Appl. Sci. 2022, 12(9), 4797; https://doi.org/10.3390/app12094797 - 9 May 2022

Viewed by 2056

Abstract

Scientific applications often require substantial amount of computing resources for running challenging jobs potentially consisting of many tasks from hundreds of thousands to even millions. As a result, many institutions collaborate to solve large-scale problems by creating virtual organizations (VOs), and integrate hundreds [...] Read more.

Scientific applications often require substantial amount of computing resources for running challenging jobs potentially consisting of many tasks from hundreds of thousands to even millions. As a result, many institutions collaborate to solve large-scale problems by creating virtual organizations (VOs), and integrate hundreds of thousands of geographically distributed heterogeneous computing resources. Over the past decade, VOs have been proven to be a powerful research testbed for accessing massive amount of computing resources shared by several organizations at almost no cost. However, VOs often suffer from providing exact dynamic resource information due to their scale and autonomous resource management policies. Furthermore, shared resources are inconsistent, making it difficult to accurately forecast resource capacity. An effective VO’s resource profiling and modeling system can address these problems by forecasting resource characteristics and availability. This paper presents effective resource profiling and performance prediction models including Adaptive Filter-based Online Linear Regression (AFOLR) and Adaptive Filter-based Moving Average (AFMV) based on the linear difference equation combining past predicted values and recent profiled information, which aim to support large-scale applications in distributed scientific computing environments. We performed quantitative analysis and conducted microbenchmark experiments on a real multinational shared computing platform. Our evaluation results demonstrate that the proposed prediction schemes outperform well-known common approaches in terms of accuracy, and actually can help users in a shared resource environment to run their large-scale applications by effectively forecasting various computing resource capacity and performance. Full article

(This article belongs to the Special Issue Applications of Parallel Computing)

► Show Figures

Figure 1

51 pages, 5605 KiB

Open AccessArticle

PSciLab: An Unified Distributed and Parallel Software Framework for Data Analysis, Simulation and Machine Learning—Design Practice, Software Architecture, and User Experience

by Stefan Bosse

Appl. Sci. 2022, 12(6), 2887; https://doi.org/10.3390/app12062887 - 11 Mar 2022

Cited by 5 | Viewed by 4963

Abstract

In this paper, a hybrid distributed-parallel cluster software framework for heterogeneous computer networks is introduced that supports simulation, data analysis, and machine learning (ML), using widely available JavaScript virtual machines (VM) and web browsers to accommodate the working load. This work addresses parallelism, [...] Read more.

In this paper, a hybrid distributed-parallel cluster software framework for heterogeneous computer networks is introduced that supports simulation, data analysis, and machine learning (ML), using widely available JavaScript virtual machines (VM) and web browsers to accommodate the working load. This work addresses parallelism, primarily on a control-path level and partially on a data-path level, targeting different classes of numerical problems that can be either data-partitioned or replicated. These are composed of a set of interacting worker processes that can be easily parallelized or distributed, e.g., for large-scale multi-element simulation or ML. Their suitability and scalability for static and dynamic problems are experimentally investigated regarding the proposed multi-process and communication architecture, as well as data management using customized SQL databases with network access. The framework consists of a set of tools and libraries, mainly the WorkBook (processed by a web browser) and the WorkShell (processed by node.js). It can be seen that the proposed distributed-parallel multi-process approach, with a dedicated set of inter-process communication methods (message- and shared-memory-based), scales up efficiently according to problem size and the number of processes. Finally, it is demonstrated that this JavaScript-based approach for exploiting parallelism can be used easily by any typical numerical programmer or data analyst and does not require any special knowledge about parallel and distributed systems and their interaction. The study is also focused on VM processing. Full article

(This article belongs to the Special Issue Applications of Parallel Computing)

► Show Figures

Figure 1

13 pages, 6378 KiB

Open AccessArticle

Calculation of Surface Offset Gathers Based on Reverse Time Migration and Its Parallel Computation with Multi-GPUs

by Dingjin Liu, Bo Li and Guofeng Liu

Appl. Sci. 2021, 11(22), 10687; https://doi.org/10.3390/app112210687 - 12 Nov 2021

Cited by 1 | Viewed by 2002

Abstract

As an important method for seismic data processing, reverse time migration (RTM) has high precision but involves high-intensity calculations. The calculation an RTM surface offset (shot–receiver distance) domain gathers provides intermediary data for an iterative calculation of migration and its velocity building. How [...] Read more.

As an important method for seismic data processing, reverse time migration (RTM) has high precision but involves high-intensity calculations. The calculation an RTM surface offset (shot–receiver distance) domain gathers provides intermediary data for an iterative calculation of migration and its velocity building. How to generate such data efficiently is of great significance to the industrial application of RTM. We propose a method for the calculation of surface offset gathers (SOGs) based on attribute migration, wherein, using migration calculations performed twice, the attribute profile of the surface offsets can be obtained, thus the image results can be sorted into offset gathers. Aiming at the problem of high-intensity computations required for RTM, we put forth a multi-graphic processing unit (GPU) calculative strategy, i.e., by distributing image computational domains to different GPUs for computation and by using the method of multi-stream calculations to conceal data transmission between GPUs. Ultimately, the computing original efficiency was higher relative to a single GPU, and more GPUs were used linearly. The test with a model showed that the attributive migration methods can correctly output SOGs, while the GPU parallel computation can effectively improve the computing efficiency. Therefore, it is of practical importance for this method to be expanded and applied in industries. Full article

(This article belongs to the Special Issue Applications of Parallel Computing)

► Show Figures

Figure 1

19 pages, 1267 KiB

Open AccessArticle

Data-Oriented Language Implementation of the Lattice–Boltzmann Method for Dense and Sparse Geometries

by Tadeusz Tomczak

Appl. Sci. 2021, 11(20), 9495; https://doi.org/10.3390/app11209495 - 13 Oct 2021

Viewed by 2417

Abstract

The performance of lattice–Boltzmann solver implementations usually depends mainly on memory access patterns. Achieving high performance requires then complex code which handles careful data placement and ordering of memory transactions. In this work, we analyse the performance of an implementation based on a [...] Read more.

The performance of lattice–Boltzmann solver implementations usually depends mainly on memory access patterns. Achieving high performance requires then complex code which handles careful data placement and ordering of memory transactions. In this work, we analyse the performance of an implementation based on a new approach called the data-oriented language, which allows the combination of complex memory access patterns with simple source code. As a use case, we present and provide the source code of a solver for D2Q9 lattice and show its performance on GTX Titan Xp GPU for dense and sparse geometries up to

4096^{2}

nodes. The obtained results are promising, around 1000 lines of code allowed us to achieve performance in the range of 0.6 to 0.7 of maximum theoretical memory bandwidth (over 2.5 and 5.0 GLUPS for double and single precision, respectively) for meshes of sizes above

1024^{2}

nodes, which is close to the current state-of-the-art. However, we also observed relatively high and sometimes difficult to predict overheads, especially for sparse data structures. The additional issue was also a rather long compilation, which extended the time of short simulations, and a lack of access to low-level optimisation mechanisms. Full article

(This article belongs to the Special Issue Applications of Parallel Computing)

► Show Figures

Figure 1

25 pages, 552 KiB

Open AccessArticle

Parallel Makespan Calculation for Flow Shop Scheduling Problem with Minimal and Maximal Idle Time

by Jarosław Rudy

Appl. Sci. 2021, 11(17), 8204; https://doi.org/10.3390/app11178204 - 3 Sep 2021

Cited by 2 | Viewed by 4946

Abstract

In this paper, a flow shop scheduling problem with minimal and maximal machine idle time with the goal of minimizing makespan is considered. The mathematical model of the problem is presented. A generalization of the prefix sum, called the job shift scan, for [...] Read more.

In this paper, a flow shop scheduling problem with minimal and maximal machine idle time with the goal of minimizing makespan is considered. The mathematical model of the problem is presented. A generalization of the prefix sum, called the job shift scan, for computing required shifts for overlapping jobs is proposed. A work-efficient algorithm for computing the job shift scan in parallel for the PRAM model with n processors is proposed and its time complexity of

O (log n)

is proven. Then, an algorithm for computing the makespan in time

O (m log n)

in parallel using the prefix sum and job shift scan is proposed. Computer experiments on GPU were conducted using the CUDA platform. The results indicate multi-thread GPU vs. single-thread GPU speedups of up to 350 and 1000 for job shift scan and makespan calculation algorithms, respectively. Multi-thread GPU vs. single-thread CPU speedups up to 4.5 and 14.7, respectively, were observed as well. The experiments on the Taillard-based problem instances using a simulated annealing solving method and employing the parallel makespan calculation show that the method is able to perform many more iterations in the given time limit and obtain better results than the non-parallel version. Full article

(This article belongs to the Special Issue Applications of Parallel Computing)

► Show Figures

Figure 1

18 pages, 414 KiB

Open AccessArticle

A Parallel Algorithm for Scheduling a Two-Machine Robotic Cell in Bicycle Frame Welding Process

by Andrzej Gnatowski and Teodor Niżyński

Appl. Sci. 2021, 11(17), 8083; https://doi.org/10.3390/app11178083 - 31 Aug 2021

Cited by 2 | Viewed by 2054

Abstract

Welding frames with differing geometries is one of the most crucial stages in the production of high-end bicycles. This paper proposes a parallel algorithm and a mixed integer linear programming formulation for scheduling a two-machine robotic welding station. The time complexity of the [...] Read more.

Welding frames with differing geometries is one of the most crucial stages in the production of high-end bicycles. This paper proposes a parallel algorithm and a mixed integer linear programming formulation for scheduling a two-machine robotic welding station. The time complexity of the introduced parallel method is

O ({log}^{2} n)

on an

n^{3}

-processor Exclusive Read Exclusive Write Parallel Random-Access Machine (EREW PRAM), where n is the problem size. The algorithm is designed to take advantage of modern graphics cards to significantly accelerate the computations. To present the benefits of the parallelization, the algorithm is compared to the state of art sequential method and a solver-based approach. Experimental results show an impressive speedup for larger problem instances—up to 314 on a single Graphics Processing Unit (GPU), compared to a single-threaded CPU execution of the sequential algorithm. Full article

(This article belongs to the Special Issue Applications of Parallel Computing)

► Show Figures

Figure 1

18 pages, 4178 KiB

Open AccessArticle

Affinity-Based Task Scheduling on Heterogeneous Multicore Systems Using CBS and QBICTM

by Sohaib Iftikhar Abbasi, Shaharyar Kamal, Munkhjargal Gochoo, Ahmad Jalal and Kibum Kim

Appl. Sci. 2021, 11(12), 5740; https://doi.org/10.3390/app11125740 - 21 Jun 2021

Cited by 9 | Viewed by 3725

Abstract

This work presents the grouping of dependent tasks into a cluster using the Bayesian analysis model to solve the affinity scheduling problem in heterogeneous multicore systems. The non-affinity scheduling of tasks has a negative impact as the overall execution time for the tasks [...] Read more.

This work presents the grouping of dependent tasks into a cluster using the Bayesian analysis model to solve the affinity scheduling problem in heterogeneous multicore systems. The non-affinity scheduling of tasks has a negative impact as the overall execution time for the tasks increases. Furthermore, non-affinity-based scheduling also limits the potential for data reuse in the caches so it becomes necessary to bring the same data into the caches multiple times. In heterogeneous multicore systems, it is essential to address the load balancing problem as all cores are operating at varying frequencies. We propose two techniques to solve the load balancing issue, one being designated “chunk-based scheduler” (CBS) which is applied to the heterogeneous systems while the other system is “quantum-based intra-core task migration” (QBICTM) where each task is given a fair and equal chance to run on the fastest core. Results show 30–55% improvement in the average execution time of the tasks by applying our CBS or QBICTM scheduler compare to other traditional schedulers when compared using the same operating system. Full article

(This article belongs to the Special Issue Applications of Parallel Computing)

► Show Figures

Figure 1

18 pages, 1856 KiB

Open AccessArticle

Designing Parallel Adaptive Laplacian Smoothing for Improving Tetrahedral Mesh Quality on the GPU

by Ning Xi, Yingjie Sun, Lei Xiao and Gang Mei

Appl. Sci. 2021, 11(12), 5543; https://doi.org/10.3390/app11125543 - 15 Jun 2021

Cited by 4 | Viewed by 4386

Abstract

Mesh quality is a critical issue in numerical computing because it directly impacts both computational efficiency and accuracy. Tetrahedral meshes are widely used in various engineering and science applications. However, in large-scale and complicated application scenarios, there are a large number of tetrahedrons, [...] Read more.

Mesh quality is a critical issue in numerical computing because it directly impacts both computational efficiency and accuracy. Tetrahedral meshes are widely used in various engineering and science applications. However, in large-scale and complicated application scenarios, there are a large number of tetrahedrons, and in this case, the improvement of mesh quality is computationally expensive. Laplacian mesh smoothing is a simple mesh optimization method that improves mesh quality by changing the locations of nodes. In this paper, by exploiting the parallelism features of the modern graphics processing unit (GPU), we specifically designed a parallel adaptive Laplacian smoothing algorithm for improving the quality of large-scale tetrahedral meshes. In the proposed adaptive algorithm, we defined the aspect ratio as a metric to judge the mesh quality after each iteration to ensure that every smoothing improves the mesh quality. The adaptive algorithm avoids the shortcoming of the ordinary Laplacian algorithm to create potential invalid elements in the concave area. We conducted 5 groups of comparative experimental tests to evaluate the performance of the proposed parallel algorithm. The results demonstrated that the proposed adaptive algorithm is up to 23 times faster than the serial algorithms; and the accuracy of the tetrahedral mesh is satisfactorily improved after adaptive Laplacian mesh smoothing. Compared with the ordinary Laplacian algorithm, the proposed adaptive Laplacian algorithm is more applicable, and can effectively deal with those tetrahedrons with extremely poor quality. This indicates that the proposed parallel algorithm can be applied to improve the mesh quality in large-scale and complicated application scenarios. Full article

(This article belongs to the Special Issue Applications of Parallel Computing)

► Show Figures

Figure 1

18 pages, 5272 KiB

Open AccessArticle

Overset Grid Assembler and Flow Solver with Adaptive Spatial Load Balancing

by Orxan Shibliyev and Ibrahim Sezai

Appl. Sci. 2021, 11(11), 5132; https://doi.org/10.3390/app11115132 - 31 May 2021

Cited by 3 | Viewed by 2963

Abstract

An overset mesh approach is useful for unsteady flow problems which involve components moving relative to each other. Since the generation of a single mesh around all components is prone to mesh stretching due to the relative motion of bodies, using the overset [...] Read more.

An overset mesh approach is useful for unsteady flow problems which involve components moving relative to each other. Since the generation of a single mesh around all components is prone to mesh stretching due to the relative motion of bodies, using the overset grid methodology, an individual mesh can be generated for each component. In this study, a parallel overset grid assembler was developed to establish connectivity across component meshes. Connectivity information was transferred to the developed parallel flow solver. The assembler uses multiple methods such as alternating digital tree and stencil walking to reduce the time spent on domain connectivity. Both the assembler and solver were partitioned spatially so that overlapping mesh blocks reside in the same partitions. Spatial partitioning was performed using a 3D space partitioning structure, namely octree, to which mesh blocks are registered. The octree was refined adaptively until bins of octree could be evenly distributed to processors. The assembler and solver were tested on a generic helicopter configuration in terms of load balance, scalability, and memory usage. Full article

(This article belongs to the Special Issue Applications of Parallel Computing)

► Show Figures

Figure 1

17 pages, 1214 KiB

Open AccessArticle

Parallel Algorithm with Blocks for a Single-Machine Total Weighted Tardiness Scheduling Problem

by Mariusz Uchroński

Appl. Sci. 2021, 11(5), 2069; https://doi.org/10.3390/app11052069 - 26 Feb 2021

Cited by 5 | Viewed by 2506

Abstract

In this paper, the weighted tardiness single-machine scheduling problem is considered. To solve it an approximate (tabu search) algorithm, which works by improving the current solution by searching the neighborhood, is used. Methods of eliminating bad solutions from the neighborhood (the so-called block [...] Read more.

In this paper, the weighted tardiness single-machine scheduling problem is considered. To solve it an approximate (tabu search) algorithm, which works by improving the current solution by searching the neighborhood, is used. Methods of eliminating bad solutions from the neighborhood (the so-called block elimination properties) were also presented and implemented in the algorithm. Blocks allow a significant shortening of the process of searching the neighborhood generated by insert type moves. The designed parallel tabu search algorithm was implemented using the MPI (Message Passing Interface) library. The obtained speedups are very large (over 60,000×) and superlinear. This may be a sign that the parallel algorithm is superior to the sequential one as the sequential algorithm is not able to effectively search the solution space for the problem under consideration. Only the introduction of diversification process through parallelization can provide an adequate coverage of the entire search process. The current methods of parallelization of metaheuristics give a speedup which strongly depends on the problem’s instances, rarely greater than number of used parallel processors. The method proposed here allows the obtaining of huge speedup values (over 60,000×), but only when so-called blocks are used. The above-mentioned speedup values can be obtained on high performance computing infrastructures such as clusters with the use of MPI library. Full article

(This article belongs to the Special Issue Applications of Parallel Computing)

► Show Figures

Figure 1

15 pages, 368 KiB

Open AccessArticle

Atomicity Violation in Multithreaded Applications and Its Detection in Static Code Analysis Process

by Damian Giebas and Rafał Wojszczyk

Appl. Sci. 2020, 10(22), 8005; https://doi.org/10.3390/app10228005 - 12 Nov 2020

Cited by 10 | Viewed by 3287

Abstract

This paper is a contribution to the field of research dealing with the parallel computing, which is used in multithreaded applications. The paper discusses the characteristics of atomicity violation in multithreaded applications and develops a new definition of atomicity violation based on previously [...] Read more.

This paper is a contribution to the field of research dealing with the parallel computing, which is used in multithreaded applications. The paper discusses the characteristics of atomicity violation in multithreaded applications and develops a new definition of atomicity violation based on previously defined relationships between operations, that can be used to atomicity violation detection. A method of detection of conflicts causing atomicity violation was also developed using the source code model of multithreaded applications that predicts errors in the software. Full article

(This article belongs to the Special Issue Applications of Parallel Computing)

► Show Figures

Graphical abstract

Journal Menu

Journal Browser

Applications of Parallel Computing

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Published Papers (12 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI