Applications of Parallel Computing

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (20 February 2022) | Viewed by 27878

Special Issue Editor


E-Mail Website
Guest Editor
Department of Control Systems and Mechatronics, Wrocław University of Science and Technology, 50-372 Wrocław, Poland
Interests: methods of solving NP-hard problems; discrete optimization problems; parallel computing; project management; automation in construction; permutations groups; free probability

Special Issue Information

Dear Colleagues,

Over the past few years, parallel computing has been widely used to solve computational problems, especially in optimization. It is an effective method to improve the computing speed and processing power of computer systems. The main target of parallel computing is scientific applications, and many large-scale scientific applications refer to problems that are modeled as optimization problems, often discrete ones, based on graph modeling and exploiting artificial intelligence methods.

This Special Issue is devoted to topics in parallel computing, including theory and applications. The focus will be on applications involving parallel methods of solving hard computational problems, especially of optimization.

Prof. Dr. Wojciech Bożejko
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Parallel computing
  • High-performance computing
  • Parallel solvers
  • Sparse matrices
  • Parallel arithmetic
  • Interconnection networks
  • Parallel graph algorithms
  • Parallel combinatorial optimization algorithms
  • Supercomputing, clusters

Published Papers (12 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 882 KiB  
Article
Parallelization of Array Method with Hybrid Programming: OpenMP and MPI
by Apolinar Velarde Martínez
Appl. Sci. 2022, 12(15), 7706; https://doi.org/10.3390/app12157706 - 31 Jul 2022
Cited by 2 | Viewed by 1547
Abstract
For parallelization of applications with high processing times and large amounts of storage in High Performance Computing (HPC) systems, shared memory programming and distributed memory programming have been used; a parallel application is represented by Parallel Task Graphs (PTGs) using Directed Acyclic Graphs [...] Read more.
For parallelization of applications with high processing times and large amounts of storage in High Performance Computing (HPC) systems, shared memory programming and distributed memory programming have been used; a parallel application is represented by Parallel Task Graphs (PTGs) using Directed Acyclic Graphs (DAGs). For the execution of PTGs in HPC systems, a scheduler is executed in two phases: scheduling and allocation; the execution of the scheduler is considered an NP-complete combinatorial problem and requires large amounts of storage and long processing times. Array Method (AM) is a scheduler to execute the task schedule in a set of clusters; this method was programmed sequentially, analyzed and tested using real and synthetic application workloads in previous work. Analyzing the proposed designs of this method in this research work, the parallelization of the method is extended using hybrid OpenMP and MPI programming in a server farm and using a set of geographically distributed clusters; at the same time, a novel method for searching free resources in clusters using Lévy random walks is proposed. Synthetic and real workloads have been experimented with to evaluate the performance of the new parallel schedule and compare it to the sequential schedule. The metrics of makespan, waiting time, quality of assignments and search for free resources were evaluated; the results obtained and described in the experiments section show a better performance with the new version of the parallel algorithm compared to the sequential version. By using the parallel approach with hybrid programming applied to the extraction of characteristics of the PTGs, applied to the search for geographically distributed resources with Lévy random walks and applied to the metaheuristic used, the results of the metrics are improved. The makespan is decreased even when the loads increase, the times of the tasks in the waiting queue are decreased, the quality of assignments in the clusters is improved by causing the tasks with their subtasks to be assigned in the same clusters or in cluster neighbors and, finally, the searches for free resources are executed in different geographically distributed clusters, not sequentially. Full article
(This article belongs to the Special Issue Applications of Parallel Computing)
Show Figures

Figure 1

17 pages, 1745 KiB  
Article
Resource Profiling and Performance Modeling for Distributed Scientific Computing Environments
by Md Azam Hossain, Soonwook Hwang and Jik-Soo Kim
Appl. Sci. 2022, 12(9), 4797; https://doi.org/10.3390/app12094797 - 09 May 2022
Viewed by 1339
Abstract
Scientific applications often require substantial amount of computing resources for running challenging jobs potentially consisting of many tasks from hundreds of thousands to even millions. As a result, many institutions collaborate to solve large-scale problems by creating virtual organizations (VOs), and integrate hundreds [...] Read more.
Scientific applications often require substantial amount of computing resources for running challenging jobs potentially consisting of many tasks from hundreds of thousands to even millions. As a result, many institutions collaborate to solve large-scale problems by creating virtual organizations (VOs), and integrate hundreds of thousands of geographically distributed heterogeneous computing resources. Over the past decade, VOs have been proven to be a powerful research testbed for accessing massive amount of computing resources shared by several organizations at almost no cost. However, VOs often suffer from providing exact dynamic resource information due to their scale and autonomous resource management policies. Furthermore, shared resources are inconsistent, making it difficult to accurately forecast resource capacity. An effective VO’s resource profiling and modeling system can address these problems by forecasting resource characteristics and availability. This paper presents effective resource profiling and performance prediction models including Adaptive Filter-based Online Linear Regression (AFOLR) and Adaptive Filter-based Moving Average (AFMV) based on the linear difference equation combining past predicted values and recent profiled information, which aim to support large-scale applications in distributed scientific computing environments. We performed quantitative analysis and conducted microbenchmark experiments on a real multinational shared computing platform. Our evaluation results demonstrate that the proposed prediction schemes outperform well-known common approaches in terms of accuracy, and actually can help users in a shared resource environment to run their large-scale applications by effectively forecasting various computing resource capacity and performance. Full article
(This article belongs to the Special Issue Applications of Parallel Computing)
Show Figures

Figure 1

51 pages, 5605 KiB  
Article
PSciLab: An Unified Distributed and Parallel Software Framework for Data Analysis, Simulation and Machine Learning—Design Practice, Software Architecture, and User Experience
by Stefan Bosse
Appl. Sci. 2022, 12(6), 2887; https://doi.org/10.3390/app12062887 - 11 Mar 2022
Cited by 5 | Viewed by 3778
Abstract
In this paper, a hybrid distributed-parallel cluster software framework for heterogeneous computer networks is introduced that supports simulation, data analysis, and machine learning (ML), using widely available JavaScript virtual machines (VM) and web browsers to accommodate the working load. This work addresses parallelism, [...] Read more.
In this paper, a hybrid distributed-parallel cluster software framework for heterogeneous computer networks is introduced that supports simulation, data analysis, and machine learning (ML), using widely available JavaScript virtual machines (VM) and web browsers to accommodate the working load. This work addresses parallelism, primarily on a control-path level and partially on a data-path level, targeting different classes of numerical problems that can be either data-partitioned or replicated. These are composed of a set of interacting worker processes that can be easily parallelized or distributed, e.g., for large-scale multi-element simulation or ML. Their suitability and scalability for static and dynamic problems are experimentally investigated regarding the proposed multi-process and communication architecture, as well as data management using customized SQL databases with network access. The framework consists of a set of tools and libraries, mainly the WorkBook (processed by a web browser) and the WorkShell (processed by node.js). It can be seen that the proposed distributed-parallel multi-process approach, with a dedicated set of inter-process communication methods (message- and shared-memory-based), scales up efficiently according to problem size and the number of processes. Finally, it is demonstrated that this JavaScript-based approach for exploiting parallelism can be used easily by any typical numerical programmer or data analyst and does not require any special knowledge about parallel and distributed systems and their interaction. The study is also focused on VM processing. Full article
(This article belongs to the Special Issue Applications of Parallel Computing)
Show Figures

Figure 1

13 pages, 6378 KiB  
Article
Calculation of Surface Offset Gathers Based on Reverse Time Migration and Its Parallel Computation with Multi-GPUs
by Dingjin Liu, Bo Li and Guofeng Liu
Appl. Sci. 2021, 11(22), 10687; https://doi.org/10.3390/app112210687 - 12 Nov 2021
Cited by 1 | Viewed by 1280
Abstract
As an important method for seismic data processing, reverse time migration (RTM) has high precision but involves high-intensity calculations. The calculation an RTM surface offset (shot–receiver distance) domain gathers provides intermediary data for an iterative calculation of migration and its velocity building. How [...] Read more.
As an important method for seismic data processing, reverse time migration (RTM) has high precision but involves high-intensity calculations. The calculation an RTM surface offset (shot–receiver distance) domain gathers provides intermediary data for an iterative calculation of migration and its velocity building. How to generate such data efficiently is of great significance to the industrial application of RTM. We propose a method for the calculation of surface offset gathers (SOGs) based on attribute migration, wherein, using migration calculations performed twice, the attribute profile of the surface offsets can be obtained, thus the image results can be sorted into offset gathers. Aiming at the problem of high-intensity computations required for RTM, we put forth a multi-graphic processing unit (GPU) calculative strategy, i.e., by distributing image computational domains to different GPUs for computation and by using the method of multi-stream calculations to conceal data transmission between GPUs. Ultimately, the computing original efficiency was higher relative to a single GPU, and more GPUs were used linearly. The test with a model showed that the attributive migration methods can correctly output SOGs, while the GPU parallel computation can effectively improve the computing efficiency. Therefore, it is of practical importance for this method to be expanded and applied in industries. Full article
(This article belongs to the Special Issue Applications of Parallel Computing)
Show Figures

Figure 1

19 pages, 1267 KiB  
Article
Data-Oriented Language Implementation of the Lattice–Boltzmann Method for Dense and Sparse Geometries
by Tadeusz Tomczak
Appl. Sci. 2021, 11(20), 9495; https://doi.org/10.3390/app11209495 - 13 Oct 2021
Viewed by 1547
Abstract
The performance of lattice–Boltzmann solver implementations usually depends mainly on memory access patterns. Achieving high performance requires then complex code which handles careful data placement and ordering of memory transactions. In this work, we analyse the performance of an implementation based on a [...] Read more.
The performance of lattice–Boltzmann solver implementations usually depends mainly on memory access patterns. Achieving high performance requires then complex code which handles careful data placement and ordering of memory transactions. In this work, we analyse the performance of an implementation based on a new approach called the data-oriented language, which allows the combination of complex memory access patterns with simple source code. As a use case, we present and provide the source code of a solver for D2Q9 lattice and show its performance on GTX Titan Xp GPU for dense and sparse geometries up to 40962 nodes. The obtained results are promising, around 1000 lines of code allowed us to achieve performance in the range of 0.6 to 0.7 of maximum theoretical memory bandwidth (over 2.5 and 5.0 GLUPS for double and single precision, respectively) for meshes of sizes above 10242 nodes, which is close to the current state-of-the-art. However, we also observed relatively high and sometimes difficult to predict overheads, especially for sparse data structures. The additional issue was also a rather long compilation, which extended the time of short simulations, and a lack of access to low-level optimisation mechanisms. Full article
(This article belongs to the Special Issue Applications of Parallel Computing)
Show Figures

Figure 1

25 pages, 552 KiB  
Article
Parallel Makespan Calculation for Flow Shop Scheduling Problem with Minimal and Maximal Idle Time
by Jarosław Rudy
Appl. Sci. 2021, 11(17), 8204; https://doi.org/10.3390/app11178204 - 03 Sep 2021
Cited by 1 | Viewed by 3115
Abstract
In this paper, a flow shop scheduling problem with minimal and maximal machine idle time with the goal of minimizing makespan is considered. The mathematical model of the problem is presented. A generalization of the prefix sum, called the job shift scan, for [...] Read more.
In this paper, a flow shop scheduling problem with minimal and maximal machine idle time with the goal of minimizing makespan is considered. The mathematical model of the problem is presented. A generalization of the prefix sum, called the job shift scan, for computing required shifts for overlapping jobs is proposed. A work-efficient algorithm for computing the job shift scan in parallel for the PRAM model with n processors is proposed and its time complexity of O(logn) is proven. Then, an algorithm for computing the makespan in time O(mlogn) in parallel using the prefix sum and job shift scan is proposed. Computer experiments on GPU were conducted using the CUDA platform. The results indicate multi-thread GPU vs. single-thread GPU speedups of up to 350 and 1000 for job shift scan and makespan calculation algorithms, respectively. Multi-thread GPU vs. single-thread CPU speedups up to 4.5 and 14.7, respectively, were observed as well. The experiments on the Taillard-based problem instances using a simulated annealing solving method and employing the parallel makespan calculation show that the method is able to perform many more iterations in the given time limit and obtain better results than the non-parallel version. Full article
(This article belongs to the Special Issue Applications of Parallel Computing)
Show Figures

Figure 1

18 pages, 414 KiB  
Article
A Parallel Algorithm for Scheduling a Two-Machine Robotic Cell in Bicycle Frame Welding Process
by Andrzej Gnatowski and Teodor Niżyński
Appl. Sci. 2021, 11(17), 8083; https://doi.org/10.3390/app11178083 - 31 Aug 2021
Cited by 2 | Viewed by 1414
Abstract
Welding frames with differing geometries is one of the most crucial stages in the production of high-end bicycles. This paper proposes a parallel algorithm and a mixed integer linear programming formulation for scheduling a two-machine robotic welding station. The time complexity of the [...] Read more.
Welding frames with differing geometries is one of the most crucial stages in the production of high-end bicycles. This paper proposes a parallel algorithm and a mixed integer linear programming formulation for scheduling a two-machine robotic welding station. The time complexity of the introduced parallel method is O(log2n) on an n3-processor Exclusive Read Exclusive Write Parallel Random-Access Machine (EREW PRAM), where n is the problem size. The algorithm is designed to take advantage of modern graphics cards to significantly accelerate the computations. To present the benefits of the parallelization, the algorithm is compared to the state of art sequential method and a solver-based approach. Experimental results show an impressive speedup for larger problem instances—up to 314 on a single Graphics Processing Unit (GPU), compared to a single-threaded CPU execution of the sequential algorithm. Full article
(This article belongs to the Special Issue Applications of Parallel Computing)
Show Figures

Figure 1

18 pages, 4178 KiB  
Article
Affinity-Based Task Scheduling on Heterogeneous Multicore Systems Using CBS and QBICTM
by Sohaib Iftikhar Abbasi, Shaharyar Kamal, Munkhjargal Gochoo, Ahmad Jalal and Kibum Kim
Appl. Sci. 2021, 11(12), 5740; https://doi.org/10.3390/app11125740 - 21 Jun 2021
Cited by 6 | Viewed by 2416
Abstract
This work presents the grouping of dependent tasks into a cluster using the Bayesian analysis model to solve the affinity scheduling problem in heterogeneous multicore systems. The non-affinity scheduling of tasks has a negative impact as the overall execution time for the tasks [...] Read more.
This work presents the grouping of dependent tasks into a cluster using the Bayesian analysis model to solve the affinity scheduling problem in heterogeneous multicore systems. The non-affinity scheduling of tasks has a negative impact as the overall execution time for the tasks increases. Furthermore, non-affinity-based scheduling also limits the potential for data reuse in the caches so it becomes necessary to bring the same data into the caches multiple times. In heterogeneous multicore systems, it is essential to address the load balancing problem as all cores are operating at varying frequencies. We propose two techniques to solve the load balancing issue, one being designated “chunk-based scheduler” (CBS) which is applied to the heterogeneous systems while the other system is “quantum-based intra-core task migration” (QBICTM) where each task is given a fair and equal chance to run on the fastest core. Results show 30–55% improvement in the average execution time of the tasks by applying our CBS or QBICTM scheduler compare to other traditional schedulers when compared using the same operating system. Full article
(This article belongs to the Special Issue Applications of Parallel Computing)
Show Figures

Figure 1

18 pages, 1856 KiB  
Article
Designing Parallel Adaptive Laplacian Smoothing for Improving Tetrahedral Mesh Quality on the GPU
by Ning Xi, Yingjie Sun, Lei Xiao and Gang Mei
Appl. Sci. 2021, 11(12), 5543; https://doi.org/10.3390/app11125543 - 15 Jun 2021
Cited by 1 | Viewed by 2955
Abstract
Mesh quality is a critical issue in numerical computing because it directly impacts both computational efficiency and accuracy. Tetrahedral meshes are widely used in various engineering and science applications. However, in large-scale and complicated application scenarios, there are a large number of tetrahedrons, [...] Read more.
Mesh quality is a critical issue in numerical computing because it directly impacts both computational efficiency and accuracy. Tetrahedral meshes are widely used in various engineering and science applications. However, in large-scale and complicated application scenarios, there are a large number of tetrahedrons, and in this case, the improvement of mesh quality is computationally expensive. Laplacian mesh smoothing is a simple mesh optimization method that improves mesh quality by changing the locations of nodes. In this paper, by exploiting the parallelism features of the modern graphics processing unit (GPU), we specifically designed a parallel adaptive Laplacian smoothing algorithm for improving the quality of large-scale tetrahedral meshes. In the proposed adaptive algorithm, we defined the aspect ratio as a metric to judge the mesh quality after each iteration to ensure that every smoothing improves the mesh quality. The adaptive algorithm avoids the shortcoming of the ordinary Laplacian algorithm to create potential invalid elements in the concave area. We conducted 5 groups of comparative experimental tests to evaluate the performance of the proposed parallel algorithm. The results demonstrated that the proposed adaptive algorithm is up to 23 times faster than the serial algorithms; and the accuracy of the tetrahedral mesh is satisfactorily improved after adaptive Laplacian mesh smoothing. Compared with the ordinary Laplacian algorithm, the proposed adaptive Laplacian algorithm is more applicable, and can effectively deal with those tetrahedrons with extremely poor quality. This indicates that the proposed parallel algorithm can be applied to improve the mesh quality in large-scale and complicated application scenarios. Full article
(This article belongs to the Special Issue Applications of Parallel Computing)
Show Figures

Figure 1

18 pages, 5272 KiB  
Article
Overset Grid Assembler and Flow Solver with Adaptive Spatial Load Balancing
by Orxan Shibliyev and Ibrahim Sezai
Appl. Sci. 2021, 11(11), 5132; https://doi.org/10.3390/app11115132 - 31 May 2021
Cited by 2 | Viewed by 2142
Abstract
An overset mesh approach is useful for unsteady flow problems which involve components moving relative to each other. Since the generation of a single mesh around all components is prone to mesh stretching due to the relative motion of bodies, using the overset [...] Read more.
An overset mesh approach is useful for unsteady flow problems which involve components moving relative to each other. Since the generation of a single mesh around all components is prone to mesh stretching due to the relative motion of bodies, using the overset grid methodology, an individual mesh can be generated for each component. In this study, a parallel overset grid assembler was developed to establish connectivity across component meshes. Connectivity information was transferred to the developed parallel flow solver. The assembler uses multiple methods such as alternating digital tree and stencil walking to reduce the time spent on domain connectivity. Both the assembler and solver were partitioned spatially so that overlapping mesh blocks reside in the same partitions. Spatial partitioning was performed using a 3D space partitioning structure, namely octree, to which mesh blocks are registered. The octree was refined adaptively until bins of octree could be evenly distributed to processors. The assembler and solver were tested on a generic helicopter configuration in terms of load balance, scalability, and memory usage. Full article
(This article belongs to the Special Issue Applications of Parallel Computing)
Show Figures

Figure 1

17 pages, 1214 KiB  
Article
Parallel Algorithm with Blocks for a Single-Machine Total Weighted Tardiness Scheduling Problem
by Mariusz Uchroński
Appl. Sci. 2021, 11(5), 2069; https://doi.org/10.3390/app11052069 - 26 Feb 2021
Cited by 2 | Viewed by 1699
Abstract
In this paper, the weighted tardiness single-machine scheduling problem is considered. To solve it an approximate (tabu search) algorithm, which works by improving the current solution by searching the neighborhood, is used. Methods of eliminating bad solutions from the neighborhood (the so-called block [...] Read more.
In this paper, the weighted tardiness single-machine scheduling problem is considered. To solve it an approximate (tabu search) algorithm, which works by improving the current solution by searching the neighborhood, is used. Methods of eliminating bad solutions from the neighborhood (the so-called block elimination properties) were also presented and implemented in the algorithm. Blocks allow a significant shortening of the process of searching the neighborhood generated by insert type moves. The designed parallel tabu search algorithm was implemented using the MPI (Message Passing Interface) library. The obtained speedups are very large (over 60,000×) and superlinear. This may be a sign that the parallel algorithm is superior to the sequential one as the sequential algorithm is not able to effectively search the solution space for the problem under consideration. Only the introduction of diversification process through parallelization can provide an adequate coverage of the entire search process. The current methods of parallelization of metaheuristics give a speedup which strongly depends on the problem’s instances, rarely greater than number of used parallel processors. The method proposed here allows the obtaining of huge speedup values (over 60,000×), but only when so-called blocks are used. The above-mentioned speedup values can be obtained on high performance computing infrastructures such as clusters with the use of MPI library. Full article
(This article belongs to the Special Issue Applications of Parallel Computing)
Show Figures

Figure 1

15 pages, 368 KiB  
Article
Atomicity Violation in Multithreaded Applications and Its Detection in Static Code Analysis Process
by Damian Giebas and Rafał Wojszczyk
Appl. Sci. 2020, 10(22), 8005; https://doi.org/10.3390/app10228005 - 12 Nov 2020
Cited by 8 | Viewed by 2328
Abstract
This paper is a contribution to the field of research dealing with the parallel computing, which is used in multithreaded applications. The paper discusses the characteristics of atomicity violation in multithreaded applications and develops a new definition of atomicity violation based on previously [...] Read more.
This paper is a contribution to the field of research dealing with the parallel computing, which is used in multithreaded applications. The paper discusses the characteristics of atomicity violation in multithreaded applications and develops a new definition of atomicity violation based on previously defined relationships between operations, that can be used to atomicity violation detection. A method of detection of conflicts causing atomicity violation was also developed using the source code model of multithreaded applications that predicts errors in the software. Full article
(This article belongs to the Special Issue Applications of Parallel Computing)
Show Figures

Graphical abstract

Back to TopTop