Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (27)

Search Parameters:
Keywords = GPGPU parallelization

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 3408 KB  
Article
A High-Performance Branch Control Mechanism for GPGPU Based on RISC-V Architecture
by Yao Cheng, Yi Man and Xinbing Zhou
Electronics 2026, 15(1), 125; https://doi.org/10.3390/electronics15010125 - 26 Dec 2025
Viewed by 708
Abstract
General-Purpose Graphics Processing Units (GPGPUs) rely on warp scheduling and control flow management to organize parallel thread execution, making efficient control flow mechanisms essential for modern GPGPU design. Currently, the mainstream RISC-V GPGPU Vortex adopts the Single Instruction Multiple Threads (SIMT) stack control [...] Read more.
General-Purpose Graphics Processing Units (GPGPUs) rely on warp scheduling and control flow management to organize parallel thread execution, making efficient control flow mechanisms essential for modern GPGPU design. Currently, the mainstream RISC-V GPGPU Vortex adopts the Single Instruction Multiple Threads (SIMT) stack control mechanism. This approach introduces high complexity and performance overhead, becoming a major limitation for further improving control efficiency. To address this issue, this paper proposes a thread-mask-based branch control mechanism for the RISC-V architecture. The mechanism introduces explicit mask primitives at the Instruction Set Architecture (ISA) level and directly manages the active status of threads within a warp through logical operations, enabling branch execution without jumps and thus reducing the overhead of the original control flow mechanism. Unlike traditional thread mask mechanisms in GPUs, our design centers on RISC-V and realizes co-optimization at both the ISA and microarchitecture levels. The mechanism was modeled and validated on Vortex SimX. Experimental results show that, compared with the Vortex SIMT stack mechanism, the proposed approach maintains correct control semantics while reducing branch execution cycles by an average of 31% and up to 40%, providing a new approach for RISC-V GPGPU control flow optimization. Full article
Show Figures

Figure 1

26 pages, 2178 KB  
Article
Hierarchical Parallelization of Rigid Body Simulation with Soft Blocking Method on GPU
by Rikuya Tomii and Tetsu Narumi
Computation 2025, 13(11), 250; https://doi.org/10.3390/computation13110250 - 2 Nov 2025
Viewed by 1112
Abstract
This paper proposes and implements a method to efficiently parallelize constraint solving in rigid body simulation using GPUs. Rigid body simulation is widely used in robot development, computer games, movies, and other fields, and there is a growing need for faster computation. As [...] Read more.
This paper proposes and implements a method to efficiently parallelize constraint solving in rigid body simulation using GPUs. Rigid body simulation is widely used in robot development, computer games, movies, and other fields, and there is a growing need for faster computation. As current computers are reaching their limits in terms of scale-up, such as clock frequency improvements, performance improvements are being sought through scale-out, which increases parallelism. However, rigid body simulation is difficult to parallelize efficiently due to its characteristics. This is because, unlike fluid or molecular physics simulations, where each particle or lattice can be independently extracted and processed, rigid bodies can interact with a large number of distant objects depending on the instance. This characteristic causes significant load imbalance, making it difficult to evenly distribute computational resources using simple methods such as spatial partitioning. Therefore, this paper proposes and implements a computational method that enables high-speed computation of large-scale scenes by hierarchically clustering rigid bodies based on their number and associating the hierarchy with the hardware structure of GPUs. In addition, to effectively utilize parallel computing resources, we considered a more relaxed parallelization condition for the conventional Gauss–Seidel block parallelization method and demonstrated that convergence is guaranteed. We investigated how speed and convergence performance change depending on how much computational cost is allocated to each hierarchy and discussed the desirable parameter settings. By conducting experiments comparing our method with several widely used software packages, we demonstrated that our approach enables calculations at speeds previously unattainable with existing techniques, while leveraging GPU computational resources to handle multiple rigid bodies simultaneously without significantly compromising accuracy. Full article
(This article belongs to the Section Computational Engineering)
Show Figures

Figure 1

22 pages, 3673 KB  
Article
Massively Parallel Lagrangian Relaxation Algorithm for Solving Large-Scale Spatial Optimization Problems Using GPGPU
by Ting L. Lei, Rongrong Wang and Zhen Lei
ISPRS Int. J. Geo-Inf. 2025, 14(11), 419; https://doi.org/10.3390/ijgi14110419 - 26 Oct 2025
Viewed by 904
Abstract
Lagrangian Relaxation (LR) is an effective method for solving spatial optimization problems in geospatial analysis and GIS. Among others, it has been used to solve the classic p-median problem that served as a unified local model in GIS since the 1990s. Despite [...] Read more.
Lagrangian Relaxation (LR) is an effective method for solving spatial optimization problems in geospatial analysis and GIS. Among others, it has been used to solve the classic p-median problem that served as a unified local model in GIS since the 1990s. Despite its efficiency, the LR algorithm has seen limited usage in practice and is not as widely used as off-the-shelf solvers such as OPL/CPLEX or GPLK. This is primarily because of the high cost of development, which includes (i) the cost of developing a full gradient descent algorithm for each optimization model with various tricks and modifications to improve the speed, (ii) the computational cost can be high for large problem instances, (iii) the need to test and choose from different relaxation schemes, and (iv) the need to derive and compute the gradients in a programming language. In this study, we aim to solve the first three issues by utilizing the computational power of GPGPU and existing facilities of modern deep learning (DL) frameworks such as PyTorch. Based on an analysis of the commonalities and differences between DL and general optimization, we adapt DL libraries for solving LR problems. As a result, we can choose from the many gradient descent strategies (known as “optimizers”) in DL libraries rather than reinventing them from scratch. Experiments show that implementing LR in DL libraries is not only feasible but also convenient. Gradient vectors are automatically tracked and computed. Furthermore, the computational power of GPGPU is automatically used to parallelize the optimization algorithm (a long-term difficulty in operations research). Experiments with the classic p-median problem show that we can solve much larger problem instances (of more than 15,000 nodes) optimally or nearly optimally using the GPU-based LR algorithm. Such capabilities allow for a more fine-grained analysis in GIS. Comparisons with the OPL solver and CPU version of the algorithm show that the GPU version achieves speedups of 104 and 12.5, respectively. The GPU utilization rate on an RTX 4090 GPU reaches 90%. We then conclude with a summary of the findings and remarks regarding future work. Full article
Show Figures

Figure 1

31 pages, 2573 KB  
Article
Hardware Design of DRAM Memory Prefetching Engine for General-Purpose GPUs
by Freddy Gabbay, Benjamin Salomon, Idan Golan and Dolev Shema
Technologies 2025, 13(10), 455; https://doi.org/10.3390/technologies13100455 - 8 Oct 2025
Cited by 1 | Viewed by 2264
Abstract
General-purpose graphics computing on processing units (GPGPUs) face significant performance limitations due to memory access latencies, particularly when traditional memory hierarchies and thread-switching mechanisms prove insufficient for complex access patterns in data-intensive applications such as machine learning (ML) and scientific computing. This paper [...] Read more.
General-purpose graphics computing on processing units (GPGPUs) face significant performance limitations due to memory access latencies, particularly when traditional memory hierarchies and thread-switching mechanisms prove insufficient for complex access patterns in data-intensive applications such as machine learning (ML) and scientific computing. This paper presents a novel hardware design for a memory prefetching subsystem targeted at DDR (Double Data Rate) memory in GPGPU architectures. The proposed prefetching subsystem features a modular architecture comprising multiple parallel prefetching engines, each handling distinct memory address ranges with dedicated data buffers and adaptive stride detection algorithms that dynamically identify recurring memory access patterns. The design incorporates robust system integration features, including context flushing, watchdog timers, and flexible configuration interfaces, for runtime optimization. Comprehensive experimental validation using real-world workloads examined critical design parameters, including block sizes, prefetch outstanding limits, and throttling rates, across diverse memory access patterns. Results demonstrate significant performance improvements with average memory access latency reductions of up to 82% compared to no-prefetch baselines, and speedups in the range of 1.240–1.794. The proposed prefetching subsystem successfully enhances memory hierarchy efficiency and provides practical design guidelines for deployment in production GPGPU systems, establishing clear parameter optimization strategies for different workload characteristics. Full article
Show Figures

Figure 1

18 pages, 826 KB  
Article
Efficient GPU Parallel Implementation and Optimization of ARIA for Counter and Exhaustive Key-Search Modes
by Siwoo Eum, Minho Song, Sangwon Kim and Hwajeong Seo
Electronics 2025, 14(10), 2021; https://doi.org/10.3390/electronics14102021 - 15 May 2025
Viewed by 1497
Abstract
This paper proposes an optimized shared memory access technique to enhance parallel processing performance and reduce memory accesses for the ARIA block cipher in GPU environments. To overcome the limited size of GPU shared memory, we merged ARIA’s four separate S-box tables into [...] Read more.
This paper proposes an optimized shared memory access technique to enhance parallel processing performance and reduce memory accesses for the ARIA block cipher in GPU environments. To overcome the limited size of GPU shared memory, we merged ARIA’s four separate S-box tables into a single unified 32-bit table, effectively reducing the total memory usage from 4 KB to 1 KB. This allowed the consolidated table to be replicated 32 times within the limited shared memory, efficiently resolving the memory-bank conflict issues frequently encountered during parallel execution. Additionally, we utilized CUDA’s built-in function __byte_perm() to efficiently reconstruct the desired outputs from the reduced unified table, without imposing additional computational overhead. In exhaustive key-search scenarios, we implemented an on-the-fly key-expansion method, significantly reducing the memory usage per thread and enhancing parallel processing efficiency. In the RTX 3060 environment, profiling was performed to accurately analyze shared memory efficiency and the performance degradation caused by bank conflicts, yielding detailed profiling results. The results of experiments conducted on the RTX 3060 Mobile and RTX 4080 GPUs demonstrated significant performance improvements over conventional methods. Notably, the RTX 4080 GPU achieved a maximum throughput of 1532.42 Gbps in ARIA-CTR mode, clearly validating the effectiveness and practical applicability of the proposed optimization techniques. On the RTX 3060, the performance of 128-bit ARIA-CTR was improved by 2.34× compared to previous state-of-the-art implementations. Furthermore, for exhaustive key searches on the 128-bit ARIA block cipher, a throughput of 1365.84 Gbps was achieved on the RTX 4080 GPU. Full article
(This article belongs to the Special Issue Network Security and Cryptography Applications)
Show Figures

Figure 1

20 pages, 899 KB  
Article
Boundary-Aware Concurrent Queue: A Fast and Scalable Concurrent FIFO Queue on GPU Environments
by Md. Sabbir Hossain Polak, David A. Troendle and Byunghyun Jang
Appl. Sci. 2025, 15(4), 1834; https://doi.org/10.3390/app15041834 - 11 Feb 2025
Viewed by 2377
Abstract
This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, leveraging intra-warp locality to improve throughput. A key to BACQ’s design is its [...] Read more.
This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, leveraging intra-warp locality to improve throughput. A key to BACQ’s design is its ability to replace conflicting accesses to shared data with independent accesses to private data. It uses a ticket-based system to ensure fair ordering of operations and supports infinite growth of the head and tail across its ring buffer. The leader thread of each warp coordinates enqueue and dequeue operations, broadcasting offsets for intra-warp synchronization. BACQ dynamically adjusts operation priorities based on the queue’s state, especially as it approaches boundary conditions such as overfilling the buffer. It also uses a virtual caching layer for intra-warp communication, reducing memory latency. Rigorous benchmarking results show that BACQ outperforms the BWD (Broker Queue Work Distributor), the fastest known GPU queue, by more than 2× while preserving FIFO semantics. The paper demonstrates BACQ’s superior performance through real-world empirical evaluations. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

15 pages, 1106 KB  
Article
GPU@SAT DevKit: Empowering Edge Computing Development Onboard Satellites in the Space-IoT Era
by Gionata Benelli, Giovanni Todaro, Matteo Monopoli, Gianluca Giuffrida, Massimiliano Donati and Luca Fanucci
Electronics 2024, 13(19), 3928; https://doi.org/10.3390/electronics13193928 - 4 Oct 2024
Cited by 7 | Viewed by 3627
Abstract
Advancements in technology have driven the miniaturization of embedded systems, making them more cost-effective and energy-efficient for wireless applications. As a result, the number of connectable devices in Internet of Things (IoT) networks has increased significantly, creating the challenge of linking them effectively [...] Read more.
Advancements in technology have driven the miniaturization of embedded systems, making them more cost-effective and energy-efficient for wireless applications. As a result, the number of connectable devices in Internet of Things (IoT) networks has increased significantly, creating the challenge of linking them effectively and economically. The space industry has long recognized this challenge and invested in satellite infrastructure for IoT networks, exploiting the potential of edge computing technologies. In this context, it is of critical importance to enhance the onboard computing capabilities of satellites and develop enabling technologies for their advancement. This is necessary to ensure that satellites are able to connect devices while reducing latency, bandwidth utilization, and development costs, and improving privacy and security measures. This paper presents the GPU@SAT DevKit: an ecosystem for testing a high-performance, general-purpose accelerator designed for FPGAs and suitable for edge computing tasks on satellites. This ecosystem provides a streamlined way to exploit GPGPU processing in space, enabling faster development times and more efficient resource use. Designed for FPGAs and tailored to edge computing tasks, the GPU@SAT accelerator mimics the parallel architecture of a GPU, allowing developers to leverage its capabilities while maintaining flexibility. Its compatibility with OpenCL simplifies the development process, enabling faster deployment of satellite-based applications. The DevKit was implemented and tested on a Zynq UltraScale+ MPSoC evaluation board from Xilinx, integrating the GPU@SAT IP core with the system’s embedded processor. A client/server approach is used to run applications, allowing users to easily configure and execute kernels through a simple XML document. This intuitive interface provides end-users with the ability to run and evaluate kernel performance and functionality without dealing with the underlying complexities of the accelerator itself. By making the GPU@SAT IP core more accessible, the DevKit significantly reduces development time and lowers the barrier to entry for satellite-based edge computing solutions. The DevKit was also compared with other onboard processing solutions, demonstrating similar performance. Full article
Show Figures

Figure 1

26 pages, 3378 KB  
Article
Parallel PSO for Efficient Neural Network Training Using GPGPU and Apache Spark in Edge Computing Sets
by Manuel I. Capel, Alberto Salguero-Hidalgo and Juan A. Holgado-Terriza
Algorithms 2024, 17(9), 378; https://doi.org/10.3390/a17090378 - 26 Aug 2024
Cited by 5 | Viewed by 3213
Abstract
The training phase of a deep learning neural network (DLNN) is a computationally demanding process, particularly for models comprising multiple layers of intermediate neurons.This paper presents a novel approach to accelerating DLNN training using the particle swarm optimisation (PSO) algorithm, which exploits the [...] Read more.
The training phase of a deep learning neural network (DLNN) is a computationally demanding process, particularly for models comprising multiple layers of intermediate neurons.This paper presents a novel approach to accelerating DLNN training using the particle swarm optimisation (PSO) algorithm, which exploits the GPGPU architecture and the Apache Spark analytics engine for large-scale data processing tasks. PSO is a bio-inspired stochastic optimisation method whose objective is to iteratively enhance the solution to a (usually complex) problem by approximating a given objective. The expensive fitness evaluation and updating of particle positions can be supported more effectively by parallel processing. Nevertheless, the parallelisation of an efficient PSO is not a simple process due to the complexity of the computations performed on the swarm of particles and the iterative execution of the algorithm until a solution close to the objective with minimal error is achieved. In this study, two forms of parallelisation have been developed for the PSO algorithm, both of which are designed for execution in a distributed execution environment. The synchronous parallel PSO implementation guarantees consistency but may result in idle time due to global synchronisation. In contrast, the asynchronous parallel PSO approach reduces the necessity for global synchronization, thereby enhancing execution time and making it more appropriate for large datasets and distributed environments such as Apache Spark. The two variants of PSO have been implemented with the objective of distributing the computational load supported by the algorithm across the different executor nodes of the Spark cluster to effectively achieve coarse-grained parallelism. The result is a significant performance improvement over current sequential variants of PSO. Full article
(This article belongs to the Collection Parallel and Distributed Computing: Algorithms and Applications)
Show Figures

Figure 1

37 pages, 9513 KB  
Article
Parallel Implicit Solvers for 2D Numerical Models on Structured Meshes
by Yaoxin Zhang, Mohammad Z. Al-Hamdan and Xiaobo Chao
Mathematics 2024, 12(14), 2184; https://doi.org/10.3390/math12142184 - 12 Jul 2024
Cited by 1 | Viewed by 1576
Abstract
This paper presents the parallelization of two widely used implicit numerical solvers for the solution of partial differential equations on structured meshes, namely, the ADI (Alternating-Direction Implicit) solver for tridiagonal linear systems and the SIP (Strongly Implicit Procedure) solver for the penta-diagonal systems. [...] Read more.
This paper presents the parallelization of two widely used implicit numerical solvers for the solution of partial differential equations on structured meshes, namely, the ADI (Alternating-Direction Implicit) solver for tridiagonal linear systems and the SIP (Strongly Implicit Procedure) solver for the penta-diagonal systems. Both solvers were parallelized using CUDA (Computer Unified Device Architecture) Fortran on GPGPUs (General-Purpose Graphics Processing Units). The parallel ADI solver (P-ADI) is based on the Parallel Cyclic Reduction (PCR) algorithm, while the parallel SIP solver (P-SIP) uses the wave front method (WF) following a diagonal line calculation strategy. To map the solution schemes onto the hierarchical block-threads framework of the CUDA on the GPU, the P-ADI solver adopted two mapping methods, one block thread with iterations (OBM-it) and multi-block threads (MBMs), while the P-SIP solver also used two mappings, one conventional mapping using effective WF lines (WF-e) with matrix coefficients and solution variables defined on original computational mesh, and a newly proposed mapping using all WF mesh (WF-all), on which matrix coefficients and solution variables are defined. Both the P-ADI and the P-SIP have been integrated into a two-dimensional (2D) hydrodynamic model, the CCHE2D (Center of Computational Hydroscience and Engineering) model, developed by the National Center for Computational Hydroscience and Engineering at the University of Mississippi. This study for the first time compared these two parallel solvers and their efficiency using examples and applications in complex geometries, which can provide valuable guidance for future uses of these two parallel implicit solvers in computational fluids dynamics (CFD). Both parallel solvers demonstrated higher efficiency than their serial counterparts on the CPU (Central Processing Unit): 3.73~4.98 speedup ratio for flow simulations, and 2.166~3.648 speedup ratio for sediment transport simulations. In general, the P-ADI solver is faster than but not as stable as the P-SIP solver; and for the P-SIP solver, the newly developed mapping method WF-all significantly improved the conventional mapping method WF-e. Full article
(This article belongs to the Special Issue Mathematical Modeling and Numerical Simulation in Fluids)
Show Figures

Figure 1

14 pages, 2988 KB  
Article
Performance Investigation of the Conjunction Filter Methods and Enhancement of Computation Speed on Conjunction Assessment Analysis with CUDA Techniques
by Phasawee Saingyen, Sittiporn Channumsin, Suwat Sreesawet, Keerati Puttasuwan and Thanathip Limna
Aerospace 2023, 10(6), 543; https://doi.org/10.3390/aerospace10060543 - 7 Jun 2023
Cited by 1 | Viewed by 3050
Abstract
The growing number of space objects leads to increases in the potential risks of damage to satellites and generates space debris after colliding. Conjunction assessment analysis is the one of keys to evaluating the collision risk of satellites and satellite operators require the [...] Read more.
The growing number of space objects leads to increases in the potential risks of damage to satellites and generates space debris after colliding. Conjunction assessment analysis is the one of keys to evaluating the collision risk of satellites and satellite operators require the analyzed results as fast as possible to decide and execute collision maneuver planning. However, the computation time to analyze the potential risk of all satellites is proportional to the number of space objects. The conjunction filters and parallel computing techniques can shorten the computation cost of conjunction analysis to provide the analyzed results. Therefore, this paper shows the investigation of the conjunction filter performances (accuracy and computation speed): Smart Sieve, CSieve and CAOS-D (combination of both Smart Sieve and CSieve) in both a single satellite (one vs. all) and all space objects (all vs. all) cases. Then, all the screening filters are developed to implement an algorithm that executes General-purpose computing on graphics processing units (GPGPU) by using NVIDIAs Compute Unified Device Architecture (CUDA). The analyzed results show the comparison results of the accuracy of conjunction screening analysis and computation times of each filter when implemented with the parallel computation techniques. Full article
Show Figures

Figure 1

17 pages, 2801 KB  
Article
Multi-Gbps LDPC Decoder on GPU Devices
by Jingxin Dai, Hang Yin, Yansong Lv, Weizhang Xu and Zhanxin Yang
Electronics 2022, 11(21), 3447; https://doi.org/10.3390/electronics11213447 - 25 Oct 2022
Cited by 7 | Viewed by 4361
Abstract
To meet the high throughput requirement of communication systems, the design of high-throughput low-density parity-check (LDPC) decoders has attracted significant attention. This paper proposes a high-throughput GPU-based LDPC decoder, aiming at the large-scale data process scenario, which optimizes the decoder from the perspectives [...] Read more.
To meet the high throughput requirement of communication systems, the design of high-throughput low-density parity-check (LDPC) decoders has attracted significant attention. This paper proposes a high-throughput GPU-based LDPC decoder, aiming at the large-scale data process scenario, which optimizes the decoder from the perspectives of the decoding parallelism and data scheduling strategy, respectively. For decoding parallelism, the intra-codeword parallelism is fully exploited by combining the characteristics of the flooding-based decoding algorithm and GPU programming model, and the inter-codeword parallelism is improved using the single-instruction multiple-data (SIMD) instructions. For the data scheduling strategy, the utilization of off-chip memory is optimized to satisfy the demands of large-scale data processing. The experimental results demonstrate that the decoder achieves 10 Gbps throughput by incorporating the early termination mechanism on general-purpose GPU (GPGPU) devices and can also achieve a high-throughput and high-power-efficiency performance on low-power embedded GPU (EGPU) devices. Compared with the state-of-the-art work, the proposed decoder had a ×1.787 normalized throughput speedup at the same error correcting performance. Full article
Show Figures

Graphical abstract

20 pages, 6027 KB  
Article
3D Numerical Analysis Method for Simulating Collapse Behavior of RC Structures by Hybrid FEM/DEM
by Gyeongjo Min, Daisuke Fukuda and Sangho Cho
Appl. Sci. 2022, 12(6), 3073; https://doi.org/10.3390/app12063073 - 17 Mar 2022
Cited by 10 | Viewed by 3466
Abstract
Recent years have seen an increase in demand for the demolition of obsolete and potentially hazardous structures, including reinforced concrete (RC) structures, using blasting techniques. However, because the risk of failure is significantly higher when applying blasting to demolish RC structures than mechanical [...] Read more.
Recent years have seen an increase in demand for the demolition of obsolete and potentially hazardous structures, including reinforced concrete (RC) structures, using blasting techniques. However, because the risk of failure is significantly higher when applying blasting to demolish RC structures than mechanical dismantling, it is critical to achieve the optimal demolition design and conditions using blasting by taking into account the major factors affecting a structure’s demolition. To this end, numerical analysis techniques have frequently been used to simulate the progressive failure resulting in the collapse of structures. In this study, the three-dimensional (3D) combined finite discrete element method (FDEM), which is accelerated by a parallel computation technique incorporating a general-purpose graphics processing unit (GPGPU), was coupled with the one-dimensional (1D) reinforcing bar (rebar) model as a numerical simulation tool for simulating the process of RC structure demolition by blasting. Three-point bending tests on the RC beams were simulated to validate the developed 3D FDEM code, including the calibration of 3D FDEM input parameters to simulate the concrete fracture in the RC beam accurately. The effect of the elements size for the concrete part on the RC beam’s fracture process was also discussed. Then, the developed 3D FDEM code was used to model the blasting demolition of a small-scale RC structure. The numerical simulation results for the progressive collapse of the RC structure were compared to the actual experimental results and found to be highly consistent. Full article
(This article belongs to the Special Issue Dynamics of Building Structures)
Show Figures

Figure 1

19 pages, 17299 KB  
Article
Evaluation of NVIDIA Xavier NX Platform for Real-Time Image Processing for Plasma Diagnostics
by Bartłomiej Jabłoński, Dariusz Makowski, Piotr Perek, Patryk Nowak vel Nowakowski, Aleix Puig Sitjes, Marcin Jakubowski, Yu Gao, Axel Winter and The W-X Team
Energies 2022, 15(6), 2088; https://doi.org/10.3390/en15062088 - 12 Mar 2022
Cited by 15 | Viewed by 5764
Abstract
Machine protection is a core task of real-time image diagnostics aiming for steady-state operation in nuclear fusion devices. The paper evaluates the applicability of the newest low-power NVIDIA Jetson Xavier NX platform for image plasma diagnostics. This embedded NVIDIA Tegra System-on-a-Chip (SoC) integrates [...] Read more.
Machine protection is a core task of real-time image diagnostics aiming for steady-state operation in nuclear fusion devices. The paper evaluates the applicability of the newest low-power NVIDIA Jetson Xavier NX platform for image plasma diagnostics. This embedded NVIDIA Tegra System-on-a-Chip (SoC) integrates a Graphics Processing Unit (GPU) and Central Processing Unit (CPU) on a single chip. The hardware differences and features compared to the previous NVIDIA Jetson TX2 are signified. Implemented algorithms detect thermal events in real-time, utilising the high parallelism provided by the embedded General-Purpose computing on Graphics Processing Units (GPGPU). The performance and accuracy are evaluated on the experimental data from the Wendelstein 7-X (W7-X) stellarator. Strike-line and reflection events are primarily investigated, yet benchmarks for overload hotspots, surface layers and visualisation algorithms are also included. Their detection might allow for automating real-time risk evaluation incorporated in the divertor protection system in W7-X. For the first time, the paper demonstrates the feasibility of complex real-time image processing in nuclear fusion applications on low-power embedded devices. Moreover, GPU-accelerated reference processing pipelines yielding higher accuracy compared to the literature results are proposed, and remarkable performance improvement resulting from the upgrade to the Xavier NX platform is attained. Full article
Show Figures

Figure 1

30 pages, 21731 KB  
Article
Wave Propagation Studies in Numerical Wave Tanks with Weakly Compressible Smoothed Particle Hydrodynamics
by Samarpan Chakraborty and Balakumar Balachandran
J. Mar. Sci. Eng. 2021, 9(2), 233; https://doi.org/10.3390/jmse9020233 - 22 Feb 2021
Cited by 4 | Viewed by 4499
Abstract
Generation and propagation of waves in a numerical wave tank constructed using Weakly Compressible Smoothed Particle Hydrodynamics (WCSPH) are considered here. Numerical wave tank simulations have been carried out with implementations of different Wendland kernels in conjunction with different numerical dissipation schemes. The [...] Read more.
Generation and propagation of waves in a numerical wave tank constructed using Weakly Compressible Smoothed Particle Hydrodynamics (WCSPH) are considered here. Numerical wave tank simulations have been carried out with implementations of different Wendland kernels in conjunction with different numerical dissipation schemes. The simulations were accelerated by using General Process Graphics Processing Unit (GPGPU) computing to utilize the massively parallel nature of the simulations and thus improve process efficiency. Numerical experiments with short domains have been carried out to validate the dissipation schemes used. The wave tank experiments consist of piston-type wavemakers and appropriate passive absorption arrangements to facilitate comparisons with theoretical predictions. The comparative performance of the different numerical wave tank experiments was carried out on the basis of the hydrostatic pressure and wave surface elevations. The effect of numerical dissipation with the different kernel functions was also studied on the basis of energy analysis. Finally, the observations and results were used to arrive at the best possible numerical set up for simulation of waves at medium and long distances of propagation, which can play a significant role in the study of extreme waves and energy localizations observed in oceans through such numerical wave tank simulations. Full article
(This article belongs to the Special Issue Dynamic Instability in Offshore Structures)
Show Figures

Figure 1

26 pages, 8694 KB  
Article
Modules and Techniques for Motion Planning: An Industrial Perspective
by Stefano Quer and Luz Garcia
Sensors 2021, 21(2), 420; https://doi.org/10.3390/s21020420 - 9 Jan 2021
Viewed by 4284
Abstract
Research on autonomous cars has become one of the main research paths in the automotive industry, with many critical issues that remain to be explored while considering the overall methodology and its practical applicability. In this paper, we present an industrial experience in [...] Read more.
Research on autonomous cars has become one of the main research paths in the automotive industry, with many critical issues that remain to be explored while considering the overall methodology and its practical applicability. In this paper, we present an industrial experience in which we build a complete autonomous driving system, from the sensor units to the car control equipment, and we describe its adoption and testing phase on the field. We report how we organize data fusion and map manipulation to represent the required reality. We focus on the communication and synchronization issues between the data-fusion device and the path-planner, between the CPU and the GPU units, and among different CUDA kernels implementing the core local planner module. In these frameworks, we propose simple representation strategies and approximation techniques which guarantee almost no penalty in terms of accuracy and large savings in terms of memory occupation and memory transfer times. We show how we adopt a recent implementation on parallel many-core devices, such as CUDA-based GPGPU, to reduce the computational burden of rapidly exploring random trees to explore the state space along with a given reference path. We report on our use of the controller and the vehicle simulator. We run experiments on several real scenarios, and we report the paths generated with the different settings, with their relative errors and computation times. We prove that our approach can generate reasonable paths on a multitude of standard maneuvers in real time. Full article
Show Figures

Figure 1

Back to TopTop