MDPI - Publisher of Open Access Journals

18 pages, 826 KiB

Open AccessArticle

Efficient GPU Parallel Implementation and Optimization of ARIA for Counter and Exhaustive Key-Search Modes

by Siwoo Eum, Minho Song, Sangwon Kim and Hwajeong Seo

Electronics 2025, 14(10), 2021; https://doi.org/10.3390/electronics14102021 - 15 May 2025

Viewed by 472

This paper proposes an optimized shared memory access technique to enhance parallel processing performance and reduce memory accesses for the ARIA block cipher in GPU environments. To overcome the limited size of GPU shared memory, we merged ARIA’s four separate S-box tables into [...] Read more.

This paper proposes an optimized shared memory access technique to enhance parallel processing performance and reduce memory accesses for the ARIA block cipher in GPU environments. To overcome the limited size of GPU shared memory, we merged ARIA’s four separate S-box tables into a single unified 32-bit table, effectively reducing the total memory usage from 4 KB to 1 KB. This allowed the consolidated table to be replicated 32 times within the limited shared memory, efficiently resolving the memory-bank conflict issues frequently encountered during parallel execution. Additionally, we utilized CUDA’s built-in function __byte_perm() to efficiently reconstruct the desired outputs from the reduced unified table, without imposing additional computational overhead. In exhaustive key-search scenarios, we implemented an on-the-fly key-expansion method, significantly reducing the memory usage per thread and enhancing parallel processing efficiency. In the RTX 3060 environment, profiling was performed to accurately analyze shared memory efficiency and the performance degradation caused by bank conflicts, yielding detailed profiling results. The results of experiments conducted on the RTX 3060 Mobile and RTX 4080 GPUs demonstrated significant performance improvements over conventional methods. Notably, the RTX 4080 GPU achieved a maximum throughput of 1532.42 Gbps in ARIA-CTR mode, clearly validating the effectiveness and practical applicability of the proposed optimization techniques. On the RTX 3060, the performance of 128-bit ARIA-CTR was improved by 2.34× compared to previous state-of-the-art implementations. Furthermore, for exhaustive key searches on the 128-bit ARIA block cipher, a throughput of 1365.84 Gbps was achieved on the RTX 4080 GPU. Full article

(This article belongs to the Special Issue Network Security and Cryptography Applications)

► Show Figures

Figure 1

20 pages, 899 KiB

Open AccessArticle

Boundary-Aware Concurrent Queue: A Fast and Scalable Concurrent FIFO Queue on GPU Environments

by Md. Sabbir Hossain Polak, David A. Troendle and Byunghyun Jang

Appl. Sci. 2025, 15(4), 1834; https://doi.org/10.3390/app15041834 - 11 Feb 2025

Viewed by 1105

Abstract

This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, leveraging intra-warp locality to improve throughput. A key to BACQ’s design is its [...] Read more.

This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, leveraging intra-warp locality to improve throughput. A key to BACQ’s design is its ability to replace conflicting accesses to shared data with independent accesses to private data. It uses a ticket-based system to ensure fair ordering of operations and supports infinite growth of the head and tail across its ring buffer. The leader thread of each warp coordinates enqueue and dequeue operations, broadcasting offsets for intra-warp synchronization. BACQ dynamically adjusts operation priorities based on the queue’s state, especially as it approaches boundary conditions such as overfilling the buffer. It also uses a virtual caching layer for intra-warp communication, reducing memory latency. Rigorous benchmarking results show that BACQ outperforms the BWD (Broker Queue Work Distributor), the fastest known GPU queue, by more than 2× while preserving FIFO semantics. The paper demonstrates BACQ’s superior performance through real-world empirical evaluations. Full article

(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))

► Show Figures

Figure 1

17 pages, 2607 KiB

Open AccessArticle

A Coarse- and Fine-Grained Co-Exploration Approach for Optimizing DNN Spatial Accelerators: Improving Speed and Performance

by Hao Sun, Junzhong Shen, Changwu Zhang and Hengzhu Liu

Electronics 2025, 14(3), 511; https://doi.org/10.3390/electronics14030511 - 27 Jan 2025

Viewed by 1043

Abstract

The rapid advancement of deep neural networks has significantly increased demands for computational complexity and data volume. This trend is especially evident with the emergence of large language models, which have rendered traditional architectures such as CPUs and GPGPUs insufficient in meeting performance [...] Read more.

The rapid advancement of deep neural networks has significantly increased demands for computational complexity and data volume. This trend is especially evident with the emergence of large language models, which have rendered traditional architectures such as CPUs and GPGPUs insufficient in meeting performance and energy efficiency requirements. Spatial accelerators present a promising solution by optimizing on-chip compute, storage, and communication resources. In exploring spatial accelerator design spaces, analytical model-based simulators and cycle-accurate simulators are commonly employed, each offering distinct advantages: high computational speed and superior simulation accuracy, respectively. However, the limited accuracy of analytical models and the slow simulation speed of cycle-accurate simulators impede the achievement of globally optimal solutions during design space exploration. Therefore, effectively leveraging the strengths of both simulator types while mitigating their inherent trade-offs is a critical challenge in designing customized spatial accelerators. In this work, we introduce a novel co-exploration methodology that integrates both coarse-grained and fine-grained approaches to navigate design and mapping spaces effectively. We utilize the rapid simulation capabilities of analytical models to perform coarse-grained global exploration, quickly eliminating designs and mapping configurations with inferior performance. Building on the results of this initial exploration, we employ cycle-accurate simulators to conduct fine-grained local exploration within the identified promising regions of the design and mapping spaces. This dual-phase approach aims to identify optimal hardware designs and dataflow mapping strategies that enhance performance and energy efficiency. The experimental results demonstrate that, compared to state-of-the-art methods, our approach reduces the number of exploration points by up to 99%, while achieving a 17.9% reduction in latency, a 2.5% decrease in energy consumption, and a 30.3% improvement in throughput. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Figure 1

40 pages, 1079 KiB

Open AccessArticle

Context-Adaptable Deployment of FastSLAM 2.0 on Graphic Processing Unit with Unknown Data Association

by Jessica Giovagnola, Manuel Pegalajar Cuéllar and Diego Pedro Morales Santos

Appl. Sci. 2024, 14(23), 11466; https://doi.org/10.3390/app142311466 - 9 Dec 2024

Viewed by 1898

Abstract

Simultaneous Localization and Mapping (SLAM) algorithms are crucial for enabling agents to estimate their position in unknown environments. In autonomous navigation systems, these algorithms need to operate in real-time on devices with limited resources, emphasizing the importance of reducing complexity and ensuring efficient [...] Read more.

Simultaneous Localization and Mapping (SLAM) algorithms are crucial for enabling agents to estimate their position in unknown environments. In autonomous navigation systems, these algorithms need to operate in real-time on devices with limited resources, emphasizing the importance of reducing complexity and ensuring efficient performance. While SLAM solutions aim at ensuring accurate and timely localization and mapping, one of their main limitations is their computational complexity. In this scenario, particle filter-based approaches such as FastSLAM 2.0 can significantly benefit from parallel programming due to their modular construction. The parallelization process involves identifying the parameters affecting the computational complexity in order to distribute the computation among single multiprocessors as efficiently as possible. However, the computational complexity of methodologies such as FastSLAM 2.0 can depend on multiple parameters whose values may, in turn, depend on each specific use case scenario ( ingi.e., the context), leading to multiple possible parallelization designs. Furthermore, the features of the hardware architecture in use can significantly influence the performance in terms of latency. Therefore, the selection of the optimal parallelization modality still needs to be empirically determined. This may involve redesigning the parallel algorithm depending on the context and the hardware architecture. In this paper, we propose a CUDA-based adaptable design for FastSLAM 2.0 on GPU, in combination with an evaluation methodology that enables the assessment of the optimal parallelization modality based on the context and the hardware architecture without the need for the creation of separate designs. The proposed implementation includes the parallelization of all the functional blocks of the FastSLAM 2.0 pipeline. Additionally, we contribute a parallelized design of the data association step through the Joint Compatibility Branch and Bound (JCBB) method. Multiple resampling algorithms are also included to accommodate the needs of a wide variety of navigation scenarios. Full article

(This article belongs to the Special Issue Advancements in Multi-Agent Systems and Artificial Intelligence: Methodologies, Applications, and Future Trends)

► Show Figures

Figure 1

19 pages, 3909 KiB

Open AccessArticle

GPU-Enabled Volume Renderer for Use with MATLAB

by Raphael Scheible

Digital 2024, 4(4), 990-1007; https://doi.org/10.3390/digital4040049 - 30 Nov 2024

Viewed by 1265

Abstract

Traditional tools, such as 3D Slicer, Fiji, and MATLAB^®, often encounter limitations in rendering performance and data management as the dataset sizes increase. This work presents a GPU-enabled volume renderer with a MATLAB^® interface that addresses these issues. The proposed [...] Read more.

Traditional tools, such as 3D Slicer, Fiji, and MATLAB^®, often encounter limitations in rendering performance and data management as the dataset sizes increase. This work presents a GPU-enabled volume renderer with a MATLAB^® interface that addresses these issues. The proposed renderer uses flexible memory management and leverages the GPU texture-mapping features of NVIDIA devices. It transfers data between the CPU and the GPU only in the case of a data change between renderings, and uses texture memory to make use of specific hardware benefits of the GPU and improve the quality. A case study using the ViBE-Z zebrafish larval dataset demonstrated the renderer’s ability to produce visualizations while managing extensive data effectively within the MATLAB^® environment. The renderer is available as open-source software. Full article

► Show Figures

Figure 1

15 pages, 1106 KiB

Open AccessArticle

GPU@SAT DevKit: Empowering Edge Computing Development Onboard Satellites in the Space-IoT Era

by Gionata Benelli, Giovanni Todaro, Matteo Monopoli, Gianluca Giuffrida, Massimiliano Donati and Luca Fanucci

Electronics 2024, 13(19), 3928; https://doi.org/10.3390/electronics13193928 - 4 Oct 2024

Cited by 3 | Viewed by 2207

Abstract

Advancements in technology have driven the miniaturization of embedded systems, making them more cost-effective and energy-efficient for wireless applications. As a result, the number of connectable devices in Internet of Things (IoT) networks has increased significantly, creating the challenge of linking them effectively [...] Read more.

Advancements in technology have driven the miniaturization of embedded systems, making them more cost-effective and energy-efficient for wireless applications. As a result, the number of connectable devices in Internet of Things (IoT) networks has increased significantly, creating the challenge of linking them effectively and economically. The space industry has long recognized this challenge and invested in satellite infrastructure for IoT networks, exploiting the potential of edge computing technologies. In this context, it is of critical importance to enhance the onboard computing capabilities of satellites and develop enabling technologies for their advancement. This is necessary to ensure that satellites are able to connect devices while reducing latency, bandwidth utilization, and development costs, and improving privacy and security measures. This paper presents the GPU@SAT DevKit: an ecosystem for testing a high-performance, general-purpose accelerator designed for FPGAs and suitable for edge computing tasks on satellites. This ecosystem provides a streamlined way to exploit GPGPU processing in space, enabling faster development times and more efficient resource use. Designed for FPGAs and tailored to edge computing tasks, the GPU@SAT accelerator mimics the parallel architecture of a GPU, allowing developers to leverage its capabilities while maintaining flexibility. Its compatibility with OpenCL simplifies the development process, enabling faster deployment of satellite-based applications. The DevKit was implemented and tested on a Zynq UltraScale+ MPSoC evaluation board from Xilinx, integrating the GPU@SAT IP core with the system’s embedded processor. A client/server approach is used to run applications, allowing users to easily configure and execute kernels through a simple XML document. This intuitive interface provides end-users with the ability to run and evaluate kernel performance and functionality without dealing with the underlying complexities of the accelerator itself. By making the GPU@SAT IP core more accessible, the DevKit significantly reduces development time and lowers the barrier to entry for satellite-based edge computing solutions. The DevKit was also compared with other onboard processing solutions, demonstrating similar performance. Full article

(This article belongs to the Special Issue Edge Computing and Tiny Machine Learning in the Internet of Things: Latest Advances and Applications)

► Show Figures

Figure 1

26 pages, 3378 KiB

Open AccessArticle

Parallel PSO for Efficient Neural Network Training Using GPGPU and Apache Spark in Edge Computing Sets

by Manuel I. Capel, Alberto Salguero-Hidalgo and Juan A. Holgado-Terriza

Algorithms 2024, 17(9), 378; https://doi.org/10.3390/a17090378 - 26 Aug 2024

Cited by 3 | Viewed by 2251

Abstract

The training phase of a deep learning neural network (DLNN) is a computationally demanding process, particularly for models comprising multiple layers of intermediate neurons.This paper presents a novel approach to accelerating DLNN training using the particle swarm optimisation (PSO) algorithm, which exploits the [...] Read more.

The training phase of a deep learning neural network (DLNN) is a computationally demanding process, particularly for models comprising multiple layers of intermediate neurons.This paper presents a novel approach to accelerating DLNN training using the particle swarm optimisation (PSO) algorithm, which exploits the GPGPU architecture and the Apache Spark analytics engine for large-scale data processing tasks. PSO is a bio-inspired stochastic optimisation method whose objective is to iteratively enhance the solution to a (usually complex) problem by approximating a given objective. The expensive fitness evaluation and updating of particle positions can be supported more effectively by parallel processing. Nevertheless, the parallelisation of an efficient PSO is not a simple process due to the complexity of the computations performed on the swarm of particles and the iterative execution of the algorithm until a solution close to the objective with minimal error is achieved. In this study, two forms of parallelisation have been developed for the PSO algorithm, both of which are designed for execution in a distributed execution environment. The synchronous parallel PSO implementation guarantees consistency but may result in idle time due to global synchronisation. In contrast, the asynchronous parallel PSO approach reduces the necessity for global synchronization, thereby enhancing execution time and making it more appropriate for large datasets and distributed environments such as Apache Spark. The two variants of PSO have been implemented with the objective of distributing the computational load supported by the algorithm across the different executor nodes of the Spark cluster to effectively achieve coarse-grained parallelism. The result is a significant performance improvement over current sequential variants of PSO. Full article

(This article belongs to the Collection Parallel and Distributed Computing: Algorithms and Applications)

► Show Figures

Figure 1

37 pages, 9513 KiB

Open AccessArticle

Parallel Implicit Solvers for 2D Numerical Models on Structured Meshes

by Yaoxin Zhang, Mohammad Z. Al-Hamdan and Xiaobo Chao

Mathematics 2024, 12(14), 2184; https://doi.org/10.3390/math12142184 - 12 Jul 2024

Cited by 1 | Viewed by 1103

Abstract

This paper presents the parallelization of two widely used implicit numerical solvers for the solution of partial differential equations on structured meshes, namely, the ADI (Alternating-Direction Implicit) solver for tridiagonal linear systems and the SIP (Strongly Implicit Procedure) solver for the penta-diagonal systems. [...] Read more.

This paper presents the parallelization of two widely used implicit numerical solvers for the solution of partial differential equations on structured meshes, namely, the ADI (Alternating-Direction Implicit) solver for tridiagonal linear systems and the SIP (Strongly Implicit Procedure) solver for the penta-diagonal systems. Both solvers were parallelized using CUDA (Computer Unified Device Architecture) Fortran on GPGPUs (General-Purpose Graphics Processing Units). The parallel ADI solver (P-ADI) is based on the Parallel Cyclic Reduction (PCR) algorithm, while the parallel SIP solver (P-SIP) uses the wave front method (WF) following a diagonal line calculation strategy. To map the solution schemes onto the hierarchical block-threads framework of the CUDA on the GPU, the P-ADI solver adopted two mapping methods, one block thread with iterations (OBM-it) and multi-block threads (MBMs), while the P-SIP solver also used two mappings, one conventional mapping using effective WF lines (WF-e) with matrix coefficients and solution variables defined on original computational mesh, and a newly proposed mapping using all WF mesh (WF-all), on which matrix coefficients and solution variables are defined. Both the P-ADI and the P-SIP have been integrated into a two-dimensional (2D) hydrodynamic model, the CCHE2D (Center of Computational Hydroscience and Engineering) model, developed by the National Center for Computational Hydroscience and Engineering at the University of Mississippi. This study for the first time compared these two parallel solvers and their efficiency using examples and applications in complex geometries, which can provide valuable guidance for future uses of these two parallel implicit solvers in computational fluids dynamics (CFD). Both parallel solvers demonstrated higher efficiency than their serial counterparts on the CPU (Central Processing Unit): 3.73~4.98 speedup ratio for flow simulations, and 2.166~3.648 speedup ratio for sediment transport simulations. In general, the P-ADI solver is faster than but not as stable as the P-SIP solver; and for the P-SIP solver, the newly developed mapping method WF-all significantly improved the conventional mapping method WF-e. Full article

(This article belongs to the Special Issue Mathematical Modeling and Numerical Simulation in Fluids)

► Show Figures

Figure 1

19 pages, 38481 KiB

Open AccessArticle

Dispersion and Radiation Modelling in ESTE System Using Urban LPM

by Ľudovít Lipták, Peter Čarný, Michal Marčišovský, Mária Marčišovská, Miroslav Chylý and Eva Fojciková

Atmosphere 2023, 14(7), 1077; https://doi.org/10.3390/atmos14071077 - 26 Jun 2023

Cited by 1 | Viewed by 1572

Abstract

In cases of accidental or deliberate incidents involving a harmful agent in urban areas, a detailed modelling approach is required to include the building shapes and spatial locations. Simultaneously, when applied to crisis management, a simulation tool must meet strict time constraints. This [...] Read more.

In cases of accidental or deliberate incidents involving a harmful agent in urban areas, a detailed modelling approach is required to include the building shapes and spatial locations. Simultaneously, when applied to crisis management, a simulation tool must meet strict time constraints. This work presents a Lagrangian particle model (LPM) for computing atmospheric dispersion. The model is implemented in the nuclear decision support system ESTE CBRN, a software tool developed to calculate the atmospheric dispersion of airborne hazardous materials and radiological impacts in the built-up area. The implemented LPM is based on Thomson’s solution for the nonstationary, three-dimensional Langevin equation model for turbulent diffusion. The simulation results are successfully analyzed by testing compatibility with Briggs sigma functions in the case of continuous release. The implemented LPM is compared with the Joint Urban 2003 Street Canyon Experiment for instantaneous puff releases. We compare the maximum concentrations and peak times measured during two intensive operational periods. The modeled peak times are mostly 10–20% smaller than the measured. Except for a few detector locations, the maximum concentrations are reproduced consistently. In the end, we demonstrate via calculation on single computers utilizing general-purpose computing on graphics processing units (GPGPU) that the implementation is well suited for an actual emergency response since the computational times (including dispersion and dose calculation) for an acceptable level of result accuracy are similar to the modeled event duration itself. Full article

(This article belongs to the Special Issue Atmospheric Dispersion and Chemistry Models: Advances and Applications)

► Show Figures

Figure 1

14 pages, 2988 KiB

Open AccessArticle

Performance Investigation of the Conjunction Filter Methods and Enhancement of Computation Speed on Conjunction Assessment Analysis with CUDA Techniques

by Phasawee Saingyen, Sittiporn Channumsin, Suwat Sreesawet, Keerati Puttasuwan and Thanathip Limna

Aerospace 2023, 10(6), 543; https://doi.org/10.3390/aerospace10060543 - 7 Jun 2023

Cited by 1 | Viewed by 2180

Abstract

The growing number of space objects leads to increases in the potential risks of damage to satellites and generates space debris after colliding. Conjunction assessment analysis is the one of keys to evaluating the collision risk of satellites and satellite operators require the [...] Read more.

The growing number of space objects leads to increases in the potential risks of damage to satellites and generates space debris after colliding. Conjunction assessment analysis is the one of keys to evaluating the collision risk of satellites and satellite operators require the analyzed results as fast as possible to decide and execute collision maneuver planning. However, the computation time to analyze the potential risk of all satellites is proportional to the number of space objects. The conjunction filters and parallel computing techniques can shorten the computation cost of conjunction analysis to provide the analyzed results. Therefore, this paper shows the investigation of the conjunction filter performances (accuracy and computation speed): Smart Sieve, CSieve and CAOS-D (combination of both Smart Sieve and CSieve) in both a single satellite (one vs. all) and all space objects (all vs. all) cases. Then, all the screening filters are developed to implement an algorithm that executes General-purpose computing on graphics processing units (GPGPU) by using NVIDIAs Compute Unified Device Architecture (CUDA). The analyzed results show the comparison results of the accuracy of conjunction screening analysis and computation times of each filter when implemented with the parallel computation techniques. Full article

► Show Figures

Figure 1

26 pages, 2988 KiB

Open AccessArticle

Hybrid Lattice-Boltzmann-Potential Flow Simulations of Turbulent Flow around Submerged Structures

by Christopher M. O’Reilly, Stephan T. Grilli, Christian F. Janßen, Jason M. Dahl and Jeffrey C. Harris

J. Mar. Sci. Eng. 2022, 10(11), 1651; https://doi.org/10.3390/jmse10111651 - 3 Nov 2022

Cited by 2 | Viewed by 2509

Abstract

We report on the development and validation of a 3D hybrid Lattice Boltzmann Model (LBM), with Large Eddy Simulation (LES), to simulate the interactions of incompressible turbulent flows with ocean structures. The LBM is based on a perturbation method, in which the velocity [...] Read more.

We report on the development and validation of a 3D hybrid Lattice Boltzmann Model (LBM), with Large Eddy Simulation (LES), to simulate the interactions of incompressible turbulent flows with ocean structures. The LBM is based on a perturbation method, in which the velocity and pressure are expressed as the sum of an inviscid flow and a viscous perturbation. The far- to near-field flow is assumed to be inviscid and represented by potential flow theory, which can be efficiently modeled with a Boundary Element Method (BEM). The near-field perturbation flow around structures is modeled by the Navier–Stokes (NS) equations, based on a Lattice Boltzmann Method (LBM) with a Large Eddy Simulation (LES) of the turbulence. In the paper, we present the hybrid model formulation, in which a modified LBM collision operator is introduced to simulate the viscous perturbation flow, resulting in a novel perturbation LBM (pLBM) approach. The pLBM is then extended for the simulation of turbulence using the LES and a wall model to represent the viscous/turbulent sub-layer near solid boundaries. The hybrid model is first validated by simulating turbulent flows over a flat plate, for moderate to large Reynolds number values, Re

\in [3.7 \times 10^{4}; 1.2 \times 10^{6}]

; the plate friction coefficient and near-field turbulence properties computed with the model are found to agree well with both experiments and direct NS simulations. We then simulate the flow past a NACA-0012 foil using a regular LBM-LES and the new hybrid pLBM-LES models with the wall model, for Re =

1.44 \times 10^{6}

. A good agreement is found for the computed lift and drag forces, and pressure distribution on the foil, with experiments and results of other numerical methods. Results obtained with the pLBM model are either nearly identical or slightly improved, relative to those of the standard LBM, but are obtained in a significantly smaller computational domain and hence at a much reduced computational cost, thus demonstrating the benefits of the new hybrid approach. Full article

(This article belongs to the Special Issue Advances in Computational Modeling of Wave Structure Interaction Problems)

► Show Figures

Figure 1

17 pages, 18540 KiB

Open AccessArticle

Robust Identification and Segmentation of the Outer Skin Layers in Volumetric Fingerprint Data

by Alexander Kirfel, Tobias Scheer, Norbert Jung and Christoph Busch

Sensors 2022, 22(21), 8229; https://doi.org/10.3390/s22218229 - 27 Oct 2022

Cited by 6 | Viewed by 2813

Abstract

Despite the long history of fingerprint biometrics and its use to authenticate individuals, there are still some unsolved challenges with fingerprint acquisition and presentation attack detection (PAD). Currently available commercial fingerprint capture devices struggle with non-ideal skin conditions, including soft skin in infants. [...] Read more.

Despite the long history of fingerprint biometrics and its use to authenticate individuals, there are still some unsolved challenges with fingerprint acquisition and presentation attack detection (PAD). Currently available commercial fingerprint capture devices struggle with non-ideal skin conditions, including soft skin in infants. They are also susceptible to presentation attacks, which limits their applicability in unsupervised scenarios such as border control. Optical coherence tomography (OCT) could be a promising solution to these problems. In this work, we propose a digital signal processing chain for segmenting two complementary fingerprints from the same OCT fingertip scan: One fingerprint is captured as usual from the epidermis (“outer fingerprint”), whereas the other is taken from inside the skin, at the junction between the epidermis and the underlying dermis (“inner fingerprint”). The resulting 3D fingerprints are then converted to a conventional 2D grayscale representation from which minutiae points can be extracted using existing methods. Our approach is device-independent and has been proven to work with two different time domain OCT scanners. Using efficient GPGPU computing, it took less than a second to process an entire gigabyte of OCT data. To validate the results, we captured OCT fingerprints of 130 individual fingers and compared them with conventional 2D fingerprints of the same fingers. We found that both the outer and inner OCT fingerprints were backward compatible with conventional 2D fingerprints, with the inner fingerprint generally being less damaged and, therefore, more reliable. Full article

(This article belongs to the Special Issue Biometric Technologies Based on Optical Coherence Tomography (OCT))

► Show Figures

Figure 1

17 pages, 2801 KiB

Open AccessArticle

Multi-Gbps LDPC Decoder on GPU Devices

by Jingxin Dai, Hang Yin, Yansong Lv, Weizhang Xu and Zhanxin Yang

Electronics 2022, 11(21), 3447; https://doi.org/10.3390/electronics11213447 - 25 Oct 2022

Cited by 4 | Viewed by 3062

Abstract

To meet the high throughput requirement of communication systems, the design of high-throughput low-density parity-check (LDPC) decoders has attracted significant attention. This paper proposes a high-throughput GPU-based LDPC decoder, aiming at the large-scale data process scenario, which optimizes the decoder from the perspectives [...] Read more.

To meet the high throughput requirement of communication systems, the design of high-throughput low-density parity-check (LDPC) decoders has attracted significant attention. This paper proposes a high-throughput GPU-based LDPC decoder, aiming at the large-scale data process scenario, which optimizes the decoder from the perspectives of the decoding parallelism and data scheduling strategy, respectively. For decoding parallelism, the intra-codeword parallelism is fully exploited by combining the characteristics of the flooding-based decoding algorithm and GPU programming model, and the inter-codeword parallelism is improved using the single-instruction multiple-data (SIMD) instructions. For the data scheduling strategy, the utilization of off-chip memory is optimized to satisfy the demands of large-scale data processing. The experimental results demonstrate that the decoder achieves 10 Gbps throughput by incorporating the early termination mechanism on general-purpose GPU (GPGPU) devices and can also achieve a high-throughput and high-power-efficiency performance on low-power embedded GPU (EGPU) devices. Compared with the state-of-the-art work, the proposed decoder had a ×1.787 normalized throughput speedup at the same error correcting performance. Full article

(This article belongs to the Special Issue Hardware Accelerated Algorithms and Architectures for Various DSP Applications)

► Show Figures

Graphical abstract

20 pages, 6027 KiB

Open AccessArticle

3D Numerical Analysis Method for Simulating Collapse Behavior of RC Structures by Hybrid FEM/DEM

by Gyeongjo Min, Daisuke Fukuda and Sangho Cho

Appl. Sci. 2022, 12(6), 3073; https://doi.org/10.3390/app12063073 - 17 Mar 2022

Cited by 8 | Viewed by 2834

Abstract

Recent years have seen an increase in demand for the demolition of obsolete and potentially hazardous structures, including reinforced concrete (RC) structures, using blasting techniques. However, because the risk of failure is significantly higher when applying blasting to demolish RC structures than mechanical [...] Read more.

Recent years have seen an increase in demand for the demolition of obsolete and potentially hazardous structures, including reinforced concrete (RC) structures, using blasting techniques. However, because the risk of failure is significantly higher when applying blasting to demolish RC structures than mechanical dismantling, it is critical to achieve the optimal demolition design and conditions using blasting by taking into account the major factors affecting a structure’s demolition. To this end, numerical analysis techniques have frequently been used to simulate the progressive failure resulting in the collapse of structures. In this study, the three-dimensional (3D) combined finite discrete element method (FDEM), which is accelerated by a parallel computation technique incorporating a general-purpose graphics processing unit (GPGPU), was coupled with the one-dimensional (1D) reinforcing bar (rebar) model as a numerical simulation tool for simulating the process of RC structure demolition by blasting. Three-point bending tests on the RC beams were simulated to validate the developed 3D FDEM code, including the calibration of 3D FDEM input parameters to simulate the concrete fracture in the RC beam accurately. The effect of the elements size for the concrete part on the RC beam’s fracture process was also discussed. Then, the developed 3D FDEM code was used to model the blasting demolition of a small-scale RC structure. The numerical simulation results for the progressive collapse of the RC structure were compared to the actual experimental results and found to be highly consistent. Full article

(This article belongs to the Special Issue Dynamics of Building Structures)

► Show Figures

Figure 1

19 pages, 17299 KiB

Open AccessArticle

Evaluation of NVIDIA Xavier NX Platform for Real-Time Image Processing for Plasma Diagnostics

by Bartłomiej Jabłoński, Dariusz Makowski, Piotr Perek, Patryk Nowak vel Nowakowski, Aleix Puig Sitjes, Marcin Jakubowski, Yu Gao, Axel Winter and The W-X Team

Energies 2022, 15(6), 2088; https://doi.org/10.3390/en15062088 - 12 Mar 2022

Cited by 13 | Viewed by 4562

Abstract

Machine protection is a core task of real-time image diagnostics aiming for steady-state operation in nuclear fusion devices. The paper evaluates the applicability of the newest low-power NVIDIA Jetson Xavier NX platform for image plasma diagnostics. This embedded NVIDIA Tegra System-on-a-Chip (SoC) integrates [...] Read more.

Machine protection is a core task of real-time image diagnostics aiming for steady-state operation in nuclear fusion devices. The paper evaluates the applicability of the newest low-power NVIDIA Jetson Xavier NX platform for image plasma diagnostics. This embedded NVIDIA Tegra System-on-a-Chip (SoC) integrates a Graphics Processing Unit (GPU) and Central Processing Unit (CPU) on a single chip. The hardware differences and features compared to the previous NVIDIA Jetson TX2 are signified. Implemented algorithms detect thermal events in real-time, utilising the high parallelism provided by the embedded General-Purpose computing on Graphics Processing Units (GPGPU). The performance and accuracy are evaluated on the experimental data from the Wendelstein 7-X (W7-X) stellarator. Strike-line and reflection events are primarily investigated, yet benchmarks for overload hotspots, surface layers and visualisation algorithms are also included. Their detection might allow for automating real-time risk evaluation incorporated in the divertor protection system in W7-X. For the first time, the paper demonstrates the feasibility of complex real-time image processing in nuclear fusion applications on low-power embedded devices. Moreover, GPU-accelerated reference processing pipelines yielding higher accuracy compared to the literature results are proposed, and remarkable performance improvement resulting from the upgrade to the Xavier NX platform is attained. Full article

(This article belongs to the Special Issue Selected Papers from 28th International Conference on Mixed Design of Integrated Circuits and Systems—MIXDES 2021)

► Show Figures

Figure 1

Search Results (62)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (62)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI