MDPI - Publisher of Open Access Journals

31 pages, 2573 KB

Open AccessArticle

Hardware Design of DRAM Memory Prefetching Engine for General-Purpose GPUs

by Freddy Gabbay, Benjamin Salomon, Idan Golan and Dolev Shema

Technologies 2025, 13(10), 455; https://doi.org/10.3390/technologies13100455 - 8 Oct 2025

Viewed by 440

General-purpose graphics computing on processing units (GPGPUs) face significant performance limitations due to memory access latencies, particularly when traditional memory hierarchies and thread-switching mechanisms prove insufficient for complex access patterns in data-intensive applications such as machine learning (ML) and scientific computing. This paper [...] Read more.

General-purpose graphics computing on processing units (GPGPUs) face significant performance limitations due to memory access latencies, particularly when traditional memory hierarchies and thread-switching mechanisms prove insufficient for complex access patterns in data-intensive applications such as machine learning (ML) and scientific computing. This paper presents a novel hardware design for a memory prefetching subsystem targeted at DDR (Double Data Rate) memory in GPGPU architectures. The proposed prefetching subsystem features a modular architecture comprising multiple parallel prefetching engines, each handling distinct memory address ranges with dedicated data buffers and adaptive stride detection algorithms that dynamically identify recurring memory access patterns. The design incorporates robust system integration features, including context flushing, watchdog timers, and flexible configuration interfaces, for runtime optimization. Comprehensive experimental validation using real-world workloads examined critical design parameters, including block sizes, prefetch outstanding limits, and throttling rates, across diverse memory access patterns. Results demonstrate significant performance improvements with average memory access latency reductions of up to 82% compared to no-prefetch baselines, and speedups in the range of 1.240–1.794. The proposed prefetching subsystem successfully enhances memory hierarchy efficiency and provides practical design guidelines for deployment in production GPGPU systems, establishing clear parameter optimization strategies for different workload characteristics. Full article

(This article belongs to the Topic Advances in Microelectronics and Semiconductor Engineering)

► Show Figures

Figure 1

29 pages, 5375 KB

Open AccessArticle

Application of PINNs to Define Roughness Coefficients for Channel Flow Problems

by Sergei Strijhak, Konstantin Koshelev and Andrei Bolotov

Water 2025, 17(18), 2731; https://doi.org/10.3390/w17182731 - 16 Sep 2025

Viewed by 921

Abstract

This paper considers the possibility of using Physics-Informed Neural Networks (PINNs) to study the hydrological processes of model river sections. A fully connected neural network is used for the approximation of the Saint-Venant equations in both 1D and 2D formulations. This study addresses [...] Read more.

This paper considers the possibility of using Physics-Informed Neural Networks (PINNs) to study the hydrological processes of model river sections. A fully connected neural network is used for the approximation of the Saint-Venant equations in both 1D and 2D formulations. This study addresses the problem of determining the velocities, water level, discharge, and area of water sections in 1D cases, as well as the inverse problem of calculating the roughness coefficient. To evaluate the applicability of PINNs for modeling flows in channels, it seems reasonable to start with cases where exact reference solutions are available. For the 1D case, we examined a rectangular channel with a given length, width, and constant roughness coefficient. An analytical solution is obtained to calculate the discharge and area of the water section. Two-dimensional model examples were also examined. The synthetic data were generated in Delft3D code, which included velocity field and water level, for the purpose of PINN training. The calculation in Delft3D code took about 2 min. The influence of PINN hyperparameters on the prediction quality was studied. Finally, the absolute error value was assessed. The prediction error of the roughness coefficient n value in the 2D case for the inverse problem did not exceed 10%. A typical training process took from 2.5 to 3.5 h and the prediction process took 5–10 s using developed PINN models on a server with Nvidia A100 40GB GPU. Full article

(This article belongs to the Special Issue Application of Machine Learning in Hydrologic Sciences)

► Show Figures

Figure 1

31 pages, 3735 KB

Open AccessFeature PaperArticle

An Analysis of Layer-Freezing Strategies for Enhanced Transfer Learning in YOLO Architectures

by Andrzej D. Dobrzycki, Ana M. Bernardos and José R. Casar

Mathematics 2025, 13(15), 2539; https://doi.org/10.3390/math13152539 - 7 Aug 2025

Viewed by 1636

Abstract

The You Only Look Once (YOLO) architecture is crucial for real-time object detection. However, deploying it in resource-constrained environments such as unmanned aerial vehicles (UAVs) requires efficient transfer learning. Although layer freezing is a common technique, the specific impact of various freezing configurations [...] Read more.

The You Only Look Once (YOLO) architecture is crucial for real-time object detection. However, deploying it in resource-constrained environments such as unmanned aerial vehicles (UAVs) requires efficient transfer learning. Although layer freezing is a common technique, the specific impact of various freezing configurations on contemporary YOLOv8 and YOLOv10 architectures remains unexplored, particularly with regard to the interplay between freezing depth, dataset characteristics, and training dynamics. This research addresses this gap by presenting a detailed analysis of layer-freezing strategies. We systematically investigate multiple freezing configurations across YOLOv8 and YOLOv10 variants using four challenging datasets that represent critical infrastructure monitoring. Our methodology integrates a gradient behavior analysis (L2 norm) and visual explanations (Grad-CAM) to provide deeper insights into training dynamics under different freezing strategies. Our results reveal that there is no universal optimal freezing strategy but, rather, one that depends on the properties of the data. For example, freezing the backbone is effective for preserving general-purpose features, while a shallower freeze is better suited to handling extreme class imbalance. These configurations reduce graphics processing unit (GPU) memory consumption by up to 28% compared to full fine-tuning and, in some cases, achieve mean average precision (mAP@50) scores that surpass those of full fine-tuning. Gradient analysis corroborates these findings, showing distinct convergence patterns for moderately frozen models. Ultimately, this work provides empirical findings and practical guidelines for selecting freezing strategies. It offers a practical, evidence-based approach to balanced transfer learning for object detection in scenarios with limited resources. Full article

(This article belongs to the Special Issue Artificial Intelligence: Deep Learning and Computer Vision)

► Show Figures

Figure 1

56 pages, 3118 KB

Open AccessArticle

Semantic Reasoning Using Standard Attention-Based Models: An Application to Chronic Disease Literature

by Yalbi Itzel Balderas-Martínez, José Armando Sánchez-Rojas, Arturo Téllez-Velázquez, Flavio Juárez Martínez, Raúl Cruz-Barbosa, Enrique Guzmán-Ramírez, Iván García-Pacheco and Ignacio Arroyo-Fernández

Big Data Cogn. Comput. 2025, 9(6), 162; https://doi.org/10.3390/bdcc9060162 - 19 Jun 2025

Viewed by 1382

Abstract

Large-language-model (LLM) APIs demonstrate impressive reasoning capabilities, but their size, cost, and closed weights limit the deployment of knowledge-aware AI within biomedical research groups. At the other extreme, standard attention-based neural language models (SANLMs)—including encoder–decoder architectures such as Transformers, Gated Recurrent Units (GRUs), [...] Read more.

Large-language-model (LLM) APIs demonstrate impressive reasoning capabilities, but their size, cost, and closed weights limit the deployment of knowledge-aware AI within biomedical research groups. At the other extreme, standard attention-based neural language models (SANLMs)—including encoder–decoder architectures such as Transformers, Gated Recurrent Units (GRUs), and Long Short-Term Memory (LSTM) networks—are computationally inexpensive. However, their capacity for semantic reasoning in noisy, open-vocabulary knowledge bases (KBs) remains unquantified. Therefore, we investigate whether compact SANLMs can (i) reason over hybrid OpenIE-derived KBs that integrate commonsense, general-purpose, and non-communicable-disease (NCD) literature; (ii) operate effectively on commodity GPUs; and (iii) exhibit semantic coherence as assessed through manual linguistic inspection. To this end, we constructed four training KBs by integrating ConceptNet (600k triples), a 39k-triple general-purpose OpenIE set, and an 18.6k-triple OpenNCDKB extracted from 1200 PubMed abstracts. Encoder–decoder GRU, LSTM, and Transformer models (1–2 blocks) were trained to predict the object phrase given the subject + predicate. Beyond token-level cross-entropy, we introduced the Meaning-based Selectional-Preference Test (MSPT): for each withheld triple, we masked the object, generated a candidate, and measured its surplus cosine similarity over a random baseline using word embeddings, with significance assessed via a one-sided t-test. Hyperparameter sensitivity (311 GRU/168 LSTM runs) was analyzed, and qualitative frame–role diagnostics completed the evaluation. Our results showed that all SANLMs learned effectively from the point of view of the cross entropy loss. In addition, our MSPT provided meaningful semantic insights: for the GRUs (256-dim, 2048-unit, 1-layer): mean similarity

(μ_{s t s})

of 0.641 to the ground truth vs. 0.542 to the random baseline (gap 12.1%;

p < 10^{- 180}

). For the 1-block Transformer:

μ_{s t s} = 0.551

vs.

0.511

(gap 4%;

p < 10^{- 25}

). While Transformers minimized loss and accuracy variance, GRUs captured finer selectional preferences. Both architectures trained within <24 GB GPU VRAM and produced linguistically acceptable, albeit over-generalized, biomedical assertions. Due to their observed performance, LSTM results were designated as baseline models for comparison. Therefore, properly tuned SANLMs can achieve statistically robust semantic reasoning over noisy, domain-specific KBs without reliance on massive LLMs. Their interpretability, minimal hardware footprint, and open weights promote equitable AI research, opening new avenues for automated NCD knowledge synthesis, surveillance, and decision support. Full article

► Show Figures

Figure 1

22 pages, 3260 KB

Open AccessArticle

A Novel Adaptive Fine-Tuning Algorithm for Multimodal Models: Self-Optimizing Classification and Selection of High-Quality Datasets in Remote Sensing

by Yi Ren, Tianyi Zhang, Zhixiong Han, Weibin Li, Zhiyang Wang, Wenbo Ji, Chenhao Qin and Licheng Jiao

Remote Sens. 2025, 17(10), 1748; https://doi.org/10.3390/rs17101748 - 16 May 2025

Cited by 2 | Viewed by 1405

Abstract

The latest research indicates that Large Vision-Language Models (VLMs) have a wide range of applications in the field of remote sensing. However, the vast amount of image data in this field presents a challenge in selecting high-quality multimodal data, which are essential for [...] Read more.

The latest research indicates that Large Vision-Language Models (VLMs) have a wide range of applications in the field of remote sensing. However, the vast amount of image data in this field presents a challenge in selecting high-quality multimodal data, which are essential for saving computational resources and time. Therefore, we propose an adaptive fine-tuning algorithm for multimodal large models. The core steps of this algorithm involve two stages of data truncation. First, the vast dataset is projected into a semantic vector space, where the MiniBatchKMeans algorithm is used for automated clustering. This classification ensures that the data within each cluster exhibit high semantic similarity. Next, the data within each cluster are processed by calculating the translational difference between the original and perturbed data in the multimodal large model’s vector space. This difference serves as a generalization metric for the data. Based on this metric, we select data with high generalization potential for training. We apply this algorithm to train the InternLM-XComposer2-VL-7B model on two 3090 GPUs, using one-third of the GeoChat multimodal remote sensing dataset. The results demonstrate that our algorithm outperforms state-of-the-art baselines. The model trained on our optimally chosen one-third dataset, as validated through experiments, showed only a 1% reduction in performance across various remote sensing metrics compared to the model trained on the full dataset. This approach significantly preserved general-purpose capabilities while reducing training time by 68.2%. Furthermore, the model achieved scores of 89.86 and 77.19 on the UCMerced and AID evaluation datasets, respectively, surpassing the GeoChat dataset by 5.43 and 5.16 points. It only showed a 0.91-point average decrease on the LRBEN evaluation dataset. Full article

► Show Figures

Figure 1

31 pages, 2436 KB

Open AccessArticle

Application of Graphics Processor Unit Computing Resources to Solution of Incompressible Fluid Dynamics Problems

by Redha Benhadj-Djilali, Arturas Gulevskis and Konstantin Volkov

Computers 2025, 14(5), 170; https://doi.org/10.3390/computers14050170 - 1 May 2025

Viewed by 657

Abstract

The structure and memory organization of graphics processor units (GPUs) manufactured by NVIDIA and the use of CUDA programming technology to solve computational fluid dynamics (CFD) problems is reviewed and discussed. The potential of using a general-purpose GPU to solve fluid dynamics problems [...] Read more.

The structure and memory organization of graphics processor units (GPUs) manufactured by NVIDIA and the use of CUDA programming technology to solve computational fluid dynamics (CFD) problems is reviewed and discussed. The potential of using a general-purpose GPU to solve fluid dynamics problems is examined. The code optimization with the utilization of various memory types is considered. Some CFD benchmark problems focused on simulation of viscous incompressible fluid flows are solved on GPUs. Consideration is given to the application of the finite volume method and projection method. Programming implementation of various components of the computational procedure, solution of Poisson equation for pressure and multigrid method to solve the system of algebraic equations, is provided. By using meshes of varying resolutions and different techniques for dividing up the input data into blocks, the speedup of the GPU solution is compared to the CPU approach. Full article

► Show Figures

Figure 1

45 pages, 10045 KB

Open AccessArticle

An Automated Framework for Streamlined CFD-Based Design and Optimization of Fixed-Wing UAV Wings

by Chris Pliakos, Giorgos Efrem, Dimitrios Terzis and Pericles Panagiotou

Algorithms 2025, 18(4), 186; https://doi.org/10.3390/a18040186 - 24 Mar 2025

Cited by 1 | Viewed by 2259

Abstract

The increasing complexity of the UAV aerodynamic design, imposed by novel configurations and requirements, has highlighted the need for efficient tools for high-fidelity simulation, especially for optimization purposes. The current work presents an automated CFD framework, tailored for fixed-wing UAVs, designed to streamline [...] Read more.

The increasing complexity of the UAV aerodynamic design, imposed by novel configurations and requirements, has highlighted the need for efficient tools for high-fidelity simulation, especially for optimization purposes. The current work presents an automated CFD framework, tailored for fixed-wing UAVs, designed to streamline the geometry generation of wings, mesh creation, and simulation execution into a Python-based pipeline. The framework employs a parameterized meshing module capable of handling a broad range of wing geometries within an extensive design space, thereby reducing manual effort and achieving pre-processing times in the order of five minutes. Incorporating GPU-enabled solvers and high-performance computing environments allows for rapid and scalable aerodynamic evaluations. An automated methodology for assessing the CFD results is presented, addressing the discretization and iterative errors, as well as grid resolution, especially near wall surfaces. Comparisons with the results produced by a specialized mechanical engineer with over five years of experience in aircraft-related CFD indicate high accuracy, with deviations below 3% for key aerodynamic metrics. A large-scale deployment further demonstrates consistency across diverse wing samples. A Bayesian Optimization case study then illustrates the framework’s utility, identifying a wing design with an 8% improvement in the lift-to-drag ratio, while maintaining an average

y^{+}

value below 1 along the surface. Overall, the proposed approach streamlines fixed-wing UAV design processes and supports advanced aerodynamic optimization and data generation. Full article

(This article belongs to the Special Issue Numerical Optimization and Algorithms: 3rd Edition)

► Show Figures

Figure 1

20 pages, 3107 KB

Open AccessArticle

Computer Simulation and Speedup of Solving Heat Transfer Problems of Heating and Melting Metal Particles with Laser Radiation

by Arturas Gulevskis and Konstantin Volkov

Computers 2025, 14(2), 47; https://doi.org/10.3390/computers14020047 - 4 Feb 2025

Viewed by 1073

Abstract

The study of the process of laser action on powder materials requires the construction of mathematical models of the interaction of laser radiation with powder particles that take into account the features of energy supply and are applicable in a wide range of [...] Read more.

The study of the process of laser action on powder materials requires the construction of mathematical models of the interaction of laser radiation with powder particles that take into account the features of energy supply and are applicable in a wide range of beam parameters and properties of the particle material. A model of the interaction of pulsed or pulse-periodic laser radiation with a spherical metal particle is developed. To find the temperature distribution in the particle volume, the non-stationary three-dimensional heat conductivity equation with a source term that takes into account the action of laser radiation is solved. In the plane normal to the direction of propagation of laser radiation, the change in the radiation intensity obeys the Gaussian law. It is possible to take into account changes in the intensity of laser radiation in space due to its absorption by the environment. To accelerate numerical calculations, a computational algorithm is used based on the use of vectorized data structures and parallel implementation of operations on general-purpose graphics accelerators. The features of the software implementation of the method for solving a system of difference equations that arises as a result of finite-volume discretization of the heat conductivity equation with implicit scheme by the iterative method are presented. The model developed describes the heating and melting of a spherical metal particle exposed by multi-pulsed laser radiation. The implementation of the computational algorithm developed is based on the use of vectorized data structures and GPU resources. The model and calculation results are of interest for constructing a two-phase flow model describing the interaction of test particles with laser radiation on the scale of the entire calculation domain. Such a model is implemented using a discrete-trajectory approach to modeling the motion and heat exchange of a dispersed admixture. Full article

► Show Figures

Figure 1

25 pages, 11027 KB

Open AccessArticle

A Novel Approach for the Counting of Wood Logs Using cGANs and Image Processing Techniques

by João V. C. Mazzochin, Giovani Bernardes Vitor, Gustavo Tiecker, Elioenai M. F. Diniz, Gilson A. Oliveira, Marcelo Trentin and Érick O. Rodrigues

Forests 2025, 16(2), 237; https://doi.org/10.3390/f16020237 - 26 Jan 2025

Cited by 1 | Viewed by 1206

Abstract

This study tackles the challenge of precise wood log counting, where applications of the proposed methodology can span from automated approaches for materials management, surveillance, and safety science to wood traffic monitoring, wood volume estimation, and others. We introduce an approach leveraging Conditional [...] Read more.

This study tackles the challenge of precise wood log counting, where applications of the proposed methodology can span from automated approaches for materials management, surveillance, and safety science to wood traffic monitoring, wood volume estimation, and others. We introduce an approach leveraging Conditional Generative Adversarial Networks (cGANs) for eucalyptus log segmentation in images, incorporating specialized image processing techniques to handle noise and intersections, coupled with the Connected Components Algorithm for efficient counting. To support this research, we created and made publicly available a comprehensive database of 466 images containing approximately 13,048 eucalyptus logs, which served for both training and validation purposes. Our method demonstrated robust performance, achieving an average

{Accuracy}_{p i x e l}

of 96.4% and

{Accuracy}_{l o g s}

of 92.3%, with additional measures such as F1 scores ranging from 0.879 to 0.933 and IoU values between 0.784 and 0.875, further validating its effectiveness. The implementation proves to be efficient with an average processing time of 0.713 s per image on an NVIDIA T4 GPU, making it suitable for real-time applications. The practical implications of this method are significant for operational forestry, enabling more accurate inventory management, reducing human errors in manual counting, and optimizing resource allocation. Furthermore, the segmentation capabilities of the model provide a foundation for advanced applications such as eucalyptus stack volume estimation, contributing to a more comprehensive and refined analysis of forestry operations. The methodology’s success in handling complex scenarios, including intersecting logs and varying environmental conditions, positions it as a valuable tool for practical applications across related industrial sectors. Full article

(This article belongs to the Special Issue Applications of Artificial Intelligence in Forestry: 2nd Edition)

► Show Figures

Figure 1

15 pages, 1170 KB

Open AccessArticle

MAIL: Micro-Accelerator-in-the-Loop Framework for MCU Integrated Accelerator Peripheral Fast Prototyping

by Jisu Kwon and Daejin Park

Appl. Sci. 2025, 15(3), 1056; https://doi.org/10.3390/app15031056 - 21 Jan 2025

Viewed by 1536

Abstract

The resource-constrained MCU-based platform is unable to use high-performance accelerators such as GPUs or servers due to insufficient resources for ML applications. We define a Micro-Accelerator (MA) that can accelerate ML operations by being connected to the on-chip bus peripheral of the MCU [...] Read more.

The resource-constrained MCU-based platform is unable to use high-performance accelerators such as GPUs or servers due to insufficient resources for ML applications. We define a Micro-Accelerator (MA) that can accelerate ML operations by being connected to the on-chip bus peripheral of the MCU core. ML applications using general-purpose accelerators have a well-equipped SDK environment, making design and verification flow straightforward. In contrast, MA must be connected to the MCU core and on-chip bus interface within the chip. However, evaluating the interaction between the MCU core and an MA is challenging, as it requires the MA to connect with the core and the on-chip bus interface during target software execution. The cost of fabricating physical MA hardware is enormous, compounded by licensing issues with commercial cores. We propose a MA-in-the-loop (MAIL) framework that integrates a custom-designed MA into an emulation platform. This platform enables virtual execution by loading software onto the MCU, allowing observation of hardware-software interactions during ML execution. The proposed framework in this paper is a mixture of software that can emulate the environment in which general ML applications run on the MCU and RTL simulations to profile the acceleration on the MA. To evaluate the flow of ML software execution and performance changes according to the various architectures of MA in the framework, the MA can be reconfigured at runtime to explore the design space. To benchmark our proposed framework, we compared TinyML application profiles to the pure software execution. Experimental results show that the MA-accelerated framework performs comparably to actual MCUs, validating the efficacy of the proposed approach. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

34 pages, 1063 KB

Open AccessReview

A Survey on Design Space Exploration Approaches for Approximate Computing Systems

by Sepide Saeedi, Ali Piri, Bastien Deveautour, Ian O’Connor, Alberto Bosio, Alessandro Savino and Stefano Di Carlo

Electronics 2024, 13(22), 4442; https://doi.org/10.3390/electronics13224442 - 13 Nov 2024

Cited by 1 | Viewed by 2973

Abstract

Approximate Computing (AxC) has emerged as a promising paradigm to enhance performance and energy efficiency by allowing a controlled trade-off between accuracy and resource consumption. It is extensively adopted across various abstraction levels, from software to architecture and circuit levels, employing diverse methodologies. [...] Read more.

Approximate Computing (AxC) has emerged as a promising paradigm to enhance performance and energy efficiency by allowing a controlled trade-off between accuracy and resource consumption. It is extensively adopted across various abstraction levels, from software to architecture and circuit levels, employing diverse methodologies. The primary objective of AxC is to reduce energy consumption for executing error-resilient applications, accepting controlled and inherently acceptable output quality degradation. However, harnessing AxC poses several challenges, including identifying segments within a design amenable to approximation and selecting suitable AxC techniques to fulfill accuracy and performance criteria. This survey provides a comprehensive review of recent methodologies proposed for performing Design Space Exploration (DSE) to find the most suitable AxC techniques, focusing on both hardware and software implementations. DSE is a crucial design process where system designs are modeled, evaluated, and optimized for various extra-functional system behaviors such as performance, power consumption, energy efficiency, and accuracy. A systematic literature review was conducted to identify papers that ascribe their DSE algorithms, excluding those relying on exhaustive search methods. This survey aims to detail the state-of-the-art DSE methodologies that efficiently select AxC techniques, offering insights into their applicability across different hardware platforms and use-case domains. For this purpose, papers were categorized based on the type of search algorithm used, with Machine Learning (ML) and Evolutionary Algorithms (EAs) being the predominant approaches. Further categorization is based on the target hardware, including Field-Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), general-purpose Central Processing Units (CPUs), and Graphics Processing Units (GPUs). A notable observation was that most studies targeted image processing applications due to their tolerance for accuracy loss. By providing an overview of techniques and methods outlined in existing literature pertaining to the DSE of AxC designs, this survey elucidates the current trends and challenges in optimizing approximate designs. Full article

► Show Figures

Figure 1

25 pages, 1511 KB

Open AccessArticle

Performance Study of an MRI Motion-Compensated Reconstruction Program on Intel CPUs, AMD EPYC CPUs, and NVIDIA GPUs

by Mohamed Aziz Zeroual, Karyna Isaieva, Pierre-André Vuissoz and Freddy Odille

Appl. Sci. 2024, 14(21), 9663; https://doi.org/10.3390/app14219663 - 23 Oct 2024

Cited by 3 | Viewed by 1801

Abstract

Motion-compensated image reconstruction enables new clinical applications of Magnetic Resonance Imaging (MRI), but it relies on computationally intensive algorithms. This study focuses on the Generalized Reconstruction by Inversion of Coupled Systems (GRICS) program, applied to the reconstruction of 3D images in cases of [...] Read more.

Motion-compensated image reconstruction enables new clinical applications of Magnetic Resonance Imaging (MRI), but it relies on computationally intensive algorithms. This study focuses on the Generalized Reconstruction by Inversion of Coupled Systems (GRICS) program, applied to the reconstruction of 3D images in cases of non-rigid or rigid motion. It uses hybrid parallelization with the MPI (Message Passing Interface) and OpenMP (Open Multi-Processing). For clinical integration, the GRICS needs to efficiently harness the computational resources of compute nodes. We aim to improve the GRICS’s performance without any code modification. This work presents a performance study of GRICS on two CPU architectures: Intel Xeon Gold and AMD EPYC. The roofline model is used to study the software–hardware interaction and quantify the code’s performance. For CPU–GPU comparison purposes, we propose a preliminary MATLAB–GPU implementation of the GRICS’s reconstruction kernel. We establish the roofline model of the kernel on two NVIDIA GPU architectures: Quadro RTX 5000 and A100. After the performance study, we propose some optimization patterns for the code’s execution on CPUs, first considering only the OpenMP implementation using thread binding and affinity and appropriate architecture-compilation flags and then looking for the optimal combination of MPI processes and OpenMP threads in the case of the hybrid MPI–OpenMP implementation. The results show that the GRICS performed well on the AMD EPYC CPUs, with an architectural efficiency of 52%. The kernel’s execution was fast on the NVIDIA A100 GPU, but the roofline model reported low architectural efficiency and utilization. Full article

(This article belongs to the Special Issue Advances in Computer Architecture Design, Parallel Processing, and Fault Tolerance)

► Show Figures

Figure 1

15 pages, 1106 KB

Open AccessArticle

GPU@SAT DevKit: Empowering Edge Computing Development Onboard Satellites in the Space-IoT Era

by Gionata Benelli, Giovanni Todaro, Matteo Monopoli, Gianluca Giuffrida, Massimiliano Donati and Luca Fanucci

Electronics 2024, 13(19), 3928; https://doi.org/10.3390/electronics13193928 - 4 Oct 2024

Cited by 5 | Viewed by 2653

Abstract

Advancements in technology have driven the miniaturization of embedded systems, making them more cost-effective and energy-efficient for wireless applications. As a result, the number of connectable devices in Internet of Things (IoT) networks has increased significantly, creating the challenge of linking them effectively [...] Read more.

Advancements in technology have driven the miniaturization of embedded systems, making them more cost-effective and energy-efficient for wireless applications. As a result, the number of connectable devices in Internet of Things (IoT) networks has increased significantly, creating the challenge of linking them effectively and economically. The space industry has long recognized this challenge and invested in satellite infrastructure for IoT networks, exploiting the potential of edge computing technologies. In this context, it is of critical importance to enhance the onboard computing capabilities of satellites and develop enabling technologies for their advancement. This is necessary to ensure that satellites are able to connect devices while reducing latency, bandwidth utilization, and development costs, and improving privacy and security measures. This paper presents the GPU@SAT DevKit: an ecosystem for testing a high-performance, general-purpose accelerator designed for FPGAs and suitable for edge computing tasks on satellites. This ecosystem provides a streamlined way to exploit GPGPU processing in space, enabling faster development times and more efficient resource use. Designed for FPGAs and tailored to edge computing tasks, the GPU@SAT accelerator mimics the parallel architecture of a GPU, allowing developers to leverage its capabilities while maintaining flexibility. Its compatibility with OpenCL simplifies the development process, enabling faster deployment of satellite-based applications. The DevKit was implemented and tested on a Zynq UltraScale+ MPSoC evaluation board from Xilinx, integrating the GPU@SAT IP core with the system’s embedded processor. A client/server approach is used to run applications, allowing users to easily configure and execute kernels through a simple XML document. This intuitive interface provides end-users with the ability to run and evaluate kernel performance and functionality without dealing with the underlying complexities of the accelerator itself. By making the GPU@SAT IP core more accessible, the DevKit significantly reduces development time and lowers the barrier to entry for satellite-based edge computing solutions. The DevKit was also compared with other onboard processing solutions, demonstrating similar performance. Full article

(This article belongs to the Special Issue Edge Computing and Tiny Machine Learning in the Internet of Things: Latest Advances and Applications)

► Show Figures

Figure 1

37 pages, 9513 KB

Open AccessArticle

Parallel Implicit Solvers for 2D Numerical Models on Structured Meshes

by Yaoxin Zhang, Mohammad Z. Al-Hamdan and Xiaobo Chao

Mathematics 2024, 12(14), 2184; https://doi.org/10.3390/math12142184 - 12 Jul 2024

Cited by 1 | Viewed by 1223

Abstract

This paper presents the parallelization of two widely used implicit numerical solvers for the solution of partial differential equations on structured meshes, namely, the ADI (Alternating-Direction Implicit) solver for tridiagonal linear systems and the SIP (Strongly Implicit Procedure) solver for the penta-diagonal systems. [...] Read more.

This paper presents the parallelization of two widely used implicit numerical solvers for the solution of partial differential equations on structured meshes, namely, the ADI (Alternating-Direction Implicit) solver for tridiagonal linear systems and the SIP (Strongly Implicit Procedure) solver for the penta-diagonal systems. Both solvers were parallelized using CUDA (Computer Unified Device Architecture) Fortran on GPGPUs (General-Purpose Graphics Processing Units). The parallel ADI solver (P-ADI) is based on the Parallel Cyclic Reduction (PCR) algorithm, while the parallel SIP solver (P-SIP) uses the wave front method (WF) following a diagonal line calculation strategy. To map the solution schemes onto the hierarchical block-threads framework of the CUDA on the GPU, the P-ADI solver adopted two mapping methods, one block thread with iterations (OBM-it) and multi-block threads (MBMs), while the P-SIP solver also used two mappings, one conventional mapping using effective WF lines (WF-e) with matrix coefficients and solution variables defined on original computational mesh, and a newly proposed mapping using all WF mesh (WF-all), on which matrix coefficients and solution variables are defined. Both the P-ADI and the P-SIP have been integrated into a two-dimensional (2D) hydrodynamic model, the CCHE2D (Center of Computational Hydroscience and Engineering) model, developed by the National Center for Computational Hydroscience and Engineering at the University of Mississippi. This study for the first time compared these two parallel solvers and their efficiency using examples and applications in complex geometries, which can provide valuable guidance for future uses of these two parallel implicit solvers in computational fluids dynamics (CFD). Both parallel solvers demonstrated higher efficiency than their serial counterparts on the CPU (Central Processing Unit): 3.73~4.98 speedup ratio for flow simulations, and 2.166~3.648 speedup ratio for sediment transport simulations. In general, the P-ADI solver is faster than but not as stable as the P-SIP solver; and for the P-SIP solver, the newly developed mapping method WF-all significantly improved the conventional mapping method WF-e. Full article

(This article belongs to the Special Issue Mathematical Modeling and Numerical Simulation in Fluids)

► Show Figures

Figure 1

23 pages, 12799 KB

Open AccessArticle

Construction of Three-Dimensional Semantic Maps of Unstructured Lawn Scenes Based on Deep Learning

by Xiaolin Xie, Zixiang Yan, Zhihong Zhang, Yibo Qin, Hang Jin, Cheng Zhang and Man Xu

Appl. Sci. 2024, 14(11), 4884; https://doi.org/10.3390/app14114884 - 4 Jun 2024

Cited by 2 | Viewed by 1959

Abstract

Traditional automatic gardening pruning robots generally employ electronic fences for the delineation of working boundaries. In order to quickly determine the working area of a robot, we combined an improved DeepLabv3+ semantic segmentation model with a simultaneous localization and mapping (SLAM) system to [...] Read more.

Traditional automatic gardening pruning robots generally employ electronic fences for the delineation of working boundaries. In order to quickly determine the working area of a robot, we combined an improved DeepLabv3+ semantic segmentation model with a simultaneous localization and mapping (SLAM) system to construct a three-dimensional (3D) semantic map. To reduce the computational cost of its future deployment in resource-constrained mobile robots, we replaced the backbone network of DeepLabv3+, ResNet50, with MobileNetV2 to decrease the number of network parameters and improve recognition speed. In addition, we introduced an efficient channel attention network attention mechanism to enhance the accuracy of the neural network, forming an improved Multiclass MobileNetV2 ECA DeepLabv3+ (MM-ED) network model. Through the integration of this model with the SLAM system, the entire framework was able to generate a 3D semantic point cloud map of a lawn working area and convert it into octree and occupancy grid maps, providing technical support for future autonomous robot operation and navigation. We created a lawn dataset containing 7500 images, using our own annotated images as ground truth. This dataset was employed for experimental purposes. Experimental results showed that the proposed MM-ED network model achieved 91.07% and 94.71% for MIoU and MPA metrics, respectively. Using a GTX 3060 Laptop GPU, the frames per second rate reached 27.69, demonstrating superior recognition performance compared to similar semantic segmentation architectures and better adaptation to SLAM systems. Full article

(This article belongs to the Special Issue Advanced 2D/3D Computer Vision Technology and Applications)

► Show Figures

Figure 1

Search Results (68)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (68)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI