MDPI - Publisher of Open Access Journals

19 pages, 4756 KiB

Open AccessArticle

Quasi-3D Mechanistic Model for Predicting Eye Drop Distribution in the Human Tear Film

by Harsha T. Garimella, Carly Norris, Carrie German, Andrzej Przekwas, Ross Walenga, Andrew Babiskin and Ming-Liang Tan

Bioengineering 2025, 12(8), 825; https://doi.org/10.3390/bioengineering12080825 (registering DOI) - 30 Jul 2025

Viewed by 85

Abstract

Topical drug administration is a common method of delivering medications to the eye to treat various ocular conditions, including glaucoma, dry eye, and inflammation. Drug efficacy following topical administration, including the drug’s distribution within the eye, absorption and elimination rates, and physiological responses [...] Read more.

Topical drug administration is a common method of delivering medications to the eye to treat various ocular conditions, including glaucoma, dry eye, and inflammation. Drug efficacy following topical administration, including the drug’s distribution within the eye, absorption and elimination rates, and physiological responses can be predicted using physiologically based pharmacokinetic (PBPK) modeling. High-resolution computational models of the eye are desirable to improve simulations of drug delivery; however, these approaches can have long run times. In this study, a fast-running computational quasi-3D (Q3D) model of the human tear film was developed to account for absorption, blinking, drainage, and evaporation. Visualization of blinking mechanics and flow distributions throughout the tear film were enabled using this Q3D approach. Average drug absorption throughout the tear film subregions was quantified using a high-resolution compartment model based on a system of ordinary differential equations (ODEs). Simulations were validated by comparing them with experimental data from topical administration of 0.1% dexamethasone suspension in the tear film (R² = 0.76, RMSE = 8.7, AARD = 28.8%). Overall, the Q3D tear film model accounts for critical mechanistic factors (e.g., blinking and drainage) not previously included in fast-running models. Further, this work demonstrated methods toward improved computational efficiency, where central processing unit (CPU) time was decreased while maintaining accuracy. Building upon this work, this Q3D approach applied to the tear film will allow for more seamless integration into full-body models, which will be an extremely valuable tool in the development of treatments for ocular conditions. Full article

(This article belongs to the Special Issue The Power of Models and Simulation Tools in Biomedical and Biochemical Engineering)

► Show Figures

Figure 1

16 pages, 2270 KiB

Open AccessArticle

Performance Evaluation of FPGA, GPU, and CPU in FIR Filter Implementation for Semiconductor-Based Systems

by Muhammet Arucu and Teodor Iliev

J. Low Power Electron. Appl. 2025, 15(3), 40; https://doi.org/10.3390/jlpea15030040 - 21 Jul 2025

Viewed by 429

Abstract

This study presents a comprehensive performance evaluation of field-programmable gate array (FPGA), graphics processing unit (GPU), and central processing unit (CPU) platforms for implementing finite impulse response (FIR) filters in semiconductor-based digital signal processing (DSP) systems. Utilizing a standardized FIR filter designed with [...] Read more.

This study presents a comprehensive performance evaluation of field-programmable gate array (FPGA), graphics processing unit (GPU), and central processing unit (CPU) platforms for implementing finite impulse response (FIR) filters in semiconductor-based digital signal processing (DSP) systems. Utilizing a standardized FIR filter designed with the Kaiser window method, we compare computational efficiency, latency, and energy consumption across the ZYNQ XC7Z020 FPGA, Tesla K80 GPU, and Arm-based CPU, achieving processing times of 0.004 s, 0.008 s, and 0.107 s, respectively, with FPGA power consumption of 1.431 W and comparable energy profiles for GPU and CPU. The FPGA is 27 times faster than the CPU and 2 times faster than the GPU, demonstrating its suitability for low-latency DSP tasks. A detailed analysis of resource utilization and scalability underscores the FPGA’s reconfigurability for optimized DSP implementations. This work provides novel insights into platform-specific optimizations, addressing the demand for energy-efficient solutions in edge computing and IoT applications, with implications for advancing sustainable DSP architectures. Full article

(This article belongs to the Topic Advanced Integrated Circuit Design and Application)

► Show Figures

Figure 1

25 pages, 8302 KiB

Open AccessArticle

A Real-Time Microkernel for the dsPIC33E Family

by Wilmar Hernandez and Norberto Cañas

Electronics 2025, 14(11), 2160; https://doi.org/10.3390/electronics14112160 - 26 May 2025

Viewed by 1044

Abstract

A microkernel for the dsPIC33E microcontroller family is presented. This is useful because existing microkernels and real-time operating systems do not implement the real-time locking policy immediate ceiling priority protocol (ICPP) for this family, or if they do, their configuration is not straightforward [...] Read more.

A microkernel for the dsPIC33E microcontroller family is presented. This is useful because existing microkernels and real-time operating systems do not implement the real-time locking policy immediate ceiling priority protocol (ICPP) for this family, or if they do, their configuration is not straightforward (as far as we know). For complex embedded systems, it is advantageous to design the system using a concurrent process model. The simplest entity that can be used to achieve this is the microkernel. The most important result of this work is the development of a new microkernel (A

μ

K) that implements a priority-based process scheduling along with an ICPP shared resource access policy. A

μ

K is characterized by its easy configuration. Furthermore, among other good properties of A

μ

K, it is possible to achieve a central processing unit (CPU) utilization of more than 99.9% for scheduling periods of 10 ms. Finally, two study cases with real-time constraints are shown. Full article

(This article belongs to the Special Issue Recent Advances in Embedded Systems)

► Show Figures

Figure 1

16 pages, 4292 KiB

Open AccessArticle

PreEdgeDB: A Lightweight Platform for Energy Prediction on Low-Power Edge Devices

by Woojin Cho, Dongju Kim, Byunghyun Lim and Jaehoi Gu

Electronics 2025, 14(10), 1912; https://doi.org/10.3390/electronics14101912 - 8 May 2025

Viewed by 424

Abstract

Rising energy costs due to environmental degradation, climate change, global conflicts, and pandemics have prompted the need for efficient energy management. Edge devices are increasingly recognized for improving energy efficiency; however, their role as primary computing units remains underexplored. This study presents PreEdgeDB, [...] Read more.

Rising energy costs due to environmental degradation, climate change, global conflicts, and pandemics have prompted the need for efficient energy management. Edge devices are increasingly recognized for improving energy efficiency; however, their role as primary computing units remains underexplored. This study presents PreEdgeDB, a lightweight platform deployed on low-power edge devices to optimize energy usage in industrial complexes, which consume approximately 57.29% of South Korea’s total energy. The platform integrates real-time data preprocessing, time-series storage, and prediction capabilities, enabling independent operation at individual factories. A low-resource preprocessing module was developed to handle missing and anomalous data. For storage, RocksDB—a lightweight, high-performance key–value database—was optimized for edge environments. For prediction, Light Gradient Boosting Machine (LightGBM) was adopted due to its efficiency and high accuracy on limited-resource systems. The resulting model achieved a coefficient of variation of the root mean squared error (CV(RMSE)) of 14.36% and a prediction score of 0.8240. The total processing time from data collection to prediction was under 300 milliseconds. With memory usage below 150 MB and CPU utilization around 60%, PreEdgeDB enables fully autonomous energy prediction and analysis on edge devices, without relying on centralized servers. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Figure 1

9 pages, 2062 KiB

Open AccessArticle

Versal Adaptive Compute Acceleration Platform Processing for ATLAS-TileCal Signal Reconstruction

by Francisco Hervás Álvarez, Alberto Valero Biot, Luca Fiorini, Héctor Gutiérrez Arance, Fernando Carrió, Sonakshi Ahuja and Francesco Curcio

Particles 2025, 8(2), 49; https://doi.org/10.3390/particles8020049 - 1 May 2025

Viewed by 470

Abstract

Particle detectors at accelerators generate large amounts of data, requiring analysis to derive insights. Collisions lead to signal pile-up, where multiple particles produce signals in the same detector sensors, complicating individual signal identification. This contribution describes the implementation of a deep-learning algorithm on [...] Read more.

Particle detectors at accelerators generate large amounts of data, requiring analysis to derive insights. Collisions lead to signal pile-up, where multiple particles produce signals in the same detector sensors, complicating individual signal identification. This contribution describes the implementation of a deep-learning algorithm on a Versal Adaptive Compute Acceleration Platform (ACAP) device for improved processing via parallelization and concurrency. Connected to a host computer via Peripheral Component Interconnect express (PCIe), this system aims for enhanced speed and energy efficiency over Central Processing Units (CPUs) and Graphics Processing Units (GPUs). In the contribution, we will describe in detail the data processing and the hardware, firmware and software components of the system. The contribution presents the implementation of the deep-learning algorithm on a Versal ACAP device, as well as the system for transferring data in an efficient way. Full article

(This article belongs to the Special Issue Selected Papers from the 4th MODE Workshop on Differentiable Programming for Experiment Design)

► Show Figures

Figure 1

17 pages, 421 KiB

Open AccessFeature PaperArticle

CNN-Based End-to-End CPU-AP-UE Power Allocation for Spectral Efficiency Enhancement in Cell-Free Massive MIMO Networks

by Yoon-Ju Choi, Ji-Hee Yu, Seung-Hwan Seo, Seong-Gyun Choi, Hye-Yoon Jeong, Ja-Eun Kim, Myung-Sun Baek, Young-Hwan You and Hyoung-Kyu Song

Mathematics 2025, 13(9), 1442; https://doi.org/10.3390/math13091442 - 28 Apr 2025

Viewed by 569

Abstract

Cell-free massive multiple-input multiple-output (MIMO) networks eliminate cell boundaries and enhance uniform quality of service by enabling cooperative transmission among access points (APs). In conventional cellular networks, user equipment located at the cell edge experiences severe interference and unbalanced resource allocation. However, in [...] Read more.

Cell-free massive multiple-input multiple-output (MIMO) networks eliminate cell boundaries and enhance uniform quality of service by enabling cooperative transmission among access points (APs). In conventional cellular networks, user equipment located at the cell edge experiences severe interference and unbalanced resource allocation. However, in cell-free massive MIMO networks, multiple access points cooperatively serve user equipment (UEs), effectively mitigating these issues. Beamforming and cooperative transmission among APs are essential in massive MIMO environments, making efficient power allocation a critical factor in determining overall network performance. In particular, considering power allocation from the central processing unit (CPU) to the APs enables optimal power utilization across the entire network. Traditional power allocation methods such as equal power allocation and max–min power allocation fail to fully exploit the cooperative characteristics of APs, leading to suboptimal network performance. To address this limitation, in this study we propose a convolutional neural network (CNN)-based power allocation model that optimizes both CPU-to-AP power allocation and AP-to-UE power distribution. The proposed model learns the optimal power allocation strategy by utilizing the channel state information, AP-UE distance, interference levels, and signal-to-interference-plus-noise ratio as input features. Simulation results demonstrate that the proposed CNN-based power allocation method significantly improves spectral efficiency compared to conventional power allocation techniques while also enhancing energy efficiency. This confirms that deep learning-based power allocation can effectively enhance network performance in cell-free massive MIMO environments. Full article

(This article belongs to the Special Issue Advanced Algorithms in Wireless Communication and Internet of Things (IoT))

► Show Figures

Figure 1

35 pages, 6933 KiB

Open AccessArticle

Matrix-Based ACO for Solving Parametric Problems Using Heterogeneous Reconfigurable Computers and SIMD Accelerators

by Vladimir Sudakov and Yuri Titov

Mathematics 2025, 13(8), 1284; https://doi.org/10.3390/math13081284 - 14 Apr 2025

Viewed by 491

Abstract

This paper presents a new matrix representation of ant colony optimization (ACO) for solving parametric problems. This representation allows us to perform calculations using matrix processors and single-instruction multiple-data (SIMD) calculators. To solve the problem of stagnation of the method without a priori [...] Read more.

This paper presents a new matrix representation of ant colony optimization (ACO) for solving parametric problems. This representation allows us to perform calculations using matrix processors and single-instruction multiple-data (SIMD) calculators. To solve the problem of stagnation of the method without a priori information about the system, a new probabilistic formula for choosing the parameter value is proposed, based on the additive convolution of the number of pheromone weights and the number of visits to the vertex. The method can be performed as parallel calculations, which accelerates the process of determining the solution. However, the high speed of determining the solution should be correlated with the high speed of calculating the objective function, which can be difficult when using complex analytical and simulation models. Software has been developed in Python 3.12 and C/C++ 20 to study the proposed changes to the method. With parallel calculations, it is possible to separate the matrix modification of the method into SIMD and multiple-instruction multiple-data (MIMD) components and perform calculations on the appropriate equipment. According to the results of this research, when solving the problem of optimizing benchmark functions of various dimensions, it was possible to accelerate the method by more than 12 times on matrix SIMD central processing unit (CPU) accelerators. When calculating on the graphics processing unit (GPU), the acceleration was about six times due to the difficulties of implementing a pseudo-random number stream. The developed modifications were used to determine the optimal values of the SARIMA parameters when forecasting the volume of transportation by airlines of the Russian Federation. Mathematical dependencies of the acceleration factors on the algorithm parameters and the number of components were also determined, which allows us to estimate the possibilities of accelerating the algorithm by using a reconfigurable heterogeneous computer. Full article

(This article belongs to the Special Issue Optimization Algorithms, Distributed Computing and Intelligence)

► Show Figures

Figure 1

18 pages, 1821 KiB

Open AccessArticle

Embedded Streaming Hardware Accelerators Interconnect Architectures and Latency Evaluation

by Cristian-Tiberius Axinte, Andrei Stan and Vasile-Ion Manta

Electronics 2025, 14(8), 1513; https://doi.org/10.3390/electronics14081513 - 9 Apr 2025

Viewed by 591

Abstract

In the age of hardware accelerators, increasing pressure is applied on computer architects and hardware engineers to improve the balance between the cost and benefits of specialized computing units, in contrast to more general-purpose architectures. The first part of this study presents the [...] Read more.

In the age of hardware accelerators, increasing pressure is applied on computer architects and hardware engineers to improve the balance between the cost and benefits of specialized computing units, in contrast to more general-purpose architectures. The first part of this study presents the embedded Streaming Hardware Accelerator (eSAC) architecture. This architecture can reduce the idle time of specialized logic. The remainder of this paper explores the integration of an eSAC into a Central Processing Unit (CPU) core embedded inside a System-on-Chip (SoC) design, using the AXI-Stream protocol specification. The three evaluated architectures are the Tightly Coupled Streaming, Protocol Adapter FIFO, and Direct Memory Access (DMA) Streaming architectures. When comparing the tightly coupled architecture with the one including the DMA, the experiments in this paper show an almost 3× decrease in frame latency when using the DMA. Nevertheless, this comes at the price of an increase in FPGA resource utilization as follows: LUT (2.5×), LUTRAM (3×), FF (3.4×), and BRAM (1.2×). Four different test scenarios were run for the DMA architecture, showcasing the best and worst practices for data organization. The evaluation results highlight that poor data organization can lead to a more than 7× increase in latency. The CPU model was selected as the newly released MicroBlaze-V softcore processor. The designs presented herein successfully operate on a popular low-cost Field-Programmable Gate Array (FPGA) development board at 100 MHz. Block diagrams, FPGA resource utilization, and latency metrics are presented. Finally, based on the evaluation results, possible improvements are discussed. Full article

(This article belongs to the Section Computer Science & Engineering)

► Show Figures

Figure 1

15 pages, 2874 KiB

Open AccessArticle

Optimized Hybrid Central Processing Unit–Graphics Processing Unit Workflow for Accelerating Advanced Encryption Standard Encryption: Performance Evaluation and Computational Modeling

by Min Kyu Yang and Jae-Seung Jeong

Appl. Sci. 2025, 15(7), 3863; https://doi.org/10.3390/app15073863 - 1 Apr 2025

Viewed by 772

Abstract

This study addresses the growing demand for scalable data encryption by evaluating the performance of AES (Advanced Encryption Standard) encryption and decryption using CBC (Cipher Block Chaining) and CTR (Counter Mode) modes across various CPU (Central Processing Unit) and GPU (Graphics Processing Unit) [...] Read more.

This study addresses the growing demand for scalable data encryption by evaluating the performance of AES (Advanced Encryption Standard) encryption and decryption using CBC (Cipher Block Chaining) and CTR (Counter Mode) modes across various CPU (Central Processing Unit) and GPU (Graphics Processing Unit) hardware models. The objective is to highlight GPU acceleration benefits and propose an optimized hybrid CPU–GPU workflow for large-scale data security. Methods include benchmarking encryption performance with provided data, mathematical models, and computational analysis. The results indicate significant performance gains with GPU acceleration, particularly for large datasets, and demonstrate that the hybrid CPU–GPU approach balances speed and resource utilization efficiently. Full article

► Show Figures

Figure 1

25 pages, 3751 KiB

Open AccessArticle

ORAN-HAutoscaling: A Scalable and Efficient Resource Optimization Framework for Open Radio Access Networks with Performance Improvements

by Sunil Kumar

Information 2025, 16(4), 259; https://doi.org/10.3390/info16040259 - 23 Mar 2025

Viewed by 871

Abstract

Open Radio Access Networks (ORANs) are transforming the traditional telecommunications landscape by offering more flexible, vendor-independent solutions. Unlike previous systems, which relied on rigid, vertical configurations, ORAN introduces network programmability that is AI-driven and horizontally scalable. This shift is facilitated by modern container [...] Read more.

Open Radio Access Networks (ORANs) are transforming the traditional telecommunications landscape by offering more flexible, vendor-independent solutions. Unlike previous systems, which relied on rigid, vertical configurations, ORAN introduces network programmability that is AI-driven and horizontally scalable. This shift is facilitated by modern container orchestrators, such as Kubernetes and Red Hat OpenShift, which simplify the development and deployment of components such as gNB, CU/DU, and RAN Intelligent Controllers (RICs). While these advancements help reduce costs by enabling shared infrastructure, they also create new challenges in meeting ORAN’s stringent latency requirements, especially when managing large-scale xApp deployments. Near-RTRICs are responsible for controlling xApps that must adhere to tight latency constraints, often less than one second. Current orchestration methods fail to meet these demands, as they lack the required scalability and long latencies. Additionally, non-API-based E2AP (over SCTP) further complicates the scaling process. To address these challenges, we introduce ORAN-HAutoscaling, a framework designed to enable horizontal scaling through Kubernetes. This framework ensures that latency constraints are met while supporting large-scale xApp deployments with optimal resource utilization. ORAN-HAutoscaling dynamically allocates and distributes xApps into scalable pods, ensuring that central processing unit (CPU) utilization remains efficient and latency is minimized, thus improving overall performance. Full article

(This article belongs to the Section Information Systems)

► Show Figures

Figure 1

25 pages, 1038 KiB

Open AccessReview

Review of Task-Scheduling Methods for Heterogeneous Chips

by Zujia Miao, Cuiping Shao, Huiyun Li and Zhimin Tang

Electronics 2025, 14(6), 1191; https://doi.org/10.3390/electronics14061191 - 18 Mar 2025

Viewed by 2045

Abstract

Heterogeneous chips, by integrating multiple processing units such as central processing unit(CPU), graphics processing unit (GPU) and field programmable gate array (FPGA), are capable of providing optimized processing power for different types of computational tasks. In modern computing environments, heterogeneous chips have gained [...] Read more.

Heterogeneous chips, by integrating multiple processing units such as central processing unit(CPU), graphics processing unit (GPU) and field programmable gate array (FPGA), are capable of providing optimized processing power for different types of computational tasks. In modern computing environments, heterogeneous chips have gained increasing attention due to their superior performance. However, the performance of heterogeneous chips falls short of that of traditional chips without an appropriate task-scheduling method. This paper reviews the current research progress on task-scheduling methods for heterogeneous chips, focusing on key issues such as task-scheduling frameworks, scheduling algorithms, and experimental and evaluation methods. Research indicates that task scheduling has become a core technology for enhancing the performance of heterogeneous chips. However, in high-dimensional and complex application environments, the challenges of multi-objective and dynamic demands remain insufficiently addressed by existing scheduling methods. Furthermore, the current experimental and evaluation methods are still in the early stages, particularly in software-in-the-loop testing, where test scenarios are limited, and there is a lack of standardized evaluation criteria. In the future, further exploration of scenario generation methods combining large-scale models and simulation platforms is required, along with efforts to establish standardized test scene definitions and feasible evaluation metrics. In addition, in-depth research on the impact of artificial intelligence algorithms on task-scheduling methods should be conducted, emphasizing leveraging the complementary advantages of algorithms such as reinforcement learning. Full article

(This article belongs to the Section Electronic Materials, Devices and Applications)

► Show Figures

Figure 1

7 pages, 12219 KiB

Open AccessProceeding Paper

Fast Collision Detection Method with Octree-Based Parallel Processing in Unity3D

by Kunthroza Hor, Taeheon Kim and Min Hong

Eng. Proc. 2025, 89(1), 37; https://doi.org/10.3390/engproc2025089037 - 13 Mar 2025

Viewed by 854

Abstract

Performing accurate and precise collision detection is a key to real-time applications in computer graphics, games, physical-based simulation, virtual reality, augmented reality, and research and development. Researchers have developed numerous methods to minimize computation time and enhance the accuracy of collision detection for [...] Read more.

Performing accurate and precise collision detection is a key to real-time applications in computer graphics, games, physical-based simulation, virtual reality, augmented reality, and research and development. Researchers have developed numerous methods to minimize computation time and enhance the accuracy of collision detection for pair-object collisions. Although the performance of the central processing unit (CPU) has significantly improved in recent years, it is still insufficient for many applications. In this study, we have developed an improved algorithm for geometric bounding volume hierarchy (BHV) in 3D spatial subdivisions using an Octree-based axis-aligned bounding box (AABB) structure. The AABB structure is used for collision detection and its computation by the central processing unit and graphic processing unit (GPU), which is implemented on the compute shader in Unity3D. AABB was defined as the maximum and minimum hexahedron within an object that is parallel to the coordinate axis. While GPU computing is essential for enhancing the object’s performance. The proposed algorithm approaches Octree AABB-based GPU parallel processing to reduce the calculation or process of simulation for real-time collision detection and handles multiple computations. In the CPU environment, the algorithm spent 2.9 fps when simulating up to 20 objects of the Torus Model that contains 2.3 K vertices and 4.6 K triangles. In the GPU environment, it spent 635.62 fps with 20 objects, and the maximum number increased to 180 objects in real-time. Full article

(This article belongs to the Proceedings of 2024 IEEE 7th International Conference on Knowledge Innovation and Invention)

► Show Figures

Figure 1

17 pages, 7698 KiB

Open AccessArticle

Plant Disease Segmentation Networks for Fast Automatic Severity Estimation Under Natural Field Scenarios

by Chenyi Zhao, Changchun Li, Xin Wang, Xifang Wu, Yongquan Du, Huabin Chai, Taiyi Cai, Hengmao Xiang and Yinghua Jiao

Agriculture 2025, 15(6), 583; https://doi.org/10.3390/agriculture15060583 - 10 Mar 2025

Cited by 1 | Viewed by 1146

Abstract

The segmentation of plant disease images enables researchers to quantify the proportion of disease spots on leaves, known as disease severity. Current deep learning methods predominantly focus on single diseases, simple lesions, or laboratory-controlled environments. In this study, we established and publicly released [...] Read more.

The segmentation of plant disease images enables researchers to quantify the proportion of disease spots on leaves, known as disease severity. Current deep learning methods predominantly focus on single diseases, simple lesions, or laboratory-controlled environments. In this study, we established and publicly released image datasets of field scenarios for three diseases: soybean bacterial blight (SBB), wheat stripe rust (WSR), and cedar apple rust (CAR). We developed Plant Disease Segmentation Networks (PDSNets) based on LinkNet with ResNet-18 as the encoder, including three versions: ×1.0, ×0.75, and ×0.5. The ×1.0 version incorporates a 4 × 4 embedding layer to enhance prediction speed, while versions ×0.75 and ×0.5 are lightweight variants with reduced channel numbers within the same architecture. Their parameter counts are 11.53 M, 6.50 M, and 2.90 M, respectively. PDSNetx0.5 achieved an overall F1 score of 91.96%, an Intersection over Union (IoU) of 85.85% for segmentation, and a coefficient of determination (R²) of 0.908 for severity estimation. On a local central processing unit (CPU), PDSNetx0.5 demonstrated a prediction speed of 34.18 images (640 × 640 pixels) per second, which is 2.66 times faster than LinkNet. Our work provides an efficient and automated approach for assessing plant disease severity in field scenarios. Full article

(This article belongs to the Special Issue Computational, AI and IT Solutions Helping Agriculture)

► Show Figures

Figure 1

30 pages, 1684 KiB

Open AccessArticle

Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations

by Haruto Fujii, Yasuaki Ito, Nobuya Yokogawa, Kanta Suzuki, Satoki Tsuji, Koji Nakano, Victor Parque and Akihiko Kasagi

Appl. Sci. 2025, 15(5), 2572; https://doi.org/10.3390/app15052572 - 27 Feb 2025

Cited by 1 | Viewed by 1049

Abstract

Quantum chemistry offers the formal machinery to derive molecular and physical properties arising from (sub)atomic interactions. However, as molecules of practical interest are largely polyatomic, contemporary approximation schemes such as the Hartree–Fock scheme are computationally expensive due to the large number of electron [...] Read more.

Quantum chemistry offers the formal machinery to derive molecular and physical properties arising from (sub)atomic interactions. However, as molecules of practical interest are largely polyatomic, contemporary approximation schemes such as the Hartree–Fock scheme are computationally expensive due to the large number of electron repulsion integrals (ERIs). Central to the Hartree–Fock method is the efficient computation of ERIs over Gaussian functions (GTO-ERIs). Here, the well-known McMurchie–Davidson method (MD) offers an elegant formalism by incrementally extending Hermite Gaussian functions and auxiliary tabulated functions. Although the MD method offers a high degree of versatility to acceleration schemes through Graphics Processing Units (GPUs), the current GPU implementations limit the practical use of supported values of the azimuthal quantum number. In this paper, we propose a generalized framework capable of computing GTO-ERIs for arbitrary azimuthal quantum numbers, provided that the intermediate terms of the MD method can be stored. Our approach benefits from extending the MD recurrence relations through shells, batches, and triple-buffering of the shared memory, and ordering similar ERIs, thus enabling the effective parallelization and use of GPU resources. Furthermore, our approach proposes four GPU implementation schemes considering the suitable mappings between Gaussian basis and CUDA blocks and threads. Our computational experiments involving the GTO-ERI computations of molecules of interest on an NVIDIA A100 Tensor Core GPU (NVIDIA, Santa Clara, CA, USA) have revealed the merits of the proposed acceleration schemes in terms of computation time, including up to a 72× improvement over our previous GPU implementation and up to a 4500× speedup compared to a naive CPU implementation, highlighting the effectiveness of our method in accelerating ERI computations for both monatomic and polyatomic molecules. Our work has the potential to explore new parallelization schemes of distinct and complex computation paths involved in ERI computation. Full article

(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))

► Show Figures

Figure 1

27 pages, 1081 KiB

Open AccessArticle

ConBOOM: A Configurable CPU Microarchitecture for Speculative Covert Channel Mitigation

by Zhewen Zhang, Yao Liu, Yuhan She, Abdurrashid Ibrahim Sanka, Patrick S. Y. Hung and Ray C. C. Cheung

Electronics 2025, 14(5), 850; https://doi.org/10.3390/electronics14050850 - 21 Feb 2025

Viewed by 1971

Abstract

Speculative execution attacks are serious security problems that cause information leakage in computer systems by building speculative covert channels. Hardware defenses mitigate speculative covert channels through microarchitectural changes. However, two main limitations become the major bottleneck in existing hardware defenses. High-security hardware defenses, [...] Read more.

Speculative execution attacks are serious security problems that cause information leakage in computer systems by building speculative covert channels. Hardware defenses mitigate speculative covert channels through microarchitectural changes. However, two main limitations become the major bottleneck in existing hardware defenses. High-security hardware defenses, such as eager delay, can effectively mitigate both known and unknown covert channels. However, these defenses incur high performance overhead due to the long-fixed delayed execution applied in all potential attack scenarios. In contrast, hardware defenses with low performance overhead are faster and can mitigate known covert channels, but lack sufficient security to mitigate unknown covert channels. The limitations indicate that it is difficult to achieve better security and performance of a processor against speculative execution attacks using a single defense method. In this paper, we propose ConBOOM, a configurable central processing unit (CPU) microarchitecture that provides optimized switchable hardware defensive modes, including the high-security eager delay mode and two proposed performance-optimized modes based on the anticipated attack scenarios. The defensive modes allow for flexibility in mitigating different speculative execution attacks with better performance, unlike the existing defenses having fixed performance overhead for all attack scenarios. The ConBOOM modes can be switched without modifying the hardware, and switching ConBOOM to the suitable mode for the anticipated attack scenario is achieved through the provided software configuration interface. We implemented ConBOOM on Berkeley’s RISC-V out-of-order processor core (SonicBOOM). Furthermore, we evaluated ConBOOM on the VCU118 FPGA platform. Compared to the existing representative work with the fixed performance overhead of 39.1%, ConBOOM has the lower performance overhead ranging between 15.1% and 39.1% to mitigate different attack scenarios. ConBOOM provides more defensive flexibility with negligible hardware resource overhead about 2.0% and good security. Full article

(This article belongs to the Special Issue Safeguarding Systems: Approaches to Resolving Hardware Security Challenges)

► Show Figures

Figure 1

Search Results (218)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (218)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI