Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (828)

Search Parameters:
Keywords = speedup

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
25 pages, 876 KB  
Article
Multi-Scale Digital Twin Framework with Physics-Informed Neural Networks for Real-Time Optimization and Predictive Control of Amine-Based Carbon Capture: Development, Experimental Validation, and Techno-Economic Assessment
by Mansour Almuwallad
Processes 2026, 14(3), 462; https://doi.org/10.3390/pr14030462 (registering DOI) - 28 Jan 2026
Abstract
Carbon capture and storage (CCS) is essential for achieving net-zero emissions, yet amine-based capture systems face significant challenges including high energy penalties (20–30% of power plant output) and operational costs ($50–120/tonne CO2). This study develops and validates a novel multi-scale Digital [...] Read more.
Carbon capture and storage (CCS) is essential for achieving net-zero emissions, yet amine-based capture systems face significant challenges including high energy penalties (20–30% of power plant output) and operational costs ($50–120/tonne CO2). This study develops and validates a novel multi-scale Digital Twin (DT) framework integrating Physics-Informed Neural Networks (PINNs) to address these challenges through real-time optimization. The framework combines molecular dynamics, process simulation, computational fluid dynamics, and deep learning to enable real-time predictive control. A key innovation is the sequential training algorithm with domain decomposition, specifically designed to handle the nonlinear transport equations governing CO2 absorption with enhanced convergence properties.The algorithm achieves prediction errors below 1% for key process variables (R2> 0.98) when validated against CFD simulations across 500 test cases. Experimental validation against pilot-scale absorber data (12 m packing, 30 wt% MEA) confirms good agreement with measured profiles, including temperature (RMSE = 1.2 K), CO2 loading (RMSE = 0.015 mol/mol), and capture efficiency (RMSE = 0.6%). The trained surrogate enables computational speedups of up to four orders of magnitude, supporting real-time inference with response times below 100 ms suitable for closed-loop control. Under the conditions studied, the framework demonstrates reboiler duty reductions of 18.5% and operational cost reductions of approximately 31%. Sensitivity analysis identifies liquid-to-gas ratio and MEA concentration as the most influential parameters, with mechanistic explanations linking these to mass transfer enhancement and reaction kinetics. Techno-economic assessment indicates favorable investment metrics, though results depend on site-specific factors. The framework architecture is designed for extensibility to alternative solvent systems, with future work planned for industrial-scale validation and uncertainty quantification through Bayesian approaches. Full article
(This article belongs to the Section Petroleum and Low-Carbon Energy Process Engineering)
26 pages, 2618 KB  
Article
A Cascaded Batch Bayesian Yield Optimization Method for Analog Circuits via Deep Transfer Learning
by Ziqi Wang, Kaisheng Sun and Xiao Shi
Electronics 2026, 15(3), 516; https://doi.org/10.3390/electronics15030516 - 25 Jan 2026
Viewed by 138
Abstract
In nanometer integrated-circuit (IC) manufacturing, advanced technology scaling has intensified the effects of process variations on circuit reliability and performance. Random fluctuations in parameters such as threshold voltage, channel length, and oxide thickness further degrade design margins and increase the likelihood of functional [...] Read more.
In nanometer integrated-circuit (IC) manufacturing, advanced technology scaling has intensified the effects of process variations on circuit reliability and performance. Random fluctuations in parameters such as threshold voltage, channel length, and oxide thickness further degrade design margins and increase the likelihood of functional failures. These variations often lead to rare circuit failure events, underscoring the importance of accurate yield estimation and robust design methodologies. Conventional Monte Carlo yield estimation is computationally infeasible as millions of simulations are required to capture failure events with extremely low probability. This paper presents a novel reliability-based circuit design optimization framework that leverages deep transfer learning to improve the efficiency of repeated yield analysis in optimization iterations. Based on pre-trained neural network models from prior design knowledge, we utilize model fine-tuning to accelerate importance sampling (IS) for yield estimation. To improve estimation accuracy, adversarial perturbations are introduced to calibrate uncertainty near the model decision boundary. Moreover, we propose a cascaded batch Bayesian optimization (CBBO) framework that incorporates a smart initialization strategy and a localized penalty mechanism, guiding the search process toward high-yield regions while satisfying nominal performance constraints. Experimental validation on SRAM circuits and amplifiers reveals that CBBO achieves a computational speedup of 2.02×–4.63× over state-of-the-art (SOTA) methods, without compromising accuracy and robustness. Full article
(This article belongs to the Topic Advanced Integrated Circuit Design and Application)
Show Figures

Figure 1

33 pages, 18247 KB  
Article
Learning Debris Flow Dynamics with a Deep Learning Fourier Neural Operator: Application to the Rendinara–Morino Area
by Mauricio Secchi, Antonio Pasculli, Massimo Mangifesta and Nicola Sciarra
Geosciences 2026, 16(2), 55; https://doi.org/10.3390/geosciences16020055 - 24 Jan 2026
Viewed by 111
Abstract
Accurate numerical simulation of debris flows is essential for hazard assessment and early-warning design, yet high-fidelity solvers remain computationally expensive, especially when large ensembles must be explored under epistemic uncertainty in rheology, initial conditions, and topography. At the same time, field observations are [...] Read more.
Accurate numerical simulation of debris flows is essential for hazard assessment and early-warning design, yet high-fidelity solvers remain computationally expensive, especially when large ensembles must be explored under epistemic uncertainty in rheology, initial conditions, and topography. At the same time, field observations are typically sparse and heterogeneous, limiting purely data-driven approaches. In this work, we develop a deep-learning Fourier Neural Operator (FNO) as a fast, physics-consistent surrogate for one-dimensional shallow-water debris-flow simulations and demonstrate its application to the Rendinara–Morino system in central Italy. A validated finite-volume solver, equipped with HLLC and Rusanov fluxes, hydrostatic reconstruction, Voellmy-type basal friction, and robust wet–dry treatment, is used to generate a large ensemble of synthetic simulations over longitudinal profiles representative of the study area. The parameter space of bulk density, initial flow thickness, and Voellmy friction coefficients is systematically sampled, and the resulting space–time fields of flow depth and velocity form the training dataset. A two-dimensional FNO in the (x,t) domain is trained to learn the full solution operator, mapping topography, rheological parameters, and initial conditions directly to h(x,t) and u(x,t), thereby acting as a site-specific digital twin of the numerical solver. On a held-out validation set, the surrogate achieves mean relative L2 errors of about 6–7% for flow depth and 10–15% for velocity, and it generalizes to an unseen longitudinal profile with comparable accuracy. We further show that targeted reweighting of the training objective significantly improves the prediction of the velocity field without degrading depth accuracy, reducing the velocity error on the unseen profile by more than a factor of two. Finally, the FNO provides speed-ups of approximately 36× with respect to the reference solver at inference time. These results demonstrate that combining physics-based synthetic data with operator-learning architectures enables the construction of accurate, computationally efficient, and site-adapted surrogates for debris-flow hazard analysis in data-scarce environments. Full article
Show Figures

Figure 1

26 pages, 3967 KB  
Article
A General-Purpose AXI Plug-and-Play Hyperdimensional Computing Accelerator
by Rocco Martino, Marco Pisani, Marco Angioli, Marcello Barbirotta, Antonio Mastrandrea, Antonello Rosato and Mauro Olivieri
Electronics 2026, 15(2), 489; https://doi.org/10.3390/electronics15020489 - 22 Jan 2026
Viewed by 62
Abstract
Hyperdimensional Computing (HDC) offers a robust and energy-efficient paradigm for edge intelligence; however, current hardware accelerators are often proprietary, tailored to the target learning task and tightly coupled to specific CPU microarchitectures, limiting portability and adoption. To address this, and democratize the deployment [...] Read more.
Hyperdimensional Computing (HDC) offers a robust and energy-efficient paradigm for edge intelligence; however, current hardware accelerators are often proprietary, tailored to the target learning task and tightly coupled to specific CPU microarchitectures, limiting portability and adoption. To address this, and democratize the deployment of HDC hardware, we present a general-purpose, plug-and-play accelerator IP that implements the Binary Spatter Code framework as a standalone, host-agnostic module. The design is compliant with the AMBA AXI4 standard and provides an AXI4-Lite control plane and DMA-driven AXI4-Stream datapaths coupled to a banked scratchpad memory. The architecture supports synthesis-time scalability, enabling high-throughput transfers independently of the host processor, while employing microarchitectural optimizations to minimize silicon area. A multi-layer C++ software (GitHub repository commit 3ae3b46) stack running in Linux userspace provides a unified programming model, abstracting low-level hardware interactions and enabling the composition of complex HDC pipelines. Implemented on a Xilinx Zynq XC7Z020 SoC, the accelerator achieves substantial gains over an ARM Cortex-A9 baseline, with primitive-level speedups of up to 431×. On end-to-end classification benchmarks, the system delivers average speedups of 68.45× for training and 93.34× for inference. The complete RTL and software stack are released as open-source hardware to support reproducible research and rapid adoption on heterogeneous SoCs. Full article
(This article belongs to the Special Issue Hardware Acceleration for Machine Learning)
Show Figures

Figure 1

19 pages, 1481 KB  
Article
GPU-Accelerated FLIP Fluid Simulation Based on Spatial Hashing Index and Thread Block-Level Cooperation
by Changjun Zou and Hui Luo
Modelling 2026, 7(1), 27; https://doi.org/10.3390/modelling7010027 - 21 Jan 2026
Viewed by 115
Abstract
The Fluid Implicit Particle (FLIP) method is widely adopted in fluid simulation due to its computational efficiency and low dissipation. However, its high computational complexity makes it challenging for traditional CPU architectures to meet real-time requirements. To address this limitation, this work migrates [...] Read more.
The Fluid Implicit Particle (FLIP) method is widely adopted in fluid simulation due to its computational efficiency and low dissipation. However, its high computational complexity makes it challenging for traditional CPU architectures to meet real-time requirements. To address this limitation, this work migrates the FLIP method to the GPU using the CUDA framework, achieving a transition from conventional CPU computation to large-scale GPU parallel computing. Furthermore, during particle-to-grid (P2G) mapping, the conventional scattering strategy suffers from significant performance bottlenecks due to frequent atomic operations. To overcome this challenge, we propose a GPU parallelization strategy based on spatial hashing indexing and thread block-level cooperation. This approach effectively avoids atomic contention and significantly enhances parallel efficiency. Through diverse fluid simulation experiments, the proposed GPU-parallelized strategy achieves a nearly 50× speedup ratio compared to the conventional CPU-FLIP method. Additionally, in the P2G stage, our method demonstrates over 30% performance improvement relative to the traditional GPU-based particle-thread scattering strategy, while the overall simulation efficiency gains exceeding 20%. Full article
Show Figures

Figure 1

14 pages, 1255 KB  
Article
Real-Time Control of Six-DOF Serial Manipulators via Learned Spherical Kinematics
by Meher Madhu Dharmana and Pramod Sreedharan
Robotics 2026, 15(1), 27; https://doi.org/10.3390/robotics15010027 - 21 Jan 2026
Viewed by 98
Abstract
Achieving reliable and real-time inverse kinematics (IK) for 6-degree-of-freedom (6-DoF) spherical-wrist manipulators remains a significant challenge. Analytical formulations often struggle with complex geometries and modeling errors, and standard numerical solvers (e.g., Levenberg–Marquardt) can stall near singularities or converge slowly. Purely data-driven approaches may [...] Read more.
Achieving reliable and real-time inverse kinematics (IK) for 6-degree-of-freedom (6-DoF) spherical-wrist manipulators remains a significant challenge. Analytical formulations often struggle with complex geometries and modeling errors, and standard numerical solvers (e.g., Levenberg–Marquardt) can stall near singularities or converge slowly. Purely data-driven approaches may require large networks and struggle with extrapolation. In this paper, we propose a low-latency, polynomial-based IK solution for spherical-wrist robots. The method leverages spherical coordinates and low-degree polynomial fits for the first three (positional) joints, coupled with a closed-form analytical solver for the final three (wrist) joints. An iterative partial-derivative refinement adjusts the polynomial-based angle estimates using spherical-coordinate errors, ensuring near-zero final error without requiring a full Jacobian matrix. The method systematically enumerates up to eight valid IK solutions per target pose. Our experiments against Levenberg–Marquardt, damped least-squares, and an fmincon baseline show an approximate 8.1× speedup over fmincon while retaining higher accuracy and multi-branch coverage. Future extensions include enhancing robustness through uncertainty propagation, adapting the approach to non-spherical wrists, and developing criteria-based automatic solution-branch selection. Full article
(This article belongs to the Section Intelligent Robots and Mechatronics)
Show Figures

Figure 1

26 pages, 663 KB  
Article
Energy–Performance Trade-Offs of LU Matrix Decomposition in Java Across Heterogeneous Hardware and Operating Systems
by Francisco J. Rosa, Juan Carlos de la Torre, José M. Aragón-Jurado, Alberto Valderas-González and Bernabé Dorronsoro
Appl. Sci. 2026, 16(2), 1002; https://doi.org/10.3390/app16021002 - 19 Jan 2026
Viewed by 105
Abstract
The increasing core counts and architectural heterogeneity of modern processors make performance optimization insufficient if energy consumption is not simultaneously considered. By providing a novel characterization of how the interaction between hybrid architectures and system software disrupts the traditional correlation between execution speed [...] Read more.
The increasing core counts and architectural heterogeneity of modern processors make performance optimization insufficient if energy consumption is not simultaneously considered. By providing a novel characterization of how the interaction between hybrid architectures and system software disrupts the traditional correlation between execution speed and energy efficiency, this research study analyzes the performance–energy trade-offs of parallel LU matrix decomposition algorithms implemented in Java, focusing on the Crout and Doolittle variants. This study is conducted on four different platforms, including ARM-based, Hybrid x86, and many-core accelerators. Execution time and speedup are evaluated for varying thread counts, while energy consumption is measured externally to capture whole-system energy usage. Experimental results show that the configuration yielding the maximum speedup does not necessarily minimize energy consumption. While x86 systems showed energy savings exceeding 80% under optimal parallel configurations, the ARM-based platform required distinct thread counts to minimize energy consumption compared with maximizing speed. These findings demonstrate that energy-efficient configurations represent a distinct optimization space that often contradicts traditional performance metrics. In the era of hybrid computing, green software optimization must transition from a simplistic “race-to-sleep” paradigm toward sophisticated, architecture-aware strategies that account for the specific power profiles of heterogeneous cores to achieve truly sustainable high-performance computing. Full article
Show Figures

Figure 1

22 pages, 5031 KB  
Article
Data-Driven Prediction of Stress–Strain Fields Around Interacting Mining Excavations in Jointed Rock: A Comparative Study of Surrogate Models
by Anatoliy Protosenya and Alexey Ivanov
Mining 2026, 6(1), 4; https://doi.org/10.3390/mining6010004 - 16 Jan 2026
Viewed by 130
Abstract
Assessing the stress–strain state around interacting mining excavations using the finite element method (FEM) is computationally expensive for parametric studies. This study evaluates tabular machine-learning surrogate models for the rapid prediction of full stress–strain fields in fractured rock masses treated as an equivalent [...] Read more.
Assessing the stress–strain state around interacting mining excavations using the finite element method (FEM) is computationally expensive for parametric studies. This study evaluates tabular machine-learning surrogate models for the rapid prediction of full stress–strain fields in fractured rock masses treated as an equivalent continuum. A dataset of 1000 parametric FEM simulations using the elastoplastic generalized Hoek–Brown constitutive model was generated to train Random Forest, LightGBM, CatBoost, and Multilayer Perceptron (MLP) models based on geometric features. The results show that the best models achieve R2 scores of 0.96–0.97 for stress components and 0.99 for total displacements. LightGBM and CatBoost provide the optimal balance between accuracy and computational cost, offering speed-ups of 15 to 70 times compared to FEM. While Random Forest yields slightly higher accuracy, it is resource-intensive. Conversely, MLP is the fastest but less accurate. These findings demonstrate that data-driven surrogates can effectively replace repeated FEM simulations, enabling efficient parametric analysis and intelligent design optimization for mine workings. Full article
Show Figures

Graphical abstract

14 pages, 5251 KB  
Article
Facade Unfolding and GANs for Rapid Visual Prediction of Indoor Daylight Autonomy
by Jiang An, Jiuhong Zhang, Xiaomeng Si, Mingxiao Ma, Chen Du, Xiaoqian Zhang, Longxuan Che and Zhiyuan Lin
Buildings 2026, 16(2), 351; https://doi.org/10.3390/buildings16020351 - 14 Jan 2026
Viewed by 212
Abstract
Achieving optimal daylighting is a cornerstone of sustainable architectural design, impacting energy efficiency and occupant well-being. Fast and accurate prediction during the conceptual phase is crucial but challenging. While physics-based simulations are accurate but slow, existing machine learning methods often rely on restrictive [...] Read more.
Achieving optimal daylighting is a cornerstone of sustainable architectural design, impacting energy efficiency and occupant well-being. Fast and accurate prediction during the conceptual phase is crucial but challenging. While physics-based simulations are accurate but slow, existing machine learning methods often rely on restrictive parametric inputs, limiting their application across free-form designs. This study presents a novel, geometry-agnostic framework that uses only building facade unfolding diagrams as input to a Generative Adversarial Network (GAN). Our core innovation is a 2D representation that preserves 3D facade geometry and orientation by “unfolding” it onto the floor plan, eliminating the need for predefined parameters or intermediate features during prediction. A Pix2pixHD model was trained, validated, and tested on a total of 720 paired diagram-simulation images (split 80:10:10). The model achieves high-fidelity visual predictions, with a mean Structural Similarity Index (SSIM) of 0.93 against RADIANCE/Daysim benchmarks. When accounting for the practical time of diagram drafting, the complete workflow offers a speedup of approximately 1.5 to 52 times compared to conventional simulation. This work provides architects with an intuitive, low-threshold tool for rapid daylight performance feedback in early-stage design exploration. Full article
(This article belongs to the Special Issue Daylighting and Environmental Interactions in Building Design)
Show Figures

Figure 1

46 pages, 3979 KB  
Article
GeoMIP: A Geometric-Topological and Dynamic Programming Framework for Enhanced Computational Tractability of Minimum Information Partition in Integrated Information Theory
by Jaime Díaz-Arancibia, Luz Enith Guerrero, Jeferson Arango-López, Luis Fernando Castillo and Ana Bustamante-Mora
Appl. Sci. 2026, 16(2), 809; https://doi.org/10.3390/app16020809 - 13 Jan 2026
Viewed by 204
Abstract
The computational tractability of Integrated Information Theory (IIT) is fundamentally constrained by the exponential cost of identifying the Minimum Information Partition (MIP), which is required to quantify integrated information (Φ). Existing approaches become impractical beyond ~15–20 variables, limiting IIT analyses on realistic neural [...] Read more.
The computational tractability of Integrated Information Theory (IIT) is fundamentally constrained by the exponential cost of identifying the Minimum Information Partition (MIP), which is required to quantify integrated information (Φ). Existing approaches become impractical beyond ~15–20 variables, limiting IIT analyses on realistic neural and complex systems. We introduce GeoMIP, a geometric–topological framework that recasts the MIP search as a graph-based optimization problem on the n-dimensional hypercube graph: discrete system states are modeled as graph vertices, and Hamming distance adjacency defines edges and shortest-path structures. Building on a tensor-decomposed representation of the transition probabilities, GeoMIP constructs a transition-cost (ground cost) structure by dynamic programming over graph neighborhoods and BFS-like exploration by Hamming levels, exploiting hypercube symmetries to reduce redundant evaluations. We validate GeoMIP against PyPhi, ensuring reliability of MIP identification and Φ computation. Across multiple implementations, GeoMIP achieves 165–326× speedups over PyPhi while maintaining 98–100% agreement in partition identification. Heuristic extensions further enable analyses up to ~25 variables, substantially expanding the practical IIT regime. Overall, by leveraging the hypercube’s explicit graph structure (vertices, edges, shortest paths, and automorphisms), GeoMIP turns an intractable combinatorial search into a scalable graph-based procedure for IIT partitioning. Full article
Show Figures

Figure 1

18 pages, 3457 KB  
Article
Parallel Optimization for Coupled Lattice Boltzmann-Finite Volume Method on Heterogeneous Many-Core Supercomputer
by Xiaojing Lv, Chengsheng Wu, Zhao Liu, Yujing Fan, Jianchun Wang, Yaying Zhang, Yixing Jin and Xuesen Chu
Appl. Sci. 2026, 16(2), 721; https://doi.org/10.3390/app16020721 - 9 Jan 2026
Viewed by 260
Abstract
Nowadays various coupling strategies have been developed to combine the strengths of different numerical methods in computational fluid dynamics (CFD), among which the coupled algorithm of the lattice Boltzmann-finite volume method (LBM-FVM) has gained widespread attention. However, research on parallel optimization of LBM-FVM [...] Read more.
Nowadays various coupling strategies have been developed to combine the strengths of different numerical methods in computational fluid dynamics (CFD), among which the coupled algorithm of the lattice Boltzmann-finite volume method (LBM-FVM) has gained widespread attention. However, research on parallel optimization of LBM-FVM coupled solvers remains limited, mostly focused on independent solvers. In this work, we proposed a flexible framework and optimization schemes to explore the coordinated balance of accuracy-efficiency-hardware adaptability. First, we designed a processor layout strategy to address load imbalance and communication redundancy in the coupled solver. We then developed several parallelization techniques, including LBM restructuring, data reuse, and SIMD optimization for targeted kernels on the most advanced architecture of the Sunway series in China, namely SW26010P heterogeneous many-core processors, which provide hardware architectural advantages well suited for large-scale parallel computational fluid dynamics. Finally, the accuracy of the LBM-FVM coupling simulations was validated through benchmark simulations of 2D/3D lid-driven cavity flow. The results show that our LBM-FVM coupling solver can accurately capture flow characteristics, with vortex structures consistent with experimental data. Additionally, we achieved a 152× speedup for the LBM solver and a 126× speedup for the coupled simulation compared to the standalone FVM simulation on the New Sunway supercomputer system. Our approach marks a milestone in the field of LBM implementations and provides a promising future for coupled algorithms in CFD. Full article
Show Figures

Figure 1

20 pages, 4911 KB  
Article
Autonomous Real-Time Regional Risk Monitoring for Unmanned Swarm Systems
by Tianruo Cao, Yuxizi Zheng, Lijun Liu and Yongqi Pan
Mathematics 2026, 14(2), 259; https://doi.org/10.3390/math14020259 - 9 Jan 2026
Viewed by 163
Abstract
Existing State-of-the-Art (SOTA) methods for situational awareness typically rely on high-bandwidth transmission of raw data or computationally intensive models, which are often impractical for resource-constrained edge devices in unstable communication environments. To address these limitations, this paper introduces a comprehensive framework for Regional [...] Read more.
Existing State-of-the-Art (SOTA) methods for situational awareness typically rely on high-bandwidth transmission of raw data or computationally intensive models, which are often impractical for resource-constrained edge devices in unstable communication environments. To address these limitations, this paper introduces a comprehensive framework for Regional Risk Monitoring utilizing unmanned swarm systems. We propose an innovative knowledge distillation approach (SIKD) that leverages both soft label dark knowledge and inter-layer relationships, enabling compressed models to run in real time on edge nodes while maintaining high accuracy. Furthermore, recognition results are fused using Bayesian inference to dynamically update the regional risk level. Experimental results demonstrate the feasibility of the proposed framework. Quantitatively, the proposed SIKD algorithm reduces the model parameters by 52.34% and computational complexity to 44.21% of the original model, achieving a 3× inference speedup on edge CPUs. Furthermore, it outperforms state-of-the-art baseline methods (e.g., DKD and IRG) in terms of convergence speed and classification accuracy, ensuring robust real-time risk monitoring. Full article
Show Figures

Figure 1

21 pages, 5182 KB  
Article
Quantitative Assessment of the Computing Performance for the Parallel Implementation of a Time-Domain Airborne SAR Raw Data Focusing Procedure
by Jorge Euillades, Paolo Berardino, Carmen Esposito, Antonio Natale, Riccardo Lanari and Stefano Perna
Remote Sens. 2026, 18(2), 221; https://doi.org/10.3390/rs18020221 - 9 Jan 2026
Viewed by 219
Abstract
In this work, different implementation strategies for a Time-Domain (TD) focusing procedure applied to airborne Synthetic Aperture Radar (SAR) raw data are presented, with the key objective of quantitatively assessing their computing time. In particular, two methodological approaches are proposed: a pixel-wise strategy, [...] Read more.
In this work, different implementation strategies for a Time-Domain (TD) focusing procedure applied to airborne Synthetic Aperture Radar (SAR) raw data are presented, with the key objective of quantitatively assessing their computing time. In particular, two methodological approaches are proposed: a pixel-wise strategy, which processes each image pixel independently, and a matrix-wise strategy, which handles data blocks collectively. Both strategies are further extended to parallel execution frameworks to exploit multi-threading and multi-node capabilities. The presented analysis is conducted within the context of the airborne SAR infrastructure developed at the Institute for Electromagnetic Sensing of the Environment (IREA) of the National Research Council (CNR) in Naples, Italy. This infrastructure integrates an airborne SAR sensor and a high-performance Information Technology (IT) platform well-tailored to the parallel processing of huge amounts of data. Experimental results indicate an advantage of the pixel-wise strategy over the matrix-wise counterpart in terms of computing time. Furthermore, the adoption of parallel processing techniques yields substantial speedups, highlighting its relevance for time-critical SAR applications. These findings are particularly relevant in operational scenarios that demand a rapid data turnaround, such as near-real-time airborne monitoring in emergency response contexts. Full article
Show Figures

Graphical abstract

22 pages, 899 KB  
Article
Rapid MRTA in Large UAV Swarms Based on Topological Graph Construction in Obstacle Environments
by Jinlong Liu, Zexu Zhang, Shan Wen, Jingzong Liu and Kai Zhang
Drones 2026, 10(1), 48; https://doi.org/10.3390/drones10010048 - 9 Jan 2026
Viewed by 206
Abstract
In large-scale Unmanned Aerial Vehicle (UAV) and task environments—particularly those involving obstacles—dimensional explosion remains a significant challenge in Multi-Robot Task Allocation (MRTA). To this end, a novel heuristic MRTA framework based on Topological Graph Construction (TGC) is proposed. First, the physical map is [...] Read more.
In large-scale Unmanned Aerial Vehicle (UAV) and task environments—particularly those involving obstacles—dimensional explosion remains a significant challenge in Multi-Robot Task Allocation (MRTA). To this end, a novel heuristic MRTA framework based on Topological Graph Construction (TGC) is proposed. First, the physical map is transformed into a pixel map, from which a Generalized Voronoi Graph (GVG) is generated by extracting clearance points, which is then used to construct the topological graph of the obstacle environment. Next, the affiliations of UAVs and tasks within the topological graph are determined to partition different topological regions, and the task value of each topological node is calculated, followed by the first-phase Task Allocation (TA) on these topological nodes. Finally, UAVs within the same topological region with their allocated tasks perform a local second-phase TA and generate the final TA result. The simulation experiments analyze the influence of different pixel resolutions on the performance of the proposed method. Subsequently, robustness experiments under localization noise, path cost noise, and communication delays demonstrate that the total benefit achieved by the proposed method remains relatively stable, while the computational time is moderately affected. Moreover, comparative experiments and statistical analyses were conducted against k-means clustering-based MRTA methods in different UAV, task, and obstacle scale environments. The results show that the proposed method improves computational speed while maintaining solution quality, with the PI-based method achieving speedups of over 60 times and the CBBA-based method over 10 times compared with the baseline method. Full article
Show Figures

Figure 1

20 pages, 1423 KB  
Article
Efficient Low-Precision GEMM on Ascend NPU: HGEMM’s Synergy of Pipeline Scheduling, Tiling, and Memory Optimization
by Erkun Zhang, Pengxiang Xu and Lu Lu
Computers 2026, 15(1), 39; https://doi.org/10.3390/computers15010039 - 8 Jan 2026
Viewed by 270
Abstract
As one of the most widely used high-performance kernels, General Matrix Multiplication, or GEMM, plays a pivotal role in diverse application fields. With the growing prevalence of training for Convolutional Neural Networks (CNNs) and Large Language Models (LLMs), the design and implementation of [...] Read more.
As one of the most widely used high-performance kernels, General Matrix Multiplication, or GEMM, plays a pivotal role in diverse application fields. With the growing prevalence of training for Convolutional Neural Networks (CNNs) and Large Language Models (LLMs), the design and implementation of high-efficiency, low-precision GEMM on modern Neural Processing Unit (NPU) platforms are of great significance. In this work, HGEMM for Ascend NPU is presented, which enables collaborative processing of different computation types by Cube units and Vector units. The major contributions of this work are the following: (i) dual-stream pipeline scheduling is implemented, which synchronizes padding operations, matrix–matrix multiplications, and element-wise instructions across hierarchical buffers and compute units; (ii) a suite of tiling strategies and a corresponding strategy selection mechanism are developed, comprehensively accounting for the impacts from M, N, and K directions; and (iii) SplitK as well as ShuffleK methods are raised to address the challenges of memory access efficiency and AI Core utilization. Extensive evaluations demonstrate that our proposed HGEMM achieves an average 3.56× speedup over the CATLASS template-based implementation under identical Ascend NPU configurations, and an average 2.10× speedup relative to the cuBLAS implementation on Nvidia A800 GPUs under general random workloads. It also achieves a maximum computational utilization exceeding 90% under benchmark workloads. Moreover, the proposed HGEMM not only significantly outperforms the CATLASS template-based implementation but also delivers efficiency comparable to the cuBLAS implementation in OPT-based bandwidth-limited LLM inference workloads. Full article
Show Figures

Figure 1

Back to TopTop