Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (389)

Search Parameters:
Keywords = GPU parallel computing

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
14 pages, 2351 KB  
Article
TwinArray Sort: An Ultrarapid Conditional Non-Comparison Integer Sorting Algorithm
by Amin Amini
Electronics 2026, 15(3), 609; https://doi.org/10.3390/electronics15030609 - 30 Jan 2026
Viewed by 128
Abstract
TwinArray Sort is a non-comparison integer sorting algorithm designed for non-negative integers with relatively dense key ranges, offering competitive runtime performance and reduced memory usage relative to other counting-based methods. The algorithm introduces a conditional distinct-array verification mechanism that adapts the reconstruction strategy [...] Read more.
TwinArray Sort is a non-comparison integer sorting algorithm designed for non-negative integers with relatively dense key ranges, offering competitive runtime performance and reduced memory usage relative to other counting-based methods. The algorithm introduces a conditional distinct-array verification mechanism that adapts the reconstruction strategy based on data characteristics while maintaining worst-case time and space complexity of O(n + k). Comprehensive experimental evaluations were conducted on datasets containing up to 108 elements across multiple data distributions, including random, reverse-sorted, nearly sorted, and their unique variants. The results demonstrate consistent performance improvements compared with established algorithms such as Counting Sort, Pigeonhole Sort, MSD Radix Sort, Spreadsort, Flash Sort, Bucket Sort, and Quicksort. TwinArray Sort achieved execution times up to 2.7 times faster and reduced memory usage by up to 50%, with particularly strong performance observed for unique and reverse-sorted datasets. The algorithm exhibits good scalability for large datasets and key ranges, with performance degradation occurring primarily in extreme cases where the key range significantly exceeds the input size due to auxiliary array requirements. These findings indicate that TwinArray Sort is a competitive solution for in-memory sorting in high-performance and distributed computing environments. Future work will focus on optimizing performance for wide key ranges and developing parallel implementations for multi-core and GPU architectures. Full article
Show Figures

Figure 1

17 pages, 4235 KB  
Article
GPU Ray Tracing Analysis of Plasma Plume Perturbations on Reflector Antenna Radiation Characteristics
by Yijing Wang, Weike Yin and Bing Wei
Symmetry 2026, 18(2), 243; https://doi.org/10.3390/sym18020243 - 29 Jan 2026
Viewed by 80
Abstract
During ion thruster operation, electromagnetic waves propagating through the plasma plume undergo absorption and refraction effects. This paper presents a graphics processing unit (GPU) parallel ray tracing (RT) algorithm for inhomogeneous media to analyze plasma plume-induced perturbations on the radiation characteristics of a [...] Read more.
During ion thruster operation, electromagnetic waves propagating through the plasma plume undergo absorption and refraction effects. This paper presents a graphics processing unit (GPU) parallel ray tracing (RT) algorithm for inhomogeneous media to analyze plasma plume-induced perturbations on the radiation characteristics of a satellite reflector antenna, substantially improving computational efficiency. This algorithm performs ray path tracing in the plume, with the vertex and central rays in each ray tube assigned to dedicated GPU threads. This enables the parallel computation of electromagnetic wave attenuation, phase, and polarization. By further applying aperture integration and the superposition principle, the influence of the plume on the far-field antenna radiation patterns is efficiently analyzed. Comparison with serial results validates the accuracy of the algorithm for plume calculation, achieving approximately 319 times speed-up for 586,928 ray tubes. Within the 2–5 GHz frequency range, the plume causes amplitude attenuation of less than 3 dB. This study provides an efficient solution for real-time analysis of plume-induced interference in satellite communications. Full article
(This article belongs to the Section Physics)
19 pages, 1481 KB  
Article
GPU-Accelerated FLIP Fluid Simulation Based on Spatial Hashing Index and Thread Block-Level Cooperation
by Changjun Zou and Hui Luo
Modelling 2026, 7(1), 27; https://doi.org/10.3390/modelling7010027 - 21 Jan 2026
Viewed by 158
Abstract
The Fluid Implicit Particle (FLIP) method is widely adopted in fluid simulation due to its computational efficiency and low dissipation. However, its high computational complexity makes it challenging for traditional CPU architectures to meet real-time requirements. To address this limitation, this work migrates [...] Read more.
The Fluid Implicit Particle (FLIP) method is widely adopted in fluid simulation due to its computational efficiency and low dissipation. However, its high computational complexity makes it challenging for traditional CPU architectures to meet real-time requirements. To address this limitation, this work migrates the FLIP method to the GPU using the CUDA framework, achieving a transition from conventional CPU computation to large-scale GPU parallel computing. Furthermore, during particle-to-grid (P2G) mapping, the conventional scattering strategy suffers from significant performance bottlenecks due to frequent atomic operations. To overcome this challenge, we propose a GPU parallelization strategy based on spatial hashing indexing and thread block-level cooperation. This approach effectively avoids atomic contention and significantly enhances parallel efficiency. Through diverse fluid simulation experiments, the proposed GPU-parallelized strategy achieves a nearly 50× speedup ratio compared to the conventional CPU-FLIP method. Additionally, in the P2G stage, our method demonstrates over 30% performance improvement relative to the traditional GPU-based particle-thread scattering strategy, while the overall simulation efficiency gains exceeding 20%. Full article
Show Figures

Figure 1

22 pages, 5297 KB  
Article
A Space-Domain Gravity Forward Modeling Method Based on Voxel Discretization and Multiple Observation Surfaces
by Rui Zhang, Guiju Wu, Jiapei Wang, Yufei Xi, Fan Wang and Qinhong Long
Symmetry 2026, 18(1), 180; https://doi.org/10.3390/sym18010180 - 19 Jan 2026
Viewed by 247
Abstract
Geophysical forward modeling serves as a fundamental theoretical approach for characterizing subsurface structures and material properties, essentially involving the computation of gravity responses at surface or spatial observation points based on a predefined density distribution. With the rapid development of data-driven techniques such [...] Read more.
Geophysical forward modeling serves as a fundamental theoretical approach for characterizing subsurface structures and material properties, essentially involving the computation of gravity responses at surface or spatial observation points based on a predefined density distribution. With the rapid development of data-driven techniques such as deep learning in geophysical inversion, forward algorithms are facing increasing demands in terms of computational scale, observable types, and efficiency. To address these challenges, this study develops an efficient forward modeling method based on voxel discretization, the enabling rapid calculation of gravity anomalies and radial gravity gradients on multiple observational surfaces. Leveraging the parallel computing capabilities of graphics processing units (GPU), together with tensor acceleration, Compute Unified Device Architecture (CUDA) execution, and Just-in-time (JIT) compilation strategies, the method achieves high efficiency and automation in the forward computation process. Numerical experiments conducted on several typical theoretical models demonstrate the convergence and stability of the calculated results, indicating that the proposed method significantly reduces computation time while maintaining accuracy, thus being well-suited for large-scale 3D modeling and fast batch simulation tasks. This research can efficiently generate forward datasets with multi-view and multi-metric characteristics, providing solid data support and a scalable computational platform for deep-learning-based geophysical inversion studies. Full article
Show Figures

Figure 1

17 pages, 2889 KB  
Technical Note
Increasing Computational Efficiency of a River Ice Model to Help Investigate the Impact of Ice Booms on Ice Covers Formed in a Regulated River
by Karl-Erich Lindenschmidt, Mojtaba Jandaghian, Saber Ansari, Denise Sudom, Sergio Gomez, Stephany Valarezo Plaza, Amir Ali Khan, Thomas Puestow and Seok-Bum Ko
Water 2026, 18(2), 218; https://doi.org/10.3390/w18020218 - 14 Jan 2026
Viewed by 215
Abstract
The formation and stability of river ice covers in regulated waterways are critical for uninterrupted hydro-electric operations. This study investigates the modelling of ice cover development in the Beauharnois Canal along the St. Lawrence River with the presence and absence of ice booms. [...] Read more.
The formation and stability of river ice covers in regulated waterways are critical for uninterrupted hydro-electric operations. This study investigates the modelling of ice cover development in the Beauharnois Canal along the St. Lawrence River with the presence and absence of ice booms. Ice booms are deployed in this canal to promote the rapid formation of a stable ice cover during freezing events, minimizing disruptions to dam operations. Remote sensing data were used to assess the spatial extent and temporal evolution of an ice cover and to calibrate the river ice model RIVICE. The model was applied to simulate ice formation for the 2019–2020 ice season, first for the canal with a series of three ice booms and then rerun under a scenario without booms. Comparative analysis reveals that the presence of ice booms facilitates the development of a relatively thinner and more uniform ice cover. In contrast, the absence of booms leads to thicker ice accumulations and increased risk of ice jamming, which could impact water management and hydroelectric generation operations. Computational efficiencies of the RIVICE model were also sought. RIVICE was originally compiled with a Fortran 77 compiler, which restricted modern optimization techniques. Recompiling with NVFortran significantly improved performance through advanced instruction scheduling, cache management, and automatic loop analysis, even without explicit optimization flags. Enabling optimization further accelerated execution, albeit marginally, reducing redundant operations and memory traffic while preserving numerical integrity. Tests across varying ice cross-sectional spacings confirmed that NVFortran reduced runtimes by roughly an order of magnitude compared to the original model. A test GPU (Graphics Processing Unit) version was able to run the data interpolation routines on the GPU, but frequent data transfers between the CPU (Central Processing Unit) and GPU caused by shared memory blocks and fixed-size arrays made it slower than the original CPU version. Achieving efficient GPU execution would require substantial code restructuring to eliminate global states, adopt persistent data regions, and parallelize at higher level loops, or alternatively, rewriting in a GPU-friendly language to fully exploit modern architectures. Full article
Show Figures

Figure 1

16 pages, 328 KB  
Article
SemanticHPC: Semantics-Aware, Hardware-Conscious Workflows for Distributed AI Training on HPC Architectures
by Alba Amato
Information 2026, 17(1), 78; https://doi.org/10.3390/info17010078 - 12 Jan 2026
Viewed by 231
Abstract
High-Performance Computing (HPC) has become essential for training medium- and large-scale Artificial Intelligence (AI) models, yet two bottlenecks remain under-exploited: the semantic coherence of training data and the interaction between distributed deep learning runtimes and heterogeneous HPC architectures. Existing work tends to optimise [...] Read more.
High-Performance Computing (HPC) has become essential for training medium- and large-scale Artificial Intelligence (AI) models, yet two bottlenecks remain under-exploited: the semantic coherence of training data and the interaction between distributed deep learning runtimes and heterogeneous HPC architectures. Existing work tends to optimise multi-node, multi-GPU training in isolation from data semantics or to apply semantic technologies to data curation without considering the constraints of large-scale training on modern clusters. This paper introduces SemanticHPC, an experimental framework that integrates ontology and Resource Description Framework (RDF)-based semantic preprocessing with distributed AI training (Horovod/PyTorch Distributed Data Parallel) and hardware-aware optimisations for Non-Uniform Memory Access (NUMA), multi-GPU and high-speed interconnects. The framework has been evaluated on 1–8 node configurations (4–32 GPUs) on a production-grade cluster. Experiments on a medium-size Open Images V7 workload show that semantic enrichment improves validation accuracy by 3.5–4.4 absolute percentage points while keeping the additional end-to-end overhead below 8% and preserving strong scaling efficiency above 79% on eight nodes. We argue that bringing semantic technologies into the training workflow—rather than treating them as an offline, detached phase—is a promising direction for large-scale AI on HPC systems. We detail an implementation based on standard Python libraries, RDF tooling and widely adopted deep learning runtimes, and we discuss the limitations and practical hurdles that need to be addressed for broader adoption. Full article
Show Figures

Graphical abstract

31 pages, 8135 KB  
Article
A High-Performance Stochastic Framework for Landslide Uncertainty Analysis Using the Material Point Method and Random Field Theory
by Qinyang Sang, Yonglin Xiong and Zhigang Liu
Symmetry 2026, 18(1), 88; https://doi.org/10.3390/sym18010088 - 4 Jan 2026
Viewed by 356
Abstract
This study proposes a novel high-performance computational framework to address the computational challenges in probabilistic large-deformation landslide analysis. By integrating a GPU-accelerated material point method (MPM) solver with a parallelized covariance matrix decomposition (CMD) algorithm for decomposing symmetric matrices, the framework achieves exceptional [...] Read more.
This study proposes a novel high-performance computational framework to address the computational challenges in probabilistic large-deformation landslide analysis. By integrating a GPU-accelerated material point method (MPM) solver with a parallelized covariance matrix decomposition (CMD) algorithm for decomposing symmetric matrices, the framework achieves exceptional efficiency, demonstrating speedups of up to 532× (MPM solver) and 120× (random field generation) compared to traditional serial methods. Leveraging this efficiency, extensive Monte Carlo simulations (MCSs) were conducted to quantify the effects of spatial variability in soil properties on landslide behaviors. Quantitative results indicate that runout and influence distances follow normal distributions, while sliding mass volume exhibits log-normal characteristics. Crucially, deterministic analysis was found to systematically underestimate the hazard; the probabilistic mean sliding volume significantly exceeded the deterministic value, with 73–80% of stochastic realizations producing larger failures. Furthermore, sensitivity analyses reveal that increasing the coefficient of variation (COV) and the cross-correlation coefficient (from −0.5 to 0.5) leads to a monotonic increase in both the mean and standard deviation of large-deformation metrics. These findings confirm that positive parameter correlation amplifies failure risk, providing a rigorous physics-based basis for conservative landslide hazard assessment. Full article
Show Figures

Figure 1

17 pages, 558 KB  
Article
FPGA-Accelerated Multi-Resolution Spline Reconstruction for Real-Time Multimedia Signal Processing
by Manuel J. C. S. Reis
Electronics 2026, 15(1), 173; https://doi.org/10.3390/electronics15010173 - 30 Dec 2025
Viewed by 354
Abstract
This paper presents an FPGA-based architecture for real-time spline-based signal reconstruction, targeted at multimedia signal processing applications. Leveraging the multi-resolution properties of B-splines, the proposed design enables efficient upsampling, denoising, and feature preservation for image and video signals. Implemented on a mid-range FPGA, [...] Read more.
This paper presents an FPGA-based architecture for real-time spline-based signal reconstruction, targeted at multimedia signal processing applications. Leveraging the multi-resolution properties of B-splines, the proposed design enables efficient upsampling, denoising, and feature preservation for image and video signals. Implemented on a mid-range FPGA, the system supports parallel processing of multiple channels, with low-latency memory access and pipelined arithmetic units. The proposed pipeline achieves a throughput of up to 33.1 megasmples per second for 1D signals and 19.4 megapixels per second for 2D images, while maintaining average power consumption below 250 mW. Compared to CPU and embedded GPU implementations, the design delivers >15× improvement in energy efficiency and deterministic low-latency performance (8–12 clock cycles). A key novelty lies in combining multi-resolution B-spline reconstruction with fixed-point arithmetic and streaming-friendly pipelining, making the architecture modular, compact, and robust to varying input rates. Benchmarking results on synthetic and real multimedia datasets show significant improvements in throughput and energy efficiency compared to conventional CPU and GPU implementations. The architecture supports flexible resolution scaling, making it suitable for edge-computing scenarios in multimedia environments. Full article
(This article belongs to the Special Issue Digital Signal and Image Processing for Multimedia Technology)
Show Figures

Figure 1

21 pages, 5590 KB  
Article
A Position-Based Fluid Method with Dynamic Smoothing Length
by Changjun Zou and Xirun Li
Computers 2026, 15(1), 11; https://doi.org/10.3390/computers15010011 - 30 Dec 2025
Viewed by 299
Abstract
Traditional position-based fluid (PBF) methods often suffer from interpolation inaccuracies and limited computational efficiency due to their fixed smoothing length. To address these limitations, this paper proposes an adaptive smoothing length model and implements full-pipeline parallel acceleration on GPUs. By incorporating both local [...] Read more.
Traditional position-based fluid (PBF) methods often suffer from interpolation inaccuracies and limited computational efficiency due to their fixed smoothing length. To address these limitations, this paper proposes an adaptive smoothing length model and implements full-pipeline parallel acceleration on GPUs. By incorporating both local neighbor count and density variation, the model dynamically adjusts particle smoothing length. This adaptation effectively mitigates two issues: surface distortion due to insufficient interpolation in sparse regions, and performance degradation caused by computational redundancy in dense regions. To resolve neighbor search asymmetry introduced by dynamic smoothing lengths, we designed a symmetry handling technique based on maximum smoothing length and an efficient spatial hashing search algorithm. Experimental results across multiple scenarios (including dam break and droplet impact) demonstrate that our method maintains simulation stability comparable to the fixed smoothing length approach while improving computational efficiency and enhancing local particle distribution uniformity. The improved uniformity is evidenced by a significant reduction in the variance of neighbor particle counts. Visually, the method yields more natural results for dynamic details such as splashing and fragmentation, thereby ensuring the visual realism of the simulations. Full article
Show Figures

Figure 1

8 pages, 446 KB  
Proceeding Paper
Enhanced Early Detection of Epileptic Seizures Through Advanced Line Spectral Estimation and XGBoost Machine Learning
by K. Rama Krishna and B. B. Shabarinath
Comput. Sci. Math. Forum 2025, 12(1), 4; https://doi.org/10.3390/cmsf2025012004 - 17 Dec 2025
Viewed by 385
Abstract
This paper proposes a fast epileptic seizure detection method to allow for early clinical intervention. The primary goal is to enhance computational and predictive performance to make the method viable for online implementation. An advanced Line Spectral Estimation (LSE)-based method for EEG analysis [...] Read more.
This paper proposes a fast epileptic seizure detection method to allow for early clinical intervention. The primary goal is to enhance computational and predictive performance to make the method viable for online implementation. An advanced Line Spectral Estimation (LSE)-based method for EEG analysis was developed with Bayesian inference and Toeplitz structure-based fast inversion with Capon and non-uniform Fourier transforms to reduce computational requirements. XGBoost classifier with parallel boosting was employed to increase prediction performance. The method was tested with patients’ EEG data using multiple embedded Graphic Processing Unit (GPU) platforms and achieved 95.5% accuracy, and 23.48 and 33.46 min average and maximum lead times before a seizure, respectively. The sensitivity and specificity values (92.23% and 93.38%) show the method to be reliable. The integration of LSE and XGBoost can be extended to create an efficient and practical online seizure detection and management tool. Full article
Show Figures

Figure 1

24 pages, 866 KB  
Article
A GPU-CUDA Numerical Algorithm for Solving a Biological Model
by Pasquale De Luca, Giuseppe Fiorillo and Livia Marcellino
AppliedMath 2025, 5(4), 178; https://doi.org/10.3390/appliedmath5040178 - 8 Dec 2025
Viewed by 466
Abstract
Tumor angiogenesis models based on coupled nonlinear parabolic partial differential equations require solving stiff systems where explicit time-stepping methods impose severe stability constraints on the time step size. Implicit–Explicit (IMEX) schemes relax this constraint by treating diffusion terms implicitly and reaction–chemotaxis terms explicitly, [...] Read more.
Tumor angiogenesis models based on coupled nonlinear parabolic partial differential equations require solving stiff systems where explicit time-stepping methods impose severe stability constraints on the time step size. Implicit–Explicit (IMEX) schemes relax this constraint by treating diffusion terms implicitly and reaction–chemotaxis terms explicitly, reducing each time step to a single linear system solution. However, standard Gaussian elimination with partial pivoting exhibits cubic complexity in the number of spatial grid points, dominating computational cost for realistic discretizations in the range of 400–800 grid points. This work presents a CUDA-based parallel algorithm that accelerates the IMEX scheme through GPU implementation of three core computational kernels: pivot finding via atomic operations on double-precision floating-point values, row swapping with coalesced memory access patterns, and elimination updates using optimized two-dimensional thread grids. Performance measurements on an NVIDIA H100 GPU demonstrate speedup factors, achieving speedup factors from 3.5× to 113× across spatial discretizations spanning M[25,800] grid points relative to sequential CPU execution, approaching 94.2% of the theoretical maximum speedup predicted by Amdahl’s law. Numerical validation confirms that GPU and CPU solutions agree to within twelve digits of precision over extended time integration, with conservation properties preserved to machine precision. Performance analysis reveals that the elimination kernel accounts for nearly 90% of total execution time, justifying the focus on GPU parallelization of this component. The method enables parameter studies requiring 104 PDE solves, previously computationally prohibitive, facilitating model-driven investigation of anti-angiogenic therapy design. Full article
Show Figures

Figure 1

31 pages, 11710 KB  
Article
An Efficient GPU-Accelerated High-Order Upwind Rotated Lattice Boltzmann Flux Solver for Simulating Three-Dimensional Compressible Flows with Strong Shock Waves
by Yunhao Wang, Qite Wang and Yan Wang
Entropy 2025, 27(12), 1193; https://doi.org/10.3390/e27121193 - 24 Nov 2025
Viewed by 414
Abstract
This paper presents an efficient and high-order WENO-based Upwind Rotated Lattice Boltzmann Flux Solver (WENO-URLBFS) on graphics processing units (GPUs) for simulating three-dimensional (3D) compressible flow problems. The proposed approach extends the baseline Rotated Lattice Boltzmann Flux Solver (RLBFS) by redefining the interface [...] Read more.
This paper presents an efficient and high-order WENO-based Upwind Rotated Lattice Boltzmann Flux Solver (WENO-URLBFS) on graphics processing units (GPUs) for simulating three-dimensional (3D) compressible flow problems. The proposed approach extends the baseline Rotated Lattice Boltzmann Flux Solver (RLBFS) by redefining the interface tangential velocity based on the theoretical solution of the Euler equations. This improvement, combined with a weighted decomposition of the numerical fluxes in two mutually perpendicular directions, effectively reduces numerical dissipation and enhances solution stability. To achieve high-order accuracy, the WENO interpolation is applied in the characteristic space to reconstruct physical quantities on both sides of the interface. The density perturbation test is employed to assess the accuracy of the scheme, which demonstrates 5th- and 7th-order convergence as expected. In addition, this test case is also employed to confirm the consistency between the CPU serial and GPU parallel implementations of the WENO-URLBFS scheme and to assess the acceleration performance across different grid resolutions, yielding a maximum speedup factor of 1208.27. The low-dissipation property of the scheme is further assessed through the inviscid Taylor–Green vortex problem. Finally, a series of challenging three-dimensional benchmark cases demonstrate that the present scheme achieves high accuracy, low dissipation, and excellent computational efficiency in simulating strongly compressible flows with complex features such as strong shock waves and discontinuities. Full article
(This article belongs to the Section Statistical Physics)
Show Figures

Figure 1

24 pages, 2582 KB  
Article
A Novel Approach for Vessel Graphics Identification and Augmentation Based on Unsupervised Illumination Estimation Network
by Jianan Luo, Zhichen Liu, Chenchen Jiao and Mingyuan Jiang
J. Mar. Sci. Eng. 2025, 13(11), 2167; https://doi.org/10.3390/jmse13112167 - 17 Nov 2025
Viewed by 402
Abstract
Vessel identification in low-light environments is a challenging task since low-light images contain less information for detecting objects. To improve the feasibility of vessel identification in low-light environments, we present a new unsupervised low-light image augmentation approach to augment the visibility of vessel [...] Read more.
Vessel identification in low-light environments is a challenging task since low-light images contain less information for detecting objects. To improve the feasibility of vessel identification in low-light environments, we present a new unsupervised low-light image augmentation approach to augment the visibility of vessel features in low-light images, laying a foundation for subsequent identification. This guarantees the feasibility of vessel identification with the augmented image. To this end, we design an illumination estimation network (IEN) to estimate the illumination of a low-light image based on the Retinex theory. Then, we augment the low-light image by estimating its reflectance with the estimated illumination. Compared with the existing deep learning-based supervised low-light image augmentation approach that depends on the low- and normal-light image pairs for model training, IEN is an unsupervised approach without using normal-light image as references during model training. Compared with the traditional unsupervised low-light image augmentation approach, IEN shows faster image augmentation speed by parallel computation acceleration with image Processing Units (GPUs). The proposed approach builds an end-to-end pipeline integrating a vessel-aware weight matrix and SmoothNet, which optimizes illumination estimation under the Retinex framework. To evaluate the effectiveness of the proposed approach, we build a low-light vessel image set based on the Sea Vessels 7000 dataset—a public maritime image set containing 7000 vessel images across multiple categories Then, we carry out an experiment to evaluate the feasibility of vessel identification using the augmented image. Experimental results show that the proposed approach boosts the AP75 metric of the RetinaNet detector by 6.6 percentage points (from 56.8 to 63.4) on the low-light Sea Vessels 7000 dataset, confirming that the augmented image significantly improves vessel identification accuracy in low-light scenarios. Full article
(This article belongs to the Special Issue New Technologies in Autonomous Ship Navigation)
Show Figures

Figure 1

21 pages, 898 KB  
Article
Scalable QR Factorisation of Ill-Conditioned Tall-and-Skinny Matrices on Distributed GPU Systems
by Nenad Mijić, Abhiram Kaushik, Dario Živković and Davor Davidović
Mathematics 2025, 13(22), 3608; https://doi.org/10.3390/math13223608 - 11 Nov 2025
Viewed by 514
Abstract
The QR factorisation is a cornerstone of numerical linear algebra, essential for solving overdetermined linear systems, eigenvalue problems, and various scientific computing tasks. However, computing it for ill-conditioned tall-and-skinny (TS) matrices on large-scale distributed-memory systems, particularly those with multiple GPUs, presents significant challenges [...] Read more.
The QR factorisation is a cornerstone of numerical linear algebra, essential for solving overdetermined linear systems, eigenvalue problems, and various scientific computing tasks. However, computing it for ill-conditioned tall-and-skinny (TS) matrices on large-scale distributed-memory systems, particularly those with multiple GPUs, presents significant challenges in balancing numerical stability, high performance, and efficient communication. Traditional Householder-based QR methods provide numerical stability but perform poorly on TS matrices due to their reliance on memory-bound kernels. This paper introduces a novel algorithm for computing the QR factorisation of ill-conditioned TS matrices based on CholeskyQR methods. Although CholeskyQR is fast, it typically fails due to severe loss of orthogonality for ill-conditioned inputs. To solve this, our new algorithm, mCQRGSI+, combines the speed of CholeskyQR with stabilising techniques from the Gram–Schmidt process. It is specifically optimised for distributed multi-GPU systems, using adaptive strategies to balance computation and communication. Our analysis shows the method achieves accuracy comparable to Householder QR, even for extremely ill-conditioned matrices (condition numbers up to 1016). Scaling experiments demonstrate speedups of up to 12× over ScaLAPACK and 16× over SLATE’s CholeskyQR2. This work delivers a method that is both robust and highly parallel, advancing the state-of-the-art for this challenging class of problems. Full article
(This article belongs to the Special Issue Parallel, Distributed Computing and Computational Mathematics)
Show Figures

Figure 1

29 pages, 2147 KB  
Article
An Analysis of the Computational Complexity and Efficiency of Various Algorithms for Solving a Nonlinear Model of Radon Volumetric Activity with a Fractional Derivative of a Variable Order
by Dmitrii Tverdyi
Computation 2025, 13(11), 252; https://doi.org/10.3390/computation13110252 - 2 Nov 2025
Cited by 1 | Viewed by 531
Abstract
The article presents a study of the computational complexity and efficiency of various parallel algorithms that implement the numerical solution of the equation in the hereditary α(t)-model of radon volumetric activity (RVA) in a storage chamber. As a test [...] Read more.
The article presents a study of the computational complexity and efficiency of various parallel algorithms that implement the numerical solution of the equation in the hereditary α(t)-model of radon volumetric activity (RVA) in a storage chamber. As a test example, a problem based on such a model is solved, which is a Cauchy problem for a nonlinear fractional differential equation with a Gerasimov–Caputo derivative of a variable order and variable coefficients. Such equations arise in problems of modeling anomalous RVA variations. Anomalous RVA can be considered one of the short-term precursors to earthquakes as an indicator of geological processes. However, the mechanisms of such anomalies are still poorly understood, and direct observations are impossible. This determines the importance of such mathematical modeling tasks and, therefore, of effective algorithms for their solution. This subsequently allows us to move on to inverse problems based on RVA data, where it is important to choose the most suitable algorithm for solving the direct problem in terms of computational resource costs. An analysis and an evaluation of various algorithms are based on data on the average time taken to solve a test problem in a series of computational experiments. To analyze effectiveness, the acceleration, efficiency, and cost of algorithms are determined, and the efficiency of CPU thread loading is evaluated. The results show that parallel algorithms demonstrate a significant increase in calculation speed compared to sequential analogs; hybrid parallel CPU–GPU algorithms provide a significant performance advantage when solving computationally complex problems, and it is possible to determine the optimal number of CPU threads for calculations. For sequential and parallel algorithms implementing numerical solutions, asymptotic complexity estimates are given, showing that, for most of the proposed algorithm implementations, the complexity tends to be n2 in terms of both computation time and memory consumption. Full article
(This article belongs to the Section Computational Engineering)
Show Figures

Figure 1

Back to TopTop