High-Performance Parallel Direct Georeferencing for Massive ULS LiDAR Measurements

Yu, Mei; Zhou, Yuhao; Liu, Hua; Liu, Bo

doi:10.3390/rs18060949

Open AccessArticle

High-Performance Parallel Direct Georeferencing for Massive ULS LiDAR Measurements

School of Surveying and Geoinformation Engineering, East China University of Technology, Nanchang 330013, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(6), 949; https://doi.org/10.3390/rs18060949

Submission received: 26 January 2026 / Revised: 18 March 2026 / Accepted: 19 March 2026 / Published: 20 March 2026

(This article belongs to the Special Issue Point Cloud Data Analysis and Applications)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

OpenMP-based and CUDA-based parallel methods for rigorous and approximate direct georeferencing were designed and implemented for massive ULS LiDAR measurements.
Parallel direct georeferencing implementations deliver major efficiency gains with 7–9× for OpenMP-based methods and up to 24.6× for CUDA-based methods.

What are the implications of the main findings?

Both OpenMP-based and CUDA-based parallel strategies could greatly improve the efficiency of rigorous and approximate direct georeferencing methods, providing alternatives for efficient direct georeferencing of massive ULS LiDAR measurements.
The rigorous and approximate direct georeferencing models have their own strengths in accuracy and efficiency and can be selected according to accuracy-efficiency trade-offs in large-scale ULS production workflows.

Abstract

The rapid increase in point density and acquisition rate of UAV laser scanning (ULS) systems has shifted the primary bottleneck of LiDAR workflows from data acquisition to post-processing, particularly during direct georeferencing of massive LiDAR measurements. This study presents a systematic evaluation of parallel computing strategies for accelerating ULS direct georeferencing while preserving geodetic accuracy. Two georeferencing models are investigated: (1) a rigorous model that strictly follows the full geodetic transformation chain from sensor owned coordinates system (SOCS) to projected map coordinates, and (2) an approximate model that incorporates meridian convergence angle compensation and preprocessing of platform trajectories to reduce per-point computational complexity. For each model, a shared-memory multicore CPU implementation based on OpenMP and a heterogeneous GPU implementation based on CUDA are designed. Experiments were conducted on seven real-world ULS datasets, ranging from 2.9 × 10⁷ to 7.0 × 10⁸ points and covering diverse terrain types. Accuracy analysis shows that, in typical urban, plain, and industrial scenarios, the approximate model achieves millimeter-level mean errors and centimeter-level RMSEs relative to the rigorous model, satisfying the requirements of most engineering surveying applications. Performance evaluation demonstrates that parallelization yields substantial speedups: OpenMP-based method achieves 7–9 times acceleration, while GPU computing attains up to 24.6 times acceleration for the rigorous model and up to 16.7 times for the approximate model. The results highlight the complementary strengths of the two models and provide practical guidance for selecting accuracy-efficiency trade-offs in large-scale ULS production workflows.

Keywords:

ULS; direct georeferencing; parallel computing; OpenMP; CUDA; GPU

1. Introduction

Unmanned aerial vehicle (UAV) laser scanning (ULS) has become a core technology for high-resolution three-dimensional geospatial data acquisition [1]. By tightly integrating light detection and ranging (LiDAR) sensors with high-rate global navigation satellite system (GNSS) and inertial measurement units (IMUs), ULS systems enable rapid collection of dense and accurate point clouds under flexible flight conditions [2,3]. Compared with traditional airborne laser scanning (ALS) and terrestrial laser scanning (TLS), ULS offers advantages in deployment flexibility, operational safety, and cost-effectiveness [4], and has therefore been widely applied in topographic mapping [5], infrastructure surveying and inspection [6,7], forestry inventory [8], disaster assessment [9], and digital twin construction [10].

Driven by advances in laser ranging technology, modern ULS systems now operate at pulse repetition rates in the megahertz range [11] and increasingly adopt multi-head configurations [12]. As a result, an individual ULS project routinely generates hundreds of millions to billions of laser points. This dramatic growth in data volume has shifted the efficiency bottleneck of the ULS workflow from field acquisition to office-based data processing. Efficient processing of massive point clouds has thus become a key factor limiting the scalability of ULS applications.

Within the ULS processing chain, direct georeferencing constitutes the first and one of the most data-intensive steps. By using high-frequency GNSS/IMU-derived platform position and attitude (POS trajectory) and system calibration information, direct georeferencing transforms raw laser measurements from the sensor-owned coordinate system (SOCS) directly into a georeferenced map coordinate system, eliminating the need for ground control points and significantly improving operational efficiency [13]. However, rigorous direct georeferencing requires executing a complex sequence of coordinate transformations, interpolations, and geodetic projections for each laser point [14]. When implemented serially, these computations become prohibitively time-consuming for large datasets.

To alleviate such computational bottlenecks, parallel computing has been widely adopted in geospatial data processing [15]. Multicore CPU parallelism using OpenMP and heterogeneous CPU-GPU computing using CUDA have demonstrated significant performance gains in downstream LiDAR tasks such as DEM generation, point cloud segmentation, filtering, and registration. Nevertheless, most existing studies focus on iterative or algorithmically complex analysis stages applied to preprocessed point clouds. In contrast, the direct georeferencing stage, despite being the earliest, most fundamental, and most data-intensive component of the ULS workflow, has received limited systematic investigation regarding its parallelization potential and accuracy-efficiency trade-offs.

Before the widespread adoption of parallel hardware, direct georeferencing of point clouds relied primarily on traditional serial CPU processing. Altuntas et al. [16] performed serial correction of LiDAR point cloud attitude using data from continuously operating reference stations to achieve direct georeferencing. This approach belongs to conventional GNSS post-processing, with its accuracy dependent on serial adjustment of the processing chain, and is mainly applicable to static terrestrial scanning stations. Zhang et al. [17] systematically analyzed the direct georeferencing workflow of airborne LiDAR measurements in national coordinate systems and map projections, identifying multiple geometric distortion factors, including 3D scale, earth curvature, length, and angular distortions. They further proposed both high-precision and approximate map projection correction formulas to compensate for projection distortions within traditional serial point cloud processing pipelines, thereby ensuring point cloud accuracy in national coordinate systems.

While traditional CPU serial computing offers advantages regarding versatility and the handling of complex logic, it faces significant bottlenecks. These include low parallelism inherent to single-core architectures, memory access latency, and inefficiency when processing large-scale datasets.

With the widespread adoption of multi-core CPUs, parallel computing based on shared-memory architectures has become the preferred solution. As an industry standard, OpenMP enables developers to conveniently distribute computationally intensive loops across multiple CPU cores for parallel execution through compiler directives [18]. Owing to its performance portability and ease of use, OpenMP has been widely applied in scientific computing libraries and geospatial data processing tasks [19].

In the LiDAR data processing domain, applications of OpenMP have primarily focused on computationally intensive or algorithmically complex downstream analysis tasks. For the highly time-consuming DEM construction stage, Song et al. [20] proposed a block-based parallel Delaunay triangulation growth algorithm, achieving task-level parallelism in multi-core environments. Experimental results demonstrated significant speedups over serial algorithms when processing large-scale airborne LiDAR point clouds, while maintaining efficient memory usage. To address the clustering demands of massive point cloud datasets, Deng et al. [21] optimized the density-based spatial clustering of applications with noise algorithm using OpenMP by parallelizing the neighborhood search of core points, effectively alleviating efficiency bottlenecks in large-scale urban point cloud segmentation. For upstream data transformation in LiDAR processing, Mochurad et al. [22] proposed an OpenMP-based parallel genetic algorithm to determine the current coordinates of LiDAR data.

In recent years, general-purpose computing on graphics processing units (GPGPU) has become a key paradigm shift in high-performance computing [23]. GPUs adopt a high-throughput many-core architecture, integrating thousands of computing units and being specifically designed for large-scale parallel workloads under the single-instruction multiple-thread (SIMT) model [24]. These architectural characteristics have established a typical CPU–GPU heterogeneous computing workflow, in which the CPU is responsible for overall logic control, while the GPU acts as a co-processor to execute computation-intensive kernels [25]. When processing massive and unstructured data such as LiDAR point clouds, this heterogeneous paradigm can significantly overcome the computational limitations of traditional serial algorithms [26].

Similarly, research on GPU parallelization has been largely focused on downstream analysis tasks. In real-time localization, GPU acceleration has been widely applied to point cloud registration algorithms to meet the real-time requirements of autonomous navigation and Simultaneous Localization and Mapping (SLAM) systems. For example, Koide et al. [27] employed a CUDA-optimized parallel normal distribution transform algorithm, achieving speedups ranging from several to tens of times while maintaining accuracy, thereby satisfying the localization demands of high-speed platforms. Methods such as VAN-ICP further reduce the execution time of the ICP algorithm by accelerating approximate nearest neighbor searches through voxel inflation [28]. In preprocessing, GPU acceleration mainly targets fundamental operations such as filtering and ground segmentation to remove noise and separate ground points, providing clean data for subsequent analysis. For instance, Zermas et al. [29] proposed a scanline-based parallel segmentation algorithm that exploits the independent processing of individual scan lines on the GPU, enabling millisecond-level ground extraction. In addition, the application of CPU–GPU heterogeneous computing frameworks in large-scale geospatial point cloud processing further highlights the advantages of GPUs. In DEM generation, interpolation, and analysis of large-scale LiDAR point clouds, heterogeneous parallel algorithms effectively leverage the general-purpose capabilities of multi-core CPUs and the high throughput of GPUs, resulting in substantial performance improvements [30].

Current research on parallel computing for LiDAR data processing is highly concentrated on downstream analysis tasks. These tasks are typically iterative and complex, and are often performed on datasets that have already undergone preprocessing. In contrast, the direct georeferencing process, which is the most fundamental, earliest-stage, and data-intensive component of the entire ULS workflow, has received little systematic investigation and performance evaluation with respect to its parallelization potential.

This study addresses this research gap by conducting a comprehensive evaluation of parallel computing strategies for ULS direct georeferencing. The main contributions are threefold: (1) the detailed formulation and implementation of a rigorous direct georeferencing model and an approximate model incorporating Meridian Convergence Angle Compensation; (2) the design of OpenMP-based multicore CPU and CUDA-based GPU parallel implementations for both models; and (3) a quantitative assessment of accuracy and performance using large-scale real-world ULS datasets, providing practical guidance for engineering applications.

The remainder of this paper is organized as follows. Section 2 presents the theoretical principles and implementation details of the two direct georeferencing models, including the POS data search, interpolation, and parallelization strategies and implementations. Section 3 comparatively analyzes the accuracy and efficiency of serial, OpenMP-based and CUDA-based parallel implementations using diverse datasets. Section 4 presents the conclusions of this work.

2. Methods

This study implements two models of direct georeferencing: a rigorous model and an approximate model. Furthermore, two parallel computing schemes, OpenMP-based and CUDA-based, are utilized to enhance the data processing efficiency. The overall workflow of the rigorous and approximate direct georeferencing model is illustrated in Figure 1. The left branch corresponds to the rigorous direct georeferencing model, which performs strict point-wise coordinate transformations from the LiDAR SOCS to the projected map coordinate system. The right branch represents the approximate direct georeferencing model, in which meridian convergence angle correction and Gauss–Krüger projection preprocessing are applied to the POS trajectory, simplifying point-wise computations into two rigid-body transformations and thereby enabling efficient processing.

2.1. Rigorous Direct Georeferencing

The rigorous direct georeferencing model strictly follows the complete geodetic transformation chain from the SOCS to the projected map coordinate system. Each laser point undergoes a sequence of transformations. The raw LiDAR point cloud is measured in SOCS, including timestamp information and xyz coordinates and attributes such as intensity.

To ensure rigorous direct georeferencing accuracy, it is necessary to search the previous and next trajectory record using the precise timestamp of each laser point and perform interpolation of position and attitude from the searched adjacent POS records before subsequent coordinate transformations. The search is performed using binary search [31]. The position

(B, L, H)

is interpolated by linear interpolation [32], while the attitude is interpolated using the spherical linear interpolation [33] to ensure smooth rotational motion. In this way, high-precision attitude angles

(r, p, h)

and geodetic coordinates

(B, L, H)

at the laser emission instant are obtained.

Using the interpolated navigation parameters, the laser point coordinates are then transformed through a series of coordinate systems. First, the target coordinates in the laser scanner coordinate system are transformed into the IMU coordinate system using the lever-arm offset and the boresight angles between the laser scanner and the IMU. Then, using the interpolated IMU attitude angles (roll, pitch, and heading), the coordinates in the IMU coordinate system are transformed into the Local Level Coordinate System. Finally, based on the latitude, longitude, and ellipsoidal height provided by the GNSS/IMU, the coordinates in the local coordinate system are converted into the WGS84 ECEF coordinate system.

The formula for transforming a laser point from the laser scanner coordinate system to the WGS84 ECEF coordinate system is as follows:

P_{E} = R_{L}^{E} (B, L) R_{I}^{L} (r, p, h) (R_{S}^{I} (φ, ω, κ) P_{S} + t_{S}^{I} (x_{0}, y_{0}, z_{0})) + t_{L}^{E} (B, L, H)

(1)

where

P_{E}

is the coordinate of the laser point in the WGS84 ECEF coordinate system;

R_{S}^{I}

is the rotation matrix from the SOCS to the IMU coordinate system;

t_{S}^{I}

is the translation vector from the SOCS coordinate system to the IMU coordinate system;

φ, ω, κ

are the boresight angles between the laser scanner and the IMU system;

x_{0}, y_{0}, z_{0}

are the lever-arm offsets between the laser scanner and the IMU system;

R_{I}^{L}

is the rotation matrix from the IMU coordinate system to the Local Level Coordinate System;

R_{L}^{E}

is the rotation matrix from the Local Level Coordinate System to the WGS84 ECEF coordinate system;

t_{L}^{E}

is the translation vector from the Local Level Coordinate System to the WGS84 ECEF coordinate system;

B, L, H

are the longitude, latitude, and ellipsoidal height provided by the IMU;

r, p, h

are the attitude angles (roll, pitch, and heading) provided by the IMU.

To obtain the final planar coordinates of the laser point,

P_{E}

must first be iteratively solved to obtain the geodetic coordinates

(B^{'}, L^{'}, H^{'})

. Then, the final planar coordinates

P_{M} = {[E, N, H]}^{T}

are obtained through the Gauss–Krüger projection.

In summary, the rigorous direct georeferencing model requires performing a sequence of strict coordinate transformations for every individual laser foot point. To clearly illustrate this computational process, we summarize the workflow in Algorithm 1. Theoretically, the time complexity of this serial algorithm is

O (N \log M)

, determined by the binary search for each of the

N

laser points within the

M

trajectory records, combined with the computationally expensive iterative geodetic transformations. The space complexity is

O (N + M)

, as it primarily requires memory to store the massive point cloud and the trajectory data.

Algorithm 1: Rigorous Direct Georeferencing

Input:
Raw LiDAR Points

P_{S O C S} = {P_{1} \dots P_{N}}

, where

P_{i} = (t, x, y, z)

POS Trajectory

T_{p o s} = {p o s_{1} \dots p o s_{N}}

, where

p o s_{i} = (t, l a t, l o n, h e i g h, r, p, h)

Calibration Params:

R_{S}^{I}

(Boresight),

t_{S}^{I}

(Lever-arm)
Ellipsoid Params:

a, f

Central Meridian:

L_{0}

Output:
Georeferenced Points

P_{m a p} = {g_{1} \dots g_{N}}

, where

g_{i} = (E, N, H)

1 : Load T_{p o s}

and sort by time

2 : Initialize Output Array P_{m a p}

3 : for i = 1

to N

do
// Step 1: SOCS to IMU Coordinate System

4 : P_{I} = R_{S}^{I} P_{i} (x, y, z) + t_{S}^{I}

// Step 2: POS Search and Interpolation

5 : idx = BinarySearch (T_{p o s}, P_{i} (t)

)

6 : State_interp = LinearInterpolate (T_{p o s} [idx], T_{p o s} [idx + 1], P_{i} (t)

)

7 : Attitude_interp = Slerp (T_{p o s} [idx] . q u a t, T_{p o s} [idx + 1] . q u a t, P_{i} (t)

)
// Step 3: IMU to Local Level Coordinate System

8 : R_{I}^{L}

= BuildRotationMatrix(Attitude_interp)

9 : P_{L} = R_{I}^{L} P_{I}

// Step 4: LLCS to ECEF

10 : R_{L}^{E}

= CalcEarthRotation(State_interp.lat, State_interp.lon)

11 : t_{L}^{E}

= CalcEarthTranslation(State_interp.lat, State_interp.lon, State_interp.h)

12 : P_{E} = R_{L}^{E} P_{L} + t_{L}^{E}

// Step 5: ECEF to Geodetic (Iterative)

13 : (B, L, H)

= ECEF 2 Geodetic (P_{E}, a, f)

// Step 6: Map Projection (Gauss-Krüger)

14 : (E, N, H)

= Gauss-KrügerProjection

(B, L, H, L_{0})

15 : P_{m a p} [i] = (E, N, H)

16: end for
17: return

P_{m a p}

2.2. Approximate Direct Georeferencing with Meridian Convergence Angle Compensation

To reduce the computational cost of processing massive datasets, the approximate model adopts a POS trajectory preprocessing and rigid body transformation strategy. The core idea is to assume that within a local scanning range, the variations in the meridian convergence angle

γ

and the Gauss projection scale factor are negligible.

In this method, the POS trajectory is first preprocessed. The meridian convergence angle

γ

at each trajectory point is computed, and the heading angle is corrected as

h^{'} = h - γ

. The POS positions are then directly projected into the ENH coordinate system using the Gauss–Krüger projection. Subsequently, the computation of each laser point is simplified to a local rigid body transformation with the POS position taken as the origin.

P_{M} \approx R_{I}^{M} (r, p, h^{'}) (R_{S}^{I} (φ, ω, κ) P_{S} + t_{S}^{I} (x_{0}, y_{0}, z_{0})) + t_{I}^{M} (E, N, H)

(2)

where,

P_{E}

is the coordinate of the laser point in the map projection coordinate system;

R_{I}^{M}

is the rotation matrix from the IMU coordinate system to the map projection coordinate system;

t_{I}^{M}

is the translation vector from the IMU coordinate system to the map projection coordinate system;

r, p, h^{'}

are the attitude angles provided by the IMU, representing roll, pitch, and the corrected heading;

E, N, H

are the POS trajectory coordinates after Gauss projection, corresponding to the easting, northing, and height of the POS position.

By incorporating meridian convergence angle compensation and trajectory pre-projection, the approximate model avoids computationally expensive iterative transformations during the point-wise loop. Algorithm 2 details this method, which decouples the process into a ‘Trajectory Preprocessing’ phase and a simplified ‘Point-wise Rigid Transformation’ phase. While the overall time complexity remains

O (N \log M)

, the constant factor is significantly reduced because the heavy projection math is moved to the preprocessing stage (which takes only

O (M)

time). The space complexity remains

O (N + M)

, with a negligible increase for storing the intermediate projected trajectory.

Algorithm 2: Approximate Direct Georeferencing

Input:

P_{S O C S}, T_{p o s}, R_{S}^{I}, t_{S}^{I}, a, f, L_{0}

Output:

P_{m a p}

// Phase 1: Preprocessing POS Trajectory (Done once)

1 : Create T_{proj_pos}

of size M

2 : for j = 1

to M do
// Calculate Meridian Convergence Angle

3 : γ

= CalcMeridianConvergence

(T_{p o s} [j] . l a t, T_{p o s} [j] . l o n, L_{0})

// Compensate Heading

4 : h^{'} = T_{p o s} [j] . h - γ

// Direct Projection of POS position

5 : (E_{p o s}, N_{p o s}, H_{p o s})

= GaussianProjection

(T_{p o s} [j] . l a t, T_{p o s} [j] . l o n, T_{p o s} [j] . h e i g h)

// Store projected state

6 : T_{proj_pos} [j] = (t, E_{p o s}, N_{p o s}, H_{p o s}, r, p, h^{'})

7: end for

// Phase 2: Simplified Point Processing

8 : for i = 1

to N do
// Step 1: SOCS to IMU

9 : P_{I} = R_{S}^{I} P_{i} (x, y, z) + t_{S}^{I}

// Step 2: Search and Interpolate in Projected Domain

10 : idx = BinarySearch (T_{proj_pos}, P_{i} (t))

11 : State_proj = LinearInterpolate (T_{proj_pos} [i d x], T_{proj_pos} [i d x + 1], P_{i} (t))

// Step 3: Rigid Body Transformation (IMU to Map)
// Construct rotation matrix directly from corrected attitude

12 : R_{I}^{M}

= BuildRotationMatrix (State_proj.r, State_proj.p, State_proj . h^{'}

)

13 : P_{pos_vec}

= (State_proj.E, State_proj.N, State_proj.H)
// Direct transformation without geodetic iteration

14 : P_{m a p} [i] = R_{I}^{M} P_{I} + P_{pos_vec}

15: end for
16: return

P_{m a p}

2.3. Parallel Direct Georeferencing Architecture and Implementation

To overcome the I/O wall and memory wall encountered in large-scale ULS data processing, this study designs a direct georeferencing architecture based on a multithreaded pipeline to maximize the overlap between I/O operations and core computation. The overall technical architecture and data flow are shown in Figure 2.

2.3.1. Asynchronous I/O Pipeline Based on Producer-Consumer Model

Direct georeferencing is essentially a dataflow-driven, high-throughput task. In order to solve the speed mismatch problem where the disk I/O speed is far lower than the computing speed, this study constructs a three-stage asynchronous pipeline based on the producer-consumer model.

(1): Data Prefetching Stage: As the “producer” of the pipeline, a dedicated reading thread continuously loads raw binary point cloud data from disk. To reduce file system I/O overhead, this module adopts a large-block reading strategy, directly filling preallocated memory buffers so that the computational core does not stall while waiting for data input.
(2): Core Processing Stage: The main thread acts as the “consumer,” retrieving data from the buffers in batches and dispatching coordinate transformation computations to either the CPU thread (via serial or OpenMP) or the GPU (via CUDA) according to the settings.
(3): Asynchronous Write-back Stage: Once georeferencing is completed, the resulting coordinates are pushed into an output queue and written to disk by an independent background thread, thereby achieving temporal overlap among the reading, computation, and writing stages.

2.3.2. Adaptive Memory Management Strategy

To address potential memory overflow and fragmentation caused by large-scale point cloud data, this study designs an adaptive memory management strategy that coordinates host memory and GPU memory, ensuring system robustness across different hardware configurations.

(1): Circular memory pool: When processing large, continuous data streams, frequent dynamic memory allocation and deallocation can lead to severe memory fragmentation and high system call overhead. To mitigate this issue, a memory pool mechanism based on a ring buffer is used.

During system initialization, a large contiguous block of memory is preallocated. The data loading and computation modules reuse this memory region in a lock-free cyclic manner by maintaining read and write pointers, thereby fundamentally eliminating the overhead of dynamic memory management.

Parallel memory copying: To overcome memory bandwidth limitations, parallel memory copy techniques are introduced during data filling and extraction operations within the memory pool. By partitioning large data transfer tasks and mapping them to multiple CPU physical cores for concurrent execution, this strategy significantly increases the transfer rate between the “I/O buffer” and the “compute buffer,” effectively reducing data movement latency.

(2): Adaptive dynamic batching for GPU memory: Unlike CPU memory, GPU global memory is typically limited in capacity and not expandable. To prevent memory overflow caused by loading excessive data at once, this system implements a hardware-aware dynamic batching strategy.

During initialization, the system queries the current available GPU memory in real time via runtime APIs. Based on the available memory capacity and the GPU memory required to process a single laser point (including the timestamp, raw observations, and computed results), the system dynamically determines the maximum allowable batch size for a single kernel launch. This strategy establishes a general resource mapping model:

N_{b a t c h} = \frac{M_{a v a i l} \times α}{S_{p o i n t}}

(3)

where

M_{avail}

denotes the currently available GPU memory,

S_{point}

is the memory consumption of a single point data structure, and

α

is a safety factor reserved to avoid memory exhaustion. This mechanism endows the algorithm with strong hardware adaptability, allowing it to maintain efficient and stable performance across computing devices with different GPU memory capacities without recompilation.

2.3.3. Parallel Implementation Details

This study designs two parallelization strategies using multicore CPUs and many-core GPUs: a coarse-grained task-parallel strategy based on OpenMP, and a fine-grained data-parallel strategy based on CUDA. The schematic diagram of the parallel implementation strategy is shown in Figure 3.

Coarse-Grained CPU Parallelism via OpenMP

LiDAR point cloud data exhibit inherent data independence: the georeferencing of any individual laser point depends solely on its own timestamp, raw observations, and the corresponding POS trajectory data, making it highly suitable for data-parallel execution. On multicore CPU platforms, this study employs the OpenMP framework to implement coarse-grained, loop-level parallelism.

To maximize CPU utilization and prevent thread stalling, two specific optimization strategies were incorporated into the OpenMP implementation (as detailed in Algorithm 3):

Dynamic Load Balancing Strategy: In the rigorous direct georeferencing model, the time required to search for the corresponding POS record can vary slightly depending on the temporal distribution of the points and the binary search execution path. If static scheduling were used, threads assigned to denser point clusters might take longer, causing other threads to idle at the synchronization barrier. To overcome this, we utilized OpenMP’s dynamic scheduling (#pragma omp parallel for schedule(dynamic, chunk_size)). The runtime logically partitions millions of loop iterations into fixed-size data chunks and dynamically dispatches them to worker threads in real-time, effectively eliminating load imbalance.

Thread-Private Memory and False Sharing Avoidance: During the concurrent coordinate transformation, multiple threads need to compute and store intermediate states (such as rotation matrices and interpolated coordinates). To avoid the severe performance penalty of “false sharing” at the CPU L1/L2 cache level, all intermediate variables are strictly declared as thread-private. The computed georeferenced coordinates are directly written into a pre-allocated, conflict-free output array based on their globally unique index, completely avoiding the use of atomic locks or critical sections.

Algorithm 3: OpenMP Coarse-Grained Parallelism

Input: File List F, Memory Pool Size

S_{p o o l}

Global: RingBuffer RB, OutputQueue Q

// Thread 1: Data Producer (I/O Reading)

1 : procedure DataReader (F)

2: for file in F do

3 : Chunk C

= ReadBlock(file,

S_{p o o l}

)
4: while RB.isFull() do sleep() end while

5 : RB . push (C)

6: end for
7: Set IsFinished = true
8: end procedure

// Thread 2: Main Compute (OpenMP)
9: procedure ComputeWorker()
10: while not (IsFinished and RB.isEmpty()) do

11 : Chunk C_{i n}

= RB.pop()

12 : if C_{i n}

is Empty continue
// Fork: Parallel Execution on CPU Cores
13: #pragma omp parallel for schedule(dynamic)

14 : for i = 0 to C_{i n}

.size do
// Execute Logic from Algorithm 1 or Algorithm 2

15 : Result [i] = CoreCalculation (C_{i n}

.points[i])

16 : C_{i n}

.results[i] = Result[i]
17: end for
// Join: Implicit Barrier

18 : Q . p u s h

(C_{i n}

.results)
19: end while
20: end procedure

// Thread 3: Data Writer (Asynchronous Write)
21: procedure DataWriter()
22: while true do

23 : if not Q

.isEmpty() then

24 : Data D

= Q

.pop()

25 : WriteToDisk (D)

26: end if
27: end while
28: end procedure

Fine-Grained GPU Parallelism via CUDA

To process point cloud arrays at the scale of hundreds of millions of points, the proposed system constructs a one-dimensional thread grid comprising tens of thousands of thread blocks. During kernel execution, each GPU thread computes a globally unique logical index based on its block index and intra-block thread index. This logical index is directly mapped to the memory offset of the input point cloud array.

Since the direct georeferencing algorithm has a relatively low arithmetic intensity compared to its massive data throughput, its performance on the GPU is inherently memory-bound. To overcome the “memory wall” and maximize the utilization of the thousands of CUDA cores, three fine-grained memory optimization strategies were designed (as illustrated in Algorithm 4):

Memory Coalescing for Global Memory Access: To maximize the utilization of global memory bandwidth, the input and output point cloud data structures were reorganized from an Array of Structures (AoS) to a Structure of Arrays (SoA). Under this layout, arrays containing timestamps, X, Y, and Z coordinates are stored independently. When a Warp (32 threads) accesses the data, consecutive threads access consecutive memory addresses, ensuring fully coalesced memory transactions and significantly reducing global memory latency.

Constant Memory for Trajectory Caching: The POS trajectory data are read-only and frequently accessed by all threads for binary search and interpolation. We leverage the GPU’s Constant Memory (or Texture/Read-Only Cache, depending on the architecture) to store the currently active segment of the POS trajectory. Because threads within the same Warp typically process temporally adjacent laser points, they access the same or neighboring POS records. Constant memory broadcasts this data to all requesting threads simultaneously, drastically accelerating the trajectory lookup phase.

Shared Memory for Extrinsic Calibration Matrices: The boresight rotation matrices and lever-arm translation vectors constitute global shared data that must be accessed repeatedly for every point transformation. Instead of fetching them from global memory for every calculation, a shared-memory-based cooperative loading strategy is introduced. As shown in Algorithm 4, the leader thread of each block pre-loads the calibration parameters from global memory into the on-chip Shared Memory (which operates at L1 cache speeds). After a _syncthreads() barrier, all threads in the block access these parameters directly from shared memory, completely eliminating redundant global memory reads for matrix multiplications.

Algorithm 4: CUDA Fine-Grained Parallelism

Input:

Host Points P_{h o s t}

, Device Calibration D_{c a l i b}

// Host Side: Adaptive Batching & Dispatch
1: Query GPU Available Memory (Mem_free)
2: Batch_Size = (Mem_free * 0.95) / SizeOf(PointStruct)
3: Num_Batches = Ceiling(Total_Points / Batch_Size)

4 : for b = 0

to Num_Batches do

5 : Current_Chunk = P_{h o s t}

[b * Batch_Size : (b + 1)

* Batch_Size]
// H2D Transfer

6 : cudaMemcpyAsync (D_{p o i n t s}

, Current_Chunk, HostToDevice, Stream)
// Kernel Launch: 1 Thread per Point
7: Blocks = (Current_Chunk.size + ThreadsPerBlock − 1)/ThreadsPerBlock
8: Kernel_DG<<<Blocks, ThreadsPerBlock, Stream>>>

(D_{p o i n t s}, D_{c a l i b}, D_{p o s}, D_{o u t})

// D2H Transfer

9 : cudaMemcpyAsync (H_{o u t}

, D_{o u t}

, DeviceToHost, Stream)
10: end for

// Device Side: Kernel Function (__global__)
11: function Kernel_DG(Points, Calib, PosData, Output)
// Calculate Global Thread Index
12: idx = blockIdx.x * blockDim.x + threadIdx.x
13: if idx >= Points.size return
// Optimization: Load Calibration to Shared Memory
14: __shared__ Shared_Calib [12]
15: if threadIdx.x == 0 then
16: Shared_Calib = Calib // Cooperative loading
17: end if
18: __syncthreads()
// Independent Per-Thread Computation
// No data dependency between threads

19 : p_{l o c a l}

= Points [idx]
// Math logic same as Algorithm 1 or Algorithm 2, executed in parallel

20 : res = DeviceCoreCalculation (p_{l o c a l}, Shared_Calib, PosData)

21: Output [idx] = res
22: end function

3. Results

3.1. Experimental Setup

The experimental hardware platform was a laptop computer equipped with an AMD Ryzen 7 4800H CPU (8 cores and 16 threads, base frequency 2.9 GHz), 16 GB of DDR4 memory, and an NVIDIA GeForce RTX 2060 GPU (6 GB GDDR6 memory with 1920 CUDA cores). The software environment was based on the Windows 10 64-bit operating system, with Microsoft Visual Studio 2022 Community Version as the development tool. The parallel computing environments employed the OpenMP 2.0 standard for CPU-based multicore parallelism and the CUDA 12.5 Toolkit for GPU heterogeneous parallel computing.

The experimental data consisted of real ULS datasets acquired by the Huace Navigation AA10 airborne LiDAR surveying system (Shanghai Huace Navigation Technology Ltd., Shanghai, China; https://www.huace.cn/pdDetailNew/137, accessed on 20 March 2026) and the AU20 multi-platform laser scanning system (https://www.huace.cn/pdDetailNew/141, accessed on 20 March 2026). To comprehensively evaluate the robustness and efficiency of the proposed algorithms under different terrain characteristics and data scales, six representative datasets were selected. The point cloud sizes ranged from tens of millions to 700 million points, covering diverse scenarios including residential areas, high-rise buildings, plains, sports fields, and large industrial parks. Detailed information on the datasets is provided in Table 1, and visualizations of the point clouds obtained after rigorous direct georeferencing are shown in Figure 4.

To ensure an objective evaluation of computational efficiency, an asynchronous I/O architecture based on the producer–consumer model was adopted. The timing measurements included only the in-memory computation time of the core algorithms, namely POS preprocessing, searching, interpolation, and coordinate transformation, while excluding disk I/O latency.

3.2. Accuracy Evaluation

To verify the reliability of the approximate model in practical engineering applications, this section adopts the results obtained from the rigorous model as the reference ground truth and quantitatively evaluates the model simplification errors introduced by the approximate approach. Specifically, point-to-point differences between the point clouds generated by the approximate model and those produced by the rigorous model were computed. The statistical metrics include the mean error (ME) and the root mean square error (RMSE) in the east (E/X), north (N/Y), and up (H/Z) directions. The detailed statistical results are summarized in Table 2.

The results indicate that the approximate model achieves millimeter to centimeter approximation accuracy across all test scenarios. The 3D RMSE for all datasets ranges from 0.149 mm to 6.744 mm, while the MEs are at the sub-millimeter level. Considering the inherent ranging errors of the laser scanner and the errors in GNSS positioning and IMU measurements of low-cost ULS systems, the approximate model is practical for most ULS systems except for those with ultra-high precision laser scanners and IMU. These findings demonstrate that, under typical operating conditions, performing local rigid-body transformations based on preprocessed POS data is a high-fidelity alternative to per-point rigorous projection transformations.

It is noteworthy that the comprehensive error of Data 6 is noticeably higher than that of the other datasets. This variance is primarily driven by the exceptional spatial extent and data scale of Data 6. Because the approximate model relies on local constant assumptions for the meridian convergence angle, laser points located at the far extremities of such a vast survey area experience significantly larger scan ranges. This inevitably leads to a natural accumulation of projection distortions at the boundaries. A detailed geometric analysis and spatial interpretation of this error evolution will be further discussed in Section 4.1.

3.3. Efficiency Comparison

To comprehensively evaluate the computational efficiency of the proposed parallel algorithms in the ULS direct georeferencing workflow, the core computation time of two direct georeferencing models was recorded under three computing modes: serial execution on a CPU, multicore CPU parallel execution based on OpenMP, and heterogeneous GPU parallel execution based on CUDA.

The experimental results are reported in Table 3 and Table 4. These tables list the execution times of each dataset under different computing modes and present the corresponding speedup factors, including the speedup of OpenMP relative to serial CPU execution, the speedup of GPU execution relative to serial CPU execution, and the speedup of GPU execution relative to OpenMP.

Overall, parallel computing achieves substantial performance gains for both models, with the approximate model exhibiting significantly lower total runtime due to its simplified computational pipeline. The OpenMP-based parallel scheme attains speedups ranging from 7.0 times to 8.1 times on an 8-core, 16-thread CPU. These improvements primarily result from efficient parallelization of the per-point point-cloud computation loop and the independent task allocation in the POS preprocessing stage. The speedup increases slightly with data scale, as larger datasets better amortize thread creation and synchronization overheads, leading to improved load balancing.

In comparison, the GPU-based parallel scheme demonstrates superior performance, with speedups between 11.9 times and 16.7 times, significantly outperforming the OpenMP approach. This advantage arises from the thousands of CUDA cores on the GPU, which can concurrently process tens of thousands of laser-point threads, closely matching the highly data-parallel nature of point-cloud computation. In addition, the batch-processing pipeline effectively alleviates GPU memory capacity constraints, while high-speed access to shared memory further reduces global memory latency.

Because the rigorous model involves a complete coordinate transformation chain, its computational intensity is significantly higher than that of the approximate model, and consequently, it achieves more pronounced benefits from parallel acceleration. The OpenMP-based parallel scheme attains speedups ranging from 7.7 times to 8.9 times, approaching the theoretical limit imposed by the number of physical CPU cores. In contrast, the GPU-based parallel scheme achieves speedups as high as 21.2 times to 24.6 times, substantially exceeding the GPU acceleration observed for the approximate model.

This difference arises because the rigorous model includes computationally intensive operations such as Gauss-Krüger projection, ECEF transformations, and extensive trigonometric and matrix computations, resulting in higher arithmetic intensity. With thousands of computing cores, the GPU can more effectively hide global memory access latency when processing such compute-intensive workloads, thereby exploiting its floating-point performance. Conversely, due to the substantially simplified computation pipeline of the approximate model, its performance bottleneck is more likely constrained by memory bandwidth, preventing the GPU’s computational capability from being fully utilized.

It is noteworthy that, for both the low-complexity approximate model and the computationally intensive rigorous model, the speedup factors of the OpenMP-based and GPU-based parallel schemes remain essentially consistent, despite substantial differences in terrain complexity and point-cloud spatial distribution among the datasets. This observation indicates that point-wise direct georeferencing algorithms inherently exhibit strong data decoupling characteristics: their computational efficiency scales linearly with the size of the point cloud and is largely independent of scene geometry or local point density, demonstrating high algorithmic robustness.

3.3.1. Impact of Asynchronous I/O Pipeline

In addition to the core computational acceleration, the end-to-end processing efficiency is critical for massive ULS datasets. The I/O latency often becomes a bottleneck when the computation speed significantly increases. To verify the effectiveness of the proposed three-stage asynchronous pipeline, we conducted a comparative experiment between the “synchronous I/O” mode and the “asynchronous I/O” mode.

In the synchronous I/O mode, data reading, georeferencing computation, and result writing are executed sequentially in a single thread. In contrast, the asynchronous I/O mode employs the producer-consumer architecture described in Section 2.3, where data loading, computation, and writing are handled by independent threads concurrently.

The comparison was performed on the largest dataset, Data 6 (700 million points), using both the GPU-based and OpenMP-based rigorous models. The results are summarized in Table 5.

The results demonstrate that the asynchronous pipeline delivers substantial performance improvements for both parallel implementations.

For the GPU-based model, the total runtime decreases from 247.189 s to 117.996 s, achieving a system-level speedup of 2.1 times. In the synchronous mode, the core calculation takes only 19.962 s, meaning the system spends over 90% of the time waiting for disk I/O. By adopting the asynchronous pipeline, the effective I/O overhead perceived by the system drops significantly to 98 s. This indicates that the proposed architecture successfully masks a large portion of the data transfer latency by overlapping GPU computation with disk operations, thereby keeping the high-speed GPU cores fully occupied.

Similarly, for the OpenMP-based model, the asynchronous strategy reduces the total runtime from 282.884 s to 142.897 s, yielding a 2.0 times speedup. Although the OpenMP calculation time is longer than that of the GPU, the I/O overhead remains the dominant factor in the synchronous mode. The asynchronous pipeline effectively reduces this overhead to 84.757 s.

It is worth noting that the “I/O Overhead” in the asynchronous mode is significantly lower than in the synchronous mode. This reduction occurs because the data reading and writing threads operate in parallel with the computation thread; thus, the visible system latency is determined primarily by the slowest stage of the pipeline rather than the sum of all stages. These findings confirm that the proposed three-stage asynchronous architecture effectively breaks the I/O wall in massive point cloud processing, ensuring that the efficiency gains from parallel algorithms translate directly into end-to-end productivity.

3.3.2. OpenMP Scalability Analysis

To evaluate the scalability and parallel efficiency of the proposed OpenMP-based implementation, a strong scaling test was conducted using the Rigorous model on Data 7. The number of worker threads was varied from 1 to 16, spanning both the physical cores and the logical threads (via Simultaneous Multithreading, SMT) of the processor, and the absolute speedup relative to the serial baseline execution was recorded. The quantitative results are listed in Table 6, and the speedup trend is illustrated in Figure 5.

As illustrated in Figure 5, the speedup curve exhibits two distinct phases corresponding to the hardware architecture of the test platform (AMD Ryzen 7 4800H).

Physical Core Scaling (1–8 Threads): When the thread count is within the number of physical cores (8 cores), the algorithm demonstrates strong scalability. At 8 threads, the processing time drops from 450.307 s to 77.840 s, achieving a speedup of 5.8 times with a parallel efficiency of 0.72. The high efficiency (0.90 at 4 threads) indicates that the OpenMP dynamic scheduling strategy effectively balances the workload among cores, preventing significant load imbalance even when processing massive unstructured point clouds. The slight deviation from ideal linear speedup in this phase is primarily attributed to memory bandwidth contention, as eight cores simultaneously request high-frequency memory access for POS interpolation and coordinate data retrieval.

Logical Core Saturation (12–16 Threads): As the thread count exceeds the number of physical cores, enabling SMT, the performance gains begin to saturate. Increasing threads from 8 to 16 yields a marginal speedup increase from 5.8 times to 7.7 times, while parallel efficiency drops to 0.48. This behavior occurs because the rigorous direct georeferencing model is a compute-intensive task involving heavy double-precision floating-point operations (trigonometric functions and matrix multiplications). Logical threads share the physical Floating-Point Units (FPUs) with the main threads; therefore, SMT provides limited benefits for such dense calculation workloads compared to I/O-bound tasks.

Despite the efficiency drop at high thread counts, the total runtime continues to decrease, reaching a minimum of 58.14 s at 16 threads. This confirms that the proposed parallel architecture successfully maximizes the utilization of available CPU resources to accelerate the georeferencing process.

4. Discussion

4.1. Interpretation of Approximation Errors and Scene Adaptability

Although Table 2 shows that the approximate model is highly accurate on average, the errors are not uniformly distributed. The approximation errors inherently stem from the local constant assumption of the meridian convergence angle and the neglect of projection scale variations. As visualized in the spatial heatmap of Data 6 (Figure 6), the errors exhibit a clear systematic pattern: they are minimal in the central trajectory region and gradually accumulate towards the periphery.

Furthermore, the detailed error analysis reveals a direct mathematical correlation between the geometric deviation and the scan range (the Euclidean distance from the scanner to the target point). Figure 7 quantifies this trend, identifying two critical safety thresholds: the approximate model maintains high precision (<5 mm) within a scan range of 86 m, and standard centimeter-level accuracy (<10 mm) up to 121 m. This provides a clear, actionable guideline for engineering practice: the highly efficient approximate model is perfectly suited for standard low-altitude ULS mapping, whereas the rigorous model remains indispensable for high-altitude or long-range oblique scanning missions that exceed the 121 m threshold.

4.2. System-Level Performance Bottlenecks

From a computer architecture perspective, the efficiency results (Section 3.3) reveal the fundamental “memory-bound” and “I/O-bound” nature of the direct georeferencing task. Because the point-wise mathematics involve no inter-point dependencies, the computational cores can easily outpace the data supply.

This explains the necessity and success of the Three-Stage Asynchronous I/O Pipeline (Table 5). Without it, the GPU would spend over 90% of its lifecycle idling while waiting for disk I/O. By overlapping data transfer with kernel execution, the pipeline successfully masks the “I/O Wall”.

Similarly, the OpenMP scalability test (Figure 5) highlights the hardware limitations of modern CPUs. The efficiency drops significantly when hyper-threading (12–16 threads) is engaged. This saturation occurs because the rigorous model is heavily reliant on double-precision floating-point operations (trigonometric functions and matrix multiplications). Since logical threads share the physical Floating-Point Units (FPUs) with the main threads, the CPU’s arithmetic throughput becomes saturated, preventing further linear scaling. Furthermore, the GPU acceleration ratio capping at ~25 times indicates that the performance is ultimately bottlenecked by the PCIe bus transfer limits and VRAM bandwidth, reinforcing the value of our memory coalescing and shared memory optimization strategies.

4.3. Limitations and Future Work

Despite the significant performance gains and robustness demonstrated in this study, certain limitations remain. First, the proposed parallel architectures were designed and evaluated on a consumer-grade laptop computer. While the adaptive dynamic batching strategy effectively prevents GPU memory overflow, the absolute end-to-end processing throughput is still ultimately bounded by the hardware’s physical limits, particularly the disk I/O speed (SSD read/write bandwidth) and the PCIe transfer rate. Second, as discussed in Section 4.1, the approximate model sacrifices rigorous geodetic fidelity at extreme scan ranges. Therefore, its application should be limited to standard low-altitude ULS scenarios.

Future research will focus on two practical aspects to further enhance the engineering utility of the algorithms. First, we plan to deploy and optimize the proposed parallel pipeline on professional high-performance desktop workstations equipped with multiple GPUs and higher-bandwidth I/O systems to fully explore its upper scalability limits. Second, we aim to integrate this parallel direct georeferencing module with downstream LiDAR processing tasks(such as ground filtering, point cloud classification and SLAM registration) into a unified in-memory processing pipeline. By avoiding intermediate file write-backs to the disk, this integration will further reduce I/O overhead and realize a highly efficient, end-to-end ULS data production software solution.

5. Conclusions

To address the efficiency bottleneck caused by the massive volume of ULS data, this study systematically evaluates parallel computing techniques for direct georeferencing. Specifically, it designed and implemented a rigorous direct georeferencing model and an approximate direct georeferencing model with Meridian Convergence Angle Compensation, while also developing two distinct parallel schemes: a coarse-grained multi-core CPU scheme based on OpenMP and a fine-grained GPU scheme based on CUDA.

Experiments on seven real-world ULS datasets demonstrate that the approximate model maintains millimeter to centimeter-level consistency with the rigorous model in typical survey scenarios, meeting the accuracy requirements of most engineering applications for most of the ULS systems except for those with ultra-high precision laser scanners and IMUs. In terms of efficiency, parallel computing provides substantial performance gains: OpenMP-based CPU parallelization achieves approximately 7–9 times speedup, while CUDA-based GPU acceleration yields up to about 25 times speedup, especially for the computationally intensive rigorous model.

Overall, the parallel direct georeferencing framework evaluated in this study effectively balances accuracy and efficiency. The approximate model is suitable for routine large-scale ULS processing requiring high throughput, whereas the rigorous model remains preferable for applications demanding the highest positional accuracy or covering large spatial extents.

Author Contributions

Conceptualization, M.Y. and H.L.; methodology, Y.Z. and H.L.; software, Y.Z. and H.L.; validation, M.Y., Y.Z. and H.L.; formal analysis, Y.Z. and H.L.; investigation, B.L.; resources, H.L.; data curation, H.L.; writing—original draft preparation, M.Y. and Y.Z.; writing—review and editing, M.Y., Y.Z., B.L. and H.L.; visualization, Y.Z.; supervision, H.L.; project administration, H.L.; funding acquisition, M.Y. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Ganpo Talent Support Program—Key Discipline Academic and Technical Leader Cultivating Project under Grant number 20243BCE51111 and the Jiangxi Natural Science Foundation of China under Grant number 20242BAB20129.

Data Availability Statement

The data used in the paper is provided when requested.

Acknowledgments

We thank CHC Nav. Inc. (https://www.huace.cn/) for providing the datasets used in this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ren, Y.; Cai, Y.; Li, H.; Chen, N.; Zhu, F.; Yin, L.; Kong, F.; Li, R.; Zhang, F. A Survey on LiDAR-Based Autonomous Aerial Vehicles. IEEE/ASME Trans. Mechatron. 2026, 31, 1139–1155. [Google Scholar] [CrossRef]
Liang, A.; Pan, Y.; Huo, Y.; Li, Q.; Zhou, B.; Chen, Z. A Seamless LiDAR/IMU/RTK Fused Localization Method for UAV-Based Bridge Inspection. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2025, XLVIII-G-2025, 937–943. [Google Scholar] [CrossRef]
Haala, N.; Kölle, M.; Cramer, M.; Laupheimer, D.; Zimmermann, F. Hybrid Georeferencing of Images and LiDAR Data for UAV-Based Point Cloud Collection at Millimetre Accuracy. ISPRS Open J. Photogramm. Remote Sens. 2022, 4, 100014. [Google Scholar] [CrossRef]
Balado, J.; Garozzo, R.; Winiwarter, L.; Tilon, S. A systematic literature review of low-cost 3D mapping solutions. Inf. Fusion 2025, 114, 102656. [Google Scholar] [CrossRef]
Bartmiński, P.; Siłuch, M.; Kociuba, W. The Effectiveness of a UAV-Based LiDAR Survey to Develop Digital Terrain Models and Topographic Texture Analyses. Sensors 2023, 23, 6415. [Google Scholar] [CrossRef]
Munir, N.; Awrangjeb, M.; Stantic, B. Power Line Extraction and Reconstruction Methods from Laser Scanning Data: A Literature Review. Remote Sens. 2023, 15, 973. [Google Scholar] [CrossRef]
Kuttah, D.; Waldemarson, A. Next Generation Gravel Road Profiling—The Potential of Advanced UAV Drone in Comparison with Road Surface Tester and Rotary Laser Levels. Transp. Eng. 2024, 17, 100260. [Google Scholar] [CrossRef]
Irwin, L.A.K.; Coops, N.C.; Riofrío, J.; Grubinger, S.G.; Barbeito, I.; Achim, A.; Roeser, D. Prioritizing commercial thinning: Quantification of growth and competition with high-density drone laser scanning. For. Int. J. For. Res. 2025, 98, 293–307. [Google Scholar] [CrossRef]
Sun, J.; Yuan, G.; Song, L.; Zhang, H. Unmanned Aerial Vehicles (UAVs) in Landslide Investigation and Monitoring: A Review. Drones 2024, 8, 30. [Google Scholar] [CrossRef]
Liu, W.; Lv, Y.; Wang, Q.; Sun, B.; Han, D. A Systematic Review of the Digital Twin Technology in Buildings, Landscape and Urban Environment from 2018 to 2024. Buildings 2024, 14, 3475. [Google Scholar] [CrossRef]
Du, B.; Pang, C.; Wu, D.; Li, Z.; Peng, H.; Tao, Y.; Wu, E.; Wu, G. High-speed photon-counting laser ranging for broad range of distances. Sci. Rep. 2018, 8, 4198. [Google Scholar] [CrossRef]
Ravi, R.; Lin, Y.-J.; Elbahnasawy, M.; Shamseldin, T.; Habib, A. Simultaneous system calibration of a multi-LiDAR multicamera mobile mapping platform. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 1694–1714. [Google Scholar] [CrossRef]
Teppati Losè, L.; Chiabrando, F.; Giulio Tonolo, F. Boosting the Timeliness of UAV Large Scale Mapping. Direct Georeferencing Approaches: Operational Strategies and Best Practices. ISPRS Int. J. Geo-Inf. 2020, 9, 578. [Google Scholar] [CrossRef]
Correia, C.A.M.; Andrade, F.A.A.; Sivertsen, A.; Guedes, I.P.; Pinto, M.F.; Manhães, A.G.; Haddad, D.B. Comprehensive Direct Georeferencing of Aerial Images for Unmanned Aerial Systems Applications. Sensors 2022, 22, 604. [Google Scholar] [CrossRef] [PubMed]
Lee, C.A.; Gasster, S.D.; Plaza, A.; Chang, C.-I.; Huang, B. Recent Developments in High Performance Computing for Remote Sensing: A Review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2011, 4, 508–527. [Google Scholar] [CrossRef]
Altuntas, C.; Karabork, H.; Tusat, E. Georeferencing of ground-based LIDAR data using continuously operating reference stations. Opt. Eng. 2014, 53, 114110. [Google Scholar] [CrossRef][Green Version]
Zhang, Y.; Shen, X. Direct georeferencing of airborne LiDAR data in national coordinates. ISPRS J. Photogramm. Remote Sens. 2013, 84, 43–51. [Google Scholar] [CrossRef]
Dagum, L.; Menon, R. OpenMP: An industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 1998, 5, 46–55. [Google Scholar] [CrossRef]
Guan, X.; Wu, H. Leveraging the power of multi-core platforms for large-scale geospatial data processing: Exemplified by generating DEM from massive LiDAR point clouds. Comput. Geosci. 2010, 36, 1276–1282. [Google Scholar] [CrossRef]
Song, Y.; Li, M.; Liu, X. A paralleled Delaunay triangulation algorithm for processing large LiDAR points. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, 10, 141–146. [Google Scholar] [CrossRef]
Deng, X.; Tang, G.; Wang, Q. A novel fast classification filtering algorithm for LiDAR point clouds based on small grid density clustering. Geod. Geodyn. 2022, 13, 38–49. [Google Scholar] [CrossRef]
Mochurad, L.; Kryvinska, N. Parallelization of Finding the Current Coordinates of the Lidar Based on the Genetic Algorithm and OpenMP Technology. Symmetry 2021, 13, 666. [Google Scholar] [CrossRef]
Pantekis, F. Solving Hard Problems at Scale Using Massively Parallel Manycore Processors: An Investigation of GPGPU Acceleration Techniques. Ph.D. Thesis, Swansea University, Wales, UK, 2025. [Google Scholar]
Sarda, G.M.; Shah, N.; Bhattacharjee, D.; Debacker, P.; Verhelst, M. Optimising GPGPU execution through runtime micro-architecture parameter analysis. In 2023 IEEE International Symposium on Workload Characterization (IISWC), Ghent, Belgium, 1–3 October 2023; IEEE: New York, NY, USA, 2023; pp. 226–228. [Google Scholar]
Mittal, S.; Vetter, J.S. A survey of CPU-GPU heterogeneous computing techniques. ACM Comput. Surv. 2015, 47, 69. [Google Scholar] [CrossRef]
Alzyout, M.S.; AL Nounou, A.A.; Tikkisetty, Y.N.; Alawneh, S. Performance analysis of LiDAR data processing on multi-core CPU and GPU architectures. In 2024 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI), Mt Pleasant, MI, USA, 13–14 April 2024; IEEE: New York, NY, USA, 2024; pp. 1–5. [Google Scholar]
Koide, K.; Yokozuka, M.; Oishi, S.; Banno, A. Voxelized GICP for fast and accurate 3D point cloud registration. In 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: New York, NY, USA, 2021; pp. 11054–11059. [Google Scholar]
Wang, W.; Chang, Q. VAN-ICP: GPU-accelerated approximate nearest neighbor search for ICP registration via voxel dilation. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Zermas, D.; Izzat, I.; Papanikolopoulos, N. Fast segmentation of 3D point clouds: A paradigm on LiDAR data for autonomous vehicle applications. In 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: New York, NY, USA, 2017; pp. 5067–5073. [Google Scholar]
Muñoz, F.; Asenjo, R.; Navarro, A.; Cabaleiro, J.C. CPU and GPU oriented optimizations for LiDAR data processing. J. Comput. Sci. 2024, 79, 102317. [Google Scholar] [CrossRef]
Knuth, D.E. The Art of Computer Programming, Volume 3: Sorting and Searching, 2nd ed.; Addison-Wesley Professional: Boston, MA, USA, 1998; Volume 3. [Google Scholar]
Press, W.H. Numerical Recipes: The Art of Scientific Computing, 3rd ed.; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
Shoemake, K. Animating rotation with quaternion curves. ACM SIGGRAPH Comput. Graph. 1985, 19, 245–254. [Google Scholar] [CrossRef]

Figure 1. Workflow of direct georeferencing using the rigorous and approximate models.

Figure 2. Schematic diagram of the direct georeferencing architecture. The framework is organized into a three-stage asynchronous pipeline based on the Producer-Consumer model to mask I/O latency.

Figure 3. Schematic diagram of parallel implementation strategies. (a) Coarse-grained CPU parallelism via OpenMP. (b) Fine-grained GPU parallelism via CUDA.

Figure 4. Point cloud visualization of Datasets 1–6 derived from the rigorous direct georeferencing model, colored by elevation.

Figure 5. Relationship between Acceleration Ratio and Thread Count of OpenMP Rigorous Model.

Figure 6. Analysis of approximation errors for Data 6. Spatial distribution heatmap of 3D errors (Top, Front, and Side views), showing larger errors at the edges of the survey area.

Figure 7. The trend of error evolution with scan range.

Table 1. The Information of used ULS Datasets.

Datasets	Scene Type	Number of Points	File Size
Data1	Residential Area	2.9 × 10⁷	663 MB
Data2	High-rise Buildings	6.8 × 10⁷	1.49 GB
Data3	Plain	1.7 × 10⁸	3.26 GB
Data4	Residential Area	1.9 × 10⁸	3.67 GB
Data5	Sports Field	4.2 × 10⁸	7.99 GB
Data6	Industrial Park	7.0 × 10⁸	13.1 GB

Table 2. Accuracy comparison between the approximate model and the rigorous model.

Datasets	ME (X) (mm)	ME (Y) (mm)	ME (Z) (mm)	ME (3D) (mm)	RMSE (X) (mm)	RMSE (Y) (mm)	RMSE (Z) (mm)	RMSE (3D) (mm)
Data1	0.639	0.757	0.217	1.168	0.874	0.958	0.304	1.332
Data2	0.272	0.238	0.218	0.433	0.285	0.295	0.291	0.503
Data3	0.433	0.798	0.169	1.027	0.701	0.950	0.235	1.203
Data4	0.435	1.371	0.177	1.473	0.569	1.616	0.243	1.730
Data5	0.076	0.060	0.006	0.104	0.113	0.097	0.015	0.149
Data6	3.122	3.147	0.345	5.557	4.677	4.827	0.547	6.744

Table 3. Comparison of processing time and acceleration ratio of the approximate model.

Datasets	Number of Points	CPU (s)	OpenMP (s)	GPU (s)	Speedup (OpenMP vs. CPU)	Speedup (GPU vs. CPU)	Speedup (GPU vs. OpenMP)
Data1	2.9 × 10⁷	5.442	0.773	0.456	7.0 ×	11.9 ×	1.7 ×
Data2	6.8 × 10⁷	12.316	1.571	0.742	7.8 ×	16.6 ×	2.1 ×
Data3	1.7 × 10⁸	30.056	3.791	1.841	7.9 ×	16.3 ×	2.0 ×
Data4	1.9 × 10⁸	33.978	4.403	2.058	7.7 ×	16.5 ×	2.1 ×
Data5	4.2 × 10⁸	76.136	9.752	4.906	7.8 ×	15.5 ×	2.0 ×
Data6	7.0 × 10⁸	125.311	15.459	7.511	8.1 ×	16.7 ×	2.1 ×

Table 4. Comparison of processing time and acceleration ratio of the rigorous model.

Datasets	Number of Points	CPU (s)	OpenMP (s)	GPU (s)	Speedup (OpenMP vs. CPU)	Speedup (GPU vs. CPU)	Speedup (GPU vs. OpenMP)
Data1	2.9 × 10⁷	20.879	2.641	0.982	7.9 ×	21.2 ×	2.7 ×
Data2	6.8 × 10⁷	47.199	5.330	1.919	8.9 ×	24.6 ×	2.8 ×
Data3	1.7 × 10⁸	117.105	13.201	4.912	8.9 ×	23.8 ×	2.7 ×
Data4	1.9 × 10⁸	131.339	15.166	5.495	8.7 ×	23.9 ×	2.8 ×
Data5	4.2 × 10⁸	295.554	36.026	12.425	8.2 ×	23.8 ×	2.9 ×
Data6	7.0 × 10⁸	450.307	58.140	19.962	7.7 ×	22.3 ×	2.9 ×

Table 5. Comparison of End-to-End Runtime between Synchronous and Asynchronous I/O Pipelines(Data 6).

	I/O Strategy	Total Runtime (s)	Core Calculation Time (s)	I/O Overhead (s)
GPU	synchronous	247.189	19.962	227.227
GPU	asynchronous	117.996	19.962	98.034
OpenMP	synchronous	282.884	58.140	224.744
OpenMP	asynchronous	142.897	58.140	84.757

Table 6. Scalability and Efficiency of OpenMP Implementation on Data 6.

Threads	Core Calculation Time (s)	Speedup	Parallel Efficiency
1	450.307	1	1
4	125.535	3.6	0.90
8	77.840	5.8	0.72
12	61.634	7.3	0.61
16	58.140	7.7	0.48

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, M.; Zhou, Y.; Liu, H.; Liu, B. High-Performance Parallel Direct Georeferencing for Massive ULS LiDAR Measurements. Remote Sens. 2026, 18, 949. https://doi.org/10.3390/rs18060949

AMA Style

Yu M, Zhou Y, Liu H, Liu B. High-Performance Parallel Direct Georeferencing for Massive ULS LiDAR Measurements. Remote Sensing. 2026; 18(6):949. https://doi.org/10.3390/rs18060949

Chicago/Turabian Style

Yu, Mei, Yuhao Zhou, Hua Liu, and Bo Liu. 2026. "High-Performance Parallel Direct Georeferencing for Massive ULS LiDAR Measurements" Remote Sensing 18, no. 6: 949. https://doi.org/10.3390/rs18060949

APA Style

Yu, M., Zhou, Y., Liu, H., & Liu, B. (2026). High-Performance Parallel Direct Georeferencing for Massive ULS LiDAR Measurements. Remote Sensing, 18(6), 949. https://doi.org/10.3390/rs18060949

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

High-Performance Parallel Direct Georeferencing for Massive ULS LiDAR Measurements

Highlights

Abstract

1. Introduction

2. Methods

2.1. Rigorous Direct Georeferencing

2.2. Approximate Direct Georeferencing with Meridian Convergence Angle Compensation

2.3. Parallel Direct Georeferencing Architecture and Implementation

2.3.1. Asynchronous I/O Pipeline Based on Producer-Consumer Model

2.3.2. Adaptive Memory Management Strategy

2.3.3. Parallel Implementation Details

Coarse-Grained CPU Parallelism via OpenMP

Fine-Grained GPU Parallelism via CUDA

3. Results

3.1. Experimental Setup

3.2. Accuracy Evaluation

3.3. Efficiency Comparison

3.3.1. Impact of Asynchronous I/O Pipeline

3.3.2. OpenMP Scalability Analysis

4. Discussion

4.1. Interpretation of Approximation Errors and Scene Adaptability

4.2. System-Level Performance Bottlenecks

4.3. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI