Efficient Generation of Gridded Ship Emission Inventories from Massive AIS Data Using Spatial Hashing

Liu, Chen; Chen, Rongchang; Sun, Shuting; Xue, Qingqing; Li, Zichao; Xing, Xinying; Wang, Zhixia

doi:10.3390/atmos16111279

Open AccessEditor’s ChoiceArticle

Efficient Generation of Gridded Ship Emission Inventories from Massive AIS Data Using Spatial Hashing

by

Chen Liu

,

Rongchang Chen

^*

,

Shuting Sun

^*,

Qingqing Xue

,

Zichao Li

,

Xinying Xing

and

Zhixia Wang

China Waterborne Transport Research Institute, Beijing 100088, China

^*

Authors to whom correspondence should be addressed.

Atmosphere 2025, 16(11), 1279; https://doi.org/10.3390/atmos16111279

Submission received: 24 September 2025 / Revised: 4 November 2025 / Accepted: 6 November 2025 / Published: 11 November 2025

(This article belongs to the Special Issue Air Pollution from Shipping: Measurement and Mitigation)

Download

Browse Figures

Versions Notes

Abstract

With the development of global maritime trade, ship emissions pose an increasing threat to the global atmospheric environment, especially in international navigation waters and important port areas, where their impact on coastal air quality and ecosystems is becoming increasingly significant. This study proposes a high-throughput gridding algorithm (H-Grid) based on spatial hashing to rapidly generate ship emission inventories, which overcomes the inefficiency of traditional methods caused by complex index building and maintenance. The H-Grid algorithm achieves a constant processing time per data point and possesses inherent parallelism. Based on the H-Grid algorithm, taking the Yellow Sea area between China and Republic of Korea as a case study, the emissions of atmospheric pollutants from ships in 2024 were calculated, and their spatiotemporal distribution characteristics were analyzed. In our empirical study, the algorithm’s computational efficiency for processing millions of AIS records was improved by over 10 times compared to traditional geometric calculations, and by more than 4 times when compared to mainstream database spatial queries. Our findings provide an efficient tool for large-scale maritime emission analysis, strongly supporting the green development of global shipping.

Keywords:

hash function; AIS; ship emission inventory; air pollution

1. Introduction

Global trade is highly reliant on maritime transportation, with ship transportation accounting for approximately 80 percent of total volume [1,2]. International navigation waters, as areas with dense ship activities, pose a serious threat to the air quality, ecosystems, and human health in coastal areas due to the emissions of pollutants such as sulfur dioxide (SO₂), nitrogen oxides (NO_x), and particulate matter (PM) [3,4,5,6,7,8]. These pollutants have been demonstrated to have significant adverse effects on the health of coastal populations, contributing to respiratory illnesses and environmental issues like acid deposition [9]. Therefore, accurately measuring ship emissions is critical for formulating effective emission reduction policies, a process guided by international frameworks like the MARPOL Convention Annex VI established by the International Maritime Organization (IMO) [10], and for improving coastal air quality and addressing climate change [11,12].

The traditional emission inventory estimation methods mainly adopt “fuel based” methods or “power based” methods. In the preliminary research of our research group, a model to calculate the nitrogen-oxide-emission intensity of ships was designed [13]. While these methods are feasible at a global or national scale, they suffer from low spatial resolution. The popularization of the Automatic Identification System (AIS) has enabled a new paradigm of high-resolution, dynamic emission measurement [14,15,16], becoming a cornerstone for a wide range of maritime studies beyond safety, including traffic analysis, environmental monitoring, and economic modeling [17]. AIS data provides real-time dynamic information such as the position, speed, and heading of ships, forming the data foundation for bottom-up emission inventories, as demonstrated in early regional studies like that by Kim et al. [18].

The core technical challenge in leveraging AIS data lies in efficiently aggregating billions of discrete emission data points into a structured spatial grid. A powerful and widely adopted approach involves using spatial databases, such as PostgreSQL with its PostGIS extension [19,20,21]. Numerous studies have successfully employed this methodology to construct ship emission inventories [22]. These systems typically rely on sophisticated spatial indexes, most notably R-tree-based GiST (Generalized Search Tree) indexes, to accelerate queries. However, for the specific task of high-throughput gridding from massive, append-only datasets, this approach faces inherent limitations. The continuous insertion of data points necessitates frequent updates to the R-tree index, including costly node splitting and balancing operations. This index maintenance overhead becomes a significant performance bottleneck, particularly with the highly skewed spatial distribution of maritime traffic [23].

To circumvent database overhead, other studies have focused on algorithmic approaches. The most straightforward are traditional geometric methods, which involve a brute-force traversal where each data point is checked against every grid cell. More advanced methods employ spatial data structures such as Quadtrees or KD-Trees to recursively partition the space and accelerate the search [24,25]. These structures are theoretically efficient for uniformly distributed data. However, their performance can degrade significantly when faced with the extreme spatial clustering of maritime traffic. Ship movements, concentrated along quasi-linear shipping lanes, can lead to highly unbalanced tree structures and degenerate performance that approaches that of a simple linear search [26].

A third strategy addresses the performance challenge by reducing the data volume itself prior to processing. Techniques such as trajectory simplification or data thinning are employed to create a more manageable dataset. As demonstrated in our group’s previous work on establishing a Degree of Ship Activity (DSA) data pool after cleaning and thinning massive original AIS data [27], this approach can greatly improve calculation efficiency. However, this strategy introduces a fundamental trade-off between computational performance and data fidelity. The process of thinning may remove critical data points corresponding to specific ship operations (e.g., maneuvering, hotelling), potentially affecting the accuracy of the final emission inventory, a well-documented challenge in trajectory data analysis [28].

The preceding analysis reveals a critical research gap: a need for a gridding methodology that can handle massive, skewed AIS datasets at high speed without the overhead of complex index maintenance and without sacrificing the integrity of the original data [29,30]. To address this critical challenge, this study introduces a high-throughput gridding algorithm (H-Grid) based on spatial hashing for the rapid generation of ship emission inventories [31,32]. Our approach leverages this principle to map continuous geographic coordinates directly to discrete grid cell identifiers in constant time, O(1). This design choice resolves the inefficiency of traditional methods and grants the algorithm inherent parallelism, making it uniquely suited for the high-throughput computational demands of large-scale environmental data analysis.

In this study, the H-Grid algorithm was applied to a massive dataset of AIS records from the Yellow Sea between China and Republic of Korea to demonstrate its enhanced computational efficiency. The performance was improved by up to 10-fold compared to traditional geometric calculations and by over 4-fold when benchmarked against mainstream database spatial queries. These findings not only provide an efficient and scalable tool for large-scale maritime emission analysis but also offer valuable data support for regional planning aimed at mitigating maritime pollution and supporting sustainable shipping practices.

2. Materials and Methods

This chapter provides a comprehensive description of the research methodology. The overall technical workflow for transforming raw AIS data into the final gridded emission inventory is visualized in Figure 1. To elaborate on this process, the chapter is organized into the four main sections that constitute our methodology: Section 2.1 describes the rigorous data preprocessing pipeline designed to ensure data quality and integrity; Section 2.2 details the application of the Ship Traffic Emission Assessment Model (STEAM) for calculating instantaneous emissions; Section 2.3 provides a thorough explanation of our novel H-Grid gridding algorithm, which is the core of this study; and finally, Section 2.4 outlines the benchmark methodologies that were used for performance evaluation.

2.1. Data Preprocessing

The raw AIS data, despite its volume, contains various errors, inconsistencies, and irregularities that can compromise the accuracy of the final emission inventory [33,34]. Therefore, a multi-step preprocessing pipeline was designed and implemented to produce a clean, regularized dataset suitable for input into the STEAM model.

First, a rule-based filtering process was applied to remove logically erroneous data points. Records with invalid Maritime Mobile Service Identity MMSI numbers, coordinates falling outside the defined study area, or implausible kinematic values were identified and discarded. The threshold for implausible values (e.g., Speed Over Ground SOG) exceeding established maximums for the vessel type) was determined by analyzing the 99.9th percentile of the historical speed distribution for each vessel type, ensuring a quantifiable basis for filtering. Following this, attribute consistency checks were performed to rectify logical inconsistencies. For instance, discrepancies between a vessel’s reported navigational status (e.g., ‘at anchor’) and its kinematic data (e.g., reporting a high SOG) were identified, and the status was corrected based on the vessel’s movement characteristics.

Next, a spatial outlier detection algorithm was employed to identify and remove geographically anomalous points that do not conform to the general shipping lane patterns. The algorithm operates on a density-based principle, where a point is identified as an outlier if its local neighborhood contains too few other points. For each data point p_i, its neighborhood density

ρ

(p_i) is calculated based on the number of other points within a specified search radius ε. A point is classified as an outlier if this density falls below a predefined minimum threshold τ, as defined in Equation (1). This step effectively removes erroneous points resulting from GPS signal drift or transmission errors.

p_{i} i s o u t l i e r i f ρ (p_{i}) = | {p_{j} \in D | d i s t (p_{i}, p_{j}) \leq ε} | < τ

(1)

The parameters ε (neighborhood radius) and τ (minimum threshold) were determined through an iterative sensitivity analysis. ε was empirically set to 0.005° (half the grid cell size) to capture the immediate local clustering characteristics, while τ was set to 5 points to effectively maximize the removal of isolated noise while preserving legitimate low-traffic vessel tracks.

The final and most critical preprocessing step involved temporal regularization to standardize the trajectory data to a uniform 5-min interval, matching the operational requirements of the STEAM model. This was achieved through a two-phase process. First, for dense segments of a trajectory where multiple AIS signals were recorded within a single 5-min window, the kinematic properties (latitude, longitude, SOG, COG) were averaged to generate a single, representative point for that interval. This down sampling step prevents the over-representation of high-frequency data. Second, for sparse segments with temporal gaps longer than 5 min, a linear interpolation method was used to generate new data points at regular 5-min timestamps. This up sampling step ensures the temporal continuity of the vessel’s voyage. For any two consecutive points P(t₁) and P(t₂), an interpolated point P(t) at a desired time t (t₁ < t < t₂) is calculated as follows:

P (t) = P (t_{1}) + \frac{t - t_{1}}{t_{2} - t_{1}} (P (t_{2}) - (P (t_{1})

(2)

It is acknowledged that this temporal regularization approach, while essential for aligning with the STEAM model’s Δt_i input and managing the massive data volume, introduces a necessary trade-off. While the averaging and linear interpolation steps may marginally under-represent emission peaks during sharp maneuvering or rapid acceleration, the 5-min interval is a widely adopted standard in bottom-up emission inventory studies, providing an optimal and necessary balance between trajectory fidelity and computational feasibility for large-scale analysis.

Upon completion of these steps, a clean, complete, and temporally regularized trajectory dataset was produced, ready for the subsequent emission calculation stage.

2.2. Emission Calculation Model (STEAM)

For the bottom-up calculation of ship emissions, this study adopted the Ship Traffic Emission Assessment Model (STEAM), a widely recognized activity-based methodology first developed by Jalkanen et al. [35,36]. This model was selected for its distinct advantages in high-resolution emission inventory studies. Its core characteristic is being an activity-based model, meaning it estimates emissions based on a vessel’s real-time operational status, which is ideal for leveraging the rich dynamic data provided by the Automatic Identification System (AIS) [37,38,39]. Furthermore, the STEAM methodology is internationally recognized, extensively validated, and has been adopted in authoritative reports by the IMO ensuring the credibility of our approach [40,41].

The overall workflow of this model as implemented in this study is illustrated in Figure 2. The figure provides a schematic of the entire process, from the initial data inputs (dynamic AIS data and static ship information) to the conceptual calculation framework—conceptually broken down into Time, Load, and Factor components. It also visualizes the detailed logic for deriving these components, which will be mathematically detailed below.

The conceptual framework shown in Figure 2 is mathematically formalized by the following equation, which calculates the total emission mass (E) for a pollutant species (s) by summing the emissions from each discrete time interval (i):

E = \sum_{i} (P_{m e} \times {L F}_{i} \times E F \times ∆ t_{i})

(3)

where

E is the total emission for pollutant species s (in grams).

P_me is the maximum continuous rating power of the main engine (in kW).

LF_i is the engine load factor at time step i, calculated based on dynamic AIS data such as vessel speed and draught.

EF is the emission factor for pollutant species s (in g/kW·h), which is dependent on engine type and fuel quality.

Δt_i is the time interval between consecutive AIS signals (in hours).

Among these parameters, P_me is a static value obtained from the ship’s technical database, and Δt_i is a constant (5 min, or 1/12 h) as a result of our data preprocessing. Specifically, the vessel-specific static parameters, including maximum continuous rating power (P_me), main engine type, and fuel characteristics, were integrated by cross-referencing the MMSI from the AIS data with a comprehensive external ship technical database [42]. The most critical variables are the dynamic engine load factor (LF_i) and the context-dependent emission factor (EFs), which are determined based on the detailed logic illustrated in Figure 2. To provide full transparency on this process, Table 1 details the specific rules used to classify a vessel’s operational status from its real-time speed and the corresponding engine load that is consequently assigned [43,44]. Similarly, Table 2 provides examples of the emission factors for key pollutants, which are selected from our compiled library based on the vessel’s engine type and presumed fuel.

It is essential to discuss the uncertainties associated with the STEAM application. The EFs (Table 2) are sourced from widely recognized international reports (IMO\TNO) and are assumed to be representative of specific engine types and fuel compliance, yet they are subject to uncertainties arising from vessel-specific engine aging, actual fuel quality variations, and real-time operational deviations [45,46]. Furthermore, the rules for classifying operational status (Table 1) are based on established industry standards and validated against kinematic behavior; ambiguous cases, such as discrepancies between reported status and SOG, were primarily resolved by prioritizing the dynamic SOG data during the preprocessing stage (Section 2.1), ensuring the load factor reflects actual vessel movement. Despite these necessary assumptions, the activity-based approach provides a scientifically robust framework for high-resolution inventory studies.

The application of the STEAM model to each preprocessed AIS point serves as the crucial upstream stage of the methodology. This process transforms the trajectory data into a massive, unstructured point cloud, where each point is individually assigned a precise geographic location and a calculated emission mass. The primary advantage of this per-point calculation method lies in its scientific accuracy and high fidelity. By assessing emissions at such a fine-grained level, the model accurately captures the variations in a vessel’s pollution output corresponding to its real-time activities, such as accelerating, cruising, or maneuvering in port. This preserves the full detail of the original data, ensuring the resulting emission estimates are closely tied to actual ship operations.

However, while this point cloud is rich in detail, it is not directly usable for the primary application of creating an emission inventory map. The necessary downstream stage is to convert this unstructured data into a structured, gridded format suitable for spatial analysis. This task of spatial aggregation—summarizing millions of individual points into their respective grid cells—is a distinct methodological step. Therefore, the detailed point cloud produced upstream serves as the direct input for the downstream gridding algorithm, which is detailed in the following section.

2.3. H-Grid: A High-Throughput Spatial Hashing Algorithm

The preceding section produced a massive point cloud of emission data. The critical next step, a fundamental task in environmental analysis, is to transform these scattered points into a structured spatial grid, as depicted conceptually in Figure 3. Traditional methods for this task, such as geometric traversal or spatial database queries, are often bottlenecked by computationally expensive operations like complex index maintenance or nested-loop searches [47,48]. To circumvent these limitations entirely, the H-Grid algorithm is proposed. Instead of relying on conventional spatial indexing, H-Grid employs a direct mathematical mapping—a simple yet powerful hash-like encoding—to instantly assign any geographical coordinate pair to a unique grid cell ID.

The fundamental idea of H-Grid is to map a continuous geographical coordinate pair (longitude, latitude) directly to a unique, discrete grid cell ID using a simple, hash-like encoding. This process eliminates the need for any complex data structures or spatial queries. The algorithm consists of two main steps: coordinate discretization and hash key generation.

Step 1: Coordinate Discretization

First, each AIS point’s longitude and latitude are transformed into integer grid indices based on the defined grid resolution. For a given grid resolution r (in degrees), the grid indices (i_col, i_row) for a point with coordinates (lon, lat) are calculated as:

i_{col} = ⌊ \frac{l o n - {l o n}_{m i n}}{r} ⌋ i_{row} = ⌊ \frac{l a t - {l a t}_{m i n}}{r} ⌋

(4)

where

lon_min and lat_min represent the minimum longitude and latitude of the study area, respectively. This floor operation effectively serves as a direct, constant-time mapping from a continuous coordinate to a discrete grid cell.

Step 2: Hash Key Generation

After obtaining the integer grid indices, a unique hash key for each grid cell was generated. This key serves as a unique identifier for each cell in the hash map, where emission values will be aggregated. The key can be generated by combining the row and column indices using a bitwise operation, which is highly efficient for integer data types:

hash_key = (i_row≪s)∣i_col

(5)

where

s is the number of bits required to store the maximum column index.

This bit-shift and OR operation ensures that each grid cell has a unique identifier, allowing all emission points falling into the same cell to be aggregated in O(1) average time.

The key advantage of the H-Grid algorithm lies in its exceptional efficiency and scalability. Since the gridding of each data point is an entirely independent operation, the entire process can be easily distributed across multiple CPU cores or a computer cluster. This approach completely bypasses the time-consuming index building and traversal common in traditional methods, making it ideally suited for the rapid, large-scale processing of maritime big data.

2.4. Benchmark Methodologies

To objectively evaluate the performance of the proposed H-Grid algorithm, two baseline methods representing common approaches for spatial aggregation were implemented for comparison: a Traditional Geometric Method and a PostGIS-based spatial database method.

2.4.1. Traditional Geometric Method

This method represents a fundamental, brute-force computational approach. The algorithm iterates through each AIS data point in the dataset and, for each point, performs a nested iteration through every cell in the predefined grid matrix. A spatial containment test is then conducted to determine which grid cell contains the point.

Let a data point be P = (lon_p, lat_p) and a grid cell C_ij (at column i and row j) be defined by its minimum corner (lon_min,ij, lat_min,ij) and its resolution

r

. A point P is considered to be inside cell C_ij if the following condition is met:

({l o n}_{m i n, i j} \leq {l o n}_{p} < {l o n}_{m i n, i j} + r) ⋀ ({l a t}_{m i n, i j} \leq {l a t}_{p} < {l a t}_{m i n, i j} + r)

(6)

Once the correct cell is found, the point’s emission value is added to that cell’s accumulator, and the inner loop terminates. This method requires no special data structures but has a theoretical worst-case time complexity of O(N × M), where N is the number of data points and M is the total number of grid cells.

2.4.2. Spatial Indexing Method

This method represents a canonical approach for accelerating spatial queries, based on the principle of hierarchical, space-partitioning index structures. Unlike one-dimensional data which can be efficiently indexed by B+-trees, multi-dimensional spatial data requires specialized structures. The R-tree is a foundational data structure in this domain, organizing data into a height-balanced tree of nested Minimum Bounding Rectangles (MBRs). The Generalized Search Tree (GiST) is a more abstract and powerful evolution of this concept, providing a generalized framework for building such indexes.

The search and pruning logic that defines the efficiency of this index structure can be broken down into the following operational steps, beginning from the root of the tree for any given spatial query:

Node Examination: The algorithm starts at the root node and examines each of its entries. In an internal (non-leaf) node, each entry is a Minimum Bounding Rectangle (MBR) that spatially encloses all data within the child node it points to.
Recursive Pruning and Traversal: For each entry, a spatial overlap test is performed between its MBR and the query region. If there is no overlap, the entire branch of the tree represented by this entry is pruned and completely ignored by the search. This is the key to the algorithm’s efficiency. If there is an overlap, the algorithm recursively descends to the corresponding child node and repeats the process from Step 1.
Data Retrieval: This recursive process continues until a leaf node is reached. At the leaf level, the entries are the actual data objects, not bounding boxes. The algorithm then tests each data object against the query region and adds any that overlap to the final result set.

Through this recursive pruning mechanism, the search space is dramatically reduced. This allows the algorithm to achieve an average-case logarithmic time complexity (O(logN)), a significant improvement over the linear complexity (O(N)) of a full data scan.

This well-established search algorithm is robustly implemented in the PostgreSQL database via its PostGIS extension, using a GiST index. For this study’s benchmark, the gridding and aggregation task, executed via an SQL query, leverages this underlying indexed structure to efficiently group spatially proximal points, thereby providing a highly optimized, industry-standard benchmark for performance comparison.

3. Results

3.1. Experimental Setup and Data

The study focuses on the Yellow Sea region, a major international maritime hub located between mainland China and Republic of Korea. A map of this study area is presented in Figure 4. The precise geographical scope of our analysis is defined by the bounding box from 119° E to 126.5° E longitude and 34° N to 37.5° N latitude. This area was discretized into a high-resolution grid with a cell size of 0.01° by 0.01° for the emission inventory calculation.

The specific choice of 0.01° × 0.01° resolution was determined by balancing the need for spatial detail with computational feasibility. This grid size is fine enough to resolve key features such as major international shipping lanes and detailed port activities, ensuring a high-fidelity inventory, while keeping the total number of grid cells (M) manageable for high-throughput processing and avoiding unnecessary computational overhead.

The dataset used for this research consists of raw AIS records collected throughout the year 2024. These records include two primary message types: dynamic positional reports and static voyage-related data. To illustrate the structure of the source data, Table 3 presents a selection of key fields that are essential for the subsequent emission calculation and gridding processes. This table shows a simulated 2024 data point based on the typical structure of the dataset. Some values are decoded for clarity (e.g., SOG, COG).

This massive data volume, comprising over one billion records, provides a realistic and challenging testbed for evaluating high-throughput data processing algorithms in a real-world scenario.

All experiments were conducted on a high-performance workstation equipped with an Intel Core i9-13900K CPU, 64 GB of RAM (Beijing, China), and a high-speed PCIe 4.0 NVMe SSD (Kingston NV3) (Beijing, China). Our proposed H-Grid algorithm was implemented in Python 3.10. To ensure a controlled and reproducible benchmark, the comparison methods (Traditional Geometric and Spatial Indexing) were executed within a PostgreSQL 15 database environment with the PostGIS 3.3 extension. This setup allowed for a direct and fair performance comparison of our algorithm against a mainstream, highly optimized solution for handling spatial data.

3.2. Data Transformation and Gridding Results

The methodology successfully transformed the raw AIS data into a high-resolution, gridded ship emission inventory. To illustrate the step-by-step data transformation process at a granular level, Table 4 tracks several example AIS records through our entire data pipeline. The table showcases the evolution from the key fields in the raw AIS message, to the calculated instantaneous emissions after applying the STEAM model, and finally to the discrete grid indices and unique hash key assigned by the H-Grid algorithm.

After aggregating all processed data points, the final gridded emission inventories were produced. Figure 5 displays the spatial distribution of the annualized total emissions for several key pollutants (SO₂, NO_x, and PM_2.5). The maps clearly reveal that emissions are not uniformly distributed across the sea but are highly concentrated along major international shipping lanes and in the approaches to busy port areas. Prominent hotspots are evident around major coastal cities, particularly near Qingdao, Rizhao, and Lianyungang in China, as well as the region connecting Incheon to the southwest coast of Republic of Korea. These patterns directly correspond to the expected high-density maritime traffic in the region.

The visual evidence from these results powerfully illustrates the core computational challenge addressed by this study. The observed spatial patterns confirm a highly non-uniform data distribution, a finding supported by the statistic that over 90% of the emission data points are concentrated within less than 10% of the total grid cells. This extreme data sparsity and clustering create significant performance bottlenecks for traditional aggregation algorithms that rely on complex indexing or traversal. This characteristic of maritime AIS data is precisely what highlights the necessity for an efficient, direct-mapping gridding solution like the H-Grid algorithm.

While Figure 5 visually illustrates the spatial concentration of emissions, the comprehensive results of the high-resolution ship air pollution inventory are numerically presented in Table 5, which summarizes the annualized total emissions by key sub-region. The data provides quantitative confirmation of the spatial patterns observed: the Yellow Sea Main Channel accounts for a dominant 68.5% of the total NO_x emissions in the study area, underscoring the necessity of international cooperation in managing trans-boundary pollution. Furthermore, the port areas of Qingdao and Incheon, as primary regional logistics hubs, collectively contribute over 23% of the total NO_x pollution, emerging as significant localized hotspots. This quantitative summary directly validates the efficiency of the H-Grid algorithm in producing a high-fidelity gridded emission inventory suitable for detailed regulatory analysis.

4. Discussion

4.1. The Computational Challenge of Skewed Spatial Data

The research reveals that maritime traffic data exhibits a highly non-uniform spatial distribution, with ship activities concentrating heavily along major shipping lanes and within port areas. This characteristic, while intuitive, is the primary source of computational bottlenecks for traditional gridding methodologies. To quantify this phenomenon, a density histogram was generated from the gridded emission inventory, as illustrated in Figure 6.

The histogram confirms a pronounced right-skewed distribution. A statistical summary of the gridded data further substantiates this finding: the mean value of ship activity (0.0157) is orders of magnitude larger than the median (0.0001), a classic indicator of a distribution dominated by a small number of high-value outliers. Furthermore, the large standard deviation (0.2675) relative to the mean underscores the extreme variance and data concentration within a few grid cells. These statistics provide quantitative evidence that emission data is not uniformly distributed.

This extreme spatial clustering is the root cause of performance degradation in conventional spatial algorithms. While spatial indexes such as R-trees (commonly used in databases like PostGIS) are efficient at pruning empty space, they lose their efficacy when tens of thousands or even millions of data points fall within the same minimal index polygon or “leaf node” [49,50]. In these high-density “hotspot grids”, the task degenerates from an efficient indexed search into a brute-force, linear scan over a massive number of points within that single cell. This effectively nullifies the benefits of the index, leading to significant computational overhead and unpredictable processing times.

Therefore, the central test for any high-throughput gridding algorithm is not its performance on uniformly distributed data, but its resilience and efficiency when handling these high-density hotspots. To rigorously evaluate the H-Grid algorithm’s capabilities in this regard, the subsequent sections will present a detailed performance analysis, moving from a micro-level examination of a single hotspot grid to a macro-level assessment across the entire dataset.

4.2. Micro-Level Performance Analysis in High-Density Grids

Following the identification of data concentration as the key computational challenge, a micro-level benchmark was designed to isolate and quantify the performance of each gridding method under increasing data density. In this controlled experiment, the time required was measured to process a varying number of data points all falling within the boundaries of a single grid cell, simulating a “worst-case” scenario for a hotspot grid. The results are summarized in Table 6. The table below shows the processing time in milliseconds (ms) for assigning a specified number of points to a single grid cell using the three different methods.

For low data concentrations (e.g., 100–1000 points), the performance differences between the methods are relatively small. This is attributable to initial system overheads, such as function call latency and database connection time, which constitute a significant portion of the total execution time. However, as the number of points per cell escalates into the tens of thousands, a dramatic performance divergence emerges.

The Traditional Geometric Method: This approach, which relies on a point-in-polygon test for each data point, exhibits a near-linear increase in processing time. Its execution time scales directly with the number of points n, confirming its O(n) complexity and rendering it exceptionally inefficient for the hotspot grids identified in our dataset.
The PostGIS Method: The database method demonstrates better scalability initially due to its use of spatial indexing. However, as discussed in Section 4.1, the index’s utility diminishes once the target grid cell is located. The database must still process all candidate points within that cell, leading to a significant, albeit sub-linear, increase in query time. The performance drop at 100,000 points and beyond reflects the inherent overhead of the database engine’s query planner, execution context, and memory management when handling a large result set within a single query operation.
The H-Grid Method: In stark contrast, the H-Grid algorithm maintains exceptional performance, showing only a marginal increase in processing time even when processing one million points. This is because its core operation—transforming a coordinate pair (lon, lat) into a grid ID via simple arithmetic and bitwise operations—has a constant time complexity of O(1) per point. The measured increase in milliseconds is primarily due to the incidental overhead of iterating through the larger dataset in memory, not the gridding logic itself. The results unequivocally demonstrate that H-Grid’s core performance is fundamentally decoupled from the spatial density of the data.

This micro-level analysis confirms that H-Grid is uniquely engineered to handle the extreme data concentration characteristic of real-world maritime traffic. The critical question that follows is whether this profound efficiency advantage at the single-cell level translates to a similar performance gain when processing the entire, billion-record dataset at a macro scale.

4.3. Macro-Benchmark: End-to-End Scaling at 100 Million Records

To validate the scalability of the approach, an end-to-end benchmark processing was conducted for a dataset of 100 million AIS records. The experiment was performed under identical conditions for all methods, with data pre-loaded into memory to isolate computational performance. As shown in Figure 7, the results confirm a significant performance advantage for H-Grid.

The H-Grid method completed the entire task in 32 s. In comparison, the PostGIS spatial query approach required 128 s (approximately 4 times slower), and the traditional geometric baseline took 320 s (approximately 10 times slower). All reported runtime metrics were confirmed, including these macro-benchmark results and the micro-benchmark data in Table 6, represent the average of five independent runs conducted under identical conditions. The observed variance across repeated runs was consistently low (e.g., less than 3% for the H-Grid method), validating the stability and reliability of the reported performance results. These findings are consistent with the micro-benchmarks in Section 4.2; by minimizing per-point computational work and avoiding the overhead of spatial index construction and maintenance, H-Grid achieves stable, high-throughput performance at scale. The near-linear scaling of H-Grid reflects its single-pass arithmetic mapping and efficient hash-based accumulation, whereas the database and geometric methods are encumbered by query planning overhead and less cache-friendly memory access patterns.

4.4. Micro-Mechanism Validation: Skew Robustness and Local Aggregation

To explain why H-Grid is insensitive to the spatial hotspots that degrade the performance of the baseline methods, two compact, controlled experiments were designed. These tests validate the core mechanisms—data partitioning and local aggregation—that allow H-Grid to handle the “few lanes dominate” characteristic of real-world maritime traffic.

Experiment A: Skew Robustness

This experiment tests how performance changes as the spatial distribution of data becomes more concentrated, a common characteristic of real-world maritime traffic. To precisely quantify this effect, three 100-million-point datasets were generated with varying levels of spatial skew and measured the total processing time for each method (shown in Figure 8). The skew level is defined by the percentage of data points concentrated within a small fraction of grid cells.

The results in Table 7 unequivocally demonstrate H-Grid’s robustness. Its runtime remains virtually constant (~32 s) regardless of the data’s spatial concentration. In contrast, the performance of both PostGIS and the traditional method degrades significantly as skew intensifies. This is due to their reliance on spatial partitioning, which leads to severe load imbalance in the presence of spatial hotspots. In such scenarios, a few threads are overwhelmed by the highly-concentrated data in bottleneck cells, creating a performance bottleneck for the entire system. H-Grid’s data partitioning strategy ensures a perfectly balanced load, making its performance independent of spatial distribution. This controlled experiment precisely validates the skew-robustness of our method and explains why H-Grid is a superior choice for handling the “few lanes dominate” characteristic of real-world maritime traffic, a property that observed in the AIS data and meticulously replicated in this experiment with specific levels of data concentration.

Experiment B: Efficacy of Local Aggregation

This experiment aims to validate the advantages of H-Grid’s two-stage aggregation strategy, which is designed to mitigate contention on “hot keys” (grid cells with many data points). This experiment was designed to isolate its core benefits. The data analysis clearly demonstrates that this method achieves superior performance through efficient local aggregation.

The data in Table 8 clearly illustrates the critical role of H-Grid’s two-stage aggregation strategy. By pre-aggregating results within each thread’s private memory, the method completes over 95% of the aggregation work in a perfectly parallel manner, avoiding cross-thread synchronization overhead. The final global merge operation is a trivial expense, accounting for only 4% of the total runtime. This breakdown confirms that H-Grid’s MapReduce-like architecture is essential for its performance on highly concentrated data, as it allows for highly efficient parallel processing and a negligible sequential bottleneck.

In synthesis, the comprehensive analyses within this chapter demonstrate that H-Grid’s superior performance is not merely an incremental improvement but a result of a fundamentally different architectural approach. The combination of a computationally trivial O(1) hashing function, a load-balanced data partitioning strategy immune to spatial skew, and an efficient two-stage aggregation model collectively overcomes the bottlenecks that plague traditional spatial methods. The resulting gains in computational efficiency are substantial, enabling the generation of higher-resolution emission inventories at greater frequencies and paving the way for near real-time environmental monitoring. Having thoroughly validated the algorithm’s performance and its underlying mechanisms, the conclusion will proceed to summarize the broader contributions and future outlook of this work.

4.5. Environmental Interpretation and Policy Relevance of Emission Hotspots

The highly non-uniform spatial distribution of emissions, visually represented in Figure 5 and quantitatively summarized in Table 5, carries significant environmental and policy implications. The distinct emission hotspots along the Yellow Sea Main Channel underscore the trans-boundary nature of maritime pollution, requiring international cooperation for mitigation strategies. Furthermore, the intense clustering of emissions within the Qingdao and Incheon port regions highlights the acute environmental burden placed on these major coastal cities. The ability of the H-Grid algorithm to rapidly generate this high-resolution data is critical for policy implementation, enabling regulators to precisely define Emission Control Areas (ECAs) boundaries and evaluate the effectiveness of policy interventions (e.g., fuel sulfur limits) at a granular level necessary for environmental protection and sustainable maritime development.

5. Conclusions

This study proposed the H-Grid algorithm, an efficient method based on spatial hashing for processing large-scale AIS data. Combined with the STEAM model, a high-resolution ship air pollution emission inventory for 2024 was calculated using the Yellow Sea between China and Republic of Korea as a case study. Our performance benchmarks demonstrated that the H-Grid algorithm provides a significant speed-up, outperforming traditional geometric methods by over 10 times and mainstream PostGIS spatial queries by over 4 times on a 100-million-record dataset. This performance gain is attributed to its simple design, which effectively handles the highly skewed spatial distribution of real-world AIS data—a known challenge for conventional spatial indexing approaches. This efficiency makes it practical to generate large-scale, high-resolution emission inventories on standard hardware.

The H-Grid algorithm can serve as a useful tool for researchers and environmental agencies, offering a scalable solution for monitoring shipping emissions. The approach is applicable to other regions with heavy maritime traffic, such as the Baltic Sea or Mediterranean Sea, and can be used to more rapidly evaluate the effects of emission reduction policies. For future work, the algorithm’s speed suggests it could be a valuable component in more complex analytical pipelines. For instance, it could be combined with machine learning techniques for emission prediction or used for the timely detection of traffic anomalies. Such applications could further support the sustainable management of global shipping.

This study proposed the H-Grid algorithm, an efficient method based on spatial hashing for processing large-scale AIS data. Combined with the STEAM model, a high-resolution ship air pollution emission inventory for 2024 was calculated using the Yellow Sea between China and Republic of Korea as a case study. The performance benchmarks demonstrated that the H-Grid algorithm provides a significant speed-up, outperforming traditional geometric methods by over 10 times and mainstream PostGIS spatial queries by over 4 times on a 100-million-record dataset. This performance gain is attributed to its simple design, which effectively handles the highly skewed spatial distribution of real-world AIS data—a known challenge for conventional spatial indexing approaches. This efficiency makes it practical to generate large-scale, high-resolution emission inventories on standard hardware.

Despite the substantial performance gains demonstrated in the spatial aggregation phase, a key limitation of the current work lies in the architectural separation between the upstream emission calculation process and the downstream H-Grid gridding process. While the H-Grid component is highly parallel and optimized, the initial STEAM calculation remains structurally independent. Future research will focus on developing a unified, end-to-end parallel computing framework that tightly integrates the instantaneous emission calculation logic with the H-Grid’s aggregation mechanism. Such integration is essential for minimizing intermediate data transfer overhead and achieving maximal computational efficiency across the entire AIS-to-inventory data pipeline.

Author Contributions

Conceptualization, C.L.; methodology, C.L. and R.C.; software and validation, S.S.; investigation, Z.L.; resources, Q.X. and X.X.; data curation, Z.W.; writing—original draft preparation, C.L.; writing—review and editing, C.L. and R.C.; visualization and supervision, C.L.; funding acquisition, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of Ministry of Science and Technology of the People’s Republic of China (Grant 2024YFC3712302), and the Fundamental Research Funds of Ministry of Finance (Grant WTI 62407 and 62510).

Data Availability Statement

The data presented in this study are available from the responding author upon request.

Acknowledgments

We would like to thank the Technical Innovation Team for Waterway Traffic Pollution Prevention and Control and Major Accident Risk Prevention and Control for their technical support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Asariotis, R.; Assaf, M.; Benamara, H.; Hoffmann, J.; Premti, A.; Rodríguez, L.; Weller, M.; Youssef, F. Review of Maritime Transport 2018; UNCTAD: Geneva, Switzerland, 2018. [Google Scholar]
Mao, X.L.; Meng, Z.H.; Comer, B.; Decker, T. Greenhouse Gas Emissions and Air Pollution from Global Shipping, 2016–2023. The Internationa Council on Clean Transportation. 2025. Available online: https://globalmaritimehub.com/report-presentation/greenhouse-gas-emissions-and-air-pollution-from-global-shipping-2016-2023 (accessed on 4 November 2025).
Feng, X.; Ma, Y.P.; Lin, H.P.; Fu, T.M.; Zhang, Y.; Wang, X.L.; Zhang, A.X.; Yuan, Y.P.; Han, Z.M.; Mao, J.B.; et al. Impacts of Ship Emissions on Air Quality in Southern China: Opportunistic Insights from the Abrupt Emission Changes in Early 2020. Environ. Sci. Technol. 2023, 57, 16999–17010. [Google Scholar] [CrossRef]
Saxe, H.; Larsen, T. Air pollution from ships in three Danish ports. Atmos. Environ. 2004, 38, 4057–4067. [Google Scholar] [CrossRef]
Song, S. Ship emissions inventory, social cost and eco-efficiency in Shanghai Yangshan port. Atmos. Environ. 2014, 82, 288–297. [Google Scholar] [CrossRef]
Smyth, T.; Deakin, A.; Pewter, J.; Snee, D.; Proud, R.; Verbeek, R.; Verhagen, V.; Paschinger, P.; Bell, T.; Fishwick, J.; et al. Faster, Better, Cheaper: Solutions to the Atmospheric Shipping Emission Compliance and Attribution Conundrum. Atmosphere 2023, 14, 500. [Google Scholar] [CrossRef]
Schwarzkopf, D.A.; Petrik, R.; Matthias, V.; Quante, M.; Yu, G.Y.; Zhang, Y. Comparison of the Impact of Ship Emissions in Northern Europe and Eastern China. Atmosphere 2022, 13, 894. [Google Scholar] [CrossRef]
Merico, E.; Cesari, D.; Gregoris, E.; Gambaro, A.; Cordella, M.; Contini, D. Shipping and air quality in Italian port cities: State-of-the-art analysis of available results of estimated impacts. Atmosphere 2021, 12, 536. [Google Scholar] [CrossRef]
Dong, X.Y.; Zhang, Y.; Yu, G.Y.; Xiong, Y.Q.; Han, Z.M.; Huo, J.T.; Huang, C.; Kan, H.D.; Zheng, M.; Ning, Z.; et al. Environmental and health impacts of reduced PM_2.5 and trace metals from ship emissions under low-sulfur fuel oil policy in Shanghai, China. Environ. Pollut. 2025, 377, 126409. [Google Scholar] [CrossRef]
IMO MARPOL Annex VI. Regulations for the prevention of air pollution from ships. Resolut. MEPC 2005, 176, 58. [Google Scholar]
Zhai, J.H.; Yu, G.Y.; Zhang, J.Y.; Shi, S.; Yuan, Y.P.; Jiang, S.L.; Xing, C.B.; Cai, B.H.; Zeng, Y.L.; Wang, Y.X.; et al. Impact of Ship Emissions on Air Quality in the Greater Bay Area in China under the Latest Global Marine Fuel Regulation. Environ. Sci. Technol. 2023, 57, 12341–12350. [Google Scholar] [CrossRef]
Kim, Y.; Moon, N.; Chung, Y.; Seo, J. Impact of IMO Sulfur Regulations on Air Quality in Busan, Republic of Korea. Atmosphere 2022, 13, 1631. [Google Scholar] [CrossRef]
Wang, Z.; Ma, Q.C.; Zhang, Z.D.; Li, Z.C.; Qin, C.H.; Chen, J.F.; Peng, C.S. A Study on Monitoring and Supervision of Ship Nitrogen-Oxide Emissions and Fuel-Sulfur-Content Compliance. Atmosphere 2023, 14, 175. [Google Scholar] [CrossRef]
Mocerino, L.; Murena, F.; Quaranta, F.; Toscano, D. Port Emissions Assessment: Integrating Emission Measurements and AIS Data for Comprehensive Analysis. Atmosphere 2024, 15, 446. [Google Scholar] [CrossRef]
Han, J.; Peng, D.; Wang, Y.J.; Fu, M.L. Comparison of Inland Ship Emission Results from a Real-World Test and an AIS-Based Model. Atmosphere 2021, 12, 1611. [Google Scholar]
Liu, Z.M.; Lu, X.H.; Feng, J.L.; Fan, Q.Z.; Zhang, Y.; Yang, X. Influence of Ship Emissions on Urban Air Quality: A Comprehensive Study Using Highly Time-Resolved Online Measurements and Numerical Simulation in Shanghai. Environ. Sci. Technol. 2017, 51, 202–211. [Google Scholar] [CrossRef]
Marti, S.; Vendela, S.; Axel, H.; Holm, H.; Finnsgård, C. AIS in maritime research. Mar. Policy 2019, 106, 103520. [Google Scholar] [CrossRef]
Kim, H.Y.; Bui, H.D.; Hong, S.S. Estimation of Air Pollution from Ships in Port Area: A Case Study of Yeosu and Gwangyang Ports in Korea. Atmosphere 2022, 13, 1890. [Google Scholar] [CrossRef]
Zhu, D.Y.; Huang, M.; Lin, Q.F.; Wang, Y.Y.; Li, S.; Cheng, C.Q. Efficient management of ubiquitous location information using geospatial grid region name. Int. J. Appl. Earth Obs. Geoinf. 2025, 137, 104400. [Google Scholar] [CrossRef]
Ilba, M. Parallel algorithm for improving the performance of spatial queries in SQL: The use cases of SQLite/SpatiaLite and PostgreSQL/PostGIS databases. Comput. Geosci. 2021, 155, 104840. [Google Scholar] [CrossRef]
Khan, J.; Kakosimosb, K.; Raaschou-Nielsen, O. Development and performance evaluation of new AirGIS—A GIS based air pollution and human exposure modelling system. Atmos. Environ. 2019, 198, 102–121. [Google Scholar] [CrossRef]
Markakis, K.; Poupkou, A.; Melas, D.; Zerefos, C. A GIS based anthropogenic PM10 emission inventory for Greece. Atmos. Pollut. Res. 2010, 1, 71–81. [Google Scholar] [CrossRef]
Yang, X.Y.; Guan, X.F.; Pang, Z.X.; Kui, X.; Wu, H.Y. GridMesa: A NoSQL-based big spatial data management system with an adaptive grid approximation model. Future Gener. Comput. Syst. 2024, 155, 324–339. [Google Scholar] [CrossRef]
Goodchild, M.F.; Hill, L.L. Introduction to digital gazetteer research. Int. J. Geogr. Inf. Sci. 2008, 22, 1039–1044. [Google Scholar] [CrossRef]
Mahdavi-Amiri, A.; Alderson, T.; Samavati, F. A survey of digital earth. Comput. Graph 2015, 53, 95–117. [Google Scholar] [CrossRef]
Masaoud, E.; Stryhn, H. A simulation study to assess the impact missing values on the performance of different statistical methods for analysis of binary repeated measures data with an additional hierarchical structure. J. Stat. Res. 2023, 57, 35–67. [Google Scholar] [CrossRef]
Chen, R.C.; Liu, C.; Xue, Q.Q.; Rui, R. Research on Fine Ship Sewage Generation Inventory Based on AIS Data and Application on Yangtze River. Water 2022, 14, 3109. [Google Scholar] [CrossRef]
Li, X.; Zhang, Z.; Ma, B.; Zheng, D.Y.; Yang, W.T.; Yan, Y.W. High spatio-temporal resolution estimation of urban road traffic carbon dioxide emissions and analysis of influencing factors using GPS trajectory data. Environ. Monit. Assess. 2025, 197, 665. [Google Scholar] [CrossRef] [PubMed]
Maasakkers, J.D.; McDuffie, E.E.; Sulprizio, M.P.; Chen, C.; Schultz, M.; Brunelle, L.; Thrush, R.; Steller, J.; Sherry, C.; Daniel, J.; et al. A Gridded Inventory of Annual 2012–2018 U.S. Anthropogenic Methane Emissions. Environ. Sci. Technol. 2023, 57, 16276–16288. [Google Scholar] [CrossRef]
Yu, K.A.; Li, M.; Harkins, C.; He, J.; Zhu, Q.D.; Verreyken, B.; Schwantes, R.H.; Cohen, R.C.; McDonald, B.C.; Harley, R.A. Improved Spatial Resolution in Modeling of Nitrogen Oxide Concentrations in the Los Angeles Basin. Environ. Sci. Technol. 2023, 57, 20689–20698. [Google Scholar] [CrossRef]
Chen, Y.; Lu, Z.; Zheng, Y.; Li, P.; Luo, W.; Kang, S. Deep hashing with mutual information: A comprehensive strategy for image retrieval. Expert Syst. Appl. 2025, 264, 125880. [Google Scholar] [CrossRef]
Zamora, J.; Mendoza, M.; Allende, H. Hashing-based clustering in high dimensional data. Expert Syst. Appl. 2016, 62, 202–211. [Google Scholar] [CrossRef]
Xu, D.H.; Deng, Y.W.; Xin, P.; Zhou, X.Q. Path planning for large ships in inland waterways considering risk assessment of AIS data. Ocean Eng. 2025, 342, 122792. [Google Scholar] [CrossRef]
Sun, Z.C.; Xu, S.D.; Jiang, J. Spatial-temporal characteristics of ship carbon emission based on AIS data. Ocean Coast. Manag. 2025, 265, 107629. [Google Scholar] [CrossRef]
Jalkanen, J.P.; Brink, A.; Kalli, J.; Pettersson, H.; Kukkonen, J.; Stipa, T. A model-ling system for the exhaust emissions of marine trafficand its application in the Baltic Sea area. Atmos. Chem. Phys. 2009, 9, 9209–9223. [Google Scholar] [CrossRef]
Jalkanen, J.P.; Johansson, L.; Kukkanen, J.; Brink, A.; Kalli, J.; Stipa, T. Extension of an assessment model of ship traffic exhaust emissions for particulate matter and carbon monoxide. Atmos. Chem. Phys. 2012, 12, 2641–2659. [Google Scholar] [CrossRef]
Shu, Y.Q.; Hu, A.Y.; Zheng, Y.Z.; Gan, L.X.; Xiao, G.N.; Zhou, C.H.; Song, L. Evaluation of ship emission intensity and the inaccuracy of exhaust emission estimation model. Ocean Eng. 2023, 287, 115723. [Google Scholar] [CrossRef]
Weng, J.; Shi, K.; Gan, X.; Li, G.R.; Huang, Z. Ship emission estimation with high spatial-temporal resolution in the Yangtze River estuary using AIS data. J. Clean. Prod. 2020, 248, 119297. [Google Scholar] [CrossRef]
Nunes, R.A.O.; Alvim-Ferraz, M.C.M.; Sousa, S.I.V. Assessment of shipping emissions on four ports of Portugal. Environ. Pollut. 2017, 231 Pt 2, 1370–1379. [Google Scholar] [CrossRef]
Vutukuru, S.; Dabdub, D. Modeling the effects of ship emissions on coastal air quality: A case study of southern California. Atmos. Environ. 2008, 42, 3751–3764. [Google Scholar] [CrossRef]
Wan, Z.; Ji, S.; Liu, Y.; Zhang, Q.; Chen, J.H.; Wang, Q. Shipping emission inventories in China’s Bohai bay, Yangtze River Delta, and Pearl River Delta in 2018. Mar. Pollut. Bull. 2020, 151, 110882. [Google Scholar] [CrossRef]
Topic, T.; Murphy, A.J.; Pazouki, K.; Norman, R. Assessment of ship emissions in coastal waters using spatial projections of ship tracks, ship voyage and engine specification data. Clean. Eng. Technol. 2021, 2, 100089. [Google Scholar] [CrossRef]
Chen, D.; Zhao, Y.; Nelson, P.; Li, Y.; Wang, X.T.; Zhou, Y.; Lang, J.L.; Guo, X.R. Estimating ship emissions based on AIS data for port of Tianjin, China. Atmos. Environ. 2016, 145, 10–18. [Google Scholar] [CrossRef]
Zhou, M.; Jiang, W.; Gao, W.; Gao, X.M.; Ma, M.C.; Ma, X. Anthropogenic emission inventory of multiple air pollutants and their spatiotemporal variations in 2017 for the Shandong Province, China. Environ. Pollut. 2021, 288, 117666. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, Y.; Patton, A.P.; Ma, W.C.; Kan, H.D.; Wu, L.B.; Fung, F.; Wang, S.X.; Ding, D.; Walker, K. Projection of ship emissions and their impact on air quality in 2030 in Yangtze River delta, China. Environ. Pollut. 2020, 263 Pt A, 114643. [Google Scholar] [CrossRef]
Nguyen, P.N.; Woo, S.-H.; Kim, H. Ship emissions in hotelling phase and loading/unloading in Southeast Asia ports. Transp. Res. Part D Transp. Environ. 2022, 105, 103223. [Google Scholar] [CrossRef]
Xia, H.L.; Xue, F. A Novel Query Method for Spatial Database Based on Improved K-Nearest Neighbor Algorithm. Int. J. Decis. Support Syst. Technol. 2024, 16, 1–15. [Google Scholar] [CrossRef]
Islam, M.S.; Shen, B.J.; Wang, C.; Taniar, D.; Wang, J.H. Efficient processing of reverse nearest neighborhood queries in spatial databases. Inf. Syst. 2020, 92, 101530. [Google Scholar] [CrossRef]
Liu, M.M.; Zhang, M.K. Role of kmax mapping image annotation method combined with R-tree index in civil engineering supervision. Syst. Soft Comput. 2025, 7, 200319. [Google Scholar] [CrossRef]
Yang, Y.; Bai, P.; Ge, N.; Gao, Z.P.; Qiu, X.S. LAZY R-tree: The R-tree with lazy splitting algorithm. J. Inf. Sci. 2020, 46, 243–257. [Google Scholar] [CrossRef]

Figure 1. The workflow for calculating the gridded ship emission inventory.

Figure 2. Workflow of the STEAM-based emission calculation.

Figure 3. Visualization of the H-Grid transformation from scattered points to a regular grid.

Figure 4. The study area, covering the Yellow Sea and its major port regions.

Figure 5. Gridded annual emission inventories for major pollutants in the Yellow Sea (2024).

Figure 6. Histogram of Ship Activity Density per Grid Cell.

Figure 7. Macro-Benchmark Performance Comparison of H-Grid, PostGIS, and Geometric Baseline.

Figure 8. Skew Robustness Analysis of Spatial Aggregation Methods.

Table 1. Classification Rules for Ship Operational Status and Corresponding Engine Load Factor (LF).

Ship Status	Condition	Assumed Load Factor (LF) (%)	Typical Activity
Berthing	Speed < 1 knot	−2–5	At dock, loading/unloading, hoteling
Mooring	1 knot ≤ Speed ≤ 3 knots	5–15	At anchorage, slow movement in designated areas
Port Maneuvering	Speed > 3 knots AND LF < 20%	<20	Entering/leaving port, navigating channels, docking
Low-Speed Navigation	20% ≤ LF < 65%	20–65	Coastal shipping, slow steaming, navigating congested waters
Cruise	LF ≥ 65%	≥65	Open sea transit at or near service speed

Note: The classification is based on real-time AIS data. The Load Factor (LF) for maneuvering, low-speed, and cruise modes is dynamically calculated from the vessel’s speed relative to its maximum design speed and then categorized.

Table 2. Examples of Emission Factors (EFs) Used in This Study.

Pollutant (s)	Engine Type/IMO Tier	Fuel Sulphur Content (%)	Emission Factor (g/kW·h)
NO_x	Slow Speed Diesel/Tier I	-	17
NO_x	Slow Speed Diesel/Tier II	-	14.4
SO_x	All Engines	0.5% (Global Cap)	1.8
SO_x	All Engines	0.1% (ECA)	0.36
PM_2.5	Medium Speed Diesel	0.5% (Global Cap)	0.95

Table 3. Example of Key Fields from a Raw AIS Data Record.

Field Name	Example Value	Description
userId	356490000	Vessel’s unique identifier (MMSI)
currTime	20 May 2024 14:22	Record timestamp
longitude	123.13965	Geographical longitude in degrees
latitude	37.581733	Geographical latitude in degrees
sog	14.5	Speed Over Ground (knots)
cog	307	Course Over Ground (degrees)
naviState	0	Navigational Status (‘Under way using engine’)
shiptypekey	1	Vessel Type (‘Cargo Ship’)

Table 4. Illustration of the data transformation pipeline for individual AIS points.

Raw AIS Data
userId	curr_time	longitude	latitude	sog_knots	…
2.02 × 10⁸	20 May 2024 10:00	122.1512	35.8225	12.1	…
2.02 × 10⁸	20 May 2024 10:05	122.1734	35.8451	12.2	…
2.02 × 10⁸	20 May 2024 11:30	122.1548	35.8259	10.5	…
2.02 × 10⁸	20 May 2024 11:35	122.1791	35.8493	10.6	…
Calculated Emissions					Gridding
so₂_total	no_x_total	pm_2.5_total	…	…	grid_id
1.80 × 10⁻¹	1.95	1.17 × 10⁻¹	…	…	12215_3582
1.82 × 10⁻¹	1.97	1.18 × 10⁻¹	…	…	12217_3584
1.50 × 10⁻¹	1.65	9.80 × 10⁻²	…	…	12215_3582
1.51 × 10⁻¹	1.66	9.90 × 10⁻²	…	…	12217_3584

Table 5. Summary of Annual Ship Emissions by Major Sub-Region (2024).

Sub-Region	NO_x (Tonnes/Year)	SO₂ (Tonnes/Year)	PM_2.5 (Tonnes/Year)	NO_x Percentage of Total (%)
Yellow Sea Main Channel	18,500	2750	1150	68.5
Qingdao Port Area	3800	600	250	14.1
Incheon Port Area	2600	350	150	9.6
Other Areas	2100	300	100	7.8
Total Study Area	27,000	4000	1650	100

Table 6. Gridding Performance Comparison for a Single High-Density Cell.

Points Processed in a Single Grid Cell	Traditional Geometric Method (ms)	PostGIS Method (ms)	H-Grid Method (ms)
100	180	150	120
1000	250	200	150
10,000	450	300	200
100,000	6000	4500	370
1,000,000	25,000	8000	620

Table 7. Performance Impact of Data Skew (Total Time in Seconds).

Data Distribution	Data Concentration	H-Grid I-Method (s)	PostGIS Method (s)	Traditional Method (s)
Uniform	10% in top 10% of cells	31.5	55	150.2
Moderately Skewed	90% in top 1% of cells	31.8	91.4	254.8
Highly Skewed	99% in top 0.1% of cells	32	128	320

Table 8. Performance Breakdown of H-Grid’s Aggregation Strategy.

Aggregation Phase	Total Time (s)
Local Aggregation	30.7
Global Merge	1.3
Total Time	32

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, C.; Chen, R.; Sun, S.; Xue, Q.; Li, Z.; Xing, X.; Wang, Z. Efficient Generation of Gridded Ship Emission Inventories from Massive AIS Data Using Spatial Hashing. Atmosphere 2025, 16, 1279. https://doi.org/10.3390/atmos16111279

AMA Style

Liu C, Chen R, Sun S, Xue Q, Li Z, Xing X, Wang Z. Efficient Generation of Gridded Ship Emission Inventories from Massive AIS Data Using Spatial Hashing. Atmosphere. 2025; 16(11):1279. https://doi.org/10.3390/atmos16111279

Chicago/Turabian Style

Liu, Chen, Rongchang Chen, Shuting Sun, Qingqing Xue, Zichao Li, Xinying Xing, and Zhixia Wang. 2025. "Efficient Generation of Gridded Ship Emission Inventories from Massive AIS Data Using Spatial Hashing" Atmosphere 16, no. 11: 1279. https://doi.org/10.3390/atmos16111279

APA Style

Liu, C., Chen, R., Sun, S., Xue, Q., Li, Z., Xing, X., & Wang, Z. (2025). Efficient Generation of Gridded Ship Emission Inventories from Massive AIS Data Using Spatial Hashing. Atmosphere, 16(11), 1279. https://doi.org/10.3390/atmos16111279

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Generation of Gridded Ship Emission Inventories from Massive AIS Data Using Spatial Hashing

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Preprocessing

2.2. Emission Calculation Model (STEAM)

2.3. H-Grid: A High-Throughput Spatial Hashing Algorithm

2.4. Benchmark Methodologies

2.4.1. Traditional Geometric Method

2.4.2. Spatial Indexing Method

3. Results

3.1. Experimental Setup and Data

3.2. Data Transformation and Gridding Results

4. Discussion

4.1. The Computational Challenge of Skewed Spatial Data

4.2. Micro-Level Performance Analysis in High-Density Grids

4.3. Macro-Benchmark: End-to-End Scaling at 100 Million Records

4.4. Micro-Mechanism Validation: Skew Robustness and Local Aggregation

4.5. Environmental Interpretation and Policy Relevance of Emission Hotspots

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI