Supervised segmentation of NO2 plumes from individual ships using TROPOMI satellite data

The shipping industry is one of the strongest anthropogenic emitters of $\text{NO}_\text{x}$ -- substance harmful both to human health and the environment. The rapid growth of the industry causes societal pressure on controlling the emission levels produced by ships. All the methods currently used for ship emission monitoring are costly and require proximity to a ship, which makes global and continuous emission monitoring impossible. A promising approach is the application of remote sensing. Studies showed that some of the $\text{NO}_\text{2}$ plumes from individual ships can visually be distinguished using the TROPOspheric Monitoring Instrument on board the Copernicus Sentinel 5 Precursor (TROPOMI/S5P). To deploy a remote sensing-based global emission monitoring system, an automated procedure for the estimation of $\text{NO}_\text{2}$ emissions from individual ships is needed. The extremely low signal-to-noise ratio of the available data as well as the absence of ground truth makes the task very challenging. Here, we present a methodology for the automated segmentation of $\text{NO}_\text{2}$ plumes produced by seagoing ships using supervised machine learning on TROPOMI/S5P data. We show that the proposed approach leads to a more than a 20\% increase in the average precision score in comparison to the methods used in previous studies and results in a high correlation of 0.834 with the theoretically derived ship emission proxy. This work is a crucial step toward the development of an automated procedure for global ship emission monitoring using remote sensing data.


INTRODUCTION
The international shipping sector is one of the strongest sources of anthropogenic emission of NO x -a substance that has a negative impact both on ecology and human health. The contribution of the shipping industry is estimated to vary from 15% to 35% worldwide [17,27], which leads to approximately 60,000 premature deaths annually [16]. While over the last 20 years the pollution produced by power plants, the industry sector, and cars has been constantly decreasing, the impact of maritime transport continues to grow [8]. This causes a big societal pressure, resulting in regulations [26] that put restrictions on emission levels that can be produced by individual ships.
All methods currently used for ship emission monitoring such as in-situ [3,34], on-board [1], and airborne platform-based [44] have several disadvantages: they all require close proximity to a ship, they are costly, and they do not allow for monitoring on a global scale. A potential solution to the problem is the application of remote sensing instruments [39]. Studies [23,29] show that NO 2 traces of some individual ship plumes can be visually distinguished on images from the TROPOspheric Monitoring Instrument on board the Copernicus Sentinel 5 Precursor (TROPOMI/S5P) satellite [45] that was launched in 2018.
An important next step required for the development of an automated global-scale monitoring system is an automated interpretable method for the evaluation of NO x emission produced by individual ships. Among the main challenges for this development are low temporal sample rate and spatial resolution resulting in the extremely low signal-to-noise ratio. In addition, there is a high risk of interference of the ship plume with other NO x sources and a high frequency of occurrence of plume-like objects that cannot be associated with any ship. Finally, the ground truth for this task is not available. In this paper, we present a methodology that allows addressing the above-mentioned challenges. Using machine learning and taking advantage of spatial characteristics of a ship plume, the developed method allows automatic segmentation of NO 2 plumes from the background, simultaneously assigning detected signal to a ship-emitter, circumventing the listed limitations.
We use NO 2 retrievals from the TROPOMI/S5P as this is the only available remote sensing instrument that performs NO 2 measurements with the resolution high enough 1 to distinguish plumes from individual ships. To increase the number of potentially distinguishable plumes, we enhance the contrast between the ship plumes and the background. The used enhancement technique allows for a differentiation between the ship plumes and random co-occurring concentration peaks in the ships' neighborhood. The application of the image enhancement technique also allows for an improvement of the low signal-to-noise ratio. Then, for each analyzed ship, we perform an automatic generation of a Region of Interest (ROI) that we call a ship sector. The purpose of the ship sector is to focus the area of analysis to the region where the ship plume is expected to be located. Subsequently, we normalize the ship sector and divide it into sub-regions. This way, we distinguish the plume of interest from all the other NO 2 plumes or land-origin outflows that potentially might be located within the ship sector. Based on the ship sector division, we create a set of spatial features that characterize the location of the NO 2 plume within the ship sector. Due to the absence of other sources of ground truth, each pixel of the ship sectors we manually label as a "plume" or "not a plume". Trained on the manually labeled data, a machine learning model will enable us to automatically segment plumes in unseen images. We study five robust machine learning models of increasing complexity and compare their performance with the threshold-based methods used in previous studies. To validate the developed pipeline, we compare the estimated based on the segmentation results amount of NO 2 to the theoretically derived ship emission proxy [23].
The rest of this paper is organized as follows: In Section 2, we start with an overview of the related literature. We then provide a description of the data sources used in this study and introduce the developed methodology (Section 3). In Section 4, the reader can find the results of the study, which are followed by the conclusions in Section 5 and discussion in Section 6.

RELATED WORK
For more than a decade scientists were trying to use the available satellite data in order to quantify the NO 2 emission from the shipping industry as a whole. For instance, using the measurements from the Global Ozone Monitoring Experiment (GOME) [11] instrument onboard the second European Remote Sensing satellite (ERS-2), the authors estimated NO 2 emission level above the shipping lane between Sri Lanka and Indonesia [5]. With the images from the SCanning Imaging Absorption spectroMeter for Atmospheric CartograpHY (SCIAMACHY) [9] onboard the ENVIronmental SATellite (Envisat) mission, traces from the shipping industry over the Red Sea were quantified [37]. Finally, data from the Ozone Monitoring Instrument (OMI) [31] aboard the NASA Aura spacecraft was used to visualize a ship's NO x emission inventory for the Baltic Sea [46]. However, due to the significantly lower resolution capabilities of the above-mentioned predecessors of TROPOMI/S5P (GOME: 40 × 320 km 2 , SCIAMACHY: 30 × 60 km 2 , OMI: 13 × 25 km 2 , TROPOMI: 3.5 × 5.5 km 2 ), these studies were based on multi-month data averaging, which does not give the possibility to quantify the emission from individual ships.
There are many studies demonstrating the capability of TROPOMI NO 2 measurements to pinpoint the emission from urban (e.g. [4,22,32]) or industrial sources, such as mining industry [25], or a gas pipeline [43], as well as showing the possibility of the usage of TROPOMI data to quantify the positive effects of COVID-19 lockdown (e.g. [18]). Nonetheless, all studied emission sources are stationary (therefore, can be observed over an extended period of time) and emit much higher quantities of NO x than an individual ship. Thus, all the above-mentioned problems are arguably less complex than the one discussed in this paper.
In [23] it was reported that the NO 2 plumes produced by individual ships can be visually distinguished in TROPOMI data. However, since the NO 2 traces of most of the ships in the area are not sufficiently stronger than the background concentrations, only plumes of the largest ships were addressed in that study. In addition, the presented approach requires multiple manual steps, which makes it impossible to apply the method on a global scale. In [29], we introduced the first attempt of a fully automated pipeline for the estimation of NO 2 emission from individual ships. In the study, we showed that pre-processing of the TROPOMI signal allows a visual distinction of a greater amount of ship plumes. The plumebackground separation itself, however, was based on a locally optimized threshold, established individually for each of the analyzed ships. Due to the fact that the threshold was established on the basis of the only variable (NO 2 concentration), it was not flexible enough for a good quality of separation of the ship plume from the background. Finally, the one-feature-based method of thresholding does not allow for differentiation of the plume produced by the analyzed ship from all the other NO 2 plumes that might be located in the ship's proximity.
Machine learning has proven to be an efficient technique for solving problems in geosciences and remote sensing [30,33]. Recently emerging studies show the efficiency of applying machine learning models to the TROPOMI data as well. For instance, in [22], the authors have built a deep convolution neural network to classify images into those that contain an NO 2 plume from those that do not. The developed model, however, does not differentiate the sources of the detected plumes. Moreover, the study does not provide the attempts of segmentation of the detected plumes from the background, thus there is no possibility to quantify the amount of NO 2 that was emitted by a given source. Several studies reported a successful application of various multivariate machine learning models for the estimation of surface-level emissions. For example, in [13] the authors created a multivariate machine learning model for the estimation of NO 2 emission over Germany [13], while in [47] the O 3 concentrations were estimated for California. The areas of analysis of the above-mentioned studies, however, are restricted to over-land territories. In [28] was reported the first attempt to extend near-surface concentration predictions to an ocean region. The authors acknowledged that the performance of emission estimations over the ocean is significantly more challenging than the equivalent task for the over-land areas, mainly, due to the absence of in-situ measurements possibilities. In addition, the sources of the detected emission levels have not been studied in the abovementioned paper.
In this study, we present a pipeline that for the first time allows the application of multivariate machine learning models for the estimation of NO 2 emission from seagoing ships. The developed method uses TROPOMI satellite data, AIS data on ship positions, as well as ECMWF wind data for segmentation of NO 2 plumes from individual ships, allowing differentiation between the plume produced by the ship of interest from all the other NO 2 plumes in the ship neighborhood. In the following Section, the used data sources and the developed methodology is presented in detail.

Data sources
In this paper, we perform the integration of several data sources: TROPOMI satellite measurements, wind data, information about the ships' positions, and information about the properties of analyzed ships. In this Section, the reader can find a general overview of the data sources that will be utilized throughout the study. We also provide here the criteria applied for data selection, which set the scope of this study.
3.1.1 TROPOMI data. TROPOspheric Monitoring Instrument (TROPOMI) [45] is a UV-Vis-NIR-SWIR (UV-visible-near-infrared-short-wave infrared) spectrometer on board the Copernicus Sentinel 5 Precursor (S5P) satellite. The satellite was launched in October 2017 and entered its operational phase in May 2018. It is a sun-synchronous satellite with a local equatorial overpass time at 13:30. In 24 hours, the satellite performs approximately 14 orbits and with these covers the full globe.
The TROPOMI spectrometer measures spectra of several trace gases including nitrogen dioxide (NO 2 ). Since the NO 2 gas is the most notable product of photochemical reactions of NO x emitted by ships, it can be utilized for the ships' emission monitoring. The maximal ground pixel resolution of the TROPOMI instrument is equal to 3.5 × 5.5 km 2 at nadir. Due to the projection of the satellite images, the real size of the pixel will vary, depending on the true distance between the satellite and the part of the surface of the earth being imaged. To generate images of regular size, we regridded 2 the original TROPOMI data into a regular-size grid of size of 0.045 • × 0.045 • , which for the pixel in the middle of the analyzed area translates to approximately 4.2 × 5 km 2 .
The following filtering criteria were applied to TROPOMI data: _ > 0.5, < 0.5. Previous studies [23,29] showed that such a selection of filtering values can yield considerably good results for a given task. The detailed description of the variables can be found in [40].
In this study, 68 days from the period between 1 April 2019 and 31 December 2019 of TROPOMI measurements were analyzed. The analyzed days were mostly sunny -the distribution of the variable cloud fraction for the scope of this study is provided in Figure 1. The studied data product: tropospheric vertical column of nitrogen dioxide [19]. Data version: 1.3.0. For the analysis, an area in the Mediterranean Sea, restricted by the Northern coasts of Libya and Egypt from the south and South coast of Crete from the north 3 was chosen. This particular region was selected because of the presence of a busy shipping lane connecting Europe and Asia, the high frequency of occurrence of sunny days, and relatively low levels of NO 2 background concentrations, which are favorable conditions for the analysis. An outline of the area studied is presented in Figure  2   ECMWF operational model analyses at a spatial resolution of 0.25 •4 , the temporal resolution of 6 hours and altitude of 10 meters. It was shown in [23] that wind data at 10 meters altitude is an optimal choice for a task of ship-plume matching. Starting from the product version upgrade from 1.2.2 to 1.3.0 that took place on March 27, 2019, the ECMWF 10 meters wind data for coinciding time is available as a support product in the TROPOMI/S5P data file.

Ship-related data.
Since 2002 all commercial sea-going vessels are obliged to carry on board an AIS transponder [35] which transmits information about position, speed, heading (direction), and unique identifier (MMSI) and type of each ship. Thus, another source of data used in this study is data from Automatic Identification System (AIS) transponders. At the moment there is no open access AIS data available for the region and time period as described in Section 3.1.1, as well as the quality required for this study. The data, however, can be accessed through several commercial providers. For the scope of this study, the AIS data as well as information about the dimensions of the ships were provided by the Netherlands Human Environment and Transport Inspectorate (ILT). This is the Dutch national designated authority for shipping inspections, has access to commercial databases for the AIS data set used in this study, and is participating in this research.
With the aim of reduction of the number of images where the ship plume cannot be visually detected, in our study, we only focus on ships with a speed that exceeds 14 kt. If two ships move in immediate proximity to each other, only the ship with the highest speed was taken into consideration. From the analysis were also excluded ships that are not involved in global trade, such as Yachts, Leisure Vessels, or Research Vessels. In Figure 3, the information about the dates used for this study as well as the number of ships per day studied is depicted. The differences in the number of ships per studied day can be caused by bad weather conditions on the measurement day.

Method
In this Subsection, we present the developed methodology. Taking advantage of the characteristics of the analyzed ship as well as wind conditions in the studied region, our approach allows the segmentation of NO 2 plume produced by the particular ship of interest distinguishing it from all the other concentration peaks in the surrounding area. The results produced by the proposed approach are easily interpretable and thus can be used as a reliable source of information by ship inspectors.  The method consists of the following steps: an AIS data-based interpolation of the ship tracks at the moment and just before the satellite overpass 3.2.1, definition 3.2.2 and enhancement 3.2.3 of a ship plume image, definition of a ship sector 3.2.4 that allows the further restriction of the analyzed area, normalization of the defined ship sector and split of the normalized sector into sub-regions 3.2.5 that, finally, give the possibility to retrieve the set of necessary features. These steps are described below.
3.2.1 Ship tracks. The first step is to estimate the tracks of the studied ships. Taking into account a lifetime of NO x equal to a few hours, we study the track of the ship starting from two hours before the moment of the satellite overpass. We distinguish two types of ships'-tracks: the ship track, obtained based on resampling and interpolation of the AIS data, and the wind-shifted ship track. For calculation of the wind-shifted ship track, we assume that the plume emitted by a ship has moved in accordance to wind direction by a distance = × |Δ |, where is the speed at a given location at 10 m above sea level from ECMWF for 12:00 UTC, and |Δ | is a time difference between the time of the satellite overpass and the time of a given AIS ship position. Both wind speed and wind direction are assumed to be constant for the whole time during which we study the plume. Such an assumption may create uncertainties in the expected position of the plume of the ship. However, with the methodology presented further in this section, such uncertainties will not affect the correctness of the ship-plume allocation.
Summing up, the ship track provides us the information on the position of the ship from where the studied ship plume was emitted.

Ship plume image.
Utilizing the knowledge of a ship's position summarized in its ship track and wind-shifted ship track, we are able to focus our attention on the area that lies within immediate proximity to the analyzed ship. For this, the concept of a ship plume image (see Figure 5(a)) is introduced. The area of the ship plume image is determined based on the wind-shifted ship track as follows: the average coordinate of the studied windshifted ship track defines the center ( , ) of the ship plume image, the borders of the image are defined as , ± 0.4 •5 . This particular size of a ship plume image was determined in order to allow for optimal plume coverage for the most typical range of ship speeds (14kt -20 kt) 6 . Given the size of the pixel grid, such an offset results in an image of a maximal dimension of 18 × 18 pixels.

Pre-processing of a ship plume image.
To improve the quality of the satellite signal, in the data pre-processing step, on each of the analyzed ship plume images we apply spatial auto-correlation statistic local Moran's [2]. In [29] we showed that the application of this technique substantially improves separability between the ship plume and the background.
The local Moran's spatial auto-correlation statistic allows the enhancement of the intensity of high-value pixels located in a cluster while suppressing isolated concentration peaks randomly occurring in the background. We characterize a ship plume as a cluster of pixels adjacent to each other with a concentration higher than the background average. This way, calculating the spatial autocorrelation of a ship plume image, we combine image denoising with the enhancement of the relevant part of the image.
An example of a result of an enhanced ship plume image is provided in Figure 5 Formally, the Moran's spatial auto-correlation statistic is defined as follows: for each pixel of a ship plume image, the local Moran's is calculated as: where is the value of the respective pixel, is the number of analyzed pixels of a ship plume image (in our case 18 × 18), is a mean value of all pixels, 2 their variance, and is the value of an element in a binary spatial contiguity weight matrix at location with regards to the analyzed pixel . The value of an element of the binary spatial contiguity matrix is 1 for pixels that are considered to be the neighbors of the analyzed pixel , and 0 otherwise. For the study, the queen spatial contiguity [24], which is the 3 × 3 8-connected neighborhood of the analyzed (central) pixel was applied. The value becomes the value of the corresponding pixel of the resulting enhanced image.
From Equation 1 we can imply the weak side of using the Moran's statistic for ship plume enhancement -apart from the clusters of high values, the given method enhances clusters of low-value pixels at the same time. The methodology proposed in this study, however, is designed in a way so that this negative impact is minimized.

Parameter Value
Trace track duration 2 hours Wind speed uncertainty 5 m/s Wind direction uncertainty 40 • Table 1: Parameters applied for ship sector definition.

Ship sector.
A plume produced by a ship at a given moment will be displaced, over time, in the direction of the wind in the analyzed area. Having the wind information available, we would like to restrict the analysis to the part of the ship plume image, where the probability to find the plume of the ship is the highest. We perform the area restriction by defining an ROI of an analyzed ship, which we call a ship sector. The area of the ship sector is determined on the basis of information about the ship's trajectory and the speed/direction of the local wind.
As a starting point of the ship sector's definition, we use the wind-shifted ship track, calculated as described in 3.2.1. We then calculate the extreme wind-shifted tracks by adding the margin of wind-related uncertainty to either side of the wind-shifted ship track. The extreme wind-shifted tracks delineate the borders of the ship sector, showing the extreme possible positions of the plume, taking the wind measurement uncertainty into account. The wind uncertainty is assumed due to the limited spatial and temporal resolution of wind data [23], based on reported in several studies ( [6,42]) measurement bias, as well as assumptions of constant wind mentioned in Section 3.2.1. Illustrations of the extreme wind-shifted tracks and the resulting ship sector are shown in Figures 5(e) and 5(f), respectively. As a result of the delineation of the ship sector area, the plume should always lie within the ship sector boundaries. Only pixels lying within the ship sector are taken into consideration in further analysis. Parameters related to the ship sector can be found in Table 1. 3.2.5 Feature set construction. In order to obtain a multivariate description of the ship sector pixels, we encode the spatial information into a set of generic features. First, we perform a ship sector normalization to make spatial information in the sector comparable between the different sectors. We define a normalized sector by standardization of the orientation and the scale of the original ship sector. In this way, the position of the plume within the ship sector becomes invariant to the heading (direction) and speed of the ship, as well as to the direction and speed of the wind.
We standardize the orientation of a ship sector by rotating to 320 •7 , so that the angle of the polar coordinate of the corresponding wind-shifted ship track is the same for all ships (see Figure 6). Assuming is a set of ship sectors in the dataset, formally, the rotation coordinates of a ship sector are defined in the following way: where _ and _ are the polar coordinates of the pixel within the rotated ship sector, , is the radial distance of the pixel from the origin of the ship sector (in our case, sector origin corresponds to the position of the ship at the moment of satellite overpass), , is a counterclockwise rotation angle of the pixel from the axis ( ) of the ship sector , Θ = − is a counterclockwise rotation angle that will be applied for the orientation change of each pixel of the ship sector , is a rotation angle of a ship sector that corresponds to the counterclockwise rotation angle of the pixel , with the radial distance from the origin , = ( ), = 320 • is a new rotation angle of each ship sector after the rotation.
We standardize the ship sector's scale so that the horizontal and vertical coordinates of the rotated ship sector are rescaled into the range [0, 1] by applying a min-max scaler on the horizontal and vertical coordinates of the pixel: The second step of the feature construction procedure is the division of the normalized sector into a set of sub-regions that enable encoding spatial information of the pixels within the normalized sector. First, we define levels of the normalized sector by splitting it into six sub-regions on the basis of the radial distance of the pixel from the origin of the sector. Then, we define sub-sectors by splitting the normalized sector into four sub-regions on the basis of the pixel's rotation angle. As a result, the position of each pixel within the normalized sector image can be characterized in terms of two values: a level and a sub-sector. An illustration of the normalized sector divided into a set of levels and sub-sectors is presented in Figure 7.

Experiment design
Here, we describe the experimental setup used in this study: first, we describe the dataset used for the training of the multivariate models, then we explain the models used for the benchmarking, and provide a list of used multivariate classifiers. In addition, in this Section, the reader can find the description of the methods used for hyperparameters' optimization and measures utilized for performance evaluation.

Dataset composition.
Following the steps provided in the previous subsections, we created 754 images and cropped them to an area of the ship sector. The ship sector images were enhanced by Moran's operator and manually labeled so that they can be used for training machine learning models. Not all ship sector images contained a visually identifiable NO 2 plume. Moreover, due to the dispersion and chemical transformation of a ship plume, some parts of the plume will always be under the detection limit of the satellite and therefore, indistinguishable. Thus, labeling errors are possible. To minimize the chance of mistakes the labeler was    Figure 8. In Table 2, the information on the data distribution within the two classes of the dataset is shown. All mentioned numbers correspond to the full dataset before the training/test set division.   [21], Support Vector Machines with radial basis kernel [14], Random Forest 8 [10], and Extreme Gradient Boosting (XGBoost) 9 [15]. The above-mentioned models are multivariate, and thus are able to benefit from the set of prepared features. Namely, the set of spatial features developed with the method described in Subsection 3.2.5, along with ship and wind-related features. All models selected for the experiment are highly robust. Therefore, the potential mistakes in human labeling, if present in reasonable amounts, should still allow for models' proper training.
The first feature of the model is enhanced by Moran's values of the pixels that were translated into a one-dimensional feature vector. As mentioned in 3.2.3, the application of Moran's may result in the creation of additional high-value pixels resulting from the enhancement of clusters of low-value pixels. To mitigate the negative impact of this side effect, apart from the Moran's , the feature set was composed of the corresponding value of NO 2 . This way, a supervised learning model will be able to differentiate between high and low-value enhanced NO 2 clusters. Other features used by the model are Wind Speed, Wind Direction 10 , Ship Speed, and Ship Length. Finally, the position of an analyzed pixel within the normalized sector in terms of levels and sub-sectors was translated into the feature vectors using one-hot encoding. The resulting feature set was composed of 17 features in total. For the full feature list, see Figure 13. The used binary label indicates whether the given pixel is a part of the ship plume or not.

Benchmarks.
To quantify the performance improvement gained by the usage of multivariate supervised models, we performed ship plume segmentation by applying a thresholding method on a single selected feature. First, we applied a thresholding method on the tropospheric vertical column of NO 2 TROPOMI product regridded in accordance to the description in Section 3.1.1. No image enhancement technique was applied. This simplest way of plumebackground separation was used, among the others, in [37] for the quantification of NO 2 emission from the international shipping sector. In [23], separation of pixels related to NO 2 plumes from individual ships was also performed based on solely TROPOMI NO 2 data. In this paper, we will refer to this benchmarking method NO 2 threshold. Visualization of the input data for this thresholding technique can be found in Figure 9(a).
As a second benchmarking method, following the suggestion made in [29] we performed a ship plume segmentation based on images enhanced with Moran's statistic. The satellite image enhancement allows effective separation of a greater amount of NO 2 plumes. However, as mentioned in Section 3.2.3, the application of Moran's statistic may result in the enhancement of low-value clusters that are not a part of a plume. Visualization of the input data for this benchmarking technique is presented in Figure 9(b). In the rest of the article, we call this method Moran's threshold.
To overcome the problem of enhancement of low-value clusters by Moran's , we propose to assign zero value to all pixels of the image with intensity lower than the median of the given ship sector picture, and afterward apply the Moran's enhancement. This is the third benchmarking method used in this study. We call it Moran's I on high NO 2 . Visualization of the results of the application of Moran's only on high NO 2 values can be found in Figure 9(c). As presented in Figure 9, for all three benchmarking methods only pixels that lie within the ship sector area were taken into account for segmentation.

Segmentation validation metrics.
For the assessment of classification quality, we used a precision-recall curve, an average precision score (AP) 11 that is defined as the area under the precisionrecall curve (PR-AUC), a receiver operating curve (ROC), and, finally, area under ROC curve (ROC-AUC).
Precision and Recall are respectively defined as follows: where stands for true positive and corresponds to the pixels that were labeled as a "plume" and were correctly identified by  Class-wise data distribution the classifier.
-false positive, it corresponds to the pixels that were not labeled as a "plume", but were identified as a "plume" by the classifier. stands for false negative and corresponds to the pixels that were not classified as a "plume" by the classifier, but were labeled as such by the labeler. The ROC curve visualizes True Positive ( ) scores as a function of False Positive ( ) scores.

Cross-validation and parameters optimization.
For the model fine-tuning and model performance evaluation, nested cross-validation [12,41] was used. In the inner loop, we performed a randomized grid-search [7] with 5-fold cross-validation to optimize the hyperparameters of the used models. The AP score was used as a target function for optimization. The performance of the best model identified during the inner loop of cross-validation is evaluated on the "hold-out" test set, generated during the outer loop of crossvalidation. The above-mentioned procedure is repeated five times, generating five independent training and test sets. An illustration of the applied cross-validation scheme can be found in Figure 10. The search space of the hyperparameters for each of the analyzed multivariate models is provided in Appendix A. The optimal parameters selected for each model by the randomized grid search can be found in Appendix B. the uncertainty hidden in human labeling, the reference value is required. Due to the fact that there are no on-site emission measurements available at the scale of this analysis, it is, therefore, necessary to use a ship emission proxy to represent the reference value. Here, we propose to use a theoretically derived NO x emission proxy defined as follows: where is the length of the ship in , and is its speed in / . The details of the derivation of the given measure can be found in [23], where the proxy was introduced. As it is noted in [23], the advantage of in comparison to other ship emission proxies (e.g. [20]) is that it can be calculated based on AIS data only, while other existing emission proxies require ship information that is not in the AIS data and is not available publicly.
The ship emission proxy is calculated for each ship of the test sets (see Figure 10). We compare the obtained values of emission proxy with the estimated on the basis of segmentation results amount of produced NO 2 . We estimate the amount of produced NO 2 by summing up NO 2 concentration within the pixels classified as a "plume" by each of the studied models. For the comparison between the emission proxy and the estimated amount of NO 2 , Pearson linear correlation was used.

RESULTS
In this section we present the results of our study. We begin with the presentation of the results of the plume segmentation model in Subsection 4.1. Appropriate segmentation quality is necessary for a correct estimation of NO 2 produced by ships. In Subsection 4.2, we validate the concept presented in this paper. In the Subsection, we compare the obtained on the basis of segmentation model results of ship NO 2 estimation with the theoretical ship emission proxy.  Table 3: Results on the test set with 5-fold cross-validation. Bold font indicates the best-obtained result. Under the dashed line: results obtained from univariate thresholdbased methods that, in this study, we considered as benchmarks.

Plume segmentation
In Table 3, we report the results of the pixel classification based on a 5-fold cross-validation for all models and benchmarks studied. Figure 11 provides the corresponding precision-recall curves, obtained by averaging the scores over all cross-validation test sets. In Figure  13, we visualize the model coefficients for the linear models studied, as well as the impurity-based feature importance coefficients for the tree-based models (Random Forest and XGBoost). The obtained results can be summarized as follows: (i) From Table 3, Figure 11, as well as Figure 12 we can conclude that nonlinear classifiers clearly outperform both linear classifiers and threshold-based univariate benchmarks. Both used measures: AP score and ROC-AUC resulted in a similar rank of the studied classifiers. With XGBoost, Random Forest, or RBF SVM models, a very high level of precision can be achieved. For the task of ship plume segmentation, our biggest interest lies in the correct segmentation of the most representative pixels of the ship plume. Thus, the obtained level of recall we consider as reasonably satisfactory. From Table 3, we can also see that the level of the standard deviation of AP scores for multivariate non-linear models is significantly lower than for linear or univariate models. This suggests that the results obtained with the nonlinear classifiers are more robust.
(ii) From Figure 13, we can see that Linear SVM, Logistic Regression, Random Forest, and XGBoost multivariate models utilize the spatial information provided by sub-sectors and levels. The complexity of the RBF SVM model does not allow the direct calculation of the importance of the utilized features. Even though due to the different nature of the models, the coefficients' values depicted in Figure 13 cannot be compared directly, the relative differences between the models' features go along with our intuition on where the plume produced by an analyzed ship should be located within a normalized sector. For instance, high negative coefficients for the linear models that correspond to the features Level 4 and Level 5 suggest that even if a high-value pixel does occur in those regions of the normalized sector, it was most probably produced by a source other than the analyzed ship. On the other hand, the high positive coefficients corresponding to a feature Sub-sector 2, tell us that if a high-value pixel occurs in the middle of the sector, it is most probably a part of the plume produced by the studied ship.   Figure 14 provides the correlation plots of NO 2 values estimated for a given ship on a given day based on the segmentation results of a given model and the theoretically derived NO x ship emission proxy . Table 4 gives information on the achieved level of Pearson correlation and the number of plumes that were segmented by a certain model. Here, our baseline result is the level of Pearson correlation and the number of plumes that were identified by Manual Labeling. We can see that majority of the models detected more plumes than the labeler. However, in all cases apart from XGBoost, the higher number of segmented plumes caused the decrement of the correlation score. The XGBoost model, on the other hand, was able to detect more plumes than the manual labeler, while achieving the highest correlation score. Such a result allows us to form a hypothesis that the developed machine learning-based methodology is able to segment plumes better than a human labeler. An example of a case where the XGBoost classifier identifies a plume better than the human labeler can be found in Figure 15. More experiments are, however, required in order to make final conclusions.   Table 4: Results on the comparison between the estimated amount of NO 2 and theoretically derived NO x ship emission proxy. Sorted in accordance with the achieved level of Pearson correlation. Italic font indicates baseline results.

Validation with emission proxy
The highest contrast between the scores of the performance metrics and the correlation with the emission proxy can be noted for the NO 2 threshold benchmark model. This is due to the fact that the ship plumes composed out of one pixel in our dataset were not labeled as plumes. The substantially high correlation with the emission proxy suggests that the single-pixel plumes were, nevertheless, identified by the method correctly. An illustration of such an example is provided in Figure 16.

CONCLUSIONS
In this study, we presented a new supervised-learning-based method for the automatic evaluation of emission plumes produced by individual ships using satellite data. The experiments were performed using NO 2 measurements from the TROPOMI/S5P satellite. We started with the enhancement of the satellite data in order to increase the contrast between the ship plume and the background. The applied image pre-processing technique enhances the intensity of high-value pixels located in a cluster (plume) and suppresses random concentration peaks in the background. We then automatically assigned a ship sector to each analyzed ship, which excludes from the analysis parts of the image where the plume of the studied ship cannot be located based on wind conditions and speed/direction of the ship.
As a next step, we presented a feature construction method consisting of the normalization of the ship sector and its division into smaller sub-regions. Each sub-region has a different probability to contain a plume produced by the ship of interest. This way, we differentiate the plume produced by the ship of interest from all the other plumes potentially located within the ship sector. The set of newly created spatial ship sector-based features allows us to perform ship plume segmentation using multivariate machine learning models. The application of the multivariate models gives the possibility to support the ship plume segmentation process with a set of additional one-dimensional features such as ship characteristics and speed.
We integrated several data sources into a multivariate dataset. We manually labeled the data, so that the problem of individual shipplume segmentation can be addressed with supervised learning. We trained a set of robust linear and nonlinear multivariate classifiers and compared their performance with the segmentation results of thresholding-based univariate benchmarks. All studied non-linear classifiers showed superior results in comparison to both linear models and univariate benchmarks. With the XGBoost model, we were able to achieve more than a 20% increase in the segmentation average precision in comparison to the best benchmark univariate model.
We validated the proposed methodology using an independent measure, i.e. a theoretically derived NO x ship emission proxy that we use as a reference value. For the comparison, we estimated  the amount of NO 2 produced by each of the analyzed ships and calculated the Pearson correlation of the obtained results with the ship emission proxy. We compared the obtained correlations and the number of plumes segmented by each of the studied models with the results obtained from manual segmentation. We showed that with the XGBoost model we are able to segment more plumes while achieving a 6.8% higher correlation with the emission proxy than when the plumes were segmented manually. That might suggest that the proposed method is able to find plumes that are hardly or not detectable by the human eye.

DISCUSSION
The presented approach opens new perspectives for the application of remote sensing in the domain of ship emission monitoring. However, there are several points on the generalization of results, the methodology, and the TROPOMI detection limit we would like to address here.
Firstly, we would like to discuss the possibility of the application of the proposed methodology to other regions. In this study, we presented a general approach that allows for the application of machine learning models for more efficient, automated segmentation of plumes from individual ships using TROPOMI data. All steps of feature preparation can be performed on the data from any region of the globe. Nevertheless, the machine learning models will have to be retrained on the region-specific datasets.
Secondly, not all regions will be equally suitable for the performance of ship emission monitoring with remote sensing. In particular, at the moment there is no scientific evidence that under the thick layer of land-based emission outflow it will still be possible to differentiate plumes produced by ships. Therefore, areas that lie in close proximity to big cities, ports, or industrial objects are currently challenging to analyze.
The next point is the validation approaches used in this study. For the training of the machine learning model, we used human labels. Human labeling is the basis of all machine learning methods and this study pioneers ship plume segmentation with more efficient supervised learning based on human labeling. However, the dispersion and chemical transformation of a ship plume, as well as its non-rigid structure mean that there are always some parts of this plume that are at or beyond the visible detection limit of the combination of the TROPOMI instrument and the retrieval algorithm. This can cause errors in labeling as is demonstrated in Figure 15. Such mistakes if present in reasonable amounts should not affect the performance of the model, but, if the number of labeling errors is too high, the machine learning model will not be able to learn properly, and thus, the resulting performance will be very poor. The fact that non-linear models were able to easily outperform thresholding-based benchmarks suggests that the models were able to use the provided labels for training and thus, the labeling error rate is low. Nevertheless, an independent measure of the method evaluation is needed. Since the interest of our study centers around seagoing ships, the in-situ measurements cannot be considered as a potential way of method validation. The option of on-board measurement of fuel samples, cannot be performed at the scale of the study. Therefore, a theoretical measure of ship emission potential which is ship emission proxy turns out to be the only available option of a reference value for the results of this study.
The usage of the ship emission proxy, however, has its limitations. Namely, the used ship emission proxy does not take into account many factors that influence the expected level of emission for a given ship. Nonetheless, the used proxy allows us to rank the emission potential of the analyzed ships properly.
As a following, we would like to discuss the fact that even though only fast ships were taken into consideration in this study, the number of ships for which the plume was possible to distinguish is higher than the number of ships for which the plume was invisible for the labeler. This study focuses on observing emission sources at the edge of the detection limits of the TROPOMI instrument. It is, therefore, likely that under certain circumstances ship plumes remain undetected. We can only in part explain under what circumstances plumes are not visible. With the data presented in Figure  17 and Table 5, we show that, as expected, smaller and slower ships are more often not detected. Similarly, for high wind speeds -the detection is more challenging due to the high dilution of the ships' emissions and therefore low concentrations (the evidence can also be found in Figure 17 and Table 5). Regarding the lower detectability at lower wind speeds that can as well be observed in Figure 17, we find some accordance with the findings from [38], where it is described how the wind speed impacts the reflectivity of the sea surface due to the shape of the waves, which in turn influences the sensors' sensitivity. However, this topic needs further study in the satellite retrieval community.
To sum up, the method presented in this study is a big step towards automated and global ship emission monitoring with remote sensing and should not be devalued by the above-mentioned limitations. Firstly, one can train a machine learning model per region as commonly done in remote sensing. In addition, the region can serve as a feature of the model itself to make it invariant to geographic locations. Moreover, adding of such variables like month, solar radiation, or temperature will make the model invariant to the seasonal changes that might be more severe at northern latitudes. Secondly, main ship routes go through both more and less suitable regions for the satellite observations. Thus, a selection of the more convenient regions will still allow us to use our approach for efficient monitoring of the emission levels produced by ships that follow those routes. Moreover, the obtained good results both in terms of segmentation quality and comparison with the emission proxy suggest that labeling has been of substantial quality. The proposed methodology also opens new research directions. For instance, human labeling can be replaced with chemical plume dispersion models, which will further improve the labeling quality and make the proposed methodology even more effective. Finally, the problem of visibility of ship plumes that have been unrevealed with the presented study, once solved, will give us a great overview of the capabilities of TROPOMI sensors.

Appendices Appendix A HYPERPARAMETERS SETTINGS
Below the reader can find the hyperparameters' search space that was used for the optimization of the multivariate model's performance, along with the hyperparameters that were always used for the model training.

Appendix B HYPERPARAMETERS SELECTED BY THE OPTIMIZATION PROCESS
Here we provide the set of the hyper-parameters that were identified as optimal through the performance of randomized grid search procedure.
• Linear SVM In every iteration of cross-validation procedure the optimal value of parameter C was: C=0.02. • Logistic regression In every iteration of cross-validation procedure the optimal value of parameter C was: C=0.001. • SVM RBF In every iteration of cross-validation procedure the optimal value of parameter C was: C=0.1.