1. Introduction
Atmospheric ozone plays a crucial role in absorbing ultraviolet (UV) radiation, thus providing protection of the biosphere, which encompasses all living organisms and ecosystems. While a moderate amount of UV radiation can stimulate vitamin D production in the skin, excessive exposure to UV radiation poses risks to both humans and most organisms [
1]. Thus, monitoring the total ozone column (TOC) amount is important, providing a strong motivation and incentive to develop improved satellite and ground-based TOC retrieval techniques. Recently, machine learning techniques and neural networks [
2] have been successfully applied in many fields of the natural sciences. Neural networks have great flexibility and potential to address a variety of problems in cognitive science, information science, computer science, marketing, artificial intelligence, biology, and chemistry.
In 2013, the latest TOC and COD results for the New York City area were reported, derived using a radial basis function neural network [
3]. Since then, new developments have been made in radiative transfer simulations, machine learning, and computer hardware. The two main motivators of this study are (i) to follow up on the state and trends of the ozone layer in this area for the time period 2014–2024, and (ii) to create an updated machine learning-based retrieval technique that is accurate and easy to use.
The main instruments used in this research are the NILU-UV (No. 115), which has been deployed and operated in the New York City area (40.74° N, −74.03° E) for eleven years (2014–2024), and the Ozone Monitoring Instrument (OMI).
A detailed description of the radiative transfer simulations, retrieval methodology, machine learning and training methods used in this research is provided in the following sections. Furthermore, comparisons of results obtained by the neural network and LUT methods applied to the NILU-UV data and the OMI results as well as the relationship between the radiation modification factor (RMF) and COD are presented and discussed.
4. Radiative Transfer Simulations
In order to train the neural network, one needs to provide training data to the algorithm. To create a suitable training (or synthetic) dataset, we used AccuRT [
4], which is a unique, state of the art radiative transfer simulation package that was designed to provide a reliable, well-tested, robust, versatile, and easy-to-use radiation transfer tool for coupled (atmosphere and underlying surface) systems. AccuRT uses the discrete ordinate method for the radiative transfer modeling.
The desired outputs from the neural network are the TOC amount and the COD at 380 nm (). The available possible inputs from the NILU-UV are the raw measurements from its six channels, and the so-called radiation modification factor (RMF) inferred from the measurements, as follows.
From [
7]: “When solar radiation passes through the ozone layer of the atmosphere, a portion of the UV radiation will be absorbed by ozone, while the portion that penetrates the ozone layer will be multiply scattered or absorbed by air molecules, aerosols, and cloud particles. To take into account the effects of clouds, aerosol particles, and surface albedo on the UV radiation a radiation modification factor (RMF) is introduced. The RMF is the measured irradiance at wavelength
and solar zenith angle
,
, divided by the calculated irradiance,
, at the same
,
, and TOC at the instrument location for a cloud- and aerosol-free sky and for zero surface albedo”:
In this study, the 380 nm channel was used to determine the RMF. As already mentioned, the RMF is relatively insensitive to the ozone abundance because the ozone absorption cross-section is very small at
380 nm, but sensitive to clouds, aerosol particles, and the surface albedo. “The RMF may be larger than 100 when broken clouds are present and the direct beam from the unobscured Sun is measured by the instrument as well as diffuse sky radiation scattered by broken clouds” [
8].
The goal was to train the neural network in terms of the measured voltages in each channel at different atmospheric conditions for varying COD () values, TOC amounts, and solar zenith angles . Hence, the solar spectrum (290 nm–387 nm) and the resolution of the absolute response functions (1 nm) were used in the AccuRT computations to obtain the irradiances at these wavelengths.
The cloud optical depth (COD) is not used as a direct input to AccuRT. Instead, the cloud volume fraction (
) is used in conjunction with the cloud extinction coefficient to generate the cloud optical depth at 380 nm,
. The relationship between the cloud volume fraction (
) and the cloud optical depth
is shown in
Figure 2 below.
The input parameters and their ranges are shown in
Table 1. A total of 20,000 simulations were prepared to be computed. For each simulation, a value was randomly selected for
, O
3, and
within the ranges presented in
Table 1. A MATLAB [
9] script was used to cycle through the 20,000 prearranged cases and run the individual AccuRT simulations.
For the simulations, the US Standard Atmosphere [
10] model was used and the surface albedo was chosen to be 0.14, which is typical for cities [
11]. The location-specific aerosol size distribution and fine and coarse mode values were adopted from our previous research [
12]. The aerosol particles were placed between 0 and 2000 m with the volume fraction set to
. The COD retrieval sensitivity assessment is given in
Section 7.2. With the cloud model used (see
Section 4), the limits of
are equivalent to min
and max
in terms of COD. In
Table 1, min
reflects the lowest possible solar zenith angle at the measurement site, and max
the upper limit to minimize measurement error. The minimum of
was chosen to be the lowest solar zenith angle throughout the year and the maximum limit was selected to be
to accommodate the slab geometry used in the calculations, in which the curvature of the atmosphere is not accounted for, and to keep measurement errors low. For
, the impact of cloudiness, the vertical profile of ozone and temperature, the imperfect cosine response of the instrument, and the absolute calibration error reduce the accuracy of the results [
13].
Cloud Model
Clouds were assumed to consist of a collection of homogeneous water spheres having a single-mode log-normal volume size distribution with a specified volume mode radius
μm and width
. Clouds were placed between 2000 m and 4000 m altitude, and a Mie code was used to compute the inherent optical properties of cloud particles. To convert the cloud volume fraction
to cloud optical depth, repeated AccuRT simulations were conducted to obtain
at different cloud volume fractions. The results are shown in
Figure 2. The exact formulation of the relationship between
versus
is presented in
Appendix A.
The relationship between the RMF and
was investigated. Using AccuRT to calculate irradiance at 380 nm for different
values yielded the theoretical relationship between RMF and
as shown in
Figure 3.
5. Neural Network
A machine learning algorithm based on a multi-layer neural network (MLNN) [
14] was built with three hidden layers consisting of 100, 90 and 75 neurons using the scikit-learn Python 3 library [
15]. To deal with the large data range and to improve the training performance, logarithmic scaling was performed, except for the solar zenith angle, where the cosine of the angle was used. An adaptive learning rate was applied to the training with hyperbolic tangent activation functions. The Adam optimizer [
16] was applied with a validation tolerance limit of
and a 0.1 validation fraction. To prevent overfitting, early stopping was applied. It took about 90 iterations to train the network.
The predecessor of our present method was based on a radial basis function neural network (RBFNN) [
17] that needed experimentation for optimization as described by Fan et al. (2014) [
3]. This “manual” interaction to tune the RBFNN is not needed in our MLNN approach, providing increased practicality and ease of use. In terms of training and retrieval, Fan et al. (2014) [
3] used irradiances, while our MLNN was trained on simulated measured voltages, as described in
Section 5.1. The TOC and COD retrievals are also based on the voltages measured by the NILU-UV instrument. The training data preparation and the two types of validation are described below.
5.1. Training Data
Besides taking the cosine of the solar zenith angle
, to prepare the inputs (
) for the neural network training, the simulated irradiances
were convolved with the absolute response functions
of the corresponding NILU-UV channels. Here,
i denotes the channel number. This transformation yields the corresponding voltages that would be measured in the NILU-UV channels. For channels 1, 3 and 5, the convolutions were as follows:
where the summation is carried out using a wavelength step of 1 nm.
The ratio of voltages in two channels
was used for the TOC retrieval. As a reminder, our goal is to retrieve TOC and
(COD at 380 nm). The inputs and outputs of the neural network are shown in
Table 2.
5.2. Holdout Validation
Holdout validation involves splitting the dataset into two separate sets: one for training the model and one for testing it. The
K-fold validation divides the dataset into
K equal parts, or “folds”. The model is trained and tested
K times, each time using a different fold as the test set and the remaining folds as the training set. This approach ensures that every data point is used for both training and validation. After data processing, and setting up the training (defining loss function, selecting an optimizer, determining the learning rate and number of neurons), the neural network was ready for the supervised learning and the synthetic data were applied for the training. To validate the results, two procedures were used. First, a 75:25 holdout, and then, a
K-fold validation for
(see
Section 5.3) was applied. For the holdout method [
18] (Section 8.2.2), the mean percent error (PE), the mean absolute percent error (APE) and the squared correlation coefficient (
were calculated using 5000 data points of the neural network (MLNN) predictions versus the modeled results described in
Section 4. The APE and PE are defined as follows:
and
where
and
are the results from the neural network and radiative transfer simulations (
Section 4 model), respectively, for the
i-th set of parameters, or
i-th case, represented by the input array
. The results predicted by the trained neural network vs. the simulated values are plotted in
Figure 4 and
Figure 5.
In
Figure 5, the red color indicates the values from the trained MLNN and the model data. The
x-axis represents the scaled TOC amounts used in AccuRT for the simulations. The TOC amount is scaled to the standard US atmosphere [
10], where the equivalent TOC depth is
m, which in Dobson units corresponds to 345 DU. The lower and upper limits of the modeled TOC amounts (220 DU, 440 DU) correspond to 0.6377 and 1.2754, respectively, in terms of scaled ozone amounts. The strong correlation between the data points is evident from
Figure 4,
Figure 5, the high
value, and the low error rates. The calculated statistical parameters are provided in
Table 3.
5.3. K-Fold Validation
A
K-fold cross-validation [
19] was also performed on the training data to see how they perform on “unseen” data. A standard cross-validation technique is based on partitioning a portion of the training data and utilizing it to generate predictions from the neural network. The resulting error estimation provides insight into how the model performs on unseen data (or validation set). This technique is commonly referred to as the holdout method, a simple form of cross-validation. In this case,
was adopted, which is a common choice for this kind of validation method.
K-fold cross-validation is a technique where the dataset is divided into
K subsets. The holdout method is repeated
K times, in which each subset is used once as the validation set and the remaining subsets are combined to form the training set. The error estimation is averaged over all
K trials to determine the overall effectiveness of our model. This approach ensures that each data point is included only once in the validation set and in the training set
times. Swapping the training and test sets further enhances the efficacy of this method. In this case, each time (
times), 16,000 data points were used for training and 4000 for validation. The statistical results for the
K-fold cross-validation are provided in
Table 4.
The
values for both O
3 and
at 380 nm are 0.999, demonstrating high correlation between the predicted and simulated values. This high correlation indicates that the model explains almost all of the variability in the data. The percentage errors for O
3 and
are low, reflecting the model’s high accuracy. The negative PE value reveals a slight underestimation of COD by the MLNN in both
Table 3 and
Table 4. The low APE-s for O
3 and
suggests that the model performs accurately. Overall, the neural network model exhibits good performance for both variables, with high
values and low errors.
6. NILU-UV and OMI Data Preparation
Once the MLNN had been trained and validated, the preparation of the NILU-UV raw data was performed. Unfortunately, some of the measurements were erroneous as the instrument occasionally logged faulty data. Missing data points, unfinished readings and the appearance of random characters were the most common occurrences of faulty data. These NILU-UV data were removed from the dataset before the following steps were taken.
The NILU-UV instrument registers the timestamps in UTC time. First, the solar zenith angle
is calculated based on the location and time. The other two pieces of information needed are the ratio
and
, where
stands for channel reading. As mentioned in Section 3b in [
7], the drift factor for each channel needs to be used to compensate for the degradation of the Teflon diffuser of the instrument. The drift data nearest to the measurement time were always used to take instrument drift into account. The imperfect cosine response of the instrument was also accounted for by dividing the measured signal in each channel with the corresponding cosine correction value, as described in Section 3 in [
8]. The cosine response functions are provided in
Figure A5 in the
Appendix A. A schematic illustration of the full retrieval methodology is provided in
Figure 6.
There were some missing data in the available dataset for the years 2014–2019, mainly for the year 2016. Approximately the first third and last third portions of the data were lacking from 2017 and 2019, respectively. Unfortunately, because of technical difficulties, NILU-UV data collected after 2019 had significant gaps of missing data.
Level 3 OMI data were acquired from NASA’s Goddard Earth Sciences Data and Information Services Center (GESDISC) in hierarchical data format (release 5) for the years of interest.
8. Conclusions
Our multi-layer neural network (MLNN) method that retrieves total ozone column (TOC) amounts and cloud optical depth (COD) values [i.e.,
] simultaneously from NILU-UV measurements yields TOC amounts in close agreement with OMI retrievals. The percentage error estimates were similar to those presented in [
3].
The neural network method described in [
3] relies on the tuning of radial basis functions. Since such tuning is not needed in our MLNN method, it is more robust and easier to use. Also, in our MLNN method, the imperfect cosine response of the instrument is taken into account in the retrieval as described in
Section 6. Overall, the two methodologies (MLNN and LUT) yielded similar results and error estimates as the OMI method (see
Table 6).
In contrast to the LUT method, our MLNN approach accounts for cloud effects in the retrieval of TOC amounts. Even under heavily overcast conditions, up to , the retrieved TOC amounts were consistent and showed the same seasonal variation of ozone as the OMI measurements. To obtain acceptable TOC results from the NILU-UV measurements using the LUT method, one should require RMF , because if the RMF is less than 30, the cloud is deemed to be too optically thick for NILU-UV measurements to yield reliable TOC amounts using the LUT method. Our MLNN method provides a significant improvement in this regard as the MLNN results agree with the OMI results when data for which RMF were excluded.
The results in
Figure 8 and
Figure 9 show no significant change in the TOC amounts. The lowest annual TOC amount of 299.4 DU was found in 2017, and the highest annual average of TOC amount of 321.6 DU was found in 2019.
The error in the COD due to the presence of aerosols was found to be small (see
Section 7.2). Beyond the previously discussed sources of error, the large footprint and low time resolution of OMI (once a day) contribute to the discrepancies between results obtained from the NILU-UV and OMI measurements. For the same reasons, the NILU-UV instrument is well suited for local short- or long-term monitoring of total ozone column (TOC) amounts, aerosol optical depth (AOD) values and cloud optical depth (COD) values.