1. Introduction
The problem of optimal sensor placement is of great interest in many fields, such as health [
1], high-dimensional systems [
2], and engineering [
3]. In ocean modeling, the necessity of finding the best place to install sensors is driven by the high cost of in situ measuring platforms (e.g., [
4,
5]) and the need to accurately reconstruct physical fields from a limited number of measurements [
6].
In our study, we will use the word “sensor” in the most general sense as a means of obtaining the value of the physical field at a particular point in space. Thus, by the “optimal” placement of sensors, we mean a finite set of spatial coordinates for which the maximum reconstruction accuracy of the physical field is achieved with a minimum number of sensors.
Classical approximations for optimal sensor placement typically assume that the measurements of the physical field at the sensor locations can be related to the physical field in the entire observation area through a linear transformation. The Fisher Information Matrix is commonly used to measure the error between the true and reconstructed fields and can be calculated as the inverse covariance matrix of the reconstruction errors. Classical methods for sensor placement can be divided into three main groups based on the loss function used [
7,
8]. The first group of methods maximizes the determinant of the Fisher Information Matrix or, equivalently, minimizes the determinant of the covariance error matrix [
9,
10,
11]. The second group of methods is based on minimizing the trace of the inverse Fisher Information Matrix [
12,
13,
14]. In particular, if the assumption of diagonal error covariance is made, the loss function is reduced to the well-known Mean Squared Error (MSE) between the observed and reconstructed fields. The third group of classical methods minimizes the maximum eigenvalue of the inverse Fisher Information Matrix or equivalently minimizes the spectral norm of the error covariance matrix [
12,
13].
Singular Value Decomposition (SVD) and QR decomposition are commonly used for finding optimal sensors in linear methods. However, these methods face challenges when applied to high-dimensional data, as they scale cubically and quadratically with the dimensionality of the data for full and truncated SDV, respectively. This can make them computationally intensive, even with the use of massively parallel computing systems. Additionally, these linear methods may produce non-physical results due to the lack of consideration for physical constraints.
Recently, deep learning methods have been increasingly used to analyze climate data at high resolution and improve operational forecasting systems. For example, in [
15], it was shown that modern deep learning models with transformer architecture could be comparably accurate to classical numerical models in short-term forecasting of the state of the atmosphere on a grid of 0.25 degrees. However, classical numerical models still outperform deep learning models when it comes to the medium- and long-term forecast of the Earth system.
The introduction of the Gumbel-softmax distribution [
16] and concrete distribution [
17] made it possible to use traditional gradient-based methods to optimize the parameters of discrete probability distributions. Several deep-learning models for optimal sensor placement have been proposed. These include the Concrete Autoencoder [
18,
19], deep probabilistic sampling [
20], and feature selection networks [
21]. These deep learning methods are more memory efficient and do not require the computation of expensive covariance matrices. However, optimization using the Gumbel-softmax trick can require a long exploration period to find the optimal sensor locations. In [
22], Shallow Decoder Networks (SDNs) were introduced with two algorithms for optimizing sensor locations: a linear selection algorithm based on QR decomposition (Q-SDN) and a nonlinear selection algorithm based on neural network pruning (P-SDN). Pruning-based sensor search is similar to the annealing of the temperature parameter in the Concrete Autoencoder by [
18], but it also enables an automatic search for the optimal number of sensors. In contrast, in [
18], the number of sensors was fixed from the start.
Using a high-resolution World Ocean model exclusively for solving the sensor placement problem through a direct combinatorial search is impractical due to its high computational intensity and long duration of a single experiment. A potential solution is to use a hybrid approach that combines statistical processing of retrospective hydrodynamic modeling data using neural network models to capture nonlinear dependencies, followed by the assessment of the sensor locations using an ocean general circulation model. This approach allows for physically justified optimal sensor coordinates to be obtained within an acceptable timeframe. Furthermore, it scales linearly with the size of the computational domain and does not require computing the covariance matrix of the numerical solution. This makes it well-suited for applications such as global ocean circulation models with a resolution of 0.25 or 0.1 degrees, where the covariance matrix as a vector can have dimensions of  to .
The aim of this research is to explore the feasibility of using modern neural network methods to determine the optimal locations for temperature sensors in the World Ocean at 0.25-degree resolution. We propose a Concrete Autoencoder architecture and training procedure that improve the methodology proposed in [
23], where it was tested for a relatively small grid of size 
. The method uses an adversarial loss and a differentiable feature selection by using straight-through gradient estimation. Our approach is applicable to high-resolution grids and utilizes entropy initialization to accelerate the convergence of the sensor sparsification algorithm. It includes a reconstruction operator, which uses the measurements to try to reconstruct the physical field at the full computational grid. The sensor placement and the reconstruction operator are jointly optimized using a binary mask as a parameter.
This paper is organized into five sections. The introduction is provided in 
Section 1. In 
Section 2, we describe a method for approximating the information entropy field and present a Concrete Autoencoder architecture for optimizing sensor placement. We also provide a method for analyzing sensor-perturbing sensitivity using the autoencoder and approximate the region of influence of an observation using mutual information. In 
Section 3, we describe the training and the test datasets, baselines, and evaluation metrics. In 
Section 4, we present numerical results for a high-resolution grid and discuss the geophysical interpretation of the obtained information entropy field and sensitivity. Finally, we provide our main conclusions in 
Section 5.
  2. Materials and Methods
The developed methods for optimizing sensor placement and determining the area of influence of individual sensors can be divided into four components, as illustrated in 
Figure 1. The orange blocks represent the information entropy approximation (
Section 2.1). The blue blocks refer to the optimization of sensor placement using the Concrete Autoencoder (
Section 2.2). The yellow block indicates the sensitivity analysis method, which utilizes a trained Concrete Autoencoder (
Section 2.3). Lastly, the purple block describes the methods of assessing the sensitivity of a sensor through mutual information (
Section 2.4). The input data, which are indicated by the green blocks for these procedures, are specified further in 
Section 3.1.
  2.1. Approximation of the Information Entropy Field
To apply optimization algorithms efficiently in a high-dimensional space of sensor locations, it is necessary to introduce an informative prior distribution. One approach is to place sensors at points where the physical field has high variability. This can be achieved by defining a non-negative scalar field 
 that reflects the historical variability of the physical field and using it to define the prior distribution as:
        where 
 is a temperature parameter that controls the concentration of sensors near the maximum of the variability field 
. At the first stage of our method, we estimate the variability of the physical field as a function of spatial coordinates 
r by approximating its information entropy. To handle non-normally distributed variables, we use deep neural networks to approximate the probability density. This approach can be applied to spatially correlated non-Gaussian random variables by using the Conditional Pixel Convolutional Neural Network (PixelCNN)  [
24,
25] to approximate the joint probability density of patches taken from a dataset of historical values of physical fields.
Suppose that we have a set of 
N patches 
 with sizes 
 cropped from the 
K-dimensional domain of interest 
, each of which is labeled with spatial coordinates of the patch center 
Our goal is to approximate the probability density of physical fields 
 as a function of spatial coordinates 
r. A single patch can be represented as a vector of dimension 
, and the probability density can be expanded as the product of probabilities of individual pixels, each of which is conditioned on all previous pixels: 
        where 
 stands for the 
i-th pixel of image 
 with respect to the chosen ordering.
Conditional PixelCNN is designed to generate images by predicting the values of individual pixels in the image, given the values of previous pixels. This is achieved by using a convolutional neural network (CNN) with a special architecture known as a PixelCNN, which is able to capture the dependencies between pixels in an image and generate samples that are highly realistic and detailed. PixelCNN uses masked convolutions to ensure that a generated pixel depends only on the previous ones. The use of conditional information, such as class labels or other features, allows the model to generate images that are conditioned to specific characteristics, such as the presence of specific objects or styles. This makes it a powerful tool for tasks such as image generation, inpainting, and density estimation. In this study, we employ a Conditional PixelCNN configuration with 15 convolutional layers, each with 64 hidden channels, resulting in a network with 6.0 million parameters.
To compute the information entropy from the probability density of a system, we first need to approximate a Probability Distribution Function (PDF) that describes the density (see Algorithm 1). Once we have approximated the PDF, we can compute the information entropy using the following formula: 
        where 
I is the information entropy, 
 is the conditional PDF of the system, and 
s is a variable that represents the possible states or outcomes of the system.
To reduce the dependence of the information entropy field from the random weight initialization of the Conditional PixelCNN, we train an ensemble of several models, 
. Final information entropy is obtained by averaging over the ensemble
        
        where 
 represents the information entropy of the individual ensemble members and is obtained by the following Monte Carlo approximation
        
| Algorithm 1 Information Entropy Approximation. | 
| Input:    —train dataset of patches,  denotes a spatial grid cell with coordinates  and  denotes a time moment    —train dataset with patches centers coordinates    —randomly initialized Conditional PixelCNN network, which predicts probability of a patch conditioned on its center’s coordinates    —number of training stepsfor 
                       
                      to 
                      
                       do   choose  randomly from    choose  randomly from       end for    Approximate information entropy field   given a trained Conditional PixelCNN network   with parameters  :    
                    To get the final information entropy approximation we take an average over an ensemble of independently trained networks  , where M  is an ensemble size
                    Output:    —final information entropy field
 | 
  2.2. Concrete Autoencoder for Sensor Placement Optimization
In the second stage, the entropy of the physical field is used to initialize the distribution of sensor locations; since the density of the initial locations is proportional to the information entropy. This configuration is further optimized by the Concrete Autoencoder architecture to simultaneously minimize the number of sensors and maximize the reconstruction accuracy. At the initial warm-up stage of the optimization, the maximum possible reconstruction accuracy is achieved. As the optimization continues, the number of sensors gradually reduces due to an increase in the loss term associated with the number of sensors. This process allows for a substantial reduction in the number of sensors without a significant decrease in the accuracy of the physical field reconstruction.
Our Concrete Autoencoder architecture consists of a trainable binary mask and a reconstructing image-to-image neural network with a U-Net architecture [
26] and bilinear upsampling. The U-Net configuration has four down-sampling and four up-sampling blocks, each of which reduces the resolution of the feature tensor by a factor of two and doubles the number of feature channels. This network has a total of 31.0 million parameters. The detailed network architecture is presented in 
Figure 2.
The U-Net takes, as input, a physical field multiplied by the binary mask and predicts the reconstructed field in the full computational domain. Our binary mask representing sensor locations is parametrized by the scalar field of parameter 
w via a step function
        
To determine the optimal number of sensors, we optimize the parameters of the binary mask using a straight-through gradient estimator proposed in [
27] with gradient clipping. In the backward pass, we replace the ill-defined gradients of the step function with the gradients of the identity function. The straight-through gradient estimator allows us to use a single matrix 
w of parameters for different numbers of sensors.
The loss function of the Concrete Autoencoder has three main terms
        
        where 
 dynamically changes during training. In the first warm-up stage, 
, this is performed to initially achieve good reconstruction quality. Next, at the sparsification stage, 
 is increased by 
 at every epoch. This allows the Concrete Autoencoder to minimize the number of sensors without a significant increase in the reconstruction error. A similar procedure was proposed in the original version of the Concrete Autoencoder [
18] where the annealing procedure was applied to the temperature parameter of the Gumbel-softmax distribution.
Consider a random batch 
 sampled from 
. Following Pix2Pix framework [
28], the loss function 
 requires that the Concrete Autoencoder produces physical fields that are indistinguishable by the PatchGAN discriminator from the real fields
        
In addition to the requirement for the realism of the generated physical fields, we require an element-by-element correspondence of the reconstructed physical fields with the ground truth ones. The 
-norm is used to measure pixel-wise error
        
The sparsification of sensors is achieved by adding the average value of the binary mask to the loss function
        
We require the discriminator to distinguish the physical fields produced by the Concrete Autoencoder from the real ones, thus defining its loss as
        
The full training procedure for the Concrete Autoencoder is outlined in Algorithm 2.
        
| Algorithm 2 Concrete Autoencoder Training. | 
| Input:    —train dataset of historical values of a physical field,  denotes a time moment    —randomly initialized Concrete Autoencoder network with parameters , which reconstructs physical field from sparse measurements; parameters of the binary mask are also included in     —randomly initialized PatchGAN Discriminator with parameters     mask—binary mask defining initial sensor locations sampled proportionally to the information entropy field    —number of training steps on the warm-up stage    —number of training steps on the sensor sparsification stage    —matrix filled with ones    —matrix filled with zeros    —weight of the mask term in CA loss    —weight increase of  on the sensor sparsification stagefor  to  do   choose  randomly from    Update the Concrete Autoencoder weights:                  Update the discriminator weights:         if  then     Increase the sparsification weight:        end ifend forOutput:    —Concrete Autoencoder network with optimal parameters     —binary mask with optimal sensor placement
 | 
  2.3. Sensitivity Analysis for Sensor Measurements
After the Concrete Autoencoder is trained, we can investigate the area of influence of a sensor by perturbing the sensor measurements and examining the resulting changes in the reconstructed field. This can help us understand how changes in a sensor’s measurements affect the overall accuracy of the reconstructed physical field.
The sensitivity field can be computed by perturbing a sensor measurement with Gaussian noise and recording the resulting reconstructed fields. Since a single forward pass of the Concrete Autoencoder network takes less than a second, and an ensemble of the perturbed reconstructed fields can be constructed in a few minutes. The sensitivity field can then be derived as a field of standard deviations, as outlined in Algorithm 3.
        
| Algorithm 3 Sensitivity field computation from a trained Concrete Autoencoder. | 
| Input:    —test dataset of historical values of a physical field     denotes a time moment    n denotes spatial grid cell    —a trained Concrete Autoencoder    mask—binary mask defining sensor locations    —binary mask that is equal to 1 in the sensor with coordinate p (chosen for sensitivity analysis) and 0 otherwise    —list of the perturbed reconstructed fields, initially emptyfor  to T do   —perturbation of sensor measurements   Compute the perturbed reconstructed field:      Append  to the list end forCompute sensitivity by taking the standard deviation along the time axis:Output:    —sensitivity field for one sensor location p
 | 
  2.4. Mutual Information Field
As an independent method for evaluating the sensitivity of the reconstruction of measurements and the influence of measurements at a specific point on the entire computational domain, we also calculated the field of mutual information [
29,
30]. The calculation was performed for a specific measurement coordinate pairwise with all other points. A detailed description of the procedure is provided in Algorithm 4.
        
| Algorithm 4 Mutual Information Approximation. | 
| Input:    —train dataset of temperature time series     is a physical field realization at time t and point r    Normalization:    —train dataset of normalized temperature time series, where the normalization for time step t  is given by
					     —multiyear monthly averaged values for each r point.Converting normalized values to discrete bins Mutual Information:    For point p  calculating mutual information field  :
                                        —is the joint probability mass function of X and Y.Output:    —mutual information field for point p and all other points r.
 | 
This algorithm estimates the mutual information for a given dataset of physical field realizations. The input is a training dataset  containing field values at each point r at a series of time moments. The dataset is first normalized by subtracting the multiyear monthly average value for each point r and dividing by the standard deviation. The normalized values are then converted to discrete bins.
Next, the mutual information field  is calculated for each point p using the joint probability mass function of the discrete bin values for that point and all other points r. The mutual information is computed as the sum of the joint probabilities multiplied by the log of the ratio of the joint probability to the product of the marginal probabilities.
  3. Numerical Experiments
The training dataset for this study contains daily temperature fields for 9, 135, and 1250 m depth levels taken from the INMIO global ocean circulation model hindcast. The data covers a period from 2 January 2004 to 31 December 2020, and the calculation of the training dataset for 17 model years took approximately 340 h using 239 processor cores, including 176 cores for the ocean, 60 cores for ice, and 3 cores for the coupler, atmospheric forcing, and rivers runoff. The number of processors was chosen to ensure that the time for calculating the integration steps for the ice and ocean models was approximately equal, leading to more efficient use of computing resources when scaling the model to larger grids. This issue is discussed in further detail in [
31].
The test set for this study was chosen to be the INMIO reanalysis from 1 January 2020 to 31 December 2020. During the reanalysis experiment, ocean and ice state observations, including ARGO temperature and salinity profiles [
4,
32], AVISO sea surface height [
33], and EUMETSAT/OSISAF ice concentration [
34], in the northern and southern hemispheres were assimilated using the Ensemble Optimal Interpolation (EnOI) scheme with an ensemble of 209 elements. The calculation of one model year for the test dataset took approximately 50 h using 208 cores, including 120 cores for the ocean, 72 cores for ice, 10 cores for EnOI, and 16 cores for the coupler, atmospheric forcing, and rivers runoff.
We trained the Concrete Autoencoder with Adam optimizer and computed the daily climate based on the training set. After training, we computed the reconstruction error on the test set and compared all methods using bias and RMSE metrics. The neural networks were trained using the compute nodes with Nvidia Tesla V100 GPUs. The information entropy field was computed by averaging across the ensemble of Conditional PixelCNN networks. For every depth level out of the three considered, an ensemble of three Conditional PixelCNN networks was trained, which for a single ensemble element took 60 h on a single V100 for patch size . The training of a single Concrete Autoencoder takes about 160 h on a single V100 for a computational grid of size . Therefore, in total, neural network training took about 340 GPU-hours per field for every level.
Theoretically, the computational time of our method scales linearly with the number of computational cells in the ocean model. However, this depends on the amount of data used for training and the available computational resources. At present, for global operational ocean forecast systems, the maximum resolution is 
 of a degree [
35], which is limited by the number of CPU cores and RAM in the computing cluster being utilized. In our study concerning data with 
 degree resolution (which is about 28 km at the equator), we utilized approximately 300 CPU cores for generating the data and 8 GPUs with 32 GB of GPU memory for neural network training. With a larger compute cluster featuring 1000–2000 CPU cores and 8–10 GPUs with 80 GB of GPU memory, it is possible to extend the method to a resolution of 1/10 degree or 11 km.
  3.1. Datasets Description
INMIO Ocean general circulation model. The system of equations of three-dimensional ocean dynamics and thermodynamics in the Boussinesq and hydrostatic approximations is solved by the finite volume method [
36] on the type B grid [
37,
38,
39]. The ocean model INMIO [
40] and the sea ice model CICE [
41] operate on the same global tripolar grid with a nominal resolution of 
. The vertical axis of the ocean model uses 
z-coordinates on 49 levels with a spacing from 6 m in the upper layer to 250 m at the depth. The barotropic dynamics are described with the help of a two-dimensional system of shallow water equations by the scheme [
42]. The horizontal turbulent mixing of heat and salt is parameterized with a background (time-independent) diffusion coefficient equal to the nominal value at the equator and scaled toward the poles proportionally to the square root of the grid cell area. To ensure numerical stability in the equations of momentum transfer, the biharmonic filter is applied with a background coefficient scaled proportionally to the cell area to the power 3/2 and with the local addition by Smagorinsky scheme in formulation [
43] for maintaining sharp fronts. Vertical mixing is parameterized by the Munk–Anderson scheme [
44], with convective adjustment performed in the case of an unstable vertical density profile. At the ocean–atmosphere interface, the nonlinear kinematic-free surface condition is imposed with heat, water, and momentum fluxes calculated by the CORE bulk formulae [
45]. Except for vertical turbulent mixing, all the processes were described using time-explicit numerical methods, which allow simple and effective parallel scaling. The time steps of the main cycle for solving model equations are equal for the ocean and the ice. The ocean model, within the restrictions of its resolution, implements the eddy-permitting mode by not using the laplacian viscosity in the momentum equations.
 The sea ice model CICE v. 5.1. For this experiment, an elastic-viscous-plastic rheology model was applied to parameterize the ice dynamics, and zero-layer approximation was used for thermodynamics calculations. To explicitly resolve elastic waves, a subcycle with small time steps was set. The simulation mode includes the processing of five categories of ice thickness and one category of snow thickness using an upwind transport scheme and a description of melt ponds.
ERA5 atmospheric forcing. We used the ERA5 reanalysis [
46] for the period 2004–2020 as the external forcing to determine the water and momentum fluxes on the ocean–atmosphere and ice–atmosphere interfaces. Wind speed at 10 m above sea level and temperature and dew point temperature at 2 m were transmitted to the ice–ocean system every 3 h. In addition, the accumulated fluxes of incident solar and long-wave radiation and precipitation (snow and rain) were also read with the same period.
 EnOI. A detailed description of the EnOI method is presented in the work [
47]. For our calculation, we used the original parallel realization of the EnOI method for data assimilation in the INMIO model [
48].
 The basic equations of the EnOI method are as follows [
49]:
        where 
K is defined as
        
 (background) and 
 (analysis) are vectors of size 
n representing the model solution before and after data assimilation, respectively; 
n is the number of model grid points weighted by the number of model variables to be corrected (temperature, salinity, sea level, etc.); 
 is the vector of observations of size 
m; 
m is the total number of observation points where various data were obtained; 
 is the gain matrix; 
 is the covariance matrix of observation errors, it is assumed that the matrix 
 is the identity matrix multiplied by a scalar parameter 
r; 
 is the matrix, representing the projection operator of model values into the observational data space; 
 is the covariance matrix of model errors.
The EnOI method belongs to a group of assimilation methods that rely on some approximation of matrix B based on an ensemble of model solution vectors. This approximation allows for the estimation of the covariance matrix of model errors. In practice, the ensemble is used to approximate matrix  of size . The inverse  is then computed using SVD, which can be a limiting factor in the assimilation of a large number of observations.
  3.2. Baseline
Climate. The simplest baseline in ocean modeling, reconstruction, and forecasting is climate interpolated in time to the correct date. We calculated our climate values on the training set for every day of a year, according to the formula
        
        where 
 is the value of a physical field with coordinates 
 at day number 
 in year 
y from the training set, 
 is the number of years with day 
d in the training set.
   3.3. Evaluation Metrics
In this study, we use evaluation metrics based on the GODAE OceanView Class 4 forecast verification framework [
50]. The bias metric measures the correspondence between the mean forecast and the mean observation. To calculate the spatial and temporal distributions of the bias, we average over time at each spatial location 
 using Equation (
7) and average over the spatial coordinates at each time point in the test set using Equation (
8), respectively.
        
        where 
 is the reconstructed values of a physical field at a point with coordinates 
 and at time moment 
, 
 is the original reanalysis values of a physical field in the same point.
        
        where 
 is the total number of computational cells for the field considered.
The second metric used is the Root Mean Square Error (
). It was calculated for each grid point 
 by averaging along the time dimension using Equation (
9) and for each time moment in the test set by averaging along the spatial dimensions using Equation (
10)
        
  4. Results
In this section, we demonstrate the developed methods. In 
Section 4.1, information entropy fields are presented. 
Section 4.2 discusses the balance between the number of sensors and the reconstruction accuracy during the training of the Concrete Autoencoder. 
Section 4.3 presents the reconstructed fields by the Concrete Autoencoder, climate, and baseline methods, which utilize the Nearest Neighbor interpolation. In 
Section 4.4, the reconstruction accuracy of the models is investigated in terms of RMSE and bias on the test dataset and compared to the baselines. The sensitivity of the reconstructed fields to individual measurements is studied in 
Section 4.5 and compared to the mutual information fields.
In the course of data-driven optimal sensor placement studies [
3,
8,
9,
51], it is customary to analyze global sea surface temperature (SST) data as a validation case. Therefore, we consider the optimal sensor locations and reconstruction of the temperature field from the top ocean layer at 9 m depth. However, the information entropy field is shown at a depth of 135 m because at the stage of analysis, it became clear that this level is of greater interest and clearly corresponds to the local hydrophysical features. The field of information entropy at a depth of 9 m can be viewed in 
Appendix A.
  4.1. Information Entropy
Information entropy was calculated for layers at depths of 9 (
Figure A1), 135 (
Figure 3), and 1250 m (
Figure A2). The layer at a depth of 135 m (
Figure 3) is of particular interest from a hydro-physical perspective, as it is at a depth where the atmosphere has little direct influence, and ocean dynamics are active. This makes it a suitable depth for comparing the information entropy field to the behavior of real ocean currents.
In 
Figure 3, low-information entropy can be observed in regions where the ocean temperature does not change significantly throughout most of the year. This indicates that we had a greater degree of a priori information about the undersurface temperature in these regions. We found that IE reaches its maximum values in regions where warm and cold waters mix, and the boundary between them is likely to fluctuate, such as to the northwest of Spitsbergen, to the east of Argentina, and in the Arabian and Japanese seas. These areas are characterized by the mixing of cold and warm currents, which can lead to significant uncertainty in ocean conditions.
To the northwest of Spitsbergen, the boundary between the warm West Spitsbergen and cold East Greenland currents results in a high level of uncertainty, likely due to the “atlantification” of the Arctic Ocean, which refers to the strengthening of the influence of waters of Atlantic origin on the hydrophysical regime of the Arctic Ocean (see [
52]). To the east of Argentina, the confluence of the cold Falkland Current and the warm Brazil Current also exhibits high IE values. The high levels of information entropy observed in the Arabian Sea may be attributed to two factors. First, the direction of ocean currents in the upper 200 m of the ocean changes seasonally due to the prevailing easterly trade winds in the northern hemisphere winter and strong southwest monsoon in the summer. Secondly, there has been a trend of rapid warming of the upper 700 m of the ocean in the tropical Indian Ocean over the past 70 years [
53]. These factors likely contribute to the high levels of uncertainty in this region. In the Sea of Japan, the intra-annual variation of two currents, the Tsushima warm current [
54] and the Liman current, carrying colder water from the Sea of Okhotsk, is likely the cause of the high IE observed.
  4.2. Training Dynamics of the Concrete Autoencoder at the Sensor Sparsification Stage
After computing the information entropy field 
, we use it to initialize the binary mask with approximately 60,000 sensors, which are independently sampled from the prior distribution
        
        with the temperature parameter 
.
During the warm-up stage, the Concrete Autoencoder is trained with a fixed binary mask and  for 50 epochs to improve reconstruction quality. To encourage sparsification of the binary mask defining the positions of sensors, the weight  of the corresponding term in the loss function is gradually increased. In the sparsification stage,  is initially set to 4 and increased by 1.0 every 5 epochs.
Figure 4a shows the dynamics of the RMSE between reconstructed and ground truth temperature fields during the training. The initial RMSE is low due to the warm-up stage, and it gradually increases as the number of sensors decreases. The reconstruction error remains relatively low as the number of sensors exponentially decreases from 60,000 to around 1500 over the first 400 epochs, as shown in 
Figure 4b. Spikes in the RMSE plot can be observed when the weight of 
 is increased. After the number of sensors drops below 800, the reconstruction error starts to grow significantly. The dynamics of RMSE in comparison with the number of sensors is illustrated in 
Figure 4c.
 To analyze the reconstruction quality of the Concrete Autoencoder, we choose the checkpoint at epoch 508, when the number of sensors is 1236 and RMSE stays relatively low. The corresponding sensor locations are shown in 
Figure 5. For comparison, the reconstruction quality was also analyzed using the Nearest Neighbor method applied to a regularly spaced grid of sensors, as depicted in 
Figure 6.
  4.3. Reconstructed Fields
In this subsection, we present the temperature fields reconstructed using several methods. 
Figure 7 shows the original temperature field for 29 July 2020. 
Figure 8 shows the field reconstructed with the Concrete Autoencoder using 1236 sensors. 
Figure 9 shows the field reconstructed using climate data. 
Figure 10 shows the field reconstructed using the Nearest Neighbor method with 1326 regularly placed sensors shown in 
Figure 6. The Concrete Autoencoder, in terms of detail, resembles an improved climate. This means that the fields are as blurry as when predicting with the help of climate, but thanks to the sensors, individual regions are, on average closer to reality, as can be seen in the Arctic north of Spitsbergen and north of the Laptev Sea. In the case of the Nearest Neighbor method, a mosaic structure is observed in the reconstructed field, and as the number of sensors increases in 
Figure A3, the reconstructed field approaches the original field, as shown in 
Figure A5.
  4.4. Reconstruction Accuracy
The accuracy metrics calculated using Equations (
7) and (
9) are shown in 
Figure 11, 
Figure 12, 
Figure 13 and 
Figure 14. The median values were calculated from the time series of 
 and 
 using Equations (
8) and (
10), and the corresponding historical time series are shown in 
Figure 15. The total reconstruction accuracy for all methods is shown in 
Table 1. The spatial distribution of the temperature field reconstruction accuracy by the Nearest Neighbor method is shown in 
Appendix A for 1326 regularly placed sensors in 
Figure A8 and 
Figure A9 and 8671 regularly placed sensors in 
Figure A10 and 
Figure A11, as well as for 1236 sensors placed by the Concrete Autoencoder in 
Figure A6 and 
Figure A7.
We show the spatial distribution of the reconstruction bias and RMSE by the Concrete Autoencoder with 1236 measurements in 
Figure 11 and 
Figure 13 and from the climate without measurements in 
Figure 12 and 
Figure 14. The spatial distribution of errors by the Concrete Autoencoder is significantly better for the entire Pacific Ocean. In the Indian Ocean, there is a difference, but it is not as drastic. For the Gulf Stream, Kurosio, Brazil, and Circumpolar currents, the CA errors become generally smaller in comparison to the climate, but the maximum values are approximately preserved.
To demonstrate the interannual variability of reconstruction accuracy, we constructed a graph of the bias and RMSE for every second day throughout 2020 from the test set. 
Figure 15 indicates that the temporal patterns of RMSE remain largely unchanged for the various reconstruction methods that employ different numbers of sensors. The accuracy of the reanalysis compared to real observation data is given as an example, and it should be noted that the accuracy of the reanalysis is calculated for a limited number of measurements corresponding to the number of daily ARGO float profiles, about 300, and not for the entire computational domain as for other methods.
In 
Table 1, the accuracy statistics for the temperature field reconstruction are presented. The reconstruction accuracy by climate is 
 worse in RMSE, than the proposed field reconstruction method using the Concrete Autoencoder. Moreover, the Concrete Autoencoder requires about seven times fewer sensors to achieve approximately the same accuracy as reconstruction with the Nearest Neighbor method. With the same number of sensors, the accuracy of the reconstruction with the Concrete Autoencoder is 
 better than recovery with the Nearest Neighbor. The comparison is made according to the RMSE metric because, although for the Nearest Neighbor, the bias turns out to be very small, on average, over the entire Global ocean, there are often large deviations of different signs near a particular sensor, 
Figure A8, and in this case, the bias is not suitable to use for comparison. It is also important to note that the accuracy metrics for the INMIO reanalysis (test dataset) are calculated in comparison to the ARGO observation data. The number of sensors, in this case, varies within the range of 300 to 400 and is subject to daily fluctuations. Thus, the Concrete Autoencoder could reconstruct the geophysical field from a small number of sensors better than climate and, in terms of accuracy, it is close to state-of-the-art system of operational ocean forecasting [
55,
56,
57].
  4.5. Sensitivity Analysis for Sensor Measurements
For several sensor coordinates obtained by the CA method, mutual information and sensitivity fields were calculated. The most representative fields are shown in 
Figure 16 and 
Figure 17, displaying the influence of one point on the global ocean, and in 
Figure 18, displaying the influence of two different points on a local area.
In 
Figure 16, the mutual information shows the influence of one point on distant areas in the ocean. This effect may be due to the subtraction of monthly averaged climate values, which may not be sufficient to negate the global effect of the sun and atmosphere. In 
Figure 17, the sensitivity field shows a more locally influenced area with a sharp left border because the point is close to the left part of the calculation domain where the periodic boundary condition is set, and the CA model does not recognize the geographical proximity of the area on the other side of the calculation domain. This results in a sensitivity pattern that is clipped on one side. A potential solution to this issue could be achieved by augmenting the training set by shifting the fields along the longitudes. This would result in the output of the Concrete Autoencoder being invariant with respect to these shifts.
In 
Figure 18a,b, the influence of a point close to the front of the Agulhas and Antarctic Circumpolar currents on a local area is shown. Filamentous structures, where the model is insensitive to changes in a given sensor measurement, are typically located near other sensors.
In 
Figure 18c,d, the influence of a point located in the wide Brazil current is shown, explaining the significant size of the spot for the mutual information and sensitivity fields. It is also clear that in areas without sensors, the sensitivity pattern is similar to the field of mutual information.
  5. Discussion
In this study, we focused on regions with high information entropy, which indicates a high degree of uncertainty. By comparing the information entropy field with maps of ocean currents, we found that these regions tend to be located near the fronts between cold and warm currents. However, not all regions where cold and warm currents mix correspond to high information entropy values. Future research may investigate these regions to identify the factors that contribute to high uncertainty in some areas but not in others. This could help us better understand the relationship between the mixing of cold and warm currents and uncertainty in the ocean state.
A unique feature of the proposed method for sensor placement and field reconstruction is its ability to adjust the number of sensors during the Concrete Autoencoder training. By choosing different checkpoints of the Concrete Autoencoder, we can balance the number of sensors and the reconstruction accuracy. The results of this study showed that reducing the number of sensors below the optimal value leads to rapid error growth. Therefore, this method can be used to determine the minimum acceptable number of sensors and provide the optimal placement strategy.
The sensitivity of the Concrete Autoencoder can be used as an approximation of the mutual information field, which also takes into account the joint influence of other sensors. Investigating the sensitivity fields of the Concrete Autoencoder could also improve the interpretability of the obtained sensor locations. This sensitivity analysis tool could have several important applications. First, when placing a new observation station, it is important to find an area in which the values of the measured physical field will correlate with the values obtained at the location of this station. A system of sensors with non-overlapping areas of influence will be more cost-effective and will provide high accuracy with a minimum number of sensors. Secondly, sensitivity analysis tools are also important in numerical simulations, such as for optimizing data assimilation methods in global ocean circulation models with ensemble Kalman filters. As the number of satellite observations grows, it is important to find a balance between the amount of assimilated data and the speed of the data assimilation algorithm, which is limited by the speed of calculating the Singular Value Decomposition in ensemble Kalman filters. By using the sensor area of the influence calculation method, we can exclude sensors with significantly overlapping areas of influence from the data assimilation process, which can significantly speed up the calculation of the ocean model without sacrificing the accuracy of the forecast. In general, further investigation of the sensitivity field using the Concrete Autoencoder could provide valuable insights into the relationship between sensor placement and mutual information fields and help us better understand the factors that influence the accuracy of field reconstruction from sparse measurements.