Downscaling of MODIS NDVI by Using a Convolutional Neural Network-Based Model with Higher Resolution SAR Data

: The normalized difference vegetation index (NDVI) is a simple but powerful indicator, that can be used to observe green live vegetation efﬁciently. Since its introduction in the 1970s, NDVI has been used widely for land management, food security, and physical models. For these applications, acquiring NDVI in both high spatial resolution and high temporal resolution is preferable. However, there is generally a trade-off between temporal and spatial resolution when using satellite images. To relieve this problem, a convolutional neural network (CNN) based downscaling model was proposed in this research. This model is capable of estimating 10-m high resolution NDVI from MODIS (Moderate Resolution Imaging Spectroradiometer) 250-m resolution NDVI by using Sentinel-1 10-m resolution synthetic aperture radar (SAR) data. First, this downscaling model was trained to estimate Sentinel-2 10-m resolution NDVI from a combination of upscaled 250-m resolution Sentinel-2 NDVI and 10-m resolution Sentinel-1 SAR data, by using data acquired in 2019 in the target area. Then, the generality of this model was validated by applying it to test data acquired in 2020, with the result that the model predicted the NDVI with reasonable accuracy (MAE = 0.090, (cid:36) = 0.734 on average). Next, 250-m NDVI from MODIS data was used as input to conﬁrm this model under conditions replicating an actual application case. Although there were mismatch in the original MODIS and Sentinel-2 NDVI data, the model predicted NDVI with acceptable accuracy (MAE = 0.108, (cid:36) = 0.650 on average). Finally, this model was applied to predict high spatial resolution NDVI using MODIS and Sentinel-1 data acquired in target area from 1 January 2020~31 December 2020. In this experiment, double cropping of cabbage, which was not observable at the original MODIS resolution, was observed by enhanced temporal resolution of high spatial resolution NDVI images (approximately × 2.5). The proposed method enables the production of 10-m resolution NDVI data with acceptable accuracy when cloudless MODIS NDVI and Sentinel-1 SAR data is available, and can enhance the temporal resolution of high resolution 10-m NDVI data.


Introduction
Food is one of the basic elements for human life and survival. In order to feed the increasing population, massive and stable food production is necessary [1]. One of the key factors towards food security and production is the frequent and accurate acquisition of data pertaining to croplands.
Due to recent development in the fields of cloud-computing, robotics, and the internet, several approaches to data acquisition are available, such as Internet of Things (IoT) devices [2], unmanned aerial vehicles (UAV) [3], and satellite data [4]. Among them, satellite remote sensing has the advantages of wide range monitoring with a short return 2 of 20 period and low use cost, which enable various stakeholders to provide both cost-efficient and reliable methods for cropland monitoring [5,6].
High resolution both temporally and spatially is desirable for cropland monitoring. Higher spatial resolution enables a more precise examination of the land. Moreover, for applications such as land classification, higher spatial resolution can achieve higher accuracy due to the smaller number of mixed pixels [7]. Higher temporal resolution, on the other hand, leads to more precise detection of land cover changes. This is especially useful for croplands, since changes to cropland cover occur with high frequency. However, satellite images generally suffer from a trade-off between spatial and temporal resolution. In order to acquire images having both high temporal and high spatial resolution, either of two approaches is generally used.
The first approach is to enhance the temporal resolution of high spatial, low temporal resolution images. Temporal interpolation methods [8], temporal replacement methods [9,10] and temporal filtering methods [11,12] are usually applied in this approach. In general, these methods assume that adjacent temporal images have the same vegetation type, and this assumption can yield good results in a short time interval. However, land cover of croplands usually changes dynamically and the assumption of adjacent temporal observations is not always applicable to croplands. Also, this method is not suitable for areas with continuous cloud cover, since the interval for interpolation tends to become long under that condition. To overcome this problem, some researches have recently used synthetic aperture radar (SAR) data to support temporal interpolation. SAR has attracted attention due to its high revisit frequency and all-weather imaging capacity and is used in many applications, such as land cover mapping [13], disaster evaluation [14], and crop monitoring [15,16]. For enhancement of temporal resolution, some researchers have used a deep learning model to predict NDVI time series from pixel-based optical and SAR time series [17]. SAR images have also been used for temporal image interpolation and cloud removal [18]. However, these researches were conducted at a single location with flat ground and less cloud cover, which are ideal conditions for temporal interpolation with abundant data. In addition, the application of SAR images can be especially difficult in locations with continuous cloud cover, since optical images with relatively high temporal resolution still need to be collected to ensure a high level of accuracy.
The second approach is to enhance the spatial resolution of low spatial, high temporal resolution images, which is called downscaling. Super-resolution is a straightforward method to achieve this goal. General super-resolution, which is a task to convert low resolution RGB images into high resolution RGB images, is a popular task in computer vision, since datasets can be easily created by mosaicking the images. In general, deep learning models with many parameters have recently been used to achieve high accuracy, and the order for super-resolution is around ×2 to ×8 [19]. However, the number of satellite images is limited compared to RGB images, especially for cloudy regions, and deep models with many parameters can be difficult to train sufficiently. Also, in order to achieve higher spatial resolution, which is useful for satellite data, it is necessary to perform a higher order of downscaling. For example, downscaling of MODIS (250-m, observed daily) to Sentinel-2 (10-m) requires an order of ×25. Thus, simply applying single image super-resolution models to satellite images can be difficult in practical applications. One way to overcome this problem is to use additional supportive data. Some previous researches made use of the higher resolution bands of Sentinel-2 (10-m~60-m) to enhance the resolution of lower resolution bands [20,21]. Some research also achieved a relatively high order of downscaling by making use of higher resolution observations as support [22,23]. Also, some researchers have used Sentinel-1 (10-m) SAR images to downscale land surface temperature and soil moisture [24][25][26]. However, very few researches have investigated the use of SAR images to enhance spatial resolution of NDVI, which is a widely used indicator of vegetation growth and coverage.
Several previous studies have shown that there are strong relationships between SAR and vegetation [18,27,28]. One group concluded that SAR backscatter signals change Remote Sens. 2021, 13, 732 3 of 20 significantly when responding to different vegetation types [29], and another group showed that the backscatter value varies at different growth stages even for the same vegetation [30]. Such researches have generally used NDVI as an indicator of vegetation dynamics. NDVI is one of the most used indices to evaluate vegetation and is used for many applications, such as cropland monitoring [31], landcover monitoring [32], and climate impact modeling [33]. In this research, a model to downscale NDVI from 250-m to 10-m was developed. This model is a convolutional neural network (CNN)-based model which learns the concept of downscaling of low spatial resolution NDVI images by making use of Sentinel-1 10-m SAR data as an additional supportive dataset.
Prediction of NDVI using similar CNN-based model was also used in previous work [27]. However, this research aims to predict the NDVI under clouds using the SAR data acquired in the same day. Since SAR data is strongly affected by daily conditions such as soil moisture, using only SAR data to predict NDVI for a different day results in low accuracy due to overfitting to the condition of the training data (see Section 4.1.2). Proposed method addresses this issue by also using coarse NDVI as input and using SAR as supportive data for the downscaling.
The main contributions of this work are as follows:

1.
A CNN-based NDVI downscaling model using higher spatial resolution SAR data was proposed. 2.
The model was trained using an up-sampled and original Sentinel-2 image pair with Sentinel-1 image, and was evaluated for different seasons.

3.
MODIS 250-m NDVI data was input into the trained model and showed advantages for application.

Study Areas
The study area is Tsumagoi in Gunma-Prefecture, north latitude 36.461 • N~36.515 • N, east longitude 138.426 • E~138.537 • E, which is in the mountainous region of Japan. The approximate location is shown in Figure 1a. Tsumagoi consists of mountainous and field crop areas. Size of individual crop field area in Tsumagoi is smaller than the spatial resolution of MODIS (250-m), and is suitable to evaluate the performance of the downscaling model. The annual mean temperature is around 8 • C due to the high altitude. The annual precipitation is around 1500 mm. The rainy season is summer with a monthly average precipitation of around 180 mm from June to September. Cabbage is mainly grown in the sloping field (Figure 1b), and accounts for about 45% of total cabbage production in Japan. The growing season is early May to late October, and double cropping also takes place in some fields. The days in between plantings for fields with double cropping is typically 2~3 months, which also requires high temporal resolution for accurate monitoring. Figure 1c shows a 10-m resolution RGB image of the target area taken by Sentinel-2. The extracted target area is approximately 58 km 2 , and consists of 573 × 1016 pixels in 10-m resolution.

Sentinel Data
For this work, images of Sentinel-1A and Sentinel-2A were collected from the Copernicus Open Access Hub [34]. Revisit time of Sentinel-1A and Sentinel-2A are 6 and 5 days in Tsumagoi, since Tsumagoi is located in the mid-latitudes, while the observation frequency is approximately half in other regions. Only data from one equipment was used in this study to show efficiency of the proposed model also for other data limited regions.
Sentinel-1 data were acquired in interferometric wide swath (IW) mode, in the Ground Range Detected (GRD) format, which has 10-m spatial resolution. The SAR data processing includes: (1) Calibration (VH/VV polarization intensities to sigma naught), (2) single product speckle filtering by Lee Sigma filtering [35], (3) transformation of the backscatter coefficients (δ) to decibels (dB), and (4) Range-Doppler Terrain Correction using digital elevation model data. Preprocessing was conducted on Sentinel Platform (SNAP) Software [36]. Finally, the Sentinel-2 optical image was collocated with preprocessed Sentinel-1 data. In this work, the closest counterparts of SAR and optical data (acquired within 5 days) were extracted for collocation. Data were acquired from January 2019 to December 2020 for the study area shown in Figure 2. Less data was acquired during June to August, since summer is the rainy season for Tsumagoi. NDVI was calculated by equation (1), using RED (4th band, 665-nm) and NIR (8th band, 842-nm) of the Sentinel-2 image.

Sentinel Data
For this work, images of Sentinel-1A and Sentinel-2A were collected from the Copernicus Open Access Hub [34]. Revisit time of Sentinel-1A and Sentinel-2A are 6 and 5 days in Tsumagoi, since Tsumagoi is located in the mid-latitudes, while the observation frequency is approximately half in other regions. Only data from one equipment was used in this study to show efficiency of the proposed model also for other data limited regions.
Sentinel-1 data were acquired in interferometric wide swath (IW) mode, in the Ground Range Detected (GRD) format, which has 10-m spatial resolution. The SAR data processing includes: (1) Calibration (VH/VV polarization intensities to sigma naught), (2) single product speckle filtering by Lee Sigma filtering [35], (3) transformation of the backscatter coefficients (δ) to decibels (dB), and (4) Range-Doppler Terrain Correction using digital elevation model data. Preprocessing was conducted on Sentinel Platform (SNAP) Software [36]. Finally, the Sentinel-2 optical image was collocated with preprocessed Sentinel-1 data. In this work, the closest counterparts of SAR and optical data (acquired within 5 days) were extracted for collocation. Data were acquired from January 2019 to December 2020 for the study area shown in Figure 2. Less data was acquired during June to August, since summer is the rainy season for Tsumagoi. NDVI was calculated by Equation (1), using RED (4th band, 665-nm) and NIR (8th band, 842-nm) of the Sentinel-2 image.

MODIS Data
In order to evaluate the applicability of the model to different satellite images, MODIS data were used. MOD09 and MYD09 (Terra and Aqua Surface Reflectance Daily L2G Global 250m) products were acquired from EARTHDATA Search [37]. Cloudless Sentinel-2 and MODIS optical data which were acquired within 3 days were collocated for evaluation. The MODIS-data acquisition dates differed among the experiments; a detailed explanation is given in sections 4.2.1 and 4.3.1. NDVI was calculated by equation (1), using RED (1st band, 620~670-nm) and NIR (2nd band, 841~876-nm) of the MODIS image.

The CNN-Based Downscaling Model
Convolutional neural networks (CNN) have been successfully applied to many image processing problems, such as image classification [38], object recognition [39], and image generation [40]. Their main advantages are the capability to approximate complex non-linear end-to-end mappings including surrounding pixels and the parallel computational architecture. Recently, in the field of super-resolution, deep models with many parameters have been trained by using large datasets [19]. For super-resolution, manual labeling is not needed, because the reference data can be automatically generated from the data themselves. However, satellite data, especially in the cloudy region, are limited compared to general RGB images.
In this work, the architecture of the Super-Resolution Convolutional Neural Network (SRCNN) [41] was used, as shown in Figure 3 and Table 1. This model has fewer parameters [19] and is trainable with a smaller dataset. Models with similar architecture have also been used for NDVI estimation [27] and have shown acceptable results. Input and output are trimmed into a 1250-m (125 pixels in 10-m resolution) square patch with spacing of 400-m (40 pixels in 10-m resolution), referencing the aforementioned work [18]. By applying this trimming method to the target area (573 × 1016 pixels), 242 patches were extracted. No data augmentation was used in this study. The detailed information of input and output is summarized below.

MODIS Data
In order to evaluate the applicability of the model to different satellite images, MODIS data were used. MOD09 and MYD09 (Terra and Aqua Surface Reflectance Daily L2G Global 250 m) products were acquired from EARTHDATA Search [37]. Cloudless Sentinel-2 and MODIS optical data which were acquired within 3 days were collocated for evaluation. The MODIS-data acquisition dates differed among the experiments; a detailed explanation is given in Sections 4.2.1 and 4.3.1. NDVI was calculated by Equation (1), using RED (1st band, 620~670-nm) and NIR (2nd band, 841~876-nm) of the MODIS image.

The CNN-Based Downscaling Model
Convolutional neural networks (CNN) have been successfully applied to many image processing problems, such as image classification [38], object recognition [39], and image generation [40]. Their main advantages are the capability to approximate complex nonlinear end-to-end mappings including surrounding pixels and the parallel computational architecture. Recently, in the field of super-resolution, deep models with many parameters have been trained by using large datasets [19]. For super-resolution, manual labeling is not needed, because the reference data can be automatically generated from the data themselves. However, satellite data, especially in the cloudy region, are limited compared to general RGB images.
In this work, the architecture of the Super-Resolution Convolutional Neural Network (SRCNN) [41] was used, as shown in Figure 3 and Table 1. This model has fewer parameters [19] and is trainable with a smaller dataset. Models with similar architecture have also been used for NDVI estimation [27] and have shown acceptable results. Input and output are trimmed into a 1250-m (125 pixels in 10-m resolution) square patch with spacing of 400-m (40 pixels in 10-m resolution), referencing the aforementioned work [18]. By applying this trimming method to the target area (573 × 1016 pixels), 242 patches were extracted. No data augmentation was used in this study. The detailed information of input and output is summarized below. • Output -NDVI Sentinel-2, 10m For inference, square patches for the test area were generated in the same way with 125 pixels of side length and 40 pixels of spacing. The output was then stitched into the original shape of the target area by taking the mean value for the overlapped region.  Table 1. Hyper-parameters of the proposed CNN. y (i) represents the variable in the equations of Figure 3.

Concept of the Proposed Model
The model can be conceptually understood as downscaling using surrounding higher resolution SAR features. Figure 4 shows an example of an input and reference image patch pair. In the case of ×25 downscaling, one coarse NDVI pixel is an average (mixed-pixel) of 25 × 25 = 625 fine NDVI pixels. This is an ill-posed problem and cannot be solved exactly. CNN models have achieved considerable improvements, since they can consider surrounding pixels and the pattern of downscaling from images in the training set. Furthermore, by also considering SAR images, which have a strong relation with vegetation (as seen in Figure 4), the downscaled NDVI can be estimated with higher accuracy and using less data. SAR data can provide additional information related to surface roughness of target area. Since vegetation is mainly characterized by a vertical geometry, VV and VH components of SAR data can especially explain vegetation well [42]. For example, according to Figure 4, value of VH tend to become higher at area with vegetation (high NDVI values). This is due to the perturbation on vertical polarized signal emitted by the radar, since surface with vegetation tend to have rougher surface than the soil.

Concept of the Proposed Model
The model can be conceptually understood as downscaling using surrounding higher resolution SAR features. Figure 4 shows an example of an input and reference image patch pair. In the case of ×25 downscaling, one coarse NDVI pixel is an average (mixedpixel) of 25 × 25 = 625 fine NDVI pixels. This is an ill-posed problem and cannot be solved exactly. CNN models have achieved considerable improvements, since they can consider surrounding pixels and the pattern of downscaling from images in the training set. Furthermore, by also considering SAR images, which have a strong relation with vegetation (as seen in Figure 4), the downscaled NDVI can be estimated with higher accuracy and using less data. SAR data can provide additional information related to surface roughness of target area. Since vegetation is mainly characterized by a vertical geometry, VV and VH components of SAR data can especially explain vegetation well [42]. For example, according to Figure 4, value of VH tend to become higher at area with vegetation (high NDVI values). This is due to the perturbation on vertical polarized signal emitted by the radar, since surface with vegetation tend to have rougher surface than the soil.
For inference, square patches for the test area were generated in the same way with 125 pixels of side length and 40 pixels of spacing. The output was then stitched into the original shape of the target area by taking the mean value for the overlapped region.  Table 1. Hyper-parameters of the proposed CNN. y (i) represents the variable in the equations of Figure 3.

Concept of the Proposed Model
The model can be conceptually understood as downscaling using surrounding higher resolution SAR features. Figure 4 shows an example of an input and reference image patch pair. In the case of ×25 downscaling, one coarse NDVI pixel is an average (mixed-pixel) of 25 × 25 = 625 fine NDVI pixels. This is an ill-posed problem and cannot be solved exactly. CNN models have achieved considerable improvements, since they can consider surrounding pixels and the pattern of downscaling from images in the training set. Furthermore, by also considering SAR images, which have a strong relation with vegetation (as seen in Figure 4), the downscaled NDVI can be estimated with higher accuracy and using less data. SAR data can provide additional information related to surface roughness of target area. Since vegetation is mainly characterized by a vertical geometry, VV and VH components of SAR data can especially explain vegetation well [42]. For example, according to Figure 4, value of VH tend to become higher at area with vegetation (high NDVI values). This is due to the perturbation on vertical polarized signal emitted by the radar, since surface with vegetation tend to have rougher surface than the soil.

Training Settings
For loss function, L1-norm (Equation (2)) was selected after preliminary experiments. This result corresponds with the result for a similar task in [18]. For the optimization, Adam [43], which is one of the most widely used procedures, was adopted. In Adam, weights are updated by Equations (3)-(5), where m and v are the moving average of mean and uncentered variance, and g is the gradient on the current mini-batch. α and β are hyperparameters, and β was set to the following widely used and successful values: β 1 = 0.9, and β 2 = 0.999 [43].
In order to decide the number of epochs and the optimal α value for training, a preliminary experiment was conducted. To prevent data leakage, only data from 2019 were extracted for this experiment, since inference by the model was conducted using 2020 data in this study. Then, for the extracted data from 2019, one date was selected for the validation data and the model was trained by using data from other dates. For different α, this experiment was conducted for every acquired date in 2019, and the average training curve is shown in Figure 5. 250 epochs of training were sufficient for the convergence, and α = 0.001 performed the best after the convergence. In this work, the number of epochs was set to 250 and α was set to 0.001. Training was conducted on NVIDIA GPU, GeForce GTX 1080. The CNN model was implemented by using TensorFlow.

Training Settings
For loss function, L1-norm (Equation (2)) was selected after preliminary experiments. This result corresponds with the result for a similar task in [18].
For the optimization, Adam [43], which is one of the most widely used procedures, was adopted. In Adam, weights are updated by Equations (3)-(5), where m and v are the moving average of mean and uncentered variance, and g is the gradient on the current mini-batch. α and β are hyperparameters, and β was set to the following widely used and successful values: β1=0.9, and β2=0.999 [43].
In order to decide the number of epochs and the optimal α value for training, a preliminary experiment was conducted. To prevent data leakage, only data from 2019 were extracted for this experiment, since inference by the model was conducted using 2020 data in this study. Then, for the extracted data from 2019, one date was selected for the validation data and the model was trained by using data from other dates. For different α, this experiment was conducted for every acquired date in 2019, and the average training curve is shown in Figure 5. 250 epochs of training were sufficient for the convergence, and α=0.001 performed the best after the convergence. In this work, the number of epochs was set to 250 and α was set to 0.001. Training was conducted on NVIDIA GPU, GeForce GTX 1080. The CNN model was implemented by using TensorFlow.

Experimental Design
In this experiment, only Sentinel-2 data were used for NDVI: A pair of up-sampled 250-m Sentinel-2 NDVI and original 10-m Sentinel-2 NDVI were used for training. Up-  In this experiment, only Sentinel-2 data were used for NDVI: A pair of up-sampled 250-m Sentinel-2 NDVI and original 10-m Sentinel-2 NDVI were used for training. Upsampling was calculated by averaging the 10-m resolution NDVI values within the 250-m grid. Data acquired during 2019 (1 January 2019~31 December 2019) was used for model training and data obtained in 2020 (1 January 2020~31 December 2020) were used for tests. For test data, only dates with 100% availability were selected for fair comparison. The prediction was evaluated using two metrics, the mean absolute error (MAE, Equation (2) and Pearson's correlation coefficient ( , Equation (6)). This validation was applied to all dates, and seasonal differences were discussed.

Results
In order to evaluate the proposed model, a linear regression model was used for baseline comparison. Input for the linear regression model corresponded with the input for the proposed model (VH, VV and 250-m NDVI), and is designed to learn the linear relationship between corresponding pixels. Figure 6 shows the example of a qualitative result for Tsumagoi, 5 October 2020. The proposed CNN method (Figure 6f) is compared with linear regression (Figure 6c), the proposed model with only SAR data as input (Figure 6e, trained by SAR only), and only 250-m NDVI as input (Figure 6d, trained by NDVI only). Proposed method with both data generated a sharper image compared to other methods. Also, the linear regression seems to predict more precise features compared to input 250-m NDVI ( Figure 6b) and proposed model with only NDVI input. This result implies that SAR data are effective for explaining NDVI features. On the other hand, proposed model with only SAR input captures fine features, but the difference in value is large compared to reference data (Figure 6a). This result implied that the SAR data and NDVI data works supportively for prediction, and each data supplies different information.
sampling was calculated by averaging the 10-m resolution NDVI values within the 250-m grid. Data acquired during 2019 (2019/01/01 ~ 2019/12/31) was used for model training and data obtained in 2020 (2020/01/01 ~ 2020/12/31) were used for tests. For test data, only dates with 100% availability were selected for fair comparison. The prediction was evaluated using two metrics, the mean absolute error (MAE, equation (2)) and Pearson's correlation coefficient (ρ, Equation (6)). This validation was applied to all dates, and seasonal differences were discussed.

Results
In order to evaluate the proposed model, a linear regression model was used for baseline comparison. Input for the linear regression model corresponded with the input for the proposed model (VH, VV and 250-m NDVI), and is designed to learn the linear relationship between corresponding pixels. Figure 6 shows the example of a qualitative result for Tsumagoi, 2020/10/05. The proposed CNN method (Figure 6f) is compared with linear regression (Figure 6c), the proposed model with only SAR data as input (Figure 6e, trained by SAR only), and only 250m NDVI as input (Figure 6d, trained by NDVI only). Proposed method with both data generated a sharper image compared to other methods. Also, the linear regression seems to predict more precise features compared to input 250-m NDVI ( Figure 6b) and proposed model with only NDVI input. This result implies that SAR data are effective for explaining NDVI features. On the other hand, proposed model with only SAR input captures fine features, but the difference in value is large compared to reference data (Figure 6a). This result implied that the SAR data and NDVI data works supportively for prediction, and each data supplies different information. Quantitative results are shown in Table 2. The overall accuracy of proposed method is 0.090 in MAE and 0.734 in . On average, the proposed model performed better than linear regression. Also, MAE and of the linear regression appeared to have a strong positive relationship with the results of the proposed model. Seasonal differences are not clearly observed. However, interval days, which is the interval between the date of available imaginary from 2019 and the equivalent day in 2020, seemed to be related with accuracy. For data with a date interval of less than 7 days (3 March, 2 May, 4 December), the average MAE (0.081) was lower than the overall average and the average (0.777) was higher than the overall average. On the other hand, for data with a date interval of more than 7 days (27 March, 20 April, 18 August, 5 October, 10 November), the average MAE (0.094) was higher than the overall average and the average (0.723) was lower than the overall average.
Qualitative results show that the values of MAE and do not necessarily have positive correlation. For example, the results for 18 August tend to have high MAE and high . This can be explained by the difference in the NDVI histograms, as shown in Figure 7. From the NDVI images, the difference between fields is clearer on 18 August (high MAE, high ) than on 4 December (moderate MAE, moderate ). The bottom figure shows the histogram for distribution of NDVI. Histograms acquired in both dates have a distribution with two peaks. This is likely to be caused by the superposition of two NDVI distributions from different sources: Vegetation and background. In August, which is the mid-season for cropping, the NDVI differences are larger between fields, and the NDVI for the mountainous area is also large, which can also be observed from the expanded NDVI distribution shown in the histogram. For this expanded histogram, and MAE tend to become larger. On the other hand, in December, the differences between fields in the original NDVI are smaller, which also causes the histogram of predicted NDVI to have a similar distribution. For this coherent distribution of NDVI, MAE tends to take a smaller value. The pros and cons of these different histograms tend to differ according to the application. For example, for evaluation of the NDVI value in a certain area, low MAE values may be better, while for classification of harvested and unharvested fields, high may by better. However, more additional information can be obtained by downscaling for the seasons with latter histogram, since former information can be obtained from original 250-m resolution data.
According to the histograms in Figure 7, the model seems to make conservative predictions, especially for the data on 18 August. This can be improved by implementing different loss function or optimization methods for training. Also, using more data acquired during summer season for model training can be effective, since less summer data was acquired due to rainy season in Tsumagoi (see Figure 2). Remote Sens. 2021, 13, x FOR PEER REVIEW 10 of 20 . According to the histograms in Figure 7, the model seems to make conservative predictions, especially for the data on 8/18. This can be improved by implementing different loss function or optimization methods for training. Also, using more data acquired during summer season for model training can be effective, since less summer data was acquired due to rainy season in Tsumagoi (see Figure 2).

Prediction with MODIS Data
In section 4.1, the model was trained and evaluated with the pair of original Sentinel-2 10m NDVI and up-sampled 250-m Sentinel-2 NDVI. However, in the case of practical application, this model needs to be applied not on up-sampled Sentinel-2 data but data from different satellite sources such as MODIS. In this section, the model trained by data from 2019 (see section 4.1) was applied to 250-m NDVI from MODIS image to investigate its potential for application. Table 3 shows the date information for the acquired data. The corresponding MODIS data (see section 2.2.2 for the detailed product information) were extracted according to the following protocol.

Data Information
-Search MODIS data within 3 days from the date of the Sentinel-2 data.
-If cloudless data are not found within 3 days, no data are used -If any cloudless data is found from either Terra and Aqua, acquire all data for ensembling. (see section 4.2.3 for a detailed explanation)

Prediction with MODIS Data
In Section 4.1, the model was trained and evaluated with the pair of original Sentinel-2 10 m NDVI and up-sampled 250-m Sentinel-2 NDVI. However, in the case of practical application, this model needs to be applied not on up-sampled Sentinel-2 data but data from different satellite sources such as MODIS. In this section, the model trained by data from 2019 (see Section 4.1) was applied to 250-m NDVI from MODIS image to investigate its potential for application. Table 3 shows the date information for the acquired data. The corresponding MODIS data (see Section 2.2.2 for the detailed product information) were extracted according to the following protocol.

Data Information
-Search MODIS data within 3 days from the date of the Sentinel-2 data. -If cloudless data are not found within 3 days, no data are used -If any cloudless data is found from either Terra and Aqua, acquire all data for ensembling. (see Section 4.2.3 for a detailed explanation)

Experimental Design
In this section, the model trained in Section 4.1 (trained with the Sentinel-2 image pair, acquired in 2019) was adopted. The model predicted downscaled NDVI from the input of the MODIS 250-m NDVI and Sentinel-1 SAR image (the data pair shown in Table 3). Finally, the output image was evaluated by comparing it with the Sentinel-2 10-m NDVI acquired on a near date.

Difference between Sentinel-2 and MODIS NDVI
Mismatch in NDVI data acquired from different resolutions has been reported [44]. This was also the case for Sentinel-2 and MODIS, with the Sentinel-2 data appearing to show better correspondence with the field observation data [45]. This can be due to differences in data processing, acquisition time, and bandwidth difference.
In this research, the difference of NDVI between Sentinel-2 and MODIS was also measured. Table 4 shows the difference between acquired Sentinel-2 and MODIS NDVI. The columns "Terra" and "Aqua" show the original difference between MODIS and upsampled Sentinel-2 data acquired in the same day. In previous research, time series filtering and anomaly detection was applied to daily MODIS NDVI data, and reported improved correspondence with Sentinel-2 NDVI data [45]. In this study, in order to reduce the difference between MODIS and Sentinel-2 NDVI, median filtering was applied to NDVI acquired from Terra and Aqua before input as below: -Collect Terra and Aqua data within 3 days from Sentinel-2 data (see Section 4.2.1) -For all of the collected data, adopt the median value for each pixel as the ensembled value The "Ensemble" column in Table 4 shows the difference between MODIS and Sentinel-2 NDVI after ensembling. Both MAE and improved in average after ensembling. This may have been due to suppression of error with strong randomness.

Results
Quantitative results are shown in Table 5. On average, compared to the model with up-sampled Sentinel-2 input, MAE increased by 0.015, and decreased by 0.085. Figure 8 shows the strong correlation in the MAE between input MODIS and up-sampled Sentinel-2 Remote Sens. 2021, 13, 732 12 of 20 and in the increased MAE after prediction. This result implies room for improvement by suppressing the difference between MODIS NDVI and Sentinel-2 NDVI before input. Quantitative results are shown in Table 5. On average, compared to the model with up-sampled Sentinel-2 input, MAE increased by 0.015, and ρ decreased by 0.085. Figure 8 shows the strong correlation in the MAE between input MODIS and up-sampled Sentinel-2 and in the increased MAE after prediction. This result implies room for improvement by suppressing the difference between MODIS NDVI and Sentinel-2 NDVI before input. Figure 9 shows the qualitative prediction results for Tsumagoi (2020/10/05). According to Table 5, the prediction results are moderate compared to the average in MAE and ρ. In Figure 9, the prediction by up-sampled Sentinel-2 NDVI (c) looks clearer compared to that by MODIS input (d), and the predicted values seem to be closer to the reference data, for example, near the lake area where the MAE between MODIS and Sentinel-2 was large before input (a). However, even for prediction with MODIS data, the prediction captures the characteristics of NDVI for fields to a certain extent.    Figure 9 shows the qualitative prediction results for Tsumagoi (5 October 2020). According to Table 5, the prediction results are moderate compared to the average in MAE and . In Figure 9, the prediction by up-sampled Sentinel-2 NDVI (c) looks clearer compared to that by MODIS input (d), and the predicted values seem to be closer to the reference data, for example, near the lake area where the MAE between MODIS and Sentinel-2 was large before input (a). However, even for prediction with MODIS data, the prediction captures the characteristics of NDVI for fields to a certain extent. Remote Sens. 2021, 13, x FOR PEER REVIEW 13 of 20

Example of an Application in Tsumagoi
The prediction with MODIS data was conducted and evaluated in section 4.2. In this section, as a practical application, the proposed model is applied to 2020 data in Tsumagoi in order to enhance the temporal resolution of 10-m NDVI. As in the studies by [46,47], double cropping of cabbages takes place during the summer season in Tsumagoi. One goal of this application is to detect the timeline of double cropping by enhanced temporal and spatial resolution data.

Experimental Design
In this section, data acquired during 2019/1/1 ~ 2019/12/31 were used to train the proposed model (see section 4.1). Then, the trained model was applied to MODIS data during 2020/1/1 ~ 2020/12/31 to predict the 10-m NDVI of the corresponding date. By combining these data with Sentinel-2 data collected in 2020/1/1 ~ 2020/12/31, NDVI data with both high spatial and high temporal resolution were generated. The data used for inference in this experiment are shown in Table 6.

Example of an Application in Tsumagoi
The prediction with MODIS data was conducted and evaluated in Section 4.2. In this section, as a practical application, the proposed model is applied to 2020 data in Tsumagoi in order to enhance the temporal resolution of 10-m NDVI. As in the studies by [46,47], double cropping of cabbages takes place during the summer season in Tsumagoi. One goal of this application is to detect the timeline of double cropping by enhanced temporal and spatial resolution data.

Experimental Design
In this section, data acquired during 1 January 2019~31 December 2019 were used to train the proposed model (see Section 4.1). Then, the trained model was applied to MODIS data during 1 January 2020~31 December 2020 to predict the 10-m NDVI of the corresponding date. By combining these data with Sentinel-2 data collected in 1 January 2020~31 December 2020, NDVI data with both high spatial and high temporal resolution were generated. The data used for inference in this experiment are shown in Table 6. Table 6. Date information of the reference data.

Sentinel-1 MODIS-Terra MODIS-Aqua Sentinel-2
3 January 1 January, 2 January, 4 January, 6 January   Results of time series plot are shown in Figure 11. Three locations with different land usages were selected for time-series evaluation. Location A is a crop field with single cropping, location B is a fallow land, and location C is a crop field with double cropping, which was confirmed by Sentinel-2 images. The mean value within 3 × 3 (30 m × 30 m) center pixels of each location is used as the extracted NDVI value. The three plots in Figure 11 shows the transition of NDVI value for each location. Blue plots indicate the values extracted from MODIS data (250-m resolution, cloudless images were extracted), and red plot shows 10-m resolution NDVI with combination of Sentinel-2 and prediction by the proposed model using MODIS and Sentinel-1 data. Among the red plots, outlined plots represent estimation from the proposed model and the error bar represents the average MAE (see Section 4.2.4).
In location A (single crop), the 250-m resolution plot (blue) has increase of NDVI in May, which is earlier compared to that in the 10-m resolution plot (red). In the resolution of 250-m, one pixel becomes a mixed pixel and includes NDVI values of surrounding fields. In such resolution, the planting season cannot be estimated accurately, considering that planting is conducted earlier in crop-fields with double cropping compared to single cropping fields in order to run two growing cycles. By downscaling to 10-m resolution, planting season can be estimated clearer by decomposing the mixed-pixel. Results of time series plot are shown in Figure 11. Three locations with different land usages were selected for time-series evaluation. Location A is a crop field with single cropping, location B is a fallow land, and location C is a crop field with double cropping, which was confirmed by Sentinel-2 images. The mean value within 3 × 3 (30 m × 30 m) center pixels of each location is used as the extracted NDVI value. The three plots in Figure 11 shows the transition of NDVI value for each location. Blue plots indicate the values extracted from MODIS data (250-m resolution, cloudless images were extracted), and red plot shows 10-m resolution NDVI with combination of Sentinel-2 and prediction by the proposed model using MODIS and Sentinel-1 data. Among the red plots, outlined plots represent estimation from the proposed model and the error bar represents the average MAE (see section 4.2.4). In location B (fallow land), transition of NDVI in 250-m resolution is higher compared to that in 10-m resolution. This is because the value in 250 m resolution becomes the average of surrounding pixels, which has higher NDVI values.
In location C (double crop), 10-m resolution NDVI plot shows an increase from May to June, but the NDVI drops in August, and then rise again in September. This is likely to be the result of double cropping, since the field camera survey conducted in previous researches also observed the first harvest at the end of July [46,47]. The drop of NDVI values can be seen clearer in 10-m resolution. However, the downscaled data still seems to have a larger value at the end of August compared to the reference Sentinel-2 data. This implies that the model is generating conservative predictions that are closer to the average value. The modification of loss function or the architecture of Generative Adversarial Networks (GANs) could be applied to predict sharper values in a future study [48,49].
A decrease in NDVI can be observed at all locations in March and December. The observed snowfall data was acquired from the Japan Meteorological Agency. The amount of snowfall observed at Kusatsu Observatory (15 km away from the target area but with similar meteorological conditions), is shown by the bar graph in Figure 11. The decrease in NDVI is likely to be caused by snowfall, since intense snowfall can be observed around the date of the decreased NDVI.
The Sentinel-2 observed NDVI and predicted NDVI at the near date have close values in average. Time series anomaly detection or filtering can be applied to further increase the accuracy. Also, applications such as land cover classification can be improved by using high temporal and spatial resolution image generated by the proposed method. In location A (single crop), the 250-m resolution plot (blue) has increase of NDVI in May, which is earlier compared to that in the 10-m resolution plot (red). In the resolution of 250-m, one pixel becomes a mixed pixel and includes NDVI values of surrounding fields. In such resolution, the planting season cannot be estimated accurately, considering that planting is conducted earlier in crop-fields with double cropping compared to single cropping fields in order to run two growing cycles. By downscaling to 10-m resolution, planting season can be estimated clearer by decomposing the mixed-pixel.
In location B (fallow land), transition of NDVI in 250-m resolution is higher compared to that in 10-m resolution. This is because the value in 250m resolution becomes the average of surrounding pixels, which has higher NDVI values.
In location C (double crop), 10-m resolution NDVI plot shows an increase from May to June, but the NDVI drops in August, and then rise again in September. This is likely to be the result of double cropping, since the field camera survey conducted in previous researches also observed the first harvest at the end of July [46,47]. The drop of NDVI values can be seen clearer in 10-m resolution. However, the downscaled data still seems to have a larger value at the end of August compared to the reference Sentinel-2 data. This implies that the model is generating conservative predictions that are closer to the average value. The modification of loss function or the architecture of Generative Adversarial Networks (GANs) could be applied to predict sharper values in a future study [48,49].
A decrease in NDVI can be observed at all locations in March and December. The observed snowfall data was acquired from the Japan Meteorological Agency. The amount of snowfall observed at Kusatsu Observatory (15 km away from the target area but with similar meteorological conditions), is shown by the bar graph in Figure 11. The decrease   (Figure 12a) has sufficient temporal resolution to track the vegetation transition, but difference between the two fields are unclear due to insufficient spatial resolution. Difference between the two fields can be observed in the time series plot by Sentinel-2 NDVI (Figure 12b), but cannot track the transition of vegetation enough due to insufficient temporal resolution. The NDVI of the two fields are sometimes in the same 250-m resolution pixel with same values, and even if not, they have similar values due to mixed pixel, where 1 pixel becomes the average of the surrounding 250 m × 250 m. By combining Sentinel-2 NDVI with predicted 10-m resolution NDVI (Figure 12c), both high spatial resolution to distinguish the difference between the two fields and high temporal resolution to track the transition of vegetation growth can be achieved. By applying this method, double cropping in field A and single cropping in field B could be observed.
with same values, and even if not, they have similar values due to mixed pixel, where 1 pixel becomes the average of the surrounding 250 m × 250 m. By combining Sentinel-2 NDVI with predicted 10-m resolution NDVI (Figure 12c), both high spatial resolution to distinguish the difference between the two fields and high temporal resolution to track the transition of vegetation growth can be achieved. By applying this method, double cropping in field A and single cropping in field B could be observed.

Pros and Cons
In recent years, several researches have focused on the coordination of optical and SAR images. One popular approach is to use SAR, which is available for cloudy days, to estimate the optical features under cloudy conditions. However, a model trained in this manner will rely fully on SAR for its prediction, causing it to be unstable against SAR noise or observation error. This will worsen the ability of the model to make predictions using data acquired from different observations or dates.
In the proposed model, 250-m resolution NDVI was input into the model with SAR, which would lead to less dependence on SAR data, assuming that the input NDVI is the up-sampled image of the actual NDVI. This can lead to stable prediction for different locations and seasons, which is also not ideal.
However, there are still sources of error for the prediction. One is the difference between MODIS and Sentinel-2 NDVI. This can be suppressed to a certain extent by median

Pros and Cons
In recent years, several researches have focused on the coordination of optical and SAR images. One popular approach is to use SAR, which is available for cloudy days, to estimate the optical features under cloudy conditions. However, a model trained in this manner will rely fully on SAR for its prediction, causing it to be unstable against SAR noise or observation error. This will worsen the ability of the model to make predictions using data acquired from different observations or dates.
In the proposed model, 250-m resolution NDVI was input into the model with SAR, which would lead to less dependence on SAR data, assuming that the input NDVI is the up-sampled image of the actual NDVI. This can lead to stable prediction for different locations and seasons, which is also not ideal.
However, there are still sources of error for the prediction. One is the difference between MODIS and Sentinel-2 NDVI. This can be suppressed to a certain extent by median ensembling of Terra and Aqua NDVI (see Section 4.2.3). In addition, temporal anomaly filtering or smoothing can also be applied to reduce difference between MODIS and Sentinel-2 NDVI furthermore [45]. A second source of error was the data included in the training set. The quantity and quality of data necessary for precise prediction can vary according to the features of the target area: In general, an area with high variation of land cover will require more and more variable training data. However, for locations that experience rainy and dry seasons, the available optical data will be scarce for the rainy season, even though dynamic change in cropland usually takes place during the rainy season. This mismatch between training data and prediction demand should be considered. One way to overcome this issue is by data augmentation [50]. Applying data augmentation to scarce data in the rainy season can be effective. Another possible solution is fine-tuning, which has also been successfully applied to SAR images [51]. Applying a model pre-trained with data under similar condition can enhance accuracy with a small training dataset.

Effective Condition for Application
After applying the proposed model to different seasons and locations, it can be concluded that there are two important characteristics when choosing an ideal location for the application. One is the histogram of NDVI, as discussed above in Section 4.1.2. Ideal shape for the histogram may differ by application. However, in order to exploit the benefits of enhanced spatial resolution, a high value is preferable (see Section 4.3.2), which is likely to be derived from an area of land with an expanded NDVI histogram. The second important characteristic is the ability to explain differences in land cover by VH and VV backscatter within the target area. If the soil and crop area have the same or similar backscatter characteristics, predicting the difference would also be difficult for the model. Also, transition of backscatter values differs by crop types [16]. This availability to analyze the land-cover differences by backscatter values can be preliminarily confirmed by applying a heuristic linear regression model in the target area, since the results of linear regression and the proposed model was strongly correlated (see Section 4.1.2), which may be a simple indicator of the explainability of NDVI from backscatter signals. In addition, unsupervised learning can be applied in this area to confirm the difference of backscatter signals between different landcover types.

Conclusions
This study proposed a method to downscale the spatial resolution of NDVI from 250-m MODIS resolution to 10-m Sentinel-2 resolution. The proposed model used 10-m resolution SAR with 250-m resolution NDVI as input to learn the concept of downscaling. The proposed model was first trained with a Sentinel-2 NDVI pair (an original 10-m image and an up-sampled 250-m image) acquired in 2019. The experimental results showed that the prediction was reasonably accurate (MAE = 0.090, = 0.734 on average for the test data acquired in 2020). Then, the trained model was applied to downscale NDVI from MODIS NDVI data in Tsumagoi. Even with the difference between MODIS NDVI and Sentinel-2 NDVI, the accuracy of downscaled NDVI was acceptable (MAE = 0.108, = 0.650 on average). Finally, the model was applied to enhance the temporal resolution (approximately × 2.5 data) of NDVI with high spatial resolution (10 m), and it successfully observed the features of a double cropping of cabbage in Tsumagoi.
However, a method to reduce the difference between the original MODIS NDVI and Sentinel-2 NDVI will be needed for further improvement. A method for locating effective locations for the application is also needed. In summary, this study provides a new approach to produce 10-m resolution NDVI data with acceptable accuracy when cloudless MODIS NDVI and Sentinel-1 SAR data is available. This can be a step for improving both the temporal and the spatial resolution in vegetation monitoring, and is hoped to contribute to the improvement of food security and production.