Convolutional Neural Network Coupled with a Transfer-Learning Approach for Time-Series Flood Predictions

: East Asian regions in the North Paciﬁc have recently experienced severe riverine ﬂood disasters. State-of-the-art neural networks are currently utilized as a quick-response ﬂood model. Neural networks typically require ample time in the training process because of the use of numerous datasets. To reduce the computational costs, we introduced a transfer-learning approach to a neural-network-based ﬂood model. For a concept of transfer leaning, once the model is pretrained in a source domain with large datasets, it can be reused in other target domains. After retraining parts of the model with the target domain datasets, the training time can be reduced due to reuse. A convolutional neural network (CNN) was employed because the CNN with transfer learning has numerous successful applications in two-dimensional image classiﬁcation. However, our ﬂood model predicts time-series variables (e.g., water level). The CNN with transfer learning requires a conversion tool from time-series datasets to image datasets in preprocessing. First, the CNN time-series classiﬁcation was veriﬁed in the source domain with less than 10% errors for the variation in water level. Second, the CNN with transfer learning in the target domain e ﬃ ciently reduced the training time by 1 / 5 of and a mean error di ﬀ erence by 15% of those obtained by the CNN without transfer learning, respectively. Our method can provide another novel ﬂood model in addition to physical-based models.


Introduction
The East Asian regions along the North Pacific have recently experienced an increase in catastrophic flood disasters due to larger and stronger typhoons. To reduce and mitigate flood disasters, artificial neural network (ANN) models may be a beneficial tool for accurately and quickly forecasting riverine flood events in localized areas [1], in addition to conventional physical models [2]. In Japan, areas vulnerable to strong typhoons and heavy rainfall events have experienced severe riverine flood disasters in the last 3-4 years [3]. An overflow-risk warning system that is based on real-time observed data has been successfully working for most major rivers in Japan. However, a flood warning system that can forecast with a quick response has not been practically implemented in specific locations of rivers. If the specific time and location of inundation were forecasted by the flood warning system before the inundation occurred, most people may have been able to evacuate these locations in past flood disasters. Consequently, the number of victims may have been reduced. When a forecast flood warning system is developed for practical use, an ANN model with deep learning is a candidate due

Study Site
We selected two watersheds that do not have flood control facilities such as large dams and ponds ( Figure 1). The first watershed is a part of the Oyodo River watershed, which is located in Southwest Japan. The watershed has an area of 861 km 2 , a main river with a length of 52 km that flows to the Hiwatashi gauge station near Oyodo River Dam (31.7870 • N, 131.2392 • E) downstream, and an elevation that ranges from 118 to 1350 m. The area often experiences heavy rainfall events from the summer to early fall due to typhoons. This watershed is named "Domain A" in this study. Another watershed is the Abashiri River watershed toward the northeast of Hokkaido. The watershed has an area of 1380 km 2 , a main river with a length of 115 km that flows to the outlet in the North Pacific, and an elevation that ranges from 0 to 978 m. Heavy rainfall events have seldom occurred due to weak monsoon impacts and few occurrences of typhoon tracks and approaches. We refer to this watershed as "Domain B".

Data Acquisition
The hourly datasets for rainfalls and water levels from 1992 to 2017 in Domain A and 2000 to 2019 in Domain B were obtained from the website of the hydrology and water-quality database [24], managed by the Ministry of Land, Infrastructure, Transport, and Tourism in Japan (MLIT Japan) and the meteorological database [25] that belongs to the Japan Meteorological Agency (JMA). The observed datasets in Domain A were obtained from five gauge stations for water level and 11 gauge stations for rainfall. Domain B had four water-level stations and nine rainfall stations. The downstream stations where the model predicts water levels were Hiwatashi for Domain A and Hongou for Domain B (Figure 1). We focused on the historical flood events at each domain. Each flood event has a 123-hour duration, including the time of the flood maximum peak, three days before the peak, and two days after the peak. One maximum peak during the duration of each flood event was identified when the water level exceeded a certain criterion of water levels-approximately 45% to 60% of the highest peak of the observed datasets among all flood events. The quantity of data for each flood event is 123. If multiple peaks exist locally in one flood event, note that only the highest peak is regarded as the maximum value of the event. The datasets in Domain A and Domain B are listed in Tables 1 and 2, respectively. The geographical and observational characteristics of both domains are shown in Table 3.

ANN and CNN Features
An ANN model usually consists of three layers-input layer, hidden layer, and output layer-and creates network with neurons. Note that the hidden layer often has multiple layers as a deep learning approach. Each neuron has an activation function that filters input values x with weighted coefficients w into an output value z. If a layer has n neurons, a certain neuron j in a subsequent layer receives n input values. These inputs weighted by coefficients are integrated and added by bias b j . The activation function f outputs z j from a neuron. These variables are defined in the following equations. The structure of the ANN model is shown in Figure 2.
z j = f y j . ( A CNN is a type of ANN architecture with a deep-learning approach. We created a relatively simple algorithm of a CNN based on past studies [16,23]. The variables in Equations (1) and (2) are mapped to the 2D spatial image in the CNN, which consists of a convolution layer (convolution) and a max-pooling layer (pooling). We consider that the input variable x has L × L 2D source pixels, which is filtered with a H × H small window of convolution with a table of weighted values. The convolution takes the same size of values from the source image, and then multiplies the weighted values of the H × H window to the filtered values of source pixels. The filtering is repeated over the source image by shifting the window. Components of the input and filter are defined as , respectively. This process is shown in Figure 3. Note that a zero-padding technique was introduced to ensure that the size of input is equivalent to that of the output. The x i,j is multiplied by w k,l,n while separately shifting a grid at (i,j) on the filtering. This equation, including the convolution form and bias b n in the function, is expressed as follows: We use a rectified linear unit (ReLU) as the activation function (f ), which can select positive input values due to an improvement in the conversion of a matrix, as expressed in the following equation.
f y i,j,n = max y i,j,n , 0 .
By filtering in the convolution, common features among images are extracted when the filter pattern is similar to a portion of each image.  The pooling does not have weighted coefficients and activation functions, is usually utilized to produce 2D data, and is defined in the following equation: where U p,q is the square unit domain with the size R × R and (p,q) are the horizontal and vertical components. U p,q is moved on a 2D image with a non-overlap approach. As horizontal and vertical scales are shrunk 1/R times, the maximum value of x i,j,n in U p,q provides output with a data size of 1/R 2 times of the input 2D data. The features extracted by the convolution can satisfy the consistency by means of the pooling even if the same features in reality are recognized as different features due to searching errors. Our CNN algorithm had two convolution layers and two pooling layers. By performing these processes, smaller-size 2D image data are generated and then passed to a fully connected layer, which has the same function of Equations (1) and (2). These 2D image data are converted into numerous 1D digital datasets that involve information about classification of the original image. A softmax function is a normalized exponential function in Equation (7) that converts outputs to probability quantities in the output layer. It evaluates the binary classification.
However, our model prediction requires a specific magnitude of water level, instead of classification. Therefore, we introduced the sigmoid function into the softmax position to assign the magnitude from the normalized value (0 to 1) between the maximum value and the minimum value of the water levels.

Transfer Learning
We assume that two domains exist separately. When a model is trained in a certain (source) domain with large datasets, it usually takes ample time to have an accurate prediction. If the model is applied to another (target) domain without any connection with the source domain, training it causes ample time. Transfer learning is one of the solution methods for improvement of the efficient prediction (e.g., reduction of run time) in the target domain [13]. The transfer learning can reuse common knowledge obtained from the source domain for the target domain. This overview sketch is shown in Figure 4. The CNN coupled with a transfer-learning approach (CNN transfer learning) is a recently successful method in image classification. As our model should perform time-series predictions, CNN transfer learning was utilized in this study by means of a conversion tool between time-series data and image data. First, the CNN was run in the source domain ( Figure 5a). Second, parts of the hidden layers of the CNN were fixed and reused in the target domain. Last, "the fully connected layer 1" and "the fully connected layer 2" in the deep layers were retrained, using the datasets of the target domain ( Figure 5b).

Data Conversion from Time-Series to Image
To perform time-series predictions with the CNN, a conversion method between time-series data and image data is necessary. As shown in Figure 6, our method was a simple way to arrange 16 variables (rainfall and water level) at all gauge stations from upstream to downstream on the vertical, and to arrange the temporal variation of each variable on the horizontal. The digital values were then converted to an image by using a black-and-white gradation that ranges from zero to one and is normalized by the maximum and minimum values of all observed datasets of water level and rainfall. The spacing between two values on the horizontal axis was one hour. The interval of the vertical axis was one for simplicity.  The image in the CNN has to correspond to a value of the water level in the time progress when train and prediction are conducted in the CNN. A 16 × 16 image was selected from the 2D image of a flood event due to the limited number of variables. A square image is used as input data. The information from the input data was associated with a 1 × 1 image (i.e., value) of the output data at the predicted point. We assumed that the flood-related information in a square image generates the value of the predicted point in the next time step (i.e., in an hour). Note that the predicted point indicates a location where water levels are predicted. This treatment was separately repeated in time progress as shifted from the head of the 2D image to the tail. Drawing information is shown in Figure 7. This data treatment was also employed when the LSTM predicts the value in advance using current and 15 past data values provided from 16 variables. The LSTM searches the relationship among plural line-based data; however, the CNN finds spatial patterns among 2D images.  The number of variables (i.e., four for water-level gauges and nine for rain gauges) in Domain B differs from that of Domain A. To make the image size of Domain B equivalent to that of Domain A, as shown in Figure 6, the image size has been expanded from 123 × 13 to 123 × 16, using the interpolation that maintains high-resolution images with the Lanczos algorithm [26].

Computational Setups in CNN and CNN Transfer Learning
Four cases were selected to evaluate the errors between predicted outputs and observed outputs in Domain A, using the CNN, and those in Domain B, using CNN transfer learning. These cases involved datasets that had the first-to third-position highest peaks in each domain (top three datasets) and two datasets with midlevel peaks, which were chosen among the other events in each domain. The four cases of the top three datasets and one midlevel-peak dataset were used as the datasets of the test process. Note that another midlevel-peak data set remains in the validation data set to evaluate at least one midlevel-peak dataset during the validation process. First, the CNN was trained with the training datasets, excluding the top three datasets and two moderate-peak datasets. Second, the trained CNN was checked to see whether a loss function shows overfitting, using validation datasets, including the top three datasets and the two datasets with moderate peaks as a representative of most datasets. This validation is usually performed to conduct a no-bias evaluation of a model. Last, a test was performed using one of the top three datasets or the moderate-peak dataset. In general, these procedures are usually utilized in machine learning [27,28]. In Domain A, we chose 38 events for training, four events for validation, and one event for testing (Table 4). Domain B had 12 events for training, four events for validation, and one event for testing (Table 5).  The programs for the flood models (CNN and CNN transfer learning) were created by Python (version 3.6.4) [29], incorporated with Keras libraries [30] on the Window-OS PC that has the Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz. The setups of several parameters, such as batch size, epoch number, and activation functions, are shown in Table 6.
Same as above

Results
We verified a typical classification by using the CNN as a preliminary examination, which is often used for image analyses based on "true or false" binary classification. The classification in this study was defined as upward or downward trends of water levels. The results showed a satisfactory accuracy rate of 88.8-93.5% for the cases from A1 to A4 (A1-A4) in the selected datasets of Domain A. However, the errors were caused by small up-and-down oscillations that primarily occurred during an early rising phase of a waveform. By this verification of the binary classification, our CNN can capture the upward and downward trends of flood waves. A detailed explanation is shown in Appendix A.
We also evaluated an effect of a pixel-size difference between an original image (123 × 13 pixels) and a resized image (123 × 16 pixels) in Domain B on CNN predictions of water levels. Predicted waveforms of the resized image in test datasets of the cases from B1 to B4 (B1-B4) were similar to those of the original image in another preliminary examination. However, the RMSEs obtained by the resized image are moderately reduced from those by the original image. Therefore, we assumed that the effect of the pixel-size difference could be limited to dramatically modify the CNN predictions of water levels. A detailed explanation is shown in Appendix B.

Verification of CNN Prediction With Source Datasets
The CNN predicted the water levels of A1-A4 in Domain A after training the CNN among 38 flood events. To use the validation datasets for each case, including the top three datasets that have higher peaks of floods, the RMSEs ranged from 0.14 to 0.73 m, which corresponds to 2.6-6.9% of the variation between the minimum value and maximum value among the validation datasets. These errors can be acceptable because the top three datasets were excluded in the training process. In the test process for A1-A4, the CNN had an acceptable performance with a less than 10% relative error ( Figure 8 and Table 7), although the higher peak of each A1-A3 prediction did not follow the observed peak. For example, the A3 prediction to capture the highest peak (3 September 2005) of flood wave shows that the RMSE was 0.73 and the relative error was 6.9%, although the peak height could not be captured by an approximately 20% difference (Figure 8c). For the dataset with the moderate peak of the flood wave (14 August 1999), the CNN in A4 predicted a considerably better shape of the water level, whose relative error was only 2.6% to the total variation in the water level during the 14 August 1999 event (Figure 8d). These results indicate that the moderate peak flood of A4 was typical for prepared datasets in the training process. Capturing a higher peak of each dataset of A1-A3 caused poor prediction due to exclusion of the A1-A3 datasets in the training process.

Verification of CNN Transfer Learning with Target Datasets in Domain B
After the verification of the CNN via the training, validation, and testing processes in Domain A, the transfer learning on the fully connected layers 1 and 2 was conducted in Domain B. The number of retrainings can be an important factor for improving the prediction of the CNN transfer learning. We evaluated how RMSEs are reduced based on the number of retrainings. Figure 9 shows the RMSEs for all cases (B1-B4), which are calculated by the observed and predicted water levels with respect to the number of retrainings. The number zero in the horizontal axis means that the CNN was employed without any retrainings in Domain B. As the number of trainings increased, the RMSEs for B1-B4 were reduced. The retraining converged approximately by approximately 20 times. Figure 10 shows the comparison between the predicted water level and the observed water level in a time series during flood events in the test datasets of B1-B4 after the 20-time retrainings. These wave shapes tend to rise earlier than the observed wave that actually increased before the peaks. The predicted wave curves had small oscillations or non-smoothness.
In addition, the RMSEs of B1-B4 in the test datasets indicates 0.06-0.15 m, which corresponds to a variation from 3.3% to 4.3% between the minimum value and maximum value of the water level of each dataset (Table 8). The RMSEs are approximately equivalent to or reduced by at most 20% from those (0.06-0.20 m) of the CNN prediction with 100 trainings in Domain B, without transfer learning and with resized 123 × 16 pixels (Table 8). The retraining time of the CNN transfer learning was considerably shorter due to retraining the part of the CNN structure compared with the computational cost of the CNN without transfer learning. During the check process of the validation datasets after the 20-time retrainings, the RMSEs of B1-B4 ranged from 0.10 to 0.12 m, which corresponds to reasonable values from 2.6 to 3.2% for the variation between the minimum value and maximum value of the water level for each dataset.    1 Relative error was the same as that in Table 7.

Discussion
The time-series water-level predictions of A1-A4 using the CNN in Domain A showed acceptable agreement with the observed data, although the higher peaks of each of A1-A3 were poorly captured. Comparisons with other neural networks, such as the fully connected deep neural network (FCDNN) [7] and recurrent neural network (RNN) [31] indicated that our CNN prediction was poorer than the performance of these models. For example, Hitokoto and Sakuraba [7] reported that their FCDNN provided better predictions of water levels for the top-four largest flood events from 1990 to 2014 in the same watershed. The RMSE in an hour prediction of water level was 0.12 m, which is approximately 1/3 of the mean RMSE of our CNN prediction, due to the use of two different verification methods. The RMSEs of the FCDNN were calculated by the leave-one-out cross-validation method [32], in which three of the top four largest flood events were included as training datasets for each prediction. Our CNN excluded the top three datasets in the training process. Therefore, the accuracy of the CNN may not be worse than that of the FCDNN.
In addition to the flood prediction, the time-series predictions in other fields have been performed utilizing the CNN. Temporal predictions of the precipitation occurrence, as performed by the CNN, showed that the accuracy exceeded 90% in the first lead time based on the quadrant classification, which is based on the combination of yes and no occurrences in prediction and observation [23]. Another example of stock market prediction by the CNN that reported 50% accuracy was achieved in the upward or downward trend of 31 stocks [22]. Although the backgrounds and computational setups of these studies completely differ from those of our study, the preliminary numerical examination of the CNN predictions based on binary classification (i.e., upward or downward value of water level) achieved approximately 90% accuracy. Therefore, our temporal prediction by the CNN can produce high accuracy.
Our method uses the CNN coupled with a transfer-learning approach in time-series predictions. It may be original based on the literature review for machine-learning-based flood models in the latest two decades [33]. The results of the CNN transfer learning after 20-time retrainings show that the RMSEs were less than 5% of each variation in water level in the test dataset in Domain B. These errors are slightly reduced from those of the CNN in Domain A. However, some specific differences exist between the predicted water levels and the observed water levels in Domain B, which were not observed in the CNN predictions in Domain A. For example, the predicted flood peaks were slightly underestimated by only 5%-10% of the variation of observed water levels in B1-B3 ( Figure 10). The slight underestimation of wave peaks in B1-B3 may be caused by the genetic tendency of the CNN trained in Domain A. This finding may be attributed to the notion that the CNN was substantially fitted to the flood waveforms that have steeper slopes of peaks.
Quantitative improvement in the CNN transfer learning appeared in the reduction of computational costs. The CNN transfer learning's RMSEs are slightly reduced from those of the CNN without transfer learning (i.e., a single use of CNN) in Domain B (Table 8). The former had 20-time retrainings of "the fully connected layer 1" and "the fully connected layer 2" of the CNN, which means that the number of parameters of the training process was also reduced (Figure 5b). The latter required full training by 100 times (i.e., epochs = 100) through all layers of the CNN (Figure 5a). Based on our computer resources, the computational cost was reduced to approximately 18% compared with the CNN without transfer learning. Thus, the beneficial aspect from the results of the CNN transfer learning reveal that the computational costs were substantially more reduced than the CNN without transfer learning once the CNN is well verified in Domain A (source domain). The CNN has to first perform the pretraining. The CNN can be applied to other domains once the CNN is pretrained.
A few studies of time-series prediction have been performed by using the LSTM, with a transfer-learning approach. For example, Laptev et al. [14] successfully reduced the prediction errors in the target dataset of time-series electricity loads obtained from the US electric company, using transfer learning with multiple layers of the LSTM. For error evaluation, they reported that the symmetric mean absolute percentage error (SMAPE) in small-size datasets (till the maximum of 20% of the total datasets) in the target domain was improved by approximately 1/3 of the SMAPE for the total datasets. Their LSTM was likely to improve the accuracy even in small-size datasets with transfer learning. Although comparing our results is difficult due to different-types of datasets and different methods of error evaluation, the accuracy improvement in a target domain (Domain B) using transfer learning could be achieved with approximately 15% reduction in a mean error difference between the CNN with transfer learning and that without transfer learning in the first trial of our study. Note that the mean error difference is defined in the following equation: where TL is with transfer learning, and NONTL is without transfer learning. The CNN transfer learning in Domain B can be necessary to further reduce the RMSEs compared with those from a single use of the CNN without the transfer-learning approach. The RMSEs of CNN transfer learning can be smaller because the transfer-learning approach in this study was simple (i.e., retrainings only of the fully connected layers 1 and 2). In general, if some deeper layers close to the end of the CNN is retrained, the transfer learning can be more effective for improving predictions (e.g., Wang et al. [34]). For example, we may perfume the sensitivity tests of retraining the layers, such as the max-pooling 2 and the fully connected layers 1 and 2 ( Figure 5a). Therefore, the next step of our study can perform the retrainings of additional layers in the CNN in Domain B.

Conclusions
We created a new model that consists of the CNN with a transfer-learning approach and a conversion tool between the image dataset and time-series dataset. The model can predict time-series floods in an hour in the target domain (Domain B) after pretraining the CNN with numerous datasets in the source domain (Domain A). As the first trial of a proposed method with this model, we note the following findings: 1.
CNN binary classification of upward/downward trends of water levels provided highly accurate predictions for a preliminary examination.

2.
CNN with time-series predictions in Domain A had less than 10% errors in the total variation of water levels in each test dataset. 3.
CNN with transfer learning in Domain B reduced the RMSEs as the number of retrainings was increased, and the RMSEs after 20-time retrainings were slightly reduced from those of the CNN without transfer learning in Domain B with a substantially reduced computational cost.
Due to the method of the beginning stage, our future works require some extensions of the CNN-based flood modeling with the transfer-learning approach as follows:

•
Lead time in the prediction should be extended from an hour to three to six hours based on the time lags for the watersheds.

•
The best retraining process in deep layers should be investigated.

Appendix B
Resizing pixels in the 2D image of Domain B may affect image classification using CNN. We compared CNN predictions with an original image (123 × 13 pixels) and those with a resized image (123 × 16 pixels). Figure A2 shows water levels in four test datasets in comparison among observed data, prediction with the original image, and prediction with the resized image. The two waveforms by the CNN predictions are similar in each B1-B4. As shown in Table A2, the RMSEs between observed data and predictions in the testing of B1-B4 with the original image are from 0.19 to 0.22. In contrast, the RMSEs with the resized image are 0.06 to 0.20. The resized image caused the reduction of the RMSEs with the original image that ranges from 4.5% to 38%. This suggests that an extended image may provide better predictions on CNN time classifications possibly due to a gain of more information by larger-size pixels.