Application of Artiﬁcial Neural Network and Information Entropy Theory to Assess Rainfall Station Distribution: A Case Study from Colombia

: An assessment of the rainfall station distribution in the mountainous area of the Regional Autonomous Corporation of Cundinamarca (CAR, for its acronym in Spanish), Colombia, was conducted by applying concepts from information entropy and artiﬁcial neural networks (ANNs). This study was divided into two phases: ﬁrst, a classiﬁcation of the meteorological stations using two-dimensional self-organizing maps; second, the evaluation of the performance of the ANN by applying concepts of information entropy. Three scenarios were raised for the classiﬁcation of the meteorological stations by adjusting the number of neurons in the output layer. A high number of neurons in the output layer were obtained, causing the model to over-ﬁt while emphasizing di ﬀ erences amid patterns. When comparing the results of the scenarios, the permanence of certain characteristics and features was found in the system, validating the model classiﬁcation. Subsequently, the results of the ﬁrst scenario were used to evaluate the entropy of the historical series. Finally, the results show that the area of study presents a lack of information due to the uncertainty associated with the probabilistic arrangement, which can be corrected with the developed model. Consequently, some recommendations for the redesign of the rainfall are provided.


Introduction
An appropriate rainfall network is fundamental when planning watershed management strategies because it must capture and supply reliable spatial and temporal precipitation data needed for the design, construction, and operation of hydraulic structures such as urban stormwater drainage systems [1]. The design of a rainfall network consists of the determination of the number and location The combined approach of ANN and SOM is recommended for the design of rainfall networks where there are large scale requirements and random criteria for station location, which makes the application of conventional methods not appropriate. This approach is reflected in monitoring stations being located in redundant sites, neglecting other areas. In the case of the studied region of this paper, the lack of a single design criterion is justified in part because stations were installed at different times in the last eight decades. Thus, it is relevant to assess the performance of the rainfall network to determine if there is such redundancy in the location of the monitoring stations. Consequently, the current rainfall stations distribution of 2016, under the jurisdiction of the Regional Autonomous Corporation of Cundinamarca (CAR, for its acronym in Spanish) in Colombia, was evaluated using applied concepts of the information entropy and ANNs to provide recommendations for the redesign of the rainfall network in the studied mountainous region.

Characteristics of the Studied Region
The studied region covers 18,615 km 2 ; there are 104 municipalities, of which 98 belong to the Department of Cundinamarca, and six are under the jurisdiction of the Department of Boyacá and the rural area of the City of Bogota, Colombia. Although this territory is mostly for agricultural and livestock use (Soacha, Central Sabana, and West Sabana), there exists a major industrial and mining development plan for Ubaté, Cundinamarca. Figure 1 shows the map of the studied region, which is divided into nine watersheds, including the Sumapaz, Bogota, Magdalena, Black, Minero, Ubaté and Suarez, White, Gachetá, and Machetá rivers. Table 1 lists the corresponding areas for each of these watersheds. Topographically, 30% of the study area is located at heights between 2500 and 3000 m above sea level (Figure 2), which forms the so called high mountain area in the region of Bogota Sabana and the Ubaté-Chiquinquirá valley with its adjacent hillsides. On the other hand, about 16% of the area corresponds to the Andean peaks, with heights of 3000 m above sea level [30]. In the study area, the bimodal regime of rains predominates, typical of the Andean region, which extends over the western slope and the highlands of the Eastern Cordillera. On the eastern slope of the mountain range, the rain regime is monomodal.
The spatial and temporal variations in precipitation in the study area are governed by three climatological phenomena. The first is the circulation of the atmosphere through the equatorial zone, which, affected by the Intertropical Confluence Zone (ITCZ), is where warm and humid air currents converge from the large high-pressure belts of the southern and northern hemispheres, giving rise to large water-laden clouds. In most of the region, the movement of the ITCZ causes, during the year, a double maximum and a double minimum of rainfall, associated with other meteorological elements during the wet and dry seasons. The second climatological phenomenon is due to the circulation of air masses arising locally due to thermal differences, which produces cloudiness and precipitation in the upper parts of the valleys and clear skies in the center of them; at night, this phenomenon is reversed. This phenomenon is influenced by the shape and orientation of the terrain, altitude, vegetation, and presence of water [31]. A third phenomenon is that of the southeast trade winds from the Orinoquía (Eastern border of Colombia with Venezuela), which blow with higher intensity from June to September, discharging large amounts of moisture onto the eastern side of the Cordillera Oriental and cause a maximum rainfall from June to August [32]. The second climatological phenomenon is due to the circulation of air masses arising locally due to thermal differences, which produces cloudiness and precipitation in the upper parts of the valleys and clear skies in the center of them; at night, this phenomenon is reversed. This phenomenon is influenced by the shape and orientation of the terrain, altitude, vegetation, and presence of water [31]. A third phenomenon is that of the southeast trade winds from the Orinoquía (Eastern border of Colombia with Venezuela), which blow with higher intensity from June to September, discharging large amounts  Regarding the minimum number of rain gauges in the network located in the studied area, the World Meteorological Organization (WMO) proposes a specific number of stations depending on the physiographic unit of the region where the network will be installed. Thus, a minimum density per station of 250 km 2 is recommended for mountainous areas [33].

Methods
The study involved two phases. In the first phase, a classification of the meteorological stations was conducted using two-dimensional SOMs for the studied region. In the second phase, we took advantage of the fact that the entropy of the information cannot only represent the uncertainty of the rainfall distribution, but it can also reflect the correlation and the transmission of information between the rainfall stations [34]; through this, the performance of the rainfall gauge of the Cundinamarca region was evaluated.

Meteorological Network Data
The meteorological network data were gathered from CAR [35] and its area of jurisdiction. The CAR has a historical record of 182 stations for measuring precipitation distribution in the studied area, with historical records since 1931. Some of these stations have been installed recently. Thus, the historical series is not uniform in all cases. Of the total of registered stations, 37% corresponds to rainfall stations, 28% to pluviographic stations, 15% to primary meteorological stations, 14% to secondary meteorological stations, and 5% to the remaining automatic and satellite stations. Gathered data included the meteorological station locations with coordinates and the historical record of monthly precipitation. Regarding the minimum number of rain gauges in the network located in the studied area, the World Meteorological Organization (WMO) proposes a specific number of stations depending on the physiographic unit of the region where the network will be installed. Thus, a minimum density per station of 250 km 2 is recommended for mountainous areas [33].

Methods
The study involved two phases. In the first phase, a classification of the meteorological stations was conducted using two-dimensional SOMs for the studied region. In the second phase, we took advantage of the fact that the entropy of the information cannot only represent the uncertainty of the rainfall distribution, but it can also reflect the correlation and the transmission of information between the rainfall stations [34]; through this, the performance of the rainfall gauge of the Cundinamarca region was evaluated.

Meteorological Network Data
The meteorological network data were gathered from CAR [35] and its area of jurisdiction. The CAR has a historical record of 182 stations for measuring precipitation distribution in the studied area, with historical records since 1931. Some of these stations have been installed recently. Thus, the historical series is not uniform in all cases. Of the total of registered stations, 37% corresponds to rainfall stations, 28% to pluviographic stations, 15% to primary meteorological stations, 14% to secondary meteorological stations, and 5% to the remaining automatic and satellite stations. Gathered data included the meteorological station locations with coordinates and the historical record of monthly precipitation.

Data Processing
Most meteorological stations were installed in the sixties, and the CAR has operated the rainfall network since 1961. There are currently 47 non-active stations and 135 active stations registered under the CAR jurisdiction. Figure 3a depicts the number of stations installed per decade from the 1930s to 2010s. The rainfall records in the studied meteorological stations are not homogeneous in terms of the amount of information. The period with the highest possible number of stations with available data was selected, considering that the meteorological stations in the analyzed network have come into operation at different times. Thus, the stations holding more than 80% of monthly data for not less than twenty years were selected for this study (Figure 3b). Most meteorological stations were installed in the sixties, and the CAR has operated the rainfall network since 1961. There are currently 47 non-active stations and 135 active stations registered under the CAR jurisdiction. Figure 3a depicts the number of stations installed per decade from the 1930s to 2010s. The rainfall records in the studied meteorological stations are not homogeneous in terms of the amount of information. The period with the highest possible number of stations with available data was selected, considering that the meteorological stations in the analyzed network have come into operation at different times. Thus, the stations holding more than 80% of monthly data for not less than twenty years were selected for this study (Figure 3b).  The monthly records were enumerated, starting with the oldest of the series from January 1927 to the most recent one. They were checked one by one to find out which station had information The monthly records were enumerated, starting with the oldest of the series from January 1927 to the most recent one. They were checked one by one to find out which station had information about the precipitation in every particular month. The obtained information was plotted, as shown in Figure 3b. The period between 1986 and 2008 was found to have the most substantial number of active stations. As a result, ten stations with at least 80% available monthly data were selected for this study.

Development of the Artificial Neural Network Model
The classification was performed using the MATLAB Neural Network Toolbox and the SOM approach, which identifies the homogeneous regions with more precision than the K-means and Ward methods, two of the most commonly used classification methods [4]. The networks used are made up of an input layer, in which the input patterns are entered into the model, and an output layer, where the weights of neurons are updated based on the input patterns. Moreover, the output layer is the two-dimensional space that is self-organized based on the structure of the input patterns. Thus, each of the neurons in the input layer connects to all the neurons in the output layer.
Although there are no lateral connections between neurons in the same layer, updating the output layer weights based on the neighborhood of the winning neuron creates a similarity link between nearby neurons that leads to the grouping or self-organization of neurons with similar characteristics.
In this case, a two-dimensional array was made, setting the number of rows and columns of the array of neurons. To accurately differentiate pattern groups, it is recommended to use neuron arrays in the output layer as large as possible. However, it is essential to note that if the number of neurons in the output layer is quite large, the model can over-train and highlight the differences between each of the patterns, yielding the same number of groups as patterns [36].
Consequently, and considering the above, three scenarios were defined with different numbers of neurons in the output layer. The number of iterations is given, according to González-Cuéllar [36], by the number of neurons in the output layer multiplied by 500. Finally, with the help of the MATLAB Neural Network Toolbox, a computational application was developed to classify the rainfall stations of the studied region, considering the defined typologies, and the built-in visualization of the results. To prevent the fact that difference of scale between variables affects the classification, these were transformed using the relative change and difference normalization approach so that their ranges were homogeneous and the results comparable [4]. Table 2 shows the applied transformation for each variable to obtain values between zero and one.

Input Variable Transformation
Large neuron arrays were used in the output layer to differentiate the patterns of groups adequately. However, it is essential to note that if the number of neurons in the output layer is vast, the model may over-fit and highlight the differences between each of the patterns, generating as many groups as patterns [36,37]. Three scenarios with different numbers of neurons in the output layer were defined (100, 400, and 900 neurons for Types 1, 2, and 3, respectively) as per preliminary studies by the authors [17,36]; thus, the number of iterations was estimated by the number of neurons in the output layer multiplied by 500 (50,000, 200,000, and 450,000 iterations for Types 1, 2, and 3, respectively).
As mentioned earlier, a computer application was developed using the MATLAB ANN Toolbox to classify rainfall stations of the CAR based on the classification types defined in each of the three Water 2020, 12,1973 8 of 18 developed scenarios. Type-1 was based on a Kohonen network model of 100 neurons distributed in an array of ten rows and ten columns. The number of iterations was defined as the number of neurons in the output layer multiplied by 500 (i.e., about 50,000 iterations). Type-2 was based on a Kohonen network model of 400 neurons distributed in an array of 20 rows and 20 columns. Thus, 200,000 iterations were used. Type-3 was based on a Kohonen network model of 900 neurons distributed in an array of 30 rows and 30 columns. Consequently, 450,000 iterations were used. In all three cases, a hexagonal topology, with neurons also in hexagonal shape, was chosen. Thus, neurons that are not on the edges of the Kohonen layer have six neighboring neurons that are connected virtually.

Performance Evaluation of the Rainfall Network in the Cundinamarca Region
The information entropy is a measure of the uncertainty of a specific outcome in a random process [20]. The concept of entropy has been used to investigate the variability associated with monthly, seasonal, and annual series of precipitation, and thus, characterize the precipitation to generate formulations on the efficient management of rainfall water [29]. In a study by Lohani et al. [38], artificial neural networks and the neuro-fuzzy system were used to forecast monthly inflow in a reservoir; results from this study were useful to understand how water supply and flood control measures can be generated from these models.
By using the information entropy concept, this section describes the distribution of information in each of the developed groups. The length of the time series data in each case was established, having no less than 20 year periods and missing data not exceeding 20% [28]. The following equation calculated the marginal entropy for each station: where k is the discrete data interval, x k is the result corresponding to the interval k, and p(x k ) is the probability of x k . For each station, an estimated series of predicted values was obtained by multiple linear regression from the data of the meteorological stations in the same group. Thus, the marginal entropy for the estimated series was also calculated by Equation (1). Subsequently, the joint marginal entropy between actual and estimated values for each station was calculated by Equation (2): where l is the discrete data interval for the estimated values, y l is the result corresponding to the interval l, and p(x k y l ) is the probability of x k y l . Finally, the mutual information of each station was calculated from the values found previously by applying Equation (2). The mutual information is the amount of information contained in a process to another process [28]. In this case, it corresponds to the rainfall data contained in one station and simultaneously within others in its own group, which is calculated by Equation (3): where H(x), H(y), and H(x,y) are the marginal entropy of the actual data, the marginal entropy of the simulated series, and the joint entropy, respectively; stations should have minimum mutual information as possible as the fundamental basis of the monitoring network design using the information entropy approach. Thus, stations must be independent of each other. Low mutual information values indicate that those stations are more independent and share little information, while high mutual information values represent those with more dependency. Therefore, there may be no need for more stations in that area. It is recommended to install additional rainfall stations where mutual information values are close to zero. Table 3 shows the classification of mutual information according to the obtained values. This criterion was used to produce the corresponding recommendations for a proper redesign of the rainfall network in the studied area based on precipitation variability and rainfall antecedents, as suggested by Mishra [29] and Chang et al. [39], respectively. Table 3. Mutual information classification.

Mutual Information Range
Index Above average >2.0 Excess

Results and Discussion
As mentioned earlier, the studied region encompasses 18,615 km 2 , where 30% is located at altitudes between 2500 and 3000 m above sea level, and 16% of the area is mountainous. The region is divided into nine second-order basins, where the Bogota river basin is the largest. The precipitation has a bimodal behavior, a period of rain and drought. During the period from December to March, rainfall is equivalent to only 19% of the annual average. The rainiest months are October and April, with 15 and 16% of the total annual precipitation, respectively. The driest months are January and February, and rains are distributed throughout the year (142 days on average).

Classification of Rainfall Stations
As mentioned in Section 2.2.2 Data Processing, ten stations with at least 80% of available monthly data were selected for this study. Table 4 shows the normalized input variables for the selected ten stations and their corresponding normalization. Table 4. Normalized input variables for selected first ten rainfall stations.

S1
S4 S8 S9 S10 S13 S15 S16 S18 S22 The input variables used for the classification were the annual rainfall (mm), elevation (m), latitude (m), longitude (m), monthly precipitation (mm), and standard deviation of the annual rainfall in each meteorological station. The values were transformed using the relative change and difference normalization approach, previously shown in Table 2, for obtaining homogeneous and comparable ranges.

Scenario Configurations
The classification method by the SOM approach offers the advantage of indicating the results on two-dimensional maps regardless of the number of variables included using the map of Hits and the distance between neurons by the SOM neighbor weight distances (U-matrix map). Three scenarios were analyzed, as described in previous sections. In the Type-1 scenario, the winning neurons were identified in the map of Hits shown in Figure 4a, where the number within neurons indicates the number of stations represented (i.e., the number of wins for each neuron). Neurons with a value of zero are those that pose no pattern, in this case, no station. On the other hand, the U-matrix map shows how different a neuron is from another. Consequently, it is possible to identify the groups in which the information is divided.
Water 2020, 12, x FOR PEER REVIEW 10 of 18

Scenario Configurations
The classification method by the SOM approach offers the advantage of indicating the results on two-dimensional maps regardless of the number of variables included using the map of Hits and the distance between neurons by the SOM neighbor weight distances (U-matrix map). Three scenarios were analyzed, as described in previous sections. In the Type-1 scenario, the winning neurons were identified in the map of Hits shown in Figure 4a, where the number within neurons indicates the number of stations represented (i.e., the number of wins for each neuron). Neurons with a value of zero are those that pose no pattern, in this case, no station. On the other hand, the U-matrix map shows how different a neuron is from another. Consequently, it is possible to identify the groups in which the information is divided.   Likewise, Figure 4b depicts the map of Hits for the visualization of the results in the Type-2 scenario, where the winning neurons are identified. It is noted that winning neurons are more dispersed in Type-2 than in Type-1, due to the increased number of neurons in the network. Furthermore, the distance between neurons U-matrix map for Type-2 is provided in Figure 5b. In this case, the demarcation is more noticeable than in Type-1. Fifty groups were distinguished; some of these were formed by not winning neurons, becoming irrelevant to the study. Thus, only 47 groups remained.  Likewise, Figure 4b depicts the map of Hits for the visualization of the results in the Type-2 scenario, where the winning neurons are identified. It is noted that winning neurons are more dispersed in Type-2 than in Type-1, due to the increased number of neurons in the network. Furthermore, the distance between neurons U-matrix map for Type-2 is provided in Figure 5b. In this case, the demarcation is more noticeable than in Type-1. Fifty groups were distinguished; some of these were formed by not winning neurons, becoming irrelevant to the study. Thus, only 47 groups remained.
Water 2020, 12, x; doi: FOR PEER REVIEW www.mdpi.com/journal/water Finally, the map of Hits for the Type-3 scenario is shown in Figure 4c. It is noted that winning neurons are scattered similarly to the Type-2 case, due to an over-increased number of neurons in the network. This outcome indicates that the groups will consist of fewer stations. Besides, the distance between neurons in the map for the Type-3 scenario shows the presence of darker areas, marking the division between the sets, as shown in Figure 5c. In this case, the demarcation is more noticeable than in Type-1. Sixty-five groups were distinguished. Some of these remained comprised of non-winning neurons and were found irrelevant to the study; thus, only 56 groups were used.

Mutual Information Classification Analysis
The results obtained for the different scenario types, in which at least one neuron is the winner, are 13, 47, and 56 groups formed in Type-1 (100 neurons in the output layer), Type-2 (400 neurons), and Type-3 (900 neurons), respectively. By increasing the number of neurons, the number of groups formed also increases. Therefore, considering a clustering process, a high number of neurons in the output layer can result in the over-training of the model, emphasizing the differences amid patterns; thus, in the Type-3 scenario, with 900 neurons, about 55% of the stations were classified individually. It is noteworthy that when making a comparative analysis of the different scenarios, specific patterns remain in the classification. A particular case is station S140, which was classified individually in all scenarios, and station S80, which was classified individually in Types 1 and 3.
For the mutual information classification, the input variable to measure the entropy was monthly precipitation, as explained in previous sections. Although the CAR rainfall network has historical records of monthly rainfall since 1931, the length of the series for each of the stations varies Finally, the map of Hits for the Type-3 scenario is shown in Figure 4c. It is noted that winning neurons are scattered similarly to the Type-2 case, due to an over-increased number of neurons in the network. This outcome indicates that the groups will consist of fewer stations. Besides, the distance between neurons in the map for the Type-3 scenario shows the presence of darker areas, marking the division between the sets, as shown in Figure 5c. In this case, the demarcation is more noticeable than in Type-1. Sixty-five groups were distinguished. Some of these remained comprised of non-winning neurons and were found irrelevant to the study; thus, only 56 groups were used.

Mutual Information Classification Analysis
The results obtained for the different scenario types, in which at least one neuron is the winner, are 13, 47, and 56 groups formed in Type-1 (100 neurons in the output layer), Type-2 (400 neurons), and Type-3 (900 neurons), respectively. By increasing the number of neurons, the number of groups formed also increases. Therefore, considering a clustering process, a high number of neurons in the output layer can result in the over-training of the model, emphasizing the differences amid patterns; thus, in the Type-3 scenario, with 900 neurons, about 55% of the stations were classified individually.
It is noteworthy that when making a comparative analysis of the different scenarios, specific patterns remain in the classification. A particular case is station S140, which was classified individually in all scenarios, and station S80, which was classified individually in Types 1 and 3.
For the mutual information classification, the input variable to measure the entropy was monthly precipitation, as explained in previous sections. Although the CAR rainfall network has historical records of monthly rainfall since 1931, the length of the series for each of the stations varies according to the installation date. The amount of input data in each group varies based on the available information. Consequently, a minimum of 120 datasets was established to execute the classification procedure; thus, the mutual information classification of each combination of stations for each of the groups formed was performed. The results show that stations present deficit information, indicating that each of them is independent. This outcome makes it unfeasible to rebuild the historical series of precipitation of each station from other stations with which it shares the same group. As a result, groups formed in Type-1 were used, as shown in Table 5. Table 5. Distribution stations by groups.
It is recommended to include this correlation in future rainfall network studies considering a possible reconstruction of the historical series using information from one station to the other. Thus, this allows the relocation of one of the stations to places with fewer stations. The same situation occurs between stations S42 (San Jorge) and S129 (Doña Juana), S76 (Monserrate) and S122 (Esclusa), as well as S16 (Tres Esquinas) and S101 (El Hato No. 2), which were classified into similar groups in different types. Conversely, stations S80 (Las Margaritas) and S140 (Central No. 2) were classified individually. Thus, their information could not be reconstructed from other nearby stations. Consequently, it is recommended that the Regional Autonomous Corporation of Cundinamarca, in the case of a future re-engineering of the network, should not relocate these stations since their information is unique, with valuable historical records since 1959.
It was found that the northeastern part (corresponding to the municipalities of Yacopí and Puerto Salgar) and the south of the studied area (corresponding to the municipality of Cabrera) have low coverage of meteorological stations, so it is recommended to increase the number of stations in this area. In this case, the CAR should evaluate transferring some stations classified under the same group in all three scenarios, such as S16, S28, S42, S76, S101, S122, S129, and S142. Moreover, despite the high number of stations in the central area of the studied area, the transfer of stations different to those mentioned above is not recommended since, with this method, it is not possible to reconstruct the information that each of these stations provides to neighboring stations. Applied spatial analyst tools were used to construct Figures 7 and 8, where the images show interpolated surfaces calculated using the kriging method within ArcMap software. As discussed earlier, kriging is a method of applied spatial analysis that allows the estimation of values in unsampled locations using the information provided by the sample [4][5][6][7]. Figure 7 shows the distribution of mutual information in terms of entropy from values of precipitation of 80 stations within the studied region from 1986 and onwards.
It is noted that the variation of mutual information shown in Figure 7 is not homogeneous, presenting higher values in the center and north of the study area (>0.89). These results may be justified for the regions with a large number of stations; however, the values found for mutual information are deficient.
As mentioned in Section 2.1 Characteristics of the studied region, the complex topography typical of the Andean region can cause a considerable variation in precipitation; thus, when comparing nearby stations, low values for mutual information are obtained. Consequently, it can be stated that the criterion of proximity does not determine the homogeneity of the records. Applied spatial analyst tools were used to construct Figures 7 and 8, where the images show interpolated surfaces calculated using the kriging method within ArcMap software. As discussed earlier, kriging is a method of applied spatial analysis that allows the estimation of values in unsampled locations using the information provided by the sample [4][5][6][7]. Figure 7 shows the distribution of mutual information in terms of entropy from values of precipitation of 80 stations within the studied region from 1986 and onwards.
It is noted that the variation of mutual information shown in Figure 7 is not homogeneous, presenting higher values in the center and north of the study area (>0.89). These results may be justified for the regions with a large number of stations; however, the values found for mutual information are deficient.
As mentioned in Section 2.1 Characteristics of the studied region, the complex topography typical of the Andean region can cause a considerable variation in precipitation; thus, when comparing nearby stations, low values for mutual information are obtained. Consequently, it can be stated that the criterion of proximity does not determine the homogeneity of the records. It should be noted here that the topographic factor is of great importance to explain the spatial variability of rain in the study area [31], considering that 30% of the study area is located in heights between 2500 and 3000 m above sea level, which makes up the so-called High Mountain area. In comparison, 16% of the area corresponds to the so called Andean peaks (3000 m above sea level). Figure 7 also shows that the rainfall gauge in the studied area presents information deficiency. Only two stations, E32 (Santa Isabel) and E154 (Campobello), yielded acceptable mutual information values (between 1 and 1.5), which are in the remote municipalities of Tabio and Madrid, respectively. However, it is not recommended to relocate one of these stations based on this analysis but to expand the network coverage, especially in the northeastern and southern parts of the studied area. Figure 8 shows the map of isohyets corresponding to the distribution of the annual precipitation values in the studied area, which is characterized by the influence of geography of the Andean region. By analyzing the map of isohyets, a heterogeneous behavior and development of a gradient in the South-North direction are identified, with increasing cumulative annual rainfall values ranging from 1200 to 2400 mm. This map of isohyets will support future estimation of design rainfalls, especially in ungauged areas in Colombia, where the lack of readily available and processed information often becomes an obstacle in the development of hydrological studies [40]; thus, the predictive capabilities of data-driven modeling applied to hydrology are demonstrated [41]. It should be noted here that the topographic factor is of great importance to explain the spatial variability of rain in the study area [31], considering that 30% of the study area is located in heights between 2500 and 3000 m above sea level, which makes up the so-called High Mountain area. In comparison, 16% of the area corresponds to the so called Andean peaks (3000 m above sea level). Figure 7 also shows that the rainfall gauge in the studied area presents information deficiency. Only two stations, E32 (Santa Isabel) and E154 (Campobello), yielded acceptable mutual information values (between 1 and 1.5), which are in the remote municipalities of Tabio and Madrid, respectively. However, it is not recommended to relocate one of these stations based on this analysis but to expand the network coverage, especially in the northeastern and southern parts of the studied area. Figure 8 shows the map of isohyets corresponding to the distribution of the annual precipitation values in the studied area, which is characterized by the influence of geography of the Andean region. By analyzing the map of isohyets, a heterogeneous behavior and development of a gradient in the South-North direction are identified, with increasing cumulative annual rainfall values ranging from 1200 to 2400 mm. This map of isohyets will support future estimation of design rainfalls, especially in ungauged areas in Colombia, where the lack of readily available and processed information often becomes an obstacle in the development of hydrological studies [40]; thus, the predictive capabilities of data-driven modeling applied to hydrology are demonstrated [41]. The heterogeneous behavior of the precipitation and the presence of gradients observed in Figure 8 correspond to the mutual information values shown in Figure 7. This relationship can be established considering that heterogeneity in the distribution of precipitation allows for different behavior from their neighboring stations in the historical record.
As a result, deficient values of mutual information are found when comparing one station to another. It is important to note, as stated above, that the complex Andean topography turns out to be decisive on the distribution of the rainfall, and hence, of the values of the mutual information.
The SOM approach was used to identify homogeneous areas from the time series of precipitation. This decision is justified in that this method has been previously compared with two of the most commonly used methods for classification-the methods of Ward and K-means-through experimental design.
All three methods were tested by experimental datasets, where both the number of groups and their members were previously known. Accuracy values of the groups obtained by the SOM method were 100%, whereas the accuracy values for the K-means' and Ward methods were 97% and 95%, respectively. This comparison ensures that the SOM approach determines the exact number of groups and elements belonging to them; thus, the results are validated [4]. Such comparison has also been the objective of other studies, in which the SOM approach allows the identification of the homogeneous regions with more precision than the most used classification methods, the K-means and Ward methods [42]. The heterogeneous behavior of the precipitation and the presence of gradients observed in Figure 8 correspond to the mutual information values shown in Figure 7. This relationship can be established considering that heterogeneity in the distribution of precipitation allows for different behavior from their neighboring stations in the historical record.
As a result, deficient values of mutual information are found when comparing one station to another. It is important to note, as stated above, that the complex Andean topography turns out to be decisive on the distribution of the rainfall, and hence, of the values of the mutual information.
The SOM approach was used to identify homogeneous areas from the time series of precipitation. This decision is justified in that this method has been previously compared with two of the most commonly used methods for classification-the methods of Ward and K-means-through experimental design.
All three methods were tested by experimental datasets, where both the number of groups and their members were previously known. Accuracy values of the groups obtained by the SOM method were 100%, whereas the accuracy values for the K-means' and Ward methods were 97% and 95%, respectively. This comparison ensures that the SOM approach determines the exact number of groups and elements belonging to them; thus, the results are validated [4]. Such comparison has also been the objective of other studies, in which the SOM approach allows the identification of the homogeneous regions with more precision than the most used classification methods, the K-means and Ward methods [42].

Conclusions
By combining self-organizing maps and the concept of entropy of information, it was possible to evaluate the distribution of the rainfall stations network of the Cundinamarca region. The application of ANNs to classify rainfall stations in the studied region showed that such stations could be grouped into 13, 47, and 56 groups, depending on the number of neurons in the output layer.
Three scenarios were raised for the classification of the meteorological stations by varying the number of neurons in the output layer. Results show that increasing the number of these neurons increased the number of groups formed. In Type-1 with 100 neurons in the output layer, 13 groups were obtained. In Type-2 with 400 neurons in the output layer, 47 groups were obtained, and finally, in Type-3 with 900 neurons in the output layer, 56 groups were obtained. Consequently, in a clustering process, a high number of neurons in the output layer can over-train the model.
It was expected that the best values of mutual information were present at stations in the central zone of the studied area for its high density. However, the results show that the same trend was present in the rest of the studied area with deficit values. Thus, it can be concluded that the criterion of proximity between stations does not guarantee the homogeneity of the information provided; it would be unwise to transfer any of these stations under this criterion.
Because of the low mutual information values, it is recommended to integrate the available information from the CAR with other entities, such as the Institute of Hydrology, Meteorology and Environmental Studies of Colombia (IDEAM, for its acronym in Spanish) to complement the data from the stations in the same area to expand network coverage. Detailed maps of isohyets such as the one shown in this study can be used for both regional and global hydrological studies to develop water management strategies, considering the lack of readily available processed data in Colombia.
Finally, it is recommended to extend the coverage of the rainfall network in the northeastern and southern parts of the studied area using a new group of stations with similar characteristics based on the results of the classification by ANNs. However, this relocation must be assessed in detail because the values of mutual information obtained for the same stations are deficient. Consequently, it is recommended to assess, in a future study, the mutual information of these stations with their nearest neighbors. If in such a case, there were acceptable values of mutual information, the transfer would be feasible.