Assessing Lightning and Wildfire Hazard by Land Properties and Cloud to Ground Lightning Data with Association Rule Mining in Alberta, Canada

Hotspot analysis was implemented to find regions in the province of Alberta (Canada) with high frequency Cloud to Ground (CG) lightning strikes clustered together. Generally, hotspot regions are located in the central, central east, and south central regions of the study region. About 94% of annual lightning occurred during warm months (June to August) and the daily lightning frequency was influenced by the diurnal heating cycle. The association rule mining technique was used to investigate frequent CG lightning patterns, which were verified by similarity measurement to check the patterns’ consistency. The similarity coefficient values indicated that there were high correlations throughout the entire study period. Most wildfires (about 93%) in Alberta occurred in forests, wetland forests, and wetland shrub areas. It was also found that lightning and wildfires occur in two distinct areas: frequent wildfire regions with a high frequency of lightning, and frequent wild-fire regions with a low frequency of lightning. Further, the preference index (PI) revealed locations where the wildfires occurred more frequently than in other class regions. The wildfire hazard area was estimated with the CG lightning hazard map and specific land use types.


Introduction
Cloud to Ground (CG) lightning is a common meteorological hazard in Canada, and it is a leading cause of many types of fatalities, injuries, property damage, forest fires, and interruptions to business, as well as damage to almost every electrical or electronic system. About 9 to 10 lightning-related deaths and 19 to 164 injuries occur each year in Canada, costing between $3.6 and $79.2 million [1]. The lightning activity is a major natural ignition source for global wildfires and was parameterized as a driving factor to generate simulated wildfire model [2,3]. There is an average of 816 fires ignited by lightning each year. These fires cause an average of $16.4 million in property damage and between 3900 and 5300 insurance claims are estimated to be filed against lightning-related property damage (excluding fires) each year [4]. Around 75% of all forest fires are caused by lightning, with these fires accounting for about 85% of the total area burned in Canada [5]. CG lightning is the single largest cause of transients, faults and outages in electric power transmission and distribution systems in lightning-prone areas, and lightning is a major cause of electromagnetic interference that can affect all electronic systems [6].
Over the past few decades, a series of studies have focused on understanding lightning activity in many countries and various areas using different lightning location systems. The lightning location systems using natural lightning strikes to trees was presented by verification of lightning location accuracy in Finland [7]. The severity of lightning damages based on different sorts of trees was

Terrain Elevation and Terrain Slope Data
We obtained a terrain elevation data from Shuttle Radar Topography Mission (SRTM) [35] 1 Arc-Second Global data for our study area (the Province of AB, Canada). The absolute vertical accuracy of the elevation data will be 16 m (at 90% confidence). This radar system will gather data that will result in the most accurate and complete topographic map of the Earth's surface that has ever been assembled. The unit of elevation is meters as referenced to the WGS84/EGM96 geoid. We then calculate terrain slope based on this digital elevation model using ArcGIS software (ESRI, Redlands, CA, USA) [36]. The terrain elevation and terrain slope data for Alberta, Canada were derived from this SRTM 1 Arc-second global data.

Land Uses Data
The Land Uses (LU) map ( Figure 1) covers all areas of Alberta at a spatial resolution of 30 m. The LU classes follow the protocol of the Intergovernmental Panel on Climate Change (IPCC) and consist of: Forest, Water, Cropland, Grassland, Settlement and Other land (barren land, ice, rock and unclassified). The LU maps were prepared using existing source data, including a variety of land cover (LC) and crop maps and various topographic layers such as Buildings and Structures, Hydrography, Industrial and Commercial Areas, Transportation and Wetlands from the "Canada Vectors", a digital cartographic product in vector format (CanVec) supplied by Natural Resources Canada (NRCan). Details about the data and their accuracy are given in Data Product specifications [37].

Soil Types Data
Soil Landscapes of Canada data (SLC version 3.2, Soil Landscapes of Canada Working Group, Ottawa, ON, Canada Figure 2) is the latest revision of the Soil Landscapes of Canada, which was developed by Agriculture and Agri-Food Canada to provide information about the country's agricultural soils at the provincial and national levels. SLCs provide GIS coverage that shows the major characteristics of soil and land for the whole country. The information is organized according to a uniform national set of soil and landscape criteria based on permanent natural attributes. They are based on existing soil survey maps that have been recompiled at 1:1,000,000 scale. SLC polygons may contain one or more distinct soil landscape components and may also contain small but highly contrasting inclusion components. We could get a distinct type of soil to identify the soil great group according to the Canadian System of Soil Classification, 3rd edition [38]. There are ten classes in the Canadian System of Soil Classification and the major levels used in the classification include: Order, Great Group and Subgroup. We used the first level (Order) within the System for distinct soil types [38,39].

Canadian National Fire Database (CNFDB)
The Canadian National Fire Database (CNFDB) is a collection of wildfire data from various sources; these data include fire locations (point data) and fire perimeters (polygon data) as provided by Canadian fire management agencies (provinces, territories, and Parks Canada) [40]. The properties of the wildfires are composed of agency information (province, territory, parks) illustrating which agency collected the fire data, wildfire occurrence dates, coordinates, fire size (hectares), cause of fire as reported by each agency, and fire types. These data will be used in the application section (Section 7) to demonstrate how the frequent pattern results from data mining can be utilized. We only use lightning-caused wildfire data, among all possible different causes of fire, in order to know the impact of the CG lightning strikes on wildfire outbreak.

Spatial Distribution of Lightning Strikes Activity (Hotspot Analysis)
The research purpose of this section is to investigate how CG lightning is distributed spatially over Alberta. We count CG lightning frequency in each municipal boundary to see which areas have more lightning. The study is based on the cloud to ground (CG) lightning data from 2010 to 2014. Hotspot analysis enables us to know not only where high-value regions cluster together but also where low-value regions cluster. Several processes precede actually carrying out hotspot analysis. The flow chart of the processes is below (Figure 3). We will summarize these processes first and then we will explain each step in greater detail. All processes are performed by ArcGIS 10.3 software (ESRI, Redlands, CA, USA) [36].

Spatial Distribution of Lightning Strikes Activity (Hotspot Analysis)
The research purpose of this section is to investigate how CG lightning is distributed spatially over Alberta. We count CG lightning frequency in each municipal boundary to see which areas have more lightning. The study is based on the cloud to ground (CG) lightning data from 2010 to 2014. Hotspot analysis enables us to know not only where high-value regions cluster together but also where low-value regions cluster. Several processes precede actually carrying out hotspot analysis. The flow chart of the processes is below (Figure 3). We will summarize these processes first and then we will explain each step in greater detail. All processes are performed by ArcGIS 10.3 software (ESRI, Redlands, CA, USA) [36].   First, taking into account the cloud to ground (CG) lightning data for the study period from 2010 to 2014, we counted how many lightning strikes actually transpired in the predefined 84 Alberta municipalities. In order to overcome the size variations, we normalized the number of CG lightning flashes within a district by dividing it by the area of each district. The next step for carrying out hotspot analysis is specifying a threshold distance for neighboring features, which is called a neighborhood. The neighborhood consists of the features that are analyzed together in order to assess local clustering. The neighborhood is defined by a threshold distance band. To set the threshold distance band, the Incremental spatial autocorrelation tool [41] in ArcGIS 10.3 [36] is implemented. The Incremental spatial autocorrelation tool essentially finds appropriate distances where spatial clustering is most pronounced, by using incremental distances. However, we need to select both the starting distance and the distance increment values to implement the tool. The 'Calculating distance band from neighbor count' tool in ArcGIS 10.3 helps for this step. This tool tells us the minimum, average, and maximum distances when each feature has one neighbor. As we mentioned, we need to find a reasonable distance value that includes possible neighborhood features. Measuring spatial autocorrelation is one way to find the appropriate distance. Global Moran's I [41] measures spatial autocorrelation based simultaneously on feature location and feature values. It can tell us whether a set of features is clustered, dispersed or random. To identify a spatial pattern of the normalized incidence of CG lightning data, we performed hotspot analysis (one of the spatial analysis tools in ArcGIS 10.3), which enabled us to identify the statistically significant hotspot where high incidence data values cluster together. The hotspot analysis calculates Getis-Ord G * i statistic [42] as follows: where x j is the attribute value for feature j, w i,j is the spatial weight between feature i and j, and n is equal to the total number of features: G * i Statistic is a Z-score so no further calculation is required. A Z-score is simply a measurement of standard deviation. This score serves as a basis for deciding whether we can reject the null hypothesis.

Temporal Distribution of Lightning Strikes Activity
There are slightly different maximum and minimum CG lightning occurrence densities in other studies but this temporal distribution is in agreement with several studies that have been conducted on global and local regional scales [11,14,[16][17][18]20,21,23,25,[43][44][45]. The results are similar to all lightning pattern research conducted in different countries or areas where the vast majority of lightning over land occurred during the warm months (May~October) with a strong peak in cloud to ground lightning (June~July) and minimum seasonal lightning occurrence in cold months (December~February). In addition, lightning strikes seem to be affected by the influence of the diurnal heating cycle. We assumed that the temporal distribution of CG lightning over Alberta was similar to that of other regions. To identify the temporal trend of the lightning over Alberta precisely, we divide all lightning occurrence counts by month and distribute each month's count hourly. The resulting graphs, tables and interpretations are shown in the results section (Section 5.2).

Discovery of Frequent CG Lightning Patterns Using CFP-Grwoth++ Algorithm
We carried out experiments to identify the implicit lightning frequency patterns under given conditions. These given conditions are expressed in the four land physical properties (i.e., Elevation, Slope, Land Uses, and Soil Type). We subdivided each property into 10 classes of elevation data, 10 classes of slope data, 15 classes of land uses data, and 10 classes of soil type data. We used the Jenks natural breaks optimization method [46] for dividing elevation and slope data. The land uses classes follow the protocol of the Intergovernmental Panel on Climate Change (IPCC) and consist of: Forest, Water, Cropland, Grassland, Settlement and Other land (barren land, ice, rock and unclassified). There are distinct types of soil according to the Canadian System of Soil Classification [38]. Therefore, each lightning strikes record has those four different properties.
The main purpose of this experiment is to identify the relationships between those four properties and lightning strikes, or to see whether there are any particular frequent patterns and whether the frequent patterns are consistent throughout the study period. The main theory of CFP-Growth++ was implemented for this study but assigning minimum item support (MIS) values are modified in this study. Based on the original MIS functions, the user-specified MIS values are chosen by iterative calculation to generate the best result. However, the appropriate method to find the MIS value for each item differs slightly because we have an unequal range of coverage area for each item. Therefore, it is possible that a larger coverage area for one item has relatively more lightning strikes than a smaller coverage area for another, even though the larger area has a lower rate of lightning strikes per unit area. The modified MIS functions consider the characteristics of uneven coverage area size of each item and take the rare item problem into account as well. The modified MIS functions are as follows: The value f(a i ) is the actual frequency (or the support expressed in percentage of the data set size) of item a i in the data. The value LS is the user-specified lowest minimum item support allowed. β (0 ≤ β ≤ 1) is a parameter that controls how the MIS values for items should be related to their frequencies. The modified part is for β (0 ≤ β ≤ 1) value. Originally, the β value was a user-specified value and it is based on the iterative calculation to find the best MIS values. This is not only a non-standard way of choosing this MIS value, but it also fails to consider the different coverage area sizes. The value A a i is the coverage area of item a i , while A T is the total study area. By using this modified β value (β ), we can fit for individual item characteristics. This modified MIS function can prevent possible problems, already mentioned above, when a larger coverage area for an item has relatively more lightning strikes than a different item's smaller area, even though the larger area has fewer lightning strikes per unit area. The sequence result of non-modified MIS functions is in the larger coverage area of an item having fewer lightning strikes per unit area, which could cause it to be extracted as a frequent item. Therefore, we use the modified MIS function in this research.

Hotspot Analysis (Spatial Distribution of CG Lightning)
The hotspot analysis was conducted to identify statistically significant hotspot regions where high incidence CG lightning counts cluster together on predefined 84 Alberta municipalities. Using this lightning strikes values cluster together.
Since we used only the spatial location of lightning strikes to identify the presence or absence of spatial patterns, we cannot explain what types of topographical characteristics influence the spatial patterns. We are therefore limited in describing the precise relationship between lightning strikes and topographical characteristics (land properties). Therefore, we will use an Association Rule Mining technique to identify the kinds of factors, and their combinations, that are related to the locations where CG lightning flashed. We will do so using four different topographical characteristics, enabling us to find the frequent patterns that have the four different factors (land properties) in much of the lightning strikes data.   Table 1 shows their percentage of total lightning strikes counts. A total of 93.8% of annual lightning occurred in the warm months of June through August. Lightning activity was at its peak in July (45.7%). There are almost no lightning activities (0.0019%) in the cold months of December through February. Lightning strikes density steadily increases from about 10:00 local time (MST) to afternoon peaks occurring between 17:00 and 20:00 local time (MST). They then decline steadily to a morning minimum between 04:00 and 11:00 local time. Table 1 also shows that during the study period about 99.5% of all lightning strikes in the province of Alberta occurred between May and September. This map shows clear spatial patterns indicating that most hotspot regions were located in central, central east, and south central regions of the study area, and coldspots are more typically found in northern areas. There are areas in which the lightning strikes occur frequently or rarely, and there are also areas in which locations containing statistically significant high or low number of lightning strikes values cluster together.

Temporal Distribution of CG Lightning Activitiy
Since we used only the spatial location of lightning strikes to identify the presence or absence of spatial patterns, we cannot explain what types of topographical characteristics influence the spatial patterns. We are therefore limited in describing the precise relationship between lightning strikes and topographical characteristics (land properties). Therefore, we will use an Association Rule Mining technique to identify the kinds of factors, and their combinations, that are related to the locations where CG lightning flashed. We will do so using four different topographical characteristics, enabling us to find the frequent patterns that have the four different factors (land properties) in much of the lightning strikes data.  Table 1 shows their percentage of total lightning strikes counts. A total of 93.8% of annual lightning occurred in the warm months of June through August. Lightning activity was at its peak in July (45.7%). There are almost no lightning activities (0.0019%) in the cold months of December through February. Lightning strikes density steadily increases from about 10:00 local time (MST) to afternoon peaks occurring between 17:00 and 20:00 local time (MST). They then decline steadily to a morning minimum between 04:00 and 11:00 local time. Table 1 also shows that during the study period about 99.5% of all lightning strikes in the province of Alberta occurred between May and September.

Frequent CG Lightning Patterns Using CFP-Growth++ Algorithm
To find frequent patterns of CG lightning over Alberta, we used the CFP-Growth++ algorithm, but with the modified MIS function described in Section 4.3. We divided the frequent patterns of data mining results into five classes using the Jenks natural breaks optimization method. The classes are designated 'High Risk', 'Risk', 'Moderate', 'Low Risk' and 'No Risk'. The average number of frequent patterns is about four hundred annually out of fifteen hundred given all possible combinations. We present 'High risk' and 'Risk' results for the entire research period (2010-2014) only here ( Table 2).

Frequent CG Lightning Patterns Using CFP-Growth++ Algorithm
To find frequent patterns of CG lightning over Alberta, we used the CFP-Growth++ algorithm, but with the modified MIS function described in Section 4.3. We divided the frequent patterns of data mining results into five classes using the Jenks natural breaks optimization method. The classes are designated 'High Risk', 'Risk', 'Moderate', 'Low Risk' and 'No Risk'. The average number of frequent patterns is about four hundred annually out of fifteen hundred given all possible combinations. We present 'High risk' and 'Risk' results for the entire research period (2010-2014) only here ( Table 2).  5  12  25  39  7450  5  11  29  44  6269  3  13  25  39  6126  5  13  25  39  5502  2  11  26  38  4738  6  12  25  39  4437  6  13  25  39  4109   2011 Risk   3  11  26  39  3328  6  11  25  39  2628  6  14  25  39  2566  5  14  25  39  2413  4  11  30  44  2410  4  11  30  43  2304  4  12  30  44  2250  3  11  29  43  2205  4 11 29 43 1993 2012 High Risk 5  12  25  39  10,321  5  11  29  44  8530  5  13  25  39  7769  3  13  25  39  7153  6  12  25  39  6702  6  13  25  39  6268   2012 Risk   2  11  26  38  5101  3  11  26  39  4556  6  11  25  39  3929  6  14  25  39  3887  5  14  25 39 3628 We noticed the frequent patterns sorted in descending order of support counts are quite similar among all years of the study period, but we need to measure how similar patterns occurred statistically in order to prove that the CG lightning on our four different land properties is consistent or highly similar throughout all the study years. In addition, if we can find consistent CG lightning patterns on those land properties, we can use these frequent patterns for many applications in a variety of fields. In Section 6 (Verifications), we present a way of measuring similarity for our results sets and demonstrate the similarity value between two comparable results of all results sets. Each code number represents a category of land properties (i.e., Elevation, Slope, Land Uses, and Soil Types) and we explain all code numbers and their properties in Appendix A.

CG Lightning Hazard Maps
Based on these frequent tuples and support counts in Section 5.3, we generated CG lightning hazard maps and can illustrate which high CG lightning risk regions combine the four different land properties. The procedure for generating CG lightning hazard maps for the period from 2010 to 2014 is as follows. First, we collect all CG lightning data from 2010 to 2014. Each CG lightning data has patterns, which are sequences of four different land properties and their support counts. The CG lightning data is also geo-referenced data so that we can distribute the lightning points over Alberta. These distributed CG lightning points can show where CG lightning occurs more frequently and how particular combined land properties are more closely related with CG lightning strikes over Alberta. To make a continuous hazard map of CG lightning hazards, we use the Inverse Distance Weighted (IDW) interpolation tool [47,48]  . It is evident that high CG lightning risk areas are similarly positioned throughout the study years (2010-2014, Figures 6 and 7).The high CG lightning risk areas are located in southern Alberta, along the east side of the Rocky Mountains. We can identify their corresponding land properties and their support counts (how much more frequent they are than other patterns) in frequent pattern results from the data mining results.

Simiilarity Measure (Consistency Test) for Results
In this section, we measure the similarity between two frequent itemsets results from five years results. Similarity is a quantity that reflects the strength of relationship between two objects or two features. This quantity usually ranges between −1 and +1, or is normalized to between 0 and 1. Distance measures dissimilarity [49].
If the frequent patterns from the data mining algorithm have a consistency or high similarity throughout the years, we can conclude that the lightning data has specific patterns and these patterns have homogeneity. Our data mining result is multivariate categorical (nominal) data type. For continuous and bivariate data, the notion of similarity is relatively well established, but for categorical data and multivariate data, the similarity computation is not straightforward. For along the east side of the Rocky Mountains. We can identify their corresponding land properties and their support counts (how much more frequent they are than other patterns) in frequent pattern results from the data mining results.

Simiilarity Measure (Consistency Test) for Results
In this section, we measure the similarity between two frequent itemsets results from five years results. Similarity is a quantity that reflects the strength of relationship between two objects or two features. This quantity usually ranges between −1 and +1, or is normalized to between 0 and 1. Distance measures dissimilarity [49].
If the frequent patterns from the data mining algorithm have a consistency or high similarity throughout the years, we can conclude that the lightning data has specific patterns and these patterns have homogeneity. Our data mining result is multivariate categorical (nominal) data type. For continuous and bivariate data, the notion of similarity is relatively well established, but for categorical data and multivariate data, the similarity computation is not straightforward. For continuous data, the Minkowski Distance is a general method used to compute distance between two continuous multivariate points [50]. In contrast, we are not measuring a similarity between two individual results for a year; rather, we are measuring the similarity between two different sets. In other words, we need to measure the similarity between two clusters (sets) composed of a discrete (nominal) multivariate data type. There are many criteria in the literature (see [51][52][53][54]) for comparing two sets or groups with a view to measuring their similarity. However, there are some limitations in that literature. For instance, the literature [52] may not permit overlapping or joint clusters (sets), or it [53,54] may consider measuring similarity for overlapping clusters, but assume that a cluster does not contain duplicates. Given these limitations, the conventional methods in the literature are not applicable to our frequent itemset results. We explain our data type and structure first (Table 3) and propose an improved method of measuring similarity to overcome these limitations. The variables (Var1~Var4) are independent of each other and describe each land property. As we can see here, there are many joint (overlapping) frequent tuples within each result. For example, in the 2010 results (Table 3, left), the Elevation property 5 in the first pattern is duplicated in the second. Furthermore, the support count of each frequent pattern implies that there are duplicates. There are more details about the limitations in measuring similarity between two groups in the literature [52][53][54].  5  12  25  39  6210  5  12  25  39  7450  5  11  29  44  6117  5  11  29  44  6269  3  13  25  39  5638  3  13  25  39  6126  2  11  26  38  4059  5  13  25  39  5502  6  12  25  39  3432  2  11  26  38  4738  6  13  25  39  3230  6  12  25  39  4437 To take care of those limitations, we use a Best Match algorithm [53], which takes as its input the two cluster sets, C 1 (S i=1,...,n ) and C 2 (S j=1,...,m ), and a set difference measure. We modify this algorithm by converting support counts into a ranking score to assign a weight value for each frequent pattern. The Best Match algorithm determines how well the frequent patterns reflect each other. Specifically, for each frequent pattern (S i ) ∈ C 1 , we can compute its best representative in C 2 by: This formula is slightly different from the original Best Match algorithm for finding the minimum of dissimilarity (distances) among all comparable patterns, but this formula (Equation (12)) finds maximum similarity among all comparable patterns. To measure the similarity between two nominal frequent patterns, we used Jaccard's coefficient, which measures asymmetric information on variables. Since our variables are in the form of categorical (nominal) data, we cannot measure the variable in a quantitative way. We assigned a range of numeric indices to represent each item of variables. This is called consistent labeling. To calculate similarities or distances between two data sets represented by nominal variables, we need to convert these nominal variables into binary dummy variables that have binary values. A binary dummy variable is one that takes the value 0 or 1 to indicate the absence or presence of some categorical effect.
We can calculate the binary distance between the two binary dummy variables using Jaccard's coefficient (Similarity) formula given by: where p is the number of variables that are positive (1) for both objects; q is the number of variables that are positive (1) for the ith object and negative (0) for the jth object; and r is the number of variables that are negative (0) for the ith object and positive (1) for the jth object. We compute how well C 2 can be represented by C 1 by summing similarity coefficients from each member of C 1 to its respective best representative in C 2 . We can make this similarity coefficient symmetric by also summing the similarity coefficients from every member of C 2 to its corresponding best representative in C 1 [53].This gives the final symmetric measure: We can normalize this similarity using the number of clusters |C 1 | + |C 2 | to make the final similarity score range from 0 to 1. There is one more thing that we need to consider about this similarity measure on our data sets. As we mentioned in the discussion of our data type, we have support counts for each frequent itemset. Therefore, we need to consider these support counts when we measure the similarity; we can do so by assigning weight values according to the support counts. To clarify, here is one example (Figure 8). we measure the similarity; we can do so by assigning weight values according to the support counts.
To clarify, here is one example (Figure 8).  Table 3 is frequent itemset results for each year and they are sorted in descending order based on support counts. If we give rankings by its support count, the data look like below (Table 4).   Table 3 is frequent itemset results for each year and they are sorted in descending order based on support counts. If we give rankings by its support count, the data look like below (Table 4). The similarity coefficient in Figure 8 is about 0.9333 but this method ignores the very important information added by the support count for each frequent pattern. The support count of each frequent pattern illustrates how a frequent pattern is more important (frequent) than another within the result itself. The sheer volume of its support count can vary depending on the absolute size of input data. Therefore, when we measure similarity between two clusters, we need to focus on the relative order of priority for each cluster and consider rank differences between the two patterns as a weighting factor. To illustrate a potential influence from these rankings, we will give you an example using Table 4. When we measure maximum similarity based on S 4 ∈ C 1 against S j=1,...,6 ∈ C 2 , the maximum similarity can be found in S 5 ; Their Jaccard's coefficient value is 1 even if there is a difference of one rank. When there is a rank difference between two comparable frequent patterns, we need to consider how their ranks differ and deduct their similarity coefficient value, which depends on the differences, by assigning a weight factor (W i,j ∈ [0, 1]). In this way, the similarity coefficient value of S 4 (rank 4) against S 5 (rank 5) must be less than 1 because there is a difference of rank. In other words, we would only set the similarity number at 1 (maximum similarity value), when each frequent pattern from each cluster is the same and their rank order is the same. Therefore, we contemplate the conformity of each pattern and set a corresponding ranking order. The following formula is the expanded scope of the Best Match algorithm [53]: where the weight factor is given by: otherwise r i , r j : Rank order of a frequent pattern respectively. p ∈ [0, 1]. The weight value is from 0 to 1 and the p value in the weight factor may be estimated by a decision maker using the weight of the most important criterion. p = 1 indicates decreasing weight values with same interval and p = 0 indicates equal weight values. Basically, p values (0 < p < 1) make the weight values proportionally decrease as the rank differences increase. For example, when we measure the weight values for two frequent patterns results of Table 4, the possible rank differences are from 0 to 5 and the number of clusters are six for both |C 1 | and |C 2 |. The larger the rank differences between two frequent patterns from each cluster, the smaller the weight values generated; the degree of decrease for the weight values is based on the user-specified p-value. As p decreases from 1 to close to 0 (non-zero) in Figure 9, the weight value slowly decreases as the rank differences increase. If the p value is 0, this is equivalent with the original Best Match algorithm [53], ignoring rank differences of the two frequent patterns. Figure 9 shows the possible weight distribution by different p-values for Table 4. differences between two frequent patterns from each cluster, the smaller the weight values generated; the degree of decrease for the weight values is based on the user-specified p-value. As p decreases from 1 to close to 0 (non-zero) in Figure 9, the weight value slowly decreases as the rank differences increase. If the p value is 0, this is equivalent with the original Best Match algorithm [53], ignoring rank differences of the two frequent patterns. Figure 9 shows the possible weight distribution by different p-values for Table 4. If we calculate the similarity by considering weight values for rank differences based on the Best Match algorithm with p = 1 for a given example (Table 4), the similarity measurement result can be calculated as below ( Figure 10). If we calculate the similarity by considering weight values for rank differences based on the Best Match algorithm with p = 1 for a given example (Table 4), the similarity measurement result can be calculated as below ( Figure 10). The Best Match algorithm is modified by assigning weight factors based on rank similarity results in the 0.8694 similarity coefficient value (Figure 10), which is lower than 0.9333 (Conventional Best Match algorithm, Figure 8). This is because there are rank dissimilarities between two possible comparison frequent tuples from two clusters respectively. We compute all the similarity coefficient values between two frequent pattern data sets (clusters) throughout five results, from 2010 to 2014, with different p values (0 < ≤ 1) that can be chosen by the decision maker. Listed below are the final similarity coefficient results ( Sim(C 1 , C 2 ) = (1 + 1 + 1 + 1 + 0.4823 + 0.6) + (1 + 1 + 1 + 0.3472 + 0.6 + 1) 12 = 0.8353 The Best Match algorithm is modified by assigning weight factors based on rank similarity results in the 0.8694 similarity coefficient value (Figure 10), which is lower than 0.9333 (Conventional Best Match algorithm, Figure 8). This is because there are rank dissimilarities between two possible comparison frequent tuples from two clusters respectively. We compute all the similarity coefficient values between two frequent pattern data sets (clusters) throughout five results, from 2010 to 2014, with different p values (0 < p ≤ 1) that can be chosen by the decision maker. Listed below are the final similarity coefficient results (Table 5) using the modified Best Match algorithm by assigning weight factors based on rank similarity.  The minimum similarity coefficient value is about 0.85 (85%) and the maximum similarity coefficient value is about 0.95 (95%), with different p values for all possible combinations of two frequent tuple sets of two years. These results that there are high correlations between the two frequent sets results of all possible two year's results; this means that there are quite consistent similarities or consistent patterns for all years. In other words, there are quite consistent frequent lighting patterns for the four different land properties (i.e., Elevation, Slope, Land Uses, and Soil Types).

Additional Verification Process by Comparison between 2010-2014 and 2015-2016 Data Mining Results
In the early stages of our research, we had only 2010-2014 CG lightning data and carried out all the related experiments with this data. It is highly desirable that we use the latest dataset (2015-2016) to see whether our experimental results (2010-2014) remain reliable when we compare them to results for the latest dataset. We conducted Association Rule Mining with the more recent dataset to test how consistent these new results are with the old results. We did so by measuring similarity values and comparing the two CG lightning hazard maps from the two different periods  (Table 6), using Association Rule Mining (CFP-Growth++ algorithm). We also divided the frequent patterns of data mining results into five classes ('High Risk', 'Risk', 'Moderate', 'Low Risk' and 'No Risk') using the Jenks natural breaks optimization method. We present 'High risk' and 'Risk' results only.  2  11  26  38  9443  4  12  30  44  9072  3  11  26  39  9020  4  11  29  43  8279  4  11  30  43  8165  3  11  29  43  7033  6  11  25  39  7000  6  14  25  39  6909  5 14 25 39 6672 We recognize that the frequent patterns for 2010-2014 and for 2015-2016 are quite similar. We also measure the similarity coefficient value between these two results to demonstrate that they have a similar quantitative numerical value. The methodology of measuring similarity among Association Rule Mining results is described in Section 6.1. We measure the similarity by changing parameter p-values (p = 1, 0.8, 0.6, 0.4 and 0.2). The similarity results (Table 7) are described below: We recognize that the frequent patterns for 2010-2014 and for 2015-2016 are quite similar. We also measure the similarity coefficient value between these two results to demonstrate that they have a similar quantitative numerical value. The methodology of measuring similarity among Association Rule Mining results is described in Section 6.1. We measure the similarity by changing parameter pvalues (p = 1, 0.8, 0.6, 0.4 and 0.2). The similarity results (Table 7) are described below:   As mentioned before in Section 5.4, most recognizable high CG lightning risk areas are located in southern Alberta along the east side of the Rocky Mountains. We can find their corresponding land properties and their support counts in frequent pattern results from the data mining results. We find a very similar spatial distribution of the CG lightning hazardous area for the new dataset (2015-2016) when we compare it with the 2010-2014 dataset. As mentioned before in Section 5.4, most recognizable high CG lightning risk areas are located in southern Alberta along the east side of the Rocky Mountains. We can find their corresponding land properties and their support counts in frequent pattern results from the data mining results. We find a very similar spatial distribution of the CG lightning hazardous area for the new dataset (2015-2016) when we compare it with the 2010-2014 dataset.

Comparsion of CG Lightning Hazard Map (2010-2014) with Actual Raw CG Lightning Data (2015-2016)
The CG lightning hazard map for 2010-2014 based on data mining results is divided into 20 classes that occupy the same size of area. The risk values of the map are sorted in ascending order for these 20 classes, where each class has about 5% of the total study area ( Table 8). The hazard map based on the data mining results and the unprocessed raw CG lightning data are independent of each other. What we want to know is whether the higher CG lightning risk classes have more actual CG lightning from 2015 to 2016. In answering this question, we can verify whether the result patterns from data mining techniques are reliable.
Based on our assumptions, the higher the risk level of the class, the greater the count of 2015-2016 lightning data that must appear. The actual CG lightning strikes in each class quantify how many lightning strikes flash within that class. Table 9 records all relevant data.   Table 9 shows the actual CG lightning distribution (2015-2016) for classes (by class code number) of CG hazard maps (2010-2014) derived from data mining results. If we look at the CG lightning count values in Table 9, the general trend shows that raw CG lightning counts increased when the hazard map ranges (code numbers) increased. We can check the lightning frequency graph (Figure 12) of raw CG lightning sorted into classes based on the hazard map results (2010-2014).
We calculated a hazard map based on the data mining results. It consisted of two parts: four different land properties and their supports. We made a continuous CG lightning hazard map over Alberta, Canada based on those mining results. The unprocessed raw CG lightning data's general frequency trends from 2015 to 2016, classed into the hazard map from 2010 to 2014, increases as the risk classes (class code numbers) increase. This trend is notably more dominant in classes ranging from code number 13 to 20. On the other hand, there are still quite a significant number of CG lightning strikes in lower risk areas. This may suggest some research limitations. 19 23,756.  Table 9 shows the actual CG lightning distribution (2015-2016) for classes (by class code number) of CG hazard maps (2010-2014) derived from data mining results. If we look at the CG lightning count values in Table 9, the general trend shows that raw CG lightning counts increased when the hazard map ranges (code numbers) increased. We can check the lightning frequency graph ( Figure  12) of raw CG lightning sorted into classes based on the hazard map results (2010-2014). We calculated a hazard map based on the data mining results. It consisted of two parts: four different land properties and their supports. We made a continuous CG lightning hazard map over Alberta, Canada based on those mining results. The unprocessed raw CG lightning data's general frequency trends from 2015 to 2016, classed into the hazard map from 2010 to 2014, increases as the risk classes (class code numbers) increase. This trend is notably more dominant in classes ranging from code number 13 to 20. On the other hand, there are still quite a significant number of CG lightning strikes in lower risk areas. This may suggest some research limitations.
We limited our analysis to the four land properties (i.e., Elevation, Slope, Land Uses, and Soil Types), but there could be many other triggers for lightning flashes on the surface. Despite the limited assumptions we set, the general trends of CG lightning for 2015-2016, classed on the hazard map for 2010-2014, show that higher risk areas on the hazard map have more actual CG lightning. We can also check these patterns by generating a density map of CG lightning. The lightning density map is based on the calculation of a magnitude-per-unit area from the lightning point features that fall within a neighborhood. We used the Point Density tool [55] in ArcGIS software [36] to make this density map (Figure 13 (Right)).
We divided the Province of Alberta into two clusters to look at the regions of high CG lightning density. The first cluster includes higher lightning risk areas (code numbers [17][18][19][20], which cover 20% of the total study area (Figure 13 (Left)), and the second cluster includes lower lightning risk areas (code numbers 1-16), which cover 80% of the total study area. In addition, we can check Table 9 for the corresponding percentage of the CG lightning counts from 2015 to 2016, which for these two clusters (groups of code numbers) are 26% and 74%, respectively. In Figure 14, the area marked in white is the same as the high risk (first cluster) regions in Figure 13  We limited our analysis to the four land properties (i.e., Elevation, Slope, Land Uses, and Soil Types), but there could be many other triggers for lightning flashes on the surface. Despite the limited assumptions we set, the general trends of CG lightning for 2015-2016, classed on the hazard map for 2010-2014, show that higher risk areas on the hazard map have more actual CG lightning. We can also check these patterns by generating a density map of CG lightning. The lightning density map is based on the calculation of a magnitude-per-unit area from the lightning point features that fall within a neighborhood. We used the Point Density tool [55] in ArcGIS software [36] to make this density map (Figure 13 (Right)). We can see that most high CG lightning density areas are located in the higher risk regions (code numbers 17-20, Figure 13 (Left)). In conclusion, we can verify by quantitative measurement from Table 9 and visual interpretation from Figure 14, that the higher risk areas on the hazard map have the more numerous and more densely occurring CG lightning flashes.  We divided the Province of Alberta into two clusters to look at the regions of high CG lightning density. The first cluster includes higher lightning risk areas (code numbers [17][18][19][20], which cover 20% of the total study area (Figure 13 (Left)), and the second cluster includes lower lightning risk areas (code numbers [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16], which cover 80% of the total study area. In addition, we can check Table 9 for the corresponding percentage of the CG lightning counts from 2015 to 2016, which for these two clusters (groups of code numbers) are 26% and 74%, respectively. In Figure 14, the area marked in white is the same as the high risk (first cluster) regions in Figure 13 (Left). The contour lines describe a density map of raw CG lightning for 2015-2016. We can see that most high CG lightning density areas are located in the higher risk regions (code numbers 17-20, Figure 13 (Left)). In conclusion, we can verify by quantitative measurement from Table 9 and visual interpretation from Figure 14, that the higher risk areas on the hazard map have the more numerous and more densely occurring CG lightning flashes.

Applications (Analysis of Wildfire Hazardous Regions Based on the CG Lightning Hazard Map)
This section seeks to find the relationship between wild fires and CG lightning strike patterns derived from data mining processes (and CG lightning strike data). We extracted the wild fire data described in Section 3.5 to isolate only the fires in Alberta, and to count how many wild fires occurred in conjunction with particular land uses (Table 10). In Alberta, from 2010 to 2014, approximately 93% of total wild fires were caused by lightning on Forest, Forest Wetland, and Wetland Shrub land types (shaded boxes in Table 10). To better see how the wild fire points are geographically distributed, we scattered the wild fire data points on the hazard map.
A higher risk of CG lightning does not always imply a higher risk of wildfires caused by CG lighting. There may be areas with more frequent CG lightning but lower frequent wildfires because of some other conditions, including weather and the strength of CG lightning strikes. Therefore, we limited the CG lightning hazard map to three dominant Land Uses classes, to reflect the finding that 93% of total wild fires caused by lightning occurred on Forest, Forest Wetland, and Wetland Shrub (Table 10). Figure 15 contains the hazard map overlaid with wildfire data.

Applications (Analysis of Wildfire Hazardous Regions Based on the CG Lightning Hazard Map)
This section seeks to find the relationship between wild fires and CG lightning strike patterns derived from data mining processes (and CG lightning strike data). We extracted the wild fire data described in Section 3.5 to isolate only the fires in Alberta, and to count how many wild fires occurred in conjunction with particular land uses (Table 10).
In Alberta, from 2010 to 2014, approximately 93% of total wild fires were caused by lightning on Forest, Forest Wetland, and Wetland Shrub land types (shaded boxes in Table 10). To better see how the wild fire points are geographically distributed, we scattered the wild fire data points on the hazard map.
A higher risk of CG lightning does not always imply a higher risk of wildfires caused by CG lighting. There may be areas with more frequent CG lightning but lower frequent wildfires because of some other conditions, including weather and the strength of CG lightning strikes. Therefore, we limited the CG lightning hazard map to three dominant Land Uses classes, to reflect the finding that 93% of total wild fires caused by lightning occurred on Forest, Forest Wetland, and Wetland Shrub (Table 10). Figure 15 contains the hazard map overlaid with wildfire data. We divided the CG lightning hazard map (Figure 15) into 20 classes, based on the same criteria used in Table 9, then counted the wildfire frequency for each class. The code numbers in Table 11 are labels that represent each class, where higher code numbers correspond to higher CG lightning risk areas. We also divided the wildfire counts according to the size of the code number so that we could normalize the counts. This normalized wildfire count is designed to compensate for the problem of We divided the CG lightning hazard map ( Figure 15) into 20 classes, based on the same criteria used in Table 9, then counted the wildfire frequency for each class. The code numbers in Table 11 are labels that represent each class, where higher code numbers correspond to higher CG lightning risk areas. We also divided the wildfire counts according to the size of the code number so that we could normalize the counts. This normalized wildfire count is designed to compensate for the problem of different code number sizes. Figure 16 is a normalized wildfire frequency graph that shows the trends visually.  Figure 16 is a normalized wildfire frequency graph that shows the trends visually. Table 11. CG lightning hazard map classes and wildfire frequency within a class.

Ranges Code Number
Code Size The trend of the normalized wildfire counts in Table 11 and Figure 16 contains two peaks as the code numbers (risk labels of the hazard map) change. We divided the graph into two groups. First, the counts are increasing from code number 2. Then there is a peak at code number 6 and the counts start decreasing from this peak to code number 8. Second, the counts are increasing again from code number 13. Then there is a peak at code number 17 and they start decreasing again from the peak to code number 20. Given these ups and downs, it is hard to say that the risk of CG lightning must increase with the occurrence of wildfires. We can identify four possible cases to describe and account for this shifting relationship between CG lightning and wildfire: The trend of the normalized wildfire counts in Table 11 and Figure 16 contains two peaks as the code numbers (risk labels of the hazard map) change. We divided the graph into two groups. First, the counts are increasing from code number 2. Then there is a peak at code number 6 and the counts start decreasing from this peak to code number 8. Second, the counts are increasing again from code number 13. Then there is a peak at code number 17 and they start decreasing again from the peak to code number 20. Given these ups and downs, it is hard to say that the risk of CG lightning must increase with the occurrence of wildfires. We can identify four possible cases to describe and account for this shifting relationship between CG lightning and wildfire: (1) The regions have a relatively high incidence of CG lightning strikes, but wildfires caused by lightning rarely occurred. (2) The regions have a relatively high incidence of lightning strikes and wildfires, and fires caused by lightning occurred frequently. (3) The regions have a relatively low frequency of CG lightning, but wildfires caused by lightning nevertheless occurred frequently. (4) The regions have a relatively low frequency of CG lightning, and wildfires caused by lightning rarely occurred.
The data (and two of the cases bolded above) suggest that there might be other triggers for wild fire by lightning, such as weather conditions and the particular characteristics of the lightning even when there are few lightning strikes. If the objective of this study is to research wildfire hazard, then it is the second and third cases that attract our interest. Therefore, we need to distinguish between those two areas having a relatively large incidence of wildfires and other study regions. To do so, let us start by demonstrating the notion of relative to know which areas have relatively more incidences of wildfires than others. The notion of relative in here can be calculated by Preference Index (PI) as follows: (12) where WF n is wildfire counts for a code number "n", WF t is the value of total number of wildfire counts for study area, A n is the area size for a code number "n", and A t is the total size of whole study area. WF and A are the number of wildfires and the area size, respectively; the subscript n indicates the given category (code number) in the CG lightning hazard map; and the subscript t represents total area. The PI~1 would mean that the percentage of wildfires over each category of the CG lightning hazard map is equal to the percentage of wildfires over the entire study area. Therefore, if the PI is higher than 1, we can consider the wildfires relatively frequent compared to other categories in the CG lightning hazard map. In Table 12, we record PI for each code number from 1 to 20. Using Table 12, we can find PI higher than 1 in 9 code numbers (5,6,7,15,16,17,18,19 and 20) out of 20 classes. As mentioned above, we can classify these into two groups. The first includes those regions with a relatively low frequency of CG lightning, yet a relatively high frequency of wildfires, and the second includes those regions with a relatively high incidence of both lightning strikes and wildfires caused by lightning. We determined that code numbers 5, 6 and 7 define the first group and that code numbers 15-20 define the second. We present these two groups (Red and Blue) in Figure 17 to illustrate two relatively higher risk wildfire regions in the PI graph. The combined Red and Blue regions are the final wildfire hazardous regions ( Figure 18). Using association rule mining techniques, we analyzed CG lightning frequent patterns and their related land properties and generated a hazard map based on the patterns. We tried to find a relationship between CG lightning frequent patterns and wildfires caused by CG lightning. We found that some regions have a high incidence of both CG lightning and wildfires (Blue color regions), but others combine a lower incidence of CG lightning with a relatively high incidence of wildfires (Red color regions). These results can be utilized to locate regions of high wildfire risk and to identify their related land properties, using association rule mining results to be managed and prepared for the wildfire hazard in Alberta.   Using association rule mining techniques, we analyzed CG lightning frequent patterns and their related land properties and generated a hazard map based on the patterns. We tried to find a relationship between CG lightning frequent patterns and wildfires caused by CG lightning. We found that some regions have a high incidence of both CG lightning and wildfires (Blue color regions), but others combine a lower incidence of CG lightning with a relatively high incidence of wildfires (Red color regions). These results can be utilized to locate regions of high wildfire risk and to identify their related land properties, using association rule mining results to be managed and prepared for the wildfire hazard in Alberta.  Using association rule mining techniques, we analyzed CG lightning frequent patterns and their related land properties and generated a hazard map based on the patterns. We tried to find a relationship between CG lightning frequent patterns and wildfires caused by CG lightning. We found that some regions have a high incidence of both CG lightning and wildfires (Blue color regions), but others combine a lower incidence of CG lightning with a relatively high incidence of wildfires (Red color regions). These results can be utilized to locate regions of high wildfire risk and to identify their related land properties, using association rule mining results to be managed and prepared for the wildfire hazard in Alberta.

Discussion and Conclusions
We investigated the characteristics of CG lightning over Alberta, Canada, by lightning date between 2010 and 2016. We implemented a hotspot analysis to find the regions with high frequency CG lightning strikes clustered together. Generally, hotspot regions are located in central, central east and south central regions of the study area. A total of 93.8% of annual lightning occurs in warm months (June to August) and the daily lightning frequency is influenced by the diurnal heating cycle. We used the association rule mining technique (CFP-Growth++ algorithm) to investigate frequent CG lightning patterns. The frequent CG lightning patterns were verified by a similarity measurement to check the patterns' consistency. We verified the CG lightning hazard map for 2010-2014 by comparing it to unprocessed independent raw CG lightning data from 2015 to 2016. The resulting similarity coefficient values showed a high correlation throughout the study period. The actual CG lightning generally flashed more in higher risk regions in the lightning hazard map. Most wildfires in Alberta (approximately 93%) occur in Forests, Wetland Forests, and Wetland Shrub areas. We found two distinct areas of interest: frequent wildfire regions with a high frequency of lightning, and frequent wildfire regions with a low frequency of lightning. Further, the preference index (PI) revealed locations where wildfires occurred more frequently than in other class regions. One potential application of this research is to estimate wildfire hazard areas against CG lightning hazard maps and frequency data for specific land use types. There are limitations in this study. First, we analyzed only seven years of CG lightning data. Analyzing additional years of CG lightning data would provide more accurate results. This study considered a limited number of land properties (i.e., Elevation, Slope, Land Uses, and Soil Types). Including more land properties in the study would enhance the accuracy of our results. In addition, there may be other factors that we did not consider, such as Convective Available Potential Energy (CAPE), moisture content, surface temperature, etc. Accuracy and reliability of the results could be improved by adapting other data mining techniques as well.
Acknowledgments: Elements of this study were supported from Xin Wang and Jeong Woo Kim's Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants. Authors thank Environment Canada for their permission to use Canadian Lightning Detection Network (CLDN) data.
Author Contributions: As a leading author, DongHwan Cha collected data, conducted data processing and analysis, and completed this study under the guidance of his academic advisor, Jeong Woo Kim. Xin Wang provided theory and implementation of the Association Rule Mining technique to this research.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The information of code label number for the classified land properties.   [38].