Discovering Energy Consumption Patterns with Unsupervised Machine Learning for Canadian In Situ Oil Sands Operations

: Canada’s in situ oil sands can help meet the global oil demand. Because of the energy-intensive extraction processes, in situ oil sands operations also play a critical role in meeting the global carbon budget. The steam oil ratio (SOR) is an indicator used to measure energy efﬁciency and assess greenhouse gas (GHG) emissions in the in situ oil sands industry. A low SOR indicates an extraction process that is more energy efﬁcient and less carbon intensive. In this study, we applied machine learning methods for data-driven discovery to a public database, Petrinex, containing operating data from 2015 to 2019 extracted from over 35 million records for 20 in situ oil sands extraction operations. Two unsupervised machine learning methods, including clustering and association rules, showed that the cyclic steam stimulation (CSS) recovery method was less efﬁcient than the steam-assisted gravity drainage (SAGD) recovery method. Chi-square tests showed a statistically signiﬁcant association between the CSS recovery method and high SOR ( p < 0.005). Two association rules suggested that the occurrence of non-condensable gas (NCG) co-injection produced a low SOR. Chi-square tests on the two rules identiﬁed a statistically signiﬁcant relationship between gas co-injection and low SOR ( p < 0.005). Association rules also indicated that there was no association between the production regions and SORs. For future in situ oil sands development, decision-makers should consider SAGD as the preferred method because it is less carbon intensive. Existing in situ oil sands projects and future development should explore the possibility of NCG co-injection with steam to reduce steam consumption and consequently reduce GHG emissions from the extraction processes.


Introduction
To keep the average global temperature rise below 2 • C, a third of global oil reserves have to remain undeveloped [1]. In 2019, Canada was the fourth largest oil producer, contributing 5% to the global oil production [2], and had the third largest proven oil reserves (following Venezuela and Saudi Arabia) with over 167 billion barrels (bbls) [3]. Canada plays a critical role in meeting the global carbon budget. Masnadi et al. [4] reported that Canada was the fourth highest carbon-intensive upstream oil producer in the world, after Algeria, Venezuela, and Cameroon. This is because over half of the oil production in Canada comes from an unconventional oil resource called oil sands.
Oil sands account for 64% of Canada's oil production and 98% of Canada's oil reserves [5]. Oil sands is a mixture of sand, water, clay, and heavy oil. The heavy oil separated from the oil sands is called bitumen, which contains particulate organic material, hydrocarbons, associated metals, and sulphur compounds [6]. A solid at room temperature, bitumen is the most viscous hydrocarbon [7]. Almost all oil sands reserves in Canada are concentrated in the Athabasca, Cold Lake, and Peace River regions in Northern Alberta. In situ oil sands extraction is one of two methods used to recover bitumen from oil sands.

1.
Assesses the impact of production regions and recovery methods on steam injection and oil production using clustering, unsupervised machine learning algorithms; 2.
Evaluates whether production regions have a relationship with solution gas production by an unsupervised machine learning method, namely association rules; 3.
Evaluates whether solvent co-injection with steam can reduce SORs and whether production regions have a relationship with solution gas production by an unsupervised machine learning method, namely association rules.

Materials and Methods
The KDD process is iterative, interactive, and includes the following main steps [36]: 1.
Data selection: Relevant data are retrieved from the database, then a subset of data samples is selected to create a target dataset on which the discovery will be performed.

2.
Data pre-processing: Outliers, inconsistent, or missing data are removed.

3.
Data transformation: Appropriate data forms are created for mining. The task may consist of dimension reduction, data integration, and other steps. 4.
Data mining or pattern discovery: Interesting patterns are extracted. Data mining is an essential step in the process of KDD [37]. Data mining tasks are generally grouped as predictive or descriptive. The predictive task builds a model to predict the future with methods such as correlation and regression. The descriptive task characterises properties of the data with methods such as clustering, identifying frequent patterns, and understanding associations.

5.
Interpretation and evaluation: The mined patterns are interpreted and evaluated (commonly with pattern visualisation techniques).
KDD is a computational process for finding useful knowledge from a large amount of data [38]. For this study, we followed the standard KDD steps described above using the data science library Pandas (version 0.25.3) and the programming language Python (version 3.73) ( Figure 1). volume, and temperature (PVT) properties of crude oil [26,27], crude oil price [28,29], and enhanced oil recovery [30,31]. In oil sands operations, machine learning methods were applied to analyse incident reports and increase process safety [32,33], and predict crude oil production from in situ oil sands extraction [34,35]. To the best of our knowledge, data mining techniques have not been applied to the Canadian oil and gas data warehouse or, more broadly, to any oil and gas data warehouse. This study: 1. Assesses the impact of production regions and recovery methods on steam injection and oil production using clustering, unsupervised machine learning algorithms; 2. Evaluates whether production regions have a relationship with solution gas production by an unsupervised machine learning method, namely association rules; 3. Evaluates whether solvent co-injection with steam can reduce SORs and whether production regions have a relationship with solution gas production by an unsupervised machine learning method, namely association rules.

Materials and Methods
The KDD process is iterative, interactive, and includes the following main steps [36]: 1. Data selection: Relevant data are retrieved from the database, then a subset of data samples is selected to create a target dataset on which the discovery will be performed. 2. Data pre-processing: Outliers, inconsistent, or missing data are removed. 3. Data transformation: Appropriate data forms are created for mining. The task may consist of dimension reduction, data integration, and other steps. 4. Data mining or pattern discovery: Interesting patterns are extracted. Data mining is an essential step in the process of KDD [37]. Data mining tasks are generally grouped as predictive or descriptive. The predictive task builds a model to predict the future with methods such as correlation and regression. The descriptive task characterises properties of the data with methods such as clustering, identifying frequent patterns, and understanding associations. 5. Interpretation and evaluation: The mined patterns are interpreted and evaluated (commonly with pattern visualisation techniques).
KDD is a computational process for finding useful knowledge from a large amount of data [38]. For this study, we followed the standard KDD steps described above using the data science library Pandas (version 0.25.3) and the programming language Python (version 3.73) ( Figure 1).

Data Selection
The monthly operating data obtained from Petrinex [39] contain mandatory reports on monthly activities from oil and gas licensees or operators in Alberta to the AER [40]. The monthly data from 2015 to 2019 were then tabulated into one dataset with 29 columns and over 35 million rows. The 29 columns provided information such as facility location, facility operator, well status, facility activity, and facility type. The 35 million rows contained monthly records for the entire oil and gas industry in Alberta. The monthly records included oil and solution gas production from oil batteries, fuel gas use, steam injection volumes, and NCG injection volumes. The following procedures were performed to select data associated with in situ oil sands schemes:

1.
Under the reporting facility types, battery (BT) and injection facility (IF) were selected.

2.
Under the reporting facility subtypes, in situ oil sands and sulphur reporting at oil sands were selected. 3.
BT and IF were linked by 11,000 well IDs provided in the Well to Facility Link Report [39]. The paired injection wells and producing wells for the scheme had the same well IDs. Depending on the stage of production, the number of wells for each scheme ranged from 100 to over 600 wells. The linked BT and IF IDs formed a dataset for in situ oil sands extraction schemes only, which was the target dataset in this study. The linked BT and IF IDs for each scheme are provided in the Supplementary Material.
The oil sands scheme included all BT and IF IDs associated with in situ oil sands extraction and excluded bitumen upgrading and producing wells. The BT is the facility that separates and measures products from producing wells. The IF is where steam is injected into the oil sands reservoir.

Data Preprocessing
In this study, 11 monthly records with oil production less than 5000 m 3 (approximately 100 bbl/day) were removed. These months had production interruptions such as the 2016 forest fire in Northern Alberta, or production started with volumes that were 5 to 10 times smaller than the following months. A detailed analysis of the data removal is provided in the Supplementary Material. In addition, MEG Energy's Christina Lake scheme did not have any fuel use data due to confidentiality. Therefore, this scheme was removed. With the exclusions removed, 20 in situ oil sands schemes with 1127 monthly records were populated for knowledge discovery ( Table 1). The 20 schemes accounted for 82.4% of all in situ oil sands extractions in 2019 [41].

Data Transformation
Of the 29 columns in the target dataset, 12 were removed. The removed columns contained information such as scheme locations and submission dates. A list of removed columns is provided in the Supplementary Material. Data in the target dataset were summarised to extract the operating parameters listed in Table 2 and for pattern discovery and unsupervised machine learning. The dataset had 1127 rows, each representing one monthly record. Of the 20 schemes, 13 had 60 monthly records, 3 schemes had 59 monthly records, and the remaining 4 schemes had 29 to 57 monthly records. The target dataset is provided in [42]. Table 2. Operating parameters retrieved from the data warehouse.

Operating Parameters Units Selection Method
Fuel Use 10 3 m 3 ActivityID column select FUEL ProductID column select GAS Monthly SORs were calculated by dividing injection steam volumes by oil volumes and then cross-checked against AER ST53 statistical reports to ensure BT and IF were linked correctly and other parameters were appropriately extracted for each scheme. In Petrinex, steam quantity is reported in m 3 of cold water equivalent at a temperature of 15 • C, and fuel gas quantity is reported in 10 3 m 3 at 15 • C and 101.325 kPa absolute pressure.

Data Mining
Two unsupervised machine learning techniques were used: clustering and association rules. Unsupervised learning is used to discover the underlying patterns within the data to learn more about it. Unsupervised learning was conducted using the R programming language. The k-means algorithm was executed using an R function called kmeans. The association rule algorithm was implemented using a package in R called arules.

Clustering
Cluster analysis splits data into groups based on a similarity measure and is used to explore hidden patterns [43]. In this study, we used a k-means algorithm with the Euclidean distance similarity metric. We divided monthly production volumes and steam injection volumes into k clusters based on the distance to the centroid of a cluster, with the objective of maximising the similarity within groups and minimising the similarity between groups [44]. The algorithm aims to minimise the Euclidean distances of all points with their nearest cluster centres by minimising the within-cluster sum of squared errors (SSE).
By clustering, we analysed how oil production responded to steam injection. Minimising steam injection quantities is the key to reducing GHG emissions from in situ oil sands Sustainability 2021, 13,1968 6 of 14 extraction. We also examined how production regions and recovery methods influenced oil production and steam injection. The steam injection and oil production data were normalised using the z standardisation method before being fed into the algorithm. The number of clusters (k) was selected based on the rule suggested by Harigan [45]. The rule uses the intuition that when clusters are well separated by K* being the right number of clusters, then: • For K < K*, a (K + 1) cluster partition should be the K cluster partition with one of its clusters split into two. This would significantly decrease the total within-cluster variation (W K ); • For K > K*, both the K and (K + 1) cluster partitions will be equal to the right cluster partition with some of the right clusters split randomly, so that W K and W K+1 are not significantly different.

Association Rule
Association rules are used to identify sets of items that frequently occur together in a dataset. It is a popular unsupervised machine learning technique for market basket analyses, writer evaluations, medical diagnoses, etc. [43].
In this study, we evaluated whether co-injections were associated with low SORs and if production regions were associated with high solution gas oil ratios (SGORs). The association rule had three parameters: support, confidence, and lift [46,47]. In this context, the association rule can be written as: Support measures how frequently X and Y happen together and is expressed as: where X is the co-injection or production region, Y is low SOR or high SGOR, and m is the number of months in the entire dataset, which was 1127 in this study. Confidence is the conditional probability that Y is true under the condition of X and expressed as: Number o f months containing both X and Y Months containing X = P(Y |X).
Lift is used to measure the correlation between X and Y and is written as: when Lift < 1, X is negatively correlated with Y. When Lift > 1, X is positively correlated with Y, and when Lift = 1, X and Y are independent. For association rule mining, two parameters need to be defined: the minimum support threshold (min_sup) and the minimum confidence threshold (min_conf). In this study, we set the min_sup to 10% to ensure that at least two schemes had co-injection or a high SGOR with at least 96 monthly records. The min_sup threshold also filtered out some injection activities that might not have been intended to recover bitumen. We set the min_conf to 80% to ensure a high P(Y | X) .
We categorised SORs, solvent co-injection volumes, and SGORs based on the median values. The criteria used are presented in Table 3. The solvents co-injected with steam by the 20 selected schemes were gas (mainly methane), natural gas condensate, and propane (C3). The cut-off values (medians) are presented in Table 4.

Interpretation and Evaluation
The uncovered patterns were visualised and are presented in the Results section. A chi-square test for independence was used to assess the statistical significance level of the dependence between the antecedent (X) and the consequent (Y) in an association rule (X ⇒ Y) [48,49]. The null hypothesis and an alternative hypothesis for the chi-square test are: • H o : The antecedent (X) and the consequent (Y) are independent. • H a : The antecedent (X) and the consequent (Y) are not independent.

Clustering
There were 1127 monthly records grouped into nine clusters based on steam injection and oil production. Among the nine clusters, clusters 4 and 9 were the least efficient, with more steam injection used per unit of bitumen produced (compared to the average). Cluster 2 was the most efficient, with the lowest steam injection per unit of bitumen produced (Figure 2).

Association Rule and Chi-Square Test
We tested 23 rules to determine whether solvent co-injection with steam, recovery methods, and production regions impacted SORs and SGOR. The results of the association rules are presented in Table 5. Among the 23 rules, Rules 1, 5, 11, 17, and 22 met the criteria of support, confidence, and lift, indicating the antecedent itemset implies the consequent itemset.
Chi-square tests were conducted on Rules 1, 5, 11, 17, and 22 for statistical significance. The Pearson p-values from the chi-square tests for all five rules were less than 0.05; therefore, we rejected the null hypothesis and concluded that there was a statistical association between the antecedent itemset and the consequent itemset.

Clustering
There were 1127 monthly records grouped into nine clusters based on steam injection and oil production. Among the nine clusters, clusters 4 and 9 were the least efficient, with more steam injection used per unit of bitumen produced (compared to the average). Cluster 2 was the most efficient, with the lowest steam injection per unit of bitumen produced ( Figure 2).

Association Rule and Chi-Square Test
We tested 23 rules to determine whether solvent co-injection with steam, recovery methods, and production regions impacted SORs and SGOR. The results of the association rules are presented in Table 5. Among the 23 rules, Rules 1, 5, 11, 17, and 22 met the criteria of support, confidence, and lift, indicating the antecedent itemset implies the consequent itemset.

Efficiency of Recovery Methods
Twenty schemes were clustered into nine groups based on steam injection and oil production ( Figure 3). CNULPR using the CSS recovery method and other SAGD schemes was grouped into cluster 5, which had the lowest overall oil production and steam injection volumes. This pattern indicated that the CSS method shared similar characteristics with the SAGD method when production volume was low. The maximum oil production under cluster 5 was 110,468 m 3 /month.

River
Chi-square tests were conducted on Rules 1,5,11,17,and 22 for statistical significance. The Pearson p-values from the chi-square tests for all five rules were less than 0.05; therefore, we rejected the null hypothesis and concluded that there was a statistical association between the antecedent itemset and the consequent itemset.

Efficiency of Recovery Methods
Twenty schemes were clustered into nine groups based on steam injection and oil production ( Figure 3). CNULPR using the CSS recovery method and other SAGD schemes was grouped into cluster 5, which had the lowest overall oil production and steam injection volumes. This pattern indicated that the CSS method shared similar characteristics with the SAGD method when production volume was low. The maximum oil production under cluster 5 was 110,468 m 3 /month. The other two CSS schemes, IMOCL and CNRLWL, were different from SAGD. IMOCL had a steady operation in 2015-2019, and all 60 monthly data points were clustered together and formed independent cluster 4. Fifty-three out of 60 monthly data points for CNRLWL were grouped into cluster 4. Clusters 4 and 9 injected more steam to generate similar oil production in comparison to other clusters that were SAGD schemes. This The other two CSS schemes, IMOCL and CNRLWL, were different from SAGD. IMOCL had a steady operation in 2015-2019, and all 60 monthly data points were clustered together and formed independent cluster 4. Fifty-three out of 60 monthly data points for CNRLWL were grouped into cluster 4. Clusters 4 and 9 injected more steam to generate similar oil production in comparison to other clusters that were SAGD schemes. This pattern indicated that the CSS method might be less efficient than the SAGD method when the schemes proceed toward maturity. Rule 5 in Table 5 and the subsequent chi-square test also indicated that the CSS method has higher SOR and is less efficient with the rule: {Method = CSS} ⇒ {High SOR} (support : 16%, confidence : 99%, lift : 2.0) .
The HSETL, OSUM, and PGFLB schemes are located in the Cold Lake region. They were grouped together with the schemes in the Athabasca region, which implied that different regions might not have an impact on the oil and steam interaction.

Solvent Co-Injection with Steam
Solvent co-injection with steam to improve heavy oil recovery efficiency was first reported in the 1960s [50] and has been successfully used in California for producing and transporting heavy crude oil [51]. The solvents used in the 20 selected schemes were gas (mainly methane), C3, and natural gas condensate. The CNOOCLK, COGGD, and IMOCL schemes injected condensate between 2015 and 2019. The injection volumes per month were 683 m 3 for CNOOCLK and 740 m 3 for COGGD. The IMOCL scheme had a monthly average condensate injection of 9827 m 3 ; large monthly volumes (greater than 10,000 m 3 ) were injected from June 2017 to January 2019. By December 2019, the condensate injection by IMOCL was stopped. Only CVEFC injected C3 at an average of 2868 m 3 /month from January 2018 to December 2019. The weighted average of the SOR for CVEFC increased by 8% from 2.56 to 2.77 m 3 /m 3 when comparing before and after C3 co-injection. However, these co-injection activities did not meet the min_sup threshold of 10%. Only gas co-injection met both min_sup and min_conf thresholds.
For gas co-injection, we used the median value (1456 10 3 m 3 ) of gas injection volume as a cut-off. Six schemes were considered co-injection schemes. Three schemes, CVECL, CVEFC, and SUFB, continuously injected gas between 2015 and 2019 for 60 months. The weighted average SOR of these three schemes was 2.36 m 3 /m 3 ; it was 45% lower than the weighted average SOR for the 14 schemes without gas co-injection that were fully operational between 2015 and 2019 ( Figure 4). The CNRLJF and COPSM schemes began gas co-injection in mid-2016 and early 2017, respectively. The SHAMR scheme was a new operation that began in June 2017; gas co-injection started in September 2018. The weighted average SOR of SHAMR was two times higher than the weighted average SOR of CVECL, CVEFC, and SUFB.
reported in the 1960s [50] and has been successfully used in California for producing and transporting heavy crude oil [51]. The solvents used in the 20 selected schemes were gas (mainly methane), C3, and natural gas condensate. The CNOOCLK, COGGD, and IMOCL schemes injected condensate between 2015 and 2019. The injection volumes per month were 683 m 3 for CNOOCLK and 740 m 3 for COGGD. The IMOCL scheme had a monthly average condensate injection of 9827 m 3 ; large monthly volumes (greater than 10,000 m 3 ) were injected from June 2017 to January 2019. By December 2019, the condensate injection by IMOCL was stopped. Only CVEFC injected C3 at an average of 2868 m 3 /month from January 2018 to December 2019. The weighted average of the SOR for CVEFC increased by 8% from 2.56 to 2.77 m 3 /m 3 when comparing before and after C3 co-injection. However, these co-injection activities did not meet the min_sup threshold of 10%. Only gas co-injection met both min_sup and min_conf thresholds.
For gas co-injection, we used the median value (1456 10 3 m 3 ) of gas injection volume as a cut-off. Six schemes were considered co-injection schemes. Three schemes, CVECL, CVEFC, and SUFB, continuously injected gas between 2015 and 2019 for 60 months. The weighted average SOR of these three schemes was 2.36 m 3 /m 3 ; it was 45% lower than the weighted average SOR for the 14 schemes without gas co-injection that were fully operational between 2015 and 2019 ( Figure 4). The CNRLJF and COPSM schemes began gas coinjection in mid-2016 and early 2017, respectively. The SHAMR scheme was a new operation that began in June 2017; gas co-injection started in September 2018. The weighted average SOR of SHAMR was two times higher than the weighted average SOR of CVECL, CVEFC, and SUFB.  The two association rules and chi-square tests suggested that the occurrence of gas coinjection implied a low SOR, including {Gas Co − injection} ⇒ {Low SOR} (support : 19%, confidence : 93%, lift : 1.9) and {Method = SAGD, and Gas Co − injection} ⇒ {Low SOR} (support : 22%, confidence : 93%, lift : 1.6). The distribution of SORs is provided in Figure 5.

Solution Gas and Production Region
On average, between 2015 and 2019, in situ oil sands extractions produced 21 m 3 of solution gas/1 m 3 of bitumen, with a median of 14 m 3 /m 3 . The Peace River region only had one scheme: CNULPR. The arithmetic mean of the SGOR for CNULPR was 81 m 3 /m 3 . The arithmetic mean of the SGOR for the Cold Lake region was 36 m 3 /m 3 , and for the Athabasca region, it was 11 m 3 /m 3 . Although the schemes in the Cold Lake region had higher SGORs (Figure 6), none of these schemes used gas co-injection ( Figure 5).

Solution Gas and Production Region
On average, between 2015 and 2019, in situ oil sands extractions produced 21 m 3 of solution gas/1 m 3 of bitumen, with a median of 14 m 3 /m 3 . The Peace River region only had one scheme: CNULPR. The arithmetic mean of the SGOR for CNULPR was 81 m 3 /m 3 . The arithmetic mean of the SGOR for the Cold Lake region was 36 m 3 /m 3 , and for the Athabasca region, it was 11 m 3 /m 3 . Although the schemes in the Cold Lake region had higher SGORs ( Figure 6), none of these schemes used gas co-injection ( Figure 5).   Figure 5 for an explanation of the boxplot. The boxplot is based on monthly data. The X symbol in the box represents the arithmetic mean.
Rule 22 and its chi-square test also suggested that there was a strong relationship between the Cold Lake production region and a high SGOR, with 29% support, 87% confidence, and 3.0 lift.

Conclusions
In this study, machine learning methods for data-driven discovery were applied to a public database, Petrinex, containing operating data from 2015 to 2019 that were extracted from over 35 million records for 20 in situ oil sands extraction schemes. The use of clus-  Figure 5 for an explanation of the boxplot. The boxplot is based on monthly data. The X symbol in the box represents the arithmetic mean.
Rule 22 and its chi-square test also suggested that there was a strong relationship between the Cold Lake production region and a high SGOR, with 29% support, 87% confidence, and 3.0 lift.

Conclusions
In this study, machine learning methods for data-driven discovery were applied to a public database, Petrinex, containing operating data from 2015 to 2019 that were extracted from over 35 million records for 20 in situ oil sands extraction schemes. The use of clustering and association rules and two unsupervised machine learning methods implied that: (1) the CSS recovery method was less efficient than SAGD recovery as schemes proceed toward maturity (Rule 5); (2) gas co-injection resulted in low SORs (Rules 1 and 11); and (3) the Cold Lake region had higher SGOR compared to the two other regions (Rule 22). The procedures and analyses introduced in this study for the two unsupervised machine learning algorithms can be applied to any database in any country for data-driven pattern discovery.
The chi-square test carried out on Rule 5 {Method = CSS} ⇒ {High SOR} (support : 16%, confidence : 99%, lift : 2.0) showed that there was a significant association between the CSS recovery method and high SOR (p < 0.005). SAGD recovery might be the preferred method for decision makers to consider in the future for in situ oil sands development projects because the SAGD method is less GHG-emission-intensive. By choosing SAGD recovery as a preferred method, the economic benefits are maximised while GHG emissions are minimised.
Rule 1 {Gas Co − injection} ⇒ {Low SOR} (support : 19%, confidence : 93%, lift : 1.9) and Rule 11 {Method = SAGD, and Gas Co − injection} ⇒ {Low SOR} (support : 22%, confidence : 93%, lift : 1.6), shown in Table 5, suggested that the occurrence of gas coinjection implied a low SOR. Chi-square tests on Rules 1 and 11 showed that there was a statistically significant relationship between gas co-injection and low SOR (p < 0.005). The association rules also indicated that there were no associations between the production regions and SORs.
SORs are also affected by other factors, such as operational efficiency and equipment maintenance. The application of the SAGD method and gas co-injection alone may not result in low SORs. Existing in situ oil sands projects and future developments should explore the possibility of gas co-injection with steam to reduce steam consumption and consequently reduce GHG emissions from the extraction processes.
Supplementary Materials: The following are available online at https://www.mdpi.com/2071-1 050/13/4/1968/s1, Table S1: Battery and Injection Facility IDs for each scheme, Table S2: Number of monthly data used in the study, Table S3: Removed monthly data, Table S4: Removed Columns, Table S5: Criteria used for Association Rule, Table S6: Summary of Gas Co-injection.