Abstract
Oil and gas development is characterized by high technical complexity, strong interdisciplinarity, long investment cycles, and significant uncertainty. To meet the need for quick evaluation of overseas oilfield projects with limited data and experience, this study develops an analogy indicator system and tests multiple machine-learning algorithms on two analogy tasks to identify the optimal method. Using an initial set of basic indicators and a database of 1436 oilfield samples, a combined subjective–objective weighting strategy that integrates statistical methods with expert judgment is used to select, classify, and assign weights to the indicators. This process results in 26 key indicators for practical analogy analysis. Single-indicator and whole-asset analogy experiments are then performed with five standard machine-learning algorithms—support vector machine (SVM), random forest (RF), backpropagation neural network (BP), k-nearest neighbor (KNN), and decision tree (DT). Results show that SVM achieves classification accuracies of 86% and 95% in medium-high permeability sandstone oilfields, respectively, greatly surpassing other methods. These results demonstrate the effectiveness of the proposed indicator system and methodology, providing efficient and objective technical support for evaluating and making decisions on overseas oilfield development projects.
1. Introduction
The complex process of oil and gas development involves high technical demands, strong interdisciplinarity, lengthy investment cycles, significant uncertainty, and substantial capital commitments [1]. Assessing potential development projects is further complicated by various project types, limited data availability, and tight decision-making timelines [2,3,4,5,6].
The reservoir analogy, which derives the potential of a new project from the experience of developed fields, plays a crucial role in reducing geological risk, optimizing development plans, and speeding up decision-making. It is particularly helpful during pre-development and early production stages and is most effective when used with newly discovered reservoirs within or near mature oil and gas fields. In the evaluation of new oilfield development projects, especially those overseas lacking operational experience or enough data, reservoir analogy offers guidance for reserves estimation, production forecasting, and strategy development. Both the U.S. Securities and Exchange Commission (SEC) regulations and the Society of Petroleum Engineers’ (SPE) Petroleum Resources Management System (PRMS) standard explicitly support using reliable analogies for reserves assessment when direct data are limited [7,8].
Analogy methods have been proposed or applied in reservoir characterization [9], seismic attribute analysis [10], and resource assessment [11]. With the widespread use of machine learning in petroleum engineering and its strong modeling ability and potential for generalization in reservoir identification, parameter prediction, and development scheme optimization, an increasing number of studies have used machine-learning techniques for analogy and prediction of various oilfield indicators [12,13,14,15,16,17,18]. Bai et al. [19] developed productivity prediction models using linear regression (LR), random forest regression (RF), support vector regression (SVR), backpropagation neural networks (BP), extreme gradient boosting (XGBoost), and LightGBM, demonstrating the role of machine learning in oilfield analogy and production forecasting. Guo et al. [20] introduced a new analogy and machine-learning approach for predicting reservoir permeability. Mahdaviara et al. [21] created a tool using statistical and machine-learning methods to evaluate and screen enhanced oil recovery (EOR) scenarios for low-permeability reservoirs. Rahimi and Riahi [22] classified offshore reservoir facies based on logging data with the RF method.
However, limitations remain in the current analogy practices used in oilfield development. The choice of analogy processes and parameters still heavily depends on expert knowledge and subjective judgment, lacking a systematic and objective evaluation framework. Oilfield projects involve numerous parameters, including geological, reservoir, fluid, reserves, and development factors. Because of the complexity in evaluating multiple parameters, many analogy studies focus on only a limited subset or a single category of parameters for similarity assessment, leading to the absence of a comprehensive and systematic indicator framework [23,24,25,26,27,28]. The weighting of analogy indicators is traditionally based on expert scoring or the Analytic Hierarchy Process (AHP). These methods, being highly subjective and poorly standardized, often produce significant discrepancies among different technical personnel, thus reducing comparability and reproducibility. Additionally, most existing studies on analogy methods concentrate on individual oilfield indicators, while research on analogy at the level of whole asset evaluation remains limited.
To address these challenges, this study aims to develop a comprehensive analogy indicator system specifically designed for overseas oilfield development projects and to optimize analogy methods using multiple machine-learning algorithms. First, raw indicators are extracted and categorized from commercial databases and representative development projects. Next, a set of key analogy indicators is generated through screening, classification, and weighting processes. Five commonly used machine-learning algorithms are then employed to model these key indicators and conduct both single-indicator and whole-asset analogy experiments, with performance evaluated based on adaptability and predictive accuracy. The experimental results validate the effectiveness of the proposed indicator system and identify support vector machine (SVM) as the most appropriate algorithm for this application.
The main contributions of this study are as follows:
- A range of statistical techniques, including the correlation coefficient, systematic clustering, and principal component analysis, are used to screen the original set of indicators and identify key analogy indicators. A classification scheme is then created for the selected key indicators based on probability statistics analysis and expert judgment, ensuring both representativeness and engineering relevance.
- A combined subjective–objective weighting method is proposed for key indicators. Subjective weights are assigned using direct expert scoring, while objective weights are derived from the averaged results of the entropy method and the coefficient of variation method. This approach ensures that the weighting reflects both expert experience and data characteristics.
- Through screening, classification, and weighting, a comprehensive analogy indicator system is developed. It combines static and dynamic parameters from geological, petrophysical, and development aspects, integrating both subjective and objective views to support similarity evaluation between target and candidate oilfields.
- Five machine-learning methods—support vector machine (SVM), random forest (RF), backpropagation neural network (BP), k-nearest neighbor (KNN), and decision tree (DT)—are used to perform both single indicator and whole analogy experiments. The adaptability and prediction accuracy of each method are evaluated under different reservoir conditions, such as medium-to-high permeability sandstone and low-permeability sandstone, leading to the identification of the optimal algorithm.
The paper is organized as follows. Section 2 provides a detailed discussion of the methodology and procedures for constructing the analogy indicator system and selecting analogy methods for oilfield development projects. Section 3 presents and analyzes the application of the proposed analogy indicator system and the selection of analogy methods, using real data from oilfield projects. Finally, Section 4 summarizes the main findings of the study.
2. Materials and Methods
Focusing on analogy analysis for oilfield development projects, the study includes two main components: the construction of analogy indicators and the optimization of analogy methods, as shown in Figure 1.
Figure 1.
Indicator analogy process for oilfield development projects.
In the indicator construction component, technical indicators and representative oilfield data were first categorized and organized to create a base indicator system and database tailored to the characteristics of overseas oilfield development projects. Based on the requirements of the analogy task, key analogy indicators were then identified through a series of steps, including indicator screening, classification, and weighting, to prepare for subsequent analogy procedures. In the analogy method component, machine-learning algorithms were applied to perform both single-indicator and whole-asset analogy experiments for new oilfield development projects, to identify the most suitable method to meet the needs of such projects.
2.1. Analogy Indicator System for Oilfield Development Projects
The initial process for selecting basic analogy indicators involved several steps. First, relevant data were collected from multiple sources, including commercial databases like the C&C Reservoir database (http://www.ccreservoirs.com, accessed on 27 June 2025), the IHS Markit Energy Portal (https://energyportal.ci.spglobal.com, accessed on 27 June 2025), and the Wood Mackenzie database (https://www.woodmac.com, accessed on 27 June 2025), as well as historical project records and reviewed literature. Second, the objective analysis method was used to identify key factors that influence project evaluation, ensuring the indicators were comprehensive, complete, and easily interpretable. Third, the key factors were grouped into different indicator sets based on their technical features. Finally, expert consultation was conducted to validate the results, resulting in the basic analogy indicators.
A total of 36 basic analogy indicators were initially chosen, covering both static and development parameters. The static parameters include reservoir properties, trap and structural characteristics, fluid properties, and reserve parameters. Among these, eight are qualitative indicators (highlighted in blue), and twenty-eight are quantitative indicators (shown in black), as shown in Figure 2.
Figure 2.
Basic analogy indicators for oilfield development projects. Qualitative indicators are highlighted in blue, and quantitative indicators are shown in black.
Among the 36 selected indicators, static parameters directly represent key geophysical and petrophysical attributes that govern reservoir performance. Reservoir properties such as lithology, average porosity, and permeability quantify storage capacity, while trap and structural characteristics define the reservoir framework. Fluid properties control fluid mobility and drive mechanisms, and reserve parameters measure volumetric resource potential [29]. Development parameters directly reflect production-process performance and reservoir drive characteristics. By integrating static and development parameters, the basic analogy indicators are firmly grounded in reservoir characteristics and field production performance, thus providing a comprehensive and reliable basis for analogy in oilfield development projects.
Using these 36 indicators, data from various oilfields were compiled and screened to build a basic analogy indicator database for different oilfield types. This database includes data from 1436 oilfields, categorized by region (onshore and offshore) and lithology (sandstone and carbonate).
Among the thirty-six basic analogy indicators, eight qualitative indicators—derived from oilfield classification standards or expert judgment—have clear categorical distinctions and classification functions. As categorical variables, they are not suitable for quantitative analysis and were therefore directly included in the construction of the analogy indicator system. For the remaining 28 quantitative indicators, statistical methods were used for screening, classification, and weight calculation to establish a more scientifically based key analogy indicator system. This system provides parameters and a foundation for future research on analogy methods for oilfield development projects. The flowchart for constructing the key analogy indicator system is shown in Figure 3.
Figure 3.
Flowchart for the construction of the key analogy indicator system in oilfield development projects.
2.1.1. Key Indicator Screening
The initially selected basic analogy indicators aimed to comprehensively capture all key factors characterizing oilfield development projects and influencing project evaluation. However, practical issues may arise with these indicators, such as overlapping or correlated parameters, parameters that do not clearly reflect evaluation-relevant features, and the need to clarify each parameter’s relative importance. Therefore, a secondary screening of the basic indicators is necessary to reduce redundancy and correlation among them and to identify the key indicators for analogy analysis.
In this study, three statistical methods were used for key indicator screening: the correlation coefficient method, the systematic clustering method, and the principal component analysis method.
Correlation Coefficient Method
The correlation coefficient is a statistical metric used to measure the strength of the relationship between two variables. It is calculated using the product-moment method, which is based on the deviations of each variable from its respective mean. The degree of correlation between the two variables is reflected by the product of these deviations.
Given two variables, , , the correlation coefficient is calculated using the following formula:
where and represent the mean values of variables and , respectively. The absolute value of the correlation coefficient reflects the degree of similarity between the two variables.
Systematic Clustering Method
Cluster analysis groups sample data based on individual characteristics of the research objects according to predefined classification criteria. Systematic clustering is a hierarchical classification method based on distance metrics, with the core idea of iteratively merging similar objects to form hierarchical groupings from fine to coarse levels. In this method, each sample or variable is initially treated as an individual class. The algorithm then repeatedly calculates inter-class distances and merges the two closest classes until all objects are grouped into a single cluster. This process can be visually represented by a dendrogram, allowing researchers to select an appropriate distance threshold based on practical needs to determine the final clustering scheme. Systematic clustering does not require a predefined number of clusters and can naturally reveal the data’s hierarchical structure. However, its computational complexity increases significantly with the number of samples, making it more suitable for exploratory analysis of small to medium-sized datasets.
In variable clustering studies, the systematic clustering method is often combined with the correlation coefficient to identify groups of key indicators with similar variation patterns. For clustering analysis, the correlation coefficient was converted into a distance metric as follows:
where denotes the distance between variables x and y, calculated as one minus the absolute value of their correlation coefficient . A smaller value of indicates a higher similarity between variables x and y, and such variables should be clustered together with higher priority.
Systematic clustering groups data by iteratively merging similar classes, with its core process being the definition of inter-class distances. This study adopts the Single Linkage method. The specific steps are as follows:
- Each sample or variable is initially treated as an independent class, denoted as G1, G2, …, Gₙ. All pairwise distances are calculated to form the initial distance matrix D(0), where the element Dij = dxy represents the distance between variables.
- The minimum distance element in the current matrix is selected as Dpq = min {Dij}, and the corresponding classes Gp and Gq are merged to form a new class Gs.
- The distance matrix is updated by calculating the distance between the new class Gs and any other class Gk as Dsk = min {Dpk, Dqk}, where Dpk and Dqk are the distances between the original classes Gp, Gq, and class Gk.
- The above steps are repeated until all classes are merged into a single class. The entire clustering process can be visualized using a dendrogram, and researchers can determine the final classification scheme by selecting an appropriate distance threshold based on practical needs.
Principal Component Analysis Method
Factor analysis is a multivariate dimensionality reduction method aimed at extracting a small number of latent, unobservable common factors from a set of correlated observed variables. The goal is to simplify the data structure and reveal the intrinsic relationships among variables while retaining as much original information as possible. Factor analysis explains most of the variation in the original observed variables through common factors. It assumes that the original variables can be expressed as a linear combination of k latent factors (where ), along with a unique error term associated with each variable. The mathematical model of factor analysis can thus be expressed as:
It can also be expressed in matrix form as:
where is the factor loading matrix, representing the degree of correlation between the original variables and the extracted factors.
We extract principal components using principal component analysis and employ them as factor inputs in the subsequent modeling. The specific steps for determining factor variables using principal component analysis in this study are as follows:
- Standardize the original variables to zero mean and unit variance in order to eliminate the influence of differing units and scales.
- Construct the correlation matrix R for the original variables.
- Perform eigenvalue decomposition on the correlation matrix R. Let be the matrix composed of the top principal eigenvectors, and be the diagonal matrix of the corresponding eigenvalues. The factor loading matrix can then be computed as:
- 4.
- Determine the number of factors k based on the cumulative variance contribution rate, choosing the smallest k such that the cumulative explained variance is at least 90%.
- 5.
- Linearly transform the original variables into factor scores to be used as inputs for subsequent modeling. Let wⱼᵢ denote the weight of variable Xᵢ on factor Fⱼ, and the factor score, which is a weighted sum of all variables on a given factor, can be calculated as follows:
2.1.2. Key Indicators Classification
After identifying the key analogy indicators, it is necessary to further classify and assign weights to them to support both single indicator and whole asset analogy tasks. In this study, quantitative indicators were classified using both probability distribution curves and expert knowledge, ensuring a balance between objective data features and subjective domain experience. This method improves the credibility of the classification results by considering both expert insights and variability from random sampling.
Based on the probabilistic analysis of key quantitative indicators, the distribution patterns were categorized into the following types:
- 1.
- Normal distribution type: This includes indicators such as Average Porosity, Reservoir Burial Depth, Initial Reservoir Pressure, Oil API Gravity, Original Oil in Place, and Recovery Factor.For normally distributed indicators, the classification thresholds were calculated based on the mean () and standard deviation (), using the values: .
- 2.
- Exponential distribution type: This category includes Net Pay Thickness, Average Permeability, Oil Volume Factor, Bubble Point Pressure, Reserves Abundance, Well Pattern Density, Initial Production per Well, Peak Production Rate, and Oil Production Rate.For exponentially distributed indicators, the classification thresholds were determined based on the characteristic quantiles extracted from the cumulative distribution function (CDF) at probability levels of 15%, 30%, 70%, and 85%.
- 3.
- Uniform distribution type: This category includes Average Net-to-Gross Ratio and Recovery Efficiency of Reserves.For uniformly distributed indicators, the classification thresholds were defined by the characteristic quantiles extracted from the CDF at probability levels of 20%, 40%, 60%, and 80%.
2.1.3. Key Indicator Weighting
In analogy-based evaluation of new oilfield development projects, different analogy objectives relate to various key influencing factors and their associated weights. The weight assigned to each key indicator indicates its relative contribution to the final evaluation result. Common weighting methods are divided into three categories: subjective methods (such as analytic hierarchy process, expert scoring), objective methods (such as entropy method, coefficient of variation method), and combined subjective–objective methods.
Subjective methods reflect the preferences and judgments of decision-makers, but they often lack objectivity. Objective methods depend on the inherent structure of the data but may miss the practical significance of indicators. To balance expert opinions with data-driven insights, this study uses a combined subjective–objective weighting approach. For the subjective part, expert scoring was used, where domain specialists assigned scores to each indicator based on engineering experience and the analogy objective, producing the subjective weights. For the objective part, both the entropy method and the coefficient of variation method were applied to calculate the objective weights, and their average was used as the final objective value.
Entropy Method
The entropy method is an objective weighting technique based on information entropy theory. Entropy measures uncertainty, with an indicator’s entropy value indicating its degree of dispersion. The lower the entropy value, the higher the dispersion and the greater the impact of that indicator on the overall evaluation. The steps for calculating the entropy method are as follows:
- 1.
- Normalize the data. Let denote the sample index and denote the indicator index. Let represent the original value of the -th indicator for the -th sample. Denote the minimum and maximum values of all samples for the -th indicator as and , respectively. Each is linearly mapped to the interval [0, 1], resulting in the normalized value .For a positive indicator, the normalization is computed as follows:For a negative indicator, the normalization is computed as follows:
- 2.
- Calculate the proportion of the -th indicator in the -th sample, denoted as , using the following formula:
- 3.
- Compute the entropy value of the -th indicator. Let be the total number of samples and be the normalization constant. The entropy value is calculated as follows:
- 4.
- Compute the redundancy and derive the weight of each indicator. The redundancy is given by , and the final weight is calculated as follows:
Coefficient of Variation Method
The coefficient of variation method is an objective weighting approach based on the degree of variability of indicator data. The coefficient of variation is a relative measure of data dispersion, calculated as the standard deviation divided by the mean. This method removes the effects of units and magnitude. Because of its properties, the coefficient of variation measures how much an indicator varies relative to others. A higher coefficient indicates more dispersion, meaning the indicator has a greater influence on the overall evaluation.
Let the mean and standard deviation of the -th indicator across all samples be denoted as and , respectively. The coefficient of variation is calculated using the following formula:
The weight of indicator based on the coefficient of variation method, denoted as , is calculated using the following formula:
Combined Subjective–Objective Weighting Method
This study adopts a linear weighting approach to combine subjective and objective methods. Let denote the subjective weight of indicator , obtained using the expert direct rating method. Let , , and represent the weight coefficients for the subjective method, entropy method, and coefficient of variation method, respectively. The final weight for each indicator is calculated by linearly combining the subjective weight and the two objective weights, as shown below:
In this study, the coefficients are set as ,
2.2. Analogy Methods for Oilfield Development Projects
Analogy, also referred to as analogical reasoning, is a method of inference that deduces the presence of certain properties in a target entity based on the known existence of similar properties in a comparable reference. The validity of such inference must be empirically verified, and the more attributes the two entities share, the more reliable the analogical conclusion becomes.
In oilfield development projects, analogy involves selecting similar reference projects to evaluate a new development project through comparison and inference, thereby enabling rapid screening and evaluation. In this study, two analogy approaches are employed: (1) single indicator analogy, which supplies reference values for missing indicators in the target asset, and (2) whole asset analogy, which qualitatively evaluates the overall potential and value of the target project. Both approaches are implemented using five machine-learning algorithms: support vector machine (SVM), random forest (RF), backpropagation neural network (BP), k-nearest neighbor (KNN), and decision tree (DT). The workflow of the analogy method is shown in Figure 4.
Figure 4.
Flowchart for machine-learning-based optimization of analogy methods for oilfield development projects.
2.2.1. Machine-Learning Methods
Machine learning is an interdisciplinary field that combines statistics, artificial intelligence, and computer science. It is also called predictive analytics or statistical learning. Machine-learning algorithms find patterns and features directly from data using computational methods, without depending on predetermined equations. As the number of training samples grows, these algorithms can improve their performance adaptively. The main idea is to train models to discover the underlying patterns and key rules of a phenomenon, ultimately enabling prediction or decision-making.
This study builds on the previously established key analogy indicator system and uses five machine-learning methods to predict key indicators and assess the whole asset level of oilfield development projects. A comparative overview of the characteristics of these methods is shown in Table 1.
Table 1.
Comparison of machine-learning methods.
2.2.2. Single Indicator Analogy
According to oilfield classification, the key indicator system was used to comprehensively include both reservoir static parameters and development parameters. Machine-learning methods were then applied to predict the target indicator. In this study, experiments were carried out separately for two types of oilfields: onshore medium-to-high permeability sandstone reservoirs and onshore low-permeability sandstone reservoirs [30].
Before the analogy process, data preprocessing was performed on the key indicator set. Indicators with more than 70% sample coverage were chosen as feature variables. Using the recovery factor as the target variable for the analogy, seven key indicators were selected as features: well pattern density, oil API gravity, original oil in place, reserves abundance, average porosity, net pay thickness, and reservoir burial depth. After removing outliers, imputing missing values with the median, and normalizing the data, 663 samples from medium-to-high permeability sandstone oilfields and 157 samples from low-permeability oilfields were retained.
2.2.3. Whole Asset Analogy
Using conventionally classified oilfield asset categories as the reference standard and incorporating the selected key indicators, machine-learning methods were applied to classify asset levels for onshore medium-to-high permeability sandstone oilfields, with a total of 663 samples. Based on oilfield characteristics, development potential, and economic benefits, the assets were divided into five levels, with Level 1 representing the highest priority. Compared to the limited sample size of low-permeability oilfields, the larger dataset of medium-to-high permeability oilfields is more suitable for training and optimizing machine-learning models, making it ideal for whole asset analogy and selection.
In the analogy indicator system for oilfield development projects, nine features were chosen: well pattern density, oil API gravity, original oil in place, reserves abundance, average porosity, average permeability, net pay thickness, reservoir burial depth, and recovery factor. These were used as inputs for the five machine-learning methods mentioned earlier to classify and prioritize target oilfield assets.
In the conventional empirical classification, asset categories are determined via a linear weighted scoring procedure applied to the n selected key quantitative indicators, comprising the following steps:
- 1.
- Construct the evaluation matrix by arranging the indicators as columns and the 663 oilfields as rows.
- 2.
- Normalize each indicator to the [0, 1] range and multiply by its obtained key-indicator weight.
- 3.
- Compute each oilfield’s analogy score by summing the weighted, normalized scores.
- 4.
- Plot the score histogram and, based on empirical distribution characteristics, divide S into five intervals to define the asset levels.
The histogram of analogy scores for the 663 medium-to-high permeability sandstone oilfields is shown in Figure 5. According to conventional empirical classification, the five asset levels are defined as Level 1 for ≥ 0.58, Level 2 for 0.53 ≤ < 0.58, Level 3 for 0.43 ≤ < 0.53, Level 4 for 0.38 ≤ < 0.43, and Level 5 for < 0.38.
Figure 5.
Histogram of composite analogy scores for 663 medium-to-high permeability sandstone oilfields.
All analyses were conducted in Python 3.8, using NumPy, pandas, and SciPy for data handling and statistical computations, scikit-learn for machine-learning tasks, and Matplotlib 3.3.2 for visualization.
3. Results and Analysis
3.1. Construction of the Analogy Indicator System
3.1.1. Results of Key Indicator Screening
Correlation Coefficient Method
Among the twenty-eight basic quantitative indicators, three indicators—pressure coefficient, reservoir oil density, and surface oil viscosity—had a high proportion of missing data in the collected dataset and were therefore excluded from the correlation analysis. The remaining 25 indicators were divided into two categories: static parameters and development parameters. Correlation analysis was conducted separately for each category. The correlation coefficient matrix of static parameters is shown in Table 2, and that of development parameters is shown in Table 3.
Table 2.
Correlation matrix of static parameters. Correlation coefficients greater than or equal to 0.6 are highlighted in bold to indicate relatively strong associations.
Table 3.
Correlation matrix of development parameters.
The correlation coefficient is used to assess the degree of linear association between two variables. When 0 < < 1, it indicates that a certain degree of linear correlation exists between the two variables. The closer is to 1, the stronger the linear relationship between them. Conversely, the closer is to 0, the weaker the linear correlation. In general, the correlation coefficient is interpreted in three levels: < 0.4 indicates low linear correlation, 0.4 ≤ < 0.7 indicates significant correlation, and 0.7 ≤ < 1 indicates high linear correlation.
However, due to differences in sample size, the number of variables, their characteristics, and the relationships among them, the threshold values used to determine the degree of correlation may vary in different situations. In some cases, a correlation coefficient greater than 0.5 is regarded as indicating good correlation, whereas in others, a threshold of 0.8 or even higher is required. We selected a threshold of 0.6. This choice balances the need for strong internal consistency within each indicator group against the requirement for clear separation between groups. Preliminary grouping experiments on our dataset showed that using a threshold of 0.6 produced stable clusters with high repeatability and minimal cross-group correlation.
In this study, based on the correlation coefficient (0.6 ≤ < 1), the static parameter indicators were classified into three groups. Within each group, indicators show strong internal correlation, while the correlation between groups is relatively weak. The first group includes Reservoir Burial Depth, Initial Reservoir Pressure, and Initial Reservoir Temperature. The second group consists of Initial Gas–Oil Ratio, Oil Volume Factor, and Bubble Point Pressure. The third group includes Reservoir Area, Original Oil in Place, and Recoverable Oil Reserves. The remaining eight static parameters were treated as independent indicators. The pairwise correlation coefficients of these eight development parameters were below 0.6, indicating no significant correlation, and they were therefore treated as independent indicators. The detailed classification of parameters is shown in Table 4.
Table 4.
Classification statistics of static and development parameters based on the correlation coefficient method.
Systematic Clustering Method
Systematic clustering using the Single Linkage method was performed separately for static and development parameters, and the corresponding dendrograms are shown in Figure 6 and Figure 7. In the dendrogram, a smaller distance indicates a higher tendency for variables to cluster together, and indicators within the same cluster tend to have strong internal correlation and redundancy. The clustering results for both static and development parameters are summarized with cluster thresholds ranging from 2 to 5. Indicators grouped in the same cluster are enclosed in parentheses, while those without parentheses are considered independent indicators, as shown in Table 5 and Table 6.
Figure 6.
Hierarchical clustering dendrogram of static parameters.
Figure 7.
Hierarchical clustering dendrogram of development parameters.
Table 5.
Classification of static parameters under different clustering distance thresholds. Grouped parameters are indicated within parentheses, while ungrouped entries represent individual indicators.
Table 6.
Classification of development parameters under different clustering distance thresholds. Grouped parameters are indicated within parentheses, while ungrouped entries represent individual indicators.
According to the statistical results, a smaller clustering distance threshold leads to fewer merged clusters and more independent indicators, while a larger threshold results in more integrated clusters and fewer independent indicators. In this study, we consider the threshold value of 3 to be appropriate, as the resulting indicator sets are well suited for project evaluation requirements.
These results align with the engineering significance of oilfield attributes. Static parameters were divided into 12 categories: Reservoir Area, Reservoir Oil Viscosity, and Net Pay Thickness form one group, reflecting the fundamental storage space and flow characteristics of the reservoir. Original Oil in Place and Recoverable Oil Reserves form another group, highlighting the dominant role of reserve volume in oilfield assessment. Initial Gas–Oil Ratio and Oil Volume Factor form a group, indicating how fluid properties affect drive efficiency and sustain production. Reservoir Burial Depth and Initial Reservoir Pressure form a further group, demonstrating the critical role of reservoir pressure in maintaining production energy. Development parameters were divided into five categories: Well Pattern Density, Initial Production per Well, and Peak Production Rate form one group, characterizing the decisive influence of well spacing and initial output on production capacity. Composite Decline Rate and Oil Production Rate form another group, illustrating production decline behavior and long-term performance stability. The remaining indicators each form individual groups, underscoring their independent value in evaluation.
Principal Component Analysis Method
To evaluate the explanatory power of each factor and the contribution of variables, a scree plot was used to visualize the importance of factors. The number of factors was determined by finding the inflection point of the eigenvalue curve. According to this method, calculations were performed separately for the static parameters and development parameters, and the results were displayed using scree plots, as shown in Figure 8 and Figure 9.
Figure 8.
Scree plot of static parameters.
Figure 9.
Scree plot of development parameters.
In the scree plot, a smaller eigenvalue indicates a larger number of extracted factors, a higher cumulative contribution rate, and better retention of the original information. The number of factors and the corresponding cumulative variance explained for various eigenvalue thresholds are shown in Table 7 and Table 8.
Table 7.
Explained variance of factors for static parameters.
Table 8.
Explained variance of factors for development parameters.
In this study, the number of factors was determined based on a cumulative variance explained threshold of 90%. For the static parameters, nine factors were extracted, meaning that the 17 indicators were grouped into nine factor categories. For the development parameters, six factors were extracted, corresponding to six categories for the eight development indicators.
Based on the number of factors corresponding to the 90% cumulative contribution rate, the rotated factor loading matrices were calculated, as shown in Table 9 and Table 10.
Table 9.
Static parameter rotated factor loadings. The bold values represent the loading with the greatest absolute magnitude for each parameter, indicating its primary associated factor.
Table 10.
Development parameter rotated factor loadings. The bold values represent the loading with the greatest absolute magnitude for each parameter, indicating its primary associated factor.
Based on the results of the factor loading matrices, the classification of static and development parameters was summarized in Table 11.
Table 11.
Classification statistics of static and development parameters based on the principal component analysis method.
These classification results align with the engineering significance of oilfield attributes. Static parameters were divided into nine categories: Reservoir Burial Depth, Initial Reservoir Pressure, and Initial Reservoir Temperature form one group, reflecting the subsurface pressure–temperature conditions that govern drive energy. Initial Gas–Oil Ratio, Oil Volume Factor, and Bubble Point Pressure form another group, indicating how fluid phase behavior and volumetric expansion potential control recovery efficiency. Reservoir Area, Original Oil in Place, and Recoverable Oil Reserves form a group, highlighting the volumetric capacity that underpins overall field value; Average Permeability, Reservoir Oil Viscosity, and Oil API Gravity group together, capturing the combined effects of reservoir permeability and fluid mobility on deliverability. Development parameters were divided into six categories: Well Pattern Density and Composite Water Cut form a group, characterizing development intensity and waterflooding performance. Recovery Factor and Oil Production Rate form another group, illustrating overall production capacity. The remaining indicators each form individual groups, underscoring their independent value in evaluation.
Correlation coefficient analysis, systematic clustering, and principal component analysis were used to classify both static and development parameters. The classification results from the three methods are summarized in Table 12 and Table 13. Indicators marked with the same color in the tables belong to the same category, while uncolored indicators are considered independent, meaning they are not grouped with any other indicators.
Table 12.
Summary of static parameter indicator classification results by three screening methods. Indicators marked in the same color belong to the same category; those left unshaded are treated as individual indicators.
Table 13.
Summary of development parameter indicator classification results by three screening methods. Indicators marked in the same color belong to the same category; those left unshaded are treated as individual indicators.
As shown in the tables above, the classification results of static parameters obtained by the three methods are generally consistent. For development parameters, the correlation coefficient method does not reduce the number of indicators, while the other two approaches yield alternative classification patterns. It is important to note that all three methods rely solely on statistical analysis and do not consider the geological or reservoir engineering significance of the indicators. Therefore, based on practical reservoir engineering needs and through expert discussion and validation, a comprehensive evaluation was performed to select the key indicators. In the end, twelve indicators were chosen from the seventeen static parameters and six from the eight development parameters. These eighteen quantitative indicators, combined with eight qualitative indicators, form a total of twenty-six key analogy indicators for oilfield development projects, as shown in Figure 10.
Figure 10.
Key analogy indicators for oilfield development projects. Qualitative indicators are highlighted in blue, and quantitative indicators are shown in black.
3.1.2. Results of Key Indicators Classification
Following the procedure outlined in Section 2.1.2, the probability distribution of each key quantitative indicator was calculated based on the grading criteria, and the results are summarized in Table 14. For the three parameters with existing industry-standard classification systems (porosity, permeability, and formation oil viscosity), we primarily used the established industry criteria. By systematically incorporating expert knowledge, the classification thresholds for all key quantitative indicators were optimized and calibrated, with the final classification standards shown in Table 15.
Table 14.
Threshold values for classifying indicators by distribution types.
Table 15.
Classification standards for key quantitative indicators.
3.1.3. Results of Key Indicator Weighting
For the two categories of key quantitative indicators—static parameters and development parameters—we calculated objective weights using the entropy method and the coefficient of variation method. We assigned reasonable subjective weights based on expert judgment and then used the combined subjective–objective weighting formula to determine the comprehensive weight for each indicator. The results are shown in Table 16 and Table 17.
Table 16.
Key indicator weights for static parameters.
Table 17.
Key indicator weights for development parameters.
3.2. Analogy Method Optimization
Based on a systematic review of prior studies and algorithms, we selected five classical machine-learning methods—support vector machine (SVM), random forest (RF), backpropagation neural network (BP), k-nearest neighbors (KNN), and decision tree (DT)—for comparative analysis. These methods span linear and nonlinear, parametric and nonparametric, model-based and distance-based paradigms, enabling a comprehensive evaluation of the proposed analogy indicator system. Their interpretability and ease of implementation also satisfy the reproducibility and engineering requirements of oilfield development decision-making. Although recent work has applied next-generation ensemble methods such as extreme gradient boosting (XGBoost) and LightGBM to production forecasting [19], those algorithms typically demand large sample sizes and extensive hyperparameter tuning, which risks overfitting under our sample conditions. In contrast, the five selected classical algorithms achieve robust performance without complex tuning, enhancing the generalizability of our findings and providing a valuable reference for future research.
3.2.1. Results of Single Indicator Analogy
A dataset of 663 onshore medium-to-high permeability sandstone oilfields was used for single indicator analogy. Machine-learning models were trained on 80% of the samples and tested on the remaining 20%. Test-set fitting accuracy was calculated to assess each method’s predictive performance. Table 18 shows that SVM and RF achieved accuracies of 85.81% and 74.1%, respectively, both meeting engineering requirements. The BP neural network, KNN, and DT showed lower accuracies.
Table 18.
Test set accuracy (%) of five machine-learning methods for recovery factor prediction in medium-high permeability sandstone oilfields.
Figure 11 presents predicted versus actual recovery values for each method. Green points denote training-data predictions, and red points denote test-data predictions. In the SVM plot, both green and red points cluster tightly along the 45° line, indicating an excellent fit on the training data and strong generalization to unseen cases. RF also aligns densely around the diagonal, although test-data points show slightly greater scatter, indicating robust performance with limited variance. By contrast, the BP neural network achieves a near-perfect fit on the training data but exhibits marked dispersion on the test data, signaling mild overfitting. KNN yields a broadly dispersed point cloud for both sets, revealing underfitting due to its local-averaging nature. Finally, DT displays a staircase pattern in the training data, characteristic of overfitting, and notable scatter in the test data, reflecting poor generalization. These observations confirm that only support vector machine and random forest achieve an optimal balance between bias and variance that meets engineering requirements.
Figure 11.
Comparison of predicted versus actual recovery values for medium-high permeability sandstone oilfields computed by five machine-learning methods.
Therefore, support vector machine and random forest were selected for recovery factor prediction.
A total of 157 onshore low-permeability sandstone oilfields were analyzed similarly, with 80% of samples used for training and 20% for testing. Test set accuracy was calculated to evaluate the predictive performance of each method. As shown in Table 19, the small sample size resulted in poor outcomes across all five machine-learning methods, with the highest accuracy below 70%. This indicates that when the sample size is small (n < 200), the ability of conventional machine-learning algorithms to generalize is severely limited, making it difficult to attain high-precision predictions in oilfield development.
Table 19.
Test set accuracy (%) of five machine-learning methods for recovery factor prediction in low-permeability sandstone oilfields.
Figure 12, which presents the low-permeability dataset, shows that the training-data points follow similar trends as in the medium-high permeability case, while the test-data points are substantially more dispersed, indicating markedly poorer generalization when the sample size is small.
Figure 12.
Comparison of predicted versus actual recovery values for low-permeability sandstone oilfields computed by five machine-learning methods.
3.2.2. Results of Whole Asset Analogy
SVM, RF, BP neural network, KNN, and DT methods were used to perform whole asset analogy on 663 onshore medium-to-high permeability sandstone oilfields. The dataset comprises five asset levels with approximately equal numbers of oilfields in each level, indicating a balanced multi-class dataset. Twenty oilfields were set aside for testing, and the predicted asset categories were compared with the conventional empirical classification categories. The results are shown in Table 20.
Table 20.
Whole asset analogy results and test set accuracy (%) of five machine-learning methods.
DT achieved an accuracy of 25%, KNN 55%, and BP neural network 55%, indicating unsatisfactory performance. RF reached 70% accuracy, while SVM achieved 95%, both of which meet engineering requirements. SVM’s superior performance reflects its ability, through kernel-based high-dimensional mappings, to capture the complex nonlinear relationships among oilfield parameters even with limited training samples, making it particularly suitable for asset-level classification of medium-to-high permeability sandstone oilfields. With 95% accuracy, SVM can effectively replace traditional empirical classification methods, improving the objectivity and efficiency of oilfield evaluation.
4. Conclusions
This study addresses the need for quick evaluation of overseas oilfield development projects facing complex processes, limited data, and scarce experience by developing a systematic analogy indicator system and testing related machine-learning-based analogy methods.
A database of 1436 oilfields was compiled, and 36 original indicators were systematically screened using correlation coefficient analysis, systematic clustering, and principal component analysis. The selected indicators were then classified based on probability distribution curves and expert judgment and weighted with a combined approach that integrates expert scoring with entropy and coefficient of variation methods. The outcome is a set of 26 key analogy indicators covering reservoir properties, trap and structural characteristics, fluid properties, reserve parameters, and development parameters. This indicator system balances static and dynamic features, as well as subjective and objective information, providing a strong foundation for further research using the analogy method.
Five machine-learning algorithms—support vector machine (SVM), random forest (RF), backpropagation neural network (BP), k-nearest neighbors (KNN), and decision tree (DT)—were used for both single-indicator and whole-asset analogy tasks on actual oilfield data. For the 663 onshore medium-to-high permeability sandstone samples, SVM and RF achieved accuracies above 70%, meeting engineering requirements. SVM performed the best, with 86% accuracy in recovery factor prediction and 95% accuracy in whole-asset classification. Conversely, for the 157 low-permeability sandstone samples, all methods scored below 70%, showing that traditional machine-learning algorithms are not practical when the sample size is small.
The developed methodology provides a standardized, programmatic workflow for the quick evaluation of new overseas oilfield development projects. Using the key analogy indicator system along with the SVM-based analogy method enables rapid screening of multiple candidate fields, boosting evaluation efficiency and reducing investment risks caused by inconsistent judgments or limited experience. Future work could involve iterative refinement and reweighting of the indicator system as development technologies progress, as well as incorporating and comparing additional advanced algorithms to further enhance the accuracy and reliability of quick project screening and assessment.
Author Contributions
Conceptualization, M.Z. and B.Z.; methodology, M.Z., Z.L. and C.Y.; software, M.Z. and F.H.; investigation, T.Q., B.W. and L.F.; data curation, Z.L.; writing—original draft preparation, M.Z.; writing—review and editing, Z.L.; supervision, Z.L.; project administration, Z.L.; funding acquisition, Z.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Major Science and Technology Project of China National Petroleum Corporation, “Research on Integrated Technical, Economic, and Commercial Evaluation Technologies for Oil and Gas Exploration Assets in a Low-Carbon Context” (Grant No. 2023ZZ07-05).
Data Availability Statement
The data presented in this study are available on request from the corresponding author due to privacy.
Conflicts of Interest
Authors Muzhen Zhang, Zhanxiang Lei, Baoquan Zeng, Fei Huang, Tailai Qu, Bin Wang and Li Fu were employed by the company Research Institute of Petroleum Exploration & Development, PetroChina. Author Chengyun Yan was employed by the company The First Natural Gas Plant of PetroChina Qinghai Oilfield Company. All the authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
References
- Dong, W.; Jiao, J.; Xie, S.; Lyu, C.; Cui, G.; Meng, J. Cumulative production curve method for the quantitative evaluation on the effect of oilfield development measures: A case study of the nitrogen injection pilot in Yanling oilfield, Bohai Bay Basin. Pet. Explor. Dev. 2016, 43, 672–678. [Google Scholar] [CrossRef]
- Ponomarenko, T.; Marin, E.; Galevskiy, S. Economic Evaluation of Oil and Gas Projects: Justification of Engineering Solutions in the Implementation of Field Development Projects. Energies 2022, 15, 3103. [Google Scholar] [CrossRef]
- Ponomarenko, T.V.; Sergeev, I.B. Valuation of mineral assets of a mining company on the basis of the option approach. J. Min. Inst. 2011, 191, 164–175. [Google Scholar]
- Mu, L.X.; Fan, Z.F.; Xu, A.Z. Development characteristics, models and strategies for overseas oil and gas fields. Pet. Explor. Dev. 2018, 45, 735–744. [Google Scholar] [CrossRef]
- Li, Z.X.; Liu, J.Y.; Luo, D.K.; Wang, J.J. Study of evaluation method for the overseas oil and gas investment based on risk compensation. Pet. Sci. 2020, 17, 858–871. [Google Scholar] [CrossRef]
- Yusgiantoro, P.; Hsiao, F.S.T. Production-sharing contracts and decision-making in oil production: The case of Indonesia. Energy Econ. 1993, 15, 245–256. [Google Scholar] [CrossRef]
- Sidle, R.E.E.; Lee, W.J.J. An Update on the Use of Reservoir Analogs for the Estimation of Oil and Gas Reserves. SPE Econ. Manag. 2010, 2, 80–85. [Google Scholar] [CrossRef]
- Liu, Z.L.; Geng, M.; Zhang, Y.Z. The Method for Selecting Analogous Reservoirs Based on SEC. J. Phys. Conf. Ser. 2023, 2520, 012008. [Google Scholar] [CrossRef]
- Martín Rodríguez, H.; Escobar, E.; Embid, S.; Rodríguez Morillas, N.; Hegazy, M.; Lake, L.W. New Approach to Identify Analogous Reservoirs. SPE Econ. Manag. 2014, 6, 173–184. [Google Scholar] [CrossRef]
- El-Nikhely, A.; El-Gendy, N.H.; Bakr, A.M.; Zawra, M.S.; Ondrak, R.; Barakat, M.K. Decoding of seismic data for complex stratigraphic traps revealing by seismic attributes analogy in Yidma/Alamein concession area Western Desert, Egypt. J. Petrol. Explor. Prod. Technol. 2022, 12, 3325–3338. [Google Scholar] [CrossRef]
- Liu, X.H.; Hu, T.; Pang, X.Q.; Xu, Z.; Wang, T.; Zhang, X.W.; Wang, E.Z.; Wu, Z.Y. Evaluation of natural gas hydrate resources in the South China Sea using a new genetic analogy method. Pet. Sci. 2022, 19, 48–57. [Google Scholar] [CrossRef]
- Awoleke, O.O.; Lane, R.H. Analysis of data from the Barnett Shale using conventional statistical and virtual intelligence techniques. SPE Res. Eval. Eng. 2011, 14, 544–556. [Google Scholar] [CrossRef]
- Iraji, S.; Soltanmohammadi, R.; Matheus, G.F.; Basso, M.; Vidal, A.C. Application of unsupervised learning and deep learning for rock type prediction and petrophysical characterization using multi-scale data. Geoenergy Sci. Eng. 2023, 230, 212241. [Google Scholar] [CrossRef]
- Wang, Y.; Cheng, S.Q.; Zhang, F.B.; Feng, N.C.; Li, L.; Shen, X.Z.; Li, J.H.; Yu, H. Big data technique in the reservoir parameters’ prediction and productivity evaluation: A field case in western South China Sea. Gondwana Res. 2021, 96, 22–36. [Google Scholar] [CrossRef]
- Werneck, R.d.O.; Prates, R.; Moura, R.; Gonçalves, M.M.; Castro, M.; Soriano-Vargas, A.; Mendes Júnior, P.R.; Hossain, M.M.; Zampieri, M.F.; Ferreira, A.F.; et al. Data-driven deep-learning forecasting for oil production and pressure. J. Pet. Sci. Eng. 2022, 210, 109937. [Google Scholar] [CrossRef]
- Zhou, Q.; Dilmore, R.; Kleit, A.; Wang, J.Y. Evaluating gas production performances in Marcellus using data mining technologies. J. Nat. Gas Sci. Eng. 2014, 20, 109–120. [Google Scholar] [CrossRef]
- Yuan, Z.H.; Qin, W.Z.; Zhao, J.S. Smart Manufacturing for the Oil Refining and Petrochemical Industry. Engineering 2017, 3, 179–182. [Google Scholar] [CrossRef]
- Zhang, M.Z.; Jia, A.L.; Lei, Z.X. Inter-well reservoir parameter prediction based on LSTM-Attention network and sedimentary microfacies. Geoenergy Sci. Eng. 2024, 235, 212723. [Google Scholar] [CrossRef]
- Bai, W.P.; Cheng, S.Q.; Guo, X.Y.; Wang, Y.; Guo, Q.; Tan, C.D. Oilfield analogy and productivity prediction based on machine learning: Field cases in PL oilfield, China. Pet. Sci. 2024, 21, 2554–2570. [Google Scholar] [CrossRef]
- Guo, Q.; Cheng, S.Q.; Zeng, F.H.; Wang, Y.; Lu, C.; Tan, C.D.; Li, G.L. Reservoir permeability prediction based on analogy and machine learning methods: Field cases in DLG Block of Jing’an Oilfield, China. Lithosphere 2022, 2022, 5249460. [Google Scholar] [CrossRef]
- Mahdaviara, M.; Sharifi, M.; Ahmadi, M. Toward evaluation and screening of the enhanced oil recovery scenarios for low permeability reservoirs using statistical and machine learning techniques. Fuel 2022, 325, 124795. [Google Scholar] [CrossRef]
- Rahimi, M.; Riahi, M.A. Reservoir facies classification based on random forest and geostatistics methods in an offshore oilfield. J. Appl. Geophys. 2022, 201, 104640. [Google Scholar] [CrossRef]
- Zhang, M.Z.; Jia, A.L.; Lei, Z.X.; Lei, G. A comprehensive asset evaluation method for oil and gas projects. Processes 2023, 11, 2398. [Google Scholar] [CrossRef]
- Kassem, M.A.; Khoiry, M.A.; Hamzah, N. Using Relative Importance Index Method for Developing Risk Map in Oil and Gas Construction Projects. J. Kejuruter. 2020, 32, 441–453. [Google Scholar] [CrossRef]
- Bi, A.; Huang, S.; Sun, X. Risk Assessment of Oil and Gas Pipeline Based on Vague Set-Weighted Set Pair Analysis Method. Mathematics 2023, 11, 349. [Google Scholar] [CrossRef]
- Ni, S.; Tang, Y.; Wang, G.; Yang, L.; Lei, B.; Zhang, Z. Risk identification and quantitative assessment method of offshore platform equipment. Energy Rep. 2022, 8, 7219–7229. [Google Scholar] [CrossRef]
- Rui, Z.; Lu, J.; Zhang, Z.; Guo, R.; Ling, K.; Zhang, R.; Patil, S. A quantitative oil and gas reservoir evaluation system for development. J. Nat. Gas Sci. Eng. 2017, 42, 31–39. [Google Scholar] [CrossRef]
- Vilela, M.; Oluyemi, G.; Petrovski, A. A fuzzy inference system applied to value of information assessment for oil and gas industry. Decis. Mak. Appl. Manag. Eng. 2019, 2, 1–18. [Google Scholar] [CrossRef]
- Ilyushin, Y.; Nosova, V.; Krauze, A. Application of Systems Analysis Methods to Construct a Virtual Model of the Field. Energies 2025, 18, 1012. [Google Scholar] [CrossRef]
- Li, M.; Qu, Z.; Wang, M.; Ran, W. The Influence of Micro-Heterogeneity on Water Injection Development in Low-Permeability Sandstone Oil Reservoirs. Minerals 2023, 13, 1533. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).