A Machine Learning-Centric Taxonomy and Structured Characterization of Public Datasets for Upstream Oil and Gas
Abstract
1. Introduction
2. Oil and Gas Datasets
2.1. Netherlands F3 Dataset
2.2. Volve Field Dataset
2.3. 3W Dataset
2.4. COSTA Dataset
2.5. KGS Datasets
3. Dataset Comparisons
- Sensor Data Type: This field identifies the physical quantities or channels provided by the dataset, such as seismic amplitudes and attributes, well logs (e.g., GR, RHOB, NPHI, DT, and resistivity), production signals (pressures, flow rates, temperatures, and choke and valve states), or image intensities and segmentation masks. The breadth and diversity of sensor data determine the feasibility of machine learning tasks and inform feature extraction and model design [3,22,24].
- Resolution: This criterion describes the temporal or spatial sampling of the measurements, such as the time step of production signals (e.g., 1 min, 10 min, hourly), the depth increment of well logs (e.g., 0.5 m), or the inline/crossline spacing and sampling rate of seismic volumes. Resolution affects both the level of detail learnable by machine learning algorithms and the computational cost of model training and inference [24,39].
- Volume: This criterion quantifies the overall amount of data available for learning in terms of the numbers of wells, traces, records, or labeled patterns. Examples include the number of labeled seismic sections, depth samples in well logs, time series instances and event segments in production datasets, or labeled rock images and image patches [24,39,45].
- Context: This factor addresses how geographic location influences the geological setting, reservoir type, operational practices, and measurement characteristics [46]. Datasets from different regions reflect distinct depositional environments (e.g., clastic versus carbonate systems), structural styles, petrophysical relationships, and production behavior, which may induce domain shift and affect model generalization. Consequently, models trained in one geographic context may generalize poorly to geologically dissimilar regions without appropriate validation or domain adaptation.
4. Application of Machine Learning Using Oil and Gas Datasets
- Seismic Interpretation and Facies Classification: Convolutional architectures such as CNNs and U-Net variants, generative models (GANs), and, more recently, self-supervised and contrastive learning frameworks are widely used for seismic facies segmentation, horizon picking, and stratigraphic interpretation [10,11,14,15,19,20,39,40,47]. The Netherlands F3 dataset is the dominant benchmark for these tasks, whereas KGS well logs are primarily used as supplementary inputs for facies or lithology prediction when labels are derived from external interpretations [20,39]. The prominence of the Netherlands F3 dataset does not reflect any geological or geographical exceptionalism of the North Sea setting [16,19,39,40]. Accordingly, performance gains reported on the Netherlands F3 dataset should be interpreted as improvements within a well-defined supervised benchmark, rather than as evidence of broader model superiority across geological settings or labeling conventions—the Netherlands F3 dataset represents a single geological and geographical location [10,16]. Studies consistently report higher segmentation accuracy on the Netherlands F3 dataset than on datasets with sparser labels [10,11,15,20]. In contrast, the KGS dataset, despite its considerable scale (>120,000 wells), remains limited for supervised facies benchmarking owing to the absence of standardized labels [19,43]. Overall, the evidence suggests that, in this domain, label quality is more significant than dataset volume [16,19].
- Well-Log Analysis and Petrophysics: Tree-based models (e.g., random forest and XGBoost), support vector machines (SVMs), and deep learning architectures are widely applied to lithology prediction, missing-log reconstruction, and petrophysical property estimation from wireline curves [19,20,21,48]. The Volve Field dataset supports multi-log studies under realistic field conditions, whereas COSTA provides a controlled setting for carbonate property modeling, and the KGS dataset enables large-scale log interpretation workflows [12,17,43]. Across well-log machine learning studies, performance variability appears to be driven more strongly by log-suite completeness and depth-sampling consistency than by model architecture alone [19,21]. Studies employing the full Volve Field dataset consistently report higher performance than those based on reduced data subsets for the same prediction task, as multi-curve inputs provide complementary lithological information that cannot be fully recovered from a restricted log subset [12,48]. The COSTA dataset, with its fully synthetic and internally consistent label coverage, enables controlled isolation of modeling choices, a degree of control that is generally not achievable with real datasets, in which geological heterogeneity and measurement artifacts act as confounding factors [17]. However, the absence of measurement noise in COSTA means that models achieving very high accuracy on this dataset may overestimate expected real-world performance [16,17]. In KGS-based studies, heterogeneous log availability means that imputation strategies applied prior to model training may introduce systematic biases sufficient to dominate reported accuracy differences between models, particularly when imputation sensitivity analyses are not conducted [19,21,43].
- Reservoir Characterization and History Matching: Physics-informed neural networks (PINNs), surrogate models such as particle swarm optimization–artificial neural network (PSO-ANN) and genetic algorithm–artificial neural network (GA-ANN) hybrids, ensemble regressors, and hybrid evolutionary algorithms have been applied to accelerate history matching and uncertainty quantification [3,4,12,22]. The COSTA and the Volve Field datasets are the primary datasets supporting these machine learning implementations [4,17,23]. The synthetic nature of the COSTA dataset produces a consistent and noteworthy outcome: surrogate models trained on COSTA simulation outputs achieve very high values (often >0.95), but these values reflect the smoothness and internal consistency of numerically generated simulations rather than the complexity of real subsurface behavior [16,17,23]. Furthermore, the Volve Field dataset’s limited well count means that history matching studies trained and evaluated on the Volve Field dataset are operating in a severe data-scarcity regime for deep learning architectures, rendering cross-architectural generalization unreliable [3,12]. In addition, sequence models (LSTM, TCN), gradient boosting machines (GBM), and hybrid physics-data-driven architectures are employed for well-level and field-level production rate forecasting [12,48,49]. The Volve Field dataset is the primary real-field benchmark, the COSTA dataset provides synthetic production responses, and the 3W dataset offers multivariate operational signals under disturbance conditions [12,17,24]. Production forecasting outcomes are highly sensitive to the temporal resolution and operational completeness of the training data [12,48]. The Volve Field dataset’s production data are available primarily at daily aggregation for production rates, which attenuates transient dynamics and renders short-term operational events (e.g., choke adjustments and brief shut-ins) difficult to resolve [12,44]. Forecasting studies that report high accuracy on the daily Volve Field dataset data demonstrate the ability to model slowly evolving field-level trends, not the ability to capture rapid operational dynamics [12,49]. The 3W dataset, sampled at 1 Hz, provides the temporal resolution needed to capture transient dynamics [24,25]. An important implication is that no single public dataset simultaneously supports both long-horizon forecasting and high-resolution transient modeling—a gap that should be explicitly acknowledged as a limitation in any forecasting paper that relies exclusively on these benchmarks [3,16].
- Drilling Optimization and Dysfunction Detection: Ensemble methods such as random forest (RF), GBM, and metaheuristic-assisted models (e.g., MOPSO- or Fireworks-optimized predictors) are used to forecast rate of penetration (ROP) and detect drilling dysfunctions [12,13,44,50]. The Volve Field WITSML drilling records, sampled at 1–10 s intervals, constitute the primary publicly available dataset supporting drill-string related machine learning applications in this study, including ROP forecasting, weight-on-bit optimization, and drill string dysfunction detection such as stick-slip, bit bounce, and lateral vibration events [41,42,44]. The high temporal resolution of the WITSML data, which captures surface and downhole drilling parameters, makes them uniquely suited for modeling the dynamic behavior of the drill-string under varying lithological and operational conditions [41,42,44]. The 3W dataset is primarily oriented toward production-system events rather than drilling, nonetheless its labeled rare-event time series can inform methodologies for event detection under class imbalance [24,25,26,27,28]. Drilling machine learning outcomes are uniquely sensitive to temporal resolution. Notably, the Volve Field WITSML data—available at 1–10 s intervals—are the only dataset in this study to approach the resolution required for real-time drilling control applications [12,44]. However, the dataset covers a single field with specific lithological and operational characteristics, and its WITSML records are not uniformly complete across all wells and depth intervals, introducing inconsistencies that differentially affect model training depending on preprocessing choices [12,13]. Studies applying ensemble methods and metaheuristic-assisted predictors to Volve Field WITSML data for ROP forecasting and dysfunction detection have confirmed that preprocessing decisions—including the treatment of missing drilling records and sensor dropouts—materially influence reported prediction accuracy, yet these choices are inconsistently documented across studies [12,13,50].The taxonomy’s geographic context and volume dimensions directly predict this limitation: a single-field, single-rig drilling dataset cannot serve as a general-purpose benchmark for ROP prediction [12,16], and reinforcement learning and multi-objective optimization frameworks proposed for autonomous drilling parameter selection under real-time constraints [5,13] require broader operational diversity than the Volve Field dataset provides alone. Claimed generalization beyond the Norwegian North Sea clastic setting should be explicitly conditioned on this constraint and validated against independent field data [16,46].
- Well Placement and Geosteering: Machine learning algorithms, including tree-based methods and integrated seismic-facies classifiers, are applied to support trajectory placement and geosteering decisions. The Netherlands F3 dataset provides structural and facies interpretations that can be used for synthetic geosteering experiments, whereas the Volve Field dataset supports development-planning studies under real reservoir conditions [10,12,39,40]. The COSTA dataset provides an open carbonate geomodel for testing placement strategies in controlled settings [17,22]. Geosteering experiments using the Netherlands F3 dataset operate on interpreted stratigraphic intervals rather than direct formation boundaries, and the 10-class facies scheme is a product of a specific reinterpretation campaign rather than a ground-truth geological characterization [39,40]. Models trained to navigate these interpreted intervals may achieve high simulated geosteering performance, but the mapping from predicted facies intervals to actionable drilling targets involves assumptions that are not encoded in the dataset itself [10,11]. The controlled carbonate structure of the COSTA dataset provides a cleaner experimental setting for evaluating placement algorithms [17], but its synthetic origin means that rock-physics relationships and lateral property variability are governed by modeling assumptions rather than conveying real geological information [17,22]. The key implication is that geosteering machine learning studies based on current public datasets are necessarily demonstrations of methodological feasibility rather than validated operational systems [16], and claims about real-field applicability require explicit geological qualification that most published studies fail to provide [10,46].
- Production Optimization and Smart Control: Artificial Neural Networks (ANNs), GBM, reinforcement learning, and hybrid physics-machine learning approaches are applied to optimize lift, choke, and well-network performance. The Volve Field dataset supports optimization and control studies using production and operational variables, whereas the COSTA dataset supports optimization and field-development studies under geological uncertainty using simulation outputs [5,12,17,23]. Reinforcement learning and control-oriented machine learning methods are particularly sensitive to the coverage of the operational feature domain in the training dataset [5,24]. The Volve Field dataset’s partial coverage of choke and valve states limits reinforcement learning-based optimization experiments to a subset of the control space relevant to real field operations [12,44]. The COSTA dataset’s simulation outputs allow full coverage of the operational parameter space within the model’s numerical framework [17,23]; however, this introduces a risk of over-optimization to the simulation’s physical assumptions [5,17]. Optimization results reported for the COSTA dataset should therefore be interpreted in light of its idealized conditions [16,17,23].
- Predictive Maintenance and Equipment Health: Time series models (e.g., Long Short-Term Memory (LSTM)) and ensemble methods (e.g., XGBoost/RF) are used to detect equipment degradation and forecast remaining useful life from multivariate telemetry. In practice, however, predictive-maintenance studies frequently rely on proprietary data or specialized open datasets. Nonetheless, the Volve Field dataset supports methodological development for monitoring and fault detection using available operational and production variables [12]. This application category represents the most acute dataset gap in the current public ecosystem [3,16]. Predictive maintenance requires high-frequency, equipment-specific sensor streams with labeled degradation events or failure records—a data type that the Volve Field dataset provides only partially through its WITSML and operational records [3,6,12,44]. Notably, datasets comprising task-agnostic or unlabeled operational records cannot support rigorous predictive maintenance benchmarking regardless of the sophistication of the model applied [29,30]. Studies applying LSTM or GBM to Volve Field operational data for maintenance-oriented tasks employ records that were not collected or labeled for this purpose, which introduces severe label noise and task misalignment [12,21]. The absence of a dedicated public maintenance benchmark for upstream oil and gas represents a structural gap in the dataset ecosystem [3,16]—rather than a limitation of any individual study—but this remains a gap that published studies should acknowledge explicitly rather than implying that Volve Field-based maintenance results are directly comparable to purpose-built industrial benchmarks [21,29].
- Anomaly, Leak, and Undesirable Event Detection: Autoencoders, isolation forest, one-class SVM, and CNN-LSTM variants are used to detect anomalies and faults in production systems. The 3W dataset is a widely used benchmark for rare undesirable well events (e.g., slugging and sensor-related faults) [24,25,26,27,28]. In addition, the Volve Field dataset supports time series anomaly detection studies using operational and production variables [12,49]. Seismic datasets such as the Netherlands F3 dataset can support structural or facies-related anomaly studies within subsurface interpretation tasks [10,39,40]. This is the application category where the linkage between dataset properties and machine learning outcomes is rigorously documented in the existing literature, largely because the 3W dataset was explicitly designed to make these linkages visible [24,25]. The dataset’s built-in class imbalance forces researchers to confront this imbalance directly; studies reporting aggregate accuracy on the 3W dataset without class-stratified metrics present systematically misleading performance claims [26,27,28]. The inclusion of real, simulated, and hand-drawn sequences introduces data-origin heterogeneity that affects model training in ways that are not always controlled. Specifically, models trained on a mixture of real and synthetic instances may overfit to the statistical regularities of the simulation engine rather than to the physical dynamics of the real events [24,25,26]. Furthermore, the expansion of the 3W dataset from 21 wells in v1.0.0 to 42 wells in v2.0.0 introduces non-trivial distributional changes [25]; studies trained on v1.0.0 and evaluated against v2.0.0 baselines are performing an implicit domain adaptation experiment that should be acknowledged explicitly [24,25,28].
- Drilling and Completion/Fracturing Design: ANNs, GBM, and evolutionary algorithms (e.g., PSO-ANN, GA-ANN) are applied to optimize drilling and completion decisions, including ROP prediction and parameter selection. The Volve Field dataset supports drilling optimization and ROP prediction studies [12,44]. Model-based datasets (e.g., COSTA) can be used to generate controlled scenarios for completion or development-plan sensitivity analyses, while seismic benchmarks (e.g., the Netherlands F3 dataset) can support synthetic trajectory-design experiments [16,17,39,40]. Completion design machine learning studies based on the current public dataset ecosystem face a fundamental feasibility constraint: no public dataset provides the well-completion metadata, hydraulic fracturing records, or post-completion production attribution data needed to train and validate completion optimization models in a statistically rigorous manner [3,16]. The Volve Field dataset’s completion data are available at the well-design level but do not include the fracture propagation measurements or microseismic records needed for stimulation machine learning [12,44]. This represents a direct outcome of the Feature Domains–Geomechanical gap identified in the taxonomy mapping: with no public dataset providing geomechanical attributes alongside completion records, the reported machine learning contributions in this area are necessarily limited to ROP and parameter sensitivity analyses rather than full completion optimization [13,50]. Evolutionary and surrogate-based methods applied to the Volve Field and COSTA datasets for drilling parameter optimization [13,50] are therefore best interpreted as proof-of-concept demonstrations conditioned on this data availability constraint [3,5,8,12,16,17].
5. Dataset Taxonomy
5.1. Data Type
- Well-centric: Wellbore measurements include (i) wireline logs (e.g., gamma ray, resistivity, density, neutron porosity, and sonic logs), (ii) mud logs (e.g., gas readings and cuttings descriptions), (iii) core and image-derived data (e.g., computed tomography (CT)/X-ray transmission imaging, where available), and (iv) interpreted petrophysical properties (e.g., volume of shale (), water saturation (), and porosity) [19,20,43].
- Dynamic: Time-varying operational data include production/injection rates (oil/gas/water), pressures (e.g., bottomhole pressure (BHP), flowing wellhead pressure, and tubing pressure), temperatures, and control variables, such as choke positions and valve states. In some settings, these data are accompanied by supervisory control and data acquisition (SCADA) tags or event annotations (e.g., kick events or stuck-pipe indicators) [25,26,27].
- Structural: Structural datasets describe subsurface architecture using interpreted horizons and faults, often represented in two-way travel time (TWT) and converted to true vertical depth (TVD) using velocity models. These interpretations may be calibrated to well formation tops and subsequently integrated into 3D geocellular models, which provide gridded reservoir frameworks for simulation and machine learning tasks.
- Multi-modal: Datasets in this category integrate heterogeneous measurements across different spatial scales and physical principles, subject to two explicit qualifying conditions. First, the dataset must simultaneously provide at least two physically distinct measurement types that are acquired through different sensing principles or represent different physical domains. For example, seismic acoustic impedance contrasts alongside wireline-derived petrophysical curves, or production pressure time series alongside structural horizon interpretations. Second, the measurements must exhibit meaningful spatiotemporal correspondence: they must be co-registered, co-located, or temporally aligned such that joint learning across modalities is physically meaningful rather than incidental. A dataset that contains multiple data types but lacks this spatiotemporal linkage does not qualify as multi-modal under this taxonomy. For example, well logs from one field combined with seismic data from an unrelated survey. Applying these criteria to the five datasets examined, only the Volve Field dataset satisfies both conditions, integrating geophysical surveys, wireline logs, dynamic production telemetry, and structural interpretations within a single internally consistent field dataset. This integration reduces single-modality bias and enables the capture of complementary information, potentially improving characterization performance in complex subsurface settings [10,11,15].
5.2. Data Characterization
- Resolution: Resolution refers to the temporal and/or spatial granularity of the measurements, including the sampling interval in time (e.g., milliseconds for seismic traces or seconds/minutes for time series telemetry), the depth sampling step for well logs (e.g., sub-meter increments), and the spatial sampling or bin size for gridded data (e.g., seismic inline/crossline spacing or bin size in meters) [25,40,43]. Higher resolution improves the detectability of fine-scale features (e.g., thin beds or short transients) but may increase noise sensitivity and computational cost.
- Volume: Dataset volume reflects the scale or amount of usable data available for learning, such as areal coverage and trace counts for seismic volumes, the number of wells and depth samples for log archives, and the number of instances, channels, and sequence lengths for multivariate time series [25,40,43]. Large-scale datasets can enable more robust model training and generalization, but they typically require substantial storage, curation, and preprocessing. Notably, dataset size is often reported using raw indicators such as the number of wells, files, seismic traces, logs, or time series instances; these counts do not necessarily represent the amount of statistically independent information available for machine learning. Highly correlated measurements, repeated time windows, spatially adjacent seismic traces, or redundant well-log intervals may reduce the effective information content of a dataset. Therefore, the revised taxonomy includes effective information content as a complementary criterion that considers effective dimensionality, redundancy, correlation structure, intrinsic data complexity, and the diversity of independent geological, operational, or physical conditions represented in the data.
- Fidelity: This sub-category describes measurement trustworthiness across four quantifiable dimensions that should be explicitly reported when characterizing upstream datasets for machine learning: (i) missing value proportion (MVP)—the fraction of missing or null entries per channel or variable, reported as a percentage, where proportions below 5% indicate high fidelity, proportions between 5% and 20% indicate moderate fidelity requiring imputation, and proportions exceeding 20% indicate low fidelity that may introduce systematic bias and should be explicitly flagged in any study using that variable; (ii) noise level—quantified as the signal-to-noise ratio (SNR) in decibels for continuous sensor streams, or as the coefficient of variation (CV) for depth-indexed log measurements, with known acquisition artifacts such as cycle skipping in sonic logs, mud filtrate invasion effects in resistivity measurements, or high-frequency drilling vibration noise in WITSML records documented qualitatively where SNR or CV cannot be computed; (iii) sensor dropout and frozen signal rate (SDFSR)—the proportion of time steps or depth samples affected by sensor dropout, frozen readings, or physically implausible constant values reported per channel, a metric particularly relevant for high-frequency SCADA and WITSML streams such as those in the Volve Field and 3W datasets, where frozen signals are a known artifact [24,25]; and (iv) labeling consistency—the inter-annotator agreement or proportion of samples with conflicting, ambiguous, or partially labeled ground truth where multiple annotation sources exist, with the reinterpretation methodology and its known limitations serving as a proxy for single-annotation-source datasets such as the Netherlands F3 and 3W datasets [24,39]. Applying these criteria to the five datasets examined, the Netherlands F3 dataset exhibits high fidelity for seismic image data but limited log fidelity owing to the availability of only four wells; the Volve Field dataset exhibits moderate fidelity overall due to sensor dropouts and aggregated production records; the 3W dataset explicitly documents frozen signals and missing variables as realistic artifacts; the COSTA dataset exhibits maximum fidelity by design as a noise-free synthetic benchmark; and the KGS archive exhibits variable fidelity across wells and logging vintages, with MVP values that vary substantially depending on the log suite and well vintage selected. Lower fidelity across any of these dimensions typically necessitates denoising, imputation, and quality-control preprocessing to avoid introducing systematic learning bias [24,38,39].
- Imbalance and Rare Events: Many upstream datasets exhibit strong class imbalance, including minority lithofacies classes in well logs, thin stratigraphic units in seismic interpretation, and rare abnormal events in production systems [24,26]. Such imbalance motivates strategies such as resampling, cost-sensitive learning, data augmentation, and anomaly detection formulations.
- Label Density: Label density describes how frequently ground truth is available relative to the raw measurements. For example, per depth sample (well logs), per time step or segment (time series), or per pixel/voxel (seismic images/volumes). Public datasets may provide dense labels derived from interpretation (e.g., horizon-bounded facies intervals) or sparse interval annotations, thereby affecting the suitability of supervised versus semi-supervised and self-supervised learning [11,15,24].
5.3. Feature Domains
- Petrophysical: Petrophysical variables describe rock and fluid properties, including porosity (), permeability (k), water saturation (), capillary pressure (), and discrete facies labels derived from logs, cores, or seismic interpretation products [20,21]. These attributes govern storage capacity and flow behavior and are commonly measured or inferred from well logs and core data, or provided as model outputs in synthetic benchmarks (e.g., the COSTA carbonate reservoir model) [11,15,17].
- Geomechanical: Geomechanical attributes influence wellbore stability, compaction, and fracture initiation/propagation and are therefore relevant to drilling-risk assessment and stimulation design. They quantify how the subsurface deforms under stress and may include in situ stresses, strain, elastic moduli, and derived brittleness indices, often estimated from well-log- and petrophysics-based proxies in practical workflows [22].
- Operational: Operational variables capture how the field is controlled over time and strongly influence observed production rates and pressures. Such variables are essential for distinguishing subsurface-driven behavior from operational interventions in forecasting, anomaly detection, and optimization workflows [5]. Examples include choke settings, valve states, artificial-lift modes (e.g., ESP on/off or gas-lift rate), and downtime/shut-in indicators. In public datasets, these signals are most directly represented in production-oriented time series benchmarks (e.g., the 3W dataset) and integrated field releases (e.g., the Volve Field dataset) through operational channels such as choke/valve states and related control parameters [24,26].
5.4. Machine Learning
- Task Type: This attribute specifies the core learning objective and its physical meaning. For example, inversion refers to estimating subsurface properties (e.g., porosity or permeability) from indirect measurements, whereas interpretation covers tasks such as facies classification and segmentation and horizon or fault picking. Additionally, forecasting targets the prediction of production rates, pressures, or other operational variables. Beyond these, broader reservoir data analytics (RDA) tasks include proxy modeling, uncertainty quantification, and optimization [25,39,48]. Clearly defining the task type helps ensure that inputs, labels, and metrics are aligned with a coherent physical question rather than conflating heterogeneous objectives within a single benchmark [10].
- Learning Paradigm: The learning paradigm describes how models use labels and domain knowledge, encompassing supervised learning on expert-labeled facies or well events, as well as self-supervised learning that exploits large volumes of unlabeled seismic or log data via pretext or contrastive objectives [11,14,15]. It also includes physics-informed or hybrid frameworks in which physical constraints (e.g., flow or reservoir equations) guide training, as demonstrated in the Volve Field-based production modeling [12]. Selecting an appropriate paradigm ensures that benchmarks reflect realistic label availability and incorporate domain structure, particularly where comprehensive labeling is costly or uncertain [5].
- Ground Truth: Ground truth describes how target labels are produced, including expert interpretations for facies and structural features (e.g., Netherlands F3 seismic horizons and facies), synthetic labels generated from numerical models (e.g., COSTA carbonate simulations), and laboratory measurements such as core-plug porosity (), permeability (k), and capillary pressure () used to calibrate or validate predictions derived from logs or seismic attributes [17,22,43]. Each source involves trade-offs: synthetic labels enable controlled experimentation and complete target coverage but may omit real-world complexity, operational disturbances, measurement noise, and sensor imperfections. In contrast, expert interpretations, field measurements, and laboratory measurements provide higher physical realism but can introduce domain-dependent bias, missing values, inconsistent sampling, and measurement uncertainty that should be accounted for when designing robust benchmarks [19].
- Benchmark Maturity: Benchmark maturity describes the extent to which a dataset is standardized and ready for reproducible evaluation. Indicators include the availability of standard train, validation, and test splits, published baselines and reference results, and clearly specified metrics and protocols (e.g., well-documented seismic facies benchmarks for the Netherlands F3 dataset and event detection protocols for the 3W dataset) [25,38,39]. Mature benchmarks provide clear procedures and reusable pipelines that support fair comparison and cumulative progress. In contrast, emerging datasets may lack agreed-upon tasks, splits, or metrics, requiring additional community effort to establish consistent evaluation standards [38].
- PIML/SciML Readiness: This attribute evaluates whether a dataset contains the information required to support physics-informed machine learning (PIML), scientific machine learning (SciML), or hybrid physics-data-driven workflows. In this category, a dataset is considered suitable for physics-constrained learning only when it provides, or can be reliably linked to, physically meaningful constraints such as governing equations, simulation outputs, boundary or initial conditions, conservation relationships, or temporally consistent measurements of coupled physical variables [3,12,16,17]. Governing equations are available for datasets that include, or can be explicitly linked to flow equations, reservoir equations, material balance, pressure–rate relationships, or conservation laws. Similarly, simulation outputs are available for datasets generated from numerical models or containing simulator outputs, such as pressure, saturation, permeability, porosity, production response, or scenario-based reservoir states. The boundary/initial conditions available for datasets provide information about initial pressure, saturation, grid conditions, well controls, injection/production constraints, or boundary assumptions. Finally, temporally consistent multiphysics measurements comprise time-aligned physical variables, such as pressure, flow rate, temperature, choke setting, valve state, water cut, gas rate, and operational events. Physics-constrained learning suitability is rated as high, partial, limited, or low, reflecting the dataset’s capacity to support physics-informed losses, residual constraints, hybrid surrogate models, or scientific machine learning workflows.
5.5. Context
- Asset Lifecycle: Indicates the upstream stage from which the data originate, such as exploration, appraisal, or brownfield (mature-field) operations. This distinction helps align dataset characteristics with typical use cases, for example, structural mapping and prospect screening in exploration versus production monitoring and optimization in mature assets.
- Source Type: Specifies whether the dataset is derived from field measurements (e.g., well logs, drilling data, production telemetry), from fully synthetic modeling workflows (e.g., reservoir model benchmarks), or from hybrid sources that combine simulated and real measurements. Source type influences realism, label availability, and the extent to which learned patterns are expected to generalize to operational settings.
- Utility: Describes the dataset’s intended use, such as a benchmarking resource with defined tasks and labels, a source for transfer learning and domain adaptation studies, or an open resource intended to promote reproducibility and accessibility in upstream machine learning research.
- Geographic Context: Captures the geological and geographic setting (e.g., basin, reservoir lithology such as clastic or carbonate, and tectonic regime). This context is critical for interpreting learned patterns, assessing domain shift, and designing cross-basin generalization and adaptation experiments.
5.6. Application Scope
- Application Type: Specifies the domain-specific task(s) for which a dataset is suitable, such as seismic interpretation, well-log analysis, production forecasting, anomaly detection, or carbon capture and storage (CCS) monitoring. This classification promotes alignment between dataset content (inputs and labels) and model design.
- Value Chain: Identifies the stage of the upstream value chain targeted by the application, including exploration, development, and production. This classification contextualizes how datasets support operational objectives and decision-making workflows across the asset lifecycle.
6. Conclusions and Future Work
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Tariq, Z.; Aljawad, M.S.; Hasan, A.; Murtaza, M.; Mohammed, E.; El-Husseiny, A.; Alarifi, S.A.; Mahmoud, M.; Abdulraheem, A. A Systematic Review of Data Science and Machine Learning Applications to the Oil and Gas Industry. J. Pet. Explor. Prod. Technol. 2021, 11, 4339–4374. [Google Scholar] [CrossRef]
- Chen, F.; Sun, L.; Jiang, B.; Huo, X.; Pan, X.; Feng, C.; Zhang, Z. A Review of AI Applications in Unconventional Oil and Gas Exploration and Development. Energies 2025, 18, 391. [Google Scholar] [CrossRef]
- Azmi, R.P.A.; Yusoff, M.; Mohd Sallehud-din, M.T. A Review of Predictive Analytics Models in the Oil and Gas Industries. Sensors 2024, 24, 4013. [Google Scholar] [CrossRef]
- Desai, J.N.; Pandian, S.; Vij, R.K. Big Data Analytics in Upstream Oil and Gas Industries for Sustainable Exploration and Development: A Review. Environ. Technol. Innov. 2021, 21, 101186. [Google Scholar] [CrossRef]
- Waqar, A.; Othman, I.; Shafiq, N.; Mansoor, M.S. Applications of AI in oil and gas projects towards sustainable development: A systematic literature review. Artif. Intell. Rev. 2023, 56, 12771–12798. [Google Scholar] [CrossRef]
- Salem, A.M.; Yakoot, M.S.; Mahmoud, O. Addressing Diverse Petroleum Industry Problems Using Machine Learning Techniques: Literary Methodology–Spotlight on Predicting Well Integrity Failures. ACS Omega 2022, 7, 2504–2519. [Google Scholar] [CrossRef]
- Benayoune, A. Factors influencing industry 4.0 implementation in oil and gas sector: Empirical study from a developing economy. Acad. Strateg. Manag. J. 2022, 21, 1–18. Available online: https://www.abacademies.org/articles/factors-influencing-industry-40-implementation-in-oil-and-gas-sector-empirical-study-from-a-developing-economy.pdf (accessed on 4 February 2026).
- Lu, H.; Guo, L.; Azimi, M.; Huang, K. Oil and Gas 4.0 era: A systematic review and outlook. Comput. Ind. 2019, 111, 68–90. [Google Scholar] [CrossRef]
- Wang, T.; Wei, Q.; Xiong, W.; Wang, Q.; Fang, J.; Wang, X.; Liu, G.; Jin, C.; Wang, J. Current Status and Prospects of Artificial Intelligence Technology Application in Oil and Gas Field Development. ACS Omega 2024, 9, 3173–3183. [Google Scholar] [CrossRef] [PubMed]
- Lin, L.; Zhong, Z.; Li, C.; Gorman, A.; Wei, H.; Kuang, Y.; Wen, S.; Cai, Z.; Hao, F. Machine learning for subsurface geological feature identification from seismic data: Methods, datasets, challenges, and opportunities. Earth-Sci. Rev. 2024, 257, 104887. [Google Scholar] [CrossRef]
- Liu, X.; Li, B.; Li, J.; Chen, X.; Li, Q.; Chen, Y. Semi-supervised deep autoencoder for seismic facies classification. Geophys. Prospect. 2021, 69, 1295–1315. [Google Scholar] [CrossRef]
- Nikitin, N.O.; Revin, I.; Hvatov, A.; Vychuzhanin, P.; Kalyuzhnaya, A.V. Hybrid and automated machine learning approaches for oil fields development: The case study of Volve field, North Sea. Comput. Geosci. 2022, 161, 105061. [Google Scholar] [CrossRef]
- Abd-Elwahed, M.S. Multi-Objective Optimization of Drilling GFRP Composites Using ANN Enhanced by Particle Swarm Algorithm. Processes 2023, 11, 2418. [Google Scholar] [CrossRef]
- Li, M.; Yan, X.; Wu, Q. A self-supervised deep learning framework for seismic facies segmentation. Expert Syst. Appl. 2025, 288, 128290. [Google Scholar] [CrossRef]
- Li, K.; Liu, W.; Dou, Y.; Xu, Z.; Duan, H.; Jing, R. CONSS: Contrastive Learning Method for Semisupervised Seismic Facies Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 7838–7849. [Google Scholar] [CrossRef]
- Dramsch, J.S. Chapter One—70 years of machine learning in geoscience in review. In Machine Learning and Artificial Intelligence in Geosciences; Moseley, B., Krischer, L., Eds.; Advances in Geophysics; Elsevier: Amsterdam, The Netherlands, 2020; Volume 61, pp. 1–55. [Google Scholar] [CrossRef]
- Costa Gomes, J.; Geiger, S.; Arnold, D. The design of an open-source carbonate reservoir model. Pet. Geosci. 2022, 28, petgeo2021-067. [Google Scholar] [CrossRef]
- Al-Fakih, A.; Koeshidayatullah, A.; Mukerji, T.; Al-Azani, S.; Kaka, S.I. Well-log data generation and imputation using sequence-based generative adversarial networks. Sci. Rep. 2025, 15, 11000. [Google Scholar] [CrossRef] [PubMed]
- Ribeiro Mendes, P.; Salavati, S.; Linares, O.; Moreira Gonçalves, M.; Ferreira Zampieri, M.; de Sousa Ferreira, V.H.; Castro, M.; de Oliveira Werneck, R.; Moura, R.; Morais, E.; et al. Rock-type classification: A (critical) machine-learning perspective. Comput. Geosci. 2024, 193, 105730. [Google Scholar] [CrossRef]
- Hall, B. Facies classification using machine learning. Lead. Edge 2016, 35, 906–909. [Google Scholar] [CrossRef]
- Jiang, S.; Sun, P.; Lyu, F.; Zhu, S.; Zhou, R.; Li, B.; He, T.; Lin, Y.; Gao, Y.; Song, W.; et al. Machine learning (ML) for fluvial lithofacies identification from well logs: A hybrid classification model integrating lithofacies characteristics, logging data distributions, and ML models applicability. Geoenergy Sci. Eng. 2024, 233, 212587. [Google Scholar] [CrossRef]
- Balaguera, A.; Torné, M.; Carbonell, R.; Martí, A.; Vergés, J.; Jurado, M.J.; Sánchez-Pastor, P.; Farci, A.; Davoise, D.; Rodríguez, S. Machine learning in subsurface physical properties and lithofacies prediction in a mining context. Sci. Rep. 2025, 15, 26495. [Google Scholar] [CrossRef]
- Arinze, C.A.; Jacks, B.S. A comprehensive review on AI-driven optimization techniques enhancing sustainability in oil and gas production processes. Eng. Sci. Technol. J. 2024, 5, 962–973. [Google Scholar] [CrossRef]
- Vargas, R.E.V.; Munaro, C.J.; Ciarelli, P.M.; Medeiros, A.G.; do Amaral, B.G.; Barrionuevo, D.C.; de Araújo, J.C.D.; Ribeiro, J.L.; Magalhães, L.P. A realistic and public dataset with rare undesirable real events in oil wells. J. Pet. Sci. Eng. 2019, 181, 106223. [Google Scholar] [CrossRef]
- Vargas, R.E.V.; de Melo Junior, A.J.; Munaro, C.J.; de Campos Lima, C.B.; de Lima Junior, E.T.; Barrocas, F.M.; Varejão, F.M.; Peixer, G.F.; Oliveira, I.M.N.; Barbosa, J.R., Jr.; et al. 3W Dataset 2.0.0: A realistic and public dataset with rare undesirable real events in oil wells. arXiv 2025, arXiv:2507.01048. [Google Scholar] [CrossRef]
- Oliveira, I.M.N.; Aranha, P.E.; Vieira, T.M.A.; da Silva, A.C.A.; Ramos, D.L.; de Lima Junior, E.T. Advancing Anomaly Detection in Oil Production Wells with TranAD: A Deep Transformer Network Approach. In Proceedings of the XLV Ibero-Latin American Congress on Computational Methods in Engineering (CILAMCE 2024), Maceió, Brazil, 11–14 November 2024. [Google Scholar] [CrossRef]
- Turan, E.M.; Jäschke, J. Classification of undesirable events in oil well operation. In Proceedings of the 2021 23rd International Conference on Process Control (PC), Štrbské Pleso, Slovakia, 1–4 June 2021; pp. 157–162. [Google Scholar] [CrossRef]
- Brønstad, C.; Netto, S.L.; Ramos, A.L.L. Data-driven Detection and Identification of Undesirable Events in Subsea Oil Wells. In Proceedings of the SENSORDEVICES 2021: The Twelfth International Conference on Sensor Device Technologies and Applications, Athens, Greece, 14–18 November 2021; pp. 1–6. Available online: https://personales.upv.es/thinkmind/SENSORDEVICES/SENSORDEVICES_2021/sensordevices_2021_1_10_28039.html (accessed on 4 February 2026).
- Priyanka, E.B.; Thangavel, S.; Gao, X.Z.; Sivakumar, N.S. Digital twin for oil pipeline risk estimation using prognostic and machine learning techniques. J. Ind. Inf. Integr. 2022, 26, 100272. [Google Scholar] [CrossRef]
- Wanasinghe, T.R.; Wroblewski, L.; Petersen, B.K.; Gosine, R.G.; James, L.A.; de Silva, O.; Mann, G.K.I.; Warrian, P.J. Digital Twin for the Oil and Gas Industry: Overview, Research Trends, Opportunities, and Challenges. IEEE Access 2020, 8, 104175–104197. [Google Scholar] [CrossRef]
- Jia, Z.; Wang, J.; Deng, C. IIoT-based Predictive Maintenance for Oil and Gas Industry. In Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering (EITCE 2022), Virtual, China, 21–23 October 2022; ACM: New York, NY, USA, 2022; pp. 432–436. [Google Scholar] [CrossRef]
- Zhang, T.; Gao, L.; He, C.; Zhang, M.; Krishnamachari, B.; Avestimehr, A.S. Federated Learning for the Internet of Things: Applications, Challenges, and Opportunities. IEEE Internet Things Mag. 2022, 5, 24–29. [Google Scholar] [CrossRef]
- Baqer, M. Energy-Efficient Federated Learning for Internet of Things: Leveraging In-Network Processing and Hierarchical Clustering. Future Internet 2025, 17, 4. [Google Scholar] [CrossRef]
- Baqer, M. Lightweight Federated Learning Approach for Resource-Constrained Internet of Things. Sensors 2025, 25, 5633. [Google Scholar] [CrossRef] [PubMed]
- Verma, P.K.; Verma, R.; Prakash, A.; Agrawal, A.; Naik, K.; Tripathi, R.; Alsabaan, M.; Khalifa, T.; Abdelkader, T.; Abogharaf, A. Machine-to-Machine (M2M) communications: A survey. J. Netw. Comput. Appl. 2016, 66, 83–105. [Google Scholar] [CrossRef]
- Baqer, M.; Kamal, A. S-Sensors: Integrating Physical World Inputs with Social Networks Using Wireless Sensor Networks. In Proceedings of the 2009 Fifth International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP 2009), Melbourne, Australia, 7–10 December 2009; IEEE: New York, NY, USA, 2009; pp. 213–218. [Google Scholar] [CrossRef]
- Baqer, M. Enabling Collaboration and Coordination of Wireless Sensor Networks via Social Networks. In Proceedings of the 2010 6th IEEE International Conference on Distributed Computing in Sensor Systems Workshops (DCOSSW), Santa Barbara, CA, USA, 21–23 June 2010; IEEE: New York, NY, USA, 2010; pp. 1–2. [Google Scholar] [CrossRef]
- McDonald, A. Public Datasets for Machine Learning in Geoscience. Medium (TDS Archive), 2022. Available online: https://medium.com/data-science/public-datasets-for-machine-learning-in-geoscience-cf880862300a (accessed on 3 February 2026).
- Alaudah, Y.; Michałowicz, P.; Alfarraj, M.; AlRegib, G. A machine-learning benchmark for facies classification. Interpretation 2019, 7, SE175–SE187. [Google Scholar] [CrossRef]
- Baroni, L.; Silva, R.M.; Ferreira, R.S.; Chevitarese, D.; Szwarcman, D.; Vital Brazil, E. Netherlands F3 Interpretation Dataset, Version 2.0.0; Zenodo: Geneva, Switzerland, 2018. [Google Scholar] [CrossRef]
- Equinor ASA. Disclosing all Volve Data. Equinor News Archive. 2018. Available online: https://www.equinor.com/news/archive/14jun2018-disclosing-volve-data (accessed on 4 February 2026).
- Energistics Consortium. Equinor’s Volve Field Test Data. Energistics Consortium, n.d. Available online: https://energistics.org/equinors-volve-field-test-data (accessed on 10 January 2026).
- Kansas Geological Survey. Oil and Gas Data Bases: Digital Well Logs and Oil & Gas Well Data; Data Resources Library, University of Kansas: Lawrence, KS, USA, 2006; Available online: https://www.kgs.ku.edu/PRS/petroDB.html (accessed on 10 January 2026).
- Ng, C.S.W.; Jahanbani Ghahfarokhi, A.; Nait Amar, M. Well production forecast in Volve field: Application of rigorous machine learning techniques and metaheuristic algorithm. J. Pet. Sci. Eng. 2022, 208, 109468. [Google Scholar] [CrossRef]
- Zhang, Y.; Wu, X.; You, J. A benchmark dataset and baseline methods for rock microstructure interpretation in SEM images. Sci. Data 2025, 12, 1671. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; Lian, J.; Li, C. A dataset of natural gas and liquid level for oil field production prediction in China. Sci. Data 2025, 12, 1071. [Google Scholar] [CrossRef] [PubMed]
- Lemos, J.B.; Santos, L.d.S.O.; Cerqueira, A.G. Seismic Facies Segmentation Using Convolutional Neural Networks. In Proceedings of the XVII Congresso Brasileiro de Inteligência Computacional (CBIC 2025), Horizonte, Brazil, 27–30 October 2025; pp. 1–6. [Google Scholar] [CrossRef]
- Samad, A.; Khan, I.M.; Rahaman, M.S.; Sakib, A.; Islam, M.A. Data-Driven Approach to Predict Future Oil Production of an Oil Field Using Machine Learning Techniques. In Proceedings of 8th International Conference on Mechanical, Industrial and Energy Engineering; Springer: Cham, Switzerland, 2025; Volume 3, pp. 68–73. [Google Scholar] [CrossRef]
- López, R. Forecast Oil Production Using Machine Learning. Neural Designer. 2023. Available online: https://www.neuraldesigner.com/blog/volve-oil-forecasting (accessed on 4 February 2026).
- Yang, L.; Lu, Z.; Ren, W.; Liu, T. Improving the Drilling Parameter Optimization Method Based on the Fireworks Algorithm. ACS Omega 2022, 7, 38074–38083. [Google Scholar] [CrossRef]
- Ramachandran, N.; Irvin, J.; Omara, M.; Gautam, R.; Meisenhelder, K.; Rostami, E.; Sheng, H.; Ng, A.Y.; Jackson, R.B. Deep learning for detecting and characterizing oil and gas well pads in satellite imagery. Nat. Commun. 2024, 15, 7036. [Google Scholar] [CrossRef]

| Attribute | Netherlands F3 | Volve Field | 3W | COSTA | KGS |
|---|---|---|---|---|---|
| 3D seismic surveys | ✓ | ✓ | × | × | × |
| 2D seismic surveys | × | ▵ | × | × | × |
| Seismic attributes/interpretations | ✓ | ✓ | × | × | × |
| Well logs (GR, RHOB, NPHI, DT, resistivity) | ▵ | ✓ | × | ▵ | ✓ |
| Image logs/core data | × | ✓ | × | × | ▵ |
| Measured production rates (oil/gas/water) | × | ✓ | × | × | ✓ |
| Synthetic production/simulation outputs | × | × | ✓ | ✓ | × |
| Pressure and temperature measurements | × | ✓ | ✓ | ▵ | × |
| Choke/valve/operational states | × | ✓ | ✓ | × | × |
| SCADA/control-system tags | × | ✓ | ✓ | × | × |
| Drilling/WITSML data | × | ✓ | × | × | × |
| Data representativeness | Real | Real | Real and synthetic | Synthetic (geologically realistic) | Real |
| Resolution | Inline/crossline ≈ 25 m; 3D seismic ≈ 4 ms | Daily production; 0.1–0.5 m log sampling; seconds-level drilling data (1–10 s) | 1 Hz (1 sample/s) | Grid-defined, synthetic | Logs: 0.15–0.3 m; monthly production |
| Volume | ≈190,000 labeled seismic patches | 7–24 producing wells; part of ≈40,000 files | 1984 time series; 21 wells, >8000 labeled events (3W v1.0.0) | 447 synthetic wells (based on 43 real wells) | >120,000 wells; tens of millions of samples |
| Context | Offshore Netherlands, North Sea | Norwegian North Sea | Offshore Brazil | Carbonate reservoir (simulated Middle East) | Kansas, USA |
| Taxonomy Dimension/Attribute | Netherlands F3 | Volve Field | 3W | COSTA | KGS |
|---|---|---|---|---|---|
| 1. Data type | |||||
| Geophysical (3D/2D seismic, VSP) | ✓ | ▵ | × | × | × |
| Well-centric (wireline, mud logs, core) | ▵ | ✓ | × | ▵ | ✓ |
| Dynamic (SCADA, rates, pressures, temps) | × | ✓ | ✓ | ▵ | × |
| Structural (horizons, faults, geomodels) | ✓ | ✓ | × | ✓ | × |
| Multi-modal (seismic, logs, structure) | × | ✓ | × | × | × |
| 2. Data characterization | |||||
| Resolution | 25 m bin; 4 ms | Daily prod.; 0.1–0.5 m log | 1 Hz | Grid-based synthetic | 0.15–0.3 m log |
| Volume | ∼190,000 patches | 40,000 files; 7–24 wells | 1984 instances; 21 wells; >8000 labeled events (v1.0.0) | 447 synthetic wells (based on 43 real wells) | 120,000+ wells |
| Fidelity (MVP, SNR/CV, SDFSR, label consistency) | High seismic fidelity; limited log fidelity (4 wells only) | Moderate; sensor dropouts and aggregated production records | ▵; frozen signals and missing variables documented as artifacts | Maximum; noise-free synthetic benchmark | Variable; MVP varies by log suite and well vintage |
| Imbalance and rare events | ▵ | ▵ | ✓ | × | ▵ |
| Label density | ✓ | ▵ | ✓ | ✓ | × |
| 3. Feature domains | |||||
| Petrophysical (ϕ, k, , , facies) | × | ✓ | × | ✓ | ✓ |
| Geomechanical (stress, moduli, brittleness) | × | ▵ | × | × | × |
| Operational (choke, lift, valve, downtime) | × | ✓ | ✓ | × | × |
| 4. Machine learning | |||||
| Primary task type | Interpretation; segmentation | Forecasting; anomaly detection | Classification; anomaly detection | Inversion; uncertainty quantification; surrogate | Property prediction; clustering |
| Dominant learning paradigm | Supervised; self-supervised | Supervised; physics-informed | Supervised; semi-supervised | Supervised; surrogate | Supervised; unsupervised |
| Ground-truth source | Expert reinterpretation | Field data and reports | Expert and simulation | Numerical simulation | Study-specific |
| Benchmark maturity | ✓ | ▵ | ✓ | ▵ | × |
| PIML/SciML readiness | Low; requires external physical constraints | Partial; multi-modal field data but limited explicit equations | Limited; dynamic sequences with partial simulation support | High; simulation-derived and physically consistent | Low; requires external petrophysical or reservoir constraints |
| 5. Context | |||||
| Asset lifecycle | Exploration; development | Development; production | Production | Exploration; development | Exploration; development |
| Source type | Real | Real | Real and synthetic | Synthetic | Real |
| Utility | Benchmarking; transfer learning | Benchmarking; transfer learning; open-source | Benchmarking; open-source | Benchmarking; method development | Open-source; transfer learning |
| Geographic/geological context | Offshore Netherlands, North Sea | Norwegian North Sea | Offshore Brazil | Carbonate reservoir (simulated Middle East) | Kansas, USA |
| 6. Application scope | |||||
| Primary application type | Seismic interpretation; facies segmentation | Production forecasting; hybrid modeling; anomaly detection | Event detection; early warning | Reservoir characterization; uncertainty quantification; history matching | Log interpretation; lithofacies classification |
| Value-chain coverage | Exploration; development | Development; production | Production | Exploration; development | Exploration; development |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Baqer, M. A Machine Learning-Centric Taxonomy and Structured Characterization of Public Datasets for Upstream Oil and Gas. Big Data Cogn. Comput. 2026, 10, 188. https://doi.org/10.3390/bdcc10060188
Baqer M. A Machine Learning-Centric Taxonomy and Structured Characterization of Public Datasets for Upstream Oil and Gas. Big Data and Cognitive Computing. 2026; 10(6):188. https://doi.org/10.3390/bdcc10060188
Chicago/Turabian StyleBaqer, M. 2026. "A Machine Learning-Centric Taxonomy and Structured Characterization of Public Datasets for Upstream Oil and Gas" Big Data and Cognitive Computing 10, no. 6: 188. https://doi.org/10.3390/bdcc10060188
APA StyleBaqer, M. (2026). A Machine Learning-Centric Taxonomy and Structured Characterization of Public Datasets for Upstream Oil and Gas. Big Data and Cognitive Computing, 10(6), 188. https://doi.org/10.3390/bdcc10060188

