A Machine Learning-Centric Taxonomy and Structured Characterization of Public Datasets for Upstream Oil and Gas

Baqer, M.

doi:10.3390/bdcc10060188

Open AccessArticle

A Machine Learning-Centric Taxonomy and Structured Characterization of Public Datasets for Upstream Oil and Gas

by

M. Baqer

Department of Computer Engineering, College of Information Technology, University of Bahrain, Zallaq P.O. Box 32038, Bahrain

Big Data Cogn. Comput. 2026, 10(6), 188; https://doi.org/10.3390/bdcc10060188

Submission received: 17 April 2026 / Revised: 26 May 2026 / Accepted: 3 June 2026 / Published: 9 June 2026

(This article belongs to the Topic Data Intelligence and Computational Analytics)

Download

Browse Figure

Versions Notes

Abstract

Upstream oil and gas operations generate large volumes of multivariate data from seismic surveys, well logs, production sensor networks, and reservoir simulation models. Advances in machine learning, artificial intelligence, and other Industry 4.0 technologies are increasingly enabling data-driven applications across exploration, reservoir characterization, drilling optimization, and production forecasting. However, publicly available upstream datasets vary substantially in data modality, labeling strategy, machine learning compatibility, and benchmark maturity. To date, no standardized framework or taxonomy exists to guide dataset selection, benchmark design, or cross-study comparison. This study addresses that gap by proposing a structured, machine learning-centric taxonomy that organizes upstream datasets according to properties that are directly relevant to machine learning requirements. The proposed taxonomy provides a shared reference framework to support consistent dataset description, informed selection, and reproducible benchmarking in upstream machine learning research and applications.

Keywords:

upstream oil and gas; Industrial Internet of Things (IIoT); Internet of Things (IoT); machine learning; public datasets; taxonomy; seismic data; well logs; anomaly detection

1. Introduction

Machine learning has transformed upstream oil and gas operations by enabling data-driven applications across seismic interpretation, reservoir characterization, drilling optimization, and production forecasting [1,2,3,4,5,6]. The success of these applications, however, depends critically on the availability of high-quality, well-structured datasets. The increasing digitization of the energy sector—often referred to as Industry 4.0—has generated massive volumes of data from seismic surveys, well logs, sensor networks, and simulation models [7,8]. Leveraging such data with machine learning has become essential for reducing uncertainty, accelerating decision-making, and improving operational efficiency and overall performance in oil and gas applications [9].

Machine learning has achieved broad methodological coverage across upstream exploration and production applications. At the level of individual application domains, supervised deep learning architectures—principally convolutional neural networks (CNNs) and U-Net variants—have become the dominant modeling paradigm across seismic interpretation, well-log analysis, production forecasting, and drilling optimization [10,11,12,13]. Self-supervised, contrastive, and physics-informed machine learning models have emerged as important complements in upstream operations, particularly where labeled data are scarce or where physical constraints must govern model behavior [14,15,16,17]. Generative models, including generative adversarial networks (GANs), have been applied to data augmentation and synthetic log generation, addressing the persistent scarcity of labeled training instances in upstream applications [10,18].

Across these domains, several recurring observations are evident. First, model performance consistently depends more strongly on data properties, including: log-suite completeness, labeling consistency, temporal resolution, and class balance [18,19,20,21,22]. Second, hybrid physics-data-driven approaches combining surrogate models or physics-informed constraints with data-driven components outperformed purely data-driven baselines in reservoir characterization, history matching, and production forecasting, particularly under data-scarce conditions [3,5,12,17,23]. Third, rare-event and anomaly detection applications, such as undesirable event classification in offshore production systems and equipment degradation forecasting, have leveraged machine learning algorithms [24,25,26,27,28,29,30].

Furthermore, advances in Industry 4.0 technologies are transforming the upstream oil and gas sector through three interconnected enablers: the Industrial Internet of Things (IIoT), digital twins, and advanced data analytics [7,8]. IIoT applications connect drilling rigs, wellheads, flowlines, and surface facilities through distributed sensor networks that continuously acquire real-time measurements of pressure, temperature, flow rate, vibration, and equipment status [31]. These high-frequency data streams support condition-based monitoring, early fault detection, and predictive maintenance [31]. Digital twins extend these capabilities by creating virtual representations of physical assets that are updated using IIoT measurements and integrated with machine learning models to monitor system behavior, predict equipment failures, optimize operating parameters, and evaluate operational scenarios under reduced risk [29,30]. In parallel, cloud and edge computing architectures enable the processing and analysis of terabyte-scale seismic volumes and real-time drilling data, thereby supporting field-wide surveillance and near-real-time decision-making [8]. Moreover, the increasing adoption of machine-to-machine (M2M) communication, collaborative intelligence, and federated learning is expected to further accelerate autonomous upstream operations, improve data-driven coordination, and enhance operational productivity [32,33,34,35,36,37].

Many high-fidelity industrial datasets remain proprietary, and publicly available alternatives are limited in number, often lack standardization, and are unevenly distributed [16,38]. As a result, research increasingly relies on a small number of open datasets, which do not always capture the full diversity or complexity of real-world subsurface conditions, thereby complicating reproducible benchmarking and the validity of generalization claims. Among the most widely adopted upstream datasets are the Netherlands F3, Volve Field, 3W, COSTA, and Kansas Geological Survey (KGS) datasets [17,24,25,39,40,41,42,43]. These five datasets were selected because they represent widely used and publicly accessible examples of major upstream data modalities, including seismic volumes, well logs, production and sensor time series, undesirable-event records, and reservoir/simulation-based data. Accordingly, the objective is not to claim exhaustive coverage of all available upstream datasets, but to develop and demonstrate a machine learning-centric taxonomy using representative datasets that reflect important differences in data modality, labeling strategy, benchmark maturity, and application scope.

Although the selected datasets differ substantially in data objects, acquisition mechanisms, and original research purposes, they can be systematically compared through machine learning-relevant properties rather than on the basis of direct task equivalence. The comparison is therefore not intended to rank the datasets or treat them as interchangeable benchmarks. Instead, datasets are characterized according to shared dimensions that influence machine learning use, including data modality, spatial or temporal structure, label availability, feature coverage, contextual metadata, benchmark maturity, and application suitability.

Each of the aforementioned datasets serves a distinct purpose; they vary substantially in data types and structure, labeling strategy, machine learning readiness, and benchmark maturity. These datasets are described in detail in Section 2. Several limitations persist in the current machine learning research landscape for upstream tasks. First, publicly available datasets for upstream machine learning research lack a standardized characterization framework, making it difficult for researchers to assess dataset suitability, compare results across studies, or justify dataset selection on principled grounds [16,38]. Second, these datasets are frequently applied without a systematic evaluation of whether their properties align with the requirements of the intended machine learning task, undermining reproducibility and limiting the validity of generalization claims [10,19]. Third, existing research on machine learning applied to upstream oil and gas datasets lacks the data-centric perspective required to support rigorous experimental design and fair benchmarking [3,4,5]. Fourth, datasets are described inconsistently across studies, making cross-study comparison unreliable even when the same dataset is used.

Despite this breadth of applications, a critical gap remains. Machine learning model designs, dataset descriptions, evaluation metrics, and generalization claims vary substantially across publications, even when identical public datasets are employed [16,19,38]. No standardized framework currently exists to guide researchers in dataset selection, machine learning model selection, benchmark design, or the interpretation of cross-study performance results. The proposed taxonomy in this research is intended to organize upstream datasets according to properties directly consequential for machine learning: data type, data characterization, feature domains, machine learning setup, context, and application scope. The proposed taxonomy provides the field with a reusable, consistent, and domain-specific vocabulary not offered by prior reviews. Furthermore, the taxonomy is applied as a diagnostic mapping tool across five widely adopted public datasets, explicitly linking each dataset’s properties to its suitability for specific tasks, learning paradigms, and evaluation metrics, a level of structured analysis not currently available in the literature. Moreover, by encoding benchmark maturity, label density, rare-event handling, and geographic context as explicit taxonomy dimensions, the present study provides selection guidance that exceeds what dataset-specific research or general machine learning reviews currently provide.

The remainder of this article is organized as follows. Section 2 describes the five datasets analyzed herein. Section 3 provides a comparison of the datasets from a machine learning perspective. Section 4 examines the interaction between dataset properties and machine learning outcomes across ten application categories. Section 5 presents the proposed taxonomy and its explicit mapping to the studied datasets. Section 6 presents the conclusions and future work, summarizing the key findings and outlining directions for future research.

2. Oil and Gas Datasets

Publicly available oil and gas datasets are essential for advancing the upstream oil and gas sector by supporting operational decision-making and performance evaluation. These datasets include three-dimensional (3D) seismic surveys, well logs, multivariate production time series, and static and dynamic reservoir models derived from simulation. Such data underpin numerous Industry 4.0 applications, including seismic interpretation, lithofacies classification, production forecasting, and anomaly and event detection [3,4,38].

Five widely adopted public datasets are examined in this study. These datasets predominantly represent upstream workflows, namely the Netherlands F3, Volve Field, 3W, COSTA, and KGS datasets [17,24,39,41,43]. These datasets were selected because they are publicly accessible, commonly used in upstream machine learning studies, and collectively cover several important data modalities and learning tasks, including seismic interpretation, well-log analysis, production monitoring, anomaly and event detection, and reservoir-oriented simulation. The selected datasets provide a representative basis for demonstrating the proposed taxonomy and examining how dataset properties influence machine learning suitability, benchmarking, and reproducibility. These datasets have collectively shaped much of the recent machine learning literature on upstream oil and gas characterization and production analytics.

In the remainder of this section, each dataset is described in detail. Because the five datasets were developed for different purposes, a direct one-to-one comparison based on identical prediction tasks is not appropriate. For example, seismic interpretation datasets, production time series datasets, well-log archives, and reservoir simulation datasets differ in their objects of analysis, sampling structures, labels, and intended applications. Therefore, the characterization-based comparison is adopted rather than a performance-based comparison, where the comparison focuses on machine-learning-relevant properties common across heterogeneous upstream data resources, such as data structure, label availability, feature domains, preprocessing requirements, task compatibility, and reproducibility potential.

2.1. Netherlands F3 Dataset

The Netherlands F3 dataset is derived from the publicly available 3D seismic survey acquired offshore in the North Sea. The survey covers approximately 384 km² and provides time-migrated 3D seismic data with 651 inlines and 951 crosslines, a 4 ms sampling rate, and a 25 m bin size. Along with the seismic volume, the original release includes eight interpreted horizons and well logs from four wells.

The Netherlands F3 dataset consists of a 3D seismic survey acquired in the late 1990s with a compressed data volume of approximately 1.5–2 GB, along with interpreted horizons and facies labels, making it suitable for large-scale seismic machine learning benchmarks. The original public release includes eight interpreted horizons and well logs from four wells [40]. To support machine learning research, the seismic volume was subsequently reinterpreted to delineate nine horizons (H1–H9) that subdivide the volume into ten stratigraphic intervals, which serve as facies-interval classes [39]. Each horizon file is provided in XYZ format; the intersections of these horizons with seismic inlines and crosslines were used to generate pixel-wise labeled images.

The labeled dataset comprises more than 1600 seismic sections (651 inline images and 951 crossline images) saved as 8-bit PNG files. For each section, pixels are assigned labels from 0 to 9 corresponding to the stratigraphic interval between successive horizons, resulting in ten facies-interval classes for supervised facies segmentation. The dataset also comprises JSON files containing the annotations, XYZ files for each horizon, and the original unlabeled seismic data in TIFF format [11,15,39,40].

2.2. Volve Field Dataset

The Volve Field dataset is a comprehensive, open, real-field dataset released by Equinor [12,41,42,44]. It integrates well logs, production histories, reports, drilling data, and static and dynamic reservoir simulation models, making it well-suited for data-driven studies in reservoir modeling, production forecasting, and drilling optimization [41]. The Volve Field is located in Block 15/9 in the central Norwegian North Sea.

The Volve Field dataset is often described as a high-fidelity digital representation of a producing oil field because it bundles many of the data types required to support upstream workflows. These include geophysical data (2D and 3D seismic surveys and interpretations), petrophysical and drilling logs, geological and stratigraphic interpretations, static reservoir models, dynamic flow-simulation models (e.g., Eclipse grids and schedules), production and injection time series, well design and completion data, and real-time drilling measurements in Well-site Information Transfer Standard Markup Language (WITSML) format [41,42]. The dataset comprises approximately 40,000 individual files, reflecting the heterogeneity and scale typically encountered in industrial field-development projects [12,44]. The dataset also includes production time series for seven producing wellbores.

Overall, the Volve Field dataset is distinguished by its realism, internal consistency, and broad coverage of upstream data modalities [38]. Nevertheless, despite being exceptionally rich, the dataset has several limitations that affect its use as a general-purpose machine learning benchmark. The publicly released data are bounded by the field life cycle, with production beginning in 2008 and ending in 2016, which constrains studies focused on long-horizon forecasting beyond the available operational period. In addition, the number of producing wells included in commonly used open subsets is limited, with seven available producing wellbores, restricting large-scale generalization and the evaluation of models that rely on broad well populations. Furthermore, the dataset requires significant preprocessing, cleaning, and data-integration effort, which can hinder reproducibility and rapid experimentation. Moreover, seismic, well-log, and production data are not always temporally or operationally aligned, limiting their direct use in joint spatio-temporal learning tasks. The dataset is geographically and geologically specific to a clastic reservoir in the Norwegian North Sea, which may reduce the transferability of trained models to other reservoir types or regions. Finally, some production measurements are provided in aggregated form rather than as high-frequency raw sensor streams, constraining fine-grained analysis of transient events and short-term operational dynamics.

2.3. 3W Dataset

The 3W dataset is a publicly available, large-scale benchmark developed by Petrobras to support research on the detection and diagnosis of undesirable events in offshore oil well operations [24,25]. The dataset comprises multivariate time series instances annotated by domain experts, and it was designed to foster the development and fair comparison of machine learning methods for production monitoring, early-warning systems, and abnormal-event management. The name 3W reflects the dataset composition: instances originate from three sources, namely: real, simulated, and hand-drawn [24,25].

Originally introduced in 2019 and subsequently extended as version 2.0.0 in 2025, the dataset focuses on offshore producing wells operating without manifolds [24,25]. Version 1.0.0 covers real data from 21 wells and contains 1984 labeled multivariate time series instances, whereas version 2.0.0 expands the real-well coverage to 42 wells and increases the dataset to 2228 instances across real, simulated, and hand-drawn categories [24,25]. All instances were sampled at a fixed temporal resolution (1 Hz in the 2.0.0 release) and included synchronized measurements of pressures, temperatures, flow-related variables, and operational signals (e.g., choke positions and valve states) along the wellbore and subsea production system [24,25].

Each instance was labeled using a predefined classification scheme that distinguishes normal operation from multiple classes of undesirable events (e.g., severe slugging, hydrate formation, and spurious downhole safety-valve closures). To address the scarcity of certain real-world events, the dataset combines real operational data with simulated and manually generated sequences, while preserving realistic artifacts such as missing variables, sensor noise, frozen measurements, and partially labeled intervals [24,25]. This design differentiates the 3W dataset from idealized or purely synthetic benchmarks and increases its relevance to operational machine learning research.

The 3W dataset is widely used as a reference benchmark for supervised time series classification applications and, more recently, for unsupervised and semi-supervised anomaly detection in upstream oil and gas production systems [24,26]. Its standardized representation and expert-defined labels have enabled comparative studies employing classical classifiers, ensemble methods, deep neural networks, and transformer-based architectures [26,27,28]. At the same time, it poses non-trivial challenges—including class imbalance, heterogeneous sequence lengths, and operational variability across wells—that require careful preprocessing, robust validation protocols, and appropriate modeling strategies [25]. The 3W dataset thus complements image-based and log-based benchmarks by providing a realistic foundation for the development and validation of data-driven monitoring and decision-support methods in offshore oil and gas production.

2.4. COSTA Dataset

The COSTA dataset is based on an open-source carbonate reservoir model introduced by Costa Gomes et al. and developed to support research in reservoir characterization, uncertainty analysis, and reproducible subsurface workflows in carbonate systems [17]. The model was designed as a geologically realistic yet anonymized carbonate reservoir, constructed using concepts representative of Middle Eastern carbonate build-up reservoirs. By adopting an open and reproducible design approach, COSTA provides a controlled environment for methodological development and benchmarking in data-driven subsurface studies suitable for machine learning applications.

The COSTA dataset includes a network of approximately 447 synthetic wells distributed across a three-dimensional carbonate reservoir model. Machine learning samples are typically derived as depth-indexed observations along wells or as grid cell-level extractions from the reservoir grid, yielding thousands of usable instances depending on task definition and sampling resolution. The dataset comprises synthetic subsurface variables—including porosity, permeability, water saturation, and facies—generated through numerical geological and petrophysical modeling.

Unlike real-field operational datasets, COSTA does not include measured production rates, pressure, or temperature time series, SCADA/control signals, or seismic surveys, which reflects its role as a controlled reservoir-modeling benchmark rather than an instrumented field dataset [17]. All variables are generated within a numerical modeling framework, ensuring internal consistency and complete labeling, while deliberately omitting measurement noise and operational disturbances. This design makes COSTA valuable for controlled experimentation, algorithm comparison, and sensitivity analysis, but limits its direct applicability to operational field conditions. Consequently, COSTA should be regarded as a complementary benchmark for method development and conceptual validation, rather than as a substitute for real-field machine learning implementation [17].

2.5. KGS Datasets

The public well-log archives maintained by the Kansas Geological Survey (KGS) constitute an important open-access collection of subsurface well data, providing digital Log ASCII Standard (LAS)-format logs and associated metadata for thousands of oil and gas wells drilled across Kansas, United States of America [38,43]. The KGS dataset includes depth-indexed wireline measurements—such as gamma ray, resistivity, density, neutron porosity, and sonic logs—accompanied by well headers, stratigraphic markers, and location information. Owing to their scale, diversity, and open accessibility, the KGS logs have been used in academic research to support machine learning studies in log interpretation, lithofacies classification, synthetic log generation, and subsurface property regression.

Unlike controlled synthetic benchmarks, the KGS dataset is derived from real field measurements acquired over multiple decades, spanning a wide range of geological settings, stratigraphic intervals, and logging vintages. Machine learning models are typically constructed as depth-level samples along wells, yielding millions of potential samples when large subsets of the archive are aggregated. However, the dataset does not provide standardized labels for facies or petrophysical properties. Instead, labels are typically inferred from external interpretations, core measurements, or stratigraphic correlations on a case-by-case basis. As a result, the KGS dataset is often used in customized applications rather than as a fixed benchmark with predefined training and testing splits. Overall, the KGS well-log archives serve as a valuable real-world resource for large-scale, data-driven subsurface analysis rather than as a benchmark with standardized labels and splits [38].

Although a growing number of public oil and gas datasets are available, they differ substantially in data modality, scale, labeling strategy, and realism. These differences influence task feasibility, preprocessing requirements, evaluation protocols, and the comparability of reported results. Without a systematic comparison, assessing dataset suitability for specific machine learning applications and interpreting performance claims across heterogeneous data sources remain challenging. A comparative analysis of the aforementioned datasets is provided in the following section to support informed dataset selection and fair benchmarking.

3. Dataset Comparisons

The selected public upstream datasets are compared using machine learning-centric criteria. Together, these criteria enable a consistent evaluation of heterogeneous datasets that differ substantially in data modality, size, and intended application [24,38,45]. The comparison is presented as follows:

Sensor Data Type: This field identifies the physical quantities or channels provided by the dataset, such as seismic amplitudes and attributes, well logs (e.g., GR, RHOB, NPHI, DT, and resistivity), production signals (pressures, flow rates, temperatures, and choke and valve states), or image intensities and segmentation masks. The breadth and diversity of sensor data determine the feasibility of machine learning tasks and inform feature extraction and model design [3,22,24].
Data Representativeness: This criterion evaluates whether the data originate from real field measurements, synthetic simulations, or hybrid sources, and whether they capture realistic noise, missingness, operational disturbances, and geological variability [19,24].
Resolution: This criterion describes the temporal or spatial sampling of the measurements, such as the time step of production signals (e.g., 1 min, 10 min, hourly), the depth increment of well logs (e.g., 0.5 m), or the inline/crossline spacing and sampling rate of seismic volumes. Resolution affects both the level of detail learnable by machine learning algorithms and the computational cost of model training and inference [24,39].
Volume: This criterion quantifies the overall amount of data available for learning in terms of the numbers of wells, traces, records, or labeled patterns. Examples include the number of labeled seismic sections, depth samples in well logs, time series instances and event segments in production datasets, or labeled rock images and image patches [24,39,45].
Context: This factor addresses how geographic location influences the geological setting, reservoir type, operational practices, and measurement characteristics [46]. Datasets from different regions reflect distinct depositional environments (e.g., clastic versus carbonate systems), structural styles, petrophysical relationships, and production behavior, which may induce domain shift and affect model generalization. Consequently, models trained in one geographic context may generalize poorly to geologically dissimilar regions without appropriate validation or domain adaptation.

The sensor data types and label structures available in each dataset constitute prerequisites for selecting appropriate learning paradigms. For example, densely labeled seismic images are well suited to supervised segmentation models, whereas sparse or weakly labeled production data typically motivate unsupervised or semi-supervised anomaly detection methods. Table 1 summarizes the sensor data types available in the Netherlands F3, Volve Field, 3W, COSTA, and KGS datasets.

As summarized in Table 1, the Netherlands F3 dataset primarily provides time-migrated 3D seismic volumes together with interpretation products such as horizons and facies-interval labels, and only limited well-log information (four wells) [39,40]. Furthermore, it does not include production data, simulated production history, pressure-temperature time series, or operational/control-system tags. Consequently, its applications are largely limited to seismic image interpretation and facies segmentation using supervised and self-supervised learning methods. In contrast, the Volve Field dataset offers a broad combination of subsurface and operational information: 3D seismic volumes (including limited derived 2D sections), seismic interpretation layers, rich wireline logs, core-related data, measured production rates, simulation outputs, pressure and temperature measurements, drilling and WITSML-style records, and partially available choke and other operational signals [12,41,42]. This diversity of sensor and model data enables a wide range of machine learning applications, including production forecasting, hybrid physics–machine learning reservoir modeling, drilling optimization, and anomaly detection.

The inclusion of a variety of modalities in the Volve Field dataset, however, comes at a significant cost; it lacks standardized training/test splits, universal benchmark protocols, and predefined task definitions. Consequently, results from different studies based on the Volve Field dataset are difficult to compare consistently. By contrast, the Netherlands F3 and 3W datasets are narrow in modality coverage—restricted to seismic imagery and production sensor information, respectively. Both datasets are regarded as the most mature benchmarks, with established data splits, labeled datasets, published baselines, and community-adopted evaluation protocols. This inverse relationship between modality breadth and benchmark maturity is not coincidental; the richer and more heterogeneous a dataset, the harder it is to standardize into a fixed benchmark. An explicit trade-off must therefore be made between scope and reproducibility when selecting a dataset, and this trade-off should be reported transparently in any study using these resources.

The observed inverse relationship between modality breadth and benchmark maturity suggests that upstream machine learning research remains constrained by the limited availability of standardized multi-modal benchmarks. Datasets such as the Volve Field dataset provide rich, heterogeneous data, including seismic, geological, well-log, production, pressure, and operational records. However, the modality richness of Volve Field also complicates reproducible evaluation because the modalities differ in sampling frequency, spatial reference, temporal coverage, metadata completeness, and task definition. The proposed taxonomy can guide the design of future community-scale benchmarks by specifying the minimum reporting dimensions required for standardized evaluation: modality inventory, temporal consistency, spatial co-registration, label availability, task definition, train–test splitting strategy, and evaluation metrics.

The 3W dataset is distinguished by its focus on multivariate sensor and control time series from offshore producing wells, with variables such as pressure, temperature, flow-related measurements, choke positions, valve states, and other SCADA-style control-system tags, as well as labeled operational states and undesirable events [24,25]. It does not provide seismic, well logs, or drilling time series data, but instead offers real, simulated, and hand-crafted sequences that make it a strong benchmark for time series classification and anomaly detection under rare-event conditions [26]. The COSTA dataset, by contrast, is a synthetic benchmark centered on reservoir characterization and simulation rather than operational sensor streams. It provides grid-based geological and flow-property realizations and simulated production outputs (including pressure responses), but it does not include seismic surveys, SCADA/control tags, or drilling/WITSML data [17]. This design makes COSTA well suited for uncertainty quantification, history matching, and surrogate modeling of reservoir behavior. Finally, the KGS dataset consists mainly of digital well logs in LAS format, supplying depth-indexed petrophysical curves such as gamma ray, density, neutron porosity, sonic, and resistivity, with limited core imagery and no seismic, production, SCADA, or drilling-series data [43]. Although labeling is partial and heterogeneous, this log-centric data structure supports tasks such as lithology prediction, log reconstruction, and clustering for rock-typing and petrophysical analysis [18,19].

Table 1 provides a comparative overview of oil and gas datasets, highlighting their nature, resolution, scale, and geographic origin. Oil and gas datasets can be real, synthetic, or a mix of both. For instance, the Netherlands F3 dataset comprises real 3D seismic data from the North Sea, the Volve Field dataset is a rich multi-modal time- and depth-based production dataset from the Norwegian North Sea, and the COSTA dataset comprises a synthetic yet geologically realistic carbonate model. Datasets vary in spatial and temporal resolution—from high-frequency drilling data (Volve, 1–10 s) to depth-indexed logs (KGS, 0.15–0.3 m) and seismic patches (Netherlands F3, 25 m spacing, 4 ms). Their sizes range widely: KGS includes over 120,000 wells, while the 3W dataset features 1984 time series files with over 8000 labeled events. Notably, certain datasets are classified as real-field data even when their labels are generated through human expert interpretation or automated machine learning-based annotation. Among the datasets examined in the present study, the 3W and COSTA datasets are the only ones to incorporate synthetic data.

Overall, dataset selection directly and significantly affects the choice of machine learning algorithms, preprocessing strategies, model performance, and evaluation protocols. Larger-scale datasets generally support more complex model architectures and enable more robust training and validation, whereas smaller or sparsely labeled datasets often necessitate simpler models or different machine learning paradigms. Furthermore, the diversity and richness of available variables constrain the range of feasible applications—such as facies classification, petrophysical property regression, or event and anomaly detection—and influence the extent of feature extraction or representation learning required.

Overfitting represents one of the most significant risks in upstream machine learning applications and is frequently exacerbated by dataset properties rather than model architecture alone. Several characteristics identified in the proposed taxonomy directly increase susceptibility to overfitting. Small dataset volume—as encountered when using limited well subsets from the Volve Field dataset or the 1984 instances of the 3W dataset—reduces the effective sample size available for robust generalization, particularly for deep learning architectures with large parameter counts. Sparse or heterogeneous labeling, as observed in the KGS dataset, may cause models to overfit to the specific label derivation strategy employed rather than to the underlying geological signal. Class imbalance, which is a defining characteristic of the 3W dataset and a known challenge in seismic facies benchmarks such as the Netherlands F3 dataset, can cause models to over-represent majority classes and underfit rare but operationally critical events. Geographic and geological specificity can produce models that memorize field-specific patterns rather than learning transferable representations. For example, this risk may arise when models are trained exclusively on data from the clastic Norwegian North Sea reservoir in the Volve Field dataset. Researchers using the datasets characterized in this taxonomy are therefore advised to apply dataset-appropriate regularization strategies, including cross-validation with well-stratified splits, dropout and weight decay for deep architectures, cost-sensitive learning or synthetic oversampling for imbalanced classes, and explicit evaluation on held-out wells or fields to assess generalization beyond the training distribution. The taxonomy’s benchmark maturity dimension directly supports this evaluation by identifying which datasets provide standardized splits and published baselines, thereby enabling more reliable and reproducible overfitting assessment.

Label and classification availability are the primary constraints on learning paradigm selection, not data volume. A common assumption in machine learning is that larger datasets support more sophisticated models. In the upstream oil and gas domain, this assumption can be misleading. The KGS dataset is the largest dataset examined in this study—with over 120,000 wells and tens of millions of potential depth samples—yet it is less suitable for supervised benchmarking due to the absence of standardized labels. Researchers with access to the KGS dataset cannot directly apply supervised classification without first investing substantial effort in label definition, class construction, and annotation. Conversely, the 3W dataset, with only 1984 instances, is immediately deployable for supervised classification and anomaly detection because its labels are dense, expert-defined, and consistently applied. These observations suggest that dataset selection for upstream machine learning should be guided primarily by label availability and label quality, and only secondarily by raw data volume.

Real-world fidelity and synthetic control are both valuable for machine learning applications. The COSTA dataset occupies a distinct role within the dataset ecosystem owing to its synthetic nature: complete label coverage, internal consistency, and the absence of measurement noise render it uniquely suited to controlled experimentation and simulation, where the aim is to isolate the effect of a modeling choice rather than to demonstrate real-world generalization. No real-field dataset examined in this study can serve this function, as all real-world datasets introduce confounding variability arising from geological heterogeneity, operational disturbances, and measurement artifacts.

Geographic and geological concentration constrains cross-domain generalization claims. This concentration suggests that machine learning models trained and evaluated exclusively on these datasets may encode region-specific petrophysical relationships, structural styles, and operational practices that do not transfer to other basins. Generalization claims reported in such studies cannot be substantiated by the available public benchmark ecosystem alone. For example, the claim that a model trained on the Netherlands F3 dataset can segment seismic facies in other regions cannot be validated from these benchmarks alone. This represents a collective gap in the current public dataset landscape rather than a limitation of any individual dataset, with direct implications for the breadth with which results from these benchmarks may be interpreted.

4. Application of Machine Learning Using Oil and Gas Datasets

Machine learning advances have increasingly been applied to address long-standing challenges in upstream oil and gas operations. Rather than merely cataloging applications, this section identifies how dataset properties influence machine learning tasks, methods, and outcomes in upstream oil and gas datasets. The identified categories are as follows:

Seismic Interpretation and Facies Classification: Convolutional architectures such as CNNs and U-Net variants, generative models (GANs), and, more recently, self-supervised and contrastive learning frameworks are widely used for seismic facies segmentation, horizon picking, and stratigraphic interpretation [10,11,14,15,19,20,39,40,47]. The Netherlands F3 dataset is the dominant benchmark for these tasks, whereas KGS well logs are primarily used as supplementary inputs for facies or lithology prediction when labels are derived from external interpretations [20,39]. The prominence of the Netherlands F3 dataset does not reflect any geological or geographical exceptionalism of the North Sea setting [16,19,39,40]. Accordingly, performance gains reported on the Netherlands F3 dataset should be interpreted as improvements within a well-defined supervised benchmark, rather than as evidence of broader model superiority across geological settings or labeling conventions—the Netherlands F3 dataset represents a single geological and geographical location [10,16]. Studies consistently report higher segmentation accuracy on the Netherlands F3 dataset than on datasets with sparser labels [10,11,15,20]. In contrast, the KGS dataset, despite its considerable scale (>120,000 wells), remains limited for supervised facies benchmarking owing to the absence of standardized labels [19,43]. Overall, the evidence suggests that, in this domain, label quality is more significant than dataset volume [16,19].
Well-Log Analysis and Petrophysics: Tree-based models (e.g., random forest and XGBoost), support vector machines (SVMs), and deep learning architectures are widely applied to lithology prediction, missing-log reconstruction, and petrophysical property estimation from wireline curves [19,20,21,48]. The Volve Field dataset supports multi-log studies under realistic field conditions, whereas COSTA provides a controlled setting for carbonate property modeling, and the KGS dataset enables large-scale log interpretation workflows [12,17,43]. Across well-log machine learning studies, performance variability appears to be driven more strongly by log-suite completeness and depth-sampling consistency than by model architecture alone [19,21]. Studies employing the full Volve Field dataset consistently report higher performance than those based on reduced data subsets for the same prediction task, as multi-curve inputs provide complementary lithological information that cannot be fully recovered from a restricted log subset [12,48]. The COSTA dataset, with its fully synthetic and internally consistent label coverage, enables controlled isolation of modeling choices, a degree of control that is generally not achievable with real datasets, in which geological heterogeneity and measurement artifacts act as confounding factors [17]. However, the absence of measurement noise in COSTA means that models achieving very high accuracy on this dataset may overestimate expected real-world performance [16,17]. In KGS-based studies, heterogeneous log availability means that imputation strategies applied prior to model training may introduce systematic biases sufficient to dominate reported accuracy differences between models, particularly when imputation sensitivity analyses are not conducted [19,21,43].
Reservoir Characterization and History Matching: Physics-informed neural networks (PINNs), surrogate models such as particle swarm optimization–artificial neural network (PSO-ANN) and genetic algorithm–artificial neural network (GA-ANN) hybrids, ensemble regressors, and hybrid evolutionary algorithms have been applied to accelerate history matching and uncertainty quantification [3,4,12,22]. The COSTA and the Volve Field datasets are the primary datasets supporting these machine learning implementations [4,17,23]. The synthetic nature of the COSTA dataset produces a consistent and noteworthy outcome: surrogate models trained on COSTA simulation outputs achieve very high $R^{2}$ values (often >0.95), but these values reflect the smoothness and internal consistency of numerically generated simulations rather than the complexity of real subsurface behavior [16,17,23]. Furthermore, the Volve Field dataset’s limited well count means that history matching studies trained and evaluated on the Volve Field dataset are operating in a severe data-scarcity regime for deep learning architectures, rendering cross-architectural generalization unreliable [3,12]. In addition, sequence models (LSTM, TCN), gradient boosting machines (GBM), and hybrid physics-data-driven architectures are employed for well-level and field-level production rate forecasting [12,48,49]. The Volve Field dataset is the primary real-field benchmark, the COSTA dataset provides synthetic production responses, and the 3W dataset offers multivariate operational signals under disturbance conditions [12,17,24]. Production forecasting outcomes are highly sensitive to the temporal resolution and operational completeness of the training data [12,48]. The Volve Field dataset’s production data are available primarily at daily aggregation for production rates, which attenuates transient dynamics and renders short-term operational events (e.g., choke adjustments and brief shut-ins) difficult to resolve [12,44]. Forecasting studies that report high accuracy on the daily Volve Field dataset data demonstrate the ability to model slowly evolving field-level trends, not the ability to capture rapid operational dynamics [12,49]. The 3W dataset, sampled at 1 Hz, provides the temporal resolution needed to capture transient dynamics [24,25]. An important implication is that no single public dataset simultaneously supports both long-horizon forecasting and high-resolution transient modeling—a gap that should be explicitly acknowledged as a limitation in any forecasting paper that relies exclusively on these benchmarks [3,16].
Drilling Optimization and Dysfunction Detection: Ensemble methods such as random forest (RF), GBM, and metaheuristic-assisted models (e.g., MOPSO- or Fireworks-optimized predictors) are used to forecast rate of penetration (ROP) and detect drilling dysfunctions [12,13,44,50]. The Volve Field WITSML drilling records, sampled at 1–10 s intervals, constitute the primary publicly available dataset supporting drill-string related machine learning applications in this study, including ROP forecasting, weight-on-bit optimization, and drill string dysfunction detection such as stick-slip, bit bounce, and lateral vibration events [41,42,44]. The high temporal resolution of the WITSML data, which captures surface and downhole drilling parameters, makes them uniquely suited for modeling the dynamic behavior of the drill-string under varying lithological and operational conditions [41,42,44]. The 3W dataset is primarily oriented toward production-system events rather than drilling, nonetheless its labeled rare-event time series can inform methodologies for event detection under class imbalance [24,25,26,27,28]. Drilling machine learning outcomes are uniquely sensitive to temporal resolution. Notably, the Volve Field WITSML data—available at 1–10 s intervals—are the only dataset in this study to approach the resolution required for real-time drilling control applications [12,44]. However, the dataset covers a single field with specific lithological and operational characteristics, and its WITSML records are not uniformly complete across all wells and depth intervals, introducing inconsistencies that differentially affect model training depending on preprocessing choices [12,13]. Studies applying ensemble methods and metaheuristic-assisted predictors to Volve Field WITSML data for ROP forecasting and dysfunction detection have confirmed that preprocessing decisions—including the treatment of missing drilling records and sensor dropouts—materially influence reported prediction accuracy, yet these choices are inconsistently documented across studies [12,13,50].
The taxonomy’s geographic context and volume dimensions directly predict this limitation: a single-field, single-rig drilling dataset cannot serve as a general-purpose benchmark for ROP prediction [12,16], and reinforcement learning and multi-objective optimization frameworks proposed for autonomous drilling parameter selection under real-time constraints [5,13] require broader operational diversity than the Volve Field dataset provides alone. Claimed generalization beyond the Norwegian North Sea clastic setting should be explicitly conditioned on this constraint and validated against independent field data [16,46].
Well Placement and Geosteering: Machine learning algorithms, including tree-based methods and integrated seismic-facies classifiers, are applied to support trajectory placement and geosteering decisions. The Netherlands F3 dataset provides structural and facies interpretations that can be used for synthetic geosteering experiments, whereas the Volve Field dataset supports development-planning studies under real reservoir conditions [10,12,39,40]. The COSTA dataset provides an open carbonate geomodel for testing placement strategies in controlled settings [17,22]. Geosteering experiments using the Netherlands F3 dataset operate on interpreted stratigraphic intervals rather than direct formation boundaries, and the 10-class facies scheme is a product of a specific reinterpretation campaign rather than a ground-truth geological characterization [39,40]. Models trained to navigate these interpreted intervals may achieve high simulated geosteering performance, but the mapping from predicted facies intervals to actionable drilling targets involves assumptions that are not encoded in the dataset itself [10,11]. The controlled carbonate structure of the COSTA dataset provides a cleaner experimental setting for evaluating placement algorithms [17], but its synthetic origin means that rock-physics relationships and lateral property variability are governed by modeling assumptions rather than conveying real geological information [17,22]. The key implication is that geosteering machine learning studies based on current public datasets are necessarily demonstrations of methodological feasibility rather than validated operational systems [16], and claims about real-field applicability require explicit geological qualification that most published studies fail to provide [10,46].
Production Optimization and Smart Control: Artificial Neural Networks (ANNs), GBM, reinforcement learning, and hybrid physics-machine learning approaches are applied to optimize lift, choke, and well-network performance. The Volve Field dataset supports optimization and control studies using production and operational variables, whereas the COSTA dataset supports optimization and field-development studies under geological uncertainty using simulation outputs [5,12,17,23]. Reinforcement learning and control-oriented machine learning methods are particularly sensitive to the coverage of the operational feature domain in the training dataset [5,24]. The Volve Field dataset’s partial coverage of choke and valve states limits reinforcement learning-based optimization experiments to a subset of the control space relevant to real field operations [12,44]. The COSTA dataset’s simulation outputs allow full coverage of the operational parameter space within the model’s numerical framework [17,23]; however, this introduces a risk of over-optimization to the simulation’s physical assumptions [5,17]. Optimization results reported for the COSTA dataset should therefore be interpreted in light of its idealized conditions [16,17,23].
Predictive Maintenance and Equipment Health: Time series models (e.g., Long Short-Term Memory (LSTM)) and ensemble methods (e.g., XGBoost/RF) are used to detect equipment degradation and forecast remaining useful life from multivariate telemetry. In practice, however, predictive-maintenance studies frequently rely on proprietary data or specialized open datasets. Nonetheless, the Volve Field dataset supports methodological development for monitoring and fault detection using available operational and production variables [12]. This application category represents the most acute dataset gap in the current public ecosystem [3,16]. Predictive maintenance requires high-frequency, equipment-specific sensor streams with labeled degradation events or failure records—a data type that the Volve Field dataset provides only partially through its WITSML and operational records [3,6,12,44]. Notably, datasets comprising task-agnostic or unlabeled operational records cannot support rigorous predictive maintenance benchmarking regardless of the sophistication of the model applied [29,30]. Studies applying LSTM or GBM to Volve Field operational data for maintenance-oriented tasks employ records that were not collected or labeled for this purpose, which introduces severe label noise and task misalignment [12,21]. The absence of a dedicated public maintenance benchmark for upstream oil and gas represents a structural gap in the dataset ecosystem [3,16]—rather than a limitation of any individual study—but this remains a gap that published studies should acknowledge explicitly rather than implying that Volve Field-based maintenance results are directly comparable to purpose-built industrial benchmarks [21,29].
Anomaly, Leak, and Undesirable Event Detection: Autoencoders, isolation forest, one-class SVM, and CNN-LSTM variants are used to detect anomalies and faults in production systems. The 3W dataset is a widely used benchmark for rare undesirable well events (e.g., slugging and sensor-related faults) [24,25,26,27,28]. In addition, the Volve Field dataset supports time series anomaly detection studies using operational and production variables [12,49]. Seismic datasets such as the Netherlands F3 dataset can support structural or facies-related anomaly studies within subsurface interpretation tasks [10,39,40]. This is the application category where the linkage between dataset properties and machine learning outcomes is rigorously documented in the existing literature, largely because the 3W dataset was explicitly designed to make these linkages visible [24,25]. The dataset’s built-in class imbalance forces researchers to confront this imbalance directly; studies reporting aggregate accuracy on the 3W dataset without class-stratified metrics present systematically misleading performance claims [26,27,28]. The inclusion of real, simulated, and hand-drawn sequences introduces data-origin heterogeneity that affects model training in ways that are not always controlled. Specifically, models trained on a mixture of real and synthetic instances may overfit to the statistical regularities of the simulation engine rather than to the physical dynamics of the real events [24,25,26]. Furthermore, the expansion of the 3W dataset from 21 wells in v1.0.0 to 42 wells in v2.0.0 introduces non-trivial distributional changes [25]; studies trained on v1.0.0 and evaluated against v2.0.0 baselines are performing an implicit domain adaptation experiment that should be acknowledged explicitly [24,25,28].
Drilling and Completion/Fracturing Design: ANNs, GBM, and evolutionary algorithms (e.g., PSO-ANN, GA-ANN) are applied to optimize drilling and completion decisions, including ROP prediction and parameter selection. The Volve Field dataset supports drilling optimization and ROP prediction studies [12,44]. Model-based datasets (e.g., COSTA) can be used to generate controlled scenarios for completion or development-plan sensitivity analyses, while seismic benchmarks (e.g., the Netherlands F3 dataset) can support synthetic trajectory-design experiments [16,17,39,40]. Completion design machine learning studies based on the current public dataset ecosystem face a fundamental feasibility constraint: no public dataset provides the well-completion metadata, hydraulic fracturing records, or post-completion production attribution data needed to train and validate completion optimization models in a statistically rigorous manner [3,16]. The Volve Field dataset’s completion data are available at the well-design level but do not include the fracture propagation measurements or microseismic records needed for stimulation machine learning [12,44]. This represents a direct outcome of the Feature Domains–Geomechanical gap identified in the taxonomy mapping: with no public dataset providing geomechanical attributes alongside completion records, the reported machine learning contributions in this area are necessarily limited to ROP and parameter sensitivity analyses rather than full completion optimization [13,50]. Evolutionary and surrogate-based methods applied to the Volve Field and COSTA datasets for drilling parameter optimization [13,50] are therefore best interpreted as proof-of-concept demonstrations conditioned on this data availability constraint [3,5,8,12,16,17].

5. Dataset Taxonomy

This section presents a taxonomy for organizing upstream oil and gas datasets according to properties that influence machine learning performance and the types of applications they support. The taxonomy organizes datasets along multiple dimensions, each targeting a distinct aspect of machine learning readiness and applicability. By providing a structured definition for describing oil and gas datasets, the taxonomy supports more consistent comparison across datasets and algorithms and enables more informed selection for both benchmarking and industrial use cases. The taxonomy is detailed in the following:

5.1. Data Type

The data type dimension categorizes datasets according to the sensor-collected modalities and physical quantities. Consequently, this categorization constrains which machine learning tasks and model families are appropriate [16,38]. In practice, alignment of data inputs with the intended learning objective (e.g., image segmentation versus time series anomaly detection) is essential for the selection of suitable architectures and evaluation protocols [10,51].

Geophysical: Survey-scale geophysical data include 2D/3D seismic volumes, vertical seismic profiling (VSP) for borehole-constrained imaging, and controlled-source electromagnetic (CSEM) measurements for subsurface resistivity characterization [11,15,39].
Well-centric: Wellbore measurements include (i) wireline logs (e.g., gamma ray, resistivity, density, neutron porosity, and sonic logs), (ii) mud logs (e.g., gas readings and cuttings descriptions), (iii) core and image-derived data (e.g., computed tomography (CT)/X-ray transmission imaging, where available), and (iv) interpreted petrophysical properties (e.g., volume of shale ( $V_{s h}$ ), water saturation ( $S_{w}$ ), and porosity) [19,20,43].
Dynamic: Time-varying operational data include production/injection rates (oil/gas/water), pressures (e.g., bottomhole pressure (BHP), flowing wellhead pressure, and tubing pressure), temperatures, and control variables, such as choke positions and valve states. In some settings, these data are accompanied by supervisory control and data acquisition (SCADA) tags or event annotations (e.g., kick events or stuck-pipe indicators) [25,26,27].
Structural: Structural datasets describe subsurface architecture using interpreted horizons and faults, often represented in two-way travel time (TWT) and converted to true vertical depth (TVD) using velocity models. These interpretations may be calibrated to well formation tops and subsequently integrated into 3D geocellular models, which provide gridded reservoir frameworks for simulation and machine learning tasks.
Multi-modal: Datasets in this category integrate heterogeneous measurements across different spatial scales and physical principles, subject to two explicit qualifying conditions. First, the dataset must simultaneously provide at least two physically distinct measurement types that are acquired through different sensing principles or represent different physical domains. For example, seismic acoustic impedance contrasts alongside wireline-derived petrophysical curves, or production pressure time series alongside structural horizon interpretations. Second, the measurements must exhibit meaningful spatiotemporal correspondence: they must be co-registered, co-located, or temporally aligned such that joint learning across modalities is physically meaningful rather than incidental. A dataset that contains multiple data types but lacks this spatiotemporal linkage does not qualify as multi-modal under this taxonomy. For example, well logs from one field combined with seismic data from an unrelated survey. Applying these criteria to the five datasets examined, only the Volve Field dataset satisfies both conditions, integrating geophysical surveys, wireline logs, dynamic production telemetry, and structural interpretations within a single internally consistent field dataset. This integration reduces single-modality bias and enables the capture of complementary information, potentially improving characterization performance in complex subsurface settings [10,11,15].

5.2. Data Characterization

Data characterization provides a structured framework for evaluating public upstream datasets prior to machine learning by linking dataset properties to task requirements such as seismic interpretation, well-log analysis, and production monitoring [38]. Inspired by geoscience data-quality assessments, the present taxonomy quantifies dataset attributes across five dimensions that frequently govern model choice, preprocessing effort, and attainable performance [51].

Resolution: Resolution refers to the temporal and/or spatial granularity of the measurements, including the sampling interval in time (e.g., milliseconds for seismic traces or seconds/minutes for time series telemetry), the depth sampling step for well logs (e.g., sub-meter increments), and the spatial sampling or bin size for gridded data (e.g., seismic inline/crossline spacing or bin size in meters) [25,40,43]. Higher resolution improves the detectability of fine-scale features (e.g., thin beds or short transients) but may increase noise sensitivity and computational cost.
Volume: Dataset volume reflects the scale or amount of usable data available for learning, such as areal coverage and trace counts for seismic volumes, the number of wells and depth samples for log archives, and the number of instances, channels, and sequence lengths for multivariate time series [25,40,43]. Large-scale datasets can enable more robust model training and generalization, but they typically require substantial storage, curation, and preprocessing. Notably, dataset size is often reported using raw indicators such as the number of wells, files, seismic traces, logs, or time series instances; these counts do not necessarily represent the amount of statistically independent information available for machine learning. Highly correlated measurements, repeated time windows, spatially adjacent seismic traces, or redundant well-log intervals may reduce the effective information content of a dataset. Therefore, the revised taxonomy includes effective information content as a complementary criterion that considers effective dimensionality, redundancy, correlation structure, intrinsic data complexity, and the diversity of independent geological, operational, or physical conditions represented in the data.
Fidelity: This sub-category describes measurement trustworthiness across four quantifiable dimensions that should be explicitly reported when characterizing upstream datasets for machine learning: (i) missing value proportion (MVP)—the fraction of missing or null entries per channel or variable, reported as a percentage, where proportions below 5% indicate high fidelity, proportions between 5% and 20% indicate moderate fidelity requiring imputation, and proportions exceeding 20% indicate low fidelity that may introduce systematic bias and should be explicitly flagged in any study using that variable; (ii) noise level—quantified as the signal-to-noise ratio (SNR) in decibels for continuous sensor streams, or as the coefficient of variation (CV) for depth-indexed log measurements, with known acquisition artifacts such as cycle skipping in sonic logs, mud filtrate invasion effects in resistivity measurements, or high-frequency drilling vibration noise in WITSML records documented qualitatively where SNR or CV cannot be computed; (iii) sensor dropout and frozen signal rate (SDFSR)—the proportion of time steps or depth samples affected by sensor dropout, frozen readings, or physically implausible constant values reported per channel, a metric particularly relevant for high-frequency SCADA and WITSML streams such as those in the Volve Field and 3W datasets, where frozen signals are a known artifact [24,25]; and (iv) labeling consistency—the inter-annotator agreement or proportion of samples with conflicting, ambiguous, or partially labeled ground truth where multiple annotation sources exist, with the reinterpretation methodology and its known limitations serving as a proxy for single-annotation-source datasets such as the Netherlands F3 and 3W datasets [24,39]. Applying these criteria to the five datasets examined, the Netherlands F3 dataset exhibits high fidelity for seismic image data but limited log fidelity owing to the availability of only four wells; the Volve Field dataset exhibits moderate fidelity overall due to sensor dropouts and aggregated production records; the 3W dataset explicitly documents frozen signals and missing variables as realistic artifacts; the COSTA dataset exhibits maximum fidelity by design as a noise-free synthetic benchmark; and the KGS archive exhibits variable fidelity across wells and logging vintages, with MVP values that vary substantially depending on the log suite and well vintage selected. Lower fidelity across any of these dimensions typically necessitates denoising, imputation, and quality-control preprocessing to avoid introducing systematic learning bias [24,38,39].
Imbalance and Rare Events: Many upstream datasets exhibit strong class imbalance, including minority lithofacies classes in well logs, thin stratigraphic units in seismic interpretation, and rare abnormal events in production systems [24,26]. Such imbalance motivates strategies such as resampling, cost-sensitive learning, data augmentation, and anomaly detection formulations.
Label Density: Label density describes how frequently ground truth is available relative to the raw measurements. For example, per depth sample (well logs), per time step or segment (time series), or per pixel/voxel (seismic images/volumes). Public datasets may provide dense labels derived from interpretation (e.g., horizon-bounded facies intervals) or sparse interval annotations, thereby affecting the suitability of supervised versus semi-supervised and self-supervised learning [11,15,24].

5.3. Feature Domains

This taxonomy groups variables by domain to support task design and feature engineering, separating properties intrinsic to rocks and fluids from reservoir mechanical attributes and operational control signals [10,22].

Petrophysical: Petrophysical variables describe rock and fluid properties, including porosity ( $ϕ$ ), permeability (k), water saturation ( $S_{w}$ ), capillary pressure ( $P_{c}$ ), and discrete facies labels derived from logs, cores, or seismic interpretation products [20,21]. These attributes govern storage capacity and flow behavior and are commonly measured or inferred from well logs and core data, or provided as model outputs in synthetic benchmarks (e.g., the COSTA carbonate reservoir model) [11,15,17].
Geomechanical: Geomechanical attributes influence wellbore stability, compaction, and fracture initiation/propagation and are therefore relevant to drilling-risk assessment and stimulation design. They quantify how the subsurface deforms under stress and may include in situ stresses, strain, elastic moduli, and derived brittleness indices, often estimated from well-log- and petrophysics-based proxies in practical workflows [22].
Operational: Operational variables capture how the field is controlled over time and strongly influence observed production rates and pressures. Such variables are essential for distinguishing subsurface-driven behavior from operational interventions in forecasting, anomaly detection, and optimization workflows [5]. Examples include choke settings, valve states, artificial-lift modes (e.g., ESP on/off or gas-lift rate), and downtime/shut-in indicators. In public datasets, these signals are most directly represented in production-oriented time series benchmarks (e.g., the 3W dataset) and integrated field releases (e.g., the Volve Field dataset) through operational channels such as choke/valve states and related control parameters [24,26].

5.4. Machine Learning

This category defines how datasets are converted into reproducible machine learning benchmarks to enable fair algorithmic comparison across studies [11,12,15,16,38].

Task Type: This attribute specifies the core learning objective and its physical meaning. For example, inversion refers to estimating subsurface properties (e.g., porosity or permeability) from indirect measurements, whereas interpretation covers tasks such as facies classification and segmentation and horizon or fault picking. Additionally, forecasting targets the prediction of production rates, pressures, or other operational variables. Beyond these, broader reservoir data analytics (RDA) tasks include proxy modeling, uncertainty quantification, and optimization [25,39,48]. Clearly defining the task type helps ensure that inputs, labels, and metrics are aligned with a coherent physical question rather than conflating heterogeneous objectives within a single benchmark [10].
Learning Paradigm: The learning paradigm describes how models use labels and domain knowledge, encompassing supervised learning on expert-labeled facies or well events, as well as self-supervised learning that exploits large volumes of unlabeled seismic or log data via pretext or contrastive objectives [11,14,15]. It also includes physics-informed or hybrid frameworks in which physical constraints (e.g., flow or reservoir equations) guide training, as demonstrated in the Volve Field-based production modeling [12]. Selecting an appropriate paradigm ensures that benchmarks reflect realistic label availability and incorporate domain structure, particularly where comprehensive labeling is costly or uncertain [5].
Ground Truth: Ground truth describes how target labels are produced, including expert interpretations for facies and structural features (e.g., Netherlands F3 seismic horizons and facies), synthetic labels generated from numerical models (e.g., COSTA carbonate simulations), and laboratory measurements such as core-plug porosity ( $ϕ$ ), permeability (k), and capillary pressure ( $P_{c}$ ) used to calibrate or validate predictions derived from logs or seismic attributes [17,22,43]. Each source involves trade-offs: synthetic labels enable controlled experimentation and complete target coverage but may omit real-world complexity, operational disturbances, measurement noise, and sensor imperfections. In contrast, expert interpretations, field measurements, and laboratory measurements provide higher physical realism but can introduce domain-dependent bias, missing values, inconsistent sampling, and measurement uncertainty that should be accounted for when designing robust benchmarks [19].
Benchmark Maturity: Benchmark maturity describes the extent to which a dataset is standardized and ready for reproducible evaluation. Indicators include the availability of standard train, validation, and test splits, published baselines and reference results, and clearly specified metrics and protocols (e.g., well-documented seismic facies benchmarks for the Netherlands F3 dataset and event detection protocols for the 3W dataset) [25,38,39]. Mature benchmarks provide clear procedures and reusable pipelines that support fair comparison and cumulative progress. In contrast, emerging datasets may lack agreed-upon tasks, splits, or metrics, requiring additional community effort to establish consistent evaluation standards [38].
PIML/SciML Readiness: This attribute evaluates whether a dataset contains the information required to support physics-informed machine learning (PIML), scientific machine learning (SciML), or hybrid physics-data-driven workflows. In this category, a dataset is considered suitable for physics-constrained learning only when it provides, or can be reliably linked to, physically meaningful constraints such as governing equations, simulation outputs, boundary or initial conditions, conservation relationships, or temporally consistent measurements of coupled physical variables [3,12,16,17]. Governing equations are available for datasets that include, or can be explicitly linked to flow equations, reservoir equations, material balance, pressure–rate relationships, or conservation laws. Similarly, simulation outputs are available for datasets generated from numerical models or containing simulator outputs, such as pressure, saturation, permeability, porosity, production response, or scenario-based reservoir states. The boundary/initial conditions available for datasets provide information about initial pressure, saturation, grid conditions, well controls, injection/production constraints, or boundary assumptions. Finally, temporally consistent multiphysics measurements comprise time-aligned physical variables, such as pressure, flow rate, temperature, choke setting, valve state, water cut, gas rate, and operational events. Physics-constrained learning suitability is rated as high, partial, limited, or low, reflecting the dataset’s capacity to support physics-informed losses, residual constraints, hybrid surrogate models, or scientific machine learning workflows.

5.5. Context

Understanding the geological properties and geographical context of a dataset enables researchers to assess its applicability, limitations, and implicit assumptions when developing or evaluating machine learning models. This category captures the origin, nature, and broader relevance of the dataset beyond its raw measurements.

Asset Lifecycle: Indicates the upstream stage from which the data originate, such as exploration, appraisal, or brownfield (mature-field) operations. This distinction helps align dataset characteristics with typical use cases, for example, structural mapping and prospect screening in exploration versus production monitoring and optimization in mature assets.
Source Type: Specifies whether the dataset is derived from field measurements (e.g., well logs, drilling data, production telemetry), from fully synthetic modeling workflows (e.g., reservoir model benchmarks), or from hybrid sources that combine simulated and real measurements. Source type influences realism, label availability, and the extent to which learned patterns are expected to generalize to operational settings.
Utility: Describes the dataset’s intended use, such as a benchmarking resource with defined tasks and labels, a source for transfer learning and domain adaptation studies, or an open resource intended to promote reproducibility and accessibility in upstream machine learning research.
Geographic Context: Captures the geological and geographic setting (e.g., basin, reservoir lithology such as clastic or carbonate, and tectonic regime). This context is critical for interpreting learned patterns, assessing domain shift, and designing cross-basin generalization and adaptation experiments.

5.6. Application Scope

This category specifies the machine learning tasks and corresponding algorithm classes supported by each dataset, together with the oil and gas value-chain domains in which these tasks are typically deployed.

Application Type: Specifies the domain-specific task(s) for which a dataset is suitable, such as seismic interpretation, well-log analysis, production forecasting, anomaly detection, or carbon capture and storage (CCS) monitoring. This classification promotes alignment between dataset content (inputs and labels) and model design.
Value Chain: Identifies the stage of the upstream value chain targeted by the application, including exploration, development, and production. This classification contextualizes how datasets support operational objectives and decision-making workflows across the asset lifecycle.

The proposed taxonomy, illustrated in Figure 1, organizes datasets along six dimensions: Data Type, Data Characterization, Feature Domains, Machine Learning, Context, and Application Scope. The taxonomy is proposed as an initial machine learning-centric framework rather than a closed or exhaustive classification system. Its dimensions were defined based on recurring dataset properties that directly affect machine learning task design, model selection, preprocessing requirements, evaluation protocols, and reproducibility. The five datasets examined in this study were used to demonstrate the applicability of the taxonomy across representative upstream data modalities; however, additional datasets should be incorporated in future work to further test, refine, and generalize the framework. Rather than treating these dimensions abstractly, Table 2 applies the proposed taxonomy to each of the five datasets examined in this study.

The datasets are distinguished by their modality with respect to the Data Type dimension. The Netherlands F3 dataset maps clearly to the Geophysical and Structural branches, providing 3D seismic volumes and interpreted horizons but no dynamic or well-centric signals beyond the four well logs included in the dataset. The 3W dataset is mapped solely to the Dynamic branch, supplying high-frequency SCADA-style sensor streams but no seismic, structural, or petrophysical data. The Volve Field dataset is the only dataset examined here to span all five Data Type branches, integrating geophysical surveys, wireline logs, dynamic production telemetry, structural interpretations, and petrophysical data. The COSTA dataset is mapped to the Structural branch with simulated petrophysical outputs, whereas the KGS dataset is mapped to the Well-centric branch.

The Data Characterization category reveals further differentiation. For instance, the Resolution sub-category shows that the 3W dataset offers the finest temporal granularity at 1 Hz, whereas the Netherlands F3 dataset provides 4 ms seismic sampling and 25 m spatial bins, and KGS logs are sampled at 0.15–0.3 m depth increments. On Volume, the KGS dataset dominates with over 120,000 wells, whereas the 3W dataset provides 1984 labeled time series instances. Critically, the datasets differ sharply in terms of Label Density and Imbalance: the Netherlands F3 and 3W datasets provide dense, expert-defined labels suitable for direct supervised learning, while KGS labels are sparse, heterogeneous, and study-dependent.

In addition to Label Density, the taxonomy also distinguishes label provenance and label uncertainty. Labels may originate from expert interpretation, numerical simulation, weak supervision, automatic annotation, or study-specific post-processing, and each source introduces different uncertainty characteristics. Expert-interpreted labels, such as seismic facies or event annotations, may reflect domain expertise but can include interpreter bias and limited inter-annotator reproducibility. Simulation-derived labels provide complete and internally consistent target coverage, but their validity depends on the assumptions and fidelity of the underlying physical or numerical model. Weak labels and automatically generated annotations can increase coverage, but may introduce noise, class ambiguity, or systematic labeling errors. Therefore, label availability should be interpreted together with label provenance, uncertainty, and reproducibility when assessing benchmark reliability and supervised-learning performance. The 3W dataset is the only benchmark explicitly designed around rare-event imbalance, combining real, simulated, and hand-drawn sequences to address class scarcity. The COSTA dataset, by contrast, provides fully synthetic labels with complete coverage and no missingness, maximizing label density at the cost of real-world representativeness.

For multi-modal upstream oil and gas datasets, the Data Characterization dimensions most relevant to joint spatio-temporal learning are temporal consistency, spatial co-registration, modality alignment, metadata completeness, data quality, and operational synchronization. Temporal consistency indicates whether different measurements share compatible timestamps, sampling intervals, and observation windows. Spatial co-registration evaluates whether seismic, well-log, production, geological, and reservoir-model data can be linked through common well identifiers, coordinates, depth references, horizons, grids, or reservoir zones. Modality alignment describes whether heterogeneous data sources, such as seismic volumes, well-logs, production time series, pressure data, and operational reports, can be mapped to the same physical asset or reservoir interval. Metadata completeness is essential because units, coordinate systems, depth references, time zones, well names, and acquisition dates determine whether multi-modal fusion is technically feasible. Operational synchronization further evaluates whether measurements correspond to the same production regime, intervention period, reservoir state, or monitoring campaign. This is particularly important for datasets such as the Volve Field dataset, where seismic, geological, well-log, and production data are valuable but are not always temporally or operationally aligned for direct multi-modal learning, reservoir surveillance, or 4D seismic monitoring workflows.

Across Feature Domains, the Petrophysical sub-category maps to the KGS, Volve Field, and COSTA datasets, while the Operational sub-category is covered only by the 3W and Volve Field datasets. The Geomechanical branch is partially addressed by the Volve Field dataset alone, through its drilling and WITSML records. This mapping confirms that no single public dataset simultaneously covers all three feature domains.

The Machine Learning category of the taxonomy clarifies benchmark maturity and dataset compatibility with machine learning algorithms. The Netherlands F3 and 3W datasets represent the most mature benchmarks, with standardized splits, published baselines, and community-adopted evaluation protocols. The Volve Field and KGS datasets are rich data resources but lack universal benchmark splits, requiring research-specific task and label definitions. The COSTA dataset offers reproducibility through its fully synthetic, internally consistent design; however, its adaptability remains limited. Moreover, this distinction is important because the presence of labels alone is insufficient for PIML/SciML workflows; the dataset must also support the construction of physically meaningful loss terms, residual constraints, surrogate-model targets, or consistency checks. Datasets generated from numerical reservoir models, such as the COSTA dataset, are therefore more directly suitable for physics-constrained surrogate modeling because their outputs are internally consistent with an underlying simulation framework. Field datasets such as the Volve Field dataset may support hybrid workflows when production, pressure, structural, and petrophysical variables are temporally or spatially aligned, but they generally require additional assumptions or external reservoir models to formulate explicit physics residuals. In contrast, interpretation-oriented datasets such as the Netherlands F3 dataset and large well-log archives such as the KGS dataset are primarily suitable for data-driven or self-supervised learning unless supplemented with governing equations, calibrated petrophysical models, or simulation-derived constraints.

Finally, under the Context and Application Scope categories, the taxonomy distinguishes real-field datasets (the Netherlands F3, Volve, and KGS datasets) from synthetic or hybrid sources such as COSTA and 3W datasets. This distinction was further refined by adding physical realism and operational fidelity as explicit dataset-characterization criteria. These criteria describe the extent to which a dataset captures real-world measurement noise, operational disturbances, geological heterogeneity, sensor imperfections, missing values, irregular sampling, and non-stationary field conditions. This is particularly important for assessing the transferability of models validated on synthetic or simulation-derived datasets. For example, COSTA provides strong physical consistency and controlled reservoir-model variation, but it does not fully reproduce operational field noise or sensor imperfections. In contrast, Volve and 3W provide higher operational fidelity because they include field measurements, production variability, and realistic sensor artifacts, although they offer less control over boundary conditions and physical assumptions. This category also maps datasets to their applicable value-chain stages: the Netherlands F3 and COSTA datasets primarily serve exploration and development workflows, while 3W and Volve datasets are central to production monitoring and optimization. Geographic diversity is limited, as the datasets originate from Brazil, Northwest Europe, or a simulated Middle Eastern carbonate model, with the KGS representing a North American clastic setting.

To further clarify the effect of geographic and geological specificity on model transferability, the taxonomy can also be interpreted in terms of domain-shift risk. In this context, domain-shift risk refers to the likelihood that a model trained on one dataset will experience degraded performance when applied to a different geological setting, reservoir type, depositional environment, operational regime, or acquisition vintage. This is particularly important for transfer-learning and foundation-model approaches in subsurface machine learning, where models are expected to generalize beyond the basin, field, or acquisition conditions represented in the training data. Relevant taxonomy dimensions include geographic coverage, reservoir type, depositional setting, lithological diversity, production regime, sensor modality, acquisition date, and data-generation source. For example, datasets focused on a single field or basin may provide strong local consistency but limited evidence of cross-domain generalization, whereas datasets spanning multiple reservoir types, operating conditions, and acquisition campaigns may provide stronger support for evaluating transferability.

Overall, this mapping demonstrates that the taxonomy is not merely a classification vocabulary but a diagnostic tool. The taxonomy can be utilized to identify the closest available option and explicitly acknowledge the dimensions on which that dataset falls short, thereby strengthening reproducibility and the credibility of generalization claims.

6. Conclusions and Future Work

This study presents a detailed comparative evaluation and taxonomy of publicly available datasets commonly used to develop and benchmark machine learning solutions in the upstream oil and gas domain. This study offers a comprehensive characterization of the datasets’ structure, content, and machine learning suitability by systematically analyzing widely adopted datasets, namely the Netherlands F3, Volve Field, 3W, COSTA, and KGS datasets. The proposed taxonomy establishes clear relationships between dataset properties and their suitability for specific machine learning paradigms and tasks. The taxonomy is designed to support the development of robust, wellbore-centric, data-driven geoscience applications. Moreover, the taxonomy enables researchers and practitioners to identify appropriate datasets, uncover previously untapped correlations, and improve the accuracy, efficiency, and impact of machine learning workflows across exploration, reservoir management, and production optimization in the upstream sector.

In addition to developing the taxonomy, three main findings are presented in this research. First, the results reveal that no single public dataset simultaneously achieves high resolution, large volume, real-world fidelity, dense labeling, full feature-domain coverage, and benchmark maturity—a gap that has practical implications for study design and interpretation of the results. Second, it was shown that dataset properties, rather than model architecture alone, are the primary determinants of what can be reliably inferred from reported results. Third, the taxonomy provides a shared reference framework that enables the research community to describe, compare, and select datasets consistently, thereby supporting reproducibility and cumulative scientific progress in upstream machine learning research.

Although primarily descriptive, the proposed taxonomy is intended to provide the basis for future quantitative dataset evaluation frameworks. In future extensions, each taxonomy dimension could be associated with measurable scoring criteria or standardized benchmarking metrics. For example, benchmark maturity could be quantified using the availability of predefined train–test splits, baseline models, public leaderboards, documented preprocessing workflows, and reproducible evaluation scripts. Such a scoring framework would allow future dataset evaluations to move from qualitative comparison to more consistent, transparent, and comparable benchmark assessment across upstream oil and gas machine learning studies.

While this study provides a comprehensive taxonomy of public upstream oil and gas datasets for machine learning applications, several opportunities remain for future research. Extending the taxonomy to include midstream and downstream datasets would offer a more holistic view of the oil and gas sector’s data landscape. Additionally, future work could incorporate dynamic updates to reflect evolving datasets and community contributions. Further exploration of interoperability challenges, such as format standardization and metadata consistency, is also needed to support deployment in heterogeneous oil and gas sensor data environments. Understanding data structure and storage conventions is a prerequisite for deploying distributed and federated learning frameworks across heterogeneous upstream environments. Finally, including dimensions such as licensing constraints, annotation workflows, and real-world industrial deployment case studies would strengthen the framework’s utility for both academic researchers and practitioners. A natural and important extension of this work is the empirical validation of the taxonomy’s predictive utility through controlled machine learning experiments. Such experiments would complement the descriptive and structural contributions of the present taxonomy with direct empirical evidence.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in this study are openly available in the Netherlands F3 dataset at https://zenodo.org/records/1471548 (accessed on 4 February 2026); the Volve Field dataset at https://www.equinor.com/energy/volve-data-sharing (accessed on 4 February 2026); the 3W Dataset at https://www.kaggle.com/datasets/afrniomelo/3w-dataset (accessed on 4 February 2026); the COSTA dataset at https://researchportal.hw.ac.uk/en/datasets/costa-model-hierarchical-carbonate-reservoir-benchmarking-case-st/ (accessed on 4 February 2026); and the KGS dataset at https://www.kgs.ku.edu/PRS/petroDB.html (accessed on 4 February 2026).

Acknowledgments

During the preparation of this manuscript, the author used ChatGPT 5.5 to develop the graphical abstract and for the purposes of English language proofing and wording. The author have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The author declares no conflicts of interest.

References

Tariq, Z.; Aljawad, M.S.; Hasan, A.; Murtaza, M.; Mohammed, E.; El-Husseiny, A.; Alarifi, S.A.; Mahmoud, M.; Abdulraheem, A. A Systematic Review of Data Science and Machine Learning Applications to the Oil and Gas Industry. J. Pet. Explor. Prod. Technol. 2021, 11, 4339–4374. [Google Scholar] [CrossRef]
Chen, F.; Sun, L.; Jiang, B.; Huo, X.; Pan, X.; Feng, C.; Zhang, Z. A Review of AI Applications in Unconventional Oil and Gas Exploration and Development. Energies 2025, 18, 391. [Google Scholar] [CrossRef]
Azmi, R.P.A.; Yusoff, M.; Mohd Sallehud-din, M.T. A Review of Predictive Analytics Models in the Oil and Gas Industries. Sensors 2024, 24, 4013. [Google Scholar] [CrossRef]
Desai, J.N.; Pandian, S.; Vij, R.K. Big Data Analytics in Upstream Oil and Gas Industries for Sustainable Exploration and Development: A Review. Environ. Technol. Innov. 2021, 21, 101186. [Google Scholar] [CrossRef]
Waqar, A.; Othman, I.; Shafiq, N.; Mansoor, M.S. Applications of AI in oil and gas projects towards sustainable development: A systematic literature review. Artif. Intell. Rev. 2023, 56, 12771–12798. [Google Scholar] [CrossRef]
Salem, A.M.; Yakoot, M.S.; Mahmoud, O. Addressing Diverse Petroleum Industry Problems Using Machine Learning Techniques: Literary Methodology–Spotlight on Predicting Well Integrity Failures. ACS Omega 2022, 7, 2504–2519. [Google Scholar] [CrossRef]
Benayoune, A. Factors influencing industry 4.0 implementation in oil and gas sector: Empirical study from a developing economy. Acad. Strateg. Manag. J. 2022, 21, 1–18. Available online: https://www.abacademies.org/articles/factors-influencing-industry-40-implementation-in-oil-and-gas-sector-empirical-study-from-a-developing-economy.pdf (accessed on 4 February 2026).
Lu, H.; Guo, L.; Azimi, M.; Huang, K. Oil and Gas 4.0 era: A systematic review and outlook. Comput. Ind. 2019, 111, 68–90. [Google Scholar] [CrossRef]
Wang, T.; Wei, Q.; Xiong, W.; Wang, Q.; Fang, J.; Wang, X.; Liu, G.; Jin, C.; Wang, J. Current Status and Prospects of Artificial Intelligence Technology Application in Oil and Gas Field Development. ACS Omega 2024, 9, 3173–3183. [Google Scholar] [CrossRef] [PubMed]
Lin, L.; Zhong, Z.; Li, C.; Gorman, A.; Wei, H.; Kuang, Y.; Wen, S.; Cai, Z.; Hao, F. Machine learning for subsurface geological feature identification from seismic data: Methods, datasets, challenges, and opportunities. Earth-Sci. Rev. 2024, 257, 104887. [Google Scholar] [CrossRef]
Liu, X.; Li, B.; Li, J.; Chen, X.; Li, Q.; Chen, Y. Semi-supervised deep autoencoder for seismic facies classification. Geophys. Prospect. 2021, 69, 1295–1315. [Google Scholar] [CrossRef]
Nikitin, N.O.; Revin, I.; Hvatov, A.; Vychuzhanin, P.; Kalyuzhnaya, A.V. Hybrid and automated machine learning approaches for oil fields development: The case study of Volve field, North Sea. Comput. Geosci. 2022, 161, 105061. [Google Scholar] [CrossRef]
Abd-Elwahed, M.S. Multi-Objective Optimization of Drilling GFRP Composites Using ANN Enhanced by Particle Swarm Algorithm. Processes 2023, 11, 2418. [Google Scholar] [CrossRef]
Li, M.; Yan, X.; Wu, Q. A self-supervised deep learning framework for seismic facies segmentation. Expert Syst. Appl. 2025, 288, 128290. [Google Scholar] [CrossRef]
Li, K.; Liu, W.; Dou, Y.; Xu, Z.; Duan, H.; Jing, R. CONSS: Contrastive Learning Method for Semisupervised Seismic Facies Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 7838–7849. [Google Scholar] [CrossRef]
Dramsch, J.S. Chapter One—70 years of machine learning in geoscience in review. In Machine Learning and Artificial Intelligence in Geosciences; Moseley, B., Krischer, L., Eds.; Advances in Geophysics; Elsevier: Amsterdam, The Netherlands, 2020; Volume 61, pp. 1–55. [Google Scholar] [CrossRef]
Costa Gomes, J.; Geiger, S.; Arnold, D. The design of an open-source carbonate reservoir model. Pet. Geosci. 2022, 28, petgeo2021-067. [Google Scholar] [CrossRef]
Al-Fakih, A.; Koeshidayatullah, A.; Mukerji, T.; Al-Azani, S.; Kaka, S.I. Well-log data generation and imputation using sequence-based generative adversarial networks. Sci. Rep. 2025, 15, 11000. [Google Scholar] [CrossRef] [PubMed]
Ribeiro Mendes, P.; Salavati, S.; Linares, O.; Moreira Gonçalves, M.; Ferreira Zampieri, M.; de Sousa Ferreira, V.H.; Castro, M.; de Oliveira Werneck, R.; Moura, R.; Morais, E.; et al. Rock-type classification: A (critical) machine-learning perspective. Comput. Geosci. 2024, 193, 105730. [Google Scholar] [CrossRef]
Hall, B. Facies classification using machine learning. Lead. Edge 2016, 35, 906–909. [Google Scholar] [CrossRef]
Jiang, S.; Sun, P.; Lyu, F.; Zhu, S.; Zhou, R.; Li, B.; He, T.; Lin, Y.; Gao, Y.; Song, W.; et al. Machine learning (ML) for fluvial lithofacies identification from well logs: A hybrid classification model integrating lithofacies characteristics, logging data distributions, and ML models applicability. Geoenergy Sci. Eng. 2024, 233, 212587. [Google Scholar] [CrossRef]
Balaguera, A.; Torné, M.; Carbonell, R.; Martí, A.; Vergés, J.; Jurado, M.J.; Sánchez-Pastor, P.; Farci, A.; Davoise, D.; Rodríguez, S. Machine learning in subsurface physical properties and lithofacies prediction in a mining context. Sci. Rep. 2025, 15, 26495. [Google Scholar] [CrossRef]
Arinze, C.A.; Jacks, B.S. A comprehensive review on AI-driven optimization techniques enhancing sustainability in oil and gas production processes. Eng. Sci. Technol. J. 2024, 5, 962–973. [Google Scholar] [CrossRef]
Vargas, R.E.V.; Munaro, C.J.; Ciarelli, P.M.; Medeiros, A.G.; do Amaral, B.G.; Barrionuevo, D.C.; de Araújo, J.C.D.; Ribeiro, J.L.; Magalhães, L.P. A realistic and public dataset with rare undesirable real events in oil wells. J. Pet. Sci. Eng. 2019, 181, 106223. [Google Scholar] [CrossRef]
Vargas, R.E.V.; de Melo Junior, A.J.; Munaro, C.J.; de Campos Lima, C.B.; de Lima Junior, E.T.; Barrocas, F.M.; Varejão, F.M.; Peixer, G.F.; Oliveira, I.M.N.; Barbosa, J.R., Jr.; et al. 3W Dataset 2.0.0: A realistic and public dataset with rare undesirable real events in oil wells. arXiv 2025, arXiv:2507.01048. [Google Scholar] [CrossRef]
Oliveira, I.M.N.; Aranha, P.E.; Vieira, T.M.A.; da Silva, A.C.A.; Ramos, D.L.; de Lima Junior, E.T. Advancing Anomaly Detection in Oil Production Wells with TranAD: A Deep Transformer Network Approach. In Proceedings of the XLV Ibero-Latin American Congress on Computational Methods in Engineering (CILAMCE 2024), Maceió, Brazil, 11–14 November 2024. [Google Scholar] [CrossRef]
Turan, E.M.; Jäschke, J. Classification of undesirable events in oil well operation. In Proceedings of the 2021 23rd International Conference on Process Control (PC), Štrbské Pleso, Slovakia, 1–4 June 2021; pp. 157–162. [Google Scholar] [CrossRef]
Brønstad, C.; Netto, S.L.; Ramos, A.L.L. Data-driven Detection and Identification of Undesirable Events in Subsea Oil Wells. In Proceedings of the SENSORDEVICES 2021: The Twelfth International Conference on Sensor Device Technologies and Applications, Athens, Greece, 14–18 November 2021; pp. 1–6. Available online: https://personales.upv.es/thinkmind/SENSORDEVICES/SENSORDEVICES_2021/sensordevices_2021_1_10_28039.html (accessed on 4 February 2026).
Priyanka, E.B.; Thangavel, S.; Gao, X.Z.; Sivakumar, N.S. Digital twin for oil pipeline risk estimation using prognostic and machine learning techniques. J. Ind. Inf. Integr. 2022, 26, 100272. [Google Scholar] [CrossRef]
Wanasinghe, T.R.; Wroblewski, L.; Petersen, B.K.; Gosine, R.G.; James, L.A.; de Silva, O.; Mann, G.K.I.; Warrian, P.J. Digital Twin for the Oil and Gas Industry: Overview, Research Trends, Opportunities, and Challenges. IEEE Access 2020, 8, 104175–104197. [Google Scholar] [CrossRef]
Jia, Z.; Wang, J.; Deng, C. IIoT-based Predictive Maintenance for Oil and Gas Industry. In Proceedings of the 2022 6th International Conference on Electronic Information Technology and Computer Engineering (EITCE 2022), Virtual, China, 21–23 October 2022; ACM: New York, NY, USA, 2022; pp. 432–436. [Google Scholar] [CrossRef]
Zhang, T.; Gao, L.; He, C.; Zhang, M.; Krishnamachari, B.; Avestimehr, A.S. Federated Learning for the Internet of Things: Applications, Challenges, and Opportunities. IEEE Internet Things Mag. 2022, 5, 24–29. [Google Scholar] [CrossRef]
Baqer, M. Energy-Efficient Federated Learning for Internet of Things: Leveraging In-Network Processing and Hierarchical Clustering. Future Internet 2025, 17, 4. [Google Scholar] [CrossRef]
Baqer, M. Lightweight Federated Learning Approach for Resource-Constrained Internet of Things. Sensors 2025, 25, 5633. [Google Scholar] [CrossRef] [PubMed]
Verma, P.K.; Verma, R.; Prakash, A.; Agrawal, A.; Naik, K.; Tripathi, R.; Alsabaan, M.; Khalifa, T.; Abdelkader, T.; Abogharaf, A. Machine-to-Machine (M2M) communications: A survey. J. Netw. Comput. Appl. 2016, 66, 83–105. [Google Scholar] [CrossRef]
Baqer, M.; Kamal, A. S-Sensors: Integrating Physical World Inputs with Social Networks Using Wireless Sensor Networks. In Proceedings of the 2009 Fifth International Conference on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP 2009), Melbourne, Australia, 7–10 December 2009; IEEE: New York, NY, USA, 2009; pp. 213–218. [Google Scholar] [CrossRef]
Baqer, M. Enabling Collaboration and Coordination of Wireless Sensor Networks via Social Networks. In Proceedings of the 2010 6th IEEE International Conference on Distributed Computing in Sensor Systems Workshops (DCOSSW), Santa Barbara, CA, USA, 21–23 June 2010; IEEE: New York, NY, USA, 2010; pp. 1–2. [Google Scholar] [CrossRef]
McDonald, A. Public Datasets for Machine Learning in Geoscience. Medium (TDS Archive), 2022. Available online: https://medium.com/data-science/public-datasets-for-machine-learning-in-geoscience-cf880862300a (accessed on 3 February 2026).
Alaudah, Y.; Michałowicz, P.; Alfarraj, M.; AlRegib, G. A machine-learning benchmark for facies classification. Interpretation 2019, 7, SE175–SE187. [Google Scholar] [CrossRef]
Baroni, L.; Silva, R.M.; Ferreira, R.S.; Chevitarese, D.; Szwarcman, D.; Vital Brazil, E. Netherlands F3 Interpretation Dataset, Version 2.0.0; Zenodo: Geneva, Switzerland, 2018. [Google Scholar] [CrossRef]
Equinor ASA. Disclosing all Volve Data. Equinor News Archive. 2018. Available online: https://www.equinor.com/news/archive/14jun2018-disclosing-volve-data (accessed on 4 February 2026).
Energistics Consortium. Equinor’s Volve Field Test Data. Energistics Consortium, n.d. Available online: https://energistics.org/equinors-volve-field-test-data (accessed on 10 January 2026).
Kansas Geological Survey. Oil and Gas Data Bases: Digital Well Logs and Oil & Gas Well Data; Data Resources Library, University of Kansas: Lawrence, KS, USA, 2006; Available online: https://www.kgs.ku.edu/PRS/petroDB.html (accessed on 10 January 2026).
Ng, C.S.W.; Jahanbani Ghahfarokhi, A.; Nait Amar, M. Well production forecast in Volve field: Application of rigorous machine learning techniques and metaheuristic algorithm. J. Pet. Sci. Eng. 2022, 208, 109468. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, X.; You, J. A benchmark dataset and baseline methods for rock microstructure interpretation in SEM images. Sci. Data 2025, 12, 1671. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Lian, J.; Li, C. A dataset of natural gas and liquid level for oil field production prediction in China. Sci. Data 2025, 12, 1071. [Google Scholar] [CrossRef] [PubMed]
Lemos, J.B.; Santos, L.d.S.O.; Cerqueira, A.G. Seismic Facies Segmentation Using Convolutional Neural Networks. In Proceedings of the XVII Congresso Brasileiro de Inteligência Computacional (CBIC 2025), Horizonte, Brazil, 27–30 October 2025; pp. 1–6. [Google Scholar] [CrossRef]
Samad, A.; Khan, I.M.; Rahaman, M.S.; Sakib, A.; Islam, M.A. Data-Driven Approach to Predict Future Oil Production of an Oil Field Using Machine Learning Techniques. In Proceedings of 8th International Conference on Mechanical, Industrial and Energy Engineering; Springer: Cham, Switzerland, 2025; Volume 3, pp. 68–73. [Google Scholar] [CrossRef]
López, R. Forecast Oil Production Using Machine Learning. Neural Designer. 2023. Available online: https://www.neuraldesigner.com/blog/volve-oil-forecasting (accessed on 4 February 2026).
Yang, L.; Lu, Z.; Ren, W.; Liu, T. Improving the Drilling Parameter Optimization Method Based on the Fireworks Algorithm. ACS Omega 2022, 7, 38074–38083. [Google Scholar] [CrossRef]
Ramachandran, N.; Irvin, J.; Omara, M.; Gautam, R.; Meisenhelder, K.; Rostami, E.; Sheng, H.; Ng, A.Y.; Jackson, R.B. Deep learning for detecting and characterizing oil and gas well pads in satellite imagery. Nat. Commun. 2024, 15, 7036. [Google Scholar] [CrossRef]

Figure 1. Proposed machine learning-centric taxonomy of public upstream oil and gas datasets.

Table 1. Sensor data types and data characteristics available in major public upstream oil and gas datasets.

Attribute	Netherlands F3	Volve Field	3W	COSTA	KGS
3D seismic surveys	✓	✓	×	×	×
2D seismic surveys	×	▵	×	×	×
Seismic attributes/interpretations	✓	✓	×	×	×
Well logs (GR, RHOB, NPHI, DT, resistivity)	▵	✓	×	▵	✓
Image logs/core data	×	✓	×	×	▵
Measured production rates (oil/gas/water)	×	✓	×	×	✓
Synthetic production/simulation outputs	×	×	✓	✓	×
Pressure and temperature measurements	×	✓	✓	▵	×
Choke/valve/operational states	×	✓	✓	×	×
SCADA/control-system tags	×	✓	✓	×	×
Drilling/WITSML data	×	✓	×	×	×
Data representativeness	Real	Real	Real and synthetic	Synthetic (geologically realistic)	Real
Resolution	Inline/crossline ≈ 25 m; 3D seismic ≈ 4 ms	Daily production; 0.1–0.5 m log sampling; seconds-level drilling data (1–10 s)	1 Hz (1 sample/s)	Grid-defined, synthetic	Logs: 0.15–0.3 m; monthly production
Volume	≈190,000 labeled seismic patches	7–24 producing wells; part of ≈40,000 files	1984 time series; 21 wells, >8000 labeled events (3W v1.0.0)	447 synthetic wells (based on 43 real wells)	>120,000 wells; tens of millions of samples
Context	Offshore Netherlands, North Sea	Norwegian North Sea	Offshore Brazil	Carbonate reservoir (simulated Middle East)	Kansas, USA

Notes: ✓ = present; ▵ = partially included; × = not present.

Table 2. The mapping of five public upstream oil and gas datasets to the proposed taxonomy.

Taxonomy Dimension/Attribute	Netherlands F3	Volve Field	3W	COSTA	KGS
1. Data type
Geophysical (3D/2D seismic, VSP)	✓	▵	×	×	×
Well-centric (wireline, mud logs, core)	▵	✓	×	▵	✓
Dynamic (SCADA, rates, pressures, temps)	×	✓	✓	▵	×
Structural (horizons, faults, geomodels)	✓	✓	×	✓	×
Multi-modal (seismic, logs, structure)	×	✓	×	×	×
2. Data characterization
Resolution	25 m bin; 4 ms	Daily prod.; 0.1–0.5 m log	1 Hz	Grid-based synthetic	0.15–0.3 m log
Volume	∼190,000 patches	40,000 files; 7–24 wells	1984 instances; 21 wells; >8000 labeled events (v1.0.0)	447 synthetic wells (based on 43 real wells)	120,000+ wells
Fidelity (MVP, SNR/CV, SDFSR, label consistency)	High seismic fidelity; limited log fidelity (4 wells only)	Moderate; sensor dropouts and aggregated production records	▵; frozen signals and missing variables documented as artifacts	Maximum; noise-free synthetic benchmark	Variable; MVP varies by log suite and well vintage
Imbalance and rare events	▵	▵	✓	×	▵
Label density	✓	▵	✓	✓	×
3. Feature domains
Petrophysical (ϕ, k, $S_{w}$ , $P_{c}$ , facies)	×	✓	×	✓	✓
Geomechanical (stress, moduli, brittleness)	×	▵	×	×	×
Operational (choke, lift, valve, downtime)	×	✓	✓	×	×
4. Machine learning
Primary task type	Interpretation; segmentation	Forecasting; anomaly detection	Classification; anomaly detection	Inversion; uncertainty quantification; surrogate	Property prediction; clustering
Dominant learning paradigm	Supervised; self-supervised	Supervised; physics-informed	Supervised; semi-supervised	Supervised; surrogate	Supervised; unsupervised
Ground-truth source	Expert reinterpretation	Field data and reports	Expert and simulation	Numerical simulation	Study-specific
Benchmark maturity	✓	▵	✓	▵	×
PIML/SciML readiness	Low; requires external physical constraints	Partial; multi-modal field data but limited explicit equations	Limited; dynamic sequences with partial simulation support	High; simulation-derived and physically consistent	Low; requires external petrophysical or reservoir constraints
5. Context
Asset lifecycle	Exploration; development	Development; production	Production	Exploration; development	Exploration; development
Source type	Real	Real	Real and synthetic	Synthetic	Real
Utility	Benchmarking; transfer learning	Benchmarking; transfer learning; open-source	Benchmarking; open-source	Benchmarking; method development	Open-source; transfer learning
Geographic/geological context	Offshore Netherlands, North Sea	Norwegian North Sea	Offshore Brazil	Carbonate reservoir (simulated Middle East)	Kansas, USA
6. Application scope
Primary application type	Seismic interpretation; facies segmentation	Production forecasting; hybrid modeling; anomaly detection	Event detection; early warning	Reservoir characterization; uncertainty quantification; history matching	Log interpretation; lithofacies classification
Value-chain coverage	Exploration; development	Development; production	Production	Exploration; development	Exploration; development

Note: ✓ = present; ▵ = partially included; × = not present.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Baqer, M. A Machine Learning-Centric Taxonomy and Structured Characterization of Public Datasets for Upstream Oil and Gas. Big Data Cogn. Comput. 2026, 10, 188. https://doi.org/10.3390/bdcc10060188

AMA Style

Baqer M. A Machine Learning-Centric Taxonomy and Structured Characterization of Public Datasets for Upstream Oil and Gas. Big Data and Cognitive Computing. 2026; 10(6):188. https://doi.org/10.3390/bdcc10060188

Chicago/Turabian Style

Baqer, M. 2026. "A Machine Learning-Centric Taxonomy and Structured Characterization of Public Datasets for Upstream Oil and Gas" Big Data and Cognitive Computing 10, no. 6: 188. https://doi.org/10.3390/bdcc10060188

APA Style

Baqer, M. (2026). A Machine Learning-Centric Taxonomy and Structured Characterization of Public Datasets for Upstream Oil and Gas. Big Data and Cognitive Computing, 10(6), 188. https://doi.org/10.3390/bdcc10060188

Article Menu

A Machine Learning-Centric Taxonomy and Structured Characterization of Public Datasets for Upstream Oil and Gas

Abstract

1. Introduction

2. Oil and Gas Datasets

2.1. Netherlands F3 Dataset

2.2. Volve Field Dataset

2.3. 3W Dataset

2.4. COSTA Dataset

2.5. KGS Datasets

3. Dataset Comparisons

4. Application of Machine Learning Using Oil and Gas Datasets

5. Dataset Taxonomy

5.1. Data Type

5.2. Data Characterization

5.3. Feature Domains

5.4. Machine Learning

5.5. Context

5.6. Application Scope

6. Conclusions and Future Work

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI