Abstract
Digitalization and intelligence have become pivotal in the realm of oil and gas field development, concomitant with a marked increase in the volume of fracturing-related documentation, the majority of which exists in unstructured form. This has given rise to inefficiencies in data governance and knowledge utilization. To address this, the paper proposes an integrated technical approach of “document governance—knowledge extraction—intelligent Q&A.” We first achieve document standardization through multi-format document parsing and semantic segmentation. Subsequently, by leveraging a knowledge dictionary and the Qwen3-14B (Q8 precision) large language model, the accuracy of professional information extraction is enhanced via keyword positioning and dynamic prompt strategies, achieving an extraction accuracy of 95% and a recall rate of 80%. This provides data and knowledge support for building a structured database and a vector knowledge base. Finally, a natural language Q&A system based on a structured query language agent (SQLAgent) is developed, achieving a Q&A accuracy of over 92%, a response time of less than 30 s, and support for interactive querying of complex questions. We offer a practical technical pathway for the intelligent management and knowledge services of fracturing engineering documents.
1. Introduction
Hydraulic fracturing in oil and gas fields represents a pivotal technology for enhancing well productivity and development efficiency. However, the extensive utilization of this technology has precipitated a substantial augmentation in the number of documents pertaining to fracturing, encompassing geological data, operational plans, production monitoring records, and rock analysis reports. The preponderance of unstructured formats in which these documents are stored (e.g., Word, PDF, images, Excel) engenders a marked deficiency in data governance and knowledge reuse [1], thereby impeding the digital and intelligent transformation of oilfields [2]. The rapid advancement of artificial intelligence (AI) and machine learning (ML) technologies has brought new opportunities to the oil and gas industry, achieving remarkable results in various fields, such as drilling and completion [3,4]. These technologies provide robust technical support for efficiently developing and digitally transforming the oil and gas industry [5,6].
Despite the fact that conventional document data extraction techniques have established a comparatively mature technical framework in practical applications, with advantages such as simplicity and high execution efficiency, they nevertheless manifest significant limitations when processing structurally complex and diverse engineering documents [7,8]. The prevailing tendency in these methods is to rely on fixed templates and predefined rules [9], and they demonstrate a limited capacity for adaptability to document formats and linguistic styles [10]. Furthermore, these methods are not well-equipped to capture latent semantic relationships within contexts. This is a particularly challenging aspect in fracturing engineering literature, which is replete with professional terminology and lacks structural uniformity, leading to issues such as insufficient accuracy and poor scalability [11,12].
In recent years, rapid advancements in document information extraction and multimodal understanding technologies have laid a solid foundation for knowledge management and intelligent question-and-answer (Q&A) systems in industrial fields. As Ashwini Zadgaonkar et al. [13], posited, using topic modeling—an unsupervised machine learning dimensionality reduction technique—has been shown to improve the efficiency of extracting information from unstructured texts. This offers new insights into constructing large-scale domain knowledge bases. John Dagdelen et al. [14] proposed a methodology for recognizing named entities and extracting relations, providing a new way to build structured knowledge bases from scientific literature. Derong Xu et al. [15] systematically explored the application of large language models (LLMs) in tasks such as named entity recognition, relation extraction, and event extraction. Their study showed that generative models outperform traditional discriminative models in complex contexts and low-resource scenarios. In their seminal paper, Deng et al. [16] proposed a novel table extraction method that is based on coordinate and text state analysis. This method effectively overcomes the limitations of traditional, layout-dependent approaches, providing a reliable solution for automatically capturing key data from report-style documents.
While these studies have advanced information extraction technologies, in general scenarios, most models remain oriented towards generic contexts and have yet to systematically address challenges in the oil and gas fracturing domain [17], such as terminology density, structural heterogeneity, and semantic specificity [18,19]. In the context of industrial applications, such as intelligent Q&A and design report generation, there are persistent issues relating to the domain adaptability, which is considered to be weak, and the stability, which is considered to be insufficient [20]. These issues are compounded by a lack of systematic validation in real-world oilfield environments. Consequently, there is an urgent need to develop customized knowledge extraction models and corpus training systems tailored to the fracturing domain [21,22,23] to support the digital and intelligent transformation of the oil and gas industry [24].
To address the limitations in existing approaches and support digital transformation in oilfields, this study proposes an integrated workflow for intelligent governance and utilization of fracturing knowledge. The primary objective is to convert large-scale, heterogeneous, and unstructured engineering documents into structured and queryable knowledge resources, enabling automated access to operational insights.
The scientific novelty of this work is reflected in two aspects:
- (1)
- A domain-adapted extraction paradigm, which incorporates a professionally curated fracturing knowledge dictionary to enhance the semantic alignment and stability of large language models in specialized engineering contexts;
- (2)
- A dual-layer knowledge storage architecture, combining a relational database for precise parameter-level retrieval and a vector knowledge base for semantic-aware search;
Overall, this study establishes a practical and reusable knowledge management foundation for hydraulic fracturing operations, enabling more accurate information extraction and more efficient knowledge access, thereby facilitating intelligent decision support in modern oil and gas field development.
2. Methods
We address fundamental challenges in the domain of fracturing engineering, including the difficulty of unifying the governance of unstructured documents, the high complexity of structured extraction, and the unclear pathways for intelligent utilizations. A closed-loop technical pathway, illustrated in Figure 1, is constructed, driven by a large language model (LLM) and comprising the following elements: Document Governance, Knowledge Extraction and Intelligent Q&A. The specific methods employed are outlined below.
Figure 1.
Technology Roadmap.
Firstly, in order to address the challenges posed by the diversity and heterogeneity of fracturing documents, a document governance workflow is designed to facilitate end-to-end processing, encompassing file screening, format conversion, semantic segmentation and keyword extraction. The establishment of a standardized document structure in Markdown format, incorporating metadata such as well numbers, blocks, and affiliated units, results in the formation of normalized input corpora. These serve as the foundational dataset for knowledge extraction.
Consequently, an LLM-based fracturing knowledge extraction system was developed. The fracturing knowledge dictionary is utilized to facilitate the accurate identification of entity and parameter information related to fracturing from segmented semantic paragraphs. This is achieved through the implementation of keyword positioning and prompt-guided strategies, enabling the extraction of key knowledge points. The results obtained are then subjected to field standardizations and semantic categorization prior to being stored in both a structured database and a vector knowledge base, thus establishing a dual-channel data foundation.
The development of a natural language Q&A system based on the Langchain framework has been completed. In light of the challenges and performance instability associated with the deployment of vector knowledge bases within engineering field environments, the system places primary reliance on a structured database for the purpose of data support. The integration of an SQLAgent facilitates the execution of natural language queries at the level of parameters, thereby enhancing the system’s functionality. The system has been tested and validated using a locally deployed large language model in order to evaluate the accuracy of knowledge extraction and the responsiveness of the Q&A system.
2.1. Software Environment and Tools
2.1.1. Software and Computational Environment
All software tools, libraries, and computational environments used in this study were documented with specific version information to ensure reproducibility. The document preprocessing and text normalization pipeline was implemented in Python (Version 3.10.14, Python Software Foundation, Beaverton, OR, USA). The workflow integrated LangChain (Version 0.2.12, LangChain Inc., San Francisco, CA, USA), PyMuPDF (Version 1.24.13, Artifex Software Inc., San Rafael, CA, USA), python-docx (Version 1.1.2), pandas (Version 2.2.2), numpy (Version 1.26.4), and scikit-learn (Version 1.5.0) for document parsing, segmentation, data structuring, and model evaluation.
Visualization tasks were conducted using matplotlib (Version 3.9.0) and seaborn (Version 0.13.2). The SQL-based natural language query system was implemented through the SQLAgent module of LangChain (Version 0.2.12). The vector knowledge base was constructed using FAISS (Version 1.8.0, Facebook AI Research, Menlo Park, CA, USA).
All large language model inference processes were executed via an API-based service and local deployment model (API and Local Model Version 2024-08, provider anonymized according to journal policy).
2.1.2. Hardware Configuration
Experiments were conducted on a workstation equipped with an NVIDIA RTX 4090 GPU (NVIDIA Corp., Santa Clara, CA, USA), 64 GB RAM, and an AMD Ryzen 9 7950X CPU (AMD Inc., Santa Clara, CA, USA), providing sufficient computational capacity for large-scale document processing and model inference.
2.2. Document Governance and Standardized Processing
2.2.1. Document Governance
To ensure the validity and relevance of the research data, we designed and implemented a systematic four-level screening strategy: The following procedures are to be followed: first, files are to be screened; second, information is to be extracted; third, the files are to be classified and archived; fourth, deduplication and consolidation are to be carried out. This process extracts high-quality target documents from vast and disorganized raw data. The specific steps involved are illustrated in Figure 2.
Figure 2.
Document Governance Technology Roadmap.
Initially, during the document screening phase, the Word documents collected and organized in this study were processed. This included .doc, .docm, .docx, and Word files contained within compressed packages. The code was utilized to conduct a preliminary screening of files based on keywords present in the filenames of Word documents (e.g., “fracturing design,” “plan design,” “operation summary,” “operation report”). This process involved the retention of Word files exclusively related to fracturing. Subsequently, in order to streamline the subsequent processing, the .doc and .docm files were converted to the .docx format. This unification of the document type resulted in the creation of .docx files that contained exclusively fracturing designs and operation reports.
In the subsequent phase of information extraction, the program systematically analyzed the first 50 paragraphs of the documents. A combination of regular expressions and structured pattern matching was used to extract key information, including well numbers, affiliated blocks, and responsible units. The extracted information was subsequently cross-verified with production well data tables. In the course of the classification and archiving phase, the screened and converted .docx documents were categorized into two types, designated “Operation Summary Reports” and “Fracturing Plan Designs,” based on their content features and filenames. For documents containing mixed content—for example, “perforation” and “operation summary”—a supplementary classification mechanism was designed. Following the classification process, the files belonging to the significant categories, namely operation summary reports and fracturing plan designs, were assigned a new designation. This designation was implemented in accordance with a standardized naming convention, which assigned each file a unique identifier consisting of two components: the well number and the document type. Subsequent to this assignment, the files were meticulously organized into discrete folders, ensuring their accessibility and systematic management.
In the final stage of the data management process, known as the deduplication and consolidation phase, the program meticulously retained the largest file (by file size) for each well number and document type. This process was automated, ensuring the systematic removal of redundant files. The outcome of this procedure was a concise and high-quality dataset, representing a significant improvement in the efficiency and integrity of the overall data management process. The culmination of this process entailed the exportation of the final results to an Excel summary table.
2.2.2. Document Standardized Processing
To achieve structured and standardized processing of fracturing documents, we developed an intelligent standardized processing workflow for multiple document types. This workflow was based on the Langchain framework and the DeepSeek API large language model. The overall technical pathway is illustrated in Figure 3.
Figure 3.
Technical Roadmap for Document Standardized Processing.
First, for Word-type raw data carriers, the Python-docx library was used to automate text extraction, ensuring effective parsing of Word format documents and addressing data source heterogeneity. To enhance the granularity of semantic processing, Langchain’s Recursive Character Text Splitter was introduced to reorganize text paragraph by paragraph, ensuring contextual integrity and information continuity.
During the semantic aggregation phase, an SQLAgent was designed and implemented in combination with the DeepSeek large model API to determine whether adjacent paragraphs belong to the same semantic theme. This step enabled intelligent aggregation across paragraphs, forming logically clear and structurally reasonable semantic segments.
Subsequently, an automated conversion script was employed to output the semantically reorganized content into a standardized Markdown format. The fracturing documents’ structural characteristics were taken into account in the establishment of a hierarchical architecture, including first-level headings, second-level headings, and attributes. This approach was undertaken to achieve modular management and visualization of the data.
2.3. Fracturing Knowledge Extraction
2.3.1. Knowledge Dictionary
The knowledge dictionary is defined as a standardized collection of terms, knowledge units, and their intrinsic relationships. This dictionary is formed through a systematic process that includes the governance of unstructured fracturing-related documents, the annotation and categorization of key information, and the incorporation of expert professional judgment. The model encompasses fundamental knowledge, including commonly used concepts, parameter names, and technical processes in fracturing operations. Additionally, it incorporates aliases, hierarchical relationships, and synonyms for various data items, thereby enhancing the model’s recognition capability across different expression scenarios.
The construction of the knowledge dictionary is comprised of three primary steps, as illustrated in Figure 4.
Figure 4.
Flowchart of Knowledge Dictionary Construction.
Initially, data governance is executed on an extensive array of fracturing documents to preliminarily identify core knowledge points. Secondly, the process of manual annotation and review by domain experts ensures the accuracy and completeness of the knowledge points. Thirdly, a hierarchically structured knowledge dictionary framework is formed through classification, summarization, and semantic integration, thereby addressing knowledge extraction needs at varying levels.
2.3.2. Construction of the Knowledge Extraction Model
The knowledge extraction model designed in this study adopts a modular architecture based on a large language model (LLM), consisting of three main components: an input layer, a processing layer, and an output layer. These modules collaborate to achieve efficient extraction of structured knowledge from unstructured documents, as illustrated in Figure 5.
Figure 5.
Structural Diagram of the knowledge extraction model.
The objective of this study was to ascertain the most efficacious Large Language Model (LLM) solution for the knowledge extraction task in fracturing documents. To this end, systematic comparative tests were conducted on current mainstream models, including the Qwen3 series, DeepSeek-V3, and ChatGPT-4. The evaluation process centered on the models’ aptitude in recognizing structured tables and extracting parameters. The evaluation criteria encompassed model size, the utilization of quantization compression, the implementation of prompt strategies, and the extraction accuracy, with the extraction accuracy serving as the primary metric.
In order to enhance the applicability and extraction effectiveness of the knowledge extraction model in the fracturing domain, we incorporated methods such as reinforcement learning, supervised fine-tuning, instruction input, and few-shot learning to construct a comprehensive training system, as shown in Figure 6. A variety of evaluation metrics were utilized to methodically assess the performance of the model.
Figure 6.
Flowchart of the knowledge extraction model training.
In order to provide a comprehensive evaluation of the training effectiveness and practical performance of the knowledge extraction model in this study, five representative metrics were selected. These metrics cover two dimensions: numerical accuracy evaluation and extraction effectiveness evaluation. These metrics were employed to assess the model’s capacity for error control and the precision of fracturing parameter extraction.
① Root Mean Square Error (RMSE): The objective of the present study is to evaluate the extraction accuracy of numerical fracturing parameters (e.g., displacement rate, formation pressure).
where yi is the true value of the fracturing knowledge, ŷi is the extracted value, and n is the total number of extractions.
② Mean Absolute Error (MAE): The aforementioned method is also employed for the purpose of evaluating the extraction of numerical information. This evaluation involves measuring the average absolute deviation between the values that have been extracted and the actual values.
③ Accuracy: The objective of the present study is to determine the proportion of accurate extractions in relation to all extraction results. This will allow for the evaluation of the model’s precision in identifying fracturing knowledge.
④ Recall: This index is a quantitative metric that quantifies the proportion of correctly extracted information out of all information that should have been extracted. It serves as an evaluation metric for the model’s efficacy in comprehensively addressing information extraction tasks.
⑤ F1 Score: The harmonic mean of precision and recall is a statistical metric employed for the comprehensive evaluation of the stability and overall performance of a model in fracturing knowledge extraction.
2.4. Data Storage and Knowledge Organization
To support subsequent question-answering applications and ensure the standardization and usability of the extracted knowledge, we have designed a systematic normalization process for the extraction results, as shown in Figure 7. This process converts the unstructured information produced by the model into a format that can be stored directly, ultimately forming two core knowledge bases: a structured database and a vector knowledge base. Due to the challenges associated with deploying vector knowledge bases in engineering environments, the study primarily relies on the structured database to implement question-answering functionality, with the vector knowledge base providing supplementary support.
Figure 7.
Flowchart for standardizing the extraction results.
2.4.1. Structured Database
This database stores key information extracted from fracturing documents, enabling the systematic and organized management of various types of fracturing knowledge. Its content spans multiple dimensions, including geological information, fracturing design parameters, fracturing operation data and production reports. This meets the needs of subsequent knowledge retrieval, intelligent Q&A and data analysis.
2.4.2. Vector Knowledge Base
This knowledge base stores critical information extracted from fracturing documents and transforms it into semantic vectors using deep learning models, such as large language models (LLMs). This process enables natural language-based knowledge retrieval and supports subsequent applications in knowledge reasoning and decision-making.
2.5. Intelligent Question-Answering System Design
This system’s intelligent question-answering functionality is based on an architecture that combines a large language model (LLM) with an SQLAgent, using the LangChain agent model as the core framework. Integrating the DeepSeek LLM and an SQLAgent module enables the system to seamlessly convert and interact between natural language and structured query language (SQL), as illustrated in Figure 8.
Figure 8.
Intelligent Question Answering Technology Roadmap.
First, the system parses the user’s input question into an SQL query. Then, the query is executed to retrieve results from the structured database. These results are enriched using the vector knowledge base and then transformed and reorganized into natural language responses via the DeepSeek model to ensure easily understandable answers. The system supports functions such as maximum/minimum value queries, range statistics, multi-turn dialogs, and complex question answering.
3. Results
3.1. Document Governance and Standardization Results
The data used in this study were derived from historical fracturing engineering documents from the Jimsar Basin from 2013 to 2025. These documents cover multiple aspects, including geological data, fracturing design, operational summaries, production monitoring, and rock mechanics. The documents are diverse and include Word, Excel, PDF, PowerPoint, image, compressed package, and other file types, exhibiting the typical characteristics of multi-source heterogeneity and complex formats. Figure 9 shows the specific file types and their distribution.
Figure 9.
File Quantity Distribution Pie Chart.
A total of 544,972 raw files, amounting to approximately 1256 GB of data, were collected for this experiment. After preliminary screening and organization, 17 GB of fracturing design documents and 53 GB of operational summary reports were identified as targets for data governance. These documents underwent standardized conversion according to a unified processing workflow. Unstructured Word documents contained a substantial amount of fracturing design parameters and operation summaries, making them the most information-rich document type and the primary focus of governance and analysis in this study.
To assess the applicability of different large language models during the semantic reorganization stage of fracturing documents, a comparative evaluation was conducted on three mainstream industrial-grade LLM APIs (Qwen3, DeepSeek-v3, and ChatGPT-4o). Ten original fracturing documents were randomly selected and uniformly segmented into paragraphs, followed by semantic merging and topic clustering performed by each model. Manual annotation results were used as the ground truth for evaluation. The performance metrics included field-level Accuracy, Recall, and Human Semantic Consistency (HSC, ranging from 0–100). The evaluation results are presented in Table 1.
Table 1.
Semantic Merging Performance Comparison of LLM APIs.
The results demonstrate that DeepSeek-v3 achieved the best overall performance in semantic merging tasks, particularly showing superior Recall and consistency with human judgment. It also exhibited notable advantages in capturing implicit semantic associations across paragraphs, and therefore is recommended as the preferred model for the semantic aggregation module in this study.
Using a recursive text segmentation strategy and semantic merging via a large language model, the content was integrated into 243,976 semantic paragraph groups, with an average of 90 lines per group. During the keyword extraction process, the system referenced a lexicon of 142 fracturing terms, achieving a 96% coverage rate. Based on the requirements for a standardized document structure, the segmentation results were converted into 30,372 Markdown files. These files included approximately 60,864 first-level headings (averaging two per document) and 243,976 s-level headings (averaging eight per document). The document format validation pass rate was 100%, and the completeness rate of key fields (e.g., well number, block, and operation time) was 96%.
3.2. Knowledge Dictionary Construction Results
Based on an expert analysis of fracturing documents, key fracturing knowledge was extracted and categorized into three types of dictionaries: a synonym dictionary, a value format dictionary, and an explanation dictionary. The synonym dictionary contains 100 entries addressing synonymous expressions; the value format dictionary contains 50 entries dealing with specialized value formats; and the explanation dictionary contains 97 entries clarifying specialized terminology. Examples of these dictionaries are provided in Table 2, Table 3 and Table 4.
Table 2.
Fracturing Field Synonym Dictionary.
Table 3.
Fracturing Field Value Dictionary.
Table 4.
Fracturing Field Interpretation Dictionary.
These dictionaries encompass three major categories of fracturing parameters—formation, design, and operation—along with ten subcategories, such as temperature-pressure, solids, and properties. In total, there are 142 types of fracturing parameters. Additionally, a standardized rock mechanics database model was established that incorporates multiple major blocks of the Jimsar Basin. This model improves the knowledge dictionaries by integrating three types of parameters: basic, experimental, and core characteristics, which comprise a total of 27 rock experimental mechanics parameter.
3.3. Knowledge Extraction Model Performance Evaluation
A comparative analysis was conducted on three mainstream large language models for the task of fracturing knowledge extraction: Qwen3, DeepSeek-v3, and ChatGPT-4o. To ensure evaluation credibility, ten standardized fracturing documents were randomly selected as the test samples. The extracted fields were compared with manually curated ground-truth data to measure extraction accuracy.
As shown in Figure 10, the experimental results indicate that the recommended minimum model configuration for the current system during deployment is Qwen3-14B in its 8-bit quantized version (Q8), used in conjunction with optimized prompts. This configuration strikes a favorable balance between performance and resource consumption. Therefore, Qwen3-14B (Q8) was selected as the core model for knowledge extraction in this study.
Figure 10.
Model Extraction Accuracy Comparison Bar Chart (Porosity Extraction Test).
To further improve extraction performance, the model was trained and optimized using domain-specific corpora, followed by a systematic validation of the model outputs to confirm reliability and domain robustness. After training and optimization, the model achieved an extraction accuracy of 95%, a recall rate of 80%, and an F1 score of 87%. Both the root mean square error (RMSE) and mean absolute error (MAE) were maintained at low levels. Figure 11 illustrates a comparison of performance before and after training. These results demonstrate that this model configuration meets practical application requirements in terms of accuracy and stability.
Figure 11.
Model Training Indicator Comparison Histogram.
3.4. Data Storage and Knowledge Base Construction Results
We established a comprehensive data management system that comprises a structured relational database and a vector knowledge base. This system is designed to efficiently store, manage, and apply multi-source, heterogeneous data related to fracturing stimulation. In total, 30,372 standardized Markdown documents were processed by the extraction model, and the resulting structured knowledge was imported into the database system.
The structured database contains approximately 676,700 records related to fracturing operations and includes around 35 core fields. These fields contain information such as well numbers, blocks, design indicators, operation records, and production data. See Table 5 for an example.
Table 5.
Example of the Structured Database.
The vector knowledge base consists of 768-dimensional vectors based on embedding encoding. It covers 30,372 Markdown document segments and comprises approximately 8500 vector entries, and each entry in the knowledge base contains a 768-dimensional embedding vector that supports semantic retrieval. See Table 6, for example, embedding data stored in the vector knowledge base.
Table 6.
Example vector embeddings stored in the vector knowledge base.
3.5. Intelligent Q&A System Validation Results
During its experimental deployment on the Integrated Platform in the Jimsar Basin, the intelligent Q&A system was put through its paces. It demonstrated the ability to respond accurately to various types of queries, including extreme value queries (e.g., “What is the maximum proppant concentration in a specific block?”), range-based statistical queries (e.g., “What are the wells with daily oil and gas production between 20 and 50 tons?”), and multi-turn complex inquiries. Example interactions are illustrated in Figure 12.
Figure 12.
Question and Answer Example.
The system achieved an accuracy rate of over 92% in the Q&A process, with an average response time of under 30 s. Case studies indicate that the system significantly reduces the time required for manual data retrieval and organization, thereby improving fracturing data utilization efficiency.
4. Discussion
The intelligent governance and knowledge extraction methodology for fracturing documents, as proposed in this study, has demonstrated its high effectiveness and feasibility in practical applications within the Jimsar Basin. Unlike traditional approaches that rely on manual processing or rule-based templates, this research achieves a systematic closed loop—from document collection and standardization to knowledge extraction and Q&A application—significantly enhancing data utilization efficiency and automation levels.
First, at the document governance level, we address challenges such as fragmented sources, diverse formats, and the complex structures of fracturing documents by using multi-format parsing and semantic segmentation technologies. Unlike many existing studies that focus on single formats or small-scale datasets, this work covers over 500,000 raw documents, demonstrating strong scalability and industrial applicability.
Second, in terms of knowledge extraction, experimental results indicate that the Qwen3-14B model (the 8-bit quantized version with optimized prompts) is highly precise and robust in extracting fracturing-related knowledge. Compared to DeepSeek-v3 and ChatGPT-4o, Qwen3-14B achieves a better balance of accuracy and stability, making it more suitable for large-scale deployment in real-world oilfield scenarios. These results underscore the importance of prompt optimization and domain-specific lexical constraints in improving the performance of large language models in professional applications.
Beyond general-purpose models, several domain-oriented NLP frameworks such as PetroBERT and PetroNLP have recently been introduced to enhance terminology recognition in petroleum engineering. Although these systems demonstrate notable improvements in entity identification over traditional rule-based techniques, their application is typically restricted to structured or semi-structured corpora and requires extensive domain-specific pretraining corpora. PetroBERT, for example, focuses predominantly on geological and drilling text and exhibits performance degradation when directly applied to fracturing operational records featuring heterogeneous formats and parameter variability. In contrast, the method proposed in this study integrates a knowledge dictionary with prompt-optimized LLMs, enabling stronger adaptability to diverse document templates and implicit semantic expressions within fracturing reports. Experimental results further validate its superior performance in recognizing engineering parameters while maintaining a scalable deployment capability in real oilfield environments.
Furthermore, in terms of data storage and organization, we establish a dual-support system consisting of a structured database and a vector knowledge base. The structured database provides efficient access to essential information like well numbers, blocks, and fracturing parameters. The vector knowledge base facilitates semantic fuzzy retrieval and is instrumental in addressing complex queries. This two-tier architecture improves the flexibility of knowledge utilization and offers a reference paradigm for intelligent data management in oilfield operations.
However, this method has certain limitations. First, the effectiveness of semantic retrieval in the vector knowledge base depends heavily on the embedding model and similarity thresholds, which can lead to errors in semantically ambiguous scenarios. Second, while the intelligent Q&A system performed well during an experimental deployment in the Jimsar Basin, its response speed and concurrency capability require optimization for larger-scale, real-time applications. Additionally, the current knowledge extraction model focuses primarily on textual data and has limited capability in processing multimodal information, such as images, curves, and tables.
To further overcome these limitations and advance the industrial applicability of the proposed approach, future work will focus on the following areas: (1) Expansion to multimodal knowledge extraction, integrating table recognition, curve digitization, image feature parsing, and report-level data fusion, enabling a more comprehensive understanding of fracturing operations; (2) Optimization of system performance and high-concurrency capability through model compression, distributed inference, and efficient indexing strategies to ensure stable operation under large-scale deployment in oilfield environments; (3) Development of an adaptive semantic retrieval mechanism by refining embedding generation and similarity evaluation to reduce ambiguity-related errors.
In conclusion, we provide a feasible technical pathway for the intelligent governance and knowledge service of fracturing documents, validating the potential of large language models in oil and gas engineering applications. Future research could focus on cross-modal data fusion, lightweighting models for deployment, and constructing domain-specific knowledge graphs to enhance the system’s intelligence and utility.
5. Conclusions
This study proposes an integrated framework for the intelligent management of fracturing knowledge in oil and gas fields, enabling the transformation of unstructured engineering documents into structured and computable data resources. By developing an automated Python- and LangChain-based processing pipeline, 1256 GB of multi-source and heterogeneous fracturing documents were systematically standardized, effectively addressing issues of data fragmentation and inconsistency.
Through the incorporation of a domain-specific knowledge dictionary and optimized prompting strategy, the large language model-based extraction system achieved an accuracy of 95% and a recall of 80%, demonstrating high robustness and adaptability in complex engineering text scenarios. Compared with traditional manual processing, the system improved data extraction and organization speed by more than an order of magnitude and reduced human inspection workload by over 80%, significantly enhancing the efficiency and consistency of data governance.
A dual-storage architecture comprising a structured relational database and a vector knowledge base was established to support both precise parameter retrieval and semantic search. The SQLAgent-driven natural language interface enables engineers to perform intuitive knowledge queries, and the system has been successfully deployed and validated in the Jimsar Basin, with an accuracy exceeding 92% and an average response time of less than 30 s, proving its practicality, scalability, and significantly lowering data management costs in field operations.
In summary, the proposed approach provides a robust and scalable solution for the intelligent utilization of fracturing knowledge. It not only enhances the efficiency of document management and data retrieval but also lays a solid foundation for the digital transformation and intelligent decision-making of modern oil and gas field operations.
Author Contributions
J.L., Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing—original draft, Writing—review & editing, Visualization, and Project administration. J.P., Methodology, Software, Validation, Investigation, Resources, Data curation, Writing—original draft, and Writing—review & editing. Z.Z., Software, Validation, Investigation, and Data curation. G.L., Supervision and Project administration. M.X., Software, Validation, and Resources. X.H., Resources and Supervision. S.X., Resources and Supervision. S.T., Supervision and Project administration. T.W., Supervision and Project administration. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by Xinjiang Key Laboratory of Intelligent Petroleum Exploration and Engineering (Karamay, 834000, China), Science and Technology Program of Xinjiang Uyghur Autonomous Region Project (No. 2024B01013), and National Natural Science Foundation of China Project (No. 52320105002).
Data Availability Statement
The data presented in this study are available on request from the corresponding author. Restrictions apply to the availability of these data due to confidentiality obligations related to industrial documents and proprietary analysis results.
Acknowledgments
The authors would also like to acknowledge National Key Laboratory of Petroleum Resources and Engineering, China University of Petroleum (Beijing) for supporting the development of this research.
Conflicts of Interest
Authors Jie Li, Zhihua Zhu, Xiaodong He, and Shengjiang Xu were employed by the Xinjiang Oilfield Company, PetroChina Company Limited. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| LLM | Large language model |
| SQL | Structured query language |
| Q&A | Question and answer |
References
- Padmanabhan, E.; Jayasangar, T.; Gamage, R.P. Digitalization in the Oil and Gas Industry. In Unconventional Methods for Geoscience, Shale Gas and Petroleum in the 21st Century; IOS Press: Amsterdam, The Netherlands, 2023; pp. 1–7. [Google Scholar]
- Halsey, T.C. A Grand Challenge: Digital Transformation for the Upstream Oil and Gas Industry. J. Pet. Technol. 2024, 76, 47–51. [Google Scholar] [CrossRef]
- Ma, Z.; Hu, H.; Zhou, X.; Zhang, H.; Zhang, Y.; Li, G.; Tian, S.; Wang, T. Interpretable Automated Machine Learning Workflow for Intelligent Drilling in the Petroleum Industry: Case Study on Rate of Penetration Prediction. SPE J. 2025, 30, 3240–3259. [Google Scholar] [CrossRef]
- Ma, Z.; Weng, J.; Zhang, J.; Zhang, Y.; Hao, Y.; Tian, S.; Li, G.; Wang, T. Intelligent Prediction of Rate of Penetration through Meta-Learning and Data Augmentation Synergy under Limited Sample. Geoenergy Sci. Eng. 2025, 250, 213818. [Google Scholar] [CrossRef]
- Li, G.S.; Tian, S.C.; Sheng, M.; Wang, T.; Liao, Q. Research Progress and Prospect of Intelligent Hydraulic Fracturing Technologies. Drill. Prod. Technol. 2025, 48, 1–9. (In Chinese) [Google Scholar]
- Li, G.; Wang, T.; Li, J.; Tian, S.; Song, X.; Liu, Z.; Ma, Z. Pathways and prospects for intelligent and green development of oil and gas driven bymulti-energy integration. Xinjiang Oil Gas 2025, 21, 1–13. (In Chinese) [Google Scholar]
- Zhang, Q.; Huang, V.S.J.; Wang, B.; Zhang, J.; Wang, Z.; Liang, H.; He, C.; Zhang, W. Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction. arXiv 2024, arXiv:2410.21169. Available online: https://arxiv.org/abs/2410.21169 (accessed on 11 September 2025). [CrossRef]
- Adnan, K.; Akbar, R. Limitations of Information Extraction Methods and Techniques for Heterogeneous Unstructured Big Data. Int. J. Eng. Bus. Manag. 2019, 11, 1847979019890771. [Google Scholar] [CrossRef]
- Chambers, N.; Jurafsky, D. Template-Based Information Extraction without the Templates. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 976–986. [Google Scholar]
- Waltl, B.; Bonczek, G.; Matthes, F. Rule-Based Information Extraction: Advantages, Limitations, and Perspectives. Jusletter IT 2018, 4, 427–436. [Google Scholar]
- Ma, Z.; Santos, J.E.; Lackey, G.; Viswanathan, H.; O’Malley, D. Information Extraction from Historical Well Records Using a Large Language Model. Sci. Rep. 2024, 14, 31702. [Google Scholar] [CrossRef] [PubMed]
- Hoffswell, J.; Liu, Z. Interactive Repair of Tables Extracted from PDF Documents on Mobile Devices. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–13. [Google Scholar]
- Zadgaonkar, A.; Agrawal, A.J. An Approach for Analyzing Unstructured Text Data Using Topic Modeling Techniques for Efficient Information Extraction. New Gener. Comput. 2024, 42, 109–134. [Google Scholar] [CrossRef]
- Dagdelen, J.; Dunn, A.; Lee, S.; Walker, N.; Rosen, A.S.; Ceder, G.; Persson, K.A.; Jain, A. Structured Information Extraction from Scientific Text with Large Language Models. Nat. Commun. 2024, 15, 1418. [Google Scholar] [CrossRef] [PubMed]
- Xu, D.; Chen, W.; Peng, W.; Zhang, C.; Xu, T.; Zhao, X.; Wu, X.; Zheng, Y.; Wang, Y.; Chen, E. Large Language Models for Generative Information Extraction: A Survey. Front. Comput. Sci. 2024, 18, 186357. [Google Scholar] [CrossRef]
- Deng, J.; Liu, G.; Wang, L.; Liang, J.; Dai, B. An Efficient Extraction Method of Journal-Article Table Data for Data-Driven Applications. Inf. Process. Manag. 2025, 62, 104006. [Google Scholar] [CrossRef]
- Tian, H.; Liu, R.; Li, D.; You, S.; Liao, Q.; Tian, S. Applications and prospect of deepseek large language model in petroleumengineering. Xinjiang Oil Gas 2025, 21, 55–63. (In Chinese) [Google Scholar]
- Batura, T.; Yerimbetova, A.; Mukazhanov, N.; Shvarts, N.; Sakenov, B.; Turdalyuly, M. Information Extraction from Multi-Domain Scientific Documents: Methods and Insights. Appl. Sci. 2025, 15, 9086. [Google Scholar] [CrossRef]
- Cordeiro, F.C.; da Silva, P.F.; Tessarollo, A.; Freitas, C.; de Souza, E.; Gomes, D.D.S.M.; Souza, R.R.; Coelho, F.C. Petro NLP: Resources for Natural Language Processing and Information Extraction for the Oil and Gas Industry. Comput. Geosci. 2024, 193, 105714. [Google Scholar] [CrossRef]
- Chen, X.; Ye, J.; Zu, C. Robustness of GPT Series Large Language Models in Natural Language Processing Tasks. J. Comput. Res. Dev. 2024, 61, 1128–1142. (In Chinese) [Google Scholar]
- Lu, X.F.; Jin, T. Application and Prospects of Large Language Model Fine-Tuning Techniques in Language Analysis and Testing. Mod. Foreign Lang. 2025, 48, 413–421. (In Chinese) [Google Scholar]
- Rodrigues, R.B.M.; Privatto, P.I.M.; de Sousa, G.J.; Murari, R.P.; Afonso, L.C.; Papa, J.P.; Pedronette, D.C.; Guilherme, I.R.; Perrout, S.R.; Riente, A.F. PetroBERT: A Domain Adaptation Language Model for Oil and Gas Applications in Portuguese. In Proceedings of the International Conference on Computational Processing of the Portuguese Language, Fortaleza, Brazil, 21–24 March 2022; pp. 101–109. [Google Scholar]
- Zhao, X.; Hu, Y.; Qin, T.; Wan, W.; Wang, Y. A Domain-Specific Lexicon for Improving Emergency Management in Gas Pipeline Networks through Knowledge Fusing. Appl. Sci. 2024, 14, 8094. [Google Scholar] [CrossRef]
- Mao, S.; Li, G.; Tian, S.; Liao, Q.; Wang, T.; Song, X. Research Status and Prospect of Artificial Intelligence in Reservoir Fracturing Stimulation. Drill. Prod. Technol. 2022, 45, 1–8. (In Chinese) [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).