MIRA-ChatGLM: A Fine-Tuned Large Language Model for Intelligent Risk Assessment in Coal Mining

Sun, Yi; Zhang, Chao; Wang, Chen; Han, Ying

doi:10.3390/app142412072

Open AccessArticle

MIRA-ChatGLM: A Fine-Tuned Large Language Model for Intelligent Risk Assessment in Coal Mining

School of Communication and Information Engineering, Xi’an University of Science and Technology, Xi’an 710054, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(24), 12072; https://doi.org/10.3390/app142412072

Submission received: 1 December 2024 / Revised: 18 December 2024 / Accepted: 20 December 2024 / Published: 23 December 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Intelligent mining risk assessment (MIRA) is a vital approach for enhancing safety and operational efficiency in mining. In this study, we introduce MIRA-ChatGLM, which leverages pre-trained large language models (LLMs) for the domain of gas risk assessment in coal mines. We meticulously constructed a dataset specifically designed for mining risk analysis and performed parameter-efficient fine-tuning on the locally deployed GLM-4-9B-chat base model to develop MIRA-ChatGLM. By utilizing consumer-grade GPUs and employing LoRA and various levels of quantization algorithms such as QLoRA, we investigated the impact of different data scales and instruction settings on model performance. The evaluation results show that MIRA-ChatGLM achieved excellent performance with BLEU-4, ROUGE-1, ROUGE-2, and ROUGE-L scores of 84.47, 90.63, 86.88, and 90.63, respectively, highlighting its outstanding performance in coal mine gas risk assessment. Through comparative experiments with other large language models of similar size and manual evaluation, MIRA-ChatGLM demonstrated superior performance across multiple key metrics, fully demonstrating its tremendous potential in intelligent mine risk assessment and decision support.

Keywords:

intelligent assessment; large language models; fine-tuning; coal mine; gas risk; quantization algorithm

1. Introduction

As the global mining industry continues to expand, safety issues in mining have garnered increasing attention. The mining environment is complex and variable, with factors such as equipment failures and environmental changes potentially leading to serious safety incidents, posing significant risks to workers [1]. Consequently, risk assessment in mining has become a focal point for both industry and academia. Existing risk assessment and alarm systems often rely on fixed rules or singular data monitoring methods [2], making it challenging to address the vast amounts of multidimensional and complex data present at mining sites. Traditional methods have shown limitations in computational efficiency and accuracy, particularly when it comes to the real-time processing and analysis of heterogeneous data from multiple sources, failing to effectively implement early warning and response measures for mining risks.

The development of large language models (LLMs) presents new opportunities for intelligent risk assessment in coal mining. Trained on extensive corpora, LLMs possess powerful data processing and pattern recognition capabilities, enabling efficient analysis and judgment of complex, multi-source data [3]. By integrating LLMs with mining sensor data, we can construct intelligent risk assessment systems that significantly enhance the decision-making efficiency and accuracy of mine safety management. The proposed MIRA-ChatGLM was developed in this context, aiming to provide more precise and real-time risk assessment and early warning support for coal mining safety monitoring.

During the training process, MIRA-ChatGLM learned from over 10,000 historical alarm data points collected from coal mine safety production sensors, as well as the causes of these alarms and the corresponding on-site response measures. Leveraging its powerful data analysis capabilities, this system can identify risk signals from sensors with over 90% accuracy and, based on historical handling experiences, make preliminary judgments and propose appropriate response measures to alarms. Compared to traditional rule-based alarm systems, MIRA-ChatGLM not only automates data analysis but also possesses rapid response and dynamic adjustment capabilities, adapting to the complex and variable conditions in the mining environment.

The innovation of this study lies in the deep integration of LLMs with coal mine gas sensor data, creating an intelligent assessment model that features adaptive decision-making and anti-interference capabilities. MIRA-ChatGLM enhances the efficiency and accuracy of mining risk assessments and provides intelligent support for workers’ decision-making. The application of this system offers new ideas and tools for coal mine risk assessment and demonstrates its broad application prospects in complex industrial environments.

2. Related Work

In this section, we review the existing literature relevant to our research. We begin by discussing the current methods and technologies used for coal mine assessments, highlighting their limitations and the need for more advanced approaches. Next, we explore the latest advancements in large language models and their applications across various domains, providing a foundation for the framework we propose. This comprehensive review establishes the context and significance of our work.

2.1. Current Research Status of Intelligent Assessment Systems

Coal mine risk assessment is a critical component for ensuring the safe operation of mines, with a primary focus on monitoring gas concentrations, such as methane and carbon monoxide, and providing early risk warnings [4]. Traditional coal mine risk assessment methods often rely on rule-based expert systems, which analyze sensor data to identify potential hazards, such as elevated gas concentrations or leaks, and trigger alarms based on predefined thresholds. However, as mining conditions become increasingly complex, these methods face growing limitations in adaptability and accuracy.

In recent years, data-driven machine learning methods have been introduced into coal mine risk assessment. Traditional machine learning algorithms, such as random forests and support vector machines (SVMs) [5,6], have been employed to analyze gas concentration data, aiding in the identification of risk patterns. Meanwhile, deep learning techniques, known for their robust feature extraction capabilities, have further advanced intelligent coal mine research. For example, long short-term memory (LSTM) networks and gated recurrent units (GRUs) have been applied to time-series predictions of gas concentrations, significantly improving the accuracy of risk forecasting by capturing temporal dependencies in the data [7,8]. However, these methods primarily focus on unimodal data inputs, making it challenging to integrate multi-dimensional data—such as gas measurements, environmental parameters, and equipment states—for comprehensive risk assessment.

To overcome the limitations of single-source data, multi-dimensional data fusion techniques have gradually been introduced into coal mine risk analysis research [9]. By combining data from various sensors—such as methane, carbon monoxide, oxygen, and gas concentrations—researchers have developed more sophisticated models [10,11]. These models not only enhance the robustness of risk assessments but also improve system adaptability to complex scenarios [12]. Nevertheless, the processing of multi-dimensional data imposes higher demands on computational resources and data labeling quality. In real-world coal mining environments, complex operational conditions and limited annotated data pose significant challenges to the adoption of such methods.

Additionally, intelligent algorithms have garnered significant attention in coal mine risk assessments. For instance, You et al. utilized t-SNE to process high-dimensional gas incident data and employed genetic algorithms to optimize SVMs, achieving notable improvements in accuracy and efficiency [13]. Similarly, Du et al. developed an integrated time-series evaluation model based on multi-source dynamic detection and data fusion. This model was successfully implemented in practical coal mine settings, providing reliable and precise dynamic hazard warnings [14]. However, most existing studies focus on optimizing individual algorithms or modules, lacking a comprehensive exploration of systematic risk assessment. Such limitations may result in the omission of critical information, thereby reducing reliability in real-world applications.

Notably, with the rapid development of artificial intelligence, pre-trained large language models (LLMs) have demonstrated promising potential in coal mine risk assessment due to their powerful generalization capabilities in multimodal tasks. These models can semantically interpret multimodal data and leverage instruction learning and knowledge transfer to support complex decision-making tasks, paving the way for enhanced intelligence in coal mine operations.

2.2. Current Research Status of Large Language Models (LLMs)

In recent years, the rapid development of LLMs has significantly expanded the boundaries of natural language processing technology. These models have demonstrated tremendous potential across diverse application scenarios. For instance, the GPT-4 model has achieved notable advancements in text generation and contextual understanding [15], laying the foundation for the widespread application of pre-trained models. Simultaneously, LangChain technology has combined LLMs with knowledge graphs, greatly enhancing their reasoning and contextual processing capabilities when executing complex tasks [16]. Additionally, InstructGPT has integrated an instruction-following mechanism, substantially improving the model’s ability to interpret and execute user directives [17].

In specialized fields, the application of LLMs is deepening. In the healthcare field, Liu et al.’s research applied LLMs to the domain of traditional Chinese medicine product instructions (CPMI), developing the CPMI-ChatGLM model. This study explored the application of LoRA and P-Tuning v2 technologies through efficient fine-tuning of ChatGLM-6B, achieving significant performance improvements and demonstrating the potential of LLMs in the medical field, particularly in drug recommendations and diagnostic support [18]. In the financial industry, LLMs integrated with financial knowledge have been used to summarize contextual information in financial texts [19]. The education sector is also seeing the use of LLMs to create educational content, enhance student engagement and interaction, and personalize learning experiences [20]. Furthermore, research indicates that LLMs create unprecedented opportunities for large-scale analysis and the generation of language data, which play a central role across all areas of psychology, suggesting that LLMs could transform this field [21]. Some researchers are even using LLMs to develop multimodal models and autonomous driving systems [22], employing customized visualization instruction datasets and mixed fine-tuning strategies for autonomous vehicle applications [23].

Despite the significant achievements of LLMs and multi-agent systems in various fields, research and practice in the specific application of coal mine risk forecasting and assessment remain insufficient. Currently, gas monitoring and risk analysis in coal mines predominantly rely on manual operations, which consume substantial time and human resources, revealing clear limitations in processing real-time data in high-risk or complex environments. Although IoT technologies have been studied for safety monitoring, these studies often do not involve the application of advanced LLMs for predicting and analyzing potential risks in coal production. Addressing this research gap, our proposed MIRA-ChatGLM project aims to combine LLMs with historical alarm data accumulated from sensors at various locations in coal mines, exploring the application potential of LLMs in risk assessment. This innovative approach is expected to significantly enhance the efficiency and accuracy of risk assessments in complex coal production scenarios, providing a more robust support for safety in mining operations.

In summary, this study makes the following contributions:

We propose a language model specific to the field of coal mine risk assessment, named MIRA-ChatGLM, marking the first LLM designed for mining risk evaluation.
We employ a high-quality risk assessment (RA) dataset for fine-tuning the base model with instructions.
A high-quality RA dataset has been constructed based on historical alarm data from gas sensors at coal production sites, encompassing various measurement types such as carbon monoxide (CO), laser methane (CH₄), carbon dioxide (CO₂), and oxygen (O₂). Our goal is to provide valuable resources for research and application in the field of mining risk assessment.

The remainder of this paper is organized as follows: Section 3 summarizes all the tools and content used in training MIRA-ChatGLM; Section 4 proposes a comprehensive evaluation system for MIRA-ChatGLM and conducts comparative experiments based on this framework; Section 5 discusses the experiments from Section 4; and the final section summarizes the study and outlines future research objectives in intelligent mining risk assessment.

3. Methods

This study presents MIRA-ChatGLM, which involves constructing a specialized dataset for coal mine risk assessment to train a dedicated LLM for this field. Utilizing the LoRA fine-tuning mechanism and instruction tuning, the model’s performance is enhanced, facilitating tasks such as coal mine risk evaluation and gas alarm response consultation. The specific framework of MIRA-ChatGLM is illustrated in Figure 1.

3.1. Base Model (GLM-4)

In the field of artificial intelligence, LLMs have been a research hotspot, especially with the GPT series and Claude series leading the development of LLM technology. However, the General Language Model 4 (GLM-4) series released by Zhipu AI (Beijing, China), with its exceptional performance and multimodal capabilities, is gradually becoming a representative of the new generation of powerful models and has gained widespread attention. GLM-4 shows significant improvements in language understanding and generation capabilities compared to its predecessors, particularly excelling in long-text processing and multimodal tasks, with performance now approaching that of GPT-4 and Claude 2.1 [24]. Therefore, we chose GLM-4-9B-chat as the base model for the coal mine risk assessment task, leveraging its strong language processing capabilities to effectively handle the complex natural language data processing demands in mine risk evaluation. The overall pipeline of GLM-4 All Tools and customized GLM (agent), as shown in Figure 2, enhances the model’s flexibility and customization across different application scenarios.

We selected GLM-4-9B-chat as the base model in this study for several key reasons. Firstly, considering the existing computational resources in the laboratory, the parameter scale of GLM-4-9B-chat is moderate, enabling it to run efficiently on two 24 GB VRAM 4090 GPUs (NVIDIA, Santa Clara, CA, USA). This configuration avoids the memory pressure associated with overly large models while ensuring efficient use of computational resources. Secondly, given that the primary language of the dataset is Chinese, the GLM-4 series performs exceptionally well in Chinese language processing, especially in natural language generation and understanding tasks, which is crucial for the multimodal tasks in this study. Furthermore, compared to its base model, GLM-4-9B-chat has superior instruction-following capabilities, which make it particularly outstanding in handling complex problems that require specific task instructions and constraints, offering strong potential for customized fine-tuning.

To further improve the training efficiency and performance of the model, we combined fine-tuning algorithms such as LoRA with techniques like mixed-precision training and gradient accumulation. These methods ensure efficient model fine-tuning under limited computational resources and memory while improving training stability and effectiveness. By fine-tuning GLM-4-9B-chat, we were able to achieve intelligent risk assessment and prediction for coal mine safety, enhancing the overall effectiveness of mine safety management.

In conclusion, GLM-4-9B-chat not only meets the language processing requirements of this study but also strikes an ideal balance between parameter scale and computational resource demands, making it the ideal choice for this research task. Through customized fine-tuning of this model, we can effectively improve the accuracy and efficiency of coal mine risk assessment, providing stronger support for mine safety management.

3.2. Data Sources and Pre-Processing

To ensure that the MIRA-ChatGLM model can efficiently and accurately conduct a coal mine risk assessment, we meticulously constructed a high-quality coal mine risk assessment (RA) dataset and performed detailed preprocessing. The dataset is derived from gas sensors installed in critical safety production areas of the coal mine, collecting a wealth of historical alarm data. It covers various types of measurement points, including concentration data for gases such as CO, laser methane, and CO₂, as well as sensor data from locations across the mine, including ventilation shaft sensors, coal face sensors, and more. Additionally, the dataset includes different types of alarm information, such as limit exceedance alarms, sensor disconnection, and sensor calibration, ensuring comprehensive coverage of all areas within the mine. The dataset also includes the alarm causes and corresponding mitigation measures uploaded by relevant personnel after risk alarms, providing rich contextual information for the model. A sample of the main components of the dataset is shown in Table 1.

In the data preprocessing stage, we conducted two rounds of screening and cleaning of the collected data. The primary purpose of the first round of screening and cleaning was to eliminate invalid and duplicate information. Since the alarm data is stored in Excel format, it contains many missing data entries, such as records with only gas concentration values but missing alarm causes or mitigation measures. There was also information redundancy, such as the inclusion of the name of the person who registered the alarm, or fields unrelated to the risk assessment, such as alarm time, production construction type, and production construction status. To address these issues, we manually screened and cleaned the data in the table, ensuring that the data used was closely related to risk assessment. The focus of the second round of screening and cleaning was to review the data fields directly related to the assessment and remove useless data. For example, we reduced the emphasis on alarm data triggered by factors such as fan switches, ensuring that the selected data was of higher quality for the assessment process. Additionally, we standardized the data format and units, unifying the gas concentration units and alarm threshold standards to ensure data consistency.

Furthermore, we labeled the data by classifying and annotating the alarm data uploaded by sensors and the alarm causes and fault handling measures from safety inspection records. This step helps the model better understand and learn the characteristics of mining risks. During the data labeling process, we strictly followed standards such as the “Coal Mine Safety Monitoring System and Detection Instrument Usage Management Specification AQ 1029-2019” and “Coal Mine Safety Regulations” to ensure the accuracy and consistency of the annotations. These standards provided clear guidance for our data labeling, allowing the model to more accurately identify and predict potential gas concentration risks.

The dataset we created contains 10,000 gas risk alarm records, covering four gas types: carbon monoxide, carbon dioxide, laser methane, and oxygen. Each gas type corresponds to different alarm types and their causes. From the collected data, the number of laser methane alarm records is relatively high, while oxygen alarm data is sparse, and the records for carbon monoxide and carbon dioxide alarms are moderate. After screening and processing, we selected an appropriate amount of data for each gas type, ensuring that all potential alarm situations were covered. Specifically, the dataset contains the highest proportion of sensor disconnection alarms, followed by sensor calibration alarms, with limit exceedance alarm data being less frequent.

To comprehensively simulate various scenarios for coal mine gas risk assessment, our dataset design fully considered the following factors: alarm types for different gases, installation locations of various sensors, specific alarm causes, and corresponding mitigation measures. This diversity ensures that the dataset is broadly applicable and representative.

By constructing a dataset that covers various gas types, alarm types, sensor installation locations, alarm causes, and mitigation measures, we ensured the diversity of the data, which provides rich training scenarios for the fine-tuned model. This diversity helps the model learn about different types of gas risks, alarm situations, and response strategies, thereby enhancing its generalization ability. The model can extract patterns from multiple perspectives and scenarios, preventing overfitting, and improving its reasoning and prediction abilities on unknown data. Furthermore, the comprehensiveness of the dataset also enhances the model’s ability to recognize and respond to complex coal mine gas risks, enabling it to more accurately handle different alarm situations and take appropriate mitigation measures. Therefore, this dataset design not only improves the model’s accuracy in known scenarios but also enhances its adaptability to unknown risk situations in practical applications.

Finally, we split the preprocessed dataset into training, validation, and test sets at a ratio of 8: 1: 1 for model training and evaluation. The training set is used for model training, the validation set is used to adjust model parameters and select the best model, and the test set is used to evaluate the model’s performance and generalization ability.

Through these dataset and preprocessing steps, we have constructed a comprehensive, accurate, and representative mine risk assessment dataset, laying a solid foundation for subsequent model training and evaluation.

3.3. Dataset Construction

By fine-tuning a large language model (LLM) with machine-generated instruction data, its zero-shot ability on new tasks can be significantly improved without the need for manually writing instructions. During the construction of the RA dataset, a system prompt was first designed to guide the large language model (GLM-4V) to generate a variety of questions and answers based on gas monitoring data from different locations within the coal mine. Since our data is stored in Excel format, the prompt required the generated questions to adhere to the information in the table, ensuring the accuracy and professionalism of the answers, while avoiding repetition in the questions. Each question should include specific details, such as the coal mine name, installation location, etc. (Table 2). The generated answers must be based on the table content and provide a detailed description of the alarm type, cause, and corresponding mitigation measures. In the end, all the generated question-and-answer pairs were stored in JSON format in a list for subsequent organization and analysis. Additionally, in this process, the LangChain extraction chain was used to extract relevant information from historical gas risk assessment data, and by setting prompts, GLM-4V generated the corresponding Q&A dataset. (The dataset and related processing code have been made publicly available.)

3.4. Parameter-Efficient Fine-Tuning (PEFT)

With the rapid development of large language models (LLMs), the scale of model parameters has been increasing significantly. However, due to limitations in computing resources and costs, it has become difficult for regular researchers to perform full-parameter fine-tuning on consumer-grade hardware. Furthermore, since the size of the fine-tuned models remains the same as the original model, storing and deploying separate fine-tuned models for each task has become prohibitively expensive. To address these issues, the parameter-efficient fine-tuning (PEFT) method was proposed [25]. This method significantly reduces computation and storage costs by fine-tuning only a small set of additional parameters while keeping the majority of pre-trained model parameters frozen. Currently, advanced PEFT techniques such as adapter-tuning, prefix-tuning, P-tuning, and LoRA have demonstrated performance comparable to full parameter fine-tuning.

Adapter-tuning [26] works by integrating smaller neural network modules, known as adapters, into each layer of the pre-trained model. During the fine-tuning phase, only the parameters of these adapters are updated, while the original model’s parameters remain unchanged. Prefix-tuning [27], on the other hand, introduces additional trainable prefix tokens to the model’s input or hidden layers, with only these prefix tokens being fine-tuned. P-tuning [28] follows a similar approach to prefix-tuning, but instead utilizes a limited set of continuous embedding parameters as prompts to boost the performance of generative pre-trained transformers (GPT) in natural language understanding (NLU) tasks. The main distinction between the two methods is that prefix-tuning is designed for natural language generation (NLG) tasks, while P-tuning specifically targets embedding layers, without modifying other layers as prefix-tuning does.

LoRA (low-rank adaptation) [29] is an efficient method for fine-tuning large-scale pre-trained models, based on the idea of introducing low-rank matrices to restrict parameter updates during fine-tuning to a low-rank subspace, thereby reducing the number of parameters that need to be updated for efficient fine-tuning. Traditional fine-tuning methods require updating all parameters of the model, whereas LoRA reduces the number of parameters to be updated during fine-tuning by introducing a low-rank matrix, thus lowering computational and storage overhead. During training, only the parameters of the low-rank matrix are optimized. For a linear layer

h = W x

, the forward propagation is replaced by the following equation:

h = W x + B A x,

(1)

where

W \in R^{d \times d}, A \in R^{d \times r}, B \in R^{r \times d}

, with rank

r ≪ d

. Matrix A is initialized using a random Gaussian distribution, whereas matrix B starts with all zeros, ensuring that only the main path is active during the initial phase. The data flow in the forward propagation of LoRA is shown in Figure 3.

QLoRA [30] builds upon LoRA with further optimizations, employing quantization techniques to reduce model memory usage and computational complexity (Figure 4). Specifically, QLoRA quantizes the model’s weight matrices from 32-bit floating-point numbers to 4-bit integers. This approach significantly decreases memory requirements, enabling the fine-tuning of large models on consumer-grade GPUs.

In this study, we fine-tune the GLM-4-9B-chat model using the LoRA method and compare it with the QLoRA quantization fine-tuning approach using the same instruction data. The goal of the research is to develop a high-performance, low-cost, and practically applicable language model for the field of mining risk assessment to meet industry demands.

3.5. Evaluation Metrics

This study employs metrics such as bilingual evaluation understudy (BLEU) [31] and recall-oriented understudy for gisting evaluation (ROUGE) [32] to assess the degree of match between candidate texts and reference texts. We use metrics like Predict_runtime, Predict_samples_per_second, and Predict_steps_per_second to evaluate the inference speed of the fine-tuned models. These metrics enable a comprehensive assessment of the model’s performance in terms of accuracy, fluency, and information completeness, thereby guiding model improvement and optimization.

BLEU is a metric used to assess the similarity between two texts. The formula for calculating BLEU is as follows:

B L E U = B P \cdot e x p (\sum_{n = 1}^{N} w_{n} \log p_{n}),

(2)

where BP is the length of the penalty term, called the brevity penalty, defined as follows:

B P = \{\begin{matrix} 1, i f c > r \\ e x p (1 - \frac{r}{c}) i f c \leq r \end{matrix},

(3)

In Equation (3), c is the length of the generated text and r is the length of the reference text.

P_{n} = \frac{\sum_{i}^{E} \sum_{k}^{K} \min (h_{k} (c_{i}), \underset{k \in M}{m i n} h_{k} (s_{i}; j))}{\sum_{i}^{E} \sum_{k}^{K} \min (h_{k} (c_{i}))},

(4)

where E denotes the total number of candidate texts and K denotes the total number of word groups.

h_{k} (c_{i})

denotes the frequency of occurrence of the kth word group in the candidate text

c_{i}

.

s_{i}

denotes the reference, where j ∈ M, and M denotes the number of reference answers.

h_{k} (s_{i}; j)

denotes the frequency of occurrence of the kth word group in the standard answer

s_{i, j}

.

In Equation (2),

w_{n}

is the weight, usually using equal weight for each n-gram,

w_{n} = \frac{1}{N}

, where N is the order of the largest n-gram (e.g., N = 4 represents 1-g to 4-g).

ROUGE (recall-oriented understudy for gisting evaluation) is a metric used to evaluate the quality of automatic text summarization, machine translation, and text generation. It primarily assesses quality by comparing the overlap between system-generated texts and reference texts. Unlike BLEU, ROUGE focuses more on recall, which is the proportion of words or n-grams in the generated text that match those in the reference text. The calculation formula for ROUGE is as follows:

R O U G E - N = \frac{\sum_{S \in \{R e f e r e n c e S u m m a r i e s\}} \sum_{g r a m_{s}} C o u n t_{m a t c h} (g r a m_{n})}{\sum_{S \in \{R e f e r e n c e S u m m a r i e s\}} \sum_{g r a m_{s}} C o u n t (g r a m_{n})},

(5)

where N denotes the length of N-grams, and

C o u n t_{m a t c h} (g r a m_{n})

represents the maximum number of times the n-gram occurs in both the candidate text and the reference text. ROUGE-1 evaluates single-character matches, ROUGE-2 focuses on two-character matches, ROUGE-L identifies the longest common subsequence, and so on.

Predict_runtime represents the total time taken by the model to generate a batch of samples; Predict_samples_per_second indicates the number of samples the model can generate per second, commonly used to evaluate the model’s inference speed; and Predict_steps_per_second denotes the number of steps the model can execute per second.

4. Experiments and Results

4.1. Impact of Dataset Size

Our study investigates the impact of dataset size on the performance of the MIRA-ChatGLM model, particularly across eight different data scales, using LoRA and various quantization levels (8-bit and 4-bit) for fine-tuning. These datasets originate from historical alarm information uploaded by gas sensors during the coal mining production process. The performance comparisons of the MIRA-ChatGLM fine-tuned under different data scales and methods are illustrated in Figure 5.

Generally, model performance improves with increasing data volume [33,34,35]. This is because large-scale data provides richer information, helping the model to learn features and patterns more effectively while reducing the risk of overfitting. Our research confirms this: as the data volume increases, the MIRA-ChatGLM model can learn more data patterns and features, significantly enhancing performance until approximately 3000 data points. However, as shown in Figure 3, when the data volume exceeds 3000, the model’s generalization ability to use new data diminishes, and the performance indicators change slowly despite the continued increase in training data. Specifically, the performance indicators of models fine-tuned with 5000, 7000, and 10,000 data points are nearly identical. This phenomenon may stem from information redundancy, model capacity limitations, and diminishing marginal returns. As the data volume increases, the effective information provided by additional data decreases; the model has already mastered most of the important features, and excessive data may introduce noise, leading to a gradual reduction or saturation of performance gains.

When fine-tuning with QLoRA using 8-bit quantization, we observed a slight decline in performance indicators as the data volume increased from 500 to 700, followed by a steady improvement as the data volume continued to rise to 1000. We hypothesize that this performance fluctuation may be related to the variability in results when evaluating quantized models, especially in cases with smaller dataset sizes, where this variability significantly impacts the comparative performance of the quantized model. This variability may arise from various factors, including the diversity of samples in the dataset, randomness during model training, and the model’s sensitivity to specific data. To mitigate the impact of this variability on performance comparisons, we implemented measures such as conducting multiple repeated experiments and reporting average performance and standard deviation, as well as employing techniques like cross-validation to more accurately estimate model performance on limited data.

Specifically, the base model GLM-4-9B-chat without fine-tuning demonstrates significant limitations in processing alarm data assessments, analyzing alarm causes, and providing relevant mitigation measures. It can only offer general information regarding common situations of gas alarm events in coal mines and general response measures, failing to conduct in-depth risk assessments or propose specialized handling methods for the specific alarm data uploaded by sensors.

After fine-tuning small-scale datasets (particularly those composed of 300, 500, and 700 data points), the MIRA-ChatGLM model is generally able to conform to the required format when answering questions and can accurately identify alarm data. However, its ability to analyze alarm signals precisely, determine the causes of alarms, and provide corresponding mitigation measures still requires further enhancement. This performance shortcoming may be attributed to the small dataset size and insufficient sample diversity, leading to high variability in the model’s training results. If the samples in the dataset do not comprehensively represent all scenarios of the target task, the performance of the model trained on these limited samples may not accurately reflect its true performance on larger datasets.

When the dataset reaches a certain scale, specifically when the data size exceeds 3000 points, the model can learn more features and patterns, leading to generating assessments that increasingly align with the actual causes of alarm events and more accurately provide corresponding mitigation measures. This indicates that as the data volume increases, the model’s performance and generalization capability significantly improve.

4.2. LoRA Versus QLoRA

On a server equipped with two RTX 4090 (48 G) GPUs, we evaluated three parameter-efficient fine-tuning (PEFT) methods to investigate their impact on the performance of the MIRA-ChatGLM model. We utilized the Rouge_chinese [32] and NLTK [36] toolkits to compute ROUGE and BLEU scores and assessed the model’s Predict_runtime, Predict_samples_per_second, and Predict_steps_per_second using the evaluation module in the LLaMA-Factory framework [37].

As shown in Figure 6, when fine-tuning with smaller datasets (fewer than 3000 entries), the models exhibited varying performance on BLEU and ROUGE metrics; however, the overall results were suboptimal. Conversely, when the data volume reached 3000 entries, the MIRA-ChatGLM models fine-tuned using the three different methods showed consistent performance on the BLEU metric (and similarly for ROUGE), with relatively high scores. As the data volume continued to increase, improvements in model performance became gradual, while the required hardware resources and training time continued to rise. At approximately 10,000 data points, models that were fine-tuned using all three methods achieved optimal performance across all metrics.

To evaluate the performance of different models more comprehensively, we employed three key metrics: Predict_runtime, Predict_samples_per_second, and Predict_steps_per_second to assess the inference speed of the fine-tuned models. The specific data are presented in Table 3. The observations indicate that, with comparable performance in key metrics such as text generation quality and accuracy, there were significant differences in inference speed based on whether the model was fine-tuned using quantization. In particular, models fine-tuned with LoRA required the shortest time to generate a batch of samples. In contrast, the 4-bit quantized QLoRA method took 35% longer than LoRA, while the 8-bit quantized QLoRA method took nearly 60% longer than LoRA. Additionally, in terms of the number of samples generated per second and the number of steps executed per second, LoRA fine-tuning also demonstrated a leading advantage. These data indicate that LoRA fine-tuning not only maintains the quality of generated text but also preserves the model’s inference efficiency. The fine-tuned models, once quantized, exhibited a loss in inference speed, which became more pronounced as the quantization level increased. Although the QLoRA method has advantages in reducing model parameters and saving memory usage, its inference speed is significantly lower than that of LoRA fine-tuning. In practical applications, this suggests that when pursuing efficient inference, LoRA fine-tuning may be the better choice, while QLoRA is more suitable for resource-sensitive scenarios that require model size reduction.

We randomly provided the model with a common gas alarm dataset and inquired about the reasons for the alarm and corresponding measures from the MIRA-ChatGLM fine-tuned on a 3 k dataset. Table 4 presents the responses of the models that were fine-tuned using the three methods.

Experimental results indicate that, under the same data volume, the model fine-tuned with LoRA excels in learning deeply from the dataset, which enhances its professionalism and accuracy in answering questions. In contrast, the quantized fine-tuned models may contain more redundant information in their responses, and the proposed handling measures may show some discrepancies. As shown in Table 4, when the CO gas concentration is 0 ppm, a sensor disconnection is the reason for the alarm, and restoring normal operation as per protocol is the corresponding handling measure. The model fine-tuned with LoRA accurately provided this assessment; the 8-bit quantized model identified the risk signal and suggested professional handling measures, but also generated some redundant information. Meanwhile, the 4-bit quantized model recognized the risk signal but failed to offer professional handling measures and produced redundant information. Notably, as the quantization level increased, the differences between the fine-tuned model’s inference output and the standard assessment results became more pronounced, which is an influencing factor that warrants further investigation in our future work.

Despite these differences, all three fine-tuned models were able to identify the alarm signals based on the information provided and assess possible alarm causes and related handling measures. As the training data volume increased, the quantized models exhibited more standardized and accurate responses. This indicates that a larger data volume contributes to enhancing the accuracy and reliability of responses from quantized models.

4.3. Impact Analysis of Instructional Data in Model Fine-Tuning

In instruction tuning for large language models (SFT), the instruction component plays a pivotal role by defining the specific objectives of a task, the way inputs should be processed, and the desired output format. By clearly specifying task requirements, the instruction guides the model on the operations to perform, such as classification, text generation, or comparison. It also standardizes how input data is handled and determines the format in which outputs should be presented. For example, in a comparison task, the instruction might direct the model to compare based on specific features and express the results in sentence form. Instructions not only help the model understand the precise requirements of a task but also steer it toward generating outputs that align with expected formats, thereby enhancing generation quality and training efficiency. Moreover, the instruction component provides supervision signals during the fine-tuning process, enabling the model to produce accurate content based on the instructions. This ultimately improves the model’s performance and adaptability to specific tasks [38].

In this study, we conducted an ablation experiment on the instruction dataset of the MIRA model. Specifically, we removed the instruction component from the dataset while keeping the hyperparameters and LoRA fine-tuning method unchanged, and fine-tuned the base model under the condition of 3 k data points. By comparing the performance of instruction-tuned and non-instruction-tuned models, we found that instruction tuning can enhance the performance of the MIRA-ChatGLM model to some extent, improving the accuracy and consistency of the generated text.

According to the experimental results presented in Table 5, we observed that instruction tuning led to an approximate 5% improvement in the performance of the MIRA-ChatGLM model. This finding indicates that instruction tuning aids large-scale models in gaining a deeper understanding of the input context and user intent, thereby generating text that is more accurate and aligned with expectations. Thus, instruction tuning is an effective means of enhancing model quality and efficacy.

4.4. Performance Comparisons with Alternative Models

To demonstrate the superior performance of MIRA-ChatGLM and ensure fairness in the experiments, we conducted a comparative analysis with three widely used models trained on Chinese pre-training corpora.

Bloom-7B [39]: Developed and released by the BigScience community, this is a large-scale transformer-based language model trained on a wide range of pre-training data, including internet text, specialized books, and diverse code collections.
Qwen-7B [40]: A 7-billion parameter model from Alibaba Cloud’s Tongyi Qianwen series.
Baichuan-7B [41]: An open-source, commercially usable large-scale pre-trained language model developed by Baichuan Intelligence. This 7-billion parameter model, based on the Transformer architecture, was trained on approximately 1.2 trillion tokens, supports both Chinese and English, and has a context window length of 4096.

The experimental results in Table 6 show that MIRA-ChatGLM outperformed other models of similar size. When considering the training data, Qwen-7B and Baichuan-7B, which were trained on a larger corpus of Chinese text, performed better than Bloom-7B, which primarily used English text. Additionally, Baichuan-7B exhibited superior performance in instruction fine-tuning compared to Qwen-7B.

4.5. Expert Assessment

Although automatic evaluation metrics are useful in assessing the performance of large language models (LLMs), manual evaluation is still essential to ensure the model’s effectiveness in terms of safety, expertise validation, flexibility, adaptability, and ethical considerations. Since expert judgment remains the primary standard in coal mine risk assessment, human evaluation plays a crucial role in verifying the reliability of model outcomes. For the coal mine risk assessment task, we employed the SUS (safety, usability, and smoothness) evaluation method [42]. SUS focuses on three aspects: safety, usability, and smoothness. “Safety” examines whether the model-generated content could lead to potential safety hazards, “usability” evaluates the relevance of the content to the risk assessment results and measures, while “smoothness” assesses the fluency of the generated text. The SUS scoring system uses a scale of 1 (unacceptable) to 3 (good), with 2 indicating acceptable performance. To assess MIRA’s performance, we asked five coal mine risk assessment experts to score 100 randomly selected cases related to alarm signal evaluation. Table 7 shows the average SUS scores and their 95% confidence intervals.

Compared to other open-source models with similar parameters, the content generated by Qwen-7B contained more redundant information, significantly impacting its usability and fluency, leading to lower usability scores. On the other hand, our MIRA-ChatGLM model not only ensured the safety and reliability of the results but also significantly enhanced the usability of gas alarm data in the coal mining field.

4.6. Fine-Tuning Parameter Configurations and Associated Costs

We successfully developed MIRA-ChatGLM by fine-tuning GLM-4-9B-chat using PEFT methods. Below are the detailed hyperparameter settings for fine-tuning GLM-4-9B-chat with LoRA and QLoRA algorithms.

For LoRA fine-tuning, training was conducted for 2.5 h on two RTX 4090 (48 GB) GPUs, and 4.2 h on a single RTX 4090T (24 GB) GPU. The train_epochs was set to 8, with a batch size of 2. The initial learning rate for the AdamW optimizer was set to 2 × 10⁻⁵, using bf16 for computational precision, with a maximum gradient norm of 1.0. The total training steps were 6000, and the maximum sequence length was set to 1024. No special warmup or weight decay strategies were applied. Additionally, the rank of the low-rank matrix was set to 8, with a scaling factor of 16 and a dropout rate of 0.1. Figure 7 shows the loss curve during the LoRA fine-tuning process. The figure illustrates the loss curve during the model training process. As observed, the loss decreases rapidly at the initial stage, dropping from a relatively high value, which indicates that the model effectively captures the features of the data during this phase. With the increase in training steps, the loss gradually stabilizes and approaches zero, demonstrating that the model has progressively converged and the training process is stable. The smoothed curve further highlights the overall downward trend, showing no significant oscillations or upward spikes, which suggests that the model has been optimized effectively without signs of overfitting. This indicates that the model has achieved satisfactory learning performance on the training dataset.

In the QLoRA fine-tuning with two quantization levels, the batch size was also set to 2, using the default learning rate decay strategy of the AdamW optimizer with a learning rate of 2 × 10⁻⁵ and bf16 precision. Gradient accumulation was performed every 16 steps, with a maximum source length of 32 and a maximum sequence length of 1024. Similar to LoRA, the total training steps were 6000, without warmup or weight decay. The quantization algorithm from bits and bytes was used, with training conducted at both 4-bit and 8-bit quantization levels. The 4-bit QLoRA fine-tuning was completed in just 5.5 h on a single RTX 4090T (24 GB) GPU and 3 h on two RTX 4090 (48 GB) GPUs, significantly reducing the burden on most researchers. Table 8 presents the training durations for different fine-tuning methods.

From the training duration, it can be observed that under the same conditions, the training time for models with quantization is relatively longer, yet the performance of the fine-tuned models is similar across all metrics. In terms of memory and resource consumption, QLoRA outperforms LoRA, especially at lower bit widths (such as 4-bit). Regarding time overhead, LoRA may be faster in some scenarios due to the absence of quantization operations, though this depends on hardware support and model complexity.

Thus, in memory-constrained environments, QLoRA is the ideal choice. By employing 4-bit quantization, QLoRA significantly reduces memory usage, enabling large model fine-tuning on resource-limited hardware. If computational resources and memory are relatively abundant, and higher model accuracy is required, then LoRA is the better option. LoRA excels in maintaining high accuracy while accelerating the fine-tuning process, particularly for medium to small-scale models. LoRA retains higher computational precision, suitable for tasks demanding high accuracy, while QLoRA strikes a good balance between accuracy and resource consumption, making it ideal for scenarios involving large models with limited hardware resources. Both have their advantages, catering to different application needs: LoRA leans towards high-precision environments, while QLoRA demonstrates superiority in large model applications with limited resources.

Additionally, this study employed supervised fine-tuning methods to train three comparison models. To ensure a fair comparison, we maintained the LoRA parameters consistent with those used in the development of MIRA-ChatGLM, and all models were set to bf16 computational precision. For training the Bloom-7B model, we configured a batch size of 10 and a maximum input sequence length of 512. We selected the AdamW optimizer and set its initial learning rate to 1 × 10⁻⁵. To adjust the learning rate, we employed a cosine learning rate scheduler. Furthermore, to enhance training stability, we performed gradient accumulation every 8 steps. For the Qwen-7B and Baichuan-7B models, we adjusted the training batch size to 8. Similarly, we chose AdamW as the optimizer and set its initial learning rate to 1 × 10⁻⁴. The learning rate scheduler for these models also utilized the cosine strategy. Like the Bloom-7B model, gradient accumulation for the Qwen-7B and Baichuan-7B models was performed every 8 steps. Through these training configurations, we aimed to ensure that each model could be trained in a fair environment, allowing for an accurate assessment of their performance on the task.

5. Discussion

In the field of intelligent risk assessment for coal mines, traditional risk evaluation primarily relies on expert-driven manual analysis and judgment. This process is not only time-consuming and labor-intensive but also heavily dependent on the expertise of individuals, making it susceptible to subjective biases. Furthermore, manual assessments struggle to provide rapid responses and handle large-scale data, failing to meet the efficiency demands of modern mine safety management. With advancements in technology, existing large-scale models have begun to extract environmental and risk factors from input data to improve assessment accuracy. However, these models often lack the ability to generate detailed risk management and emergency response recommendations. Mine managers and safety personnel are more concerned with obtaining specific countermeasures and management suggestions for each risk factor, as these insights are critical for enhancing mine safety and preventing accidents.

However, current intelligent technologies in the coal mining domain are primarily focused on risk prediction and early warning, rather than direct risk assessment. As shown in Table 9, a comparison with existing methods highlights the unique advantages of the proposed MIRA-ChatGLM method in several aspects. Compared to traditional expert systems that rely on rules and risk evaluation methods based on a single data type, MIRA-ChatGLM is more effective at handling multidimensional data, integrating factors such as gas concentration, sensor location, and sensor equipment status, thereby providing more accurate and comprehensive risk assessments. Additionally, MIRA-ChatGLM employs a large language model, fine-tuned with instructions, which efficiently processes complex multisource data, reduces dependence on large volumes of labeled data, and significantly enhances both the efficiency and accuracy of the evaluation. While existing methods can still provide certain risk assessments in specific scenarios, they have limitations when dealing with complex mining environments, multidimensional data fusion, and limited labeled data. Therefore, MIRA-ChatGLM demonstrates greater adaptability and practical application potential in intelligent risk assessment, decision support, and emergency response scenarios.

In order to meet the demand for intelligent risk assessment in coal mining operations, we have integrated various data sources, including mining safety protocols, alarm data collected from on-site sensors, alarm causes, and disposal measures, to construct a comprehensive dataset. By training on a large volume of historical alarm data, the model is able to accurately assess risks based on information provided by the sensors, such as measurement point types, locations, and maximum alarm data. It can also analyze the causes of alarms and propose corresponding disposal suggestions. The primary objective of this study is to explore the practical application of large language models in coal mine risk assessment. By fine-tuning the GLM-4-9B-chat base model, we developed the MIRA-ChatGLM, which is capable of automatically generating mining risk assessment results and providing detailed response strategies for specific scenarios. The development of MIRA-ChatGLM fills the gap in the application of large language models in the coal mine risk assessment field.

Compared to traditional manual assessment methods, MIRA-ChatGLM significantly improves assessment efficiency, ensures the consistency and professionalism of results, and reduces human error. This approach assists mine managers and safety personnel in obtaining more comprehensive and accurate risk information, thereby providing strong support for the development of efficient safety management plans and emergency response strategies. MIRA-ChatGLM, trained on data from specific mines, is capable of offering personalized risk suggestions based on the unique conditions of different mines, further enhancing the effectiveness of safety management and the timeliness of responses. To ensure that MIRA-ChatGLM can be effectively applied at other mine sites, it is necessary to consider the specific conditions of each mine. The model’s ability to provide personalized risk suggestions depends on the unique characteristics of the mine, such as environmental conditions, historical risk data, etc. When applying MIRA-ChatGLM to new mine sites, fine-tuning with local data may be required, including sensor readings, environmental factors, and risk patterns unique to that mine. This approach enables the model to adapt flexibly to different conditions, ensuring the relevance and accuracy of the risk assessment. Additionally, the model can be periodically retrained, incorporating new data to continuously improve the accuracy of risk suggestions as conditions change.

During the fine-tuning process of the model, we employed LoRA and QLoRA methods from PEFT. Experimental results indicate that when the dataset size is small (fewer than 3000 entries), the model’s assessment performance is suboptimal. However, when the data volume reaches approximately 3000, the performance of the MIRA-ChatGLM model significantly improves, demonstrating excellent performance on evaluation metrics and achieving accurate and professional assessment results. Nevertheless, as the data volume continues to increase, the rate of performance improvement slows down, which may be related to data diversity. In comparing the fine-tuning processes of LoRA and QLoRA, we found that although LoRA consumes more hardware resources, the training time is significantly reduced, and the model exhibits faster inference speed with stronger professional response capabilities. Conversely, models fine-tuned with QLoRA take longer during training but require relatively fewer hardware resources, making it more manageable for most researchers. Additionally, we investigated the impact of the instruction data on model performance, and through a comprehensive analysis of automated metrics and manual evaluations, we validated the superiority of MIRA-ChatGLM.

Despite the significant achievements of this study, there are also some limitations. Firstly, the base model is relatively small, and the limited amount of training data may lead to some errors in the content generated. To address this, we plan to use larger base models in future experiments and incorporate manual evaluations to further improve the accuracy of the generated content. Secondly, we hope to expand the application scope of the model from the coal mining safety domain to other risk assessment scenarios and mitigate performance fluctuations caused by data diversity. Moreover, we will explore integrating image information (such as images of mining environments) into the model to achieve multimodal risk assessment, further enhancing the model’s accuracy and practicality.

6. Conclusions

In summary, we have successfully developed a novel large-scale model—MIRA-ChatGLM—in the field of intelligent risk assessment for mining, exploring innovative pathways for the deep integration of mining risk evaluation and artificial intelligence. The introduction of this model not only injects new vitality into traditional risk assessment methods but also provides the industry with more scientific and efficient solutions. By leveraging PEFT methods, we systematically fine-tuned the instruction data, achieving significant performance improvements over the base model GLM-4-9B-chat, which demonstrates the effectiveness and forward-looking nature of our approach. Additionally, we conducted an in-depth study of the specific impacts of different fine-tuning methods, data scales, and instruction data on MIRA-ChatGLM’s performance, aiming to identify the optimal combination to enhance the model’s applicability.

During this process, we also constructed a dedicated dataset for mining risk assessment, designed to provide substantial support for mining safety management and intelligent decision-making. This dataset not only helps decision-makers promptly identify potential risks but also provides a data foundation for subsequent safety measures. Currently, the MIRA-ChatGLM project is still in its early stages and may have some shortcomings in terms of assessment professionalism and other aspects. To address this, we are actively collaborating with mining enterprises and safety experts to seek feedback and suggestions, further enhancing the model’s accuracy and assistance capabilities in risk assessment, and ensuring it genuinely serves the safety management needs of mines.

Looking ahead, given the vast complexity and diversity of coal mine risk alarm data, we will delve into the integration of multimodal large language models and multi-agent mechanisms, striving for continuous updates and iterations based on intelligent assessment and large model inference capabilities. We believe this combination will provide strong momentum for the ongoing development of mining automation, allowing risk management to become more refined and intelligent, ultimately achieving a dual enhancement of safety and efficiency.

Author Contributions

Conceptualization, C.Z. and Y.S.; methodology, C.Z.; validation, C.Z.; investigation and formal analysis, C.Z.; resources, C.Z.; data curation, C.W. and C.Z.; writing—original draft preparation, C.Z.; writing—review and editing, Y.H.; visualization, C.Z.; supervision, C.Z.; project administration, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The configuration files, environment dependencies (including the version information of relevant software packages), and the de-identified dataset involved in this study have been made publicly available and can be accessed via the following link: https://gitee.com/sun-yi-laboratory-team/MIRA-ChatGLM (accessed on 2 December 2024). Due to data privacy and compliance reasons, the original data are only accessible with specific permissions. For access to the original data, please contact the corresponding author.

Acknowledgments

The authors wish to thank the reviewers for their valuable comments and suggestions concerning this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, X.; Cao, Z.; Xu, Y. Characteristics and trends of coal mine safety development. Energy Sources Part A Recover. Util. Environ. Eff. 2021, 1–19. [Google Scholar] [CrossRef]
Li, M.; Wang, H.; Wang, D.; Shao, Z.; He, S. Risk assessment of gas explosion in coal mines based on fuzzy AHP and bayesian network. Process Saf. Environ. Prot. 2020, 135, 207–218. [Google Scholar] [CrossRef]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Sun, J.P. Research on method of coal mine gas and coal dust explosion perception alarm and explosion source judgment. Ind. Mine Autom. 2020, 46, 1–5. [Google Scholar]
Matloob, S.; Li, Y.; Khan, K.Z. Safety measurements and risk assessment of coal mining industry using artificial intelligence and machine learning. Open J. Bus. Manag. 2021, 9, 1198–1209. [Google Scholar] [CrossRef]
Zhang, G.; Wang, E. Risk identification for coal and gas outburst in underground coal mines: A critical review and future directions. Gas Sci. Eng. 2023, 118, 205106. [Google Scholar] [CrossRef]
Miao, D.; Lv, Y.; Yu, K.; Liu, L.; Jiang, J. Research on coal mine hidden danger analysis and risk early warning technology based on data mining in China. Process Saf. Environ. Prot. 2023, 171, 1–17. [Google Scholar] [CrossRef]
Dey, P.; Chaulya, S.K.; Kumar, S. Hybrid CNN-LSTM and IoT-based coal mine hazards monitoring and prediction system. Process Saf. Environ. Prot. 2021, 152, 249–263. [Google Scholar] [CrossRef]
Wang, E.; Li, Z.; Li, B.; Qin, B.; Xu, J.; Li, N.; Xia, H.; Zhang, G.; Li, Y.; Feng, X.; et al. Big data monitoring and early warning cloud platform for coal mine gas disaster risk and potential danger and its application. Coal Sci. Technol. 2022, 50, 142–150. [Google Scholar]
Li, J.; Li, T. A decision system based on intelligent perception and decision for scene ventilation safety. Int. J. Comput. Sci. Eng. 2021, 24, 162–170. [Google Scholar] [CrossRef]
Zhang, G.; Wang, E.; Zhang, C.; Li, Z.; Wang, D. A comprehensive risk assessment method for coal and gas outburst in underground coal mines based on variable weight theory and uncertainty analysis. Process Saf. Environ. Prot. 2022, 167, 97–111. [Google Scholar] [CrossRef]
Xu, K.; Li, S.; Lu, C.; Liu, J. Risk assessment of coal mine gas explosion based on cloud integrated similarity and fuzzy DEMATEL. Process Saf. Environ. Prot. 2023, 177, 1211–1224. [Google Scholar] [CrossRef]
You, M.; Li, S.; Li, D.; Xu, S. Applications of artificial intelligence for coal mine gas risk assessment. Saf. Sci. 2021, 143, 105420. [Google Scholar] [CrossRef]
Du, J.; Chen, J.; Pu, Y.; Jiang, D.; Chen, L.; Zhang, Y. Risk assessment of dynamic disasters in deep coal mines based on multi-source, multi-parameter indexes, and engineering application. Process Saf. Environ. Prot. 2021, 155, 575–586. [Google Scholar] [CrossRef]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Chen, Z.; Mao, H.; Li, H.; Jin, W.; Wen, H.; Wei, X.; Wang, S.; Yin, D.; Fan, W.; Liu, H.; et al. Exploring the potential of large language models (llms) in learning on graphs. ACM SIGKDD Explor. Newsl. 2024, 25, 42–61. [Google Scholar] [CrossRef]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Liu, C.; Sun, K.; Zhou, Q.; Duan, Y.; Shu, J.; Kan, H.; Gu, Z.; Hu, J. CPMI-ChatGLM: Parameter-Efficient Fine-Tuning of ChatGLM with Chinese Patent Medicine Instructions. Sci. Rep. 2024, 14, 6403. [Google Scholar] [CrossRef]
Huang, A.H.; Wang, H.; Yang, Y. FinBERT: A large language model for extracting information from financial text. Contemp. Account. Res. 2023, 40, 806–841. [Google Scholar] [CrossRef]
Kasneci, E.; Seßler, K.; Küchemann, S.; Bannert, M.; Dementieva, D.; Fischer, F.; Gasser, U.; Groh, G.; Günnemann, S.; Hüllermeier, E.; et al. ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ. 2023, 103, 102274. [Google Scholar] [CrossRef]
Demszky, D.; Yang, D.; Yeager, D.S.; Chandhok, S.; Eichstaedt, J.C.; Bryan, C.J.; Clapper, M.; Hecht, C.; Jamieson, J.; Johnson, M.; et al. Using large language models in psychology. Nat. Rev. Psychol. 2023, 2, 688–701. [Google Scholar] [CrossRef]
Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Zhou, Y.; Liang, K.; Chen, J.; Lu, J.; Yang, Z.; Liao, K.-D.; et al. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 958–979. [Google Scholar]
Xu, Z.; Zhang, Y.; Xie, E.; Zhao, Z.; Guo, Y.; Wong, K.-Y.K.; Li, Z.; Zhao, H. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robot. Autom. Lett. 2024. [Google Scholar] [CrossRef]
GLMT; Zeng, A.; Xu, B.; Wang, B.; Zhang, C.; Yin, D.; Zhang, D.; Rojas, D.; Feng, G.; Zhao, H.; et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. arXiv 2024, arXiv:2406.12793. [Google Scholar]
Liu, H.; Tam, D.; Muqeeth, M.; Mohta, J.; Huang, T.; Bansal, M.; Raffel, C. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Adv. Neural Inf. Process. Syst. 2022, 35, 1950–1965. [Google Scholar]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, PMLR 2019, Long Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar]
Liu, X.; Zheng, Y.; Du, Z.; Ding, M.; Qian, Y.; Yang, Z.; Tang, J. GPT Understands, too; AI Open: San Francisco, CA, USA, 2023. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. arXiv 2024, arXiv:2305.14314. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.-J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
Urbizu, G.; San Vicente, I.; Saralegi, X.; Corral, A. Not Enough Data to Pre-train Your Language Model? MT to the Rescue! In Proceedings of the Findings of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; pp. 3826–3836. [Google Scholar]
Diao, S.; Xu, R.; Su, H.; Jiang, Y.; Song, Y.; Zhang, T. Taming pre-trained language models with n-gram representations for low-resource domain adaptation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; pp. 3336–3349. [Google Scholar]
Edwards, A.; Camacho-Collados, J.; De Ribaupierre, H.; Preece, A. Go simple and pre-train on domain-specific corpora: On the role of training data for text classification. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 8–13 December 2020; pp. 5522–5529. [Google Scholar]
Bird, S. NLTK: The natural language toolkit. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, Sydney, Australia, 17–18 July 2006; pp. 69–72. [Google Scholar]
Zheng, Y.; Zhang, R.; Zhang, J.; Ye, Y.; Luo, Z.; Feng, Z.; Ma, Y. Llamafactory: Unified Efficient Fine-Tuning of 100+ Language Models. arXiv 2024, arXiv:2403.13372. [Google Scholar]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Le Scao, T.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, F.; Gallé, M.; et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv 2023, arXiv:2211.05100. [Google Scholar]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar]
Yang, A.; Xiao, B.; Wang, B.; Zhang, B.; Yin, C.; Lv, C.; Pan, D.; Wang, D.; Yan, D.; Yang, F.; et al. Baichuan 2: Open large-scale language models. arXiv 2023, arXiv:2309.10305. [Google Scholar]
Wang, H.; Liu, C.; Xi, N.; Qiang, Z.; Zhao, S.; Qin, B.; Liu, T. Huatuo: Tuning llama model with chinese medical knowledge. arXiv 2023, arXiv:2304.06975. [Google Scholar]

Figure 1. Large language model for coal mine risk assessment under the MIRA-ChatGLM framework.

Figure 2. Overall pipeline of GLM-4 All Tools and customized GLM (agent).

Figure 3. Data flow in LoRA’s forward propagation. The input data x passes through the left weight matrix W and the two matrices A and B on the right. Both hidden layers have the same output dimension, d. The final output, h, is obtained by summing the results from both sides.

Figure 4. Different tuning methods have varying memory requirements. QLoRA improves upon LoRA by quantizing transformer models to 4-bit precision and employing a paging optimizer to manage memory peaks.

Figure 5. Line graph of the performance of MIRA-ChatGLM under three fine-tuning methods with different data scales. As the data volume increases, the performance of MIRA-ChatGLM gradually improves. The horizontal axis represents the parameter scale, and the vertical axis represents the score.

Figure 6. Scores of BLEU under different fine-tuning methods and data scales. The horizontal axis represents the parameter scale, and the vertical axis represents the score.

Figure 7. Loss curve during LoRA fine-tuning.

Table 1. Dataset sources and core parameter descriptions.

Data Source	Overview of Main Content	Example Data (Simplified)
Gas alarm range	Locations, numbers, alarm thresholds, power outage ranges of gas sensors	Location: return corner of coal mining face, number: T0, alarm concentration: ≥0.5%, power outage range: all non-intrinsically safe electrical equipment in the working face and return airway.
Historical alarm event records	Details of alarm events, including time, location, type, and handling measures	Duration: 1 min 11 s; mine name: A Mining; type: laser methane; cause: test failure lockout; measure: standardize testing

Table 2. An example of instruction data.

Example

Instruction: You are an expert in the field of coal mine gas risk analysis and are familiar with relevant coal mine safety regulations. Please professionally and correctly answer questions about coal mine safety that users want to ask.
Input: The name of the coal mine is xxx, and the installation position is carbon monoxide in the general return air lanes. The measurement type at this point is for carbon monoxide, and the maximum value of carbon monoxide measurement is 511.0 ppm. What happens? What is the response?
Output: The alarm type is calibration, the alarm cause is sensor calibration, and the treatment measure is the cause: sensor calibration.

Table 3. Evaluation parameters for model inference speed.

	Predict_Runtime (/min)	Predict_Samples_per_Second (pps)	Predict_Steps_per_Second (sps)
LoRA	25.93	1.294	0.324
QLoRA-8bit	41.60	0.806	0.202
QLoRA-4bit	35.40	0.947	0.237

Table 4. Comparison of the three fine-tuning methods using instances.

Dialog Example

Input:
Coal mine name: XXX Mining, the installation site returns carbon monoxide readings to the wind tunnel wind measurement station, the point of measurement type is carbon monoxide, carbon monoxide measurement shows a maximum value of 0.0 ppm, what happens? What countermeasures are taken?
LoRA output:
The alarm type is for a sensor disconnection, the alarm reason is to repair the line, and the treatment measure is to return to normal according to standard operation.
QLoRA(8bit) output:
The alarm type is for a sensor disconnection, the cause of the alarm is to organize the wiring, and the treatment measure is to return to normal according to standard operation.It is important to note that the alarm type is “Sensor Disconnected” and not “Abnormal Carbon Monoxide Concentration”. Carbon monoxide is measured at a maximum value of 0.0 ppm, which means that the carbon monoxide concentration is very low and does not have any safe alarm level. Therefore, the cause of the alarm was due to the sensor cable being tidied up and causing a disconnection, not a high carbon monoxide concentration. The treatment was for the staff to restore the cable in time to ensure that the sensor was working properly. In this case, no special countermeasures were taken because the carbon monoxide concentration was safe.
QLoRA(4 bit) output:
The alarm type is for a disconnected sensor, the reason for the alarm is to service the wiring, and the treatment is to restore it promptly.
In this case, the alarm type is “Sensor Disconnected”, which indicates a problem with the connection between the carbon monoxide sensor and the monitoring system. The carbon monoxide measurement maximum value is 0.0 ppm, which is usually the reading when the sensor is working properly, indicating that no carbon monoxide is detected.
The cause of the alarm is the “Service Line”, which means that staff may be carrying out maintenance or overhaul work, causing the sensor to be temporarily disconnected. The response is “Restore in time”, which means that the crew needs to complete the wiring service as soon as possible to restore the sensor to its normal operating condition.
The following are possible countermeasures:
1. The personnel immediately stop servicing and focus on restoring the sensor’s wiring connection.
2. Ensure adequate ventilation or other safety measures are in place to prevent carbon monoxide buildup during restoration of the line connection.
3. Once the line connection has been restored, restart the sensor and monitor its readings to ensure that all is well.
4. Upon completion, evaluate the overhaul process to prevent a similar situation from recurring.
5. Update the maintenance record with information such as the reason for the service, what was done, and the recovery time.
As the carbon monoxide concentration read 0.0 ppm, no additional contingency measures were taken as the sensor readings indicated no risk of carbon monoxide exceeding the limit.

Table 5. Ablation study results in fine-tuning with command data giving better performance compared to non-command tuning (bold values indicate better performance).

	BLEU-4	ROUGE-1	ROUGE-2	ROUGE-L
Non-instruction data	80.0892	87.5659	81.1023	86.2961
Instruction data	84.4695	90.6289	86.8820	90.6289

Table 6. Comparison of experimental results of different models (bold values indicate better performance).

Model	BLEU-4	ROUGE-1	ROUGE-2	ROUGE-L
Bloom-7B	65.1465	70.2396	68.9872	71.8386
Qwen-7B	73.3362	78.9910	76.1294	81.3647
Baichuan-7B	77.3001	82.0610	80.8721	85.9376
MIRA-ChatGLM	84.4695	90.6289	86.8820	90.6289

Table 7. SUS scores of different models and corresponding 95% confidence intervals (bold values indicate better performance).

Model	Safety	Usability	Smoothness
Bloom-7B	2.232 ± 0.338	1.992 ± 0.580	2.389 ± 0.482
Qwen-7B	2.495 ± 0.451	2.015 ± 0.537	2.401 ± 0.410
Baichuan-7B	2.563 ± 0.279	2.710 ± 0.336	2.719 ± 0.305
MIRA-ChatGLM	2.796 ± 0.152	2.830 ± 0.456	2.912 ± 0.217

Table 8. Training time.

	Single GPU (24 GB)	Dual GPU (48 GB)
LoRA	4.2 h	2.5 h
(Quantization Level: 4) QLoRA	5.5 h	3.2 h
(Quantization Level: 8) QLoRA	9.5 h	5.5 h

Table 9. Comparison summary of existing models/technologies in the coal mining domain with our proposed solution.

Aspect	Existing Methods	Proposed Method
Risk assessment method	Rule-based expert systems that analyze sensor data and trigger alarms based on predefined thresholds [4].	Integrated advanced pre-trained large language models (LLMs) for semantic interpretation and decision-making on multidimensional data.
Data type	Primarily single-modal data inputs (e.g., methane, carbon monoxide concentrations) [4].	Supports multidimensional data integration, including gas concentrations, sensor locations, and equipment statuses.
Technological approaches	Traditional algorithms (e.g., random forest, SVM) and deep learning models (e.g., LSTM, GRUs) focused on specific patterns or time-series predictions [7,8].	Utilizes instruction-tuned LLMs with strong generalization capabilities for complex and systematic risk assessments.
Data fusion	Combines multidimensional sensor data to enhance robustness in risk assessment but faces challenges with high computational resource demands and limited annotated data [9,10,11,12].	Efficiently processes multidimensional data using pre-trained LLMs, reducing reliance on extensive annotated datasets through knowledge transfer.
Challenges addressed	Limited adaptability to complex mining scenarios, high computational demands of multidimensional data fusion, and insufficient annotated data [12].	Enhanced adaptability to complex scenarios and alleviated reliance on extensive annotated datasets through pre-trained LLMs.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, Y.; Zhang, C.; Wang, C.; Han, Y. MIRA-ChatGLM: A Fine-Tuned Large Language Model for Intelligent Risk Assessment in Coal Mining. Appl. Sci. 2024, 14, 12072. https://doi.org/10.3390/app142412072

AMA Style

Sun Y, Zhang C, Wang C, Han Y. MIRA-ChatGLM: A Fine-Tuned Large Language Model for Intelligent Risk Assessment in Coal Mining. Applied Sciences. 2024; 14(24):12072. https://doi.org/10.3390/app142412072

Chicago/Turabian Style

Sun, Yi, Chao Zhang, Chen Wang, and Ying Han. 2024. "MIRA-ChatGLM: A Fine-Tuned Large Language Model for Intelligent Risk Assessment in Coal Mining" Applied Sciences 14, no. 24: 12072. https://doi.org/10.3390/app142412072

APA Style

Sun, Y., Zhang, C., Wang, C., & Han, Y. (2024). MIRA-ChatGLM: A Fine-Tuned Large Language Model for Intelligent Risk Assessment in Coal Mining. Applied Sciences, 14(24), 12072. https://doi.org/10.3390/app142412072

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MIRA-ChatGLM: A Fine-Tuned Large Language Model for Intelligent Risk Assessment in Coal Mining

Abstract

1. Introduction

2. Related Work

2.1. Current Research Status of Intelligent Assessment Systems

2.2. Current Research Status of Large Language Models (LLMs)

3. Methods

3.1. Base Model (GLM-4)

3.2. Data Sources and Pre-Processing

3.3. Dataset Construction

3.4. Parameter-Efficient Fine-Tuning (PEFT)

3.5. Evaluation Metrics

4. Experiments and Results

4.1. Impact of Dataset Size

4.2. LoRA Versus QLoRA

4.3. Impact Analysis of Instructional Data in Model Fine-Tuning

4.4. Performance Comparisons with Alternative Models

4.5. Expert Assessment

4.6. Fine-Tuning Parameter Configurations and Associated Costs

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI