You are currently viewing a new version of our website. To view the old version click .
Data
  • Data Descriptor
  • Open Access

4 December 2025

China’s 15-Year Mine Accident Report Dataset (2010–2025): Construction and Analysis

,
,
,
and
1
Research Institute of Mine Software, Chinese Institute of Coal Science, Beijing 100013, China
2
Beijing Technology Research Branch, Tiandi Science and Technology Co., Ltd., Beijing 100013, China
3
Research Institute of Mine Artificial Intelligence, Chinese Institute of Coal Science, Beijing 100013, China
4
School of Civil and Resource Engineering, University of Science and Technology Beijing, Beijing 100083, China

Abstract

Mine accidents pose severe threats to worker safety and sustainable mining development in China. However, existing mine accident data in China are often scattered, unstructured, and lack systematic integration, which limits their application in safety research and practice. This study constructed a standardized structured dataset using 532 mine accident reports from official channels covering the period 2010–2025. The dataset went through four stages: data collection, standardized cleaning, structured annotation, and quality validation. It is stored in JSON Lines (JSONL) format for easy reuse. The dataset covers 27 provinces/autonomous regions/municipalities in China. Among accident levels, general accidents account for 65.6%; among accident types, roof accidents account for 20.3%. Accidents are geographically concentrated, with 11.7%, 8.3%, and 7.7% occurring in Shanxi, Gansu, and Inner Mongolia, respectively. Official data have shown an annual average decrease of 9.7% in mine accidents from 2018 to 2022, reflecting improved safety governance. This dataset addresses the gap of a full-element structured mine accident database in China, providing high-quality data for accident causation modeling, regional risk early warning, and safety policy evaluation. It also supports mine enterprises in targeted risk prevention and regulatory authorities in precise regulatory enforcement.
Dataset: Dataset is available at: https://doi.org/10.5281/zenodo.17388738 (accessed on 19 October 2025).
Dataset License: CC-BY 4.0

1. Summary

Mining serves as a foundational pillar of China’s national economy, playing an irreplaceable role in energy supply and raw material security [1,2,3]. The economic impact of mining accidents extends beyond direct losses, affecting supply chain stability and regional development, with estimates suggesting each fatality generates indirect costs 3–5 times the direct economic loss. However, the complex geological conditions, high-intensity operational environments, and multi-link production processes inherent to mining make it one of the industries with notably high safety risks [4,5]. According to official data from the National Mine Safety Administration (NMSA), over 3000 mine accidents occurred in China between 2010 and 2025, resulting in thousands of casualties and cumulative direct economic losses exceeding CNY 10 billion. These frequent accidents not only threaten the lives of workers but also constrain the sustainable development of the mining industry, becoming a prominent issue affecting social stability and high-quality economic growth [6,7,8,9].
Accident data are the core foundation for revealing safety patterns and optimizing prevention strategies. According to official Chinese statistics, the coal mine death rate per million tons dropped from 0.093 in 2018 to 0.044 in 2021, showing a significant downward trend. This indicator is still higher than that of developed countries: the death rate per million tons in the United States and Australia is about 0.01–0.02 (2020 data), meaning China’s work risk level is roughly 2–4 times that of the United States and Australia [10]. Other studies also show that the death rate per million tons in developed countries generally ranges from 0.03 to 0.04, while China’s 0.044 in 2021 has approached but is still slightly higher than this range [11]. In terms of information collection rules, the U.S. Mine Safety and Health Administration (MSHA) operates the Mine Data Retrieval System (MDRS), a unified national database requiring electronic reporting within 15 min, with data publicly searchable in real time [12,13]. The European Commission has mandated standardized accident reporting under Regulation (EU) 2022/123, establishing cross-national data comparability [14,15] (Figure 1). In contrast, the management and application of mine accident data in China face significant limitations: first, officially disclosed accident information is often presented in fragmented reports without a unified structured format, hindering data integration and reuse; second, existing data primarily focus on recording basic accident details, lacking in-depth analysis of accident causes, dynamic tracking of disposal processes, and systematic sorting of responsibility chains—all of which restrict their value in academic research and practical applications.
Figure 1. Coal mining mortality rate comparison (2010–2025).
In recent years, with the advancement of the “safety resilience” theory [16,17] and machine learning technologies [18,19], the academic community’s demand for high-quality accident data has become increasingly urgent. Li et al. (2023) [20] developed an early warning gas accident model with 82% accuracy based on manually annotated small-sample data, but its promotion was limited by an insufficient data scale. Qiu et al. (2021) [21] pointed out that the lack of multi-dimensional data on mine accident causes has confined causal analysis studies to qualitative descriptions. Leveson (2011) [22] and Hollnagel (2012) [23] emphasized that system-theoretic accident models require granular data on human–machine–environment-management (HME-M) interactions—a critical deficiency in existing Chinese datasets. Tixier et al. (2015) [24] reviewed the applications of text mining in occupational safety, underscoring the imperative for standardized protocols for causal information extraction. Dong et al. (2023) [25] stated that safe and intelligent mining is an inevitable trend for information-based mining engineering, with its disciplinary connotations, challenges, and prospects discussed. Zhang et al. (2021) [26] proposed a mine safety status assessment framework based on multi-source heterogeneous data fusion methods, providing technical support for multi-dimensional data analysis. Against this backdrop, constructing a standardized, full-element, and fine-grained mine accident dataset has become critical to breaking through current research bottlenecks.
To address this gap, the present study developed a structured mine accident dataset using official Chinese mine accident reports covering a 15-year period (2010–2025). The dataset comprises 532 samples and is distinguished by three key features: (1) dimension completeness: it captures the entire accident lifecycle—from occurrence to disposal—covering 17 core dimensions and 43 subfields; (2) structural standardization: it adopts the JSON Lines (JSONL) format to unify data fields, ensuring direct reusability by different research teams; and (3) in-depth interpretability: manual annotation is used to clarify the direct (technical/operational) and indirect (managerial/institutional) causes of accidents, as well as specific disposal measures, providing fine-grained data for causal mechanism research.
This dataset is designed to support three key research directions: first, causal mechanism research, which explores the coupled failure laws of the “human-machine-environment-management” system through correlation analysis of direct and indirect causes; second, risk prediction research, which builds regional early warning models based on the mapping relationship between spatio-temporal features and accident types; and third, policy evaluation research, which quantifies the effectiveness of safety policies by comparing accident characteristics before and after regulatory campaigns. In practice, it provides mining enterprises with prevention templates for similar accidents and offers data-driven support for regulatory authorities to conduct precise regulatory enforcement.
Currently, no research projects (funded or unfunded) based on this dataset have been formally published; however, preliminary analyses have been completed to validate the dataset’s reliability. The public release of this dataset is expected to promote empirical research and interdisciplinary collaboration in the field of mine safety, facilitating the transition of mining safety management from “experience-driven” to “data-driven” practices.

2. Data Description

This section details the source, scope, and processing workflow of the dataset, as well as the structured annotation rules and quality control measures, to ensure the transparency and reusability of the data.

2.1. Data Source and Scope

The raw data of this dataset are derived from publicly available official mine accident reports in China, spanning the period from January 2010 to June 2025. Three types of authoritative sources are included to ensure data authenticity and representativeness:
  • Government supervision platforms: Accident information columns on the official website of the National Mine Safety Administration (NMSA), and accident notices released by provincial Emergency Management Departments.
  • Authoritative media publications: In-depth reports on major and relatively large mine accidents by central media (Xinhua News Agency and People’s Network) and local mainstream media.
  • Industry reports and the literature: The Annual Mine Work Safety Report issued by the NMSA and officially verified accident cases cited in academic journals.
Geographically, the dataset covers mines in 27 provinces/autonomous regions/municipalities in China. No samples were collected from Beijing, Shanghai, Tianjin, Heilongjiang, and Jilin, where mining activities are limited due to economic structure and geographical conditions. This exclusion is justified by authoritative data: according to the 2022 data from National Bureau of Statistics, these five regions collectively account for only 2.14% of national coal reserves (Beijing 0.05%, Tianjin 0%, Shanghai 0%, Jilin 0.24%, and Heilongjiang 1.77%) and less than 1.8% of active mining capacity. The 13th Five-Year Plan for Coal Industry Development outlined exit or restrictive policies for these regions, with Beijing completing its commercial coal mining phase-out by 2020. Shanghai has no commercial mines historically. Therefore, their exclusion has minimal impact (<2% sampling bias) on the representativeness of the dataset regarding national mine safety characteristics. Temporally, the sample distribution is as follows: 88 samples (16.5%) from 2010 to 2014, 159 samples (29.9%) from 2015 to 2019, and 285 samples (53.6%) from 2020 to 2025. This temporal imbalance is primarily attributed to the lower retrievability of older reports (pre-2019) due to incomplete digital archiving of official information.
Notably, the dataset exhibits a certain degree of sample selection bias: relatively large and major accidents account for a higher proportion due to high social attention and comprehensive information disclosure, while general accidents are underrepresented. This distribution deviates from the actual occurrence pattern of mine accidents, indicating room for improvement in sample comprehensiveness.

2.2. Data Collection and Annotation Process

The dataset construction follows a four-stage workflow: raw data collection → standardized cleaning → structured annotation → quality validation. After each stage, systematic quality checks were conducted to ensure the reliability of each data point.

2.2.1. Raw Data Collection

A combination of “targeted web crawling” and “manual retrieval” was adopted to collect data.
For accident reports on government websites, the Scrapy framework was used to crawl through public documents (in PDF and HTML formats) from 2010 onwards, with duplicate and invalid links filtered out to obtain valid reports.
For cases from media reports and the industry literature, keyword retrieval was conducted to screen for relevant content. Core accident information was manually extracted and compiled into text records to avoid missing key details.

2.2.2. Standardized Cleaning

Raw data suffered from format disorganization, inconsistent terminology, and information redundancy, which required standardized processing:
  • Format unification: Reports in PDF, HTML, and plain text formats were converted to uniform plain text, with irrelevant content removed to ensure consistency in data structure.
  • Terminology normalization: Standard terms from the Measures for Rewards for Reports in the Work Safety Field and Measures for the Investigation and Handling of Mine Accidents were adopted to unify the expression of accident types and accident levels. Hu et al. (2025) [27] and Gao et al. (2024) [28] demonstrate that domain-specific term normalization improves information extraction accuracy by 15–22% in Chinese accident reports, justifying our manual curation approach.
  • Deduplication: Duplicate records were eliminated through triple-keyword matching (“accident name + occurrence time + location”), resulting in 532 unique samples retained for subsequent processing.

2.2.3. Structured Annotation

A structured annotation framework encompassing 17 core dimensions was developed based on accident causation theories and mine safety management standards, which fully covers the entire accident lifecycle (from occurrence to disposal). Specifically, for the critical “causes of accident” link in the lifecycle, standardized annotation rules were established: direct_cause (technical/operational triggers, ≤50 characters) and indirect_cause (managerial/institutional issues, ≤50 characters) were clearly defined and extracted from the text, with unified expression norms (≤50 characters per item) to realize the structuralization of unstructured causal information—addressing the lack of standardized causal data in existing datasets. To ensure professional accuracy, a double-independent annotation approach was adopted by postgraduate students specializing in mine safety, and inter-annotator consistency was assessed via the Kappa coefficient (κ = 0.87). Artstein & Poesio (2008) [29] and Hripcsak & Rothschild (2005) [30] confirm that κ ≥ 0.8 indicates “excellent agreement” beyond chance, validating the reliability of our annotations.
The dataset is stored in JSON Lines (JSONL) format, with each accident sample containing 17 core dimensions and 43 subfields. Key annotation rules for major dimensions are summarized as follows:
  • Basic Information: accident_id uses official report numbers (or “null” if unavailable); accident_name includes the responsible entity, time, level, and type, with missing details supplemented via authoritative online platforms and databases such as the official website of the National Mine Safety Administration (NMSA), the Enterprise Credit Information Publicity System, and the official platforms of provincial/municipal Emergency Management Departments; accident_time is recorded to the minute in “YYYY-MM-DD HH:mm” format.
  • Spatio-Temporal Characteristics: location includes province (full name, consistent with National Bureau of Statistics standards), city, county, and specific_location; missing geographic information is supplemented via administrative division databases.
  • Responsible Entities: accident_unit covers group (parent group’s legal name, supplemented via the Enterprise Credit Information Publicity System), company (consistent with the business license), and mine (consistent with the mining license).
  • Accident Characteristics: accident_category follows industry standards (10 categories for coal mines, including roof, gas, and transportation accidents); accident_level is determined per the Regulations on the Reporting, Investigation, and Handling of Production Safety Accidents (four levels: extraordinarily serious, major, relatively large, and general), using the highest level corresponding to casualties or direct economic losses; accident_nature is classified as “liability accident,” “non-liability accident,” or “natural disaster-induced accident.”
  • Consequence Information: casualties records deaths, serious injuries, minor injuries, and missing persons (0 if no data) and “involved” persons (null if no data); economic_loss (in CNY 10,000) includes direct_loss and indirect_loss (retained to two decimal places, null if no data).
  • Process and Causes: accident_process describes key chronological nodes (≤500 characters); direct_cause (≤50 characters) reflects technical/operational triggers, while indirect_cause (≤50 characters) addresses management/institutional issues.
  • Disposal and Responsibility: responsibilities lists direct/related units (legal names) and regulatory units (null if unclear); disposal includes emergency measures (≤50 characters), personnel/unit punishments, and rectification requirements.
  • Reporting and Timeline: report_info includes initial_report_time (same format as accident_time), delay_report (Boolean), and conceal_report (Boolean); timeline uses a nested structure with stages and sub-events (time + description), marked “not disclosed” if time is unclear.
  • Supplementary Information: other_info uses an empty string if there is no data.
Inter-annotator consistency was assessed via the Kappa coefficient, yielding a value of 0.87 (≥0.8 indicates excellent consistency), confirming the reliability of the annotation results.

2.2.4. Quality Validation

Three experts in the field of mine safety were invited to validate the annotated results, focusing on three aspects:
  • Accuracy of field extraction.
  • Completeness of information (statistical missing rate of key fields such as accident causes and casualties).
  • Rationality of annotation rules (consistency with national standards and industry practices).
The final validation pass rate was 95.3%. For the 25 samples that failed validation, secondary annotation and correction were conducted to ensure data quality.

3. Methods

All statistical analyses and visualizations of the dataset were conducted using Python 3.9, with core libraries including pandas 1.5.3 for data manipulation, matplotlib 3.7.1 for graphical representation, and scipy 1.10.1 for descriptive statistics. The analytical framework aligned with the dataset’s structural features, focusing on quantifying and interpreting patterns across basic dataset characteristics, temporal distribution, spatial distribution, accident type distribution, casualty and economic loss characteristics, emergency disposal and accountability, and report timeliness—consistent with the logical hierarchy of the dataset’s core information.

3.1. Basic Dataset Characteristics

The descriptive statistics summarize the fundamental attributes of the 532-sample dataset, with key metrics organized in Table 1.
Table 1. Statistical characteristics of the dataset.

3.1.1. Temporal Distribution

Temporal patterns of accidents were analyzed at annual and monthly scales, with interpretations adjusted for sample retrievability bias—a key consideration due to uneven digital archiving of historical data (Figure 2). Table 2 presents the detailed annual distribution. Annual distribution analysis showed a “gradually increasing over time” pattern: annual samples remained low (average 17.2 samples/year) during 2010–2019, increased significantly after 2020, peaked at 73 samples in 2022, and accounted for 68.4% of the total during 2020–2024. This pattern was attributed to improved digital accessibility: the National Mine Safety Administration (NMSA) launched the “Mine Safety Supervision Information System” in 2020, increasing the digital disclosure rate of official reports, while mainstream media retained more real-time accident coverage; in contrast, pre-2019 reports suffered from incomplete electronic archiving, leading to retrieval gaps.
Figure 2. Annual distribution in 2010–2024 of the number of mine accidents (excluding data from the first half of 2025).
Table 2. Annual distribution of mine accidents.
To avoid conflating sample distribution with real accident trends, official NMSA data confirmed a sustained downward trend: 2018–2022 saw an average annual decrease of 9.7% in accident counts, with major and extraordinarily serious accidents declining by 67.2% compared to the previous five-year period; 2021–2024 average accident counts fell by 29.7% relative to the “13th Five-Year Plan” period, and 2024 achieved reductions in total accidents, relatively large accidents, and major/extraordinarily serious accidents—linked to policy interventions such as the “Three-Year Action Plan for Addressing Fundamental Issues in Mine Work Safety”.
Monthly distribution analysis revealed a seasonal pattern: concentrated occurrences in Q2–Q3 and relative lulls in Q1 and Q4. Samples in April–September typically exceeded 40 per month (peaking in October), while January–March and November–December showed lower counts (trough in February). This pattern correlated with mining industry dynamics and natural conditions: Q2–Q3 corresponds to peak production, increasing operational intensity and risk accumulation; summer heavy rainfall and high temperature/humidity also elevate risks of water inrush and abnormal gas emission—a long-documented challenge in the National Mine Work Safety Annual Report, referred to as the “rainy season three-prevention” pressure. In contrast, Q1 is marked by the Spring Festival holiday and post-holiday safety inspections, slowing production and reducing accident frequency. Notably, post-2020 digital disclosure enhanced the clarity of this monthly pattern, but the trend itself is an industry-wide phenomenon rather than an instance of biased data. Annual and monthly distributions are visualized as bar charts (Figure 3).
Figure 3. Monthly distribution of mine accidents.

3.1.2. Spatial Distribution

Spatial patterns of accidents were analyzed via provincial-scale aggregation statistics, with interpretations linked to China’s mineral resource zoning and mining activity intensity. The provincial distribution showed resource-producing region-oriented agglomeration: high-concentration regions (proportion > 8%) included Shanxi (11.7%), Gansu (8.3%), Inner Mongolia (7.7%), and Guizhou (7.3%); medium-concentration regions (4–8%) included Anhui (6.8%), Henan (6.4%), and Sichuan (6.2%); low-concentration regions (<4%) included municipalities like Beijing and Shanghai, and provinces like Hainan and Tibet.
High-concentration regions align with China’s core mineral-producing zones—Shanxi, Inner Mongolia, and Guizhou are major coal-producing areas, while Gansu is a key base for non-ferrous metals—characterized by dense mine distributions, long mining histories, and overlapping risks from complex geological conditions and high-intensity operations, leading to higher accident exposure and better documentation in public channels. Medium-concentration regions are regional mineral hubs with more dispersed resource types or varied management standards, while low-concentration regions have minimal mining activity, resulting in fewer samples. A bar graph (Figure 4) visualizes provincial sample concentrations, with color intensity proportional to the number of samples.
Figure 4. Provincial distribution of mine accidents.

3.1.3. Accident Type Distribution

Categorical frequency analysis quantified the distribution of accident types (from the accident_category field), revealing a “concentrated core types, dispersed minor types” pattern. Table 3 provides detailed statistical distributions. Core high-risk types—roof accidents (108 samples, 20.3%) and transportation accidents (81 samples, 15.3%)—accounted for over 35% of the total. Roof accidents were linked to roof stability control and support management in underground mining roadways/working faces, while transportation accidents were associated with operational compliance and equipment maintenance for underground transport systems—both directly tied to the high-risk “excavation-transport” core processes of mining.
Table 3. Statistical distribution of accident types, casualties, and economic losses.
Major risk types included electromechanical accidents (39 samples, 7.3%), coal and gas outbursts (32 samples, 6.0%), gas explosions (29 samples, 5.5%), mechanical injuries (27 samples, 5.1%), and water inrushes (24 samples, 4.5%), collectively covering key safety domains including electromechanical equipment operation, gas hazard prevention, and water control. Dispersed minor accident types—over 30 types—accounted for ~40% of the total, reflecting the complexity of mining operations and the low but non-negligible occurrence of scenario-specific risks. A pie chart (Figure 5) illustrates the proportion of each accident type, with distinct colors for core, major, and minor types.
Figure 5. Type distribution of mine accidents.

3.1.4. Casualty and Economic Loss Characteristics

The severity of accident consequences was analyzed by separating casualty and economic loss metrics, using descriptive statistics to quantify variability and patterns. In terms of casualties, accidents were grouped by the number of deaths: general accidents (1–2 deaths, 349 samples, 65.6%), relatively large accidents (3–9 deaths, 85 samples, 16.0%), major accidents (10–29 deaths, 91 samples, 17.1%), and extraordinarily serious accidents (≥30 deaths, 7 samples, 1.3%). The most severe case was the “12 × 6” water inrush accident at Ruizhiyuan Coal Industry Co., Ltd. in Hongtong County, Linfen City, Shanxi Province (2010), which caused 70 deaths.
For economic losses, only samples with valid direct economic loss records (424 samples) were analyzed, revealing a “middle-concentrated, two-end dispersed” pattern: the average direct economic loss was CNY 6.38 million (median CNY 2.09 million), with a minimum of CNY 3000 and a maximum of CNY 204.3 million. By interval, CNY 1–5 million yuan accounted for the highest proportion (60.6%, 257 samples), followed by CNY 0–1 million (13.7%, 58 samples), CNY >10 million (15.1%, 64 samples), and CNY 5–10 million (10.6%, 45 samples). By accident level, extraordinarily serious accidents had the highest average direct loss (CNY 100.40 million), followed by major (CNY 26.26 million), relatively large (CNY 9.45 million), and general accidents (CNY 1.97 million); by accident type, slope collapses (CNY 106.67 million), explosions (CNY 68.47 million), and goaf collapses (CNY 41.34 million) had the highest average losses. The case with the most severe economic loss was the “2 × 22” extraordinarily serious collapse accident at Xinyuan Coal Industry Co., Ltd. in Alxa, Inner Mongolia (2023), with direct losses of CNY 20,430.2 million and 53 deaths.
Detailed statistical distributions of casualties and economic losses are summarized in Table 4, providing a quantitative foundation for risk severity assessment.
Table 4. Statistical distribution of casualties and economic losses.

3.2. Emergency Disposal and Accountability

Emergency disposal analysis, based on data from the disposal.emergency_measures and timeline fields, showed that 92.3% of accidents activated emergency plans, with mine rescue teams having an average response time of 47 min (shortened to 28 min for major accidents due to prioritized resource allocation). Common on-site disposal measures included “search and rescue for trapped persons” (86.5%), “hazard source isolation” (e.g., ventilation/power cutoff, 67.2%), and “on-site alert setup” (58.4%). For aftermath handling, 100% of cases included medical treatment or aftermath arrangements for casualties, with 89.1% explicitly documenting medical assistance and family consolation measures.
Accountability analysis, using data from the disposal.personnel_punishment, disposal.unit_punishment, and responsibilities.regulatory_units fields, revealed that among 734 punished individuals, criminal prosecution (356 persons) and administrative sanctions (359 persons) accounted for over 97% of cases, reflecting strict liability for accidents causing casualties. For punishments, 71.9% were fined (average CNY 1.399 million), 12.8% were ordered to suspend work as a penalty, and 13.0% had their licenses suspended/revoked—with higher loss cases corresponding to more severe penalties, demonstrating a “loss-penalty matching” principle. Additionally, 86.1% of cases involved accountability for regulatory units, indicating strengthened oversight of regulatory responsibilities in casualty-related accidents.

3.3. Report Timeliness Analysis

Reporting behavior was analyzed using time-interval calculation and trend statistics based on the report_info field. Among 569 valid records, 25.3% (144 records) involved delayed reporting, 12.5% (71 records) involved concealed reporting, and 6.3% (36 records) involved both delay and concealment. For 355 samples with valid initial_report_time, the average delay was 5191.2 min (86.5 h), with a median of 45.0 min and extreme values ranging from −18.0 min (advance reporting due to rapid response) to 1,692,540.0 min (28,209 h, reflecting severe intentional delay).
Temporal trends showed fluctuations rather than a linear downward pattern: 2010–2014 (delay: 11.5%, concealment: 6.2%), 2015–2019 (delay: 19.9%, concealment: 9.9%), 2020–2024 (delay: 33.0%, concealment: 16.2%), and 2025 (delay: 20.0%, concealment: 0%, 5 samples). Peaks included a 41.1% delay rate in 2022 and a 20.5% concealment rate in 2020, which were potentially linked to cyclical changes in mining capacity and phased variations in regulatory enforcement intensity. A line chart (Figure 6) visualizes the annual trends of delay and concealment rates.
Figure 6. Annual change trend of the timeliness of accident reporting.

4. Applications and Value of the Dataset

The dataset, characterized by its standardized structure and comprehensive coverage of mine accident information, provides substantial support for both academic research and practical safety management in the mine safety field. Its multi-dimensional and fine-grained data features address the long-standing gap of high-quality structured mine accident data in China, enabling in-depth exploration of safety laws and optimization of safety governance practices.

4.1. Academic Research Value

The dataset serves as a high-quality foundation for deepening theoretical research and innovating analytical methods in mine safety, with three key application directions. In the study of accident causation mechanisms, the clear distinction between direct (technical/operational) and indirect (managerial/institutional) causes in the causes field allows for the integration of analytical tools such as association rule mining and network analysis. This approach is grounded in systems thinking frameworks like STAMP [22], which emphasize hierarchical control structures and systemic failures. This integration facilitates the construction of a “human–machine-environment-management” coupled failure model. Using the direct_cause and indirect_cause data annotated in this dataset (e.g., 108 roof accidents with clear “insufficient support strength” as direct cause and “inadequate supervision” as indirect cause), association rule mining can quantify the correlation strength between variables, thereby revealing the core triggering paths of specific accidents—realizing the application of the dataset in causal mechanism research. This avoids the limitations of qualitative descriptions in traditional causal analysis and promotes the development of data-driven causal inference in mine safety research. Advanced techniques like SHAP value interpretation can further elucidate the contribution of each causal factor [31].
In risk prediction research, the dataset’s integration of spatio-temporal features (from the location field), enterprise attributes (from the accident_unit field), and accident consequence indicators (from accident_category and casualties.death) provides a rich set of variables for building predictive models. Modern deep learning architectures and ensemble methods such as XGBoost [32] can leverage these multi-dimensional features: taking the spatio-temporal features (location province/city), accident_unit attributes, and historical accident types (accident_category) in the dataset as input variables, and the number of deaths (casualties.death) or accident type as target variables, a regional high-risk accident prediction model can be constructed. For example, using the Shanxi accident samples and their roof/transportation accident distribution data, the model can predict the high-risk periods and types of accidents in coal-producing areas—realizing the dataset’s application in risk prediction. This shift from “post-accident response” to “pre-accident prevention” enhances the scientific basis of mine risk management.
In the evaluation of safety policies, the dataset’s 15-year temporal coverage (2010–2025) enables longitudinal comparative analysis of accident characteristics before and after major safety initiatives. Quasi-natural experimental designs like difference-in-differences (DID) can rigorously quantify policy impacts [33,34]. For instance, the difference-in-differences method can be used to quantify the actual effectiveness of policies such as the “Three-Year Action Plan for Addressing Fundamental Issues in Mine Work Safety” and the “General Survey of Hidden Disaster-Causing Factors” in reducing accident rates and mitigating accident severity. This provides empirical evidence for optimizing policy design, adjusting implementation strategies, and improving the efficiency of safety governance.

4.2. Practical Application Value

In practical mine safety scenarios, the dataset offers targeted guidance for mining enterprises, regulatory authorities, and emergency response teams, directly addressing key challenges in safety management. For mining enterprises: (1) In Shanxi, Gansu, and Inner Mongolia, enterprises should prioritize roof support inspections (roof accidents account for 20.3% of total) and reference direct causes such as ‘insufficient support strength’ in the dataset to set weekly roof separation monitoring standards; (2) High-gas mines should train workers on gas detector calibration based on accident cases where ‘equipment failure caused missed detection’. For example, enterprises with frequent coal bunker operations can reference cases of coal bunker collapse (attributed to inadequate equipment fixation) to refine procedures for blockage handling and regular equipment inspections; high-gas mines can draw on gas accident cases to strengthen training on gas detection techniques and equipment maintenance, transforming generic safety requirements into actionable, case-based prevention strategies.
For regulatory authorities, the spatial distribution patterns of accidents (derived from the location field) enable rational allocation of regulatory enforcement resources. For example, regions with a high concentration of roof accidents can prioritize inspections of roof support systems in underground working faces, while regions prone to gas accidents can focus on supervising gas monitoring and ventilation systems. This “region-specific, type-focused” supervision model improves the precision and efficiency of regulatory enforcement, avoiding the inefficiency of scattered, one-size-fits-all inspections.
In terms of emergency capacity building, the timeline field—recording key nodes in the accident response process, such as rescue team arrival time and multi-department coordination links—provides a basis for optimizing emergency plans. For instance, the average rescue response time can be used to set benchmarks for emergency response efficiency; successful and failed on-site disposal cases can be integrated into emergency training materials, shortening the learning curve for emergency teams and enhancing their on-site disposal capabilities.

4.3. Data Access and Usage Guidelines

To ensure the dataset’s reusability while maintaining compliance and traceability, detailed access and usage specifications are established as follows. The dataset is stored in a single JSON Lines (JSONL) file named mine_accidents.jsonl, which integrates all structured accident records, including multi-dimensional fields such as basic accident information, spatio-temporal characteristics, causal chains, and disposal and responsibility details. For ease of transmission and storage, the file is packaged into a compressed ZIP file titled mine_accidents.zip, which can be decompressed using standard file compression tools to access the core data.
In terms of usage compliance, the dataset is limited to non-commercial purposes, including academic research, mine safety management optimization, and safety policy effectiveness evaluation; secondary sale, unauthorized distribution, or use for illegal activities is strictly prohibited. When using the dataset to produce research outputs, the source must be clearly cited as “China Mine Accident Report Dataset (JSON Lines Format), obtained from the research team’s public dataset.” If the original data is processed to form derived datasets, the original data source and detailed processing workflow must be explicitly stated to ensure data traceability and avoid misrepresentation.
The dataset will be maintained through an annual update mechanism: each year, new accident cases will be added, and missing fields in historical records will be supplemented based on newly accessible official documents. Information on the latest version of the dataset, including update logs and access links, will be published through the research team’s official channel to ensure its long-term timeliness and practical value.

4.4. Cause–Action Mapping Analysis

To strengthen the link between statistical patterns and actionable interventions, we analyzed the causal pathways leading to high-frequency accident types and mapped them to specific prevention measures. Table 5 presents a cause–action mapping matrix for the four most common accident types.
Table 5. Cause–action mapping matrix for high-frequency accident types.
This mapping directly translates statistical findings into practice: for instance, the dominance of “insufficient support strength” in roof accidents (40.71%) justifies the mandatory deployment of real-time monitoring, while the near-universal presence of “safety management deficiency” across all types (76–92%) underscores the need for systemic organizational reforms beyond technical fixes.

5. Limitations

Despite systematic collection and standardized processing, this dataset exhibits three notable limitations due to the characteristics of public information disclosure and the nature of its data sources. First, there is a bias in sample representativeness: official and media disclosures of mine accidents focus predominantly on major and relatively large accidents, while detailed reports on general accidents have low coverage in public channels. This leads to a discrepancy between the proportion of general accident samples in the dataset and the actual occurrence distribution of mine accidents, potentially underestimating the risk characteristics of minor accidents and thus affecting the comprehensive assessment of overall mine safety risk levels and evolutionary patterns of accidents with different severity levels.
Second, there are cross-year differences in field completeness: key fields such as economic_loss and causes.indirect_cause have high missing rates in early accident reports (2010–2014). This is directly attributed to the low standardization of early mine accident reports, where information disclosure focused on basic accident details rather than in-depth analysis, restricting longitudinal comparative studies across the full time span—particularly limiting the analysis of long-term trends such as the evolution of accident causes and changes in economic loss scales.
Third, the precision of spatial information needs improvement: the location.specific_location field for some accident cases is only accurate to the mine name or main production area, without specification of specific working faces or equipment points. This hinders in-depth analysis of accident triggers at the micro-spatial scale and limits the expansion of application scenarios such as spatial risk mapping.
To address these limitations, future improvements will focus on three aspects: first, expanding sample acquisition channels by establishing cooperation with local mine safety supervision departments and emergency management agencies to obtain unpublished general accident files and internal investigation reports, thereby improving the coverage of accidents with different severity levels and reducing representativeness bias. Second, improving early-stage data fields: for key information missing in the 2010–2014 records, consulting original materials such as historical ledgers of local supervision departments and accident rectification and acceptance reports of mining enterprises, or conducting reasonable imputation by linking to concurrent industry statistical data and characteristics of similar accident cases to enhance the consistency and completeness of cross-year data. Third, refining spatial information collection standards: in future data updates, explicitly requiring the recording of specific working face numbers, equipment models, and relative positions of accident sites, marking spatial coordinates based on mine excavation engineering plans, and supplementing environmental information such as working face geological conditions to support micro-scale accident risk analysis and spatial distribution research.

6. Conclusions

This study used officially published Chinese mine accident reports from 2010 to 2025 as the data source. Data were collected via targeted web crawling and manual retrieval, followed by standardized cleaning, structured annotation (covering 17 core dimensions and 43 subfields), and quality validation—resulting in the construction of a standardized mine accident dataset with 532 samples. The dataset covers 27 provinces/autonomous regions/municipalities in China, is stored in JSON Lines (JSONL) format, and systematically integrates full-process accident information, addressing the gap of a full-element structured database in China’s mine safety field.
Statistical analysis revealed key patterns: temporally, the sample distribution shows a time-increasing trend due to retrieval availability, but data from the National Mine Safety Administration indicate an average annual decrease of 9.7% in mine accident counts from 2018 to 2022, with accidents concentrated in the second and third quarters; spatially, accidents are concentrated in core mineral-producing regions, with accidents in Shanxi, Gansu, and Inner Mongolia Autonomous Region accounting for 11.7%, 8.3%, and 7.7% of the sample, respectively; in terms of accident types and consequences, roof accidents (20.3%) and transportation accidents (15.3%) are the main risk types, with an average of 4.6 deaths per accident and an average direct economic loss of CNY 6.38 million for samples with recorded losses—high-loss accidents correspond to severe accountability, with 71.9% of involved units fined (average fine: 1.399 million yuan); in terms of reporting and disposal, 25.3% of cases involved delayed reporting, 12.5% involved concealed reporting, and 92.3% of accidents activated emergency plans.
The dataset possesses both academic and practical value: academically, it supports causal mechanism modeling, risk prediction, and policy evaluation, breaking through the bottleneck of small-sample research; practically, it provides prevention templates for mining enterprises, supports regulatory authorities in optimizing regulatory enforcement resource allocation, and offers a basis for emergency plan improvement. In the future, by expanding sample channels, improving early-stage fields, and refining spatial information, this will further enhance the database’s completeness, facilitating the transition of mining safety management from “experience-driven” to “data-driven” practices.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/data10120202/s1, Dataset S1: mine_accidents.jsonl.

Author Contributions

Conceptualization, M.W. and H.W.; methodology, H.L.; software, H.G. and J.H.; validation, M.W., H.L., and H.W.; formal analysis, J.H.; investigation, J.H.; resources, M.W.; data curation, H.G. and J.H.; writing—original draft preparation, M.W.; writing—review and editing, H.W.; visualization, J.H.; supervision, H.W.; project administration, H.L.; funding acquisition, H.L. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 52304169, and the CCTEG Technology Innovation and Entrepreneurship Fund, grant number 2024-TD-ZD014.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are publicly available at https://doi.org/10.5281/zenodo.17388738 (uploaded on 19 October 2025).

Acknowledgments

During the preparation of this manuscript, the authors used DeepSeek-R1 (version 250120) and ChatGPT-4o (version 250326) for the purposes of text translation and text polishing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

Authors Maoquan Wan, Hao Li, Hao Wang, and Hanjun Gong were employed by the company Tiandi Science and Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare no conflict of interest.

References

  1. Zhu, R.; Lin, B. Energy and Carbon Performance Improvement in China’s Mining Industry:Evidence from the 11th and 12th Five-Year Plan. Energy Policy 2021, 154, 112312. [Google Scholar] [CrossRef]
  2. Wang, J.; Xue, Y.; Xiao, J.; Shi, D. Diffusion Characteristics of Airflow and CO in the Dead-End Tunnel with Different Ventilation Parameters after Tunneling Blasting. ACS Omega 2023, 8, 36269–36283. [Google Scholar] [CrossRef]
  3. Zhu, M.; Xie, G.; Liu, L.; Wang, R.; Ruan, S.; Yang, P.; Fang, Z. Strengthening Mechanism of Granulated Blast-Furnace Slag on the Uniaxial Compressive Strength of Modified Magnesium Slag-Based Cemented Backfilling Material. Process Saf. Environ. Prot. 2023, 174, 722–733. [Google Scholar] [CrossRef]
  4. Li, X.; Cao, Z.; Xu, Y. Characteristics and Trends of Coal Mine Safety Development. Energy Sources Part A Recovery Util. Environ. Eff. 2025, 47, 2316–2334. [Google Scholar] [CrossRef]
  5. Wu, B.; Wang, J.; Zhong, M.; Xu, C.; Qu, B. Multidimensional Analysis of Coal Mine Safety Accidents in China—70 Years Review. Min. Metall. Explor. 2023, 40, 253–262. [Google Scholar] [CrossRef]
  6. Ismail, S.N.; Ramli, A.; Aziz, H.A. Influencing Factors on Safety Culture in Mining Industry: A Systematic Literature Review Approach. Resour. Policy 2021, 74, 102250. [Google Scholar] [CrossRef]
  7. Stemn, E.; Amoh, P.O.; Joe-Asare, T. Analysis of Artisanal and Small-Scale Gold Mining Accidents and Fatalities in Ghana. Resour. Policy 2021, 74, 102295. [Google Scholar] [CrossRef]
  8. Liu, C.; Yang, S. Using Text Mining to Establish Knowledge Graph from Accident/Incident Reports in Risk Assessment. Expert Syst. Appl. 2022, 207, 117991. [Google Scholar] [CrossRef]
  9. Noraishah Ismail, S.; Ramli, A.; Abdul Aziz, H. Research Trends in Mining Accidents Study: A Systematic Literature Review. Saf. Sci. 2021, 143, 105438. [Google Scholar] [CrossRef]
  10. Huang, X.; Li, W.; Wang, C. Analysis of China’s Coal Resource Safety and Enterprise Performance; Atlantis Press: Dordrecht, The Netherlands, 2022; pp. 416–424. [Google Scholar]
  11. Meng, F. Safety Warning Model of Coal Face Based on FCM Fuzzy Clustering and GA-BP Neural Network. Symmetry 2021, 13, 1082. [Google Scholar] [CrossRef]
  12. Gilje, E.P.; Wittry, M.D. Is Public Equity Deadly? Evidence from Workplace Safety and Productivity Tradeoffs in the Coal Industry. NBER Work. Pap. 2021, 28798. [Google Scholar] [CrossRef]
  13. Mine Safety and Health Administration (MSHA). 30 CFR Part 50—Notification, Investigation, Reports and Records of Accidents, Injuries, Illnesses, Employment, and Coal Production in Mines. Available online: https://www.ecfr.gov/current/title-30/part-50 (accessed on 23 November 2025).
  14. Regulation (EU) 2023/988 of the European Parliament and of the Council of 10 May 2023 on general product safety. Available online: https://eur-lex.europa.eu/eli/reg/2023/988/oj (accessed on 23 November 2025).
  15. IEC 80601-2-30:2018. Available online: https://www.iso.org/standard/70653.html (accessed on 23 November 2025).
  16. Leveson, N.; Dulac, N.; Zipkin, D.; Cutcher-Gershenfeld, J.; Carroll, J.; Barrett, B. Engineering Resilience into Safety-Critical Systems. In Resilience Engineering; CRC Press: Boca Raton, FL, USA, 2006. [Google Scholar]
  17. Tong, B.; Liu, H.; Zhu, J.; Wang, Y.; Mei, T.; Kou, M. Exploring Safety Research Progress and Prospects for the Sustainable Development of Resilient Cities. Buildings 2025, 15, 505. [Google Scholar] [CrossRef]
  18. Park, J.; Kang, D. Artificial Intelligence and Smart Technologies in Safety Management: A Comprehensive Analysis Across Multiple Industries. Appl. Sci. 2024, 14, 11934. [Google Scholar] [CrossRef]
  19. Xu, Z.; Saleh, J.H. Machine Learning for Reliability Engineering and Safety Applications: Review of Current Status and Future Opportunities. Reliab. Eng. Syst. Saf. 2021, 211, 107530. [Google Scholar] [CrossRef]
  20. Li, H.; Zhang, Y.; Yang, W. Gas Explosion Early Warning Method in Coal Mines by Intelligent Mining System and Multivariate Data Analysis. PLoS ONE 2023, 18, e0293814. [Google Scholar] [CrossRef]
  21. Qiu, Z.; Liu, Q.; Li, X.; Zhang, J.; Zhang, Y. Construction and Analysis of a Coal Mine Accident Causation Network Based on Text Mining. Process Saf. Environ. Prot. 2021, 153, 320–328. [Google Scholar] [CrossRef]
  22. Leveson, N.G. Engineering a Safer World: Systems Thinking Applied to Safety; The MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
  23. Hollnagel, E. FRAM: The Functional Resonance Analysis Method: Modelling Complex Socio-Technical Systems; CRC Press: London, UK, 2017; ISBN 978-1-315-25507-1. [Google Scholar]
  24. Tixier, A.J.-P.; Hallowell, M.R.; Rajagopalan, B.; Bowman, D. Automated Content Analysis for Construction Safety: A Natural Language Processing System to Extract Precursors and Outcomes from Unstructured Injury Reports. Autom. Constr. 2016, 62, 45–56. [Google Scholar] [CrossRef]
  25. Dong, L.; Wang, J.; Wang, J.; Wang, H. Safe and Intelligent Mining: Some Explorations and Challenges in the Era of Big Data. J. Cent. South Univ. 2023, 30, 1900–1914. [Google Scholar] [CrossRef]
  26. Zhang, S.; Liu, T.; Wang, C. Multi-Source Data Fusion Method for Structural Safety Assessment of Water Diversion Structures. J. Hydroinformatics 2021, 23, 249–266. [Google Scholar] [CrossRef]
  27. Hu, H.; Feng, Y.; Wang, C.; Wang, Z.; Ma, X.; Wu, P.; Dong, W.; Yan, Q. AugMine: Boosting Coal Mine Accident News Classification with Text Data Augmentation. In Proceedings of the International Conference on Image, Vision and Intelligent Systems 2024 (ICIVIS 2024), Xining, China, 16–17 June 2024; You, P., Zheng, Y., Eds.; Springer Nature: Singapore, 2025; pp. 1–22. [Google Scholar]
  28. Gao, F.; Zhang, L.; Wang, W.; Zhang, B.; Liu, W.; Zhang, J.; Xie, L. Named Entity Recognition for Equipment Fault Diagnosis Based on RoBERTa-Wwm-Ext and Deep Learning Integration. Electronics 2024, 13, 3935. [Google Scholar] [CrossRef]
  29. Artstein, R.; Poesio, M. Inter-Coder Agreement for Computational Linguistics. Comput. Linguist. 2008, 34, 555–596. [Google Scholar] [CrossRef]
  30. Hripcsak, G.; Rothschild, A.S. Agreement, the F-Measure, and Reliability in Information Retrieval. J. Am. Med. Inf. Assoc. 2005, 12, 296–298. [Google Scholar] [CrossRef] [PubMed]
  31. Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
  32. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
  33. Dougherty, C.; Jorgenson, D.W. International Comparisons of the Sources of Economic Growth. Am. Econ. Rev. 1996, 86, 25–29. [Google Scholar]
  34. Ashenfelter, O.; Card, D. Using the Longitudinal Structure of Earnings to Estimate the Effect of Training Programs. Rev. Econ. Stat. 1985, 67, 648–660. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.