Near Miss Archive: A Challenge to Share Knowledge among Inspectors and Improve Seveso Inspections

: In European Seveso Legislation for the control of the hazard of major accidents (Directive 2015/12/UE), the Safety Management System SMS is an essential obligation for managers and the authorities are required to periodically verify its adequateness through periodical inspections at Seveso sites. One of the pillars of the SMS is the collection and analysis of documents on accidents, near misses, and possible anomalies, in order to identify weaknesses and implement continuous improvement. In Italy, for a few years, the documents, gathered from all Italian Seveso sites by the inspectors, have been archived and used for research purposes. The archive currently contains some 4000 reports, collected in 5 years by some 100 inspectors throughout Italy. This paper discusses in detail the challenges faced to extract the knowledge hidden in the documents and make it usable through the design of a robust model. For this aim, machine learning techniques have been used for preprocessing of the reports for extracting the concepts and their relations, organized into an entity-relation model. The effectiveness of this methodology and its potentiality are pointed out by investigating a few hot topics, exploiting the information contained in the repository.


Introduction
The analysis of near misses is a need in those sectors, including aeronautics and shipping, nuclear, chemical, and petrochemical, where accidents are not frequent, but when they do occur, have catastrophic consequences [1]. Thus, the study of near misses and anomalies provides an opportunity to recognize unsafe conditions or situations and prevent incidents [2]. For this reason, in the literature there are several definitions of near miss, depending on the context, but all of them have a general meaning that a near miss is an unplanned and unexpected event without consequences in injury, illness, damage, and environmental problems, but had the potential to do so.
The context considered by this research is the process industry, including the establishments under the European Seveso Directive on the control of major accident hazards. The Seveso III Directive 2012/18/EU [3] requires the implementation of a safety management system (SMS) for establishments handling hazardous substances. One of the pillars of SMS is to gather and analyze reports of accidents, incidents, near misses, and anomalies occurred in the establishment with the aim to point out the weakness in the SMS for its reviewing and improvement.
In Italy, there are about a thousand establishments under Seveso legislation, over half of them are upper tier. They cover many industrial sectors, including oil and gas, petrochemical, chemical, metal processing, pharmaceuticals, explosives, and pyrotechnic.
Since 1999, the Italian Competent Authorities adopted a standard guide for the inspections at establishments under Seveso legislation, based on a detailed checklist. The Italian inspectors are also required to discuss the anomalies, near misses, and incident reports provided by the operators, with the goal of prioritizing the scrutiny of the points in the checklist.
In Italy, the latest implementation of the Seveso Directive in July 2015 encloses a form to fill in anomalies, near misses, and accidents by the operator. The set of documents is called operative experience. Each document contains the description of the events which occurred in the establishment in the last ten years, the analysis that identifies the critical points of the SMS, and the solutions adopted for safety improvement.
The practice of analyzing near miss events for improving the efficiency of the SMS, presented years ago by few pioneers [4], has been adopted in a systematic mode by the Inail research group. Since 2015, this group, in which researchers are also Seveso inspectors, started to collect the operative experiences, sent by Inail inspectors who operate throughout the national territory. The reports are organized and uploaded into a central archive and managed for research purposes.
The system of Seveso inspections is part of the system for national industrial safety, therefore, without going into details of the SMS of a single establishment, the near miss repository represents one of the rare cases of a national collection of this type of document, without being mandatory.
The systematic activity of collecting those reports has significantly increased the number of documents in the operative experience repository. The reports have heterogeneous content, because they describe different types of events (i.e., anomalies, near misses or incidents) and deal with diverse categories of equipment, substances, processes, and working activities.
The definition of near miss is very general. In Seveso context, the Italian Standard UNI 10617:2019 [5] gives the following definition: "major near miss is any extraordinary event that could have turned into a major accident. The difference between a major accident and a major near miss does not lie in the causes or modalities of evolution of the event, but only in the different degree of development of the consequences or in the randomness of the presence of items or people".
In some cases, that definition matches with events that may be a precursor of accident [6] i.e., an item in known incidental chain, and their analysis allows the investigation of the causes that triggered the incidental mechanism, usually hidden from relevant consequences. The near misses, reported in the application of the above-mentioned Directive, also include minor incidents without injuries, anomalies, malfunctions, and deviations from the normal operation of the equipment or processes. A few of these occurrences would appear meaningless, but their analysis can contribute to discovering safety weaknesses related to new or emerging risks. This highlights another point of view for near miss definition, as "an opportunity for surveillance and risk reduction" [2], to improve human, environmental, and process safety [7].
Based on the Seveso near miss repository, several types of study may be developed, from the statistical analysis on occurrences to the extraction of information up to learning lessons. The inspector may exploit this archive to improve the inspection, also relying, in addition to his/her own expertise, on operational experiences that have occurred in establishments similar to the one being inspected. On the other hand, researchers can address their research activities to more recurrent issues for improving the solutions, but also face new or emerging risks which have already occurred, although not frequently. The objective of this paper is to describe the methods adopted to extract knowledge from this repository and provide the inspectors and researchers with information or pieces of insight.
The challenge, which is also the objective of this study, is to extract the knowledge contained into the documents and make it usable. Thus, it is first necessary to have tools capable of managing texts in natural language. Among the information technology systems adopted for processing natural language, text mining (TM) is the most appropriate method to automatically analyze and classify the parts of speech. Recently, more advanced techniques of artificial intelligence (AI), including the machine learning (ML) processes, greatly improve the automation of analyzing large amounts of data.
This article is structured as follows. Section 2 describes the background addressing the management of the incident and near miss reports and different approaches for extracting knowledge. Section 3 explains the objective of this research. Section 4 details the peculiarities of the near miss archive considered and describes the methods, including text mining, machine learning, and artificial intelligence functionalities, adopted for managing unstructured natural language text and extracting knowledge into an entity-relation model. In Section 5, a few use cases describe the results of the model application and related discussions. Sections 6 and 7 provide a general discussion and few concluding remarks, respectively.

Background
For many years now, the reports of major accidents occurring in the chemical process industries have been systematically analyzed to learn from what happened and avoid the repetition of the same conditions and unfavorable situations. Thus, there are open databases that gather the accident reports occurred in industrial chemical sectors.
The Major Accident Reporting System (eMARS) [8], established by the EU's Seveso Directive 82/501/EEC since 1982, is the official European database, it collects accidents, incidents, and near misses occurred in European Seveso establishments. The ARIA (Analysis, Research, and Information on Accidents) database (in French and English) organized by the Bureau for Analysis of Industrial Risks and Pollutions in France, "catalogues incidents or accidents that were, or could have been, deleterious to human health, public safety or the environment" [9]. Other sites collecting industrial accidents are the following: Zema, the German database of Major accidents and incidents [10], the Chemical Safety Board in the United States [11], Tukes Varo registry in Finland [12], and the Japanese Failure Knowledge database (in Japanese and English) [13].
The study of major accidents aims at understanding and, above all, learning as much as possible from what happened, but, unfortunately, some causes could be covered by relevant consequences.
Looking at near misses as potential accidents intercepted and interrupted by chance, luck, or skill provides the opportunity to analyze them with respect to the safety management system (SMS), for identifying its weak points. Moreover, the lack of consequences guarantees a more open narrative without fear of repercussions. Thus, in complex sectors, including chemical and petrochemical industries, where major accidents can be catastrophic events with serious consequences for human health, environment, and assets, the analysis of near misses and anomalies is strongly encouraged.
Some issues in near miss management are common to different sectors; in [14], the authors describe the difficulties in shipping, a high-risk industry, for identifying the events and reporting them, including the lack of time to report them or the format too complex to fill it. The first issue is also in Seveso industries; therefore, the repository contains a variety of diverse near misses, from expected failures (e.g., pump, gasket) to the loss of hazardous substances that required emergency interventions.
The study and analysis of near misses leads to the rediscovery of hidden or forgotten knowledge and the learning of how to improve safety [15]. Although the analysis of near misses is an adopted and consolidated approach in major hazard industries, it is also developing in other sectors where the frequency of severity is very high, e.g., the construction sector. Thus, [16] explores the potentiality of using near miss information for improving construction safety performance; the objective is to fill the gap from the theoretical definition of near miss and its practical understanding to better identify the causes and its process management.
The accident reports (e.g., those contained into eMARS database) usually contain keywords for better classification of the events, their causes, and their consequences; they are written by experts in a technical language that is understandable to all the community with the aim to learn the lessons and avoid recurrence of the same events. By reading the narrative of an accident, however, human experts are able to even extract information and concepts not explicitly represented by the keywords, but contained within the text, including cause-effect relationships.
Natural language processing (NLP) applications aim to extract information contained in the accident story. Single et al. [17] use NLP techniques to extract information from eMARS accident database and insert them into an ontology, used to represent the knowledge base for inquiring purposes. In [18], the authors described their approach to extract meaning from multilingual free-text safety incident reports in railway transport. The approach is to import the text into a graph database and connect with an ontology for managing multi-languages using NLP techniques. Nakata [19] proposes a method for recognizing the typical flow of events in a large set of accident reports, by adopting textmining capabilities, focusing on adjacent sentences, and extracting the pair of predecessor and successor words that characterize the flow.
Near miss reports are quite different from accident reports, although they deal with similar matter, but may lack of systematic view of the events. In fact, near misses are detected by workers and usually registered by the supervisor or operator, and tell about the facts and the direct causes, the actions done, and those that would be planned. The objective of their analysis is also to learn the lessons, aiming for workers to correct unsafe conditions and situations, and report critical issues of daily operations, as described by Bragatto et al. [20].
Although in some reports of large organizations, predefined keywords also appear, which is useful for organizing statistical frequency analyses more quickly, the textual story always remains the most interesting and complete aspect. For this reason, the analysis of the near misses requires tools able to interpret and process the natural language used for their description. The usual methods adopted were based on predefined taxonomies of the most important concepts involved, including substance, equipment, people, and process activity. Thus, the near miss analysis tried to classify the event and representing elements with the items contained in the taxonomies [21].
Using the same items for representing both accidents and near misses has been a method of understanding and measuring the distance of near misses from major accidents. Ansaldi et al. [22] described how to apply similarity techniques to documents, for measuring the semantic distance of near misses with respect to an accident report. This method is applicable and effective when managed data are few and homogeneous, that is coming from the same plant or the same industrial sector, and the taxonomies can be defined a priori because use the same terminology.
Two changes in near miss management have made this a priori method difficult to adopt. One is the cultural change from blame approach toward safety awareness, from attitude to hide the negative events toward a greater sensitivity to record all anomalies and deviations from normal situations, near misses, also highlighting the positive aspects that stopped their escalation; this has greatly increased the number of reports. The other aspect is that near misses, because they can be a means of identifying weakness in the safety management system, are aimed mainly at the workers of the plant; thus they use the language and jargon common to the sector and plant itself.
On the other hand, for extracting worth information and sharing among the stakeholders, it would be efficient to arrange all reports, recorded from each establishment, into a single repository. In this way, however, the "a priori" definition of taxonomies would require continuous updates, additions, and checks of new terms or synonyms used in several industrial sectors and in different jargons, with the risk of omitting important terminologies.
In recent years, artificial intelligence (AI) techniques, including machine learning, are becoming more effective at facing issues related to huge amounts of data, including the improvement in NLP field that involve text data, e.g., technical documentation, and are spreading in many industrial sectors. The AI techniques, especially machine learning (ML) technology, are used for extracting information from a bulk of data; thus, together with text-mining tools would be able to work on a number of documents for eliciting knowledge.
Cheng et al. [23] showed a comparison of different algorithms, based on NLP techniques, for extracting knowledge from construction accident reports and classifying the narratives. Arteaga et al. [24] addressed the issue of analyzing reports on the severity of traffic crashes and extracting meaningful information for developing safety countermeasures. Kurian et al. [25] applied ML and keyword analysis for defining a customized library that more efficiently supports ways to report incidents; special attention is paid to those with minor consequences that often lack details useful for understanding the causes. Paltrinieri et al. [26] suggested a risk assessment approach that is based on machine learning techniques, including a deep neural network model. Xu and Saleh [27] provided a detailed overview of different ML categories and corresponding models and algorithms used, reviewing ML applications in reliability and safety applications. They also give a rough definition of ML, as a data analysis method that iteratively learns from past data and adapts independently when applied to new data. The authors also "believe a most promising application of ML is in unleashing its power to harvest more value from near miss data and other accident databases for ultimately improving accident and occupational injury prevention".

Objectives
The first aim of this work is to provide the Seveso inspectors with a valuable knowledge resource, to improve the quality of the Seveso inspections. The sharing of what happened in different establishments, for example, to the same type of equipment, or in the same process with a certain substance although with the involvement of different parts of an equipment, has many advantages. Among these, it allows inspections to be carried out according to a more homogeneous criterion throughout the national territory. It also allows us to identify the possible solutions to be adopted by considering those that have given a better outcome or those that are more adequate in similar situations, highlighting which barriers have worked best and which ones proved to be lacking.
A more general goal is to understand the weakness of the safety management in Seveso industries, and address efforts in those directions. It has been at least a couple of decades that near miss analysis has been used to better understand whether the risk assessment, that has been carried out in the plant, actually includes all the possible triggers that can lead to the top event and then to a possible accident scenario. On the other hand, an event, whose probability of occurrence was considered too low to be included in the incidental chain, has instead proved, by the analysis of operative experiences, to be more plausible. In this case, the most remarkable feedback is the improvement of safety procedures and the update of risk analysis by extending it, where appropriate, with the examination of new critical items.
With the evolution of technologies, operative contexts and working methods, even the risks, whether traditional or emerging, face a change. Indeed, the traditional risks are affected by the new context, while emerging risks represent a novelty. In both cases, an accurate analysis of what is reported in the near miss reports certainly allows an early identification of elements that could represent unexpected hazards up to that moment.

Methods
The method adopted is text mining with machine learning capabilities to extract concepts and model them into a knowledge representation.
The aim is not only to identify the terms but also to recognize their semantic recognition, and, above all, their relationships. The semantic recognition of the single words is not sufficient to understand the story; in fact, a term may be present in the document without having a direct role in the event. For instance, in the phrase "leakage from the drainage valve of the suction pump used for the tank", the tank is not a primary term in the description of the near miss, so it would not be correct to count such an event as an occurrence to tank entity.
The definition of the model for representing the knowledge of the near misses is the core of this research. In this model, called EsOpIA (Operative Experience and Artificial Intelligence), the text mining and the machine learning techniques provide the capabilities for the extraction and classification of the concepts, and modeling them into the knowledge base. The following subsections describe the EsOpIA model (entities and relations) and the AI techniques applied for extracting the knowledge.

Operative Experience Reports
The operative experience documents, collected during the Italian Seveso inspections, tell about the anomalies, near misses, and minor incidents that occurred at the establishment during the previous decade.
Each report, written in natural language, i.e., Italian, adopts the standard format provided by the Italian Seveso legislation, whose fields contain information related to the description of the event, the recovery activities undertaken, and the follow-up actions. The description is the narrative of the event occurred and highlights the substance and equipment involved, the technical devices or the procedures that failed or were misapplied, as well as those that stopped the escalation of the occurrence, avoiding the consequences or mitigating their effect.
In spite of using the same format, the reports are compiled differently for the accuracy of the description and the detailed information recorded. The interpretation of the operative experience concept is also different from one establishment to another. At a few establishments, only the release of hazardous substances is recorded; in other cases, anomalies, unsafe conditions, and situations are detected, as well as those not related to major accident hazards. This diversity represents truthful pictures of the events occurred into establishments and depots but increases the complexity of extracting knowledge from those reports.
The precious information of those documents is in the story itself, in a few sentences the report tells what happened, what are the elements involved (equipment, substances, people), what failed and what succeed. Therefore, for this research, knowledge extraction means to represent the story in a mathematical model.

Entities
Since the model must represent the story contained in each near miss document, its definition has to reflect the concepts that best describe the facts. The concepts were identified by answering simple questions, including: what, when, and where did it happen? Who and what were involved? What stopped the escalation, and what failed?
The first question identifies the key elements to identify what happened and give it a place in space (where?) and time (when?). Thus, the entities are, respectively, event, industrial sector, and date. The second question identifies the persons, who were involved in the story (entity people), as well as the equipment or a part of it (apparatus) concerned with the event, the substance involved, and, eventually, the type of work (activity) undertaken when the event occurred. During the Seveso inspections, the operator provides the inspectors with the list of technical and organizational measures (barrier) adopted for preventing the accidents and for mitigating the consequences when undesired events occurred. Thus, the third question refers to an entity barrier that failed or succeeded in the near miss occurrence.
The identified concepts are used for classifying the terms extracted from the documents. Indeed, they have a broad meaning and often would require further specification for a more effective knowledge representation. On the other hand, a more detailed specification could make the machine learning applicability difficult in the classification process; therefore, a balance between keeping some details and ensuring the success of ML is the strategy adopted.
Therefore, the entities, including substance, people, and activity, are not further specified, while apparatus, barrier, and event are classified into several subclasses. The apparatus is subdivided into equipment and its parts, i.e., component; thus, tank is classified as equipment, while flange is a component. Barrier is split into two subsets, the technical and the organizational barriers. Thus, the level gauge, a safety physical device, is a technical barrier, while permit to work and instructions for loading/unloading operations are procedures or technical instructions classified into an organizational barrier.
The event entity is the core of the story of near miss, without it the near miss or anomaly would not exist; therefore, its subdivision into several classes provides greater and useful specification of the concepts. The subclass loss collects the terms related to a loss of containment, e.g., leakage, overfilling, overflow; failure gathers the mentions dealt with any breakdown, malfunctioning, damage of machinery or devices, but also wrong behaviors or errors in working activities. The defects of equipment that would cause integrity problems (e.g., corrosion, erosion, pitting, holes) or less efficiency (e.g., occlusion, lack of elasticity, fouling) are grouped into the deterioration subclass. The near miss archive also contains incidents and a few accidents; thus, a subclass defines major events.
The groups described above represent negative occurrences, what was wrong, but it is also important to point out the actions or the circumstances that succeeded to interrupt and block the event escalation, to notice unsafe conditions in an early phase, or promptly to stop working activities; thus, the success subclass contains those terms.

Relations
The identification and classification of the terms in a text are not exhaustive for representing the knowledge contained in the document. In a sentence, the words that are parts of a discourse, further to their meaning, may have a role or be irrelevant in the story. This ambiguity can be solved by relating concepts to each other.
The relations designed in the model are the following: related_to, part_of, involves, and causes. The related_to is a weak link putting two elements in relation, without adding a specific type. While the other three implicitly provide a meaning to the connections between the terms, part_of links a physical component to an equipment, involves relates an element to a substance, the relationship causes points out something (event, activity) or someone (people) that led to an event. These three relations are also different from the first one because they are oriented connections, that is, they have a direction from one concept to another. Table 1 shows the model elements, the relations are put in correspondence with their entity types in a predefined order.  Figure 1 shows the graph, designed with Protégé (https://protegewiki.stanford.edu/ wiki/WebProtegeUsersGuide (access on 26 July 2021)), corresponding to the conceptual model, the rectangles describe the entities (classes and subclasses) and the arcs correspond to the relations.

AI Techniques for Extracting Knowledge
The methods adopted for extracting knowledge by the near miss archive are based on text-mining capabilities, which is used for analyzing the parts of speech and extracting the tokens (words). Text mining is the process of eliciting information from an unstructured and free form text, by analyzing the parts of speech and classifying them with appropriate meaning.
The token classification and their relationships defined in the conceptual model are processed with the support of machine learning techniques.
In this context, data is the text describing the narrative of near miss or anomaly, and the task of ML techniques is to learn how to extract, classify the terms, and define their relationships by referring to the designed EsOpIA model.
The ML application is an iterative process, characterized by two phases: the training for building the machine learning model and the evaluation of its performance.

Machine Learning Model Construction
The adopted ML system is based on techniques for annotating the terms contained in the text and defining their relationships. In the jargon of text analysis process, "annotation" is the technique of assigning a type of the entity model to a term or a part of speech that is to provide it with a meaning or semantics.
The goal is for the system to learn to annotate the text correctly, thus a team of experts has supported this operation. At each step of this iterative process, experts manually annotate a set of documents using the ML tool adopted for developing the project (IBM Watson Knowledge Studio [28]).
The Figure 2 shows an example of annotations and their relationships. The documents are in Italian, but in the figure, the key terms are translated into English to facilitate reading. According to the EsOpIA model, the colored boxes correspond to the different types of entities, including the classes: event (red), apparatus (green), activity (blue), and substance (gray). The other boxes with colored outline represent the relationships, and the lines link the mention items. The statement is a brief description of the event occurred, that is "breakage and leakage as result of damage to the flexible pipe used for tapping the sulfuric acid from the tank". The entity annotator classifies breakage and damage as terms belonging to failure subclass of event class, while leakage is in the loss subclass. The terms flexible pipe and tank are annotated as apparatus, members of the component and equipment subclasses, respectively. The activity in progress at the time the event occurred is the tapping of sulfuric acid (entity substance).
The relations related_to and involves are quite simple to be defined, while causes relation must take into account the order between the entities. Indeed, reading the statement, for humans is easy to understand that the damage has caused a breakage with a leakage consequence. The challenge is to train the ML system to learn this reasoning in order to classify correctly the cause relations.

Performance of the Machine Learning Model
Following the overview provided by Xu and Saleh [27] on ML methods, based on their capabilities and features, the ML adopted in EsOpIA has the characteristics of supervised learning, since the aim for the system is to learn a target function that can be adopted to predict the values of a class. The annotation process, as described in [29], for training the EsOpIA model required about 15 iterations, with sets of documents ranging from 5 to 10 in the starting phase, up to 20 and 50 in the most advanced stages of learning. At each phase, the system evaluates the test model through some metrics, usually adopted by ML techniques [23], including precision, recall, and F1 score.
Defining TP, TN, FP, and FN the number of true positive, true negative, false positive, and false negative outcomes, respectively, each mentioned metrics are computed as follows: The measure Equation (1) (3) is interpreted as a weighted average of precision and recall values, whose best value is 1 and the worst is 0. Table 2 shows the statistics of the test set of the deployed model for each entity. Formula (1) ensures that a high precision value means that all citations that are annotated as a certain type of entity really belong to that classification. Table 2 shows that DATE, APPARATUS, and SUBSTANCE entities are annotated with the highest precision values, but also the other mentions have a high level of correctness. Thus, we are quite confident that the system is able to correctly classify the annotated concepts.
A high value of recall (Formula (2)) means that all citations that should be annotated as a certain type of entity really are. Table 2 shows satisfactory values (greater than 0.7) for many of the types of entities, but just two are sufficient, i.e., activity and barrier. one explanation is that both of these types of entities have terms that can be classified as other types (homonyms), and the system is not always able to correctly classify them, maybe because sentences are too short, so it is difficult to deduce a more explicit context. for example, the same term block is used as an event (failure), an action to interrupt something (activity), or a technical mechanism (barrier) for preventing undesired events. Thus, if the sentence is short, the system may have difficulty in classifying it correctly. Table 2 shows the values of the deployed model, but during the iterative process of ML techniques, lower values have required specific interventions to improve performance, including new documents to be annotated, choosing them from those containing the most ambiguous terms, but also coordinating and making the choices of human annotators converge in the same solutions.

Application of Other AI Functionalities
This section briefly mentions other AI functionalities adopted for cleaning the document repository and optimizing the management of models and their terms.
One of the critical points faced in managing the near miss archive has been to assure the content anonymization. Indeed, proper names of people or companies have to be removed by the text, but this operation, unthinkable to do manually, takes advantages of applying text-mining capabilities for recognizing the undesired terms and remove them.
Other activities deal with organizing the extracted terms by considering the synonymies and lemma or discarding the parts of speech not useful for the model. All these activities are strictly related to the characteristics of the language used in the documents; in our case Italian, so we think it would be tedious to mention all the details that would probably be different for other languages.
Another characteristic of AI techniques is the management of stop words that are terms not useful in clarifying the semantic meaning of the text content and therefore should be ignored by the system. In this project, however, the stop words are considered only in the search phase; thus, a few parts of speech are ignored, e.g., articles. The prepositions are usually considered stop words, but in Italian, they often are important parts to describe a concept; for example, the loading arm is literary translated into Italian as "arm of loading", thus, if the preposition of is removed as a stop word, the concept is meaningless, or worst is split into two concepts: arm (component) and loading (activity).
The lemmization and synonymy have been manually performed on the list of words extracted and classified by the system, whose outcomes have decreased the number of entities; more than 33,000 terms have been reduced until to about 1500 words in normal form; the ML model also counts more than 27,000 relations.

EsOpIA Application
The EsOpIA application is a tool to access and query the near miss repository, it aims at people who are working on this matter, both experts and trainees for Seveso inspections. The user can express queries in natural language (Italian), the language used in the reports, with the chance of selecting the entities from the model.
The search mode may have two starting hypotheses: the first is beginning from a consolidated knowledge to verify if there are still operational experiences that confirm it or not; the second supports intuitions or foresight to understand if they match with real cases [30]. Figure 3 shows the user interface panel of EsOpIA: at the top, there is the query in natural language, then the filter section containing the terms classified according to the EsOpIA model. After running the search, the system updates the lists, loading only the terms contained in the documents found, making the search refinement easier for the user. The picture shows the equipment combo listing only the items extracted in the outcomes, including line, oil pipeline, tank, and piping. The bottom of the panel shows the choices previously selected. The EsOpIA application also provides the functionalities to directly query the model; the user, therefore, selects the terms from the lists of the entity types and combines them with Boolean operators. Figure 4 shows the panel of Advanced Search, the example looks for the following query: heavy rain AND tank. the items are classified as event-none and apparatus-equipment, respectively, the and operator means to look for reports that contain both entities.
the search term is, of course, extended to the synonyms associated to each entity.

Results
This section describes a few hot topics by extracting some pieces of knowledge contained in this archive using the different approaches, described above, to run the search activities.
The first two case studies deal with known issues, the difficulties involving the permit to work and the loss of containment in ground; the aim is to understand if those problems persist despite the efforts made to ensure work safety and to limit containment losses. The third case study starts from an intuition by looking at some terms contained in the model that are apparently out of context, but since classified in the EsOpIA model by the ML, are therefore interesting for safety purposes.
The following sections describe, for each case study, the most significant searching steps developed with the EsOpIA application and their outcomes. The reports are written in Italian, but to make the reader understand the results, the entities are translated into English and listed in tables together with their classes and subclasses.
The editing types, adopted for the tables, have the following meaning: all model components are in italics, the names of classes, subclasses, and relations are in uppercase, the individuals (i.e., terms) are in lowercase.
Since esopia model has an entity-relation structure, each model extracted from the report is a set of (connected or disjoint) triples, i.e., (entity, relation, entity); thus, sequences of triples, described in the following sections, provide the representation of the natural language text into a mathematical model. the terms in bold correspond to the words used in the discussion.

Case Study #1: Risks Known in Managing Working Activities
The first case study faces the issue related to the permit to work (PtW), to check if its management has been directly involved in some events.
PtW is a document addressed to third-party companies or internal workers who must execute activities of maintenance, improvement, or changes inside the plant. Agreed be-tween the operator of the establishment and the external company, it is a written document specifying, among other things, responsibilities, means, times, interfaces, intervention limits, precautions (including personal and collective protective equipment), and reports. PtW is a systematic and formalized tool that collects information to carry out work in full compliance with safety, it must take into account all the risks of the working activity but also the conditions and situations in which it takes place, to indicate, therefore, the preventive and protective measures to be adopted.
Searching "permit to work" as free text in EsOpIA also returns the reports that refer to PtW as a procedure correctly compiled and followed. Looking for cases in which PtW failed, because it was misapplied or wrongly filled in, means to check among the model entities extracted from the reports, including EVENT and BARRIER.
The words listed in Table 3 as failure events summarize general concepts, but the terms extracted from the reports are more detailed. Thus, error corresponds to error of application, intervention, operational, and compiling; lack is lack of analysis, supervision, preventive, and end-of-work checks, or more seriously, the PtW was missing. A frequent error, extracted from the reports, is the misapplication of PtW for delivering the plant, after the maintenance operations, in correct operating conditions. In one case, in fact, there was a solvent leakage due to lack of blind disk on the end of the line, in another the operator opened a wrong valve connected to a provisional line, on which the blind flange had not been mounted, both cases occurred after maintenance work. The extracted model highlights the relation incorrect application related_to permit to work.
Another report describes the release of product from the manometer detachment on the pump when the systems restarted after a general stop, since the threaded plug was not applied correctly. In this case, verifying the restoration of standard conditions following a maintenance intervention failed. The model representing the follow-up actions are: awareness related_to workers; training related_to workers; training related_to use related_to ptw; review related_to ptw.
In this report, the revision of PtW foresees explicitly adding a section on verifying the restoration of standard operation after maintenance work. Checking the plant restoration to standard operation after maintenance or change activities should be part of the PtW procedure, but the above relation (review of PtW) highlights that in some cases it is still an open issue.
In many reports, however, the PtW compilation is correct, but its execution failed. As described at the beginning of this section, the PtW contains the list of collective and personal protective equipment that must be adopted; one report tells that during a check, a supervisor found the lack of fire extinguishers and explosivity detector foreseen in the PtW, as listed in Table 3 as technical barriers. Additionally, this case highlights the required supervisors' presence in the working area, they are in charge of guiding external workers inside the establishment and overseeing their activities since the beginning.
Due to other concurrent works, a supervisor postponed the issuance of the PtW, but did not control the working area and third-party worker started the welding activity without waiting for the PtW, and, therefore, without making the planned cleaning operations. The result was a fire, promptly extinguished by other workers. The model contains the following relations: Third-party worker causes not waiting related_to permit to work; training related_to worker.

Discussion on Case Study #1
The Italian Seveso legislation foresees that the permit to work contains all necessary information related to maintenance activities, including authorizations and responsibilities, preventive checks of conditions and materials adopted, workers' qualifications, instructions for safe working, list of safety equipment, scheduling, communication, verification of correct execution, and restarting.
The inspectors already check if those points are addressed in the maintenance procedures provided by operators, as well as in those for information and training of third-party companies, and for procurement of goods and services.
The results of this study, therefore, show that the attention of inspectors toward the management of work permits and their contents, including the role of supervisor and scheduling of activities, is still strongly motivated by the near-miss events that continue to occur.

Case Study #2: Environmental Risks for Leakages of Hazardous Substances
The second case study is to verify if there are still situations of dispersion of hazardous substances in the ground, despite the safety measures certainly adopted and controlled in recent years. Below are the search steps that address this issue, as depicted in Figure 3.
At the first step, the question to the NLP system is: Which documents deal with losses in the ground?
What we are looking for is on which situations the leaks of harmful substances required land remediation. Indeed, looking at the list of events classified as MAJOR, the term contamination suggests that some events deal with polluting conditions, as well as the item remediation contained in the list of organizational barriers.
Filtering the search with the above terms (i.e., contamination and remediation) reduces the number of documents, and as expected, the types of equipment mainly involved in those events are tanks and pipelines. The third column of Table 4 lists the entities extracted from the occurrence related to tanks.
The containment basin is a technological barrier for gathering hazardous substances accidentally released by tanks and avoiding the ground contamination. Thus, the goal is to understand if such a barrier worked or failed. The results of the search describe loss of hazardous substances from tanks into containment basin. The model representing such reports contains the following relations: (event-loss) release involves (substance) product (event-loss) release related_to (barrier-technological) containment basin involves (substance) product.
Where product is a general term that indicates the hazardous substances involved, including hydrocarbon, diathermic fuel oil, and gasoline.
In many cases, this barrier has worked and therefore only cleaning operations of the basins were required, but two reports describe the cases where this type of barrier failed and traces of hazardous substances residues have been found in the soil under the basins. One event involved a tank no longer used, while in another case, accidental release of gasoline into the basin occurred during preliminary reclamation activities for maintenance operation of a tank, the loss required the removal of part of the ground. The latter document does not specify the reason why the basin was not able to contain the loss; it could be due to an inadequacy of the basin itself or to its cracking. This example points out some limits of these reports, which do not always describe in detail the reasons why an event occurred. The number of reports relating to leaks from piping is greater than the events occurred for tanks. Table 5 lists some of the entities extracted from those documents. The losses are mainly due to deterioration mechanisms that in a few cases have caused serious soil contamination. The list of organizational barriers contains several types of procedures, including maintenance, checks, and controls, sometimes referred to specific not destructive tests (NDT). Among the results, one report describes a release of diesel fuel from an abandoned pipeline, which caused contamination of the underlying soil. The document outlines the lack of controls on those types of equipment, and the follow-up actions relate to remediation and subsequent removal of pipeline that was not used.
Another case describes a leak from a pipe for which its replacement had already been planned. The loss superficially affected a portion of underlying land that was covered with a waterproof sheet in order to avoid the washout of contaminants due to rain, before transferring the polluted soil (waste) to an appropriate disposal facility.

Discussion on Case Study #2
The loss of containment of hazardous substances and the possible dispersion into the environment is one of the cornerstones of the Seveso directive, on these hazards the operator develops the quantitative and qualitative risk analysis, whose results address the operator to implement the measures necessary to prevent them and those to mitigate the consequences. The search activities on the near miss repository, however, highlight that there are still several reports related to this topic.
The main problem is the deterioration of the equipment that can increase with its aging. The reports often describe the lack of controls and verifications and, sometimes, the ineffectiveness of some specific tests.
Another interesting point refers to equipment not currently used, often it is forgotten that decommissioned equipment might cause problems if it is not completely reclaimed and still contains residues.
During the Seveso inspections, at maintenance verification, the inspectors usually check the procedures adopted for managing equipment that is out of service, decommissioned or in demolition, including remediation and disposal of residues. Thus, the outcomes extracted from the operative experience repository and discussed above confirm the need to assess this topic in-depth.

Case Study #3: Unexpected Risks-Bad Weather Conditions
This case study deals with the issue that the occurrence of external factors with unpredictable consequences can put the safety management system in crisis. That is the case of strong and exceptional meteorological phenomena.
Looking at the entities extracted in EsOpIA, some terms, classified as event-none, relate to weather conditions, including thunderstorm, heavy rain, strong wind, lightning, and ice. the system was able to classify those terms as event and link them to other event items through a causes relation. Table 6 shows the list of terms related to bad weather conditions, classified as eventnone at the first column, each of them has caused one or more events, contained in the third column, belonging to a certain subclass of event (second column). Thus, each row of the table is readable as a triple in the following mode: (event-none) term-(causes-(event-subclass) terms.
Starting from the subset of documents that refer to meteorological phenomena, the EsOpIA application provides the functionalities to look for terms of other entity types.
The electrical blackout is one of the events caused by the storm; the interruption of power is usually included in the risk analysis as a possible situation that could occur. However, when this situation is caused by atmospheric events, external to the process and to the establishment, it is interesting to see if other elements not foreseen in the risk analysis are involved. Table 7 shows the list of classified terms that are inside the reports describing the blackout caused by meteorological effects. There are not terms related to events in subclasses major and deterioration, while there have been losses of containment and device failure. Some events, classified as success, describe how the development of the event was interrupted by the activation of foreseen safety procedures. The scrutiny of the documents can continue by selecting some specific terms. Selecting the equipment electric generator, the search result describes two near misses dealing with the opening of rupture disk, both occurred in chemical sites.
In one case, the available electric generator was activated manually, but, in the meantime, a reactor had gone into overpressure with consequent opening of its rupture disk and release of the product. In the other case, the co-generator was out of order due to a previous fault, the supervisor, in accordance with the emergency instruction, tried to restore the power supply, but during this short period, a reactor went into high-pressure causing the opening of the rupture disk. The previous cases, however, represent success stories, since the technical barriers, i.e., rupture disk, worked correctly.
The EsOpIA model, extracted by the reports, contains the following relationships between entities: black out related_to manual activation related_to electric generator; opening related_to rupture disc part_of reactor Another report describes the impact that a blackout had on the process activities. During the emergency procedure for stopping the process, an extraordinary supply of liquid was used to neutralize the high concentration hazardous substance that caused the tank overflow. Probably the procedure would have worked in case of a normal power outage, but it was not considered that the pump to dispose of the water did not work without electricity, the additional element of the water from the storm was not foreseen.
In Table 7, at the row SUBSTANCE, the list contains two hazardous substances (i.e., chlorine and diathermic fuel oil), a general term product that is meaningless, and the term meteoric water. Selecting this last term, the result gives a report that has a blackout condition similar to those described above, but the inability to restore the power was caused by the simultaneous activation of three pumps to empty the basin from the meteoric waters.
Another case refers to an overflow from a tank for collecting rainwater, due to heavy rains. The interesting aspect of this document is that one of the follow-up actions has been the installation of a radar device for monitoring the level of tanks gathering the meteoric water. This would suggest that equipment dedicated to services might become critical items as well those involving hazardous substances.
Operative experiences related to adverse meteorological conditions, however, represent deviations from the normal operation of the establishment. Thus, for controlling the containment of rainwater in a tank, in case of lack of technical devices (e.g., level gauge), operative instructions and procedures (i.e., organizational barriers) should include appropriate modes to operate in those specific anomalies and emergency conditions.

Discussion on Case Study #3
The Seveso III Directive describes the minimum information that should be contained in a safety report (SR), including the identification and accidental risk analysis and the causes of accident scenarios also due to natural causes, for example earthquakes or floods. Therefore, in their SR, operators of the establishment have to collect historical information relating to the meteorological, geophysical, and hydrogeological events which occurred on their site. Interviewing some inspectors, however, it emerges that, during Seveso SMS inspections, usually no one asks questions relating to bad weather problems, that is, what measures have been taken for this issue, unless there are explicit operational experiences in this regard.
The feedback from operational experiences described in Section 5.3 would be useful for the authorities who periodically audit the safety reports, to understand if the measures taken by the operators address issues related to rapid and worse change of climatic conditions. It is therefore reasonable that auditors of SR assess whether extreme meteorological events may occur that go beyond time series. For example, knowing whether the amount of rain fallen in a short time is equal to that recorded over a long period can help the inspector to assess whether the necessary prevention measures have been taken to cope with extreme conditions. Thus, knowing the events that have already occurred in similar conditions, the critical aspects, but also the solutions adopted, could be useful to auditors.

Discussion
The study presented is the outcome of the collaboration of experts in ML application and text-mining techniques, together with the Seveso inspectors whose skills have allowed the definition of the most appropriate knowledge domain, in such a complex sector as the process industry.
The model built and the ways of inquiring events and relationships allow us to highlight hidden knowledge that otherwise would have been lost as described in the case study on weather events. While considering the traditional risks, the case study of the loss of hazardous substances into the ground, or the organizational procedures, i.e., work permits; further lessons can still be drawn in addition to confirm those already known. The three case studies reported have shown that the goal has been achieved.
Representing the content of operational experiences and analyzing it has the aim of sharing knowledge among all inspectors. Over the years, each inspector, indeed, gains experience and knowledge that are often more in-depth in some sectors that in others. Thus, to analyze different situations, the inspector should read a lot of documents and ask other colleagues for them. Hence, the usefulness of a shared repository and a tool for extracting knowledge is demonstrated.
A further strength is to allow inspections to be conducted in a homogeneous manner throughout the national territory through the lessons learned from which inspectors can take inspiration to make their activity more effective. These suggestions to reflection and study are warnings, which may arise other queries from the inspectors to the operator, following a slogan such as "Make you think of!" The discussions made in the previous section, for each case study, can be considered answers to general questions, some of which are listed in Table 8.

Concluding Remarks
The importance of near miss analysis has been known for decades. This research has shown how NLP and ML capabilities enforce the power of near miss management extracting hidden knowledge and highlighting both wrong conditions or situations and success stories. The model implemented is strongly based on events and relationships with other entities that have been identified to express the concepts contained in the operational experiences; thus, it is able to represent the story contained in the text. The model is also suitable for searching purposes: through EsOpIA functionalities, the inspector can browse on the entity-relation graph.
The robustness of the model is tested by evaluating its applicability on other types of documents of the same domain of interest, including accident reports and equipment failure data sheets. From the first results of this test, it emerged that the conceptual model is valid, while it may be necessary to update the term sets on specific fields chosen with further training, i.e., accidents and equipment failures.
Even if the application works on documents in Italian, the conceptual model is usable with text archives in other languages after appropriate training.
This project is also part of a wider framework of the Italian Seveso inspections. Indeed, a peculiarity of the Italian implementation of this European Directive is that several stakeholders are involved, both national and regional technical institutions; therefore, the Decree has established a coordination group for the uniform application on the national territory composed by all regional and national representatives of the bodies involved. This group, in addition to providing formal instructions and guidance on questions from operators or associations, promotes work group initiatives to address important issues (e.g., the management of obsolete equipment). The initiative to collect the operative experiences for the extraction and sharing of knowledge is in line with the objectives of this coordination group; thus some initiatives have been launched under Inail's leadership, for example the production of bulletins, an initiative already on development in the European Community for major accidents, but in this case, they are focused on near misses or minor incidents.
The risk of this repository is to have a huge number of documents that are difficult to manage, or only useful for confirming concepts already known. The challenge of this project is to overcome the a priori knowledge and investigate new indications and solutions or face emerging risks.