Cross-Sectorial Semantic Model for Support of Data Analytics in Process Industries

: The process industries rely on various software systems and use a wide range of technologies. Predictive modeling techniques are often applied to data obtained from these systems to build the predictive functions used to optimize the production processes. Therefore, there is a need to provide a proper representation of knowledge and data and to improve the communication between the data scientists who develop the predictive functions and domain experts who possess the expert knowledge of the domain. This can be achieved by developing a semantic model that focuses on cross-sectorial aspects rather than concepts for speciﬁc industries, and that speciﬁes the meta-classes for the formal description of these speciﬁc concepts. This model should cover the most important areas including modeling the production processes, data analysis methods, and evaluation using the performance indicators. In this paper, our primary objective was to introduce the speciﬁcations of the Cross-sectorial domain model and to present a set of tools that support data analysts and domain experts in the creation of process models and predictive functions. The model and the tools were used to design a knowledge base that could support the development of predictive functions in the green anode production in the aluminum production domain.


Introduction
Process industries have characterized a significant share of European industry, involving the intense use of raw resources and energy. Therefore, these industries provide a context where even small optimizations can lead to large savings in terms of both economic and environmental costs [1,2]. This is especially true for specific industrial processes, such as aluminum smelting or injection moulding, characterized by production in high volumes and divided among many distributed production units. Predictive modeling can be effective for optimizing processes in this context. The application of predictive models is not straightforward for several reasons, including interoperability issues among existing software systems. As a consequence, the deployment of predictive functions in these production environments at a sustainable cost or with sufficient reliability is not always feasible. As process industries usually integrate various software systems and incorporate a wide range of technologies on different execution layers [3,4], proper representation of the information and underlying data obtained from these systems is necessary. Such semantic models should provide machine-readable conceptualizations that describe the available and needed assets, equipment, and production processes involved in the specific production, as well as the functionality and information flow between these assets. In the process industries, production processes must be accurately modeled to achieve the optimization goals set by the companies.
In this paper, we describe a semantically-enriched model for process industries. Semantic techniques are used to define, inter-link, and share distributed models describing all the key aspects of the multi-scale processes of interest. First, the models include all system-specific classes existing at different layers and then all the different classes describingfú the same or related information are logically inter-linked by means of semantic web standards. This results in an open, extendable semantic description of all data entities handled by all systems available in the production site. The proposed semantic model was designed in the context of the MOdel based coNtrol framework for Site-wide OptimizatiON of data-intensive processes (MONSOON), a SPIRE (Sustainable Process Industry through Resource and Energy Efficiency) research project, which aims to develop an infrastructure with the main objective of establishing a data-driven methodology supporting the identification and exploitation of optimization potentials by applying model-based predictive controls in the production processes [5]. The platform was evaluated in two domains: Aluminum production and plastic injection. The proposed architecture consists of two main components, as depicted in Figure 1. of the multi-scale processes of interest. First, the models include all system-specific classes existing at different layers and then all the different classes describingfú the same or related information are logically inter-linked by means of semantic web standards. This results in an open, extendable semantic description of all data entities handled by all systems available in the production site. The proposed semantic model was designed in the context of the MOdel based coNtrol framework for Site-wide OptimizatiON of data-intensive processes (MONSOON), a SPIRE (Sustainable Process Industry through Resource and Energy Efficiency) research project, which aims to develop an infrastructure with the main objective of establishing a data-driven methodology supporting the identification and exploitation of optimization potentials by applying model-based predictive controls in the production processes [5]. The platform was evaluated in two domains: Aluminum production and plastic injection. The proposed architecture consists of two main components, as depicted in Figure 1. The Cross-Sectorial Data Lab (hereafter called the Data Lab) is the core component of the MONSOON platform. It is a collaborative environment built on top of the big data frameworks and technologies and used to collect, store, and process large volumes of data obtained from multiple production sites from different process industries. In the Data Lab, the data scientists process and analyze the production data and develop predictive functions to optimize the production processes. Data Lab also provides a set of tools to evaluate those models, simulate the functions performance in testing environment, and deploy them into the production environment.
The Real-Time Plant Operation Platform is a set of components used during run-time in a particular plant. The components are used to integrate various sensors and systems deployed on-site (including PLC, MES, SCADA, and ERP) and provides a common interface for communication with these systems. The components are then used to transfer relevant data from the production site to the Data Lab. Predictive models developed in Data Lab are deployed in these components and are integrated within the on-site environment.
This paper is organized as follows. Section 2 describes the motivation for this work from the perspective of data analysis process and its relation to the proposed semantic model. Then, an overview of existing standards and solutions relevant to our work in Section 3 and is followed by the description of the methodology adopted in development of the semantic model in Section 4. Section 5 introduces the model, describes its structure and main components, and demonstrates its usage. The following chapters then present the model evaluation and the software tools using the presented semantic model. The Appendix A contains several examples from the demonstration scenarios from one of the MONSOON's application domains.
The terms model and modeling in this paper have different meanings: The Cross-Sectorial Data Lab (hereafter called the Data Lab) is the core component of the MONSOON platform. It is a collaborative environment built on top of the big data frameworks and technologies and used to collect, store, and process large volumes of data obtained from multiple production sites from different process industries. In the Data Lab, the data scientists process and analyze the production data and develop predictive functions to optimize the production processes. Data Lab also provides a set of tools to evaluate those models, simulate the functions performance in testing environment, and deploy them into the production environment.
The Real-Time Plant Operation Platform is a set of components used during run-time in a particular plant. The components are used to integrate various sensors and systems deployed on-site (including PLC, MES, SCADA, and ERP) and provides a common interface for communication with these systems. The components are then used to transfer relevant data from the production site to the Data Lab. Predictive models developed in Data Lab are deployed in these components and are integrated within the on-site environment.
This paper is organized as follows. Section 2 describes the motivation for this work from the perspective of data analysis process and its relation to the proposed semantic model. Then, an overview of existing standards and solutions relevant to our work in Section 3 and is followed by the description of the methodology adopted in development of the semantic model in Section 4. Section 5 introduces the model, describes its structure and main components, and demonstrates its usage. The following chapters then present the model evaluation and the software tools using the presented semantic The terms model and modeling in this paper have different meanings: (a) the semantic model is the formal specification of the concepts from the given domain; (b) the data analysis model is the result of applying the data modeling techniques (e.g., predictive function). (c) the process model, i.e., model, specifies the decomposition of the process to steps and describes the steps' data and execution dependencies (usually based on graphical notation in the form of flowcharts).
The aim of this study was to propose a formalized semantic model that represents the knowledge exchanged by domain experts and data scientists. Parts of this semantic model also cover process modeling for production processes and describe metadata about the data analysis models (i.e., predictive functions).

Motivation
This section describes the main objectives of the proposed Cross-sectorial domain model in the context of the general data analysis process. Several standard methodologies, such as CRISP-DM or SEMMA [6][7][8], break the process into several major phases.
The initial phase called Problem understanding, focuses on understanding the objectives and requirements of data analysis from a business perspective, i.e., how data analytics methods can optimize the production process. During this phase, the domain experts in cooperation with the data scientists analyze the overall production process and its steps (production segments) and specify which key-performance indicators (KPIs) will be optimized. This knowledge is then converted into the data analysis problem definition (e.g., classification, regression, anomaly detection, etc.) by data scientists. The subsequent Data understanding phase starts with data collection and proceeds with activities to ensure familiarity with the data, to identify data quality problems, to provide the first insights into the data, and to detect interesting subsets to form hypotheses for hidden information. During this phase, data scientists, with the cooperation of domain experts, select and describe the subset of relevant data required to achieve optimization objectives specified in the problem understanding phase. The data are described using the already known dependencies between the data and KPIs.
The Data understanding phase is followed by the Data preparation phase, which covers all activities necessary to construct the final dataset (data that will be fed into the data analysis tools), and the modelling phase, where the various data analysis techniques (such as decision trees, neural networks, logistic or linear regression models, etc.) are selected and applied, and their parameters are calibrated to optimal values. While these two phases are essential for the overall results of the data analysis process, they are more technical and mostly involve activities of data scientists, with some exceptions, e.g., to collect initial raw data from the production site environment, data scientists have to cooperate with the site IT-specialists responsible for the setup of the data integration component.
The result of the Modelling phase is the set of data analytics models (i.e., predictive functions) that appear to have high quality from a data analysis perspective. This quality is usually evaluated by applying the model to the independent testing dataset not used during the building of the model, and computing the evaluation statistics such as classification accuracy (number of tested records with the correctly predicted value by the model/total number of records in the testing dataset). Before proceeding to the final deployment of the model, it is essential to evaluate the model more thoroughly and review the steps executed to construct the model, to ensure it adequately achieves the business objectives [9]. Within the Evaluation phase, the quality of the models evaluated on the testing dataset is projected onto the business-oriented KPIs specified in the problem understanding phase. As such, domain experts can directly interpret the evaluation results of the overall data analysis and assess the quality of the constructed models from a business perspective.
The overall motivation to develop the Cross-sectorial semantic model was to improve the communication between data scientists and domain experts during these phases and to formally capture and externalize communicated or created knowledge. Such a semantic model should enable cross-sectorial sharing of knowledge about the optimization of the production processes by data analytics methods and predictive functions ( Figure 2). The model itself should represent the common language between domain experts and data scientists, which would provide the necessary concepts, which are important to capture in context of data analytics in particular domain. The model should assist the domain experts to externalize their tacit domain-specific knowledge (about production processes, resources, etc.) into a more comprehensive form, and also to capture all necessary aspects of the domain-specific knowledge, which could be necessary when applying the data analytics in particular domain. which would be represented in the terminology used by the data scientists. Such knowledge sharing should support and improve the Problem and Data understanding phases of the data analytics process by providing a common ground when describing the domain and related problem and data aspects. On the other hand, the data scientists could use the semantic model to define the predictive functions to solve the data analytical tasks in particular domain and to present to the domain experts, how such functions influence the domain KPIs. The main knowledge management objectives of the Cross-sectorial domain model are: (1) Improve communication between data scientists and domain experts during the phases of problem understanding, data understanding, and evaluation by creation of a shared knowledge base. (2) Allow within-and cross-sectorial sharing of knowledge about the optimization of the production processes using data analytics methods and predictive functions. Besides the knowledge management objectives, another objective of the semantic model design should be automatized evaluation of predictive functions by computation of the impact on the KPIs from evaluation statistics. Modeling in combination with related tools should enable development of what if simulation scenarios and automatic optimization of production process during the modelling and evaluation phases and automatize validation of data dependencies during the deployment of the predictive function in the operational environment. To achieve these objectives, the semantic model should support: The main knowledge management objectives of the Cross-sectorial domain model are: (1) Improve communication between data scientists and domain experts during the phases of problem understanding, data understanding, and evaluation by creation of a shared knowledge base. (2) Allow within-and cross-sectorial sharing of knowledge about the optimization of the production processes using data analytics methods and predictive functions. and evaluation phases and automatize validation of data dependencies during the deployment of the predictive function in the operational environment. To achieve these objectives, the semantic model should support: (1) Presentation in a human-readable form using modeling tools that allow creation, visualization, querying, filtering, and navigation in the semantic model elements by data scientists and domain experts. (2) A machine-readable format with formally defined semantics, which supports automatic validation of deployment, quality evaluation of the predictive functions, and simulation and automatic optimization of the production processes.
As the main motivation was to develop an approach applicable to more industrial domains, the proposed semantic model does not aim to specify the concepts for a specific industry domain, but instead focuses on cross-sectorial aspects. From that point of view, the Cross-sectorial domain model can rather be considered as a meta-model that determines meta-classes for the formal description of these particular concepts. Such meta-classes should cover all important aspects, common for the process industries, in the context of optimization of production processes using data analytics. Therefore, the Cross-sectorial domain model should include the elements to describe the production itself, including the processes and their performance, equipment and personnel, concepts enabling to describe any kind of data obtained from the production and concepts used to describe the predictive models used in optimization. Therefore, we decided to formally divide the semantic model into the three respective modules: production process modeling, data and KPI modeling, and modeling of predictive functions. The semantic model and respective modules will be in more detail, as described in Section 5.
To ensure the aspect of the presentation in human-readable form and to support the adoption of the semantic model in the real-world scenarios, a set of modelling tools were designed as a part of the MONSOON platform. Such tools utilize the semantic model as a basis and provide the graphic user interface for creation of the industrial domain models. Semantic framework, a tool specifically designed to capture the domain knowledge, is described in Section 7.

Standards for Modeling Manufacturing Processes
In this section, we describe two types of conceptual information models in the industrial domain: Models based on industrial standards and semantic models. In the first group, ANSI/ISA-95 is a well-established international standard for developing an automated interface between enterprise and manufacturing control systems [10]. Data models of this standard are implemented in Business To Manufacturing Markup Language (B2MML), an XML-based set of schemas for modeling manufacturing processes [11]. B2MML is frequently used to integrate business systems (e.g., ERP systems) with manufacturing systems (e.g., control systems). In general, B2MML allows describing the manufacturing processes as well as material, personnel, equipment, and physical assets related to the production. B2MML enables us to describe how to produce the product, to specify the resources needed in the production, and to define the production schedule and performance. ISA-88 is another standard addressing batch production processes [12]. Both ISA-95 and ISA-88 are overlapping and inconsistent in several aspects. ISA-88 is better in terms of conceptualization and flexibility, whereas ISA-95 is applicable to a broader spectrum of processes. Therefore, activities involving the integration of both standards have been explored [13], including the development of upper-level ontology approach to map the concepts of both standards [14].
Several applications of semantic and knowledge technologies to the modeling of manufacturing processes have been presented and developed. Lemaignan et al. [15] introduced MASON, an upper ontology that represents a common semantic model of the manufacturing processes and environment. The primary objective of the ontology was to help estimate the manufacturing costs. The ontology describes the most important concepts of the industrial processes but only uses the cost factor as a method of their evaluation. Kharlamov et al. [16] used an OWL 2 to develop an ontology for the manufacturing and energy production sectors. The authors developed a set of tools for engineers with little knowledge of semantic technology in the creation of ontology-based models and in populating them with data. In Pakonen et al. [17], fuzzy ontologies were used in the industrial domain to improve information retrieval from the plant knowledge base consisting of reports obtained from different plant systems. With the emergence of the Industry 4.0 concept, several other ontological approaches have been created. Ontologies have been used to represent manufacturing knowledge [18], to describe industrial production processes [19], and to explain the capabilities of manufacturing resources [20]. An important objective of the use of the ontological approach is to support the interoperability between the industrial standards [21,22].
The main objective of the cross-sectorial domain model is to capture the common generic concepts related to the process industries. This approach leads to the design of a meta-model, which would enable the development of specific domain models for particular industries. From that perspective, the use of a standard, such as B2MML, as the basis for the semantic model proved to be useful.

Standards for Modeling Predictive Functions
The need has increased for formalized representations of the data analytics processes and for the formal representation of outcomes of those processes in general. Several formalisms describe scientific investigations and findings of research available, but most of them are specific to a domain, such as biomedicine. Examples of such formalisms include the ontology of biomedical investigations (OBI) [23] or ontology of experiments (EXPO) [24]. These ontologies specify useful concepts that describe general processes producing output data given some input data and formalize outputs and results of the data analytics investigations. They are aligned with the general upper-level ontologies such as SUMO [25] or DOLCE [26].
Semantic technologies are also applied directly to formalize knowledge about data analytics processes and to discover knowledge in databases. The initial goal of this effort was to build an intelligent data mining assistant that combines planning and meta-learning for the automatic design of data mining workflows. The assistant relies on the formalized ontologies of data mining operators that specify constraints, required inputs, and provided outputs for various operations in the data pre-processing, modeling, and validation phases. Examples of the ontologies for data mining and data analytics include Data Mining OPtimization Ontology (DMOP) [27] and Ontology of Data Mining (OntoDM) [28]. DMOP ontology provides concepts for the description of the input analyzed data, data pre-processing operations, data mining algorithms, and models and analysis results. OntoDM provides a unified framework for data mining, and contains definitions of the basic data mining concepts together with the concepts describing scientific investigations. This was further extended by expose ontology [29]. Expose provides a formal domain model for a database consisting of data mining experiments based on OntoDM and extends it with the data mining algorithms' description from DMOP. Another extension is a description of experiments (e.g., algorithm parameters, execution environment, methods of evaluation, etc.). OntoDM and Expose are primarily used to provide a controlled vocabulary for data analytics investigations. Other studies [30][31][32] presented an interesting approach for developing data science ontology, used to understand the data analytics code using dataflow graphs.
Formalized ontologies, such as DMOP or OntoDM, have already been built by the conceptualization of existing industry standards. These standards include XML-based or JSON-based formats for exchanging of data mining processes and models such as PMML [33], PFA [34], or ONNX, or interfaces of common software libraries for data mining such as Rapid-miner, Weka, scikit-learn, or Keras. Alignment of semantic models with the data analytics technologies is particularly important for interoperability with existing tools, which will allow seamless integration of semantic technologies with the data analytics components of the platform.

Methodology
For the design of the semantic model, we adopted a methodology that aims to facilitate the development of production process and data analytics ontologies so that the ontology developer can justify the rationale behind the involved decisions. The base of the proposed methodology is SMOL [35], which was extended using the concepts introduced by Grüninger and Fox [36]. The final proposed methodology emphasizes conceptualization based on the usage scenarios for the knowledge model. The methodology consists of the following steps: (1) Methodology strategy selection. In this phase, the best methodological strategy is selected according to the available or recoverable data from the existing knowledge sources related to the specific domain, or from the information from the opinion of domain experts. For our case, we adopted the middle-out strategy: We started with the conceptualization of the main domain concepts, which were subsequently mapped to the upper-level ontology and further specified with new domain sub-concepts.
(2) Specification of use cases and query requirements. In this phase, we specified base use cases outlining how the knowledge system implementing the semantic models are used within the scope of the whole data analytics platform. From each use case, we defined the requirements of the semantic models specified in the form of competency questions expressed in natural or pseudo-natural language (with limited question structure). The competency questions in this phase are informal since they are not expressed in formalized ontology language. Examples of the questions are shown in Section 6. (3) Knowledge structure construction. Once the informal competency questions have been proposed for the new ontology, the ontology terminology (classes, relations, and instances) are specified using some first-order logical language. This language must provide the necessary vocabulary to formally restate the informal competency questions [37], i.e., after conceptualization, the developed ontology is accompanied by a set of formal competency questions. After the base concepts of the proposed semantic model are specified, the concepts were mapped and refined according to the selected upper-ontologies. (4) Knowledge structure validation and reorganization. The formal competency questions associated with the developed ontology represent how to evaluate the ontology to verify that it is adequate. Each competency question is formally evaluated to recognize if the proposed structure of the knowledge model is consistent and complete. In the case where some knowledge is missing or inconsistent, changes in the ontology structure are proposed by introducing the new classes, relations, or instances, or by restructuring the existing hierarchies of concepts.
Finally, the formal competency questions were also mapped to the web service interfaces of the implemented tools and the underlying JavaScript Object Notation (JSON) data model implementing the semantic models. A part of the competency question was then implemented in the form of software integration tests to validate the functionality of the final implemented knowledge system.

Cross-Sectorial Domain Model
The main tool introduced by the cross-sectorial domain model to improve communication between users is a shared vocabulary collaboratively created by domain experts and data scientists. The domain concepts are described as the vocabulary entries that unambiguously define their meanings. Each entry has an assigned primary label presented as the unique title representing the concepts for the user. Optionally, each entry can list additional labels such as synonyms, acronyms, or abbreviations representing the same concept in the particular natural language(s). Besides the definition of the meaning, vocabulary can be used as the Knowledge Organization System (KOS), i.e., the entries can be used as the indexing keywords for the organization of any documents or other resources (e.g., datasets, scripts, etc.), which contain relevant information about the domain or solved business problem. The concepts in the vocabulary are further organized in a hierarchical structure (taxonomy) from broader to narrower concepts. The taxonomy structure allows efficient navigation in the domain knowledge and retrieval of relevant information.
The structure of the cross-sectorial domain model vocabulary is based on the Simple Knowledge Organization System (SKOS) standard with the extensions defined in the JSKOS standard, which define the JSON (JavaScript Object Notation) format for the knowledge organization systems such as taxonomies. All domain concepts are represented as the objects of the JSKOS concept type, which formally defines fields of the vocabulary entries and their relationships, e.g., narrower or broader, previous/next, related, etc. The structure of the controlled vocabulary is domain independent, and the proposed JSKOS format can be used to describe concepts in any domain. For the application in the process industry domain, the cross-sectorial domain model defines the core of the vocabulary with the main basic concepts for the representation of the main entities in the process industry. The top concepts can be divided into: (1) Production process modeling, which specifies the decomposition of production processes into segments (production phases) and describes resources required for production; (2) Data modeling, which specifies concepts for the description of the data elements and key performance indicators; and (3) Predictive function modeling, which specifies concepts for the description of the data analytics models.
Besides the shared vocabulary, the execution workflow of the production processes and their decomposition to activities and sub-processes are usually modeled using graphical notation in the form of various schemas or workflow diagrams. The structure of the vocabulary specified by the cross-sectorial domain model allows linking each graphical object from the diagram to the definition entry in the shared vocabulary, i.e., it is possible to navigate from the process model to the definition of the related concepts or from the concept (e.g., data element, KPI, or predictive function) to the relevant part of the graphical process model. As the reference graphical notation for the modeling of the production processes, the cross-sectorial domain model adopts the subset of the BPMN 2.0 standard.
To summarize, the overall structure of the proposed cross-sectorial domain model is presented in Figure 3. The bottom layer provides interoperability with the linked data and semantic web technologies based on the RDF and JSON-LD standards. Over this standard knowledge model, we defined the meta-model for the knowledge organization system based on the SKOS and JSKOS standards. Using this meta-model, we identified common vocabulary for the process industry domain, which describes production processes and production resources. The domain concepts are complemented by the general data analytics concepts adopted from the DMOP ontology, which are used to describe the data elements, KPIs, and predictive functions. Finally, the definition of the production processes is extended with graphical notation based on the simplified BPMN diagrams. The structure of the cross-sectorial domain model vocabulary is based on the Simple Knowledge Organization System (SKOS) standard with the extensions defined in the JSKOS standard, which define the JSON (JavaScript Object Notation) format for the knowledge organization systems such as taxonomies. All domain concepts are represented as the objects of the JSKOS concept type, which formally defines fields of the vocabulary entries and their relationships, e.g., narrower or broader, previous/next, related, etc. The structure of the controlled vocabulary is domain independent, and the proposed JSKOS format can be used to describe concepts in any domain. For the application in the process industry domain, the cross-sectorial domain model defines the core of the vocabulary with the main basic concepts for the representation of the main entities in the process industry. The top concepts can be divided into: (1) Production process modeling, which specifies the decomposition of production processes into segments (production phases) and describes resources required for production; (2) Data modeling, which specifies concepts for the description of the data elements and key performance indicators; and (3) Predictive function modeling, which specifies concepts for the description of the data analytics models.
Besides the shared vocabulary, the execution workflow of the production processes and their decomposition to activities and sub-processes are usually modeled using graphical notation in the form of various schemas or workflow diagrams. The structure of the vocabulary specified by the cross-sectorial domain model allows linking each graphical object from the diagram to the definition entry in the shared vocabulary, i.e., it is possible to navigate from the process model to the definition of the related concepts or from the concept (e.g., data element, KPI, or predictive function) to the relevant part of the graphical process model. As the reference graphical notation for the modeling of the production processes, the cross-sectorial domain model adopts the subset of the BPMN 2.0 standard.
To summarize, the overall structure of the proposed cross-sectorial domain model is presented in Figure 3. The bottom layer provides interoperability with the linked data and semantic web technologies based on the RDF and JSON-LD standards. Over this standard knowledge model, we defined the meta-model for the knowledge organization system based on the SKOS and JSKOS standards. Using this meta-model, we identified common vocabulary for the process industry domain, which describes production processes and production resources. The domain concepts are complemented by the general data analytics concepts adopted from the DMOP ontology, which are used to describe the data elements, KPIs, and predictive functions. Finally, the definition of the production processes is extended with graphical notation based on the simplified BPMN diagrams.

Production Process Modeling
Besides the description of the properties of concepts from the production industry domain, the cross-sectorial domain model provides core vocabulary that consists of the main categories dividing

Core Concepts
Besides the description of the properties of concepts from the production industry domain, the cross-sectorial domain model provides core vocabulary that consists of the main categories dividing all domain concepts, production processes, and production resources. The vocabulary is represented as a JSKOS concept scheme, and each category is represented as a JSKOS concept. Note that it is necessary to distinguish types for the domain concepts (production process and production resource,) and category concepts (production process concept and production resource concept). All specific concepts of the given type are linked as the narrower concepts of the corresponding category. Each specific process concept (e.g., Green Anode Production) has a production process type and is linked as the narrower concept to the production process concept category. The structure of the core concept scheme is presented in Figure 4. all domain concepts, production processes, and production resources. The vocabulary is represented as a JSKOS concept scheme, and each category is represented as a JSKOS concept. Note that it is necessary to distinguish types for the domain concepts (production process and production resource,) and category concepts (production process concept and production resource concept). All specific concepts of the given type are linked as the narrower concepts of the corresponding category. Each specific process concept (e.g., Green Anode Production) has a production process type and is linked as the narrower concept to the production process concept category. The structure of the core concept scheme is presented in Figure 4.

Production Process and Production Segment
Production process and process segments ( Figure 5) concepts represent the decomposition of the overall production process into particular production steps. The production process is a container composed of production segments (specific steps in the process) ordered in the execution workflow. The production process can be further decomposed to sub-processes. This is represented by the compound production segment, which delegates execution to the underlying sub-process. The subprocesses refer to the parent process on the upper level of the process hierarchy by the inverse partOf relation. The taxonomy of the production process concepts corresponds to the decomposition of the production processes to sub-processes. The broader relation is equivalent to the partOf field, whereas narrower specifies an inverse relationship.

Production Process and Production Segment
Production process and process segments ( Figure 5) concepts represent the decomposition of the overall production process into particular production steps. The production process is a container composed of production segments (specific steps in the process) ordered in the execution workflow. The production process can be further decomposed to sub-processes. This is represented by the compound production segment, which delegates execution to the underlying sub-process. The sub-processes refer to the parent process on the upper level of the process hierarchy by the inverse partOf relation. The taxonomy of the production process concepts corresponds to the decomposition of the production processes to sub-processes. The broader relation is equivalent to the partOf field, whereas narrower specifies an inverse relationship.
The production process can be further decomposed to sub-processes. This is represented by the compound production segment, which delegates execution to the underlying sub-process. The subprocesses refer to the parent process on the upper level of the process hierarchy by the inverse partOf relation. The taxonomy of the production process concepts corresponds to the decomposition of the production processes to sub-processes. The broader relation is equivalent to the partOf field, whereas narrower specifies an inverse relationship. The process segment is the logical grouping of personnel and equipment resources required to complete a production step. It defines what types of personnel and equipment are needed, and can describe specific resources, e.g., some particular kind of equipment for specific segment. The partOf field, in this case, refers to the production process that consists of this production segment. Process segments are ordered in the execution workflow by an execution dependency relationship. The execution workflow order between the production segments (steps) is represented by the hasExecutionDependency relation, which specifies a condition when a segment can be executed only after the previous segment is finished. The ordering of the production segment concepts can also be expressed using the previous and next fields. previous is equivalent to hasExecutionDependency, and next represents an inverse relationship. The list of production resources (equipment or personnel role concepts) required to perform this production segment is represented using the requires field.
The cross-sectorial domain model is not limited to a particular graphical notation and can be used to describe processes using various schemas, workflow diagrams, or dynamic visualizations presented in the process visualization software (e.g., InTouch screens). One of the suitable forms for modeling the knowledge about the production processes involves using diagrams where the process steps, their execution dependencies, and the involved resources are represented as the interconnected graphical objects. As the reference graphical notation for the implementation of the MONSOON semantic framework, we adopted a subset of the BPMN 2.0 standard [38]. Examples are provided in Section 8.

Production Resources
Production resources are conceptually divided into equipment resources and person roles (Figure 4). The common super-type for these concepts is the production resource concept, which represents resources required to complete a particular production step. Grouping of resources with similar characteristics and purposes can be described by resource classes. Any piece of the production resource may be a member of zero or more classes. The semantic model does not prescribe any classification scheme for the resources but defines a simple knowledge organization system (SKOS) [39] as the preferred concept schema for specification of classification thesauri, subject heading lists, taxonomies, folksonomies, and other similar types of controlled vocabulary. Additionally, each production resource can be annotated with the relevant data elements describing its parameters and capabilities ( Figure 6).
Equipment concept covers any equipment-related entities, e.g., production sites and areas, production units, production lines, work, and process cells. Equipment may include other equipment. For example, a production line may be comprised of work cells. Each of them may be defined in the model as a separate equipment element with distinct properties and capabilities. The person concept covers any human roles involved in the production step and describes their capabilities. Properties and capabilities of equipment or human roles are specified as the linked data elements.
Both equipment and personnel concepts can be organized in concept hierarchies. The taxonomy of equipment concepts corresponds to the decomposition of the equipment to sub-pieces or the equipment can be categorized into more generic classes.

Data Modeling
Data modeling elements represent the main concepts used to describe the data and relationships between the physical data stored in the Data Lab, particular data elements of those datasets (e.g., attributes), and how they are related to overall KPIs specified for a particular process segment. Highlevel descriptions of those elements were introduced by Sarnovsky et al. [5]. The next sections provide a more in-depth description.

Data Elements
The main concepts for modeling the data elements are divided into two levels of abstractions: logical and physical. The logical data element (data element concept in the model, Figure 6) concept specifies role, type, and quantity (e.g., units of measurements) of each data element. A role describes how the data elements are treated in the production step. In general, we recognize two different roles: Input and output. Data elements with the input role can be selected as the input of the predictive functions and describe, for example, measurements or input control signals. Data elements with output roles describe diagnostic or control signals, which can be optimized as the outputs of the predictive functions. Each data element object can have both input and output roles specified, e.g., a control signal can serve as an output for the predictive control or as an input for the predictive maintenance. The type of the data element specifies the values of the data element and denotes real intervals (continuous values), ordered sets (ordinal values), or unordered sets (nominal values). Data elements can be related to the production process or equipment, or any other domain concept.
Physical data elements are used to link the data elements to the physical representation of the data in the cross-sectorial Data Lab, e.g., structured records or files. A (logical) data element can be mapped to multiple physical data elements depending on the data pre-processing methods applied to the data during the process of data acquisition and integration. Each physical data element has a

Data Modeling
Data modeling elements represent the main concepts used to describe the data and relationships between the physical data stored in the Data Lab, particular data elements of those datasets (e.g., attributes), and how they are related to overall KPIs specified for a particular process segment. High-level descriptions of those elements were introduced by Sarnovsky et al. [5]. The next sections provide a more in-depth description.

Data Elements
The main concepts for modeling the data elements are divided into two levels of abstractions: logical and physical. The logical data element (data element concept in the model, Figure 6) concept specifies role, type, and quantity (e.g., units of measurements) of each data element. A role describes how the data elements are treated in the production step. In general, we recognize two different roles: Input and output. Data elements with the input role can be selected as the input of the predictive functions and describe, for example, measurements or input control signals. Data elements with output roles describe diagnostic or control signals, which can be optimized as the outputs of the predictive functions. Each data element object can have both input and output roles specified, e.g., a control signal can serve as an output for the predictive control or as an input for the predictive maintenance. The type of the data element specifies the values of the data element and denotes real intervals (continuous values), ordered sets (ordinal values), or unordered sets (nominal values). Data elements can be related to the production process or equipment, or any other domain concept.
Physical data elements are used to link the data elements to the physical representation of the data in the cross-sectorial Data Lab, e.g., structured records or files. A (logical) data element can be mapped to multiple physical data elements depending on the data pre-processing methods applied to the data during the process of data acquisition and integration. Each physical data element has a specified data type that represents how the data are physically encoded in the records. The data type can be atomic (e.g., byte, integer, double, string) or composed (e.g., array, enumeration, map or union of types). Multiple physical data elements (fields) can be grouped into one physical data set that corresponds to the one relational table, collections of objects in NoSQL database, or flat structured file with the tabular or transaction data. The physical location of the dataset is represented as the IRI. For physical data sets, it is possible to infer datatype schema (represented, for example, as the composed avro datatype), which is the composition of datatypes of the grouped physical data elements.
A simple yet effective specification for physical data elements could be provided by the media types for sensor measurement lists (SenML) [40], which is a data model serializable with limited processing capabilities that is able to fit a large amount of data in a contextually representative format that is readable and writable by machines and humans. This specification defines media types for representing simple sensor measurements and device parameters in the SenML based on JSON, Concise Binary Object Representation (CBOR), eXtensible Markup Language (XML), and Efficient XML Interchange (EXI). Su et al. [41] showed how to transform data expressed with SenML into the RDF to support semantic interoperability. An alternative to the SenML specification described above is the OGC SensorThings API standard specification [42]. Both SenML specification and the OGC SensorThings API standard specification provide a JSON-based method of modeling physical and logical data elements and were thus further investigated within the cross-sectorial domain model.

KPIs
KPI concepts are used to model the overall performance parameters for a given production process. The KPIs are linked to the concepts of process segments or are specified for the whole production process. They represent metrics that can be used for visualization, assessment, and management of the performance or impact of the specific operations within the process industries. Since the KPIs must be measurable, they inherit the main properties (i.e., data type and quantity) from the (logical) data elements (as described in the previous section). KPIs can be further organized into a hierarchy of KPI categories according to the specified classification schema, such as the classification of environmental impact defined in ISO 14031 [43].
Besides the organization of the KPIs in the classification hierarchies, causal relationships between the KPIs can be expressed by the non-hierarchical KPI dependency relationships. Such relations can represent the transformation of one KPI to another (e.g., transformation of energy or material savings to environmental impact KPI, such as level of emissions). Using the KPI dependency relations, KPIs can also be inferred from the evaluation results, which estimate the performance of the predictive functions (see the description of the predictive functions concepts in the following subsection). Using the composition of the KPI dependency relations, it is possible to evaluate the impact of the application of a predictive function from various perspectives, such as product quality, environmental impact, material/energy consumption, or effective usage of the production resources.

Predictive Functions Modeling
As mentioned in Section 4, the majority of the ontological models used to describe data mining models support the overall data mining workflows. Several other models are available that provide more complete frameworks for data mining, e.g., OntoDM, DMOP or Exposé ontologies, but those provide rather general concepts related to the data mining tasks, datasets, etc., as well as more in-depth and detailed entities related to algorithms or experiments. Regarding the description of the data mining models, DMOP covers a detailed description of the data mining algorithms and its internal structure.
In the cross-sectorial domain model, we decided to use the concepts from DMOP ontology. Figure 7 depicts the main concepts re-used from DMOP ontology and their linking to the rest of the cross-sectorial domain model. The model training task specifies a particular step in the data mining process, and it is realized by an algorithm. The algorithm concept from DMOP (or its sub-classes) describes the particular learning algorithm. The algorithm is applied to the data and produces a model, e.g., predictive model training algorithm transforms input data to an output predictive model. Algorithm characteristics and parameters then specify various aspects of the model induction algorithm, e.g., type of the solved problem, tolerance to noise in the input data, tolerance to missing values, etc., and are represented by hasQuality and hasParameter properties, respectively. algorithm, e.g., type of the solved problem, tolerance to noise in the input data, tolerance to missing values, etc., and are represented by hasQuality and hasParameter properties, respectively. The model then represents the output of a learning algorithm. A model concept represents a logical definition of the predictive function. Model class (or its subclasses) then specify the model structure and model parameter. The model structure describes the nature of the model, e.g., logical structure (tree, rules, etc.) or mathematical expression (e.g., kernel functions, etc.). Model parameter specifies the various parameters for a given model type. Each specific model type has a specific set of model parameters. Such parameters can include, for example, weights in neural network models, threshold splits in tree models, etc. The model complexity measure class then contains concepts to quantify the model complexity. Similarly, specific model types with specific parameters have assigned particular complexity measures. In the case of the tree classification model, this could be the depth of a tree or a number of leaves; in the case of support vector machine (SVM), this could be the model number of support vectors or the sum of their weights. DMOP provides a concept hierarchy for both models and algorithms. The trained model can be used, when deployed on-site, in real-world application scenarios. Such scenarios are usually designed to optimize one or more KPIs for the given process. Using the cross-sectorial model, it is possible to specify the influences relationship between the trained model and a KPI related to the production process.

Evaluation of the Semantic Models
Various methodologies for ontology evaluation are available that specify existing approaches for different types of the ontologies [44][45][46]. Bandeira. et al. [47] divided the evaluation techniques according to the ontology type (application, task, domain, and top-level ontologies) and presented the main evaluation methodology applicable for a specific ontology type. Different groups of evaluation methods are used based on what should be evaluated and when (in what stage of ontology design) [48]. In our approach, we decided to use the approach described by Goméz-Peréz et al. [49]. The model then represents the output of a learning algorithm. A model concept represents a logical definition of the predictive function. Model class (or its subclasses) then specify the model structure and model parameter. The model structure describes the nature of the model, e.g., logical structure (tree, rules, etc.) or mathematical expression (e.g., kernel functions, etc.). Model parameter specifies the various parameters for a given model type. Each specific model type has a specific set of model parameters. Such parameters can include, for example, weights in neural network models, threshold splits in tree models, etc. The model complexity measure class then contains concepts to quantify the model complexity. Similarly, specific model types with specific parameters have assigned particular complexity measures. In the case of the tree classification model, this could be the depth of a tree or a number of leaves; in the case of support vector machine (SVM), this could be the model number of support vectors or the sum of their weights. DMOP provides a concept hierarchy for both models and algorithms. The trained model can be used, when deployed on-site, in real-world application scenarios. Such scenarios are usually designed to optimize one or more KPIs for the given process. Using the cross-sectorial model, it is possible to specify the influences relationship between the trained model and a KPI related to the production process.

Evaluation of the Semantic Models
Various methodologies for ontology evaluation are available that specify existing approaches for different types of the ontologies [44][45][46]. Bandeira. et al. [47] divided the evaluation techniques according to the ontology type (application, task, domain, and top-level ontologies) and presented the main evaluation methodology applicable for a specific ontology type. Different groups of evaluation methods are used based on what should be evaluated and when (in what stage of ontology design) [48]. In our approach, we decided to use the approach described by Goméz-Peréz et al. [49]. The primary reason for evaluating the model was driven by specified use cases within the application domains of the MONSOON project.
The cross-sectorial domain model can be considered a domain ontology, which can be evaluated against a frame of reference. This can be achieved by formulating a set of competency questions-requirements from the real-world. As the evaluation is an iterative process, it can be performed starting from the ontology development to achieve a complete model. In this case, competency questions can represent a part of the functional requirements that can be used during the design and must be satisfied by the implemented model. To use them during the evaluation phase, there is a need to determine how to properly formulate the competency questions and how to measure if the question can be successfully answered using the developed ontology [50].
If we consider a competency question as a natural language formulation that should be answered by the ontology, then its ability to answer those questions can be measured by a possibility of specifying of a single query (or multiple queries) as an expression of competency questions using a chosen query language and ontology concepts and relationships. To collect a proper set of competency questions, a different set of real-world scenarios covering most of the ontology use cases must be considered, but invalid or questions should be removed. This includes redundant and/or incomplete questions, and questions that cannot be answered due to limitations of used formalism. Using the approach previously described [46], we designed a set of competency questions that were common for both aluminum and plastic application domains and categorized them according to different patterns and competency questions archetypes. A list of those competency questions in the design phase for all three developed modules was established. Table 1 summarizes the competency questions and defines which of the patterns the competency questions cover. The answerability of the ontology can be evaluated by testing how the competency questions can be formulated by the chosen query language using ontology concepts and relations. Table 2 Processes 2019, 7, 281 15 of 27 summarizes them and presents their representation in a query language that represents the ability of the ontology module to solve and answer the questions it was designed to answer. For that purpose, we formulated the questions using the SPARQL query language [51] and using the concepts and relations from the presented semantic model. Besides the evaluation of the ontology expressiveness using competency questions, we also designed and performed tests aimed to evaluate how the designed ontology can support communication between the domain experts and data scientists and how the ontology can be used for automatization of the data mining tasks. In the first case (evaluation of the efficiency of communication), we performed two experiments [52].
In the first experiment, we divided the domain experts into two groups: (1) Documenting the results of the communication using the documentation based on the textual documents and the graphical schemas for the process models, and (2) using the semantic framework to capture and directly formalize the domain knowledge in the cross-sectorial domain model. Both groups of domain experts were supported by the same team of data scientists. The goal of both groups was to complete the phases of problem understanding and data understanding (modelling phase and model evaluation phase were not covered in this experiment). After the data scientists and domain experts produced documentation and semantic models, the textual documents were manually annotated by the data scientists to extract the vocabulary of domain concepts described in the documents and relationships between these concepts explicitly defined in the text sentences. The list of concepts extracted from the textual documentation documented by the first group was then compared with the concepts modelled directly using the cross-sectorial domain model by the second group of domain experts. The comparison was then evaluated using various metrics, such as precision and recall, to evaluate the coverage of the knowledge in both forms of the documentation. The conclusion was that the semantic model fully covers the textual documentation (i.e., good recall) with some issues related to the process modelling (i.e., lower precision for process models). The issues were mainly related to using the wrong terminology for some process activities (e.g., using of a term describing the tool instead of a verb term describing the tool usage) and wrong decomposition of the processes to sub-activities and sub-processes.
In the second experiment, we divided the data scientists into two groups and both groups were working with the same group of domain experts from the two different domains (i.e., aluminum production and plastic production). The task for the groups was again complete the problem and to understand data phases in cooperation with the domain experts to use semantic models to document the domain knowledge. Each group of data scientists were working on one domain: one group on the aluminum use case and one on the plastic case. After the groups finished the phases of problem and data understanding, the groups exchanged their domains and continued with the phases of data pre-processing, modelling, and validation. One group that described the problem and data for the plastic domain the completed the pre-processing, modelling, and data evaluation for aluminum domain and vice versa. Note that the data scientists did not have any previous experience with both domains, so their understanding of the problem and data was based solely on the documentation expressed by the other group. In this case, we measured how much information was missing in the semantic model by formulating the list of competency questions asked by each group during the phases of pre-processing, modelling, and validation. The result was that both groups were able to correctly identify the problem and how it can be transformed to the data mining task, which predictive functions will be built from data, and which KPIs will be optimized by these functions. The missing knowledge formulated in the competency questions was mainly related to the dependencies between the data attributes and to a dependency between the KPIs in the validation phase. However, all missing competency questions were covered in the semantic model; it was possible to add missing dependencies without extending the cross-sectorial domain model conceptualization.

Semantic Modeller
To improve the effectiveness of communication between the experts and to implement the shared knowledge base using the semantic model described in the previous sections, we designed and implemented the prototype of the semantic modeling tools customized for the proposed model. The tool is primarily aimed to be used collaboratively by the domain experts and data scientists. It consists of a graphic interface enabling to describe the particular industrial domain characteristics, e.g., draw the diagrams of the processes, describe the equipment and personnel related to the processes and process steps, annotate the data produced by those elements, etc. On the other hand, besides the knowledge base in human-readable form stored in the tool, also machine-readable representation of the domain model is created on the back-end.
The web-based user interface of the semantic tools is presented in Figure 8. The layout of the interface is divided into the navigation panel (1), detailed contextual view (2), and main content view (3). The navigation panel consists of the navigation tree presenting the domain concepts organized according to the main types of views (processes, equipment, and people). The detailed contextual view provides an additional description of the domain concepts selected in the navigation panel. The navigation panel also provides the context for the main view and filter information related to the selected domain concepts (i.e., the structure of the domain part of the semantic model is always visible for the experts for navigation). The main views correspond to the main concepts and present the semantic properties and relations for processes, data (elements), KPIs, and predictive functions. Processes 2019, 7, x FOR PEER REVIEW 17 of 28

Process View
In the process view, the user can use graphical process editor to model production processes. The graphical notation is based on the simplified BPMN diagrams that consist of the main building blocks: start/end node, activity (representing the step of the production process), generic gateway (branching of the process to parallel phases), and sub-processes. Process view is synchronized with the navigation bar, i.e., the user can select an element in the navigation bar, and process view will switch to the process/sub-process and select the element. A description of the selected element is provided in the detailed contextual view.

Data and KPIs View
The data and KPIs view provides a list of data elements and KPIs defined for the specified part of the production process or resource. For data elements, it provides information about the role of the data attribute, which can be used as the input/output to the predictive function, measurement units, and data type. The important part of the view is the filter, which allows selecting only the subset of the elements relevant to the specified domain concepts. As such, users can create a contextual view of the specific data analytics task targeted at the optimization of the production step or equipment. Additionally, the user can annotate each data element with arbitrary tags and use these annotations to further filter the presented information.

Predictive Functions View
Similar to the data view, predictive functions view provides the list of semantic elements describing the predictive functions specified by the data scientists. A list of functions can be filtered according to the context specified by tags and domain concepts. For the selected function, the view provides an overview of the main information about the function such as table summarizing inputs, outputs, and influenced KPIs.

Case Study
This section presents the application of the cross-sectorial domain model to one of the project's application domains: aluminum production. The MONSOON aluminum use case was evaluated at AP Dunkerque smelter, France, which has the highest primary aluminum production in the EU.

Process View
In the process view, the user can use graphical process editor to model production processes. The graphical notation is based on the simplified BPMN diagrams that consist of the main building blocks: start/end node, activity (representing the step of the production process), generic gateway (branching of the process to parallel phases), and sub-processes. Process view is synchronized with the navigation bar, i.e., the user can select an element in the navigation bar, and process view will switch to the process/sub-process and select the element. A description of the selected element is provided in the detailed contextual view.

Data and KPIs View
The data and KPIs view provides a list of data elements and KPIs defined for the specified part of the production process or resource. For data elements, it provides information about the role of the data attribute, which can be used as the input/output to the predictive function, measurement units, and data type. The important part of the view is the filter, which allows selecting only the subset of the elements relevant to the specified domain concepts. As such, users can create a contextual view of the specific data analytics task targeted at the optimization of the production step or equipment. Additionally, the user can annotate each data element with arbitrary tags and use these annotations to further filter the presented information.

Predictive Functions View
Similar to the data view, predictive functions view provides the list of semantic elements describing the predictive functions specified by the data scientists. A list of functions can be filtered according to the context specified by tags and domain concepts. For the selected function, the view provides an overview of the main information about the function such as table summarizing inputs, outputs, and influenced KPIs.

Case Study
This section presents the application of the cross-sectorial domain model to one of the project's application domains: aluminum production. The MONSOON aluminum use case was evaluated at AP Dunkerque smelter, France, which has the highest primary aluminum production in the EU.
Currently, the aluminum production process can be considered highly-mature, but several areas still could benefit from the application of modern data analytics.
The aluminum production process depicted in Figure 9 consists of several main steps. Aluminum is produced using an electrolysis process in the potline. The potline consists of a set of several hundred pots where liquid aluminum is being produced using electrolysis. The main inputs to the process are electricity, alumina (Al 2 O 3 , obtained from bauxite), anodes (produced on-site, in the carbon plant), and an electrolytic bath [53]. Each pot is equipped with anodes and cathodes. Anodes are consumed during the process, and therefore they have to be continually replaced. Anodes and their quality are some of the most critical and controllable inputs for the electrolysis. When using the cross-sectorial domain model and simple BPMN graphical notation as a production process consisting of process segments, top-level aluminum production process could be modeled, as depicted in Figure 10. Currently, the aluminum production process can be considered highly-mature, but several areas still could benefit from the application of modern data analytics. The aluminum production process depicted in Figure 9 consists of several main steps. Aluminum is produced using an electrolysis process in the potline. The potline consists of a set of several hundred pots where liquid aluminum is being produced using electrolysis. The main inputs to the process are electricity, alumina (Al2O3, obtained from bauxite), anodes (produced on-site, in the carbon plant), and an electrolytic bath [53]. Each pot is equipped with anodes and cathodes. Anodes are consumed during the process, and therefore they have to be continually replaced. Anodes and their quality are some of the most critical and controllable inputs for the electrolysis. When using the cross-sectorial domain model and simple BPMN graphical notation as a production process consisting of process segments, top-level aluminum production process could be modeled, as depicted in Figure 10.   Currently, the aluminum production process can be considered highly-mature, but several areas still could benefit from the application of modern data analytics. The aluminum production process depicted in Figure 9 consists of several main steps. Aluminum is produced using an electrolysis process in the potline. The potline consists of a set of several hundred pots where liquid aluminum is being produced using electrolysis. The main inputs to the process are electricity, alumina (Al2O3, obtained from bauxite), anodes (produced on-site, in the carbon plant), and an electrolytic bath [53]. Each pot is equipped with anodes and cathodes. Anodes are consumed during the process, and therefore they have to be continually replaced. Anodes and their quality are some of the most critical and controllable inputs for the electrolysis. When using the cross-sectorial domain model and simple BPMN graphical notation as a production process consisting of process segments, top-level aluminum production process could be modeled, as depicted in Figure 10.   (1) Green anodes production involves preparation of paste and forming of the anodes before their baking in the furnace. (2) Anode baking: baked anodes are produced from green anodes in chambers of the baking furnace.
(3) Anode rodding: rodded anodes are produced from baked anodes by splicing them together with stems, producing the final product of the carbon plant stage, and are distributed to electrolysis phase.
Anodes production can be modeled as a process segment as a separate sub-process, as depicted in Figure 11.
Processes 2019, 7, x FOR PEER REVIEW 19 of 28 Anodes are produced in the carbon plant. Anode production consists of three stages: (1) Green anodes production involves preparation of paste and forming of the anodes before their baking in the furnace. (2) Anode baking: baked anodes are produced from green anodes in chambers of the baking furnace. (3) Anode rodding: rodded anodes are produced from baked anodes by splicing them together with stems, producing the final product of the carbon plant stage, and are distributed to electrolysis phase.
Anodes production can be modeled as a process segment as a separate sub-process, as depicted in Figure 11. At the production segment level, each of the anode production process segments can be hierarchically modeled in a similar fashion as the separate processes. The following example in Figure  12 presents the sub-process diagram for green anode production segment. Green anodes are produced from calcined petroleum-coke, coal-tar pitch and recycled scraps, and anode butts. Such ingredients are prepared and dosed. The material is then processed (including milling, screening, proportioning, and mixing of the grain), which results in a paste used to form the anodes. Each of the process steps on each level of the structure can be described by specifying the relation with the equipment used in this process. We can specify the particular device and its relation to the process segment (e.g., the buss mixer is related to the mixing phase) and define the concept hierarchy for the devices (e.g., define the mixer as a hydraulic device type). Equipment can be described using several parameters (input or output). As an example, the buss mixer device can be characterized using a mixer power parameter. A parameter, in general, is specified by corresponding data element, which defines the units, description, type, and tags for full-text search in the modeler tool (e.g., the mixer power attribute is an input, which means that it can be set up by personnel on the plant site). Then, the parameter can be tied (using the represent relation) to the concrete data element in a specific data set. The data set is described similarly and uses the concepts from DMOP ontology.
A KPI is described using a specification of a dependency relation between the data element and a KPI element. In this scenario, we specified an energy consumption KPI that represents the overall energy consumption of the green anode production process. The buss mixer, as a device involved in one of the process' segments, influences the given KPI. This contribution can be represented using the definition of a dependency between the energy consumption element and the buss mixer power consumption data element.
The random forest (RF) classification model can be used in a predictive maintenance scenario to anticipate the breakdown of a buss mixer during its operation. This model can be described by the DMOP concepts using the semantic model. The model itself is an output of a RF algorithm (univariate decision tree algorithm) applied on a concrete dataset (Dataset1) and specified using a set of different Figure 11. Anode production process model.
At the production segment level, each of the anode production process segments can be hierarchically modeled in a similar fashion as the separate processes. The following example in Figure 12 presents the sub-process diagram for green anode production segment. Green anodes are produced from calcined petroleum-coke, coal-tar pitch and recycled scraps, and anode butts. Such ingredients are prepared and dosed. The material is then processed (including milling, screening, proportioning, and mixing of the grain), which results in a paste used to form the anodes. Anodes are produced in the carbon plant. Anode production consists of three stages: (1) Green anodes production involves preparation of paste and forming of the anodes before their baking in the furnace. (2) Anode baking: baked anodes are produced from green anodes in chambers of the baking furnace. (3) Anode rodding: rodded anodes are produced from baked anodes by splicing them together with stems, producing the final product of the carbon plant stage, and are distributed to electrolysis phase.
Anodes production can be modeled as a process segment as a separate sub-process, as depicted in Figure 11. At the production segment level, each of the anode production process segments can be hierarchically modeled in a similar fashion as the separate processes. The following example in Figure  12 presents the sub-process diagram for green anode production segment. Green anodes are produced from calcined petroleum-coke, coal-tar pitch and recycled scraps, and anode butts. Such ingredients are prepared and dosed. The material is then processed (including milling, screening, proportioning, and mixing of the grain), which results in a paste used to form the anodes. Each of the process steps on each level of the structure can be described by specifying the relation with the equipment used in this process. We can specify the particular device and its relation to the process segment (e.g., the buss mixer is related to the mixing phase) and define the concept hierarchy for the devices (e.g., define the mixer as a hydraulic device type). Equipment can be described using several parameters (input or output). As an example, the buss mixer device can be characterized using a mixer power parameter. A parameter, in general, is specified by corresponding data element, which defines the units, description, type, and tags for full-text search in the modeler tool (e.g., the mixer power attribute is an input, which means that it can be set up by personnel on the plant site). Then, the parameter can be tied (using the represent relation) to the concrete data element in a specific data set. The data set is described similarly and uses the concepts from DMOP ontology.
A KPI is described using a specification of a dependency relation between the data element and a KPI element. In this scenario, we specified an energy consumption KPI that represents the overall energy consumption of the green anode production process. The buss mixer, as a device involved in one of the process' segments, influences the given KPI. This contribution can be represented using the definition of a dependency between the energy consumption element and the buss mixer power consumption data element.
The random forest (RF) classification model can be used in a predictive maintenance scenario to anticipate the breakdown of a buss mixer during its operation. This model can be described by the DMOP concepts using the semantic model. The model itself is an output of a RF algorithm (univariate decision tree algorithm) applied on a concrete dataset (Dataset1) and specified using a set of different Each of the process steps on each level of the structure can be described by specifying the relation with the equipment used in this process. We can specify the particular device and its relation to the process segment (e.g., the buss mixer is related to the mixing phase) and define the concept hierarchy for the devices (e.g., define the mixer as a hydraulic device type). Equipment can be described using several parameters (input or output). As an example, the buss mixer device can be characterized using a mixer power parameter. A parameter, in general, is specified by corresponding data element, which defines the units, description, type, and tags for full-text search in the modeler tool (e.g., the mixer power attribute is an input, which means that it can be set up by personnel on the plant site). Then, the parameter can be tied (using the represent relation) to the concrete data element in a specific data set. The data set is described similarly and uses the concepts from DMOP ontology.
A KPI is described using a specification of a dependency relation between the data element and a KPI element. In this scenario, we specified an energy consumption KPI that represents the overall energy consumption of the green anode production process. The buss mixer, as a device involved in one of the process' segments, influences the given KPI. This contribution can be represented using the definition of a dependency between the energy consumption element and the buss mixer power consumption data element.
The random forest (RF) classification model can be used in a predictive maintenance scenario to anticipate the breakdown of a buss mixer during its operation. This model can be described by the DMOP concepts using the semantic model. The model itself is an output of a RF algorithm (univariate decision tree algorithm) applied on a concrete dataset (Dataset1) and specified using a set of different settings, e.g., maximum depth of trees in forest and minimal number of records in a split or leaves, and has several quality attributes. The algorithm was applied using concrete values of the algorithm parameters, which are specified as individuals of the Parameter class, e.g., Min_samples_leaf parameter instance was set to an integer value of 1, the Min_samples_split parameter was set to 5, and Max_depth was set to none (which specifies that the parameter was not considered in model training). The model was properly trained to anticipate the breakdowns of the buss mixer, which directly influences the energy consumption of the device (and therefore the process as a whole), as depicted in Figure 13.
Processes 2019, 7, x FOR PEER REVIEW 20 of 28 settings, e.g., maximum depth of trees in forest and minimal number of records in a split or leaves, and has several quality attributes. The algorithm was applied using concrete values of the algorithm parameters, which are specified as individuals of the Parameter class, e.g., Min_samples_leaf parameter instance was set to an integer value of 1, the Min_samples_split parameter was set to 5, and Max_depth was set to none (which specifies that the parameter was not considered in model training). The model was properly trained to anticipate the breakdowns of the buss mixer, which directly influences the energy consumption of the device (and therefore the process as a whole), as depicted in Figure 13. The described scenario and the presented diagrams can be represented using the semantic model in a machine-readable format in JSON-LD notation. Appendix A contains the example JSON-LD documents for each of the presented elements.

Conclusions
In this paper, we proposed a unified ontology for the formalization of data analytics processes applied for optimization in the process industry. This ontology promotes the sharing of best practices and results within and between domain experts and data scientists, reducing both the duplication and loss of knowledge. It is an essential step in formalizing the data analytics methods and how they can be applied to the optimization of production processes. The proposed ontology has modular architecture divided into a domain-specific part and data analytics part. The domain-specific part specifies concepts for the description of production processes, equipment, human resources, and overall KPIs. The data analytics part specifies concepts for data representation, pre-processing, and modeling of the predictive functions. Both parts are interlinked in one unified semantic framework. The proposed semantic framework was developed using sound methodology with emphasis on the usability of the semantic models in various user scenarios covering most of the data analytics phases from problem understanding to model validation. Besides the formal specification of the semantic models, we also designed and implemented modeling tools that allow collaborative semantic modeling and sharing of knowledge between domain experts and data scientists. An important feature of the modeling tools is that they are designed for users without skills in semantic technology. In future work, we would like to further extend our model with functional modeling of the dependencies between data elements and KPIs. Such extensions will allow automatic reasoning and computation of the impact of the data analytics methods on the overall performance of the production process assessed by the specified KPIs. The described scenario and the presented diagrams can be represented using the semantic model in a machine-readable format in JSON-LD notation. Appendix A contains the example JSON-LD documents for each of the presented elements.

Conclusions
In this paper, we proposed a unified ontology for the formalization of data analytics processes applied for optimization in the process industry. This ontology promotes the sharing of best practices and results within and between domain experts and data scientists, reducing both the duplication and loss of knowledge. It is an essential step in formalizing the data analytics methods and how they can be applied to the optimization of production processes. The proposed ontology has modular architecture divided into a domain-specific part and data analytics part. The domain-specific part specifies concepts for the description of production processes, equipment, human resources, and overall KPIs. The data analytics part specifies concepts for data representation, pre-processing, and modeling of the predictive functions. Both parts are interlinked in one unified semantic framework. The proposed semantic framework was developed using sound methodology with emphasis on the usability of the semantic models in various user scenarios covering most of the data analytics phases from problem understanding to model validation. Besides the formal specification of the semantic models, we also designed and implemented modeling tools that allow collaborative semantic modeling and sharing of knowledge between domain experts and data scientists. An important feature of the modeling tools is that they are designed for users without skills in semantic technology. In future work, we would like to further extend our model with functional modeling of the dependencies between data elements and KPIs. Such extensions will allow automatic reasoning and computation of the impact of the data analytics methods on the overall performance of the production process assessed by the specified KPIs.