You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

31 August 2021

Development of Knowledge Base Using Human Experience Semantic Network for Instructive Texts

,
,
and
1
Faculty of Energy Systems and Nuclear Science, Ontario Tech University, Oshawa, ON L1G 0C5, Canada
2
Department of Electrical and Computer Engineering, Ontario Tech University, Oshawa, ON L1G 0C5, Canada
3
IRI, Reactor Innovation, Ontario Power Generation, Whitby, ON L1N 9E3, Canada
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Integrating Knowledge Representation and Reasoning in Machine Learning

Abstract

An organized knowledge structure or knowledge base plays a vital role in retaining knowledge where data are processed and organized so that machines can understand. Instructive text (iText) consists of a set of instructions to accomplish a task or operation. Hence, iText includes a group of texts having a title or name of the task or operation and step-by-step instructions on how to accomplish the task. In the case of iText, storing only entities and their relationships with other entities does not always provide a solution for capturing knowledge from iTexts as it consists of parameters and attributes of different entities and their action based on different operations or procedures and the values differ for every individual operation or procedure for the same entity. There is a research gap in iTexts that created limitations to learn about different operations, capture human experience and dynamically update knowledge for every individual operation or instruction. This research presents a knowledge base for capturing and retaining knowledge from iTexts existing in operational documents. From each iTexts, small pieces of knowledge are extracted and represented as nodes linked to one another in the form of a knowledge network called the human experience semantic network (HESN). HESN is the crucial component of our proposed knowledge base. The knowledge base also consists of domain knowledge having different classified terms and key phrases of the specific domain.

1. Introduction

Instructive texts (iTexts) are different, in terms of structure and textual pattern than standard texts. iTexts usually instruct or describe how to do something in a step-by-step process. For example, how does one fix a turbine? The answer to this question has a few procedures to follow, which will help accomplish the main goal or operation. iTexts usually consist of a title, which could be the name of the process or operation, and a set of instructions or procedures that help to accomplish the operation in a step-by-step process. Figure 1 shows the differences between regular or standard text and iText in terms of structure and textual pattern.
Figure 1. Difference between regular text and instructive text (iText).
Employees in large industries, such as the nuclear industry, store information like procedures, precautions, experiments, risk factors, etc., in handwritten or pdf documents, which are prominent in quantity. These are called operational documents. They follow these documents during operation in order to accomplish each task efficiently. There is a continuous movement of experienced personnel to different departments, or they go for retirements and hence a tremendous amount of expertise is lost. The loss of expertise costs the industry a huge amount of money as they have to invest in training less experienced personnel, leading to indirect losses in delayed or wrong activities. A less experienced employee cannot operate complex tasks due to having less knowledge and training about the documents and their operation. The training period could take months to cover information about the different operations. The more extended the training period, the more expensive it is for the industry. At many times, it is troublesome to retrieve any specific information during operation or other practices. It is helpful if the desired information is quickly retrieved when employees are in the middle of an industrial activity or in a lab, making them work faster. Moreover, much time is wasted while searching for specific information from one out of innumerable documents during a complex operation to accomplish its objective. In case of any inaccurate information retrieval, there is a high risk of operational failure, which is again costly to recover for the industry. If information and human experience from these large number of documents could be extracted, structured and retained in a knowledge base from where desired information could be easily retrieved at any time, then the operational time could be saved and utilized in a much better way. Furthermore, this could also reduce the expenses for the training and learning purpose. The learning process could also be faster. The less experienced employee will also be able to perform the complex operation with the help of the knowledge base, which was impossible for them previously. However, the management of this knowledge base could be critical with the increase in information. Without proper structuring of knowledge, information retrieval will be an expensive approach.
Hence, this knowledge can be structured in the form of a network, being able to retain the human expertise from these documents in an organized way by developing relationship among the entities, their action, attributes and different values and parameters from each of the iTexts and procedures, which could have information about human role, tool, equipment, location, document, operation, procedure, etc., associated with that particular operation. Current research approaches in developing a knowledge base and retaining the relationship among entities and structuring knowledge from standard texts does not fully apply for iTexts as relationships, attributes, and properties of entities in iTexts differ in case of different operation. This paper presents a knowledge base consisting of HESN, which structures knowledge by capturing the human experience from iTexts using grammar, semantic meaning, and domain knowledge consisting of classes and properties of information related to that specific domain. The knowledge base retains the properties, relationships and values of different entities, action terms or verbs, attributes, and attribute values found in an iText for different operations. It extracts the real expertise from iTexts and dynamically updates the HESN existing in the knowledge base. The contribution of this work can be summarized as follows:
  • The development of an adaptive, dynamic and deterministic knowledge structure with qualitative and quantitative attributes, called the human experience semantic network (HESN), is used to capture and structure knowledge from iTexts in the form of nodes and edges;
  • The development of a knowledge base, consisting of HESN and domain knowledge, for retaining properties, values, and relationships of different terms or key phrases, found in iTexts. These terms or key phrases could be an entity, action term or verb, attribute, or attribute value. The knowledge is structured for different entities, action terms, attribute, or attribute values based on different operations.
The rest of this paper is organized in the following way. In Section 2, a literature review is done related to the work. In Section 3, the methodology and HESN is explained. Section 4 demonstrates the advantage and ability of this approach. A conclusion is drawn in Section 5.

3. Research Methodology

The development of the knowledge base for iTexts is a step-by-step process. The two major parts of the knowledge base are domain knowledge and HESN itself. They help accomplish tasks, such as identifying different terms and key phrases, establishing relationships among them, structuring knowledge of different terms, and key phrases found in iTexts based on different operations under which each of the instructions is provided and finally update HESN and the knowledge base. For simplicity, all entities and named entities are termed entities in this paper. In this research, a relationship is established among four types of terms or key phrases—entities, action terms or verb terms, attribute terms, and attribute values. Domain knowledge consists of information related to these terms or key phrases. They are represented using class and property and help to detect and identify terms and phrases in iTexts. Each time new instructions are learned, and the HESN is updated. Updating HESN or domain knowledge also means updating the knowledge base since HESN and domain knowledge constitute the knowledge base.

3.1. iTexts Extraction and Preprocessing

This research is done based on the test documents communicated with the maintenance section within Ontario power generation (OPG), responsible for approximately half of the electricity generation in the Province of Ontario, Canada. The test documents had different contents related to purpose, pre-requisites, instructions, post-requisite, definitions, summary of changes, validation, and verification, and similar information about different processes, operations, inspections, equipment, etc. The texts related to operational procedures and instructions, only those texts were extracted with the help of an algorithm that was developed to follow a standard procedure and group the texts combining related sentences. In this way, the entire document is divided into small chunks where each chunk consists of the title of the operation or procedure (Parent-iText or PT) and a set of instructions underneath (Child-iText or CT). These groups or chunks of texts were further processed to capture knowledge and retain it in the knowledge base with the help of HESN and domain knowledge. However, this paper focused mainly on extracting knowledge from iTexts and developing the knowledge base and HESN rather than text extraction from documents.

3.2. Domain Knowledge Development

The domain knowledge is an essential part of the knowledge base. It is used to identify the entities, action terms, and attribute terms and values from the iTexts. The entities are the nouns or names of different equipment, document, tool, human role, etc. The attribute terms could be terms like height, pressure, weight, etc. Attribute values refer to any number or value, and it could be status, such as complete, in progress, condition of anything, such as poor, high, dry, and so on. The term “action” refers to the verbs, such as measure, work, move, check, etc. Each entity, action, attribute, and attribute value, except for the numbers, are pre-defined and classified as part of the knowledge base. There could be many classes, and under each class, there exists multiple entities, action, attribute, or attribute values. Each of these classes has properties. For example, “humanRole” is the name of a class. Under this class, there are entities such as “engineer”, “personnel”, “manager”, etc. The domain could be related to a nuclear power plant, chemical industry, or any other category. Based on this domain, different terms and phrases are defined and classified with the help of domain experts. The more enriched domain knowledge is, the more term identification is possible. As a result, more relationship establishment is possible later on among different terms found in the iText. The properties of each class help to know about a term or phrase’s association with other terms and from which iText and operation, the term was identified. This information about different terms or phrases is updated each time when new iTexts are read from the document.

3.3. Human Experience Semantic Network (HESN)

HESN is the key component of our proposed knowledge base. Different terms and key phrases are identified from iTexts with the help of domain knowledge. There could be different operations or procedures in a document. Under each operation, there could be multiple instructions or procedure that talks about how to accomplish that particular operation. There could be the same term or key phrase in different operations. HESN represents the knowledge network that shows the association or relation of a term or key phrase with other terms or key phrases based on different operations. Each of these terms or key phrases could be an entity, action, attribute or attribute value. The network is represented in the form of nodes and edges that constitute a tree or undirected graph. Figure 2 represent a small glimpse of HESN where nodes and edges are connected. Figure 3 represents detailed information about each node. As the network retains the relations, semantics, and information about different terms and key phrases, and captures the knowledge and experience from iTexts existing in operational documents, thus it is called the human experience semantic network (HESN). Creating relations among different terms or key phrases based on the operation is performed with the help of tags. The methodology of creating relations among different terms and phrases and the use of tags is explained in the latter part of this paper.
Figure 2. Human experience semantic network (HESN).
Figure 3. Three different entity and their classes and property found in domain knowledge. Values are updated when new iText is read.

3.4. Entity, Action, Attribute and Value Recognition and Linking

Domain knowledge is used to deal with recognizing terms and key phrases, which could be an entity, action term, attribute term, or some value. If named entities, action, or attribute, that consists of more than one word are identified, each word of that named entity, action, or attribute is concatenated to make it a single word. For example, water pump = waterpump. This helps in making the relationship among the words or key phrases easier later on. An attribute could be terms like pressure, height, condition, etc., which are the properties of an entity. Its value could be high, low, poor, etc. It could also be a numeric value. The domain knowledge consists of all these terms, except for the numeric values. Once the identification and concatenation are made, the next task is to establish relationships among the words or phrases. At first, the stop words are removed from the sentence except for a few, which are “on,” “in,” “this,” “have,” “has,” and “should.” There could be six types of relationships—(i) entity-action (E-Ac), (ii) entity-entity (E-E), (iii) entity-attribute (E-Att), (iv) entity-value (E-V), (v) action-attribute (Ac-Att), and (vi) attribute-value (Att-V). The relationship is always created among two terms or key phrases. A grammar pattern-based linguistic matching is done with the help of a library named spaCy [20]. This helps to identify the direct dependency of a word over another word in a sentence in the form of a duplet. Each of these duplets is further processed and reorganized.
Figure 4 shows the algorithm using which the tags are created and duplets are generated from each iText. Tags are the nouns and verbs found in an iText. A set of tags are used against each duplet. It helps to identify from which particular iText, the duplet was generated. Furthermore, this information helps to distinguish the relationships between different terms and key phrases based on different operations. The use of tags is explained further in the latter part of this paper.
Figure 4. Algorithm of creating tags and duplet formation.
From the algorithm, OP in Step 2 refers to a set of instructions having a title or operation name (PT) and one or more instructions (CT). T[i] represents each iText which could be a PT or CT. ‘N’ and ‘V’ in the algorithm means all nouns and verbs extracted from that particular iText. ‘TAGS’ in Step 5 denote all the Nouns and Verbs of T[i], whereas “ALLTAGS” in Step 6 denote tags of that particular iText and the ‘PTAGS’. ‘PTAGS’ are the tags extracted from PT. All necessary components of the spaCy library is loaded and assigned to ‘sp’ in Step 8. It can now be used to perform tasks like finding word dependencies from within a sentence. In Step 9, the stop words are removed from the iText except for a few, which are ‘on’, ‘in’, ‘this’, ‘have’, ‘has’, and ‘should’. In Step 10, the iText is processed using ’sp’ to get valuable insight, such as direct word dependencies, parts of speech tag for each word, etc. In Step 11, the function “getAllDD” returns word dependency for each word in the sentence in the form of duplets. Each element in the duplet is represented as d[0] and d[1], as shown in Step 12. “DK” consists of all terms found in domain knowledge. The final “DD” found in Step 14, after ending the loop, consists of the sorted duplets. Concatenation of the duplets creates a small network for that particular iText, as shown in Figure 5. This network is the building block of HESN. Figure 5 visually represents the methodology of how a small HESN network is generated from an iText. Step 15 is described in Section 3.6 “Update HESN” of this paper.
Figure 5. Generation of duplets and formation of small network from iText.

3.5. Tag Generation and Relation Tracking

When it comes to iTexts, it is essential to track the information about different terms and phrases provided in different sets of instructions or operations. If we again consider Figure 5, we get the entity here as “pump”, and its attribute is “pressure”. The value is mentioned as 3. Let us consider this value for ’pump’ for operation OP1. There could be another operation OP2 where the entity and attribute are the same but the value is 7. In this case, two different values are obtained having the same attribute of the entity but for different operations, OP1 and OP2. In order to keep track of this knowledge, tag plays an important role. Figure 6 shows how relations of the same entity are structured for two different operations. Tags are termed in this research as the nouns and verbs extracted from text, having word’s character length greater than 2 for verbs and any character length for nouns. For every network that is generated from each instruction, tags are added against them. These tags contain the nouns and verbs extracted from that particular instruction and the title of the operation under which the instruction is situated. Considering the same example from Figure 5, if T1 is considered as the set of tags for those associations found in the small network, formed from that particular iText, then T1 consists of the nouns and verbs of that iText (CT), along with tags generated from the title of its operation (PT). This takes place for every instruction under the same operation. This helps to keep track of which information is coming from which operation. Parts-of-speech (POS) tagging is one of the popular techniques of natural language processing. It has been used in this research for generating the tags from each iText. The process is shown in Figure 7.
Figure 6. Updating value of same entity from two different iText for two different operation which shows how HESN is updated.
Figure 7. Extracting nouns and verbs from text and converted to root form as tags.

3.6. Update HESN

From Figure 5, it is observed how a small network is generated from each iText consisting of the relationship among terms of entity, action, attribute, and its value and how the respective tags are generated from that particular instruction (CT) and title of the operation (PT). HESN consists of nodes and edges. Figure 3 represent detailed information about a node. Whenever a new relation is created between two terms or key phrases, the property of both of the terms are updated. For example, in Figure 3, the term “personnel” and “sign” are related and was found from an iText. The term “personnel” is an entity whereas the term “sign” is an action. The property “AssociatedAction” of “personnel” is updated with the term “sign”, which indicates that the entity “personnel” is related with an action called “sign”. It also includes the iText, where these terms were identified. Moreover, it also stores the information extracted as tags from title of the operation. This gives an idea about the operation under which the iText was found. If a new term, such as “move”, is found related with “personnel” from a new iText, then the new term “move” is added to the property “AssociatedAction” of the term “personnel” in the similar way. Every new association creates a small network. All these small networks together form a more extensive network, which is the HESN. This is how each node of HESN and the network itself is updated. As the information of HESN is stored in the knowledge base, the knowledge base is also updated.

4. Advantage of the Proposed Knowledge Base

The structuring of knowledge from iTexts and capturing the human experience from it is not only about extracting or establishing relationships among the entities found in each iText, but more about linking that information and relation with different operations, which is done using tags, as shown previously in the methodology. The six types of relation which are established from iTexts are—(i) entity-action (E-Ac), (ii) entity-entity (E-E), (iii) entity-attribute (E-Att), (iv) entity-value (E-V), (v) action-attribute (Ac-Att), and (vi) attribute-value (Att-V).

4.1. Query Evaluation

The knowledge base proposed in this paper is advantageous when learning from operational and procedural documents that consist of iTexts. The information observed in operational documents, consisting of iTexts, needs to be retrieved based on different operations. Status, condition, involvement of human role, measurement, activity, etc., varies for different operations, although the terms are the same. Hence, when a query is asked based on an operation, HESN can provide information according to that particular operation. This makes HESN unique and efficient for iTexts. From Figure 6, two queries could be considered:
  • What should be the pressure of pump for Operation 1?
  • What should be the pressure of pump for Operation 2?
Here, “Operation 1” and “Operation 2” are the title (PT) of two separate operations. If the tags of CT of “Operation 1” is T1 and tags of CT of “Operation 2” is T2, respectively, then it is possible to retrieve the network consisting of the relation among “pressure”, “pump”, and “3” based on T1 and the network consisting of the relation among “pressure”, “pump” and “7” based on T2. In this way, both the questions can be answered using HESN. Moreover, domain knowledge consists of information about the classes of each of the terms. This helps identify entities, actions, attributes and attribute values, and complex reasoning through HESN.

4.2. Relation Extraction

The accuracy of relation extraction is measured based on the procedural and operational test documents provided by OPG. In total, 25 different types of sentences or iTexts were selected, and 102 relations were extracted. Each relation is made between 2 keywords or phrases. A total of 16 relations were ignored as they do not fall into previously mentioned six types of relations, and 79 relations were correctly extracted. Figure 8 is a table that shows what duplets are generated from each iText. Figure 8 is shown to provide an example of how duplets are generated and finalized from each iText. Each of these duplets contains a relation, and the terms are already classified in the domain knowledge. In the “relation” column, “TRUE” means that a particular duplet follows one of the six types of relations that were previously mentioned, and “IGNORED” means it does not follow. “FALSE” means the relation of the duplet is wrong. ‘E’, ‘Ac’, ‘Att’, and ‘V’, that is observed in “duplets” column in Figure 8, stands for “Entity”, “Action”, “Attribute”, and “Value”, respectively. Combination of each of these duplets for a particular iText forms a network that is the building block of HESN. Each of these networks is tracked with the help of tags. Finally, the information for each entity, action, attribute, and value is updated, which updates HESN and the knowledge base as a whole.
Figure 8. Relations extracted from different types of sentences or iTexts.

5. Conclusions and Future Work

Knowledge extraction from iText is not similar to that from regular text or paragraph. For iTexts, it is imperative to structure information and relationships of a term or key phrase with other terms based on different operations. This research work proposes a knowledge base that helps to capture the knowledge or human experience from iTexts and dynamically update the knowledge structure. HESN and domain knowledge are the two parts of our proposed knowledge base. Domain knowledge is used to identify different terms from the iTexts. HESN is used to represent the knowledge from iText. This knowledge is the relationship of different terms and key phrases based on different operations and is represented in the form of nodes and edges. All these nodes and edges constitute a knowledge network called HESN. HESN is the combination of small networks, consisting of relationships among different terms, which are also tracked to know from which particular instruction and operation the small network has been formed. HESN is updated each time new information is learned. The methodology is suitable for extracting knowledge from iText. The current research was focused on iText found in industrial documents from the nuclear power plant domain. Limited test data were used to test the approach due to the confidentiality of information. Future work includes working with more data and more extensive domain knowledge, improved structure of HESN for better relations representation, and an information retrieval mechanism from HESN based on natural language query.

Author Contributions

H.A.G. and S.S.A.J.; methodology, H.A.G. and S.S.A.J.; software, S.S.A.J.; validation, H.A.G. and H.A.H.; formal analysis, H.A.G. and S.S.A.J.; investigation, H.A.G.; resources, H.A.H.; writing—original draft preparation, S.S.A.J.; writing—review and editing, H.A.G.; visualization, H.A.G. and S.S.A.J.; supervision, H.A.G. and J.R.; project administration, H.A.G.; funding acquisition, H.A.G. and J.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research is funded by NSERC and Ontario Power Generation (OPG).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhao, Y.; Smidts, C. A method for systematically developing the knowledge base of reactor operators in nuclear power plants to support cognitive modeling of operator performance. Reliab. Eng. Syst. Saf. 2019, 186, 64–77. [Google Scholar] [CrossRef]
  2. Rodríguez-García, M.Á.; García-Sánchez, F.; Valencia-García, R. Knowledge-Based System for Crop Pests and Diseases Recognition. Electronics 2021, 10, 905. [Google Scholar] [CrossRef]
  3. Skobelev, P.; Simonova, E.; Smirnov, S.; Budaev, D.; Voshchuk, G.; Morokov, A. Development of a Knowledge Base in the “Smart Farming” System for Agricultural Enterprise Management. Procedia Comput. Sci. 2019, 150, 154–161. [Google Scholar] [CrossRef]
  4. Ritou, M.; Belkadi, F.; Yahouni, Z.; Cunha, C.D.; Laroche, F.; Furet, B. Knowledge-based multi-level aggregation for decision aid in the machining industry. CIRP Ann. 2019, 68, 475–478. [Google Scholar] [CrossRef]
  5. Zhong, W.; Li, C.; Peng, X.; Wan, F.; An, X.; Tian, Z. A Knowledge Base System for Operation Optimization: Design and Implementation Practice for the Polyethylene Process. Engineering 2019, 5, 1041–1048. [Google Scholar] [CrossRef]
  6. Li, T.; Chen, Z. An ontology-based learning approach for automatically classifying security requirements. J. Syst. Softw. 2020, 165, 110566. [Google Scholar] [CrossRef]
  7. Wu, C.; Wu, P.; Wang, J.; Jiang, R.; Chen, M.; Wang, X. Ontological knowledge base for concrete bridge rehabilitation project management. Autom. Constr. 2021, 121, 103428. [Google Scholar] [CrossRef]
  8. Sanfilippo, E.M.; Belkadi, F.; Bernard, A. Ontology-based knowledge representation for additive manufacturing. Comput. Ind. 2019, 109, 182–194. [Google Scholar] [CrossRef]
  9. Wątróbski, J. Ontology learning methods from text—An extensive knowledge-based approach. Procedia Comput. Sci. 2020, 176, 3356–3368. [Google Scholar] [CrossRef]
  10. Nadeau, D.; Sekine, S. A survey of named entity recognition and classification. Lingvisticae Investig. Int. J. Linguist. Lang. Resour. 2007, 30, 3–26. [Google Scholar] [CrossRef]
  11. Bach, N.; Badaskar, S. A Review of Relation Extraction. Lit. Rev. Lang. Stat. II 2007, 2, 1–15. [Google Scholar]
  12. Martinez-Rodriguez, J.L.; Lopez-Arevalo, I.; Rios-Alvarado, A.B. OpenIE-based approach for Knowledge Graph construction from text. Expert Syst. Appl. 2018, 113, 339–355. [Google Scholar] [CrossRef]
  13. Kim, T.; Yun, Y.; Kim, N. Deep Learning-Based Knowledge Graph Generation for COVID-19. Sustainability 2021, 13, 2276. [Google Scholar] [CrossRef]
  14. Bekoulis, G.; Deleu, J.; Demeester, T.; Develder, C. Joint entity recognition and relation extraction as a multi-head selection problem. Expert Syst. Appl. 2018, 114, 34–45. [Google Scholar] [CrossRef] [Green Version]
  15. Wang, Z.; Xu, S.; Zhu, L. Semantic relation extraction aware of N-gram features from unstructured biomedical text. J. Biomed. Inform. 2018, 86, 59–70. [Google Scholar] [CrossRef] [PubMed]
  16. Nie, B.; Sun, S. Knowledge graph embedding via reasoning over entities, relations, and text. Future Gener. Comput. Syst. 2019, 91, 426–433. [Google Scholar] [CrossRef]
  17. Xu, B.; Zhuge, H. The influence of semantic link network on the ability of question-answering system. Future Gener. Comput. Syst. 2020, 108, 1–14. [Google Scholar] [CrossRef]
  18. Guo, A.; Tan, Z.; Zhao, X. Measuring Triplet Trustworthiness in Knowledge Graphs via Expanded Relation Detection. In Knowledge Science, Engineering and Management; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 65–76. [Google Scholar] [CrossRef]
  19. Xiao, S.; Song, M. A Text-Generated Method to Joint Extraction of Entities and Relations. Appl. Sci. 2019, 9, 3795. [Google Scholar] [CrossRef] [Green Version]
  20. Spacy.io. Industrial-Strength Natural Language Processing in Python. Available online: https://spacy.io/ (accessed on 18 April 2021).
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.