Knowledge Extraction and Quality Inspection of Chinese Petrographic Description Texts with Complex Entities and Relations Using Machine Reading and Knowledge Graph: A Preliminary Research Study

: (1) Background: Geological surveying is undergoing a digital transformation process towards the adoption of intelligent methods in China. Cognitive intelligence methods, such as those based on knowledge graphs and machine reading, have made progress in many domains and also provide a technical basis for quality detection in unstructured lithographic description texts. (2) Methods: First, the named entities and the relations of the domain-speciﬁc knowledge graph of petrography were deﬁned based on the petrographic theory. Second, research was carried out based on a manually annotated corpus of petrographic description. The extraction of N -ary and single-entity overlapping relations and the separation of complex entities are key steps in this process. Third, a petrographic knowledge graph was formulated based on prior knowledge. Finally, the consistency between knowledge triples extracted from the corpus and the petrographic knowledge graph was calculated. The 1:50,000 sheet of Fengxiangyi located in the Dabie orogenic belt was selected for the empirical research. (3) Results: Using machine reading and the knowledge graph, petrographic knowledge can be extracted and the knowledge consistency calculation can quickly detect description errors about textures, structures and mineral components in petrographic description. (4) Conclusions: The proposed framework can be used to realise the intelligent inspection of petrographic knowledge with complex entities and relations and to improve the quality of petrographic description texts effectively.


Introduction
Due to the growing availability of massive earth observation data, the research on and application of artificial intelligence technologies, such as knowledge graphs (KGs), machine learning and deep learning, are receiving increasing attention [1][2][3] in the solid earth [4], remote sensing [5], geological image recognition [6][7][8] and metallogenic process [9] domains. In response to the rapidly increasing and varying types of filed data, scholars have proposed that studies should be guided by big data thinking and techniques commonly applied in deep information mining, such as hidden-mode processes, unknown correlations or other useful information that could be leveraged for decision making [10,11]. Meanwhile, the geological data tend to be uncertain, sparse, multiresolution and multiscale and need knowledge-rich intelligent systems for processing [12][13][14][15][16]. Among these approaches, machine reading based on natural language processing [17][18][19][20] and domain-specific KGs of geosciences [21][22][23][24] have also attracted increasing attention from geologists.
PaleoDeepDive, a digital library and machine reading system, is an early application of Data Mining and Knowledge Base in the geosciences [25]. To measure the relative frequency of the occurrence of stromatolites in Macrostrat (https://macrostrat.org, accessed on 18 March 2022) for the North America-Caribbean region, a similar approach named GeoDeepDive was used to extract mentions of the term "stromatolite(s)" or "stromatolitic" within the published documents. A total of 10,683 papers were retrieved and 612 unique stratigraphic names were found linked to stromatolites [26]. For the Chinese geological literature, which constitutes unstructured data, researchers carried out keyword extraction based on Chinese word segmentation and word frequency statistics and showed the intrinsic information of the literature using a KG [27,28].
Google KG is the basis of Google's semantic search and intelligent question answering service released in 2012. In general, KGs can be divided into generic KGs, such as Google KG, and domain-specific KGs. At present, domain-specific KGs have drawn the attention of some research and have been developed in some commercial applications such as intelligent question answering, intelligent decision making and intelligent detection services in health, education, geology and other fields [21]. In KGs, knowledge is a factual triple in the form of (subject, predicate, object), where each triple entity is represented as a node and edges represent the relation between nodes [29]. As large-scale probabilistic knowledge triples were extracted by information extraction tools, a probabilistic database was also proposed to associate probabilities with triples [30][31][32]. With the applications of KGs in different domains, geologists have also begun to study how to extract knowledge from unstructured literature sources. Knowledge extraction in the geosciences has focused on the recognition of geological entities or keywords, such as geological time. For instance, Liu et al. [33] divided the information into two types, general time entity and geological time entity, depending on the description characteristics in geological and mineral texts, and realised the structural extraction of geological time entities using a BiLSTM-CRF model. Named entity recognition (NER) using deep learning has also been applied in the extraction of information to construct a domain-specific KG of geological hazards [34]. In Western Australia, KGs were generated from the mineral exploration reports for iron ore deposits in the Chichester Range Project and gold deposits in the Coolgardie Gold Project [17]. The automated KG formulation framework showed the prospect of machine reading in knowledge extraction from unstructured geological texts.
During the process of knowledge extraction in the geosciences, entity recognition is important content, and relationship extraction is also crucial [35,36]. A traditional relation extraction task is to predict whether there is a relation between two entities in a single sentence and classify this relationship; this task is also called binary relation extraction. However, in practical applications there are also complex relation extraction and entity recognition tasks. Figure 1 shows some types of relations encountered in actual scenarios, including a binary relation, N-ary relation, overlapping relation (subdivided into single-entity overlapping relation and entity pair overlapping relation) and cross relation [37]. In general, professionally trained geologists usually follow certain rules to form complex entities in petrographic descriptions. For instance, dual-structure and dualcolour entities often appear in structure and colour descriptions, whereas metamorphic rocks with an equigranular blastic texture are often described as multistructure entities in Chinese petrographic descriptions.
Previous research on KG formulation in geosciences mainly focused on simple NER and relation extraction. The extraction of complex knowledge characterised by complex named entities or complex relations has not been studied. The applications of KGs in the geosciences have thus far prioritised basic queries and visualisation [36]. Smart applications, such as the automatic quality inspection of petrographic descriptions, have not been developed. A module of the intelligent mineral geological survey cloud platform, which was named as the "information release and knowledge question", was only just designed. Research on prospective prediction based on KGs was proposed, but has not yet been carried out [38].
At present, a digital geological survey has been published for China, and a cognitive geological survey is also under development. In this paper, the massive Chinese rock descriptions obtained through field observations are taken as the research object to carry out the geological record quality inspection using artificial intelligence. An automatic knowledge extraction and quality inspection framework based on KGs and machine reading is studied. The framework proposed in this paper will eventually provide a quality inspection service on rock description texts in the form of a web service interface. The rest of this paper is organised as follows: the framework for knowledge extraction and quality inspection is introduced in Section 2. The framework components include the definition of the named entities and relations of the petrographic descriptions based on prior knowledge, sequence labelling of rock named entities based on word embedding, N-ary relation extraction of petrographic descriptions based on an enriched pre-trained Chinese language model and complex entity separation based on prior rules. In Section 3, a case study based on the 1:50,000 sheet of Fengxiangyi located in the Dabie orogenic belt is presented. Error propagation in the pipeline mode, integration of variant data and specifications, knowledge recommendation and knowledge reasoning are discussed in Section 4. The paper is concluded in Section 5.

Knowledge Extraction and Quality Inspection Framework
The proposed automatic knowledge extraction and quality inspection framework for rock descriptions involves several processes, including rock named entity and relation definitions, NER, relation extraction and knowledge consistency calculation. Figure 2 shows the process of the proposed framework. First, the types of the named entities and relations of rock descriptions are defined according to the prior petrographic knowledge, and the semi-automatic formulation of the petrographic KG is completed. Second, according to the defined entity and relation types, manual annotation of petrographic descriptions is carried out to formulate the labelled corpus. The corpus is divided into a training dataset, validation dataset and testing dataset, according to the general practice of supervised learning methods. In this paper, a pipeline mode is adopted for petrographic knowledge extraction, which consists of two closely linked components, namely, NER and relation extraction. Training and fine-tuning of rock NER and relation extraction models are carried out using the labelled corpus. After inputting a rock description, entities and relations are extracted using the trained models and entity separation is carried out in cases where the entities extracted from the description are complex. Then, the knowledge triples are created from the extracted entities and relations. Finally, using the formulated KG and the extracted knowledge triples, a consistency calculation is carried out on the petrographic knowledge obtained. Geologists verify the validity of the extracted knowledge and consistency calculation through random sampling. Some sampled descriptions which are not extracted correctly are used as an incremental annotation corpus.

Predefinition of Named Entities and Relations Based on Prior Petrographic Knowledge
Scholars have different understandings of the predefinition of geological entities and relations. Wang et al. [39] opined that entity relation extraction in the geological field needs to conform to the diversity of entity and relation types in the domain's corpus. This problem makes accurate predefinition of geological entities and relations difficult. Their proposed solution was to directly extract entities and the relations from the geological texts without predefinition.
Geological texts usually contain basic concepts, spatial distribution, attribute information and relations [40]. Chu et al. [41] clearly defined geological named entities with four categories: entity objects (GEO), geological age (TIME), geological processes (PROCESS) and other geological indicators (OTHERS). Xie et al. [42] further subdivided geological named entities into six categories, namely, geological age, geological structure, strata, rock, mineral and location.
In the geological domain, petrographic description texts are different from the above corpus in the geological domain. In petrographic studies, the contents of rock observations and descriptions generally include colour, texture, structure and mineral composition. The type of rocks are classified based on their description and specific classification principles. Therefore, the named entity types in rock description texts can be predefined as rock, colour, texture, structure and mineral.
In rock descriptions, colour is the most striking feature; it is also an important identification characteristic and genesis marker. When observing rocks, fresh and weathered colour should be distinguished. For crystalline rocks, metrographic descriptions need to distinguish the major, minor and accessory minerals. For rocks with a porphyritic or porphyroblastic structure, the description also should contain the comparison of phenocrystic and groundmass minerals. Interstitial materials or cements are also important descriptors for rocks with clastic or granular structures. In summary, the relation types in rock descriptions can be predefined as follows: fresh colour (FRESH_COLOR), weathered colour (WEATH_COLOR), preserved texture (PRESERVE_T), preserved structure (PRESERVE_S), major mineral (MA_MINERAL), minor mineral (MI_MINERAL), accessory mineral (ACC_MINERAL), phenocrystic mineral (PHE_MINERAL) and groundmass mineral (GRO_MINERAL). There are various relation types among named entities and most relations point to the same rock entity. Hence, the relations in the rock description can be considered as N-ary relations or single-entity overlapping relations. There is also the subordinate relation type (CATEGORY_OF) between rock entities, which may also exist between mineral entities. To simplify the named entities and relations of rocks, in this study organic matter, fossils, quaternary sediments and related relations were not considered. Figure 3 shows the meta-graph for the named entities and relations of the domain-specific KG of petrography.

Petrographic Named Entity Recognition Based on Sequence Labelling Model
Existing NER approaches are based on rules and the dictionary or on deep learning. An unsupervised geological knowledge extraction method based on the geological domain vocabulary and association rules was proposed for unstructured Chinese documents [27]. In recent years, NER based on deep learning has become the mainstream method [2]. Deep learning methods transform geological NER into sequence labelling. Models, such as DBN [40], BiLSTM-CRF [33] and BiGRU-CRF [34], were used in corresponding experiments. The GRU is a variant of LSTM and its advantages are fewer parameters and faster training.
However, LSTM models are more able to strongly express large amounts of data [43]. The optimal choice between a LSTM or GRU model depends on the specific tasks at hand. With the widespread use of large-scale pre-training Chinese models, some approaches, such as ELMO-CNN-BiLSTM-CRF [41] and BERT-BiLSTM-CRF [17,44], are beginning to be adopted to identify the geological named entities in the geoscience domain. In addition, the emergence of the ELECTRA and XLNet models [45] offer more choices for downstream Chinese natural language processing tasks.
In this paper, the sequence labelling method is also adopted. Based on the labelled corpus of petrographic descriptions, the comparative experiments between bidirectional RNN models (BiLSTM-CRF and BiGRU-CRF) and pre-training Chinese models (BERT, ELECTRA, XLNet) were carried out to determine which model is suitable for NER of petrographic descriptions. The comparison processes for the rock named entity sequence labelling models is shown in Figure 4. The petrographic description texts are first labelled and saved as ANN format files and then tokenised at the Chinese character level. ANN is the file format of the BRAT (Brat Rapid Annotation Tool) [46]. The character-level representation of the input sequence is completed via an embedding layer, and the feature extraction is realised through an encoding layer. The token classification layer is finally used to determine the probability of each entity type. Models based on RNNs adopt a randomly initialised embedding layer and a CRF classification layer. However, the models based on pre-trained models only require fine-tuning of the dense layer.

Petrographic Relation Extraction Based on Enriching R-Transformer Model
As mentioned above, relation extraction comprises binary relation extraction, N-ary relation extraction and entity overlapping relation extraction. Binary relation extraction was proposed earlier as a means to identify the relation between two entities in a single input sentence [47]. N-ary relation extraction pertains to the recognition of relations among n entities through multiple sentences [48]. As shown in Figure 1b, the relations among the three entities also need to be classified. The possible relation categories between entities are predefined. In addition, "NA" is included in the predefined relation set to indicate that there is no association between entities. Overlapping relation extraction means that different relation triples in one or more sentences may have various degrees of overlap [37]. In general, there are two forms of overlap: single-entity overlap (SEO) and entity pair overlap (EPO). SEO refers to cases where triples share an overlapping entity, but they do not share overlapping entity pairs (Figure 1c). In contrast, EPO refers to the triples sharing overlapping entity pairs (Figure 1d). The extraction of cross relations (Figure 1e) is a challenging problem in geoscience, though some advanced network models were proposed for biomedical cross-sentence relation extraction [49].
Existing methods used to extract relations in the geosciences are mostly based on templates, i.e., a template library is used to match the context of two given entities in the input text. If the context fragment is successfully matched with a template in the library, the corresponding relation in the template is regarded as the relation between the two entities. Template-based methods contain two specific template implementations, one based on trigger words and one based on syntactic structure. Trigger words usually include verbs and prepositions. A word-level relation extraction approach using such trigger words was proposed to identify relations in mineral exploration reports [17]. Methods based on syntactic structure usually take verbs as the starting point to formulate rules that place entities on nodes and the dependency relations on edges. For instance, an open Chinese syntactic structure extraction model was established in the geological field, in which relations were extracted based on the syntactic structure [39]. The model uses the open Chinese language technology platform developed by the Harbin Institute of Technology to analyse the dependency syntax and obtain the syntactic structure. Based on a small number of annotated geological corpora, the syntactic structure-based patterns are automatically learned to obtain the high-frequency relation extraction templates. Finally, the learned templates are used to match the structure of the dependency relation and then identify entities and relations. However, the relation extraction templates in the model only cover the high-frequency syntactic structure, as it is difficult to achieve comprehensive templates. It can be seen that methods based on a template in the geosciences have realised unsupervised relation extraction. However, the overlapping relations, which appear often in geological knowledge descriptions, cannot be determined using syntactic-based relation extraction models and word-level relation extraction methods.
To achieve the overlapping relation extraction from the petrographic descriptions, an approach based on an enriching R-Transformer model is proposed in this paper. The method transforms the relation extraction task into a relation classification task. For single relations between rocks and mineral entities or between rocks and structure entities, relation classification mainly involves determining whether there is a relation between the two entities and the problem is considered as a binary classification problem. If there are multiple possible relations between rocks and colour entities or between rocks and mineral entities in a single rock description sentence, the relation classification is called a multiclassification problem. In this paper, the absence of a relation is considered a special relation type (marked as NA). Sequence semantic feature extraction in the R-Transformer model is based on pre-trained language models such as BERT, XLNet and ELECTRA. The framework of the proposed R-Transformer relation classification model is shown in Figure 5. First, the position of the entity pair in the sequence is marked in the input of the model; thus, the extracted vector representation of the sentence contains the position information of the entity pair. Second, the model extracts semantic information from the sentence vector and the two-entity vectors. Each entity vector is aggregated via an averaging method, and dimension reduction is realised using a fully connected dense layer with the Tanh activation function. Third, the two-entity vectors with reduced dimensions are concatenated with the sentence vector, and the annotation classification prediction of each sequence character is realised through the fully connected dense layer, which adopts the softmax multiclassification activation function. Considering the scale of the corpus and the total number of the entity pairs, a dropout layer is added after the combi-nation layer to deal with the possible over-fitting problem and improve the model performance. The relationship extraction method used in this paper will be presented in detail in another article.

Rule-Based Complex Entity Separation
In general, geological investigators write the Chinese petrographic descriptions according to certain rules. For example, in structure and colour descriptions, often dual terms appear, such as "massive-gneissic structure" (块状-片麻状构造), "grey-light flesh red" (灰红-肉红色), etc., "块状-片麻状构造" is a term in Chinese, and "massive-gneissic structure" is the corresponding translation in English. The same below. The rule of "grain size + minor mineral morphology + major mineral morphology" is often used to describe rocks with a granoblastic structure in Chinese geological texts. Thus, the extraction of such entities with complex descriptions is an important problem to be solved in this process.
Sequence labelling models based on deep learning require manual entity annotation to realise semantic information extraction of the labelled entities. However, dual-construct entities are usually labelled as single entities, thus models trained on corpora annotated in this manner usually recognise the dual-construct entity as a single entity. To realise the extraction of dual-structure entities, it is necessary to separate entities based on rules. In this paper, dual-construct entities were split and reformed according to the concatenation character using the complex entity separation algorithm shown in Algorithm 1. For example, after splitting and reformation, "massive-gneissic structure" (块状-片麻状构造) was extracted as two entities, namely, "massive structure" (块状构造) and "gneissic structure" (片麻状构造).

Algorithm 1. Complex entity separation algorithm.
Input: a complex entity Output: entities separated 1: input complex entity containing the entity type 2: if entity type is Texture 3: if entity is blastic texture and len (entity) > 7 4: execute extraction of grain size, minor and major mineral textures 5: else if entity type is Structure 6: if concatenation characters are present in entity 7: execute entity separation based on the concatenation character 8: return entities

Experimental Results
In this paper, the 1:50,000 sheet of Fengxiangyi located in the Dabie-Sulu orogenic belt in central and eastern China (Figure 6a) was selected for the empirical research. From 2014 to 2016, the Institute of Geological Survey of Anhui province carried out a digital geological and mineral survey in this area [50], thus creating a large number of electronic rock description texts. Middle-deep metamorphic strata and Neoproterozoic intermediateacid metamorphic intrusive rocks, which are part of the core of the Dabie-Sulu orogenic belt [51], are widely distributed in this area. Figure 6b shows the major distribution of the metamorphic plutonic rocks and metamorphic supracrustal rocks in the studied sheet, including paragneiss, granitic gneiss, monzogranitic gneiss, granodioritic gneiss, eclogite, amphibolite, marble, quartz-muscovite schist and quartzite. Quaternary sediments are not studied in this paper. The rock descriptions were typical N-ary and single-entity overlapping relation texts. For example, the description text of the quartz-muscovite schist covers the single-entity overlapping relation between rock and structure (or texture). It also contains some N-ary relations between rock and mineral, including the major, minor and accessory minerals between rock and mineral. The structure and texture in the metamorphic rock description texts are typical complex entities. For example, the structural description of monzogranitic gneiss is a "massive-gneissic structure" (块状-片麻状构造), which is a double-structural entity. Its structure is also a typical granular crystal structure, which is usually described using multistructure description modes, such as the "lepidoblastic granoblastic texture" (鳞片花岗变晶结构).

Construction of the Prior Petrographic KG
Once a medium-scale regional geological survey has taken place in an area, e.g., at a scale between 1:200,000 and 1:250,000, the rock types in the region are generally known. According to the petrographic knowledge, previous survey reports and expertise, the textures, structures and material composition of the different rock types also are known. Therefore, in this paper, a KG is constructed based on prior knowledge for the inspection of rock description texts. Due to the high credibility of prior knowledge, a probabilistic database approach is not adopted in this paper.
The rock types and characteristics in the experimental sheet were comprehensively summarised in the survey report, which formed the prior knowledge for the formulation of the domain-specific petrographic KG. Taking the Neoproterozoic intermediate-acid metamorphic plutonic rocks as an example, the rock types mainly contain monzogranitic gneiss, granitic gneiss and granodioritic gneiss. These plutons are ancient intrusions disintegrated from the original Susong Group, and have undergone multistage metamorphism and deformation [50]. Table 1 summarises the prior knowledge on the metamorphic deformation intrusions in the Fengxiangyi sheet, including rock type, texture, structure and mineral composition. The characteristics of rock composition are described by means of major, minor and accessory minerals. Note: monzogranitic gneiss (二长花岗质片麻岩): "二长花岗质片麻岩" is a term in Chinese. "Monzogranitic gneiss" is the corresponding translation in English.

Knowledge Extraction and Quality Inspection
The quality inspection task requires that the computer system can accurately extract named entities and relations from the input texts. In the proposed quality inspection framework, the sequence labelling model and the enhanced R-Transformer relation classification model recognise the named entities and extract the relations from the input rock description texts in a pipeline mode. The extracted entities and relations are eventually composed into knowledge triples. The rock types in the selected sheet are mainly metamorphic rocks, which are mostly classified on the basis of texture (grain size, shape, orientation), structure and mineral composition. Based on the extraction of knowledge triples and the petrographic KG constructed based on prior knowledge, a quality inspection of the rock description texts can be realised based on the consistency between rock knowledge and knowledge triples. Figure 7 shows the calculation process applied for knowledge alignment. The first step is the consistency calculation of the extracted texture, structure and mineral composition information. Based on the extracted rock type, Cypher, a graphic query language, is used to match the extracted textures, structures, major minerals and minor minerals with their corresponding information of this rock type in the KG one by one. To evaluate the matching results of step 1, if there are mismatched extraction knowledge triples, step 2 is executed. If all triples match, the algorithm proceeds to step 3. The unmatched triples of step 2 may be an error description or new knowledge, and the program returns the mismatch information. At the same time, the program automatically saves the rock description text to the corpus, which should be manually verified.
Step 3 involves matching the rock entity extracted from the rock description text with the rock entities in the petrographic KG, which conform to the characteristics of the extracted texture, structure, major minerals and minor minerals. The output of this step is the number of rock entities that match the description. The process is terminated if only one match is found; if two or more entities are returned in step 3, then step 4 is executed. In step 4, the program indicates that there are some rock entities with the same descriptive characteristics. The knowledge identified between the rock entities is returned. This step plays the role of knowledge recommendation while conducting the consistency calculation. As an example, in Table 2 the rock description text for a granitic gneiss outcrop is presented. The extracted rock entity name, structure entities, texture entities and mineral entities, along with the relations of the major and minor minerals, are described. Figure 8a is the subgraph for granitic gneiss. The consistency calculation for the extracted triples went through steps 1 and 3. In the petrographic KG, only granitic gneiss has the same characters extracted from the rock description text. Table 3 is the rock description text for another eclogite outcrop, and Figure 8b is the subgraph of the petrographic KG for eclogite. The consistency calculation after knowledge extraction also involves the execution and termination of steps 1 and 3.

Description text
The north of the point is eclogite. Grey-green, fine-grained blastic texture, mass structure, mainly composed of garnet 30%, pyroxene 70%. Mineral particles are small, mostly around 0.5 mm. Weathered garnet is mahogany and has a rounded grain. For the quartz-muscovite schist, the application of the knowledge extraction and consistency calculation process demonstrated that muscovite is described as the major mineral in most quartz-muscovite schist description texts. However, muscovite is the minor mineral in the standard description of quartz-muscovite schist. The process executed steps 1 and 2 in turn and terminated. The related rock description text was automatically stored in the corpus waiting for manual validation by users of the proposed framework. A review by geologists revealed that the reason for the mismatch was the imprecise description by investigators.

Error Transformation in Pipeline Mode
In this paper, comparative experiments were carried out on the sequence labelling models used for the rock NER and the R-Transformer relation classification models used for the relation extraction. Table 4 shows the results of the comparative experiments, which shows that the sequence labelling model and the relation classification model based on BERT achieved the best performance in the naming entity and relation extraction based on the F1 scores. In particular, the F1 value of the BERT-based sequence labelling model reached 98.04%. This high accuracy can reduce the errors of the NER stage effectively and remedies the deficiencies of the error transmission in the pipeline mode. Meanwhile, there was relatively little difference between the performance of the various models. The possible reason is the corpus size. Further experimental studies on the comparison of different models under corpora of different scales will be carried out in the future. Note: "-": non-execution.

Integration of Variant Data and Specifications
In digital mapping systems, apart from the unstructured data, there are also important structured data, which are more important, such as the location, landform and mapping unit of the geological observation point. At present, the objects of information extraction in the geosciences are mostly unstructured data, including texts and documents. However, based on the experience of domain-specific KG formulation in other fields, structured data are also an important source of knowledge. The integration of structured and unstructured data to realise the rapid construction of a large-scale KG is an aspect that needs further research.
In geological texts, a common occurrence is that the use of the terms "texture" and "structure" is confused. However, in petrology specifications, such as the terminology classification and code of geology mineral resources, Part 10: Petrology (GB/T 9649.10-2009), the terms "texture" and "structure" have unambiguous definitions. In this paper, standardised terms are stored in the KG of petrography as a form of prior knowledge. If a nonstandard entity term is extracted or separated, a triple consisting of the term will have difficulty passing the consistency calculations.

Knowledge Recommendation and Knowledge Reasoning
This study takes the extraction and inspection of petrographic knowledge with complex entities and relations as the research object. As described in Section 3.2, when matching the rock description characteristics with the rock entities in the petrographic KG, there may be more than one match with the same description characteristics. In particular, some metamorphic rocks have the same fabric characteristics and mineral composition. Owing to their different geological environments (geological occurrences), the basic names of rocks may vary greatly, resulting in the phenomenon of synonymy of the same rock. For example, massive rocks mainly composed of muscovite and quartz are named muscovite quartzite formed by the regional metamorphism, but those formed through gas-liquid metamorphism of granitic rocks are also named muscovites. In such cases, the process needs not only to return the possible rock entity matches, but also to prompt the geological investigator to pay more attention to the field observation of the geological occurrence. Therefore, apart from the quality inspection of petrographic descriptions, rock identification knowledge recommendation is another possible application of petrographic KGs.
Another potential application of KGs is knowledge reasoning. Generally, metamorphic facies can be determined according to the minerals and mineral assemblages of metamorphic rocks. Since the proposed framework can be used to obtain mineral information of metamorphic rocks in the studied sheet through machine reading, the metamorphic facies of the metamorphic strata can be inferred based on the mineral information of the rocks which belong to the metamorphic strata and the computable and stored decision rules in the KG.

Conclusions
In this study, the methods for automatic knowledge extraction and quality inspection of petrographic description texts with complex entities and relations were investigated. A framework which contains rock named entity and relation definitions was proposed based on prior petrographic knowledge, rock NER based on a sequence labelling model, petrographic relation extraction based on an enriching R-Transformer relation classification model and rule-based complex entity separation. Considering the high accuracy of NER, the framework allows for rock named entity sequence labelling and relation classification in a pipeline mode. The petrographic descriptions of regional metamorphic rock types in the sheet of Fengxiangyi located in the Dabie orogen were selected as the experimental dataset. The experimental results showed that: (1) Large-scale pre-trained language models are suitable for complex entity recognition and complex relation extraction on small-scale petrographic description texts. (2) The framework proposed in this paper can automatically extract knowledge from petrographic descriptions of regional metamorphic rocks in the Dabie orogen. (3) Adoption of the proposed method for KG-based quality inspection can lead to improvements in rock description quality and avoid obviously inconsistent descriptions.