1. Introduction
With the continuous advancement of smart grid development, substation secondary systems have accumulated massive volumes of operation and maintenance data from multiple heterogeneous sources, as they play a critical role in ensuring the secure and stable operation of power grids. Such data are often stored in a scattered manner in the form of isolated single-page diagrams or unstructured text, exhibiting typical characteristics of high entropy and disorder. In complex operational scenarios, such as signal topology tracing across different panels and collaborative maintenance across different bays, the weak correlation among heterogeneous data features results in cumbersome workflows and low efficiency in information retrieval and logical association. Knowledge graphs provide strong capabilities in semantic association, reasoning, and information integration. By establishing deterministic association mappings among heterogeneous nodes from multiple sources, fragmented and discrete data can be reconstructed into a highly structured relational network. This offers theoretical support for the structured representation and intelligent retrieval of information in the power domain [
1].
Extensive research has been conducted globally on the reconstruction of unstructured data in the power domain into structured knowledge graphs. For the text modality, Natural Language Processing (NLP) technology is widely used to extract entity and relation information from unstructured text sequences [
2]. Among these, methods based on pre-trained language models and their hybrid architectures have become the dominant paradigm for information extraction in this field, owing to their superior capability in capturing global semantic distributions [
3,
4,
5]. Ther authors of [
6] employed the Bert-BiLSTM-CRF model to extract entities and relations associated with power equipment. By integrating traditional keyword filtering with subgraph querying techniques, this approach effectively enhanced both the query accuracy and response speed for technical standards and product information concerning power equipment. Ref. [
7] utilized the Bert-BiLSTM-ATT model to extract diverse entities and relations, including devices, credentials, timeframes, and locations. This methodology facilitated the construction of corpus resources for power service centers and a specialized knowledge base for the power sector. In terms of textual feature fusion and alignment, ref. [
8] proposed an enhanced ERNIE-CNN model that incorporates an attention mechanism to effectively integrate semantic features and spatial patterns within operational texts. Ref. [
9] proposed a learnable convolutional attention network for unsupervised entity alignment that effectively captures structural information while reducing the overlap of redundant information, thereby providing valuable insights for entity alignment in the power domain. Ref. [
10] introduced a model architecture for attribute type recognition based on knowledge graphs. By transferring the inferred fine-grained type probability distribution of target objects and integrating it with the outputs of conventional generative models, the model predicts the distribution over the complete set of types. These techniques have achieved remarkable performance in applications such as power equipment information retrieval, fault diagnosis, and decision support [
11,
12,
13]. However, entities in substation secondary systems are often characterized by diverse forms, ambiguous boundaries, and long-range dependencies, which still limit the accuracy of entity and relation extraction achieved by existing methods.
Regarding the image modality, computer vision techniques have been widely applied to tasks such as appearance recognition of substation equipment, automatic meter reading acquisition, and switchgear status perception [
14,
15]. Ref. [
16] employed D-LLE manifold learning and the Canny algorithm for local dimensionality reduction and edge detection of graphical features. By integrating YOLO-based object detection with OCR-based text recognition, semantic reconstruction of graphical nodes and connection relationships in secondary wiring diagrams was achieved. Based on ontology modeling theory, ref. [
17] extracted multidimensional entities and their attribute topologies from unstructured high-dimensional visual streams such as equipment photographs and surveillance videos, thereby providing effective support for the evolution of low-level observational data into structured knowledge graphs and their dynamic adaptive updating. The authors of [
18] modified image features, including positional information and edge connectivity, through self-supervised learning during the pre-training stage. Combined with transfer learning and category semantic fusion modules, this approach enabled automatic detection of conventional defects in transmission lines, thereby reducing the risks and computational costs associated with manual inspection. In addition, the authors of [
19] abstracted physical grid topology, equipment metadata, and electrical connection states into multidimensional nodes and relational edges in graph space. Based on a breadth-first multi-source path search algorithm, automated topological tracing of electrical paths and hierarchical upstream–downstream division were realized. However, existing studies mainly focus on object detection based on equipment appearance or simple operational states, while research on deep topological logic analysis of complex engineering drawings remains relatively limited.
As the information dimensions of complex systems continue to expand, information networks constructed from a single modality are no longer sufficient to comprehensively represent the complex physical and logical states of power systems. Multimodal information fusion techniques enable the integration of fragmented local information into a globally consistent Multimodal Knowledge Graph (MMKG) through cross-dimensional semantic alignment and feature complementarity [
20]. Ref. [
21] proposed a Semantic Enhanced Cross-modal Collaborative Attention Network (SCCN) that employs a collaborative attention mechanism to achieve effective cross-modal fusion between textual and visual information. Ref. [
22] proposed an information extraction and retrieval-augmented generation (GAT-RAG) method based on graph attention (GAT) networks. This methodology constructs a unified knowledge graph that integrates textual, visual, and structural data. Additionally, it utilizes graph neural networks to rigorously model the semantic dependencies and contextual relations among entities. Ref. [
23] realized the high-precision semantic alignment and feature fusion of power equipment images and text entities by introducing a feature extraction network based on Vision Transformer and combining momentum contrastive learning and cross-modal attention fusion mechanisms, which significantly improved the retrieval performance of the multimodal power knowledge graph. Ref. [
24] utilized natural language processing and image recognition technologies to analyze the alarm signals of substation equipment, as well as the forms and parameters of the equipment. Eventually, it integrated multi-source data to achieve intelligent handling strategies for substation alarms, providing a technical foundation for the monitoring, processing, and decision-making services of inspectors.
To address the inefficiencies in information retrieval and correlation matching within complex operation and maintenance scenarios of substation secondary systems, on the basis of deep analysis of the characteristics of multimodal data on the substation secondary side, this paper defines an ontology model oriented to this field, formulates standardized triple mapping rules, and further proposes an information extraction and knowledge graph construction framework for information flow diagrams and safety-measure tickets. The main contributions of this paper are summarized as follows:
Aiming at the problem of the difficulty of topology parsing caused by dense graphic elements and messy intersecting lines in the information flow diagram, the Heuristic Circular Stepping Search Algorithm (HCSA) is innovatively designed. By incorporating a dynamic directional masking strategy and an extremum-point identification mechanism, the algorithm effectively suppresses local topological noise, enabling deterministic reconstruction of information flow paths and accurate extraction of entity connectivity relationships in complex directed networks.
To address the issues of contextual information attenuation and high entity boundary uncertainty in long-sequence, unstructured instructions, a RoFormer-BiLSTM-CRF hybrid information extraction model enhanced with rotary position embedding in the underlying layer is constructed. By transforming absolute positional encoding into relative distance-aware representations between characters, the model effectively mitigates semantic information loss caused by long-range dependencies, thereby enabling high-precision extraction of entities and relationships from textual data.
To overcome the limitations of information expression in a unimodal data source, cross-modal entity matching of image and text information is conducted based on string similarity to construct an MMKG that encompasses key elements such as equipment, information flow directions, and circuit wiring terminals. This provides a valuable reference for the intelligent management and control of power systems, as well as the efficient retrieval of multimodal information.
2. Framework for Knowledge Graph Construction
This chapter analyzes the characteristics of multimodal data and defines the domain ontology model of the secondary side of the power system. On this basis, a comprehensive framework for constructing a multimodal knowledge graph is proposed.
2.1. Multimodal Data Feature Analysis
Multimodal data on the secondary side of the substation include cable information flow diagrams, site survey forms, secondary operation safety-measure tickets, protection replacement drawings, etc. This paper mainly focuses on information flow diagrams and safety-measure tickets as the data basis for research.
As representative data within the image modality, information flow diagrams primarily delineate the physical and spatial interconnections and signal transmission logic among various devices. The blue rectangular boxes in
Figure 1 identify secondary-side substation cabinets or protection devices, and the multi-colored directed arrows and the text above them characterize the connection relationships and signal flows between the main equipment and associated equipment, which have the characteristics of complex spatial topology and severe overlapping and coupling of graphics and text.
However, information flow diagrams are usually stored in a scattered manner in the form of isolated single-page drawings, which can only present the physical topology of local subsystems or a single piece of main equipment and cannot intuitively reflect the global interconnection status of the secondary-side power-system equipment in its entirety. In scenarios such as information flow tracing across different panels and cabinets or complex fault troubleshooting, maintenance personnel need to carry out cumbersome manual browsing and logical integration among a large number of drawings, which makes the retrieval of equipment associated across different drawings extremely difficult. Therefore, it is imperative to perform structured analysis of the information flow diagrams for the entire station, transforming the physical associations in discrete drawings into deterministic graph data structures, effectively eliminating the topological uncertainty of local representations and thereby achieving accurate retrieval of equipment associated across drawings and tracing of signal flow throughout the entire chain.
Secondary safety-measure tickets refer to textual data that are pre-compiled, audited, and executed on site during the maintenance, testing, or modification of secondary equipment in operational power systems (such as relay protection devices, automatic devices, and monitoring and control devices) to achieve physical or logical isolation between “maintenance equipment” and “operational equipment” in electrical circuits and logical pathways. Safety-measure tickets are usually stored in a scattered manner in the form of unstructured, independent text. The traditional mode, which relies on operation and maintenance personnel to manually read scattered drawings line by line based on experience and perform cross-checking, is inefficient and highly susceptible to logical oversights and potential misoperation risks stemming from human fatigue.
Table 1 shows some examples of safety-measure entries that contain diverse forms of equipment and terminal entities. The position and boundary information of these entities is uncertain, and there are long-range dependencies between entities, which are extremely sensitive to the relative positions between characters. Therefore, it is necessary to extract granular information regarding equipment, terminals, and actions from each operation record and perform cross-modal information integration and global mapping with the topological relationships between devices in the information flow diagrams so as to advance the intelligent operation and maintenance capabilities of substation secondary systems.
2.2. Construction of the Domain Ontology for Substation Secondary Systems
Ontology serves as the schema layer and data skeleton of the knowledge graph, defining the conceptual hierarchies, entity types, and permissible relational rules within a specific domain [
25]. In order to eliminate the heterogeneous data gap between information flow diagrams and safety-measure tickets while accurately representing the topological connections and terminal-side equipment operation semantics of the secondary system, a substation secondary-side domain ontology model is constructed in a top-down manner. This model is established on the basis of an in-depth analysis of the heterogeneous characteristics of secondary-side multimodal data and combines power-domain expert knowledge with business-logic deconstruction, which defines three primary categories of core entities and three fundamental types of semantic relationships. As shown in
Table 2, in the entity dimension, this paper defines “Equipment” as the entity representing all secondary devices, the “Loop” as the entity representing circuit information, and the “Terminal” entity as the business execution unit; in the relationship dimension, it defines the “Information Flow” as the relationship representing physical cable routing, the “Action” as the relationship depicting business operation logic, and “Subordination” as the relationship mapping different logic circuits and equipment terminal levels.
Based on the standardized domain ontology model mentioned above, this study formulates specialized triple mapping mechanisms for both image and text modalities. Within the image modality, the model extracts <Equipment, Information Flow, Equipment> triplets to convert unstructured imagery into a directed topological network, thereby reconstructing the fundamental physical backbone of the substation secondary system. Regarding the text modality, the model extracts <Equipment, Action, Loop> and <Loop, Subordination, Terminal> triples, defining the business operation scope of operation and maintenance instructions and accurately restoring the hierarchical subordinate structure from loop to terminal inside the equipment. These mapping mechanisms transform heterogeneous multimodal data into a structured representation that intertwines physical topology with operational logic, effectively making up for the semantic deficiency of a single modality in information representation. Furthermore, this approach establishes a standardized schema-layer foundation, which is essential for the subsequent development of information extraction models and the alignment of multi-source entities.
2.3. The Framework for Constructing Knowledge Graphs Based on Multimodal Data
Based on the ontology model of substation secondary systems and the mapping rules of triplets among different modalities, this paper proposes a general framework for constructing a multimodal knowledge graph oriented towards information flow diagrams and safety-measure tickets. As shown in
Figure 2, the framework is mainly composed of three core modules working together. Within the image modality, an information extraction model for information flow diagrams is constructed. This model integrates YOLOv8n, OCR, and spatial topological analysis technology based on the HCSA. It accurately locates the device nodes from unstructured pixel drawings and tracks the topology link directions, thereby extracting the information flow relationships between devices and constructing the physical topology framework of the secondary-side knowledge graph of the substation. Regarding the text modality, an information extraction model for safety-measure tickets based on RoFormer-BiLSTM-CRF is constructed. By integrating a pre-trained language model with deep learning networks, the model accurately identifies equipment, terminal, and operational entities within complex and unstructured long-form text, thereby filling the terminal-level dynamic business logic into the equipment-level physical topology skeleton. Based on this, cosine similarity is used to perform synonymous disambiguation for heterogeneous elements and to align entities for cross-modal triples. Finally, the aligned text and image triples are stored and visualized in the Neo4j graph database, constructing a multimodal knowledgetions of information expression in substation.
3. Multimodal Information Extraction and Alignment
Based on the constructed knowledge graph framework, this chapter introduces information extraction methodologies for both image and text modalities and performs alignment of information from different modalities to achieve the construction of triples in the multimodal knowledge graph.
3.1. Information Extraction for the Image Modality
To address the limitations of traditional single-image processing algorithms in capturing deep semantic relationships within information flow diagrams, this part introduces an information extraction methodology that integrates object detection, OCR, and the HCSA, realizing extraction from disordered pixels to structured graph triplets.
3.1.1. Entity Extraction Method of Information Flow Diagrams
Equipment entities within substation secondary system schematics are visually characterized by blue rectangular frames. The YOLOv8n model is used for entity localization, and it outputs a sequence of target bounding-box coordinates, along with their corresponding confidence scores. Given an input image with spatial dimensions of width
W and height
H, the predicted coordinates for the
i-th entity bounding box are denoted as
In order to eliminate the influence of the image-resolution scale on the subsequent spatial topological calculations, this paper performs global normalization on the predicted bounding-box coordinates and calculates the geometric center-point coordinates of each equipment entity as follows:
The normalized geometric centroids exhibit scale-invariant properties within the topological space, which provides a unified spatial metric baseline for subsequent Euclidean distance measurements and topological link anchoring across diverse equipment nodes. Based on this, considering that the information flow diagram has a high level of clarity and the text contained therein is in regular fonts, without any issues such as irregular fonts or blurred characters, this paper uses PaddleOCR 3.1 for text recognition and conducts error detection and verification through manual inspection methods. Eventually, the text labels of the entity nodes in the image can be obtained, achieving the cross-modal transformation of device entities from visual features to semantic entities.
3.1.2. Relationship Extraction Method of Information Flow Diagrams
The information flow diagram represents the topological connection relationships and signal flow between devices using directed arrows of different colors. However, there is a large number of dense attribute texts and overlapping graphic elements in the diagram, which causes serious interference in the identification of the topological relationships between devices. Therefore, in this paper, HSV (Hue, Saturation, Value) color-space technology is utilized to extract the connection lines between devices from the complex background, generating a binarized mask image. Subsequently, morphological erosion is applied to eliminate isolated pixel-level artifacts and textual edge adhesions, finally extracting the connection topology skeleton.
However, accurately determining arrow orientations within a connection topology is the critical challenge in extracting inter-device relationships. It is difficult to achieve path tracing and endpoint recognition with arrow connections using traditional image processing algorithms. Specifically, when the Zhang–Suen thinning algorithm processes hollow ellipses superimposed on line segments and small triangles at arrow tips, it generates closed-loop skeletons and topological spurs. This leads to the extraction of numerous pseudo-endpoints, making it impossible to uniquely determine the true endpoints of the connection topology. The Pavlidis algorithm a contour-tracing method based on local pixel neighborhoods, is primarily suitable for extracting the closed boundaries of standard connected regions. Lacking the capacity to extract global geometric features, this method cannot directly locate and output the terminal coordinates of a connection line. While the Breadth-First Search (BFS) algorithm performs well in identifying the endpoints of simple lines, its accuracy decreases significantly in the presence of intersecting lines or overlapping hollow ellipses [
26]. Although the Hough transform is effective for extracting global straight-line segments, applying it to detect polylines in information flow diagrams yields numerous isolated horizontal and vertical segments, thereby failing to directly output the true endpoints of the connection lines.
Therefore, this paper innovatively proposes the HCSA. By establishing a detection circle based on the geometric centroid for iterative stepping, the algorithm strictly constrains the forward trajectory. Furthermore, by incorporating direction masking, extreme point detection, and branch detection mechanisms, it effectively overcomes interference from local noise and complex edges. This ensures continuous and stable path tracing, alongside the precise localization of true endpoints, even in the presence of multiple angular turns along the lines. The procedural framework is shown in Algorithm 1.
Specifically, a circular detection window with a radius of
R is first constructed, using the geometric centroid (
) of the connection contour as the initial reference point. The step radius (
R) is subject to the geometric constraint of
, where
represents the maximum pixel width of the lines within the mask image and
represents the minimum pixel length of the line segment on the fold line. Based on the spatial intersections between the detection circumference and the binary mask of the connection lines (
), the algorithm detects the local extension directions of the link at the initial position and adds these directions to the search queue as active paths. Subsequently, using the current endpoint of the active path (
) as the center, the algorithm performs multi-step iterative stepping along the connection trajectory. The maximum number of steps is defined as
, where
signifies the maximum pixel length of the line.
| Algorithm 1: Heuristic Circular Stepping Search Algorithm (HCSA). |
![Entropy 28 00655 i001 Entropy 28 00655 i001]() |
To mitigate the interference caused by circular overlapping graphical primitives during the stepping progression, the algorithm introduces a dynamic direction-masking strategy before each search for forward intersections. Upon the successful detection of a new valid intersection along the current traversal heading, this point is designated as the subsequent stepping benchmark to facilitate unidirectional path extension. If the circular detection window fails to identify valid intersections along the specified trajectory, indicating that the path has either encountered a geometric corner or reached its topological termination, the algorithm is programmed to pinpoint the local extreme point () in the current direction. Subsequently, exploratory branching detection is executed at this location. Should a novel branch deviating from the original traversal heading be detected, a new search path is instantiated. Conversely, if no such branch exists, the point is classified as a genuine topological endpoint and integrated into the endpoint repository (). By transforming continuous pixel traversal into discrete directed stepping, the HCSA incurs a computational cost of only for scanning a circumference of radius R per step. Combined with the maximum number of tracking steps (), the theoretical time complexity of the algorithm for extracting a single connection line is , exhibiting a linear relationship with the length of the target line.
After successfully extracting the coordinates of the two endpoints of the connection line, it is necessary to distinguish between the tip and the tail of the arrow. By analyzing geometric features, it can be seen that the arrow tip is usually a small triangle, while the arrow tail transitions smoothly with the main line, meaning the tip has more local white pixel blocks than the tail. Therefore, an endpoint polarity discrimination function based on local pixel density is constructed. By comparing the pixel density features of the two endpoints, the framework precisely identifies the specific flow of signal transmission and achieves logical transformation from an undirected topology to a directed relationship graph.
3.1.3. Generation of Information Flow Triples
After identifying the tips and tails of the connection lines, Euclidean distance measurement is utilized to match the equipment connection relationships. Assuming that N equipment entities are detected in the image, the set of their globally normalized center-point coordinates is denoted as . For any given extracted, directed link () with a known tip () and tail (), by minimizing the Euclidean spatial distance, the two ends of the link are precisely anchored to the corresponding sending equipment and receiving equipment.
To endow topological connections with specific business semantics, it is necessary to accurately map the attribute texts scattered around the links to the corresponding directed edges. Using the Hough line detection algorithm to extract the ordinate (
) of the main line of the link and combining the horizontal boundary domain (
) of the link, a label recognition region (
) is adaptively constructed for each directed edge. Assuming the center point of the
j-th text bounding box obtained by OCR is
, the discrimination criterion for its spatial subordination relationship with a specific link is defined as follows:
If the text center point strictly satisfies the above geometric inclusion conditions, it is determined that the text belongs to this connection line, realizing the transformation of unstructured text into semantic link attributes. Through the spatial mapping and logical aggregation of the source device name, target device name, signal transmission orientation, and corresponding attribute texts, the raw image pixels and character elements within the schematic are effectively converted into a structured knowledge graph triplet set (
), which is defined as follows:
where
and
respectively represent the source equipment and target equipment of the topological link,
is the connection relationship between pieces of equipment,
is the equipment entity set extracted by detection, and
is the complex semantic edge set that fuses signal attributes and directional connection relationships.
3.2. Information Extraction for the Text Modality
To address the challenges of heterogeneous text formats and complex entity extraction in safety-measure tickets, a RoFormer-BiLSTM-CRF based information extraction model is developed, enabling accurate extraction of textual triples.
3.2.1. Rotary Position Embedding
Traditional pre-trained models utilizing absolute position embedding, such as Bert, are highly susceptible to feature extraction degradation and positional information decay when processing texts characterized by long-range inter-entity dependencies and intra-entity character position sensitivity. Therefore, this paper introduces the Rotary Position Embedding mechanism [
27] in the bottom-layer feature extraction stage. It realizes position encoding by mapping the context representation to the complex space and multiplying it by an orthogonal rotation matrix determined by the absolute position. When calculating the inner product of the self-attention mechanism, this mechanism can mathematically equivalently transform the absolute position index into the relative distance perception between characters; its principle logical architecture is shown in
Figure 3.
Specifically, let the input textual sequence of the safety-measure ticket be denoted as
. Suppose the word-embedding vectors at positions
m and
n are linearly projected to yield the
d-dimensional query vector (
) and key vector (
), respectively. The RoPE mechanism, by constructing orthogonal rotation matrices
and
, explicitly injects the absolute position information into the corresponding vectors in the form of a rotation operation. After fusing the position information, vectors
and
can be formulated as follows:
wherein the rotation matrix (
) possesses orthogonality. Based on the geometric properties of rotation matrices within Euclidean space, it follows that
. Therefore, when calculating the inner product of self-attention, the product of the absolute position matrices can achieve elimination, making the inner product of the two vectors equivalently transformed into a function depending only on the relative position
:
As indicated by the preceding equation, RoPE mathematically converts the additive constraints of absolute positions into a rotational operation applied to the feature vectors dictated by their positional indices. This mechanism not only preserves the norm invariance of the word embeddings but also endows the model with an exceptional length extrapolation capability, significantly enhancing the extraction ability of the network for deep semantic features in the complex long instructions of safety-measure tickets [
28].
3.2.2. Text Information Extraction Method Based on the RoFormer-BiLSTM-CRF Model
To address the ambiguity of entity boundaries and the challenges of long-range dependencies within the lengthy texts of safety-measure tickets, this study constructs a joint information extraction architecture based on RoFormer-BiLSTM-CRF, as illustrated in
Figure 4. During the data processing phase, manual sequence annotation is performed on the safety-measure documents utilizing the Label Studio platform. By employing the BIO tagging scheme, three core entity categories are explicitly defined: equipment (DEV), terminal (TERM), and action (ACT). In the feature encoding phase, a sequence of feature vectors infused with relative positional information is acquired via the RoPE mechanism. Assume the feature sequence output by the RoFormer layer is
, where
n is the sequence length of the input text. To further capture the deep temporal and contextual dependencies within the text, the model feeds the sequence (
H) into a BiLSTM network for bidirectional encoding. By concatenating the forward and backward hidden states at each time step, a comprehensive representation (
) encapsulating global contextual features at time step
i is obtained. To alleviate the overfitting phenomenon under small sample data, a random dropout deactivation operation is applied to
. Subsequently, it is mapped to
k-dimensional label space through a fully connected layer, generating the emission score (
). Thus, the emission score matrix is constructed as
, where
k is the total number of label categories and
represents the un-normalized score of the
i-th character being assigned the label of
.
Due to the strict dependency constraints among entity labels in sequence labeling tasks, relying solely on emission scores is highly likely to result in illegal label transitions. Therefore, this paper introduces a Conditional Random Field (CRF) module at the top layer of the network to perform global optimal decoding and loss calculation. Assume the transition matrix is
, where element
represents the score associated with transitioning from state tag
to
. For a specific input sequence (
x) and a corresponding predicted label sequence (
), the global path-scoring function is defined as the summation of the emission scores and the transition scores:
During the model training phase, parameter optimization aims to maximize the conditional likelihood of the ground-truth label sequence. The negative log likelihood is employed as the training objective, which is equivalent to minimizing the sequence cross-entropy between the predicted distribution and the empirical data distribution. The global loss function is expressed as follows:
where
denotes the partition function, which normalizes over all possible label sequences by summing the exponentiated scores under a given input. Minimizing this loss enables the model to suppress invalid label transitions, reduce uncertainty in the decoding space, and jointly optimize both the feature representations and the transition-matrix parameters.
During the model inference phase, the Viterbi dynamic programming algorithm is employed to decode the optimal label sequence (
) that achieves the maximum global score. After parsing it into specific entity vocabularies, the equipment entity set (
), action relationship set (
), and terminal entity set (
) in the safety-measure ticket instructions can be obtained. In the actual operations of the equipment in the security system, the terminal operations of the devices within the same circuit often exhibit a high degree of consistency. According to the hierarchical structural features of the text of the safety-measure ticket, this section adopts heuristic rules based on document structure parsing to perform top-to-bottom structural tree matching between specific safety-measure instruction items and their belonging-context hierarchical titles, accurately extracting the loop entity set (
) corresponding to the equipment entities. Based on the above entity sets, combining the substation secondary-side domain ontology model, the equipment and operation actions identified by the model and the associated circuits parsed by structure are semantically integrated to generate the triplet set (
):
where
is the specific equipment entity involved in the operation instruction,
is the specific action of the operation instruction, and
is the wiring terminal of the specific loop where the equipment is located. Subsequently, to accurately restore the subordinate structure inside the equipment and clarify the logical circuit where the specific terminal is located, the inclusion relation triplet set (
) is constructed:
where
represents the specific device terminal number and
represents the containment relationship defined in the ontology layer. Ultimately, the complete set of structured triplets generated by text-modality extraction is
, realizing the structured conversion of the unstructured text of the safety-measure tickets and providing data support for the construction of the multimodal graph of substation secondary systems.
3.3. Information Alignment
Considering human errors during the dataset compilation process and mistakes existing in the information extraction process, the same information often exists in different textual forms. To resolve this issue, this paper introduces a character-level N-gram technique combined with term frequency–inverse document frequency (TF-IDF) to map the equipment entities extracted from the image and text modalities into a continuous feature-vector space for semantic disambiguation.
The character-level N-gram method decomposes an entity string into a set of overlapping continuous character substrings of length n. In this study, to simultaneously capture the local morphological features of specialized vocabulary in the power domain and retain robust sequence patterns, the N-gram range is configured to extract character sequences of length , representing bigrams, trigrams, and fourgrams. Subsequently, the TF-IDF algorithm is utilized to calculate the statistical weight of each extracted N-gram. This mechanism effectively penalizes high-frequency and generic character combinations lacking distinctiveness, such as common power-system prefixes, while amplifying the weights of unique and highly discriminative substrings, thereby significantly enhancing the robustness of entity representation against noise.
Let the feature vector of the image-modality entity be
and the feature vector of the text-modality entity be
. The semantic similarity between the two is measured using cosine distance, and the calculation formula is expressed as follows:
A similarity-matching tolerance threshold () is established such that two cross-modal entities are identified as referencing the same physical object if and only if , at which point a node-merging operation is executed.
5. Discussion
With the development of smart grids, substation secondary systems hae accumulated a massive amount of multi-source heterogeneous data. Traditional multimodal power-system research is mostly limited to fault detection, visual recognition of equipment appearance, or the monitoring of simple operational states of the system and cannot fully solve the problems of cross-drawing signal flow tracing and relational retrieval at the equipment and terminal levels. Therefore, this paper conducted information extraction based on the two modalities of data, including information flow diagrams and safety measure tickets, and ultimately constructed a structured knowledge graph integrating equipment entities, electrical circuits, signal flow directions, and the terminal level.
During the extraction of connection relationships within the image modality, conventional topology extraction algorithms such as BFS, PCA, and the modified Zhang–Suen skeletonization method are highly susceptible to interference from overlapping hollow elliptical primitives along the connection lines and the triangular tips of the arrows, making it difficult to accurately parse the deep topological logic embedded in complex engineering drawings. The HCSA proposed in this research achieves highly efficient topology tracing and precise endpoint recognition. It accomplishes this by constructing detection circles integrated with directional masks alongside branch and extremum point-detection mechanisms. Within the textual modality, long-sequence safety-measure tickets frequently cause contextual information degradation. To resolve this specific issue, the constructed RoFormer-BiLSTM-CRF model demonstrates superior information extraction performance when compared to RoUIE, Bert, and alternative combined models. Furthermore, its inference latency is significantly lower than that of the GLM-4-FS model. Subsequently, character-level N-gram combined with TF-IDF technology and cosine similarity is adopted to align the device entities in the text and image modalities, constructing a standardized knowledge graph.
Despite the outstanding performance of the proposed method in specific scenarios, certain limitations remain. Specifically, the HCSA is primarily designed for extracting connection relationships within information flow diagrams of 500 kV substation secondary systems. Its overall applicability might decrease when evaluated on substations with different voltage levels or schematic drawings governed by alternative design specifications. Furthermore, in the text extraction and modal alignment stage, the current approach mainly utilizes traditional discriminative deep learning models and has not achieved end-to-end deep integration of the original data from different modalities.
Future research will focus on enhancing the generalization capability of the HCSA. Dynamic adjustment of the detection radius and optimization strategies for the stepping direction will be introduced, achieving adaptive tracing for paths of arbitrary curvature. Concurrently, subsequent research will also consider integrating richer heterogeneous data sources, including data from different voltage levels and different substations, and achieve cross-validation. Future research will utilize cross-modal feature fusion technologies such as graph neural networks to construct a more comprehensive knowledge graph, providing more systematic information support for the intelligent operation and maintenance and equipment information retrieval of substation secondary systems.
6. Conclusions
Considering the high entropy and disordered nature of multi-source heterogeneous data in substation secondary systems, an information processing method for multimodal substation data is proposed. This method utilizes information flow diagrams and safety-measure tickets to construct structured knowledge graphs and achieves the integration and visualization of physical topology connections of secondary equipment and terminal-level information. For the image modality, equipment entities are extracted by integrating YOLOv8n and OCR techniques, and the HCSA is proposed to identify information flows with an accuracy of 100%. This approach effectively addresses the challenges in recognizing equipment connectivity caused by complex line intersections and discontinuities, enabling reliable path tracing and endpoint identification of signal flow among devices in the secondary system. For the text modality, a RoFormer-BiLSTM-CRF-based information extraction model is developed. By incorporating rotary position embedding, the model effectively addresses long-range dependency issues and entity-boundary ambiguity in extended instruction sequences, enabling high-precision extraction of equipment, terminal, and operational action entities. Experimental results demonstrate that the overall performance of this model is superior to that of other combined models such as Bert. Finally, string similarity is utilized to align different modal entities, constructing a comprehensive knowledge graph for the substation secondary system, integrating both equipment connection topology and terminal operation logic.