The structure of SFRSN is shown in
Figure 1, which consists of five parts: character embedding layer, feature reuse stacked BiLSTM layer, convolution layer, single-tail selection function layer, and global feature gated attention mechanism layer. In the character embedding layer, the character embedding vector is generated by the BERT pre-trained model. The feature reuse stacked BiLSTM layer mainly concatenates the output features of each BiLSTM layer with the original features as the input of the next BiLSTM layer, mainly obtaining deep context features of the text. The convolution layer is used to obtain spatial relations and structural patterns such as nested and overlapping entity spans in the sentence span matrix. The single-tail selection function layer is used to obtain all possible entity span suffix category classification features, and the global feature gated attention mechanism layer combines the entity span suffix category classification feature and the span structure representation through weight selection, ultimately achieving entity span suffix category classification.
3.3. Convolution Layer
Because the main objective of the entity recognition task in this paper is to predict entity label suffix between character pairs, it is necessary to generate high-quality character-pair span representations. In traditional entity label sequence labeling, one character corresponds to a complete label, such as B-PER, I-LOC, etc. Entity labels consist of two parts, including entity boundary identifiers such as B—(Begin) and I—(Inside), and category identifiers such as PER (Person), LOC (Location), etc. In this paper, category identifiers are defined as entity suffixes, and for non-entity characters, their labels are O. This paper maps them to a special suffix category N. Therefore, this paper adds three columns to the annotation content of the dataset, as shown in
Figure 2a. In
Figure 2a, the character position represents the position of the character in the sentence, and the entity label mainly includes the entity boundary and entity suffix category. Within the entity range, the entity category of the entity head character is marked as the corresponding entity suffix, and the position of the corresponding tail character is marked as the position of the entity tail character. The entity category of other characters is marked as [‘
N’], and the position of the tail character is consistent with the position of the character in the sentence. For the case of a nested entity, all entity suffixes are marked on the entity category corresponding to the header character of the nested entity, and the positions of the corresponding tail character are marked.
Figure 2b shows the span representation matrix corresponding to the sample in
Figure 2a. In the sample shown in
Figure 2a, the text “I love Beijing tian’anmen, the scenery in Beijing is really beautiful” has a sentence length of 15. In this paper, a 15 × 15 span representation matrix is constructed, and each element (
i,
j) in the matrix represents the span formed from the
i-th character to the
j-th character in the sentence. When mapping entity labels, if an entity starts with character
i and ends with character
j, the element values of (
i,
j) in the matrix are labeled as the suffix category of the entity. The sentence sample includes the entity “Beijing tian’anmen” and “Beijing”. In the entity “Beijing tian’anmen”, the connection between “bei” and “men” is the entity category, and their positions in the sentence are 2 and 6, respectively. Therefore, the “POI” entity suffix is marked at position (2,6) in the representation matrix. Similarly, the “bei” and “jing” in “Beijing” are marked as an “LOC” entity suffix at position (8,9) in the representation matrix. Additionally, if a character
c has no entity category relation with any other characters, then only its self-connection position is assigned the entity suffix category
N. In the case of a nested entity, if the nested entity also starts with character
i and ends with character
k, where
i is not equal to
j, the element values of (
i,
k) in the matrix are labeled as the suffix category of the entity. For characters inside the entity (i.e., non-head and tail characters), they form a span with their own position (
i,
i), and their category is marked as N. The span corresponding to all character pairs (
i,
j) that do not form any entity is also marked with the category of N. In this way, the entity recognition task is transformed into a classification task, predicting the corresponding entity suffix category for each element (
i,
j) in the span matrix.
Based on the span representation matrix, referring to the architecture of the convolution layer of the W2NER model in [
26], the convolution layer in this paper includes conditional layer normalization, span representation construction, and DCNN. In conditional layer normalization, the normalized gain parameter
for character
i and character
j based on character
i is first calculated. The calculation formula is shown in Formula (6).
Among them,
is the context feature corresponding to time
t in the sequence
output by the feature reuse stacked BiLSTM network in
Section 3.2,
and
are the training parameter matrix and the bias, respectively. Then, the normalization bias function
calculation formula is shown in Formula (7). In Formula (7),
and
are the training parameter matrix and the bias, respectively.
Finally, calculate the normalized representation feature
corresponding to character
i and character
j, as shown in Formula (8).
Among them, represents the dimension of the context feature vector dimension, and is the k-th dimension of .
The gain and bias of conditional layer normalization are dynamically generated parameters based on character i, which can adjust the entity suffix category according to specific characters, enhancing the model’s ability to express context structures.
In span representation construction, the relative position information between each pair of characters is mapped to , the regional information of the lower triangle and upper triangle regions in the character adjacency span matrix is mapped to . The two vectors are concatenated with the character pair representation feature V to obtain the span representation feature with the relative position and regional information of character pairs, denoted as .
In the character adjacency span matrix, there are inclusion and intersection relationships among nested entities. Therefore, multi-scale DCNN is used to obtain the spatial structural relationship of the span matrix. Given the span representing feature
, let the
j-th level dilation convolution with a dilation step of
be represented as
; the dilation step size
= 1,2,3 and
convolution kernel are used. Among them,
= 1 is the standard convolution kernel, which is used to obtain locally continuous grad features to help identify short entities. When
= 2, the receptive field is expanded to obtain grad features separated by one character, which helps identify medium-length entities. When
= 3, the maximum receptive field is obtained, which helps identify long-distance entities. Additionally, the
convolution kernel is a standard trade-off between capturing sufficient spatial context information and maintaining computational efficiency. The first layer
of the DCNN network performs the input
to obtain the output of the first layer of DCNN. Then, the first dilated convolution result
is obtained through the activation function
. The calculation formula is shown in Formula (9).
is input to the second dilation step for iteration, as shown in Formula (10).
Repeat the above calculation to obtain the third dilation step’s dilation convolution result
. Concatenate the DCNN outputs with different dilation rates to obtain the final span representation feature, as shown in Formula (11).
3.5. Global Feature Gated Attention Mechanism Layer
After obtaining the classification feature of the entity suffix category for character pairs and the final representation of the character pair span feature, due to the situation that some entities in specific fields have the same characters but different entity categories, such as “Peking University” and “Beijing”, where the character “bei” belongs to the “organization” category and “place” category, respectively, similar entity spans correspond to different entity suffixes. Therefore, the method of feature addition may not be able to adapt to different context differences. In response to the different degrees of adaptation of each sample to context differences, this paper introduces a global feature gated attention mechanism. Through joint training of the model, the feature of the entity suffix category for character pairs and the final representation feature of the character pair span are autonomically weighted. The two features further improve the classification accuracy of entity suffix categories through a weighted combination.
The calculation process of the global feature gating attention mechanism is as follows: firstly, the final representation of the character pair span feature is subjected to dimensionality reduction processing, with the aim of obtaining the classification feature for each suffix category in the predefined entity suffix category set of the given character pairs
and
based on the final representation of the character pair span feature. The dimension reduction formula is as follows:
Among them, W and are trainable parameter matrices, and b and are bias functions. is the final representation of the span feature of the character pair and .
Set the classification feature of the entity suffix category of character pairs as matrix A, and set the classification feature of the entity suffix category obtained based on the final representation of the character pair span feature as matrix B. Taking matrix A as an example, calculate the average value on the matrix A space to obtain the global average feature. The main purpose is to preserve the overall features of the sample and suppress noise interference. Then, take the maximum value in the span-head character direction of the matrix and the average value in the span-tail character direction. This is to obtain significant features in the span-head feature direction of the matrix, which can be considered as the key information of the entity’s span-head characters. At the same time, the maximum value is taken in the span-tail character direction of the matrix, and the average value is taken in the span-head character direction, with the aim of obtaining significant features in the span-tail character direction of the matrix, which can be considered as the key information of the entity’s span-tail characters. The three types of features are concatenated to combine the global features, denoted as
. Similarly, the above operation is repeated for matrix B to obtain the corresponding combined global features, denoted as
. The two features are concatenated and dimensionality-reduced to obtain the weights of matrix A, denoted as
, and the weights of matrix B, denoted as
. The relationship between the weights of matrix A and matrix B is set to
. The calculation formula of
is shown in Formula (16), and the feature fusion formula is shown in Formula (17).
Among them, and are the weight matrix and bias term, and z is the predicted feature of the entity suffix category by combining the features of the two parts of characters for the entity suffix category. These weights adaptively determine which type of feature (semantic relationship or structural feature) is more relied upon in the final classification. For example, when classifying long entities in a sample, the head and tail characters themselves may not be directly related, so DCNN may obtain long-distance entity span features, and, thus, the weight of will be relatively high. When solving the problem of different entity categories for the same character, the gated mechanism can dynamically adjust weights based on different context structures or semantics.
The softmax function is used to calculate the prediction probability of the character pairs and for the entity suffix category, denoted as ; k is the entity suffix category.
During the training period, this study used binary cross entropy as the model loss function. The calculation formula of the loss function is as follows:
Among them,
n is the number of characters in the sentence,
m is the number of entity suffix categories, and
represents the probability of marking the entity suffix category
k between character
and
in
Figure 2b. When performing entity label suffix classification, the input sentence is processed through modules such as BERT and feature reuse stacked BiLSTM to obtain the final classification score matrix
z, where
represents the score of the character pair (
i,
j) belonging to the entity suffix category
k. This study first applies the softmax function to calculate the probability distribution along the category dimension, obtaining the probability matrix P. Then, the probability matrix P is traversed, and all non-N categories with the highest probability entity span outputs are extracted. If there is an overlapping or nested entity phenomenon, in this study, the one with the highest predicted probability score is retained for output, while the rest are filtered out.