Next Article in Journal
Diverse Machine Learning-Based Malicious Detection for Industrial Control System
Previous Article in Journal
Session2vec: Session Modeling with Multi-Instance Learning for Accurate Malicious Web Robot Detection
Previous Article in Special Issue
Detection of Exoplanets in Transit Light Curves with Conditional Flow Matching and XGBoost
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Head Hierarchical Attention Framework with Multi-Level Learning Optimization Strategy for Legal Text Recognition

1
Big Data Research and Development Center, North China Institute of Computing Technology, Beijing 100083, China
2
School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
3
China Academy of Electronics and Information Technology, Beijing 100041, China
4
Strategic Planning Research Institute of CETC, Beijing 100041, China
5
China Justice Big Data Institute Co., Ltd., Beijing 100041, China
6
China Satellite Network Group Co., Ltd., Beijing 100020, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(10), 1946; https://doi.org/10.3390/electronics14101946
Submission received: 31 March 2025 / Revised: 6 May 2025 / Accepted: 9 May 2025 / Published: 10 May 2025
(This article belongs to the Special Issue Image Processing Based on Convolution Neural Network: 2nd Edition)

Abstract

:
Owing to the rapid increase in the amount of legal text data and the increasing demand for intelligent processing, multi-label legal text recognition is becoming increasingly important in practical applications such as legal information retrieval and case classification. However, traditional methods have limitations in handling the complex semantics and multi-label characteristics of legal texts, making it difficult to accurately extract feature and effective category information. Therefore, this study proposes a novel multi-head hierarchical attention framework suitable for multi-label legal text recognition tasks. This framework comprises a feature extraction module and a hierarchical module. The former extracts multi-level semantic representations of text, while the latter obtains multi-label category information. In addition, this study proposes a novel hierarchical learning optimization strategy that balances the learning needs of multi-level semantic representation and multi-label category information through data preprocessing, loss calculation, and weight updating, effectively accelerating the convergence speed of framework training. We conducted comparative experiments on the legal domain dataset CAIL2021 and the general multi-label recognition datasets AAPD and Web of Science (WOS). The results indicate that the method proposed in this study is significantly superior to mainstream methods in legal and general scenarios, demonstrating excellent performance. The study findings are expected to be widely applied in the field of intelligent processing of legal information, improving the accuracy of intelligent classification of judicial cases and further promoting the digitalization and intelligence process of the legal industry.

1. Introduction

Legal text classification falls under the category of text hierarchical multi-label classification tasks. As a subtask within the natural language processing (NLP) field, general text hierarchical multi-label classification aims to assign labels to texts based on a given label hierarchy, where each input text can correspond to multiple different labels structured hierarchically. Multi-label hierarchical text classification plays a significant role in various domains, such as news categorization, legal applications, and document management, owing to its alignment with real-world application requirements [1,2]. Unlike traditional flat classification methods, hierarchical multi-label classification tasks require capturing the associations between texts and categories, as well as taking into account the hierarchical relationships and correlations between categories. However, increasing the number of categories and hierarchical levels introduces challenges such as an imbalanced sample distribution and semantic similarity between hierarchical labels [3,4], further complicating the task.
Legal text classification tasks exhibit distinct characteristics compared with traditional multi-label text classification, including stronger semantic reasoning logic embedded in labels and limited labeled samples. Identifying these labels requires contextual semantic analysis combined with factual content and underlying legal logic, further increasing task complexity. An example of a legal text classification task is shown in Figure 1. For a factual description of a loan relationship case, the deep-level labels (level-3) include “joint liability guarantee, scope of guarantee, other costs for realizing creditor’s rights, guarantee period, attorney fee”. The intermediate-level labels (level-2) are “guarantee, scope of guarantee, other costs, period of duty guarantee, other costs”, while the shallow-level labels (level-1) comprise “guarantee, guarantee, calculation of private lending principal, ’statute of limitations, period of duty guarantee, exclusion period’, calculation of private lending principal”. Here, the level-3 labels “joint liability guarantee” and “scope of guarantee” both fall under the level-1 category “guarantee”, whereas the level-3 label “guarantee period” belongs to the level-1 category “statute of limitations, period of duty guarantee, exclusion period”. Although “joint liability guarantee”, “scope of guarantee”, and “guarantee period” all appear to relate to “guarantee” on the surface, legally, they belong to two distinct categories. Accurate classification requires literal interpretation and precise legal semantic analysis.
Scholars have dedicated efforts to exploring efficient and accurate hierarchical text classification methods to address the aforementioned challenges. Several studies have proposed model design strategies incorporating hierarchical structures to tackle issues such as class imbalance and hierarchical relationship modeling. Zhou et al. [5] employed directed graphs to represent hierarchical labels and used a hierarchy-sensitive structural encoder to model labels, effectively integrating hierarchical label information into text and label semantics. Chen et al. [6]’s hierarchy-aware semantics matching network (Hi-Match) performs representation learning on texts and hierarchical labels, using separate text and label encoders to extract semantic features. The model then calculates correlations between text and label embeddings within a joint semantic embedding space to identify multi-label types, defining distinct optimization objectives based on the two representation vectors to enhance hierarchical multi-label text classification performance. However, these methods suffer from insufficient semantic representation of label hierarchies and the inability to resolve imbalanced sample distribution. An increasing number of researchers have recently adopted contrastive learning approaches to optimize hierarchical label semantic representation and address sample distribution imbalance. Zhang et al. [7] introduced a hierarchy-aware and label balanced model (HALB), which utilizes multi-label negative supervision to push text representations of samples with different labels further apart. In addition, to mitigate label imbalance in hierarchical text classification, asymmetric loss is applied to compute classification loss, enabling the model to focus on learning from difficult samples and balance the contribution of positive and negative labels to the loss function.
Furthermore, scholars have improved classification performance by optimizing multi-label semantic representations [4], augmenting negative samples [8], or incorporating external knowledge [3,9] to further enhance the accuracy of multi-label recognition. Chen et al. [3] proposed a few-shot hierarchical multi-label classification framework based on ICL and LLM, leveraging contrastive learning to accurately retrieve text keywords from a retrieval database and improve hierarchical label recognition accuracy. However, these approaches primarily address either imbalanced sample distribution or hierarchical semantic representation issues individually, failing to resolve both challenges synchronously. Zhang et al. [4] combined multi-label contrastive learning with K-nearest neighbors (MLCL-KNN), enabling text representations of sample pairs with more shared labels to be closer while separating pairs without common labels. Zhou et al. [5] designed a hierarchical sequence ranking (HiSR) method to generate diverse negative samples that maximize contrastive learning effectiveness, enhancing the ability of the model to distinguish fine-grained labels by emphasizing differences between true labels and generated negatives. Feng et al. [9] categorized external knowledge into micro-knowledge (basic concepts associated with individual class labels) and macro-knowledge (correlations between class labels), using them to improve discriminative power in text and semantic label representations.
This study addresses these limitations and capitalizes on the advantages of prototype networks in handling imbalanced sample distributions by proposing a multi-label recognition method for legal texts based on hierarchical prototypical networks. In particular, we employ the Sentence-BERT model [10] to obtain a unified long-text embedding vector representation. A hierarchical prototype network architecture is designed for multi-level label recognition, in which a hierarchical prototype structure is constructed according to the data label levels and relationships. In addition, a hierarchical prototype network loss function is proposed. By integrating inter-layer correlation information between labels and prototypes at different levels, the method achieves unified optimization of cross-level prototype parameters within the prototype network, thereby enhancing the accuracy of multi-level label recognition under conditions of uneven sample distribution.
The main contributions of this article are as follows:
(1)
We propose a new multi-head hierarchical attention framework suitable for multi-label legal text recognition tasks, which mainly comprises a feature extraction module and a hierarchical module. The feature extraction module is mainly used to extract multi-level semantic representations of the text, while the hierarchical module is used to obtain multi-label category information.
(2)
We propose a novel hierarchical learning optimization strategy that considers multi-level semantic representation and multi-label classification information learning requirements through data preprocessing, loss calculation, and weight updating, effectively improving the convergence speed of framework training.
(3)
We conduct comparative experiments on the legal domain dataset CAIL2021 and the general multi-label recognition datasets AAPD and Web of Science (WOS). The experimental results show that the proposed method is significantly superior to mainstream methods in legal and general scenarios.

2. Related Work

2.1. Multi-Label Representation

The prototype neural network [11] utilizes neural networks to project inputs into a latent embedding space, where multiple reference points (class prototypes) are defined. The model improves classification accuracy by optimizing the mapping function and prototype representations. During inference, the Euclidean distance between the input embedding and each class prototype is computed to assign labels.
Prototype neural networks have demonstrated robust performance in few-shot classification tasks, particularly in image classification [11] and open-domain problems [12]. Their applications in NLP include entity recognition [13], text classification [14], and relation extraction [15,16]. In multi-intent recognition, Luo et al. [17] introduced an intent fusion feature extraction mechanism and an intent separation mechanism to eliminate irrelevant noise, thereby improving multi-label classification. Xian et al. [18] employed a single-layer recurrent neural network to generate text vector representations, further refining classification performance via a mean-value prototype neural network.
While current hierarchical multi-label recognition methods effectively capture label dependencies, they remain less effective in handling long-tail distributions. Conversely, prototype learning methods exhibit strong robustness in few-shot multi-label classification but lack adequate modeling of hierarchical labels. We attempt to bridge this gap by proposing a hierarchical prototype neural network that integrates hierarchical multi-label learning and prototype-based classification, enhancing accuracy in hierarchical multi-label recognition.

2.2. Multi-Label Text Recognition

Multi-label recognition of legal texts represents a specialized subdomain within multi-label classification, necessitating the precise identification of domain-specific terminologies and hierarchical structures from lengthy legal documents. Compared with general multi-label recognition tasks, legal text classification is distinguished by its strict sentence structures, systematic semantic tagging, specialized domain-specific terminology requiring expert interpretation, and long-tailed data distributions. These characteristics render legal text classification more challenging and complex than general tasks. Current research in this domain can be categorized into three primary approaches based on hierarchical traversal methods: flat methods, local methods, and global methods.
The flat method simplifies the hierarchical multi-label classification problem into a standard multi-label classification task. This approach assumes mutual independence between labels at different hierarchical levels, flattening them into a single-layer label prediction or focusing solely on terminal-level label prediction. For instance, Peng et al. [19] integrated TextCNN, RNN, and attention-based capsule networks to optimize classification networks for multi-label tasks. Liu et al. [20] proposed XML-CNN, which incorporates bottleneck layers and dynamic max-pooling to enhance hierarchical label recognition. However, this assumption disregards the inherent hierarchical structure of legal labels. In addition, the resulting label predictions fail to capture inter-label hierarchical dependencies, yielding suboptimal practical classification accuracy.
The local method utilizes independent classifiers for different hierarchical levels, progressively predicting labels from lower to higher levels. For instance, Cai et al. [21] developed a hierarchical support vector machine (HSVM) that constructs separate SVM classifiers for each level and integrates discriminant functions to maintain hierarchical consistency. Cerri et al. [22] utilized multiple neural networks independently trained through transfer learning to enhance hierarchical classification. Wehrmann et al. [23] introduced a hierarchical multi-label classification network (HMCN) that jointly models local classification dependencies and global hierarchical information, optimizing classification at local and global levels. However, this method propagates classification errors from lower levels upward, increasing uncertainty in higher-level predictions. In addition, it amplifies erroneous predictions at intermediate levels, leading to a gradual decline in prediction accuracy as the hierarchy ascends, an undesirable phenomenon in which classification performance deteriorates progressively across hierarchical tiers.
Global methods typically involve a single classifier that fully integrates hierarchical category information into the classification process, designing optimization strategies to capture hierarchical label relationships and enabling direct prediction of hierarchical labels [24]. Zhou et al. [5] proposed the hierarchy-aware global model (HiAGM) for hierarchical label prediction, which combines a bidirectional tree long short-term memory network (Bi-TreeLSTM) with a graph convolutional network (GCN) to model label hierarchical relationships. Chen et al. [6] developed a Hi-Match network that models text and hierarchical multi-label semantics. They transformed label recognition into a semantic matching problem, incorporating hierarchical information by calculating the semantic similarity between texts and hierarchical labels. As such, they achieved hierarchical label identification. Deng et al. [25] addressed the long-tail distribution challenge of last-level labels by proposing the HTCinfoMax model, which introduces text-label mutual information maximization and prior label matching to filter irrelevant information. Zhang et al. [4] developed MLCL-KNN to further optimize semantic representations across different hierarchy levels, designing a label contrastive learning method that pulls text representations of sample pairs with more shared labels closer while pushing pairs without common labels apart. During inference, KNN retrieves nearest neighbor samples to enhance multi-label recognition accuracy. Furthermore, Zhang et al. [7] and Zhou et al. [8] improved semantic label discriminability using negative sample augmentation and sibling label contrastive learning, respectively, to boost hierarchical label recognition performance. Nooten et al. [26] explored the effects of label-aware loss and contrastive loss in Euclidean and hyperbolic spaces on hierarchical label semantic representations.

3. Methods

3.1. Problem Description

Legal text multi-label recognition falls under hierarchical multi-label classification, where the objective is to identify hierarchically structured legal labels from a given factual description. The set of all multi-labels forms a hierarchical structure, defined as C = C 1 , C 2 , , C H , where H represents the depth of the hierarchical label structure. The classification set of the i-th layer is denoted as C i = c 1 , c 2 , 0 , 1 C i , and C i is the total number of labels at that level. The hierarchical structure T resembles a forest in data structures, where the depth of the label hierarchy corresponds to the depth of trees in the forest, each parent node may correspond to multiple child nodes, and a child node belongs to only one parent node.
The formal definition of the legal text multi-label recognition problem is as follows: Let D denote the dataset containing N data samples
D = X 1 , Y 1 , X 2 , Y 2 , , X N , Y N
where X i represents the input legal text comprising L words:
X i = w 1 , w 2 , , w L
Y i denotes the corresponding hierarchical multi-label set, where
Y i = y 1 , y 2 , · · · , y H , y i C i a n d Y i T
Let the multi-label recognition model be Ω , then the legal text multi-label recognition task can be represented as a classification model Ω learned using the sample set D and hierarchical structure T that can predict the multi-label set for legal text Y i corresponding to the input text.
Ω X i , θ Y i , Y i T
where θ is a parameter of model Ω .

3.2. Multi-Head Hierarchical Attention Framework

The overall framework of the legal text multi-label recognition model based on the hierarchical prototype neural network proposed in this study is shown in Figure 2. This framework comprises two parts, the feature extraction module on the left and the hierarchical module on the right. The feature extraction module mainly comprises a vector encoder and a multi-head attention layer. The text sentence vector encoder is responsible for encoding the input text to generate an initial sentence vector, which serves as the semantic representation of the text while preserving contextual dependencies and key semantic features. Subsequently, multiple multi-head attention layers are used to hierarchically represent sentence vectors, enabling the extraction of semantic representations at each level. The hierarchical module calculates the semantic distance between different levels and their corresponding prototype representations for multi-level label recognition. Figure 2 shows a scenario with three label levels, corresponding to the legal dataset (CAIL2021). The subsequent description of the method is based on this three-layer label structure. The model can be extended to accommodate different depths of hierarchical labels (e.g., 2 layers, 4 layers). Depending on the label hierarchy of the target task, the depth of the feature extraction layer and prototype network in the model can be adjusted accordingly. Model parameters are optimized by calculating the loss function.

3.3. Feature Extraction Module

This study effectively extracts the multi-level semantic representation from the input legal text by utilizing the Sentence-BERT model [10] to obtain the initial sentence vector E from the input text w 1 , w 2 , , w N . Sentence-BERT is a supervised sentence embedding model that extends the BERT architecture by incorporating a mean pooling layer, enabling the extraction of fixed-length sentence embeddings. By leveraging a Siamese network architecture, it compares semantic embeddings of input text with manually annotated reference samples, optimizing model parameters using a contrastive learning strategy. This process enhances the semantic representation capacity of sentence vectors.
Multi-head attention mechanisms are utilized to extract hierarchical semantic information at different levels to align with the multi-level label structure [27]. Each multi-head attention layer captures semantic features at increasing depths, thereby enabling progressive abstraction of textual representations. Given a three-level hierarchical label structure, the multi-level sentence vector representation is computed as follows:
H 1 = M u l t i h e a d E H 2 = M u l t i h e a d H 1 H 3 = M u l t i h e a d H 2
where H 1 , H 2 , and H 3 denote sentence vector representations at three different hierarchical depths while M u l t i h e a d · represents the transformation function applied at each level, implemented using multi-head attention mechanisms.

3.4. Hierarchical Module

Conventional prototype neural networks typically employ a single-layer structure, which fails to capture hierarchical relationships between label classes, leading to suboptimal recognition accuracy. This study overcomes this limitation by proposing a hierarchical prototype neural network model that incorporates a transition matrix to define prototype transitions between adjacent hierarchical levels. The model simultaneously optimizes prototype parameters across all hierarchical levels, ensuring global optimization of multi-label classification.
The hierarchical prototype representations are formally defined as follows:
P = { P 1 , P 2 , P 3 } = m 1 k l 1 , m 2 k l 1 , , m x k l 1 , m 1 k l 2 , m 2 k l 2 , , m y k l 2 , m 1 k l 3 , m 2 k l 3 , , m z k l 3
where l 1 , l 2 , and l 3 correspond to the prototype parameters at three different hierarchical depths, x, y, and z represent the number of prototype labels at each respective level, and k denotes the number of prototypes under each prototype label.
Let A denote the connection matrix between the prototype parameters of the first and second levels, with a dimensionality of y × x . A i j represents the connection between the i-th prototype parameter in the first level and the j-th prototype parameter in the second level. If the j-th prototype parameter in the second level corresponds to the i-th prototype parameter in the first level, the value is 1; otherwise, it is 0. B denotes the connection matrix between the prototype parameters of the second and third levels, with a dimensionality of z × y . The equations for calculating the prototypes at different levels are given as follows:
P 2 = A · f P 1
P 3 = B · f P 2
f · represents the transformation operation between prototypes at different levels. The calculation of prototype parameters is implemented using an attention mechanism layer, and a mean pooling layer is applied to process the parameter calculation results. This stabilizes the distribution of prototype parameters and enables the model to converge quickly.

3.5. Hierarchical Label Classification

Traditional prototype neural networks are primarily designed for single-label classification tasks, where they compute the distance between the feature representation and multiple prototypes and assign the class of the nearest prototype to the input data. The classification process is defined as follows:
x c l a s s arg max i = 1 C g i x
g i is the discriminant function corresponding to the i-th class:
g i x = min j = 1 K Ω X i ; θ m i j 2 2
In addition, g i can also represent the matching value of sample x to the i-th class.
However, multi-label recognition tasks require assigning zero or more labels to a single sample, making the minimum-distance approach unsuitable as it is inherently limited to single-label classification. Yang et al. [12] sought to address this by proposing a distance-based prototype neural network for hierarchical multi-label classification. Their method computes distances between hierarchical multi-labels and prototypes, introducing a threshold-based decision mechanism; no label is assigned if the minimum prototype distance of a sample exceeds a pre-defined threshold. However, the sample is assigned all corresponding labels if the distance to at least one prototype falls below the threshold. For a sample x, which does not correspond to any label,
max i = 1 C g i x < t h r e s h o l d
For a sample x corresponding to one or more labels, the label set is defined as follows:
{ i ; min j = 1 K Ω X i ; θ m i j 2 2 > t h r e s h o l d , i ( 1 , 2 , 3 , . . . , C ) }
Parent–child constraints are applied to ensure hierarchical consistency, enforcing structural dependencies by considering only prototype distances within the same hierarchical parent–child relationships.

3.6. Loss Function

Traditional single-label classification employs loss functions such as DCE [12] and OVA [28], which optimize the distance between text embeddings and prototype representations. However, these methods are incompatible with multi-label classification, as they do not account for multi-label assignments. This study introduces a hierarchical cross-entropy loss function that optimizes text embedding representations and prototype parameters to address this limitation, ensuring compliance with the multi-label constraints in Equation (12).
For an input x i , the multi-level text vector representation is computed, followed by confidence-level estimation for each prototype class, as follows:
y ^ l 1 = max k σ λ d Ω l 1 x i ; θ , m i k l 1
where σ represents the sigmoid function, and d is the distance function (using cosine similarity). During the calculation, the maximum distance between the text vector representation and the model distance within the same class is taken as its confidence level value with respect to the current class. The loss function for the model output at the current level and its corresponding prototype is defined as follows:
l o s s l 1 = y l 1 log y ^ l 1 1 y l 1 log 1 y ^ l 1
The losses of the model are obtained by summing the three levels of losses as follows:
L o s s = l o s s l 1 + l o s s l 2 + l o s s l 3

3.7. Implementation Details

All experiments were conducted on a high-performance computing server running CentOS 7.6. The system specifications include two RTX-TITAN 24 GB GPUs and eight 32 GB memory modules. The experimental environment was set up under the PyTorch 1.7.1 framework, utilizing the transformers model library [29], with Python version 3.7.6.
For the CAIL2021 dataset, the model employs the three-layer hierarchical structure described above. For the AAPD and WOS datasets, which have label depths of two layers, we modify the feature extraction and hierarchical modules to align with this label hierarchy. In particular, the multi-layer sentence representation is restricted to H 1 and H 2 , the hierarchical prototype representation is defined as P = { P 1 , P 2 } , and the loss function is optimized as L o s s = l o s s l 1 + l o s s l 2 .
During the training process, the data in the training samples were first preprocessed. Legal text data typically exhibits strong structural characteristics and considerable length. In our actual training procedure, we processed the data from the training samples of the CAILl2021 dataset as follows: (1) Based on the characteristics of legal data and label types, we segmented the lengthy document data, retaining only key sections such as the trial process, the appellant’s claim, the respondent’s defense, and the court’s findings. In addition, excessively long sentences within the training samples were truncated to improve processing efficiency. (2) Sentences and labels in the factual description section were extracted separately and treated as independent data entries. Finally, we constructed a corresponding label set for the samples according to the number of model labels, thereby obtaining all the required training sample data.
This study employs the AdamW optimizer with weight_decay set to 0.001, a mini-batch size of 8, and an initial learning rate of 1e-8. In addition, a warmup strategy was adopted to adjust the learning rate.

4. Experimental Results

4.1. Datasets and Evaluation Criteria

We utilized the CAIL2021 case label prediction dataset [30] as the primary experimental resource to validate the effectiveness of the proposed hierarchical prototype neural network. The labels in the CAIL2021 case label prediction task dataset are curated by legal experts and represent factual labels for private lending cases. These labels are applied to publicly available judgment documents, constructing a case label prediction dataset. The CAIL2021 dataset comprises 2496 legal text samples, with labels categorized into three hierarchical levels, classified as evidence, private lending relationships, contract parties, and contract performance. The number of label types at each level is 11, 75, and 234. The dataset is split into training, validation, and test sets in an 8:1:1 ratio, respectively, with an average of 7.6 labels per sentence.
Considering the limited sample size in the CAIL2021 dataset and the current unavailability of other legal multi-label recognition datasets, we also selected the WOS [31] and AAPD [32] datasets to supplement our validation of the effectiveness of the proposed model for general-level multi-label classification tasks. Despite the relatively shallow depth of the label hierarchy in these two datasets (comprising only two levels), the abundance of labels and samples within them provides an effective supplementary means to assess the capabilities of the model in multi-label classification tasks, particularly under conditions characterized by a high number of labels and uneven sample distribution. The detailed distribution of the three benchmark datasets is presented in Table 1, while the distribution of sample counts across different labels is illustrated in Figure 3. The sample distributions in these datasets vary significantly, enabling an effective evaluation of the performance of the model under conditions of uneven sample distribution.
The study assessed model performance using micro-F1 and macro-F1, which are widely used in multi-label classification tasks [33]. Micro-F1 is a micro-averaging algorithm that focuses more on the overall classification performance, while macro-F1 is a macro-averaging method that emphasizes the classification performance of each individual class.

4.2. Experimental Results

We evaluated the performance of the proposed hierarchical prototype neural network by comparing it with general deep learning models, including TextRCNN [5], BERT [6], and SGM [32], as well as the top-performing model from the CAIL 2021 competition Phase-1. Since Micro-F1 was the primary evaluation metric used in the competition, the experimental results for the CAIL2021 dataset are summarized in Table 2.
To demonstrate the effectiveness of the proposed method, we conducted a rigorous evaluation of the statistical significance of the performance improvement. Specifically, the CAIL2021 dataset was randomly partitioned into training, validation, and test sets in an 8:1:1 ratio. Subsequently, five independent random experiments were carried out, and the results are summarized in Table 3. A paired samples t-test was employed to perform significance analysis. Using the experimental data, we evaluated the significance of differences between the proposed method and the micro-f1 means of TextRNN, BERT, and SGM. The detailed significance analysis is presented in Table 4. As shown in Table 4, the performance improvement of the proposed method is statistically significant (p < 0.05) compared to other attention-based models. These evaluation results indicate that our method exhibits substantial advantages over alternative approaches.
The experimental results demonstrate that the proposed method significantly enhances multi-label recognition performance. Compared with the top-performing model in the CAIL 2021 competition, our approach achieves a 6.18% improvement in the micro-F1 score, substantially outperforming commonly used deep learning models for this task. This suggests that our method effectively captures core semantic information within hierarchical levels while better integrating cross-layer semantic relationships, ultimately yielding superior results. A rigorous comparative evaluation was performed between the conventional binary cross-entropy (BCE) loss function and our novel optimization strategy under identical experimental configurations. Figure 4 systematically illustrates the loss trajectories throughout the training process, revealing a distinct convergence pattern: The proposed methodology achieves stable convergence at approximately 40,000 iterations, demonstrating a two-fold acceleration in convergence rate compared to the BCE baseline requiring 80,000 iterations.
The generalizability of the proposed method is further validated by conducting experiments on WOS and AAPD, two widely used multi-label text classification datasets. Comparisons were made against Retrieval [3], TextRCNN [5], Hi-Match [6], Hi-AGM [5], HGCLR [34], MLCL-KNN [4], HiSR [8], HALB [7], DPT [35] and MLCL [26], with the results presented in Table 5.
The experimental results show that the proposed method consistently outperforms state-of-the-art approaches across multiple datasets. The model achieved the highest micro-F1 score (88.24%) and a competitive macro-F1 score (80.24%) on the WOS dataset, as well as the leading micro-F1 score (81.21%) and a substantial macro-F1 score (57.65%) on the AAPD dataset. These results demonstrate its efficacy in general multi-label text classification tasks. The success of this method can be attributed primarily to its ability to effectively capture label semantics at multiple levels and address the challenge of imbalanced sample distribution. This further confirms that the semantic analysis of latent embedding space and the effective integration of cross-layer label semantic relationships represent a viable approach applicable to multi-label text classification tasks, including legal-specific domains. The proposed model exhibits superior domain-specific effectiveness and remarkable task generalization capability.
Compared with the relatively lower accuracy in the legal text multi-label recognition task, the model achieved high accuracy on the WOS and AAPD datasets. Following the analysis, the differences in the accuracy of the model are mainly reflected in (1) the different languages of the datasets, (2) the differences in the number of labels, and (3) the differences in the sample size. The observed discrepancies in the experimental results further validate the viewpoint hypothesized in this study that “legal multi-label classification represents a complex scenario characterized by abundant label types, demanding semantic correlations and strong hierarchical dependencies between labels”. Traditional multi-label recognition approaches often struggle to deliver satisfactory performance when handling such intricate tasks involving legal texts, particularly owing to their limitations in addressing the sophisticated hierarchical interdependencies and semantic associations inherent in legal label systems.

4.3. Ablation Experiment

An ablation study was conducted on the multi-label legal text recognition dataset (CAIL2021) to examine the contribution of different model components. The independent effects of key components were evaluated by removing the multi-layer loss function (Layer), which converts multi-level labels into a single-layer structure, and the batch normalization (BN) layer [36], which ensures stable distribution of prototype parameters. The results are presented in Table 6.
As outlined in Table 6, the proposed method achieves the best performance. When removing the “Layer” component, the recall rate and F1-score of the model decreased, indicating that this mechanism effectively captures the hierarchical structural relationships between labels. The layer-wise approach leverages inter-label dependencies to partially rectify erroneous predictions, thereby significantly enhancing recall capability and improving the adaptability of the model to the complexity of multi-label tasks. Similarly, ablation of the BN layer resulted in notable performance degradation, suggesting that BN enhances model stability and optimizes the balance between recall and the F1-score by normalizing parameter distributions and mitigating training fluctuation. The complete model achieved optimal F1-score performance, which collectively verifies the rationality of our architectural design and the effectiveness of the proposed technical solutions.

5. Conclusions

This study focuses on the task of multi-label legal text recognition, innovatively constructing a multi-head hierarchical attention framework and supporting a new hierarchical learning optimization strategy. In terms of method construction, the framework achieves precise extraction of multi-level semantic representations of text and effective acquisition of multi-label category information through the collaborative operation of the feature extraction and hierarchical modules. Concurrently, the hierarchical learning optimization strategy successfully breaks the shackles of traditional methods in balancing multi-level semantic and multi-label category information learning, accelerates the convergence speed of framework training, and lays a solid foundation for efficient and accurate multi-label legal text recognition. In the experimental verification phase, the proposed method showed significant advantages compared with mainstream methods on the CAIL2021 legal field dataset and the general multi-label recognition datasets AAPD and WOS. The model can complete tasks more accurately and efficiently, whether it is multi-label recognition of complex legal text in legal scenarios or facing diverse text types in general scenarios. This highlights the strong generalization ability and adaptability of this method. However, the model can be further optimized in terms of cross-level label modeling, such as employing a learnable soft-linked hierarchical label association approach to enhance the robustness and generalizability of the model to real-world legal data. This approach improves the adaptability of the model to challenges such as category interference, label variations, and semantic ambiguity between different labels.
The study findings are expected to be widely applied to the intelligent processing of legal information, such as assisting legal practitioners in quickly searching and classifying massive legal literature, improving the accuracy of intelligent classification of judicial cases, and further promoting the digitalization and intelligence process of the legal industry. Concurrently, this method provides new ideas for related NLP tasks. Future research can explore and expand its potential value to text processing in other professional fields.

Author Contributions

Conceptualization, K.Z.; methodology, K.Z., Y.T., J.L., Z.A. and Z.L.; software, K.Z. and L.W.; validation, L.W. and Z.A.; formal analysis, L.W. and K.Z.; investigation, K.Z., Z.L. and Z.A.; resources, L.W., J.L. and Z.A.; data curation, K.Z. and L.W.; writing—original draft preparation, K.Z., Y.T. and Z.L.; writing—review and editing, K.Z., Y.T., J.L., X.L. and Z.A.; visualization, J.L., L.W., X.L. and Z.A.; supervision, L.W.; funding acquisition, L.W. and K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

The work in this paper was supported by the National Natural Science Foundation of China under Grant No. U23B2056, and the National Key Research and Development Program of China under Grant No. 2022YFC3340900.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All the datasets used in this research are publicly accessible.

Conflicts of Interest

Author Zhonglin Liu was employed by the company China Justice Big Data Institute Co., Ltd. Author Xuelin Liu was employed by the company China Satellite Network Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this study:
CAILChallenge of AI in Law
HSVMHierarchical support vector machine
HMCNHierarchical multi-label classification network
Hi-MatchHierarchy-aware semantics matching network
Bi-TreeLSTMBidirectional tree long short-term memory
GCNGraph convolutional networks
HiAGMHierarchy-aware global model

References

  1. Zangari, A.; Marcuzzo, M.; Rizzo, M.; Giudice, L.; Albarelli, A.; Gasparetto, A. Hierarchical text classification and its foundations: A review of current research. Electronics 2024, 13, 1199. [Google Scholar] [CrossRef]
  2. Caled, D.; Won, M.; Martins, B.; Silva, M.J. A hierarchical label network for multi-label eurovoc classification of legislative contents. In Proceedings of the Digital Libraries for Open Knowledge: 23rd International Conference on Theory and Practice of Digital Libraries, TPDL 2019, Oslo, Norway, 9–12 September 2019; Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 238–252. [Google Scholar]
  3. Chen, H.; Zhao, Y.; Chen, Z.; Wang, M.; Li, L.; Zhang, M.; Zhang, M. Retrieval-style in-context learning for few-shot hierarchical text classification. Trans. Assoc. Comput. Linguist. 2024, 12, 1214–1231. [Google Scholar] [CrossRef]
  4. Zhang, J.; Li, Y.; Shen, F.; He, Y.; Tan, H.; He, Y. Hierarchical text classification with multi-label contrastive learning and KNN. Neurocomputing 2024, 577, 127323. [Google Scholar] [CrossRef]
  5. Zhou, J.; Ma, C.P.; Long, D.K.; Xu, G.W.; Ding, N.; Zhang, H.Y.; Xie, P.J.; Liu, G.S. Hierarchy-aware global model for hierarchical text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online, 5–10 July 2020; pp. 1106–1117. [Google Scholar]
  6. Chen, H.B.; Ma, Q.L.; Lin, Z.X.; Yan, J.Y. Hierarchy-aware label semantics matching network for hierarchical text classification. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL), Online, 1–6 August 2021; pp. 4370–4379. [Google Scholar]
  7. Zhang, J.; Li, Y.; Shen, F.; Xia, C.; Tan, H.; He, Y. Hierarchy-aware and label balanced model for hierarchical text classification. Knowl.-Based Syst. 2024, 300, 112153. [Google Scholar] [CrossRef]
  8. Zhou, J.; Zhang, L.; He, Y.; Fan, R.; Zhang, L.; Wan, J. A Novel Negative Sample Generation Method for Contrastive Learning in Hierarchical Text Classification. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 5645–5655. [Google Scholar]
  9. Feng, Z.; Mao, K.; Zhou, H. Adaptive micro-and macro-knowledge incorporation for hierarchical text classification. Expert Syst. Appl. 2024, 248, 123374. [Google Scholar] [CrossRef]
  10. Reimers, N.; Iryna, G. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3982–3992. [Google Scholar]
  11. Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 3–9 December 2017; pp. 4080–4090. [Google Scholar]
  12. Yang, H.M.; Zhang, X.Y.; Yin, F.; Yang, Q.; Liu, C.L. Convolutional prototype network for open set recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2358–2370. [Google Scholar] [CrossRef]
  13. Ren, Q. Fine-Grained Entity Typing with Prototypical Networks. J. Chin. Inf. Process. 2020, 34, 65–72. [Google Scholar]
  14. Yang, Y.Y.; Xie, M.X.; Cao, J.X.; Wang, X.B.; Liu, T.W.; Du, Y.H. Adversarial Sample Generation for Chinese Classification Model. Comput. Eng. 2023, 49, 54–62. [Google Scholar]
  15. Gao, T.Y.; Han, X.; Liu, Z.Y.; Sun, M.S. Hybrid attention-based prototypical networks for noisy few-shot relation classification. In Proceedings of the 32th AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, USA, 27 January–1 February 2019; pp. 6407–6414. [Google Scholar]
  16. Liu, H.X.; Dong, C.; Gou, Z.N.; Gao, K. Few-Shot Relation Extraction Method Fusing with Hybrid Representation. Comput. Eng. 2023, 49, 63–68. [Google Scholar]
  17. Luo, S.Y.; He, J. Few-shot Multi-intent Recognition with Intent Information. J. Chin. Inf. Process. 2023, 37, 61–70. [Google Scholar]
  18. Xian, Y.T.; Xiang, Y.; Yu, Z.T.; Wen, Y.H.; Wang, H.B.; Zhang, Y.F. Mean Prototypical Network for Text Classification. J. Chin. Inf. Process. 2020, 34, 73–80. [Google Scholar]
  19. Peng, H.; Li, J.X.; Wang, S.Z.; Wang, L.H.; Gong, Q.R.; Yang, R.Y.; Li, B.; Philip, S.Y.; He, L.F. Hierarchical taxonomy-aware and attentional graph capsule RCNNs for large-scale multi-label text classification. IEEE Trans. Knowl. Data Eng. 2019, 33, 2505–2519. [Google Scholar] [CrossRef]
  20. Liu, J.Z.; Chang, W.C.; Wu, Y.X.; Yang, Y.M. Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), Shinjuku, Tokyo, Japan, 7–11 August 2017; pp. 115–124. [Google Scholar]
  21. Cai, L.J.; Hofmann, T. Hierarchical document categorization with support vector machines. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management (CIKM), Washington, DC, USA, 8–13 November 2004; pp. 78–87. [Google Scholar]
  22. Cerri, R.; Barros, R.C.; de Carvalho, A.C.P.L.F.; Jin, Y.C. Reduction strategies for hierarchical multi-label classification in protein function prediction. BMC Bioinform. 2016, 17, 373. [Google Scholar] [CrossRef] [PubMed]
  23. Wehrmann, J.; Cerri, R.; Barros, R. Hierarchical multi-label classification networks. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 5075–5084. [Google Scholar]
  24. Cao, Y.K.; Wei, Z.Y.; Tang, Y.J.; Jin, C.K.; Li, Y.F. Hierarchical Label Text Classification Method with Deep Label Assisted Classification Task. Comput. Eng. Appl. 2024, 60, 105–112. [Google Scholar]
  25. Deng, Z.F.; Peng, H.; He, D.X.; Li, J.X.; Yu, P.S. HTCInfoMax: A global model for hierarchical text classification via information maximization. arXiv 2021, arXiv:2104.05220. [Google Scholar]
  26. Van Nooten, J.; Daelemans, W. Jump To Hyperspace: Comparing Euclidean and Hyperbolic Loss Functions for Hierarchical Multi-Label Text Classification. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 4260–4273. [Google Scholar]
  27. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 3–9 December 2017; pp. 6000–6010. [Google Scholar]
  28. Liu, C.L. One-vs-all training of prototype classifier for pattern classification and retrieval. In Proceedings of the 20th IAPR International Conference on Pattern Recognition (ICPR), Istanbul, Turkey, 23–26 August 2010; pp. 3328–3331. [Google Scholar]
  29. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
  30. Challenge of AI in Law(CAIL). Available online: http://cail.cipsc.org.cn/task_summit.html?raceID=6&cail_tag=2021 (accessed on 24 February 2025).
  31. Kowsari, K.; Brown, D.E.; Heidarysafa, M.; Meimandi, K.J.; Gerber, M.S.; Barnes, L.E. Hdltex: Hierarchical deep learning for text classification. In Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), Cancun, Mexico, 18–21 December 2017; pp. 364–371. [Google Scholar]
  32. Yang, P.C.; Sun, X.; Li, W.; Ma, S.M.; Wu, W.; Wang, H.F. SGM: Sequence generation model for multi-label classification. arXiv 2018, arXiv:1806.04822. [Google Scholar]
  33. Gopal, S.; Yang, Y.M. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Chicago, IL, USA, 11–14 August 2013; pp. 257–265. [Google Scholar]
  34. Wang, Z.H.; Wang, P.Y.; Huang, L.Z.; Sun, X.; Wang, H.F. Incorporating Hierarchy into Text Encoder: A Contrastive Learning Approach for Hierarchical Text Classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), Dublin, Ireland, 22–27 May 2022; pp. 7109–7119. [Google Scholar]
  35. Xiong, S.; Zhao, Y.; Zhang, J.; Mengxiang, L.; He, Z.; Li, X.; Song, S. Dual prompt tuning based contrastive learning for hierarchical text classification. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 12146–12158. [Google Scholar]
  36. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 2020 International Conference on Machine Learning (ICML), Online, 16–22 November 2020; pp. 448–456. [Google Scholar]
Figure 1. The task of multi-label recognition of legal text.
Figure 1. The task of multi-label recognition of legal text.
Electronics 14 01946 g001
Figure 2. Overall Framework.
Figure 2. Overall Framework.
Electronics 14 01946 g002
Figure 3. Distribution of samples with different labels in the dataset.
Figure 3. Distribution of samples with different labels in the dataset.
Electronics 14 01946 g003
Figure 4. Convergence Dynamics of Training Loss.
Figure 4. Convergence Dynamics of Training Loss.
Electronics 14 01946 g004
Table 1. Dataset statistics.
Table 1. Dataset statistics.
DatasetsLabel Total QuantityLabel DepthLabel QuantityTraining/Validation/Test
CAIL2021320311/75/2341996/250/250
WOS14127/13430,070/7518/9397
AAPD6129/525380/1000/1000
Table 2. Experimental results of CAIL2021 dataset.
Table 2. Experimental results of CAIL2021 dataset.
ModelAccuracyRecall RateMicro-F1
TextRCNN73.3237.1849.34
BERT72.7242.6253.74
SGM63.2948.2254.74
CAIL 2021-Top157.4053.0055.10
Method in this study67.5756.0661.28
Table 3. Results of Models on CAIL2021 datasets.
Table 3. Results of Models on CAIL2021 datasets.
ModelsMicro-F1
TextRCNN49.3449.2849.3148.9349.19
BERT53.7452.8853.6653.2352.59
SGM54.7453.9554.0254.6153.71
Ours61.2861.2761.1361.0760.53
Table 4. Significance analysis of the performance improvement of the proposed model compared to other models.
Table 4. Significance analysis of the performance improvement of the proposed model compared to other models.
ModelsTextRCNNOursBERTOursSGMOurs
μ % 49.2161.0653.2261.0654.2161.06
σ 0.0280.0950.2440.0950.1990.095
n555
ρ 0.2840.6710.579
Δ 000
d f 444
t S t a t −86.78−47.72−41.78
P ( T < t ) 5.29 × 10−85.77 × 10−79.81 × 10−7
t2.1322.1322.132
Table 5. Experimental results of multi-label recognition dataset.
Table 5. Experimental results of multi-label recognition dataset.
ModelDatasetMicro-F1Macro-F1
RetrievalWOS81.3873.82
AAPD--
HARNNWOS81.5069.69
AAPD79.5848.83
HiMatchWOS86.2080.53
AAPD80.7457.16
HiAGMWOS85.8280.28
AAPD80.3356.72
HGCLRWOS87.1181.20
AAPD--
MLCL-KNNWOS87.3781.88
AAPD--
HiSRWOS87.5282.04
AAPD--
HALBWOS87.4582.04
AAPD--
DPTWOS87.2581.51
AAPD--
MLCLWOS87.3577.82
AAPD81.7558.75
OursWOS88.2480.24
AAPD81.2157.65
Table 6. Experimental results of the CAIL2021 dataset.
Table 6. Experimental results of the CAIL2021 dataset.
MethodAccuracyRecall RateMicro-F1
Ours -layer-BN70.2353.6360.82
Ours -layer71.6852.4860.60
Ours -BN68.7054.8360.99
Ours67.5756.0661.28
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, K.; Tu, Y.; Lu, J.; Ai, Z.; Liu, Z.; Wang, L.; Liu, X. Multi-Head Hierarchical Attention Framework with Multi-Level Learning Optimization Strategy for Legal Text Recognition. Electronics 2025, 14, 1946. https://doi.org/10.3390/electronics14101946

AMA Style

Zhang K, Tu Y, Lu J, Ai Z, Liu Z, Wang L, Liu X. Multi-Head Hierarchical Attention Framework with Multi-Level Learning Optimization Strategy for Legal Text Recognition. Electronics. 2025; 14(10):1946. https://doi.org/10.3390/electronics14101946

Chicago/Turabian Style

Zhang, Ke, Yufei Tu, Jun Lu, Zhongliang Ai, Zhonglin Liu, Licai Wang, and Xuelin Liu. 2025. "Multi-Head Hierarchical Attention Framework with Multi-Level Learning Optimization Strategy for Legal Text Recognition" Electronics 14, no. 10: 1946. https://doi.org/10.3390/electronics14101946

APA Style

Zhang, K., Tu, Y., Lu, J., Ai, Z., Liu, Z., Wang, L., & Liu, X. (2025). Multi-Head Hierarchical Attention Framework with Multi-Level Learning Optimization Strategy for Legal Text Recognition. Electronics, 14(10), 1946. https://doi.org/10.3390/electronics14101946

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop