A Multi-Level Feature Fusion Network Integrating BERT and TextCNN

Zhang, Yangwu; Xu, Mingxiao; Li, Guohe

doi:10.3390/electronics14173508

Open AccessArticle

A Multi-Level Feature Fusion Network Integrating BERT and TextCNN

by

Yangwu Zhang

^1,2

,

Mingxiao Xu

² and

Guohe Li

^2,*

¹

School of Information Management for Law, China University of Political Science and Law, Beijing 102249, China

²

Department of Computer Science, China University of Petroleum-Beijing at Karamay, Karamay 834000, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(17), 3508; https://doi.org/10.3390/electronics14173508

Submission received: 24 June 2025 / Revised: 10 August 2025 / Accepted: 24 August 2025 / Published: 2 September 2025

(This article belongs to the Special Issue Advances in Intelligent Data Analysis and Its Applications, 3rd Edition)

Download

Browse Figures

Versions Notes

Abstract

With the rapid growth of job-related crimes in developing economies, there is an urgent need for intelligent judicial systems to standardize sentencing practices. This study proposes a Multi-Level Feature Fusion Network (MLFFN) to enhance the accuracy and interpretability of sentencing predictions in job-related crime cases. The model integrates hierarchical legal feature representation, beginning with benchmark judgments (including starting-point penalties and additional penalties) as the foundational input. The frontend of MLFFN employs an attention mechanism to dynamically fuse word-level, segment-level, and position-level embeddings, generating a global feature encoding that captures critical legal relationships. The backend utilizes sliding-window convolutional kernels to extract localized features from the global feature map, preserving nuanced contextual factors that influence sentencing ranges. Trained on a dataset of job-related crime cases, MLFFN achieves a 6%+ performance improvement over the baseline models (BERT-base-Chinese, TextCNN, and ERNIE) in sentencing prediction accuracy. Our findings demonstrate that explicit modeling of legal hierarchies and contextual constraints significantly improves judicial AI systems.

Keywords:

multi-level; feature fusion; global features; local features; attention; convolution kernels

1. Introduction

Job-related crimes refer to criminal offenses committed by public officials within state institutions, state-owned enterprises, or people’s organizations. These crimes involve the abuse of state-delegated powers through acts of corruption, fraud, dereliction of duty, infringement of citizens’ personal rights, or subversion of state regulations—all punishable under China’s Criminal Law. Suppose job-related crimes cannot be dealt with in a timely and fair manner. In that case, it will affect the credibility and efficiency of the government, which is why the anti-corruption campaign aims to crack down on job-related crimes. Although China’s rigorous anti-corruption campaigns have yielded significant results in recent years, the adjudication of job-related crimes faces systemic challenges, including protracted trial cycles and severe case backlogs, particularly in grassroots courts. Public judicial data reveals a consistent upward trend in such cases processed by courts and procuratorates. Within traditional judicial workflows, judges manually analyze voluminous historical precedents and legal provisions—a process that is labor-intensive, time-consuming, and prone to sentencing inconsistencies due to subjective interpretation. These deviations not only undermine judicial fairness but also risk eroding public trust in judicial transparency. Consequently, carrying out research to enhance sentencing efficiency, minimize human interference, and standardize judicial outcomes has become an urgent imperative.

The advances in research on artificial intelligence technology have been rapid in recent years, and they have a significant impact on practical work in the legal field. With the promulgation of the EU Artificial Intelligence Act, Legal Artificial Intelligence has gradually attracted widespread interest from judicial practice and artificial intelligence researchers, gaining rich research paradigms and traditional diversity from other disciplines, such as natural language processing, machine learning, mathematical statistics, and big data science. At the same time, Legal AI has brought efficiency improvements and fair judgments in judicial practices.

Legal AI should play a role in the governance of job-related crimes, especially by fully leveraging the legal semantic capture capabilities of BERT and TextCNN. To address urgent judicial challenges of job-related crimes in China’s anti-corruption campaign, this study pursues research objectives, including designing a hierarchical feature fusion network for legal semantics, optimizing domain adaptation for job-related crimes in legal classifications, and advancing AI’s role in China’s anti-corruption campaign through feature fusion and a hierarchical workflow.

2. Related Work

The rapid development of artificial intelligence technology has brought new possibilities to judicial AI [1]. Early systems mostly focused on expert rule libraries or traditional machine learning methods (such as support vector machines), matching historical precedents with structured case features to provide trial references for judges. Goncalves et al. annotated legal texts, including 42 rulings of the European Court of Human Rights [2]. Nallapati et al. proposed that traditional machine learning classifiers have shortcomings in the connection and extraction of relationships based on the argument from unstructured text to structured text [3]. However, such methods rely on manually defined feature engineering, which makes it difficult to deal with the complex semantics and unstructured features of legal texts. Especially when dealing with cases of job-related crimes, suggestions by judicial AI are often rigid due to the variability of the case and the cross-application of legal provisions, resulting in limited practical application effects.

Natural language processing technology, with its advantages in text understanding and semantic analysis, has shown potential in scenarios such as legal Q&A and contract review [4]. Mikolov et al. proposed the word2vec word embedding method to address the inherent shortcomings of traditional word representation methods in word order and idiomatic expression [5]. The CBOW (continuous bag of words) and Skip–Gram models use neural networks to predict words in contextual environments, capture a large number of accurate syntactic and lexical semantic relationships, and learn high-quality distributed representation methods for words, namely, word embedding representation. The introduction of the negative sampling hierarchical softmax method improves the learning speed of dense vectors for word embedding [6]. Aletras et al. combined legal provisions, word embeddings, and parts-of-speech tagging to provide predictions of judgment outcomes by describing the facts of the input legal case [7]. In order to capture more multi-level contextual information, Vaswani et al. proposed an attention mechanism. The Transformer model based on the attention mechanism makes gradient descent more effective and has a particular applicability [8]. BERT (Bidirectional Encoder Representations from Transformers) is composed of multiple layers of Transformer encoders stacked together, each layer containing a multi-head self-attention mechanism and a feedforward neural network [9]. The self-attention mechanism dynamically calculates the correlation weights between words, enabling the model to focus on global context and local key information simultaneously [10]. BERT uses bidirectional Transformer encoders, where the self-attention mechanism allows each token to directly attend to all other tokens in the input sequence. Every token representation output by BERT in later layers is inherently contextualized, incorporating global and coarse-grained information from the entire sequence to predict crime, thereby capturing long-range dependencies and semantic relationships beyond immediate neighbors. Kim proposed the TextCNN model [11], which uses a convolutional neural network (CNN) model to classify text. The text is transformed into a dense vector representation through word embedding. Then, the discrete vocabulary is mapped to a continuous semantic space, effectively capturing local feature information for the text to be classified [12]. In addition, TextCNN applies multiple convolutional filters of varying widths that slide over the word embedding sequence. Therefore, each filter specializes in detecting specific local patterns (n-grams) within its window size. At the same time, each filter produces a feature map that highlights where its local and fine-grained information is located, enabling quantitative analysis of the case.

In recent years, the LLM (large language model) has developed rapidly, which applies a large amount of text data to train deep learning models that can generate natural language text or understand the meaning of language text [13]. LLMs can fully leverage the advantages of pre-training on large datasets by using computing power and deep learning patterns [14]. Yue proposed the DISC-LawLLM model based on an LLM, which can solve specific tasks in the legal domain [15]. Moreover, Cui proposed using llama and a knowledge graph to construct the ChatLaw model, which has been iterated to the second generation ChatLaw2 MoE [16]. Represented by these models, multiple Chinese legal models have achieved excellent performance in Chinese legal Q&A and case matching [17]. The Lexis+AI platform launched by the Lexis company integrates large language models such as GPT-4, which can generate sentencing recommendations by analyzing variables such as case nature, involved amount, and defendant background [18]. The Law-AI tool developed by the Casetext group uses natural language processing technology to assist lawyers in legal retrieval, following human instructions to work [19]. LLMs have a wide range of application scenarios. They conduct pre-training on massive data through self-supervised learning or semi-supervised learning and then further optimize their performance and ability through fine-tuning and human alignment [20]. The existing legal intelligence applications mainly focus on general legal services scenarios [21], such as case retrieval or legal consultation, and support for sentencing decisions mostly stays at the macro suggestion level.

Expert systems and traditional machine learning methods have some limitations in the process of predicting job-related crimes. Manual rules failed to handle edge cases, e.g., novel corruption tactics, causing higher error rates in complex job-related crimes [22]. Typically, lawyers dedicate months to encoding rules and making systems scalable in response to evolving law [23]. Without contextual knowledge, ambiguous language, such as the terms “gift” and “bribe,” cannot be interpreted, leading to rigid outcomes [24]. Document failures of rule-based systems in common law show inaccuracy in statutory interpretation [25]. Traditional machine learning proved inadequate for job-related crimes, where case variability and contextual semantics, such as abuse of power, demand adaptive reasoning [26]. Compared with expert systems and traditional machine learning methods, word embeddings and convolutional neural networks are more of a focus for researchers. Word embeddings encode words as dense vectors, enabling similarity metrics and distributional semantics [27]. Due to the large size of the dataset, efficient training could reduce compute costs for the corpora through negative sampling [28]. Learned analogies of word embeddings could help parse legal relationships in job-related crimes, such as man and king or woman and queen [29]. However, due to overfitting in global feature associations, word embeddings are static embeddings [30], i.e., the same vector for all the same word senses, and they fail to encode the windows of words. Therefore, there may be deviations in understanding polysemous words, such as “gift” in legal and everyday contexts. Feature fusion is a method that fully considers both overfitting in global feature associations and the encoding of word windows. Feature fusion in deep learning integrates multiple cues or features representing different aspects of input characteristics to create a more comprehensive feature representation for tasks like classification. The linear combination of feature statistics and its extension to a general nonlinear fusion method involve minimizing a mean-square error and a regularization term [31]. As the combination of features from different layers or branches, it is often implemented via simple operations, such as summation or concatenation. Attentional feature fusion is described by short- and long-bound connections as well as feature fusion within the frontend layer [32].

However, the current research on legal classification prediction lacks an in-depth examination of auxiliary sentencing for the subdivision scenario of job-related crimes, which is characterized by a lack of logical analysis of criminal circumstances, factual determination, and legal relationships. In particular, insufficient attention is paid to research on the feature fusion of criminal circumstances. Job-related crime is a type of criminal offence that arises from the facts and legal consequences of a duty act. Its legal text includes both coarse-grained information for predicting crime and noncrime and fine-grained information for quantitatively analyzing the length of imprisonment of the defendant. This paper proposes a multi-level feature fusion method based on attention and convolutional kernels to address the problem of global and local feature segmentation in legal texts related to job-related crimes. The attention mechanism captures global features by embedding different semantic positions, while the convolutional kernel obtains local features through sliding windows. Multiple granularity features are then fused to enhance the model’s ability to perceive semantic information in job-related crime scenarios and improve its predictive performance for job-related crimes.

3. Features of Job-Related Crimes

3.1. Basis of Sentencing for Job-Related Crimes

The theory of sentencing for job-related crimes is based on the Criminal Law of the People’s Republic of China combined with judicial interpretations, criminal policies, and judicial practice experience to construct a clear hierarchy of punishment systems that balance fairness and efficiency. The sentencing levels for job-related crimes are divided into six categories, i.e., exemption from punishment, 0–1 year, 1–3 years, 3–5 years, 5–10 years, and more than 10 years. Its theoretical basis is rooted in the principle of proportionality between crime and punishment and the criminal trial policy of balancing leniency and severity, aiming to balance the corresponding relationship between social harm, subjective malice, and punishment through scientific and standardized standards.

In specific sentencing practices, the identification of punishment levels is mainly based on the objective harm and subjective malice of criminal behavior. Taking the crime of embezzlement as an example, Article 383 of the Criminal Law clearly shows that the amount of embezzlement between CNY 30,000 and CNY 200,000 is considered “relatively large” and can be sentenced to fixed-term imprisonment of up to 3 years or detention; CNY 200,000 to 3 million is regarded as a “huge amount” and shall be sentenced to imprisonment for not less than 3 years but not more than 10 years; and an amount of over CNY 3 million is considered “exceptionally large” and is subject to imprisonment for more than 10 years to life imprisonment. This is shown in Table 1. Although this amount standard has rigid requests, judicial interpretations further introduce flexible standards such as “serious circumstances” and “dire circumstances”. For example, in cases of embezzlement, disaster relief, poverty alleviation, or adverse social impact, even if the amount does not reach the threshold, it can still be upgraded to punishment, as shown in Table 2.

3.2. Global and Local Features of Sentencing

The global features of job-related crimes are the overall information about the legal relationship of job-related crimes obtained after analyzing the entire legal text. This information focuses on the legal relationship regulated by job-related crimes in the text, ignores local information, and has good stability, maintaining appropriate penalties and accountability. The criminal compositions, causal connections, and criminal accountability involved in job-related crimes have legality, specificity, and universality. The subjective purpose is adapted to the consequences of the crime, and the criminal behavior is consistent with the job-related responsibility. The global features of criminal embezzlement are shown in Table 3. The local features of job-related crimes refer to the features obtained in specific areas of the legal text. These can reflect the local structure and detailed information on the conviction and sentencing of job-related crimes. The determination of the amount of embezzlement belongs to local features, and aggravating factors also belong to local features. The combination of these two kinds of features can elevate the level of judgment in job-related crimes. The local features of the criminal embezzlement are shown in Table 4.

In judicial practice, the scientificity and adaptability of sentencing standards should be considered together. On the one hand, the fixed-amount standard may lead to sentencing differences between different areas due to unbalanced economic development. For example, the actual tolerance of “large amounts” in economically developed areas is higher than that in less developed regions, which damages legal unity and cannot demonstrate the power of legal punishment and deterrence. For example, the embezzlement of CNY 30,000 in economically underdeveloped areas affects the poverty relief of hundreds of people, and its criminal consequences are far greater than those in developed regions. On the other hand, the boundary of judges’ discretion needs to be further refined through judicial interpretation. The standard stipulates that probation or exemption shall not be applied to job-related crimes involving serious circumstances and destructive social impact, based on judicial opinions on several issues concerning the strict application of probation and exemption from criminal punishment in handling cases of job-related crimes, thus reducing the discretion space. The dynamic adjustment mechanism is introduced to link sentencing amount with regional per capita income to enhance the standard’s adaptability to the times.

3.3. Definition of Classification Task and Sentencing

According to the sentencing provisions for job-related crimes, they can be clearly classified into six categories, i.e., exemption from punishment, 0–1 year, 1–3 years, 3–5 years, 5–10 years, and more than 10 years. These six categories not only reflect the quantitative grading of the social harm of job-related crimes but also provide clear classification functions for the construction of machine learning models. In judicial practices, judges need to accurately match the text of the job-related crime to corresponding sentence ranges based on the facts of the case. Essentially, this involves mapping unstructured text (such as case descriptions in judgment documents) to preset category labels, which is highly consistent with the framework of multi-classification tasks from input features to output labels.

The choice of multi-classification tasks over regression or binary classification methods is mainly based on the legal requirement of discreteness and the normative needs of judicial decision-making. The law has strict boundaries for sentencing. For example, there is an unbreakable legal difference between “3–5 years” and “5–10 years”. Suppose a regression model is used to predict the specific length of a sentence. In that case, it may ignore such structural boundaries due to numerical continuity, resulting in prediction results deviating from legal standards. The regression model may output “4.8 years”, but according to criminal law regulations, this result needs to be classified as “3–5 years” instead of being used directly as an independent value. In addition, multi-classification tasks can generate probability distributions for each sentencing interval. For example, the probability of a case being sentenced to “5–10 years” is 75% and “3–5 years” is 20%. This probabilistic output not only provides a reference for judges but also assists them in balancing the rationality of different sentencing options, enhancing the transparency and scientific handling of decision-making.

4. Multi-Level Feature Fusion

4.1. Attention Mechanism for Global Features

In the text of job-related crimes, establishing the legal relationship that constitutes job-related crimes affects the adaptation of conviction and sentencing; for example, establishing the correlation between statements such as “the amount of embezzlement is particularly huge” and “causing national losses” is crucial for predicting the result of sentencing. The pre-trained language model represented by BERT captures deep semantic features related to sentencing in the text through the self-attention mechanism, such as the strong correlation between descriptions such as “multiple instances of embezzlement” and “refusal to return embezzled money” and severe penalty. After encoding the text into a high-dimensional vector, the model maps it to one of six sentencing labels through a fully connected layer, achieving end-to-end learning from text to category of sentencing, as is shown in Figure 1. In the pre-training stage, BERT learns the intrinsic rules of language through a Masked Language Model (LLM) and Next Sentence Prediction (NSP). The Masked LM forces the model to understand contextual logic by randomly masking some vocabulary and predicting the masked words. For example, in Figure 1, “amount” and “country” are masked. NSP enhances the model’s understanding of discourse structure by assessing the continuity of sentence pairs. This pre-training endows BERT with powerful semantic representation capabilities, enabling it to extract implicit legal elements from unstructured text.

Establishing the legal relationship of job-related crimes depends on encoding global features, and segment attributes can introduce more global information into the model. In establishing the legal relationship of job-related crimes, there are many aspects of identification, such as subject, object, subjective, and objective factors, for example, “from December 2012 to April 2013, the defendant Xia…”. The segment attributes for words are shown in Table 5. The word is marked with a segment attribute, and each segment attribute identifies the structured metainformation in the legal relationship of job-related crimes. The meaning of the segment attribute mark is shown in Table 6.

The self-attention mechanism can automatically identify key sentencing elements and generate high-dimensional semantic vectors through nonlinear transformations of multi-layer encoders. The Transformer utilizes the marker of CLS representing the global information of the entire legal text, which aggregates vectors of contextual embeddings and then outputs probability distributions of six labels of sentencing through a fully connected layer. For example, The probability of the model determining that a case belongs to the “5–10 year” category is 80%, providing a reference for judges. During the fine-tuning, by post-training on a specialized dataset for job-related crimes, self-attention can further adapt to the length of legal texts and optimize its ability to distinguish easily confused categories.

4.2. Convolutional Neural Networks for Local Features

Local features in the legal text of job-related crimes can be manifested as “after the offences”, “illicit funds”, “already refunded”, etc. Capturing such features can help determine the level of punishment. The architecture of TextCNN mainly includes the embedding layer, convolution layer, pooling layer, and fully connected classification layer, as shown in Figure 2. Firstly, the input text (such as the case description in the judgment document) is transformed into a dense vector representation through word embedding, mapping discrete vocabulary to a continuous semantic space. Subsequently, the convolutional layer uses convolutional kernels of different sizes, performs sliding-window scanning on the text, and then extracts local n-gram features. A 3 × 1 convolutional kernel may cover three consecutive words, capturing their combined semantics. The output of each convolutional kernel is subjected to a nonlinear activation function to generate a feature map. Next, the max-pooling operation is performed to reduce the dimensionality of each feature map, preserving the significant features in each convolutional channel. For example, the feature corresponding to “involved amount of CNY 30,000” is used as the key information of the amount dimension in the legal relationship of job-related crimes. Finally, all pooled features are concatenated into a global vector, which is input into a fully connected layer for classification, which then outputs probability distributions of six labels of sentencing.

4.3. MLFFN: Multi-Level Feature Fusion Network

The Multi-Level Feature Fusion Network needs to consider both the overall structural information of legal relationships in job-related crimes and capture fine-grained local circumstances. Therefore, MLFNN combines the global features on the encoder and the local features on the convolutional kernel by cascading Bert and TextCNN. It could improve the distribution of weighting and responsiveness of feature maps, obtain richer feature representations of job-related crimes, and enhance the model’s perception and understanding of conviction and sentencing for job-related crimes. MLFNN is shown in Figure 3.

The applicability of MLFFN in the task of sentencing for job-related crimes is reflected in its ability to map complex legal texts to discrete sentencing categories. Specifically, after the input for case fact description is segmented and embedded (word embedding, segment embedding, position embedding), the Transformer encoders extract semantic features layer by layer, as shown in Formulas (1)–(3), i.e.,

E T = [\begin{matrix} e t_{11} & \dots & e t_{1 n} & \dots & e t_{1 d} \\ ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ e t_{m 1} & \dots & e t_{m n} & \dots & e t_{m d} \\ ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ e t_{l 1} & \dots & e t_{l n} & \dots & e t_{l d} \end{matrix}],

(1)

E S = [\begin{matrix} e s_{11} & \dots & e s_{1 n} & \dots & e s_{1 d} \\ ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ e s_{m 1} & \dots & e s_{m n} & \dots & e s_{m d} \\ ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ e s_{l 1} & \dots & e s_{l n} & \dots & e s_{l d} \end{matrix}],

(2)

and

E P = [\begin{matrix} e p_{11} & \dots & e p_{1 n} & \dots & e p_{1 d} \\ ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ e p_{m 1} & \dots & e p_{m n} & \dots & e p_{m d} \\ ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ e p_{l 1} & \dots & e p_{l n} & \dots & e p_{l d} \end{matrix}] .

(3)

The hidden vector dimension in the embedding layer is 768. The input word vector can be a GloVe word vector with 100 dimensions, which is mapped into a word embedding ET through a 768-dimensional hidden layer called embeddings of the token, as shown in Formula (1). The input segment vector can be a vector involving the marker of segment property for job-related crimes, and the dimension of the segment vector depends on the legal relationship in the contextual information of job-related crimes. According to Table 6, the dimension of the segment vector with one-hot representation is 5. After 768 dimensions of hidden layer mapping, it is embedded into ES, as shown in Formula (2). The dimension of the input position vector depends on the length of the legal text. Generally, the dimension with one-hot representation is 512, which is mapped to a positional embedding EP through a 768-dimensional hidden layer, as shown in Formula (3).

e t_{11}

represents the first-dimension value of the word embedding for the first token, where d is the dimension of the word embedding.

e t_{l}

represents the word embedding of the last token in the input sentence, and

e t_{s}

represents the word embedding of the m-th token. The first token can represent a [CLS] identifier with a special meaning, and the punctuation mark in the middle can be represented as [SEP].

e p_{m}

is the positional embedding of the m-th token.

e s_{m}

is the segment embedding for the contextual information where the m-th token is located in the sentence. At the same time, segment embedding can introduce more global information into the model.

The MLFNN frontend encoder can automatically identify key sentencing elements through a self-attention mechanism, using the elements that constitute the legal relationship of job-related crimes as model inputs. After embedding hidden-layer mapping processing, the word embedding ET, segment embedding ES, and position embedding EP are added together, as shown in Formulas (4) and (5), i.e.,

E M B = [\begin{matrix} e m b_{11} & \dots & e m b_{1 n} & \dots & e m b_{1 d} \\ ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ e m b_{m 1} & \dots & e m b_{m n} & \dots & e m b_{m d} \\ ⋮ & ⋱ & ⋮ & ⋱ & ⋮ \\ e m b_{l 1} & \dots & e m b_{l n} & \dots & e m b_{l d} \end{matrix}],

(4)

and

e m b_{m n} = e t_{m n} + e s_{m n} + e p_{m n} .

(5)

After embedding segment properties that represent the legal relationship of job-related crimes, the MLFFN multi-layer encoders generate high-dimensional semantic vectors containing global feature information through nonlinear transformation. The output encoding corresponding to each token maps word embedding, segment embedding, and position embedding in the opposite end token. Moreover, it includes attention allocation for contextual information, i.e.,

M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1}, \dots, h e a d_{i}, \dots, h e a d_{h}) W^{o},

(6)

h e a d_{i} = A t t e n t i o n (Q_{i}, K_{i}, V_{i}),

(7)

and

\begin{matrix} Q_{i} = E M B \times W_{i}^{Q}, \\ K_{i} = E M B \times W_{i}^{K}, \\ V_{i} = E M B \times W_{i}^{V}, \\ A t t e n t i o n (Q_{i}, K_{i}, V_{i}) = s o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i} . \end{matrix}

(8)

It shows that the different attention heads

h e a d_{1}, \dots, h e a d_{i}, \dots, h e a d_{h}

are concatenated by

C o n c a t (h e a d_{1}, \dots, h e a d_{i}, \dots, h e a d_{h})

and transformed into the embeddings input of the next layer by

W^{o}

in Formula 6. The three matrices of

Q_{i}

,

K_{i}

, and

V_{i}

are obtained by linearly transforming the input vector of embeddings, and the corresponding transformation matrices are

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

, where

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

are pre-training parameters of the model, and

W_{i}^{Q} \in R^{d_{m o d e l} \times d_{k}}

,

W_{i}^{K} \in R^{d_{m o d e l} \times d_{k}}

,

W_{i}^{V} \in R^{d_{m o d e l} \times d_{v}}

. The feature map encoded with global information on job-related crimes is output according to Formulas (6)–(8). The MLFNN backend TextCNN uses convolution kernels from sliding windows to filter the global feature map and extract the plot of job-related crimes, making the semantic understanding of local features more accurate. Embedding segment properties that represent the legal relationship of job-related crimes, the MLFFN multi-layer encoders generate high-dimensional semantic vectors containing global feature information through nonlinear transformation. After each token is encoded and output in the MLFNN frontend, it contains segment information on the legal relationship structure. The size of the global feature map is

l \times d

(usually 512 × 768). The MLFNN backend convolutional layer adopts dual channels with a convolution kernel size of

m \times d

. The output of the channel convolutional layer is calculated according to Formula (9), i.e.,

O u t C o n v (i, c, k) = \sum_{p = 1}^{r} \sum_{q = 1}^{d} w_{p, q} x_{p + i - 1, q} .

(9)

where

w_{p, q}

represents the connection weight of the convolutional kernel, with a dimension of

r \times d

, and

x_{p + i - 1, q}

represents the value of the

p + i - 1

th row of the

q t h

dimension in the global feature map output by the MLFNN frontend. The pooling layer adopts the maximum pooling method and outputs the larger one. Then, the output of the weight by the convolutional kernel is fully concatenated to form a fully connected layer and linked to a softmax layer, which is used for the classification of labels of job-related crimes.

The workflow of the MLFNN model is shown in Figure 4, which includes an input layer, a frontend BERT layer, a backend TextCNN layer, and an output layer. In the input layer, legal documents like the indictment and verdict are split into tokens with position IDs and segment IDs, such as [FACT] and [STATUTE]. Tokenization utilizes WordPiece and legal lexicon expansion, and segment embeddings assign unique IDs to document sections. In the frontend BERT layer, the information of the tokens, the positions, and the segments passes through BERT’s Transformer encoders, and multi-head attention captures long-range dependencies to extract global contextual features. The output of the frontend BERT layer is the global encoding sequence, comprising token embeddings, position embeddings, and segment embeddings. In the backend TextCNN layer, parallel convolutions are applied with varying kernel sizes to extract local n-gram patterns from fused features, and max-pooling retains the most salient local features. The output of the backend TextCNN layer concatenates all pooled features. In the output layer, a fully connected layer and softmax activation predict job-related crime categories.

The MLFNN model differs from other existing models in two aspects. On the one hand, the input of the frontend Bert of MLFNN utilizes tags for words, segments, and positions, corresponding to word embeddings, segment embeddings, and position embeddings, respectively. The segment embeddings are related to criminal circumstances, as indicated in Table 5 and Table 6, according to the encoding. The labels for criminal circumstances differ from those in general knowledge and are designated according to legal provisions. Therefore, the output vectors of the hidden layer of the MLFNN frontend statistically include logical relationships between legal entities. On the other hand, the frontend of MLFNN freezes model parameters except for the last hidden layer, which can be fine-tuned during training based on the crime dataset to provide more contextual information for the MLFNN backend TextCNN to capture important information for legal application classification. The MLFNN algorithm is shown in Algorithm 1.

Algorithm 1 MLFNN algorithm.

Require:: bert_model: pre-trained bert model; textcnn_model: input_dim is equal to

bert_output_dim; loss_fn: criteria of training with cross entropy loss;

Ensure:: optimizer: optimal state or approximate optimal state;

1:: bert_model = AutoModel.from_pretrained(“bert-base-chinese”)
2:: # frozen bert parameters
3:: for $p a r a m$ in $b e r t_m o d e l . p a r a m e t e r s ()$ do
4:: $p a r a m . r e q u i r e s_g r a d = F a l s e$
5:: end for
6:: $i n p u t s$ =tokenizer( $t e x t s$ , $t o k e n$ , $s e g m e n t$ , $p o s i t i o n$ )
7:: $o u t p u t s$ =bert_model( $i n p u t s$ , $o u t p u t_h i d d e n_s t a t e s$ = $T r u e$ )
8:: $d a t a s e t$ =load_dataset( $o u t p u t s$ )
9:: $o p t i m i z e r$ =optim.Adam( $t e x t c n n_m o d e l . p a r a m e t e r s ()$ , $l r$ =lr)
10:: $c r i t e r i o n$ =nn.CrossEntropyLoss()
11:: for $e p o c h$ in $r a n g e (e p o c h s)$ do
12:: textcnn.train()
13:: $t o t a l_l o s s$ =0
14:: for i in $r a n g e (0, l e n (t r a i n_t e x t s), b a t c h_s i z e)$ do
15:: $e m b e d d i n g s$ =bert_model( $b a t c h_t e x t s$ )
16:: $l o g i t s$ =textcnn_model( $e m b e d d i n g s$ )
17:: $l o s s$ =criterion( $l o g i t s$ , $b a t c h_l a b e l s$ )
18:: optimizer.zero_grad()
19:: loss.backward()
20:: optimizer.step()
21:: $t o t a l_l o s s$ +=loss.item()
22:: end for
23:: end for

5. Experiments and Results

5.1. Data Description

The current datasets of legal cases mostly come from China Judgments Online [33], the Bank of the National Judicial Examination Center [34], and Challenge of AI in Law [35]. China Judgments Online covers various types of cases, including civil, criminal, administrative, etc., involving factual determination, evidence and basis for the determination, and verdict results. The National Judicial Examination Center provides teaching cases and standardized question banks selected by legal scholars. The case dataset from China Judgments Online and National Judicial Examination Center needed to be obtained through web crawler tools. Unlike the two dataset sources mentioned above, Challenge of AI offers an authoritative dataset of the judicial field that can be publicly downloaded. Therefore, the datasets of job-related crimes come from cases from China Judgments Online, the question bank from the National Judicial Examination Center, and the CAIL2018 (Challenge of AI in Law) dataset.

Data cleaning and data preprocessing are important and interconnected steps in preparing data for experiments. Personal privacy in cases involving sensitive data should not be exposed to the public; therefore, the content of cases related to personal privacy has been desensitized through technological means. The facts of the cases and sentencing standards, after being cleaned and preprocessed, comply with the sentencing circumstances stipulated in the Criminal Law. The structured data ensures a one-to-one correspondence between criminal facts and sentencing recommendations, providing natural attributes and labels for the corresponding relationship between input and output during modeling. Each case only contains two fields, namely, “statement of fact” and “range of verdict and sentence”. The ranges of verdicts and sentences are strictly divided into six categories according to the criminal law, namely, exemption from punishment, 0–1 year, 1–3 years, 3–5 years, 5–10 years, and more than 10 years.

The datasets of job-related crimes involved in this paper are specialized resources containing approximately 270,921 Chinese legal documents related to job-related crimes. The cases used for the experiments are divided into a training set, a verification set, and a test set, with a ratio of 6:2:2. This includes 162,552 cases in the training dataset, 54,185 cases in the verification dataset, and 54,184 cases in the test dataset.

5.2. Evaluation Metrics

On the one hand, legal judgments require both precision and reliability. On the other hand, legal predictions rely on extensive historical case data, and artificial intelligence is revolutionizing the legal profession by enabling attorneys to predict case outcomes with greater accuracy. Although accuracy reflects the overall correctness of the prediction, it is sensitive to data with imbalanced categories, as shown in Formula (10). TP represents a true positive, where the model predicts a positive sample that is actually positive. TN represents a true negative, where the model predicts a negative sample that is actually negative. FP represents a false positive, where the model predicts a positive sample that is actually negative. FN represents a false negative, where the model predicts a negative sample that is actually positive. To objectively describe the model’s performance, precision, recall, and F1 score are used as evaluation metrics, as shown in Formulas (11)–(13). Precision emphasizes the reliability of predicting positive classes to avoid wrongful judgments, suitable for scenarios that require reducing false positives. Recall ensures that sufficient positive class samples are identified to avoid missed judgments. The F1 value is the harmony between precision and recall, meeting the rigid demand for fairness in judicial scenarios.

The experimental results are related to the multi-classification task of predicting the sentence of related job crimes, and the dataset has a class imbalance. For example, the number of cases with judgments more than 10 years reached 36,653 in the test set, accounting for 70.2%. Therefore, weighted average precision (WAP), weighted average recall (WAR), and weighted average F1 value (WAF1) are used as performance evaluations, as shown in Formulas (14)–(16). Additionally,

P_{i}

and

R_{i}

indicate the precision and recall of label i, respectively.

c o u n t (l a b e l_{i})

and

c o u n t (d a t a s e t)

represent the count of label i and the count of the dataset in Formulas (14), (15) and (16), respectively.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N},

(10)

P r e c i s i o n = \frac{T P}{T P + F P},

(11)

R e c a l l = \frac{T P}{T P + F N},

(12)

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l},

(13)

W A P = \sum_{i = 1}^{C} (\frac{c o u n t (l a b e l_{i})}{c o u n t (d a t a s e t)} P_{i}),

(14)

W A R = \sum_{i = 1}^{C} (\frac{c o u n t (l a b e l_{i})}{c o u n t (d a t a s e t)} R_{i}),

(15)

and

W A F 1 = \sum_{i = 1}^{C} (\frac{c o u n t (l a b e l_{i})}{c o u n t (d a t a s e t)} F 1_{i}) .

(16)

5.3. Experimental Setup

Bert-base-Chinese, TextCNN, ERNIE, and MLFFN (our model) were selected to conduct experiments evaluating the performance of different models on the dataset of job-related crimes. The dimension of segment embedding of MLFFN frontend is 5, the word embedding adopts a 100-dimensional glove, and the convolution core of the MLFFN backend adopts a dual-channel setting. The experimental environment is provided by AutoDL’s cloud service using PyTorch 2.3.0 (https://pytorch.org/get-started/previous-versions/, accessed on 23 August 2025) in the framework and an NVIDIA V100 GPU in the hardware. The hyperparameter setup is shown in Table 7, and the selection of hyperparameters takes into account practical convention, individual experience, and computational power. The dimension of GloVe pre-trained word embeddings used as input to TextCNN is 300, and the convolution kernel size covers two, three, and four windows, with each size configured with 100 filters. The segment embedding dimension of the MLFFN frontend is 5, the positional embedding dimension is 512, the word embedding adopts a 100-dimensional GloVe, and the MLFFN backend convolution kernel adopts a dual-channel setting.

5.4. Results and Discussion

Large models used for comparative analysis have a huge number of parameters and are prone to overfitting training data, remembering noise rather than learning patterns. It is necessary to evaluate the performance of the validation set (such as loss, accuracy, etc.) after each training epoch. Suppose the training set metrics continue to improve while the validation set metrics decrease. In that case, it indicates that the model is beginning to overfit, which requires a timely adjustment of the training strategy.

The training will automatically end when the performance of the validation set no longer improves for more consecutive epochs. At the start of model training, the cross-entropy loss value rapidly decreases. By observing the accuracy curve of models on the validation set, the model reached a stable state after approximately three epochs and 6000 iterations of training, as shown in Figure 5 and Figure 6. The evaluation metrics, including ACC (accuracy), WAP, WAR, and WAF1, of each model on the test set are shown in Table 8 and Table 9.

From Table 8, it can be seen that compared to Bert-base-Chinese, MLFFN has a 10%, 6%, and 9% improvement in WAP, WAR, and WAF1, respectively; compared to TextCNN, MLFFN shows improvements of 10%, 14%, and 23% on WAP, WAR, and WAF1, respectively; and compared to ERNIE, MLFFN has improvements of 6%, 7%, and 10% on WAP, WAR, and WAF1, respectively.

Experimental data shows that the MLFFN model has a performance improvement of over 6% in weighted accuracy, weighted recall, and weighted F1 value, further indicating that the fusion of global and local features is effective in predicting job-related crimes. The model adopted the framework of frontend Bert and backend TextCNN, enhanced with segment embeddings for legal classification in job-related crimes. Key findings include three points as follows.

Firstly, segment embeddings could improve performance by encoding structural boundaries in legal relationships, e.g., sections, clauses, etc. Secondly, the frontend Bert captures long-range contextual semantics in legal relationship clauses of job-related crimes, especially modifiers, while the backend TextCNN extracts discriminative local patterns of job-related crimes, such as legal phrases, particularly those related to duty positions. Thirdly, frozen parameters in the frontend Bert and fine-tuning parameters in the backend TextCNN could help converge training loss and validation accuracy in the training and validation sets.

6. Conclusions

In this work, a Multi-Level Feature Fusion Network (MLFFN) is proposed, aiming at the problem of consistency of punishment and offenses in job-related crimes. The word vector, segment vector, and position vector are input to the MLFFN frontend, and the global feature encoding that can represent the legal relationship of job-related crimes is formed through nonlinear mapping of the attention mechanism. Then the MLFFN backend uses convolution kernels with sliding windows to filter the global feature map, and it obtains the local feature information on the special circumstances in job-related crimes. Experiments on the dataset of job-related crimes showed that the MLFFN model has significant performance improvements compared to models such as Bert-base-china, TextCNN, and ERNIE, verifying that the feature fusion method is in line with the applicable rules of benchmark penalties in job-related crimes. In addition, there is some uncertainty in the feature fusion in the post-training of the job-related crimes dataset, which will be the next research direction.

7. Limitations and Future Work

Despite MLFFN improving the model’s performance, this work has limitations. Segment embedding heavily relies on predefined legal relationships and document structures in job-related crimes, and it may fail to handle them when there are multiple legal relationships. Evaluation focused on the specific dataset, while performance based on other datasets remains unknown. The model utilizes machine learning to learn statistical distributions and patterns from data, replacing legal decision-making logic and reasoning, which, to some extent, poses legal risks.

Future work will be carried out in the following aspects to address these limitations. Distilled BERT variants and dynamic TextCNN filters may be considered to enhance adaptation in other datasets. Learnable segmentation embeddings will be explored for legal relationships in job-related crimes without clear boundaries. Additionally, the framework of Bert and TextCNN will be coupled with attention visualization to provide a rational explanation for predictions.

Author Contributions

Conceptualization, G.L.; methodology, G.L.; software, Y.Z. and M.X.; validation, Y.Z. and M.X.; investigation, M.X.; resources, M.X.; data curation, M.X.; writing—original draft preparation, G.L.; writing—review and editing, Y.Z.; visualization, M.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Research Foundation of China University of Petroleum-Beijing at Karamay (NO. XQZX20240032) and the Education Ministry’s Collaborative Education Program with Industry (NO. 231007632264016).

Data Availability Statement

The dataset used in this paper is available from the corresponding author.

Acknowledgments

We are greatly indebted to colleagues at the Haii (Human–AI Interaction laboratory), Department of Computer Science, Durham University, UK. We thank Zhaoxing Li, Yunzhan Zhou, Chenghao Xiao, Ziqi Pan, Jindi Wang, and Jialin Yu for their special suggestions and many interesting discussions. Additionally, we would like to thank Zengwei Gao, Minjie Zhang, Zhiyang Jia, and Haitao Shi from the Department of Computer Science at China University of Petroleum-Beijing at Karamay for their valuable contribution to this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Arshad, H.; Abdullah, S.; Alawida, M.; Alabdulatif, A.; Abiodun, O.I.; Riaz, O. A Multi-Layer Semantic Approach for Digital Forensics Automation for Online Social Networks. Sensors 2022, 22, 1115. [Google Scholar] [CrossRef] [PubMed]
Goncalves, T.; Quaresma, P. Evaluating preprocessing techniques in a text classification problem. In Proceedings of the Conference of the Brazilian Computer Society, Rio de Janeiro, Brazil, 24–27 October 2005; pp. 525–536. [Google Scholar]
Nallapati, R.; Manning, C.D. Legal docket classification: Where machine learning stumbles. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, HI, USA, 25–27 October 2008; pp. 438–446. [Google Scholar]
Ali Qutieshat, E.M.; Ali Quteishat, A.M.; Qtaishat, A. Transforming the Judicial System: The Impact of Machine Learning on Legal Processes and Outcomes. Int. J. Relig. 2024, 5, 6833–6841. [Google Scholar] [CrossRef]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. In Proceedings of the 2013 International Conference on Learning Representations, Scottsdale, AZ, USA, 2–4 May 2013; pp. 525–533. [Google Scholar]
Aletras, N.; Tsarapatsanis, D.; Preoiuc-Pietro, D.; Lampos, V. Predicting judicial decisions of the European Court of Human Rights: A Natural Language Processing perspective. PeerJ Comput. Sci. 2016, 2, e93. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Christoph, F.; Axel, P.; Andrew, Z. Convolutional Two-Stream Network Fusion for Video Action Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June 2016; pp. 933–1941. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
Yu, C.Q.; Wang, J.B.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 334–349. [Google Scholar]
Kojima, T.; Gu, S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. In Proceedings of the 36th Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 936–943. [Google Scholar]
Yang, A.; Xiao, B.; Wang, B.; Zhang, B.; Bian, C.; Yin, C.; Lv, C.; Pan, D.; Wang, D.; Yan, D.; et al. Baichuan 2: Open large-scale language models. arXiv 2023, arXiv:2309.10305. [Google Scholar] [CrossRef]
Yue, S.; Liu, S.; Zhou, Y.; Shen, C.; Wang, S.; Xiao, Y.; Li, B.; Song, Y.; Shen, X.; Chen, W.; et al. LawLLM: Intelligent legal system with legal reasoning and verifiable retrieval. In Proceedings of the 29th International Conference on Database Systems for Advanced Applications, Gifu, Japan, 2–5 July 2024; pp. 304–321. [Google Scholar]
Cui, J.; Ning, M.; Li, Z.; Chen, B.; Yan, Y.; Li, H.; Ling, B.; Tian, Y.; Yuan, L. ChatLaw: A multi-agent collaborative legal assistant with knowledge graph enhanced mixture-of-experts large language model. arXiv 2024, arXiv:2306.16092. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Lexis. Available online: https://www.lexisnexis.com/pdf/lexis-plus-ai-user-guide.pdf (accessed on 1 June 2024).
Create AI. Available online: https://creati.ai/ai-tools/contract-lifecycle-management/ (accessed on 3 May 2024).
Cobbe, J. Administrative law and the machines of government: Judicial review of automated public-sector decision-making. Leg. Stud. 2019, 18, 890–905. [Google Scholar] [CrossRef]
Ekeh, A.; Apeh, C.E.; Odionu, C.; Austin, B. Automating Legal Compliance and Contract Management: Advances in Data Analytics for Risk Assessment, Regulatory Adherence, and Negotiation Optimization. Eng. Technol. J. 2025, 10, 3684–3703. [Google Scholar] [CrossRef]
Niklas, B.; Nicolas, G.; Frick, R.A. Identifying and Generating Edge Cases. In Proceedings of the 2nd ACM Workshop on Secure and Trustworthy Deep Learning Systems, Singapore, 2–20 July 2024; pp. 16–23. [Google Scholar]
Witt, A.; Huggins, A.; Governatori, G.; Buckley, J. Encoding Legislation: A Methodology for Enhancing Technical Validation, Legal Alignment and Interdisciplinarity. Artif. Intell. Law 2024, 32, 293–324. [Google Scholar] [CrossRef]
Surden, H. Values Embedded in Legal Artificial Intelligence. IEEE Technol. Soc. Mag. 2022, 41, 66–74. [Google Scholar] [CrossRef]
De Luca, E.W.; Fiorelli, M.; Picca, D.; Stellato, A.; Wehnert, S. Legal Information Retrieval meets Artificial Intelligence (LIRAI). In Proceedings of the 34th ACM Conference on Hypertext and Social Media, Rome, Italy, 4–8 September 2023; pp. 461–464. [Google Scholar]
Zhang, C.; Meng, Y. Bayesian Deep Learning: An Enhanced AI Framework for Legal Reasoning Alignment. Comput. Law Secur. Rev. 2024, 55, 109267–109279. [Google Scholar] [CrossRef]
Ustun, A.; Can, B. Incorporating Word Embeddings in Unsupervised Morphological Segmentation. Nat. Lang. Eng. 2021, 27, 609–629. [Google Scholar] [CrossRef]
Lindgren, E.M.; Reddi, S.; Guo, R.; Kumar, S. Efficient Training of Retrieval Models Using Negative Cache. In Proceedings of the 35th Annual Conference on Neural Information Processing Systems, Electr Network, Online, 6–14 December 2021; pp. 1–13. [Google Scholar]
Allen, C.; Hospedales, T. Analogies Explained: Towards Understanding Word Embeddings. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 223–231. [Google Scholar]
Sasaki, S.; Heinzerling, B.; Suzuki, J.; Inui, K. Examining the Effect of Whitening on Static and Contextualized Word Embeddings. Inf. Process. Manag. 2023, 60, 103272–103289. [Google Scholar] [CrossRef]
Salazar, A.; Safont, G.; Vergara, L.; Vidal, E. Graph Regularization Methods in Soft Detector Fusion. Access 2023, 11, 144747–144759. [Google Scholar] [CrossRef]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional Feature Fusion. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 3186–3195. [Google Scholar]
Information Center at the Supreme People’s Court. Specifications for the Use of Public Data on China Judgements Online. Available online: https://wenshu.court.gov.cn (accessed on 10 April 2025). (In Chineses)
National Judicial Examination Center. Exam Classroom. Available online: https://www.moj.gov.cn/pub/sfbgw/jgsz/jgszzsdw/zsdwgjsfkszx/gjsfkskskt/ (accessed on 20 May 2025). (In Chineses)
Information Center at the Supreme People’s Court. A Large-Scale Legal Dataset for China AI and Law Challenge. Available online: http://cail.cipsc.org.cn (accessed on 20 May 2025). (In Chineses)

Figure 1. Architecture of Bert encoding. Chinese characters in the corpus have caused non-English content.

Figure 2. TextCNN convolution. Different colors represent different channel information.

Figure 3. MLFNN encoding and convolution. N× represents stacking x identical modules, and the different colors on the right side of the figure indicate different channels.

Figure 4. Workflow of the MLFNN model.

Figure 5. Curve of the fine-tuning on the training dataset.

Figure 6. Accuracy curve of models on the validation set.

Table 1. Identification of harm caused by embezzlement.

Amount Involved (CNY)	Nature of Crime	Punishment
<5000	slight	not be registered
5000–30,000	general	aggravating circumstances
30,000–200,000	relatively large	less than 3 years
200,000–3,000,000	huge	between 3 and 10 years
>3,000,000	extremely huge	more than 10 years

Table 2. Aggravating circumstances in job-related crimes.

Specific Circumstances	Subjective Nature	Objective Harm
rescue	first offence	transferring illicit money
flood prevention	once embezzled	destruction of evidence
epidemic control	previously accepted bribes	escaping overseas
poverty alleviation	previously misappropriated	sponsoring others to commit crimes
illicit money	other criminal records	inability to recover illicit money

Table 3. Global features in embezzlement.

Subjects	Objects	Subjective Factors	Objective Factors
governmental officials	state assets	directly	act of duty
trustees for public property	public property	intentionally	embezzlement
exploitation of officials	state nature	illegally	fraud
non-governmental officials	public nature	not-faulty	concealment

Table 4. Local features in embezzlement.

Amount Involved (CNY)	Circumstances	Means	Regions
<5000	rescue	misappropriation	1-tier cities
5000–30,000	flood prevention	stealing	2-tier cities
30,000–200,000	epidemic control	swindle	3-tier cities
200,000–3,000,000	poverty alleviation	rebate	4-tier cities
>3,000,000	illicit money	indirect	5-tier cities

Table 5. Properties of case description fields in job-related crimes.

Token	Segment Property	Token	Segment Property	Token	Segment Property
2012	CT	在	CQ	以	CM
年	CT	担任	CQ	虚开	CM
12	CT	某市	CQ	发票	CM
月	CT	某局	CQ	方式	CM
至	CT	副局长	CQ	，	CM
2013	CT	期间	CQ	套取	CF
年	CT	,	CQ	公款	CF
4	CT	利用	CC	人民币	CF
月	CT	职务	CC	3	CF
,	CT	之	CC	万	CF
被告人	CQ	便	CC	元	CF
夏某	CQ	,	CC	。	CF

Chinese characters in the corpus have caused non-English content.

Table 6. Meanings of segment property tags.

Segment Property Tags	Meanings	Presenting Text
CT	time of crimes	digitals representing date
CQ	qualification of offences	names and official positions
CC	causes of crimes	text of causal relationships
CM	means of crimes	words such as “with” prepositions
CF	facts of crimes	digitals representing the amount

Table 7. Setup of Hyperparameters.

Models	Bert-Base-Chinese	TextCNN	ERNIE	MLFFN
pre-training	Bert-base-Chinese	GloVe	ernie-3.0-base-zh	Bert&GloVe
batchsize	128	128	128	128
training criteria	CE	CE	CE	CE
optimizer	AdamW	AdamW	AdamW	AdamW
learning rate	5 × 10⁻⁵	1 × 10⁻³	5 × 10⁻⁵	1 × 10⁻⁵

Table 8. Experimental result of the fine-tuning experiments.

Models	ACC	WAP	WAR	WAF1
Bert-base-Chinese	0.7654	0.7317	0.7654	0.7177
TextCNN	0.7132	0.7293	0.7132	0.6344
ERNIE	0.7625	0.7502	0.7625	0.7144
MLFFN (ours)	0.7683	0.7525	0.7683	0.7300

Table 9. Performances of Bert-base-Chinese, TextCNN, ERNIE, and MLFFN.

Models	ACC	WAP	WAR	WAF1
Bert-base-Chinese	0.7654	0.7317	0.7654	0.7177
TextCNN	0.7231	0.7293	0.7132	0.6344
ERNIE	0.7152	0.7502	0.7625	0.7144
MLFFN (ours)	0.8156	0.8052	0.8156	0.7854

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Xu, M.; Li, G. A Multi-Level Feature Fusion Network Integrating BERT and TextCNN. Electronics 2025, 14, 3508. https://doi.org/10.3390/electronics14173508

AMA Style

Zhang Y, Xu M, Li G. A Multi-Level Feature Fusion Network Integrating BERT and TextCNN. Electronics. 2025; 14(17):3508. https://doi.org/10.3390/electronics14173508

Chicago/Turabian Style

Zhang, Yangwu, Mingxiao Xu, and Guohe Li. 2025. "A Multi-Level Feature Fusion Network Integrating BERT and TextCNN" Electronics 14, no. 17: 3508. https://doi.org/10.3390/electronics14173508

APA Style

Zhang, Y., Xu, M., & Li, G. (2025). A Multi-Level Feature Fusion Network Integrating BERT and TextCNN. Electronics, 14(17), 3508. https://doi.org/10.3390/electronics14173508

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Level Feature Fusion Network Integrating BERT and TextCNN

Abstract

1. Introduction

2. Related Work

3. Features of Job-Related Crimes

3.1. Basis of Sentencing for Job-Related Crimes

3.2. Global and Local Features of Sentencing

3.3. Definition of Classification Task and Sentencing

4. Multi-Level Feature Fusion

4.1. Attention Mechanism for Global Features

4.2. Convolutional Neural Networks for Local Features

4.3. MLFFN: Multi-Level Feature Fusion Network

5. Experiments and Results

5.1. Data Description

5.2. Evaluation Metrics

5.3. Experimental Setup

5.4. Results and Discussion

6. Conclusions

7. Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI