BHE+ALBERT-Mixplus: A Distributed Symmetric Approximate Homomorphic Encryption Model for Secure Short-Text Sentiment Classification in Teaching Evaluations

Zhang, Jingren; Sarah Maidin, Siti; Dewi, Deshinta Arrova

doi:10.3390/sym17060903

Open AccessArticle

BHE+ALBERT-Mixplus: A Distributed Symmetric Approximate Homomorphic Encryption Model for Secure Short-Text Sentiment Classification in Teaching Evaluations

by

Jingren Zhang

^1,2,*

,

Siti Sarah Maidin

² and

Deshinta Arrova Dewi

²

¹

School of Information Science and Engineering, Shandong Normal University, Jinan 250358, China

²

Faculty of Data Science, INTI International University, Nilai 71800, Negeri Sembilan, Malaysia

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(6), 903; https://doi.org/10.3390/sym17060903

Submission received: 10 April 2025 / Revised: 24 May 2025 / Accepted: 26 May 2025 / Published: 7 June 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

This study addresses the sentiment classification of short texts in teaching evaluations. To mitigate concerns regarding data security in cloud-based sentiment analysis and to overcome the limited feature extraction capacity of traditional deep-learning methods, we propose a distributed symmetric approximate homomorphic hybrid sentiment classification model, denoted BHE+ALBERT-Mixplus. To enable homomorphic encryption of non-polynomial functions within the ALBERT-Mixplus architecture—a mixing-and-enhancement variant of ALBERT—we introduce the BHE (BERT-based Homomorphic Encryption) algorithm. The BHE establishes a distributed symmetric approximation workflow, constructing a cloud–user symmetric encryption framework. Within this framework, simplified computations and mathematical approximations are applied to handle non-polynomial operations (e.g., GELU, Softmax, and LayerNorm) under the CKKS homomorphic-encryption scheme. Consequently, the ALBERT-Mixplus model can securely perform classification on encrypted data without compromising utility. To improve feature extraction and enhance prediction accuracy in sentiment classification, ALBERT-Mixplus incorporates two core components: 1. A meta-information extraction layer, employing a lightweight pre-trained ALBERT model to capture extensive general semantic knowledge and thereby bolster robustness to noise. 2. A hybrid feature-extraction layer, which fuses a bidirectional gated recurrent unit (BiGRU) with a multi-scale convolutional neural network (MCNN) to capture both global contextual dependencies and fine-grained local semantic features across multiple scales. Together, these layers enrich the model’s deep feature representations. Experimental results on the TAD-2023 and SST-2 datasets demonstrate that BHE+ALBERT-Mixplus achieves competitive improvements in key evaluation metrics compared to mainstream models, despite a slight increase in computational overhead. The proposed framework enables secure analysis of diverse student feedback while preserving data privacy. This allows marginalized student groups to benefit equally from AI-driven insights, thereby embodying the principles of educational equity and inclusive education. Moreover, through its innovative distributed encryption workflow, the model enhances computational efficiency while promoting environmental sustainability by reducing energy consumption and optimizing resource allocation.

Keywords:

fully homomorphic encryption; sentiment classification; Softmax; GELU; distributed symmetric approximation workflow; inclusive education; process innovation; SDG

1. Introduction

The advent of artificial intelligence has made big data and cloud computing pivotal in the digital transformation of higher education. At present, many universities opt to host their business systems directly on public clouds, relying on cloud computing platforms for data storage and processing. The advantages of cloud computing include scalability, low hardware requirements, enhanced computational power, improved resource utilization, and cost reduction. Furthermore, the extensive clusters and substantial computational capabilities of cloud environments can facilitate the extraction of valuable insights from data. However, the intrinsic characteristics of cloud computing environments also present considerable security challenges. In order to reduce costs and enhance resource utilization, cloud computing environments often employ heterogeneous and diverse node architectures. However, this characteristic also poses considerable security challenges. In the context of higher education, teaching evaluation data is highly sensitive. Such data are typically stored either locally on campus or in the cloud. However, relying on networks with limited reliability and semi-trusted cloud storage servers to process and analyze large-scale educational data inevitably poses risks of sensitive information leakage. It becomes difficult to guarantee the confidentiality, integrity, and availability of the data. Once sensitive resources are stolen or tampered with, the consequences can be severe.

Text sentiment analysis represents a pivotal downstream task within the domain of Natural Language Processing (NLP). Its objective is to analyze and comprehend the content of a text in order to ascertain the attitude or emotional inclination of the writer, which may be positive, negative, or neutral. In the context of the digital transformation of education, text sentiment analysis represents a viable avenue for enhancing the efficacy of teaching evaluations through the integration of AI. Short teaching evaluation texts reflect students’ attitudes toward the teaching process and provide invaluable feedback. Educators can use this feedback to understand students’ needs and improve the quality of their teaching. Rapid, accurate, and secure identification of sentiment in evaluation texts enables teachers to quickly pinpoint areas for improvement, enhancing the overall classroom learning environment. The application of traditional machine learning or sentiment lexicon-based methods for text sentiment analysis necessitates the manual engineering of features, a process that is both time-consuming and labor-intensive. Furthermore, these methods cannot effectively capture subtle semantic nuances, making them unsuitable for analyzing large-scale text data. Deep learning has introduced models such as CNN, RNN, LSTM, transformer, and BERT, along with their variants, significantly advancing research in text sentiment analysis. However, in comparison to other text sentiment analysis tasks, the content of evaluations of teaching quality in the classroom is more subjective and varied, often accompanied by distinctive internet culture characteristics. Furthermore, this dataset is characterized by a high degree of noise and a lack of contextual information, including emojis, abbreviations, and internet slang. Consequently, models must extract deeper semantic information, discern contextual nuances, and identify long-range word relationships to accurately determine sentiment polarity. This necessitates the development of sentiment analysis algorithms that are more robust and generalizable.

In educational settings, analyzing students’ short feedback comments plays a crucial role in assessing teaching effectiveness. In this work, we focus on the problem of classifying the sentiment of short teaching evaluation comments (such as course evaluation feedback). This problem is motivated by the need for educators to automatically understand student opinions while preserving data confidentiality. Given the above challenges, this paper proposes a distributed symmetric approximate homomorphic encryption algorithm called BHE. This algorithm approximates non-linear operations in the ALBERT workflow of the hybrid model, enabling it to handle complex computations (e.g., GELU, Softmax, LayerNorm) in pre-trained transformer-based models, which are challenging for current FHE. These computations include the GELU activation function, Softmax, and Layer-Norm. In addition, to further reduce the encryption overhead of the text sentiment classification hybrid model and improve the model’s pre-training and classification efficiency, we optimized the existing hybrid model and proposed a parameter-efficient ALBERT-Mixplus hybrid-enhanced model. ALBERT-Mixplus consists of a meta-information extraction layer and a hybrid-feature extraction layer. The meta-information extraction layer adopts a lightweight pre-trained ALBERT model to learn a large amount of general semantic knowledge, thereby enhancing the model’s noise resistance. The hybrid feature extraction layer is mainly composed of a BiGRU + MCNN neural network, which comprehensively captures the high-dimensional semantic representations of teaching evaluation texts.

The meta-information extraction layer employs a lightweight pre-trained ALBERT model to learn extensive general semantic knowledge, enhancing the model’s noise resistance. The hybrid feature extraction layer, primarily composed of BiGRU + MCNN neural networks, comprehensively captures the high-dimensional semantic representations of teaching evaluation texts.

This paper’s contributions mainly include the following aspects:

We propose a BHE distributed symmetric approximate homomorphic encryption algorithm. To address the challenge of homomorphic encryption for non-polynomial functions in the BERT model, the BHE algorithm provides a distributed, approximate workflow. By using a “computation simplification” method to approximate non-polynomial functions such as GELU, Softmax, and LayerNorm within the hybrid model, this approach enables the fusion model to support inference under a fully homomorphic encryption (FHE) environment.
We develop a BHE-compatible ALBERT-Mixplus (a mixing plus enhancements variant of ALBERT) text sentiment classification fusion model. When data are encrypted under the CKKS homomorphic encryption scheme, ALBERT-Mixplus can still effectively extract deep semantic features of sentiment. Experimental results demonstrate that with only a modest increase in training overhead, ALBERT-Mixplus outperforms mainstream classification models on multiple performance metrics.
We design a cloud–user distributed symmetric deployment paradigm to overcome homomorphic encryption limitations. Content that cannot be computed under CKKS encryption on the cloud is returned to the user side for local processing. As shown in Figure 1, this workflow architecture offers a practical solution for AI + business distributed application scenarios in higher education, leveraging local data storage, cloud-based computation, and secure, efficient communication.

2. Related Works

2.1. Current Research Status on Sentiment Classification of Teaching Evaluation Short Texts

Research on sentiment classification of short texts in teaching evaluations primarily focuses on accurately analyzing students’ emotions expressed in feedback. Compared to general sentiment analysis tasks, student evaluations of classroom teaching quality exhibit greater subjectivity and variability, often containing more nuanced opinions and implicit sentiment polarities. To date, studies in this domain encompass traditional methods, deep learning techniques, and recent model enhancements [1,2,3,4]. Early approaches to short text sentiment classification predominantly relied on lexicon-based and rule-based methods, such as constructing sentiment lexicons (e.g., SentiWordNet) and devising sentiment analysis rules. While these methods are straightforward to implement and interpret, they often struggle with complex emotions and context-dependent language [5]. The advent of machine learning prompted researchers to employ classifiers such as support vector machines (SVM) and Naive Bayes (NB) for sentiment classification. The extraction of text features (e.g., TF-IDF, n-grams) and the employment of machine learning classifiers resulted in an enhanced accuracy of sentiment classification. Nevertheless, these techniques continue to depend on feature engineering and are unable to effectively capture the intricate semantic relationships embedded within the text [6]. The application of deep learning, particularly convolutional neural network (CNN) and recurrent neural network (RNN), has significantly enhanced the performance of sentiment classification [7]. Xiao et al. employed a bidirectional LSTM model integrated with word embeddings for Chinese sentiment analysis, effectively mitigating the limitations in capturing contextual information during sentiment classification [8]. The Google team, based on the self-attention mechanism, successively proposed the transformer [9] and the BERT pre-training model [10], both of which achieved impressive results across various performance metrics. Subsequent models like RoBERTa and ALBERT are pre-training models developed by researchers to improve upon different aspects of BERT [11]. To address the issue of BERT’s large number of parameters, Google Research introduced ALBERT (A Lite BERT for Self-supervised Learning of Language Representations), which significantly reduces the number of parameters and computational cost through techniques such as parameter sharing and embedding matrix factorization, while maintaining performance similar to BERT [12]. To enhance the generalization ability of models, researchers have explored combining the local feature extraction capabilities of CNNs with the sequence modeling abilities of RNNs and LSTMs in capturing long-range dependencies in text, leading to the development of hybrid models that provide more comprehensive and effective text representations [13]. Ananthi et al. proposed an Attention-Enhanced Bidirectional Long Short-Term Memory (BiLSTM) model incorporating contextual semantic knowledge, which significantly enhances the ability to capture semantic dependencies and sentiment features in sentiment analysis [14].

Zhao et al. proposed a text classification model based on BERT-GRU-ATT, leveraging Chinese semantics and utilizing an attention mechanism to amplify the influence of sentiment words on classification outcomes [15]. To better accommodate the characteristics of short texts on the internet, Guo et al. proposed an emotion classification model based on EK-INIT-CNN [16]. Ren et al. introduced bidirectional gated temporal convolution and attention mechanisms (BG-TCA), addressing the challenges of modeling long-range dependencies, inadequate extraction of salient features, and limited information flow in text classification tasks [17]. Yao et al. proposed a text classification method based on TextGCN, which constructs a text graph structure and utilizes the global word co-occurrence relationships in the graph for feature extraction. This method effectively addresses the challenges of traditional models in capturing non-local relationships between words and the lack of semantic structural information, thereby improving the accuracy of text classification [18]. Li et al. proposed a causality extraction method based on BERT-GCN, which combines BERT’s contextual representations with the structured semantic information of graph convolutional networks. This method effectively addresses the issues of insufficient semantic understanding and inadequate modeling of inter-entity dependencies in causality extraction [19].

2.2. HE in Privacy-Preserving Sentiment Analysis

In the context of sentiment classification for teaching evaluation texts in higher education institutions, teaching evaluation data often need to be deployed locally due to security requirements. However, given the need to process massive amounts of data, higher education institutions rely on cloud computing resources to carry out various analytical tasks. Since evaluation data and results involve students’ privacy information, the data cannot be directly disclosed and is typically managed by teaching administrative departments. In practical implementation, if the plaintext content of evaluation data is transmitted to the cloud for classification prediction, ensuring data security becomes challenging due to the complexity of cloud environments. Therefore, a potential approach to addressing this issue is to enable cloud-based models to perform effective predictions under encrypted data states. The application of privacy-preserving technologies is intended to achieve a balance between the usability of data and the protection of privacy. Homomorphic encryption permits the storage of user data in an encrypted format within the cloud, thereby ensuring that the data content cannot be accessed by service providers. This safeguards against the potential misuse or tampering of user data, thereby protecting user privacy. At present, it represents a pivotal area of investigation within the domain of privacy-preserving sentiment analysis. Homomorphic encryption can be classified according to its developmental stage and the types and number of operations that can be performed on ciphertexts. This results in three categories: partial homomorphic encryption (PHE), somewhat homomorphic encryption (SHE), and fully homomorphic encryption (FHE) [20,21,22,23]. Fully homomorphic encryption (FHE) is capable of supporting an arbitrary number of both additive and multiplicative operations on ciphertexts. Its effectiveness in protecting privacy has been demonstrated in a number of fields, including image classification and statistical analysis [24,25,26]. Kim et al. proposed an efficient non-interactive text classification method based on fully homomorphic encryption (FHE), called PrivFT, which enables inference and prediction using simple neural network language models on encrypted user data [27]. Zhang et al. explored the combined application of data obfuscation and homomorphic encryption in privacy-preserving machine learning, providing specific case studies in sentiment classification [28]. Marcos Florencio et al. proposed an end-to-end Homomorphic Neural Network (HNN) architecture that enables sentiment analysis tasks to be performed while maintaining data in an encrypted state [29].

Although fully homomorphic encryption (FHE) is compatible with both addition and multiplication operations, thereby enabling the construction of polynomial functions, it is not applicable to a number of comparison operations, including max, min, or common functions such as exponential and sigmoid functions. Furthermore, pre-trained models based on self-attention, such as transformer and BERT, are unable to encrypt operations like Softmax, GLEU, and LayerNorm directly, which presents challenges for applying FHE to these types of models.

3. BHE+ALBERT-Mixplus

3.1. Preliminaries

In this section, we will briefly introduce the basic concepts used in the BHE+ALBERT-Mixplus model.

3.1.1. Fully Homorphic Encryption (FHE)Algorithm

Fully homomorphic encryption (FHE) is an encryption scheme that allows users to perform computations on encrypted data without first decrypting the data. These computation results are preserved in encrypted form, and performing the same computations on unencrypted data yields the same output upon decryption. Let

F

be the function of the entire pre-trained model,

E

the encryption function, and

D

the decryption function. For any allowed plaintext, input

x

, just as in Equation (1).

F (x) = D (g (E (x)))

(1)

where

g

is a constructed function to play the same role of function

F

, except on encrypted data.

Figure 2 describes the encryption process of fully homomorphic encryption. We deploy a pre-trained language model, such as ALBERT, to the cloud. The pre-trained model receives encrypted ciphertext from the database, which has been encrypted with a public key, and performs inference operations

g

on the ciphertext. Finally, the encrypted result is returned to the user, who decrypts it using a private key. Throughout the entire workflow, the cloud model does not have access to the plaintext data or the prediction results, thereby ensuring privacy and confidentiality.

3.1.2. Pre-Trained BERT Model

The architecture of the BERT model is primarily composed of multiple layers of the bidirectional transformer’s encoder, constructed through the stacking of transformer encoder models [9]. This structure enables BERT to fully grasp the bidirectional contextual information within the text, leading to its exceptional performance across various NLP tasks.

At the core of the encoder lies the self-attention mechanism, which revolves around three key concepts: query, key, and value, denoted as the

Q, K, V

matrices, respectively. In the self-attention mechanism, the target word is treated as the query, while its surrounding context words serve as the keys. The similarity between the query and each key is computed and used as a weight, which integrates the values of the context words into the original value of the target word.

In the BERT structure shown in Figure 3, all word embeddings are placed into matrix

X

, which is then multiplied by the pre-trained weight matrices

W_{Q}, W_{K}, W_{V}

, respectively, to obtain the

Q, K, V

. Here,

Q

represents the query vector of the target word,

K

represents the key vector of each context word, and

V

represents the original value vector of the target word and its context words. Subsequently, the self-attention hybrid feature representation is obtained according to Equation (2).

A t t e n t i o n (Q, K, V) = S o f t \max (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(2)

After applying the

Softmax ()

operation to normalize the results into a probability distribution, then the normalized values are integrated into the value matrix, resulting in the computation of the self-attention representation matrix.

In the encoder, the multi-head attention mechanism is introduced to replace single self-attention. Specifically,

h

different linear transformations are applied to map

Q, K, V

. Each attention head is responsible for mapping the input into different sub-representation spaces. The entire computation process is outlined in Equations (3) and (4).

M u l t i H e a d (Q, K, V) = C o n c a t (h e a d_{1} \dots h e a d_{2} \dots h e a d_{h}) W^{o}

(3)

h e a d_{i} = A t t e n t i o n ({Q W}_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(4)

Here,

h

represents the number of attention heads, and

W^{o}

is the weight matrix of the concatenated multi-head attention; it has

W_{i}^{Q} \in R^{d_{\mod e l} \times d_{Q}}

. In practice,

h

is often set to 8, with

d_{k} = d_{v} = d_{model} / h = 64

. In this structure, the multi-head attention mechanism independently maintains its own

Q, K, V

weight matrices.

3.2. Overview of BHE+ALBERT-Mixplus

The BHE+ALBERT-Mixplus (a mixing plus enhancements variant of ALBERT) model primarily consists of two components: the approximate symmetric homomorphic BHE algorithm and the ALBERT-Mixplus hybrid model. The BHE algorithm will be discussed in detail in later sections.

To enable efficient homomorphic encryption in our BHE framework, we adopt ALBERT (A Lite BERT) [11], a compact variant of BERT specifically designed to minimize model size while maintaining high accuracy. BERT-Base comprises 12 transformer layers with 768 hidden units and roughly 110 million parameters. In contrast, ALBERT retains the same 12-layer architecture but applies factorized embedding parameterization—decoupling the vocabulary embedding dimension from the hidden size—and cross-layer parameter sharing, whereby a single set of weights is reused across all encoder layers. Together, these optimizations shrink the model to only about 12 million parameters (a ~90% reduction) without performance loss. The resulting reduction in encrypted parameter count and ciphertext volume lowers computational and communication overhead, making ALBERT ideally suited for privacy-preserving inference on the cloud.

In the ALBERT-Mixplus model, semantic feature extraction is mainly dependent on the meta-information extraction layer and the hybrid feature extraction layer. The meta-information extraction layer utilizes an ALBERT model, which dynamically adjusts vector representations by incorporating the specific context of target words, thereby enhancing the semantic expressiveness of words. The hybrid feature extraction layer, composed of BiGRU and MCNN, is designed to capture local features at different scales and high-dimensional global sentiment semantic information in the text, thereby improving the model’s ability to extract sentiment features. After dimensionality reduction of the mixed text feature vectors, sentiment classification is performed using a classifier. The overall structure of the BHE+ALBERT-Mixplus model is shown in Figure 4.

In our proposed architecture, the ALBERT encoder depth was configured to six layers. As illustrated in Figure 3, SA denotes the self-attention mechanism, while Conv represents multi-scale convolutional operations. To meet the requirements of HE (homomorphic encryption), we selected ReLU as the activation function and placed the max computation on the user side. Additionally, we employed an approximate Softmax function supported by BHE as the classifier. This aspect will be thoroughly explained in Section 3.5.

3.3. Meta-Information Extraction Layer

To maintain the high performance of the hybrid model while significantly reducing the number of parameters and the computational resources required, we employed the ALBERT model in the meta-information extraction layer. ALBERT shares parameters across different encoder layers rather than using distinct parameters for each layer, as illustrated in Figure 5.

X = [x_{1}, x_{2} \dots x_{n}]

represents the input vectors with integrated positional encoding, while TR denotes the encoder model within the transformer architecture. The model’s principles have been thoroughly detailed in the previous section. It is crucial to note that in the multi-head attention component of each encoder, we constructed the hybrid features according to the following procedure:

The query, key, and value matrices are mapped through $h$ different linear transformations.
The Appo-Softmax classifier, approximated using the BHE algorithm, computes the different attention representations according to Equation (2).
Concatenating the different attention feature representations $Z^{^}$ , $Z^{^}$ can be denoted as $Z^{^} = [Z_{0}, Z_{1}, \dots Z_{8}]$ . In this model, we set the depth of self-attention to 8.
The concatenated matrix is multiplied by a weight matrix $W^{O}$ , resulting in the final representation $Z$ , which incorporates information from all self-attention heads. This representation is then fed into the position-wise feed-forward network (FFN).

3.4. Hybrid Feature Extraction Layer

The feature hybrid extraction layer primarily consists of a bidirectional gated recurrent unit (BiGRU) and multiscale convolutional neural network (MCNN) models connected in series [30,31]. The feature fusion vector representation output by ALBERT is input into BiGRU to capture semantic context and global information. Subsequently, multi-scale convolutional kernels from MCNN-MCO are employed to extract local semantic features at varying granularities, thereby comprehensively obtaining high-dimensional semantic representations of teaching evaluation texts. The self-attention mechanism identifies key emotional features that significantly influence the model’s classification results.

To enhance the hybrid model’s classification performance on encrypted data, we replaced the traditional BiLSTM with its three-gate structure (forget gate, input gate, output gate) with the BiGRU model. BiGRU is a type of recurrent neural network and an improved variant of RNN and BiLSTM. It consists of two relatively independent GRU units: one processes data in the forward direction and the other in the backward direction. Through these bidirectional structures, BiGRU is capable of simultaneously capturing both forward and backward information. The GRU unit includes an update gate and a reset gate. The update gate controls how much of the previous state is retained in the current state, while the reset gate determines the impact of the previous state on the current state. The computations for GRU are as shown in Equations (5)–(8).

Z_{t} = σ {(W}_{z} \cdot {[h}_{t - 1} {, x}_{t}])

(5)

r_{t} = σ {(W}_{r} \cdot {[h}_{t - 1} {, x}_{t}])

(6)

{\tilde{h}}_{t} = \tan h (W \cdot [r_{t} \times h_{t - 1} {, x}_{t}])

(7)

h_{t} = (1 - z_{t}) \times h_{t - 1} + z_{t} * {\tilde{h}}_{t}

(8)

where

σ

represents the sigmoid activation function,

Z_{t}

is the update gate,

r_{t}

is the reset gate,

h_{t}

denotes the current state,

{\tilde{h}}_{t}

represents the candidate state,

W_{r}

is the weight matrix for the reset gate, and

W_{z}

is the weight matrix for the update gate.

After passing through the BiGRU layer, we obtained the new feature vector representation:

H = 〈H_{1}, H_{2}, \dots H_{n}〉

. This newly obtained vector representation

H

is then input into the multiscale convolutional kernels to extract local emotional features from the evaluation texts at multiple scales. The structure of the MCNN model is illustrated in Figure 6.

The MCNN primarily includes multiscale convolutional operations (MCNN-MCO) and a self-attention mechanism (MCNN-SA). It is important to note that, to accommodate HE encryption requirements, we have omitted pooling operations in the MCNN. MCNN-MCO employs convolutional kernels of various sizes that slide over the text embedding matrix from top to bottom to extract local textual features. The specific computation process is described by Equations (9) and (10).

c_{i} = f (W \otimes H_{i : i + m - 1} + b)

(9)

c = (c 1, c 2, \dots, c_{n - m + 1})

(10)

\otimes

represents the convolution operation,

W

denotes the parameter weight matrix,

b

indicates the bias term, and

H_{i : i + m - 1}

refers to the text feature vectors from rows

i

to

i + m - 1

output by the BiGRU model.

f

is the activation function, with ReLU used as the activation function in this study. The sizes of the convolutional filters are set to (2,3,4), and after the convolution operations, we obtain the feature representations

C_{2}, C_{3}, C_{4}

.

MCNN-SA aims to enhance the key feature extraction capability of the vectors after convolution operations, highlighting critical emotional features. It calculates the attention scores

β_{i}

for the convolutional layer’s output features

C_{i}

. Through weighted summation, the overall attention output

V

is obtained. The specific computation process is detailed in Equations (11)–(13).

U_{i} = Re l u (W_{z} C_{i} + b_{z})

(11)

β_{i} = \frac{\exp (U_{i})}{\sum_{i} \exp (U_{i})}

(12)

V = \sum_{i} β_{i} c_{i}

(13)

Relu is the nonlinear activation function, and

W_{z}

and

B_{z}

are the weight parameter matrices. Finally, the attention features from each convolutional channel are hybrid into the feature representation embedding

S_{1}, S_{2}, S_{3}

. The concatenation of the attention feature hybrid embedding from multiple convolutional channels is shown in Equation (14).

Z = C o n c a t (S_{1}, S_{2}, S_{3})

(14)

The sentiment classification probability

P_{m}

is computed using the approximated Softmax function (Appro-Softmax), as shown in Equation (15).

P_{m} = (A p p r o x i m a t e) S o f t \max (W_{m} Z + b_{s})

(15)

3.5. BHE Approximate Homomorphic Encryption Algorithm

The BHE algorithm proposed in this paper is based on the CKKS encryption scheme by Cheon et al. [32], which can be implemented using the SEAL encryption library. Based on the aforementioned issues, the innovative aspects of the proposed approximate homomorphic encryption (BHE) algorithm include the following key points:

Integration of User Devices for FHE Predictions: Traditional FHE-based encryption algorithms typically rely entirely on cloud servers. The BHE algorithm incorporates user-side devices to address the issue of non-polynomial operations in FHE encryption within the model. This approach helps improve inference efficiency and enhances the privacy protection capabilities of sentiment classification models.
Simplification of Computation for Approximating Non-Polynomial Functions: The BHE algorithm employs a “simplified computation” concept to approximate non-polynomial functions in the ALBERT model, such as GELU, Softmax, and LayerNorm. Since these functions cannot be directly computed under homomorphic encryption, BHE uses approximation techniques to enable homomorphic encryption support.
Establishment of a Staged Approximation Workflow: The BHE algorithm divides the approximation workflow for the ALBERT model into two phases: initially replacing GELU and Softmax functions, followed by incorporating LayerNorm approximation after standard fine-tuning. This staged approach ultimately transforms the model into one that fully supports homomorphic encryption operations.

3.5.1. Symmetrical Approximation Workflow

The symmetrical approximation workflow consists of two phases: standard finetuning and LN extraction, as illustrated in Figure 7. Given an original ALBERT-Mixplus Model that has not undergone HE approximation, denoted as

Μ

, through fine-tuning with educational evaluation data, a fully HE-supported

\bar{M}

is generated. The two-phase objective of Algorithm 1 is to identify the best approximation checkpoint.

Algorithm 1: BHE Symmetrical Approximation Workflow

Data: labeled course-appraisal data

D

.
Input: Pre-trained ALBERT-Mixplus model

M

, Appro-Softmax model

S

.
1.

\hat{M} \leftarrow M \otimes (S, R eLU)

.//Replace Tanh GELU and Softmax
2. While not done do
3. Sample batches(x_i,y_i)from

D

.
4. let(x_i,y_i)optimise

\hat{M}

with

S

.
End
5.

\tilde{M} \leftarrow \hat{M} \oplus \tilde{N}

//add Appro-LN
6. While not done do
Sample batches(x_i,y_i)from

D

.
Freeze the parameters of

\tilde{M}

except

\tilde{N}

.

Compute k - t h

LN output O_{k}, {\tilde{O}}_{k}

.

Compute loss l_{k}

= MSELOSS (O_{k}, {\tilde{O}}_{k}

).
Update

\tilde{N}

with loss

Γ = \sum_{k} l_{k}

.
End

\bar{M} \leftarrow \hat{M} 〈 - 〉 \tilde{N}

.//drop the origin layernorm (LN)
Return

\bar{M}

It is particularly emphasized that the BHE algorithm primarily reconstructs three important non-polynomial functions—GELU, Softmax, and LayerNorm—using approximation techniques. These will be discussed in detail in subsequent sections.

As illustrated in Figure 7, during the standard fine-tuning process, we fine-tuned the pre-trained model

M

by replacing the non-polynomial functions (GELU and Softmax) in the pre-trained model with ReLU and Appro-Softmax models. This was followed by fine-tuning the ALBERT-Mixplus model using labeled short text data from educational sentiment evaluations. After completing the standard fine-tuning, we introduced Appro-LN while retaining the original LN. Given that the original LN involves complex nonlinear operations, the purpose of Appro-LN is to approximate these operations as linear ones. We optimized the parameters by comparing the mean squared error (MSE) between the original LN and Appro-LN. Once the parameters were tuned using the original LN, we removed the original LN, resulting in a model

\bar{M}

that fully supports homomorphic encryption operations.

3.5.2. Gaussion Error Linear Units (GELU)

In the original ALBERT model, the Gaussian error linear unit (GELU) is adopted as the activation function. However, the Gaussian kernel within the GELU involves exponential operations, and the tanh function in GELU remains non-polynomial. This results in GELU being incompatible with homomorphic encryption (HE) requirements. The original algorithm is illustrated in Equation (16).

G (x) = 0.5 x (1 + \tanh [\sqrt{2 / π} (x + 0.4044715 x^{3})])

(16)

G (x)

represents the GELU activation function, with

x

being the input value. To identify a function that can optimally replace GELU, we analyzed the convergence properties of several mainstream activation functions. Among them, ReLU was found to have the closest numerical performance to the GELU function. When the input value is near zero, the activation results of both functions are very similar. As the input value becomes significantly larger or smaller, the activation outputs tend to converge as well. Apart from the max function, the remaining computations of the ReLU function are well-suited to the requirements of homomorphic encryption.

To implement the

M ax ()

operation, we incorporated the user-side devices into the computation process. Although this introduces some communication overhead, it also ensures data security. The specific calculation process is detailed in Algorithm 2. The cloud-side

C

conveys the encrypted input to the user (local)

U

for the user (local)

M ax ()

operation. Upon receiving the connection, the user’s device decrypts the encrypted input, performs the user (local) Max operation, and then returns the re-encrypted result to

C

.

Algorithm 2: ReLU HE Workflow

1. Split ReLU Computation

Re L U_{\max} () \to U

//U represent user device.
2.

C (E) \to U (E)

.
3.

U (E) \overset{p r i v a t e - k e y}{\to} U (T)

//use private-key to Decrypt data from user side.
4.

U (T) \otimes R e L U_{\max} \to \tilde{U} (T)

//

\tilde{U} (T)

represent the plain-txet outcome of the Max operation.
5.

\tilde{U} (T) \overset{P u b l i c - k e y}{\to} C (R e L U (E))

//C represent cloud device.

3.5.3. Softmax

The second non-polynomial function approximated by the BHE algorithm is Softmax. The Softmax function converts a set of values into a probability distribution, as shown in Equation (17). The Softmax function includes exponential and division operations, which are complex and not directly supported within the framework of homomorphic encryption (HE). Here,

x_{i}

represents the

i - t h

element of the input vector,

x_{j}

represents th

j - th

element of the input vector,

\exp ()

denotes the exponential operation, and

\sum_{j}

denotes the sum of scores across all classes.

S o f t \max (x_{i}) = \frac{\exp (x_{i})}{\sum_{j = 1} \exp (x_{j})}

(17)

Direct replacements for the Softmax function, such as Taylor series approximation and Softmax-free linear attention, have been proposed to address computational and privacy concerns. However, these methods exhibit certain limitations. The Taylor series approximation can only approximate the exponential function, which may lead to gradient vanishing issues due to saturation areas when used without cross-entropy loss. Softmax-free linear attention, employing Newton-inverse methods to approximate division, can result in unbounded approximation errors in full-scale attention settings [33].

To enable Softmax operations compatible with homomorphic encryption, we propose an approximation scheme utilizing only addition and multiplication. This scheme comprises linear transformations (matrix multiplications) combined with ReLU activations, aligning with the requirements of the CKKS encryption scheme. The formulation is detailed in Equation (18).

Appro - Softmax (x_{i}) = x_{i} \times T ({\sum_{j} Re L U (\frac{x_{j}}{2} + 1)}^{3}

(18)

Since the original softmax function does not support exponential operations and ciphertext division under the CKKS homomorphic encryption scheme, an approximate replacement is required. First, the exponential function in Equation (17) is approximated using

{(x_{j} / 2 + 1)}^{3}

combined with the ReLU function. Secondly, a three-layer neural network

T

, composed only of addition and multiplication, is used to approximate the ciphertext division operation, i.e.,

{\sum_{j} Re L U (x_{j} / 2 + 1)}^{3}

, Finally, the result is multiplied by

x_{i}

to obtain the final approximate formulation of the softmax function, Equation (18).

Similar to the ReLU activation function discussed in Section 3.5.2, the

M a x ()

operation in ReLU is handled by delegating it to the user side. A three-layer linear neural network

T

is used to approximate the inverse operation, without employing division. To better estimate Softmax, we randomly generate input tensors with values in the range [−3, 3] and use their Softmax scores as the mean squared error (MSE) target. We then optimize

T

for 100 k steps with a learning rate of 1 × 10⁻³ until the MSE loss reduces to 1 × 10⁻⁶.

3.5.4. LayerNorm

Another operation that requires approximation is LayerNorm. The primary goal of LayerNorm is to normalize the activations of each input sample, ensuring a mean of 0 and a variance of 1. This process can be represented by Equation (19).

L a y e r N o r m (x) = \frac{x - μ}{\sqrt{σ^{2} + ξ}}

(19)

x

represents the input variable,

μ

is the mean,

σ^{2}

is the variance, and

ξ

is a constant. The approximation of LayerNorm (LN) is a crucial step to ensure compatibility with homomorphic encryption (HE). Similar to the GELU and Softmax functions discussed earlier, LN operations are not directly supported in HE. To achieve HE approximation, the BHE algorithm employs a piecewise polynomial fitting approach. The approximation strategy is roughly as follows:

Approximation of Mean and Variance: To compute the mean and variance, BHE utilizes an estimation method based on addition and multiplication.
Piecewise Polynomial Fitting: BHE approximates the LN operation using a set of precomputed piecewise polynomials. These polynomials are derived through fitting and optimization on extensive training data, providing high approximation accuracy across different input ranges.
Iterative Optimization: By training on a large number of samples, the model learns the optimal polynomial parameters, achieving high-precision approximation of LN.

The formula for the final approximated version of LayerNorm is shown in Equation (20).

A p p r o - L a y e r N o r m (x) = \sum_{i = 1}^{n} a_{i} x^{i}

(20)

a_{i}

represents the polynomial coefficients obtained through training, and

x

is the input embedding. According to the approximation process illustrated in Figure 6, we use Appro-LN as the LN-Extraction model to extract valuable semantic information from LN layers. By employing this “simplified replacement” for LayerNorm, BHE effectively implements LayerNorm operations within the homomorphic encryption framework, enabling efficient sentiment classification tasks while preserving data privacy.

3.5.5. The Workflow of BHE-Based Text Sentiment Classification Task

The approximation workflow is detailed in Algorithm 3. Please note that the ReLU activation splitting and interaction process is explained in detail in Algorithm 2. The BHE encryption algorithm relies on the cloud–user secure data transmission network to perform the sentiment classification task. Under the CKKS encryption scheme, when approximating non-polynomial functions such as GELU, Softmax, and LayerNorm with polynomials, additional homomorphic multiplications must be performed. The encrypted data are stored as large integers or polynomials, so each operation involves additions, multiplications, and modular reductions on these large objects. Ciphertext size is large, increasing both transmission overhead and memory usage. Moreover, because data communication occurs on both the cloud and the user sides, objective factors such as network bandwidth and user-device performance introduce further resource costs.

Algorithm 3: BHE-based Text Sentiment Classification Workflow

Input: Plain-text Query

P_{q}

, Private-key

K_{p r i v a t e}

,

K_{p u b l i c}

, encrypted BHE-based model

M

.
1. Client construct word-embedding

P_{q} \to ς_{q}

.
2. Client encrypts:

E n c r y p t (ς_{q}) \overset{K_{p u b l i c}}{\to} C_{q}

.
3. Cloud proceeds the BHE-based model:

C_{i} = \otimes M (C_{q})

.
4. Client disposes activation fuction:

C_{a} = Re L U (c_{i})

.
5. Cloud gets the result and continues:

C_{o} = \otimes M (C_{a})

.
6. Client decrypts classification results:

D e c r y p t (C_{o}) \overset{K_{private}}{\to} P_{o}

In addition, since the

\tanh

fuction is not supported by homomorphic encryption, all pooling operations have been removed from the ALBERT-Mixplus model. Matrix multiplications are therefore converted into element-wise operations.

4. Experiments

To assess the practical contributions of the BHE+ALBERT-Mixplus model in sentiment classification tasks for short texts within teaching evaluations, we will evaluate its effectiveness using the Stanford Sentiment Treebank (SST) and a proprietary offline dataset of teaching evaluation short texts (TAD-2023). The model’s performance will be rigorously validated across multiple dimensions.

4.1. Experimental Data and Performance Metrics

The experimental data were drawn from both public and offline datasets. The offline data TAD-2023 come from the Teaching Resource Platform of a normal university, comprising student evaluation texts from the 2023–2024 academic year. These texts cover students’ feedback on required courses, general education courses, and elective courses. We manually annotated this corpus to build the fine-tuning training set for the pre-trained model, with each sample consisting of an evaluation text and its corresponding sentiment label. The sentiment labels include five categories: very negative, negative, neutral, positive, and very positive. In terms of preprocessing, we first performed initial cleaning on the texts by removing animated emojis, special characters, and meaningless tokens. Then we normalized the texts through simplified/traditional character conversion and synonym/near-synonym standardization. Next, we applied Jieba for word segmentation, removed any remaining special symbols and stopwords, and annotated each token for part of speech and sentiment. Finally, we constructed the dataset into a standard corpus format. After these steps, we obtained a total of 3122 samples, which were split into training, test, and validation sets at an 8:1:1 ratio.

Additionally, as a supplement to the offline dataset, we utilized the SST-2 public dataset. SST-2 is a widely used two-class sentiment analysis dataset created by Stanford University, consisting of movie reviews annotated with sentiment labels. It is a subset of the original SST dataset, with data labeled as either positive or negative, excluding neutral and other fine-grained labels.

The evaluation metrics for the experiments include accuracy, precision, recall, F1, and time cost. The experiments were repeated five times, with the average values reported as the final results.

4.2. Experimental Environment and Parameter Settings

The experimental environment consists of two components: cloud and user. To ensure consistency in data communication during HE encryption, the experimental environment was standardized. The details are provided in Table 1.

To better support HE encryption, the BHE+ALBERT-Mixplus model utilizes the ALBERT-Base model, with the total number of parameters kept at 12 M. The key hyperparameters settings of the integrated model are detailed in Table 2.

4.3. Analysis of Experimental Results

In order to provide a comprehensive evaluation of the performance of the BHE+ALBERT-Mixplus model, a series of comparative experiments were conducted on both the offline teaching evaluation dataset (TAD-2023) and the SST-2 dataset. The performance metrics that were analyzed include accuracy, precision, recall, and F1. The baseline models that were used for comparison encompass a range of different neural network architectures, including sequence-based, attention-based, neural networks that combine sequence and attention mechanisms, and graph-based neural networks.

(A): CNN [7]: Utilizing convolutional kernels and pooling to extract textual features, CNNs are an early classic model for text sentiment classification.
(B): BiLSTM [8]: BiLSTM can process both forward and backward word vectors simultaneously and concatenate the final hidden states as global features, effectively capturing contextual semantic dependencies.
(C): Transformer [9]: Composed of multiple encoders and decoders, the transformer model is based on a self-attention mechanism. It can process sequence data in parallel, significantly enhancing training speed and efficiency.
(D): BERT [10]: A pre-trained language model based on the transformer architecture, which retains only the Encoder component of the transformer, can understand text context bidirectionally. By pre-training on a large-scale corpus and then fine-tuning on specific tasks, it significantly improves the accuracy of tasks such as question answering, text classification, and named entity recognition.
(E): BERT-GRU-ATT [15]: A hybrid model that combines BERT, GRU, and attention provides context-aware word representations and captures bidirectional textual information. The GRU processes sequential data to capture temporal dependencies, while attention dynamically focuses on important parts of the sequence, thereby enhancing the model’s ability to capture key information.
(F): BG-TCA (bidirectional gated temporal convolutional attention) [16]: BG-TCA consists of three models: BERT, GRU, and TCA. It introduces a temporal contextual attention (TCA) mechanism, which enhances the model’s accuracy in handling sentences with complex semantic relationships. This approach is commonly used in various downstream tasks in natural language processing.
(G): EK-INIT-CNN (Embedding Knowledge Initialization Convolutional Neural Network) [17]: EK-INIT-CNN is specifically designed for sentiment classification tasks involving Weibo (Chinese) text. This model combines embedding knowledge initialization with a CNN-based hybrid approach. At the start of model training, prior knowledge is used to initialize the model, thereby enhancing its learning capability and stability.
(H): TextGCN (Text Graph Convolutional Network) [18]: TextGCN represents textual data as a graph, where documents or words are treated as nodes, and their relationships (such as co-occurrence or semantic similarity) are treated as edges. Graph convolution operations are used to learn node representations, thereby capturing the complex semantic and structural information in the text data.
(I): Bert-GCN [19]: Building on TextGCN, the features of graph nodes are initialized using pre-trained BERT, and joint training is employed to obtain feature representations.

Table 3 and Figure 8 present the text sentiment classification results of the ALBERT-Mixplus model and nine baseline models on the TAD-2023 and SST-2 datasets. The ALBERT-Mixplus model achieved the best classification results on both datasets, demonstrating an improvement over other classification models. This indicates the effective value of the integrated encryption model.

Additionally, we also observed that due to the five-class sentiment classification task of the TAD-2023 dataset, which is based on teaching evaluation short texts and has higher performance requirements compared to the two-class classification task of the SST-2 dataset, the performance metrics of the baseline models are generally lower on the TAD-2023 dataset. The ALBERT-Mixplus model also reflects this trend overall. However, an exception is noted in the F1 score, where the five-class teaching evaluation dataset (94.63%) outperforms the two-class sentiment classification dataset (94.32%). This suggests that the attention mechanism following convolutional operations may be better suited for capturing key semantics in teaching evaluation content, indicating that ALBERT-Mixplus might be particularly well-suited for sentiment analysis tasks in teaching evaluation texts.

From the perspective of the baseline models, among sequence-based neural networks, CNNs can only extract local key semantic information, whereas BiLSTM can capture longer semantic dependencies and process input word vectors in both forward and backward directions, providing more semantic value. As a result, BiLSTM generally achieves better classification results than CNN. Both transformer and BERT are based on a self-attention mechanism. BERT, consisting of multiple layers of bidirectional encoders and a greater number of self-attention heads, has learned substantial semantic value during pre-training and exhibits stronger transfer capabilities.

The BERT-GRU-ATT, BG-TCA, and EK-INIT-CNN models are hybrid models that combine sequence and attention mechanisms. Compared to the baseline neural network models such as BERT, CNN, and BiGRU used individually, the hybrid models show improvements in various performance metrics, indicating that they can better capture deeper semantic information. Among these hybrid models, BERT-based models such as BERT-GRU-ATT and BG-TCA outperform EK-INIT-CNN, highlighting the significant value of the BERT model in sentiment classification tasks for teaching evaluation texts. Additionally, this comparison suggests that the EK-INIT-CNN model, which is not a pre-trained model, exhibits weaker generalization capabilities and limited transferability, particularly for sentiment classification tasks on Weibo.

TextGCN and Bert-GCN are graph-based neural network models, and the experimental results demonstrate the beneficial contributions of graph convolution to sentiment classification tasks for teaching evaluation texts. TextGCN captures global relationships between words through graph structures and GCN, making it suitable for handling long texts and complex semantic structures. Bert-GCN, which adds graph convolutional network layers to BERT, combines the strengths of BERT and GCN. The integration of BERT with GCN enhances the model’s ability to capture fine-grained semantic information within sentences, leading to richer semantic representations. This capability is a key reason for the high performance metrics observed in Bert-GCN.

4.3.1. The Impact of the BHE on ALBERT-Mixplus Model

To validate the impact of the BHE algorithm on the model’s classification performance and to explore the gains from multiple approximate workflows of the BHE algorithm on sentiment analysis, we conducted a comparative experiment using the hybrid model on the TAD-2023 and SST-2 datasets. The experiment followed the hyperparameters outlined in Section 4.2, with all key performance indicators remaining consistent. The two important HE encryption parameters, Poly-models and Coeff-models, were randomly initialized by intel from the sets [1024, 2048, 4096, 8162, 16,384] and [20, 30, 60], respectively. The results of the experiment are shown in Table 4 and Figure 9.

(A): Baseline: The hybrid model is used without any replacements or approximations, and the original training dataset is employed for fine-tuning the pre-trained model.
(B): ReLU: The GELU activation function in the hybrid model is replaced with ReLU.
(C): ReLU-S: In addition to replacing GELU with ReLU, the Softmax function is substituted with the approximate Softmax model, referred to as Appro-Softmax.
(D): ReLU-S-L: Building on the ReLU-S model, all approximations, including Appro-LN, are applied.
(E): BHE: The approximate fine-tuned hybrid model is encrypted using the SEAL algorithm library and applied to the sentiment classification task.

Based on the experimental results, the application of the BHE approximation did lead to a decline in various performance metrics for the model. However, the extent of this decline remains within an acceptable range. Analyzing the accuracy metric, there was a decrease of 0.73% on the TAD-2023 dataset and 0.41% on the SST-2 dataset. Among the different approximation replacement models, the most significant performance drop was observed with the Appro-LN model, which led to decreases of 0.30% and 0.15% on the two datasets, respectively. This indicates that there is still room for improvement in the BHE algorithm, particularly within the Appro-LN model. The smallest impact on performance was observed with the replacement of the activation function by ReLU. Experiments on both datasets demonstrate that ReLU exhibits similar performance to the GELU activation function in both sentiment five-class and two-class classification tasks.

4.3.2. The Impact of Each Component in the Hybrid Feature Extraction Layer on the ALBERT-Mixplus Model

In the previous section, we validated the positive impact of the BHE algorithm on the ALBERT-Mixplus model. In this section, we will conduct ablation experiments to further assess the contributions of each functional component within the hybrid feature extraction layer under the support of the BHE algorithm. Using the same hyperparameters as in the previous experiments, we will compare the effects of the BiGRU, MCNN-MCO, and MCNN-SA components. The results for the accuracy metric are illustrated in Figure 10.

(A): Baseline: ALBERT-Mixplus: Only approximate homomorphic operations are performed, with no additional modifications made to the hybrid feature extraction layer.
(B): w/o BiGRU: Removal of the BiGRU component from the hybrid feature extraction layer, while retaining the MCNN-MCO and MCNN-SA models.
(C): w/o MCNN-SA: Removal of the MCNN-SA component from the hybrid feature extraction layer, while retaining the BiGRU and MCNN-MCO components.
(D): w/o MCNN-MCO: Removal of the multi-scale convolution operations from the MCNN component in the hybrid feature extraction layer, while retaining the BiGRU and MCNN-SA components.
(E): w/o BiGRU and MCNN-SA: Removal of both the BiGRU and MCNN-SA components from the hybrid feature extraction layer, while retaining the MCNN-MCO component, i.e., ALBERT + MCO.
(F): w/o BiGRU and MCNN-MCO: Removal of both the BiGRU and MCNN-MCO components from the hybrid feature extraction layer, i.e., ALBERT + MCNN-SA.
(G): w/o MCNN-MCO and MCNN-SA: Removal of both the MCNN-MCO and MCNN-SA components from the hybrid feature extraction layer, i.e., ALBERT + BiGRU.
(H): ALBERT: Removal of the entire hybrid feature extraction layer, retaining only the ALBERT model.

Based on the experimental results presented in Figure 9, the baseline model demonstrates an improvement in accuracy metrics of 2.27% and 1.57% over the ALBERT model on the SST-2 and TAD-2023 datasets, respectively. This indicates the positive contribution of the functional components within the ALBERT-Mixplus hybrid feature extraction layer. Specifically, the MCNN-MCO component has a significant impact on the performance of the hybrid model. This effect is evident both in single-variable and dual-variable ablation experiments, highlighting that the multi-scale convolution kernels in the MCNN-MCO model are effective in extracting features at different granularities, thereby contributing substantially to the model’s performance.

Furthermore, it is observed that when both the BiGRU and MCNN-MCO components are removed, leaving only the MCNN-SA component, there is no significant performance improvement compared to the ALBERT model. This suggests that without BiGRU and MCO, the deep semantic extraction capability is diminished, and the standalone SA attention mechanism does not provide additional beneficial contributions. The semantic enhancement provided by SA attention is already achieved within BERT, indicating that simply adding the SA model does not contribute significantly to the classification task.

4.3.3. The Impact of ALBERT on the Performance of the Hybrid Model in Standard Fine-Tuning

To validate the beneficial contribution of ALBERT in standard fine-tuning, we designed experiments on two datasets. The experimental setup and hyperparameters follow those outlined in Section 4.2. We compared ALBERT-Base, BERT-Base, and BERT-Large in terms of training time and accuracy metrics for standard fine-tuning tasks. Notably, aside from the baseline model, none of the comparison models underwent BHE approximate homomorphic operations in this experiment. The Hidden Layer Dimension and Maximum Sequence Length were standardized across models. The comparison of model hyperparameters is presented in Table 5, and the experimental results are shown in Figure 11.

(A): Baseline: This refers to the proposed BHE+ALBERT-Mixplus model. The meta-information extraction layer utilizes the ALBERT-Base version, with approximate homomorphic fine-tuning training conducted as described in Section 3.5.1.
(B): BERT-Base: The meta-information extraction layer employs the BERT-Base model, while the hybrid feature extraction layer remains unchanged. The BERT-Base model has the same hidden layer and attention head configurations as the ALBERT-Base model but has approximately nine times the number of parameters.
(C): BERT-Large: The meta-information extraction layer uses the BERT-Large model, with the hybrid feature extraction layer unchanged. The BERT-Large model includes 24 hidden layers and 16 attention heads, making it suitable for handling larger datasets and more complex tasks. It has approximately three times the number of parameters of the BERT-Base model.
(D): ALBERT-Base: The meta-information extraction layer utilizes the ALBERT-Base model, with the hybrid feature extraction layer remaining unchanged. ALBERT-Base leverages cross-layer parameter sharing, which significantly reduces the total number of parameters while maintaining performance.

The experimental results indicate that the ALBERT-Base model, due to its cross-layer parameter sharing, reduces the number of parameters. Consequently, the meta-information extraction layer based on ALBERT-Base exhibits significantly lower time overhead on both datasets compared to the BERT-Base and BERT-Large models. This also demonstrates the effectiveness of the ALBERT-Mixplus hybrid model.

Comparing with the baseline, we observed that the relatively complex symmetrical approximation workflow of the BHE process introduces additional time overhead for standard fine-tuning of the baseline model. This results in certain limitations in time cost, which will be a focus for future improvements.

4.3.4. The Effect of Weight Decay

In this section, we will discuss the issue of attention overflow, which arises before the layer normalization in the BHE+ALBERT-Mixplus fused model. The attention overflow problem occurs when the attention scores in the approximate homomorphic BHE algorithm may not be properly constrained, leading to numerical issues. Specifically, although the output of multi-head attention is typically within the range of [−1, 1], the attention scores can sometimes be extremely sparse before normalization, reaching extreme values such as 1 × 10⁴. These extreme values can create difficulties during the LN-Extraction phase.

To mitigate the attention overflow issue, this study employs weight decay with the RAdam optimizer as a regularization technique. The increase in weight decay helps in the convergence of attention scores, thereby improving the quality of the approximate results.

In Figure 12, we demonstrate the attention overflow phenomenon on two datasets. Without any regularization, our approximation method produces uncontrolled attention scores, leading to suboptimal performance. As weight decay is increased, the attention scores tend to converge, resulting in improved approximation results.

5. Conclusions

This paper addresses the task of short-text classification for teaching evaluations by proposing a sentiment-classification fusion model that integrates the BHE distributed symmetric approximate homomorphic encryption algorithm, namely BHE+ALBERT-Mixplus. By approximating certain operations within the ALBERT-Mixplus model and leveraging the cloud–user distributed interaction paradigm, we deploy the hybrid model on the cloud under the CKKS encryption scheme, thereby ensuring data security in a cloud-computing environment. While protecting user privacy, the model also achieves a lightweight design, high accuracy, and computational security. Experimental results on the TAD-2023 and SST-2 datasets demonstrate that our model outperforms mainstream models with relatively low computational overhead.

Our approach offers valuable insights for designing similar privacy-preserving models. For example, using a lightweight ALBERT-Base model significantly reduces parameter size and computational cost, making it suitable for resource-constrained environments. The multi-scale fusion and multi-head attention mechanisms maintain high-quality feature extraction even with a smaller model. Meanwhile, the BHE encryption scheme offloads the main computational burden to the cloud, requiring only simple encryption and decryption operations on the user side, enabling deployment on devices with limited computing power. These methods and design principles can serve as useful references for other researchers building privacy-preserving NLP systems.

Regarding fairness and sustainability, the model ensures consistent privacy protection for all users through uniform encryption processing. Additionally, the ALBERT-Lite model contains only about one-ninth the number of parameters of the standard BERT-Base model, significantly reducing model size and memory usage, thereby decreasing energy consumption during inference. In our fine-tuning experiments, the ALBERT model reduced training-time overhead by 45–50% compared to BERT-Base, providing valuable insights for building green, low-carbon AI systems. In terms of fairness, our work in AI and data privacy ensures that every student has equal rights to course evaluations, reflecting the principles of educational equity and inclusive education.

In future work, we intend to build upon our approximate encryption techniques for BERT models and extend them to the domain of multimodal sentiment classification by incorporating audio and video data from instructional settings. By integrating cutting-edge findings from the field of educational science, we aim to establish a more diverse and scientifically rigorous teaching-evaluation framework, thereby making a positive contribution to the advancement of educational equity [35].

Author Contributions

Conceptualization, J.Z.; Methodology, S.S.M.; Software, J.Z.; Formal analysis, J.Z. and D.A.D.; Data curation, D.A.D.; Writing—original draft, J.Z.; Project administration, J.Z.; Funding acquisition, S.S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shandong Province Higher Education Research Project in Philosophy and Social Sciences 2024ZSZX305, INTI International University and Colleges, INTI-FDSIT-01-15-2024, Key Project of Undergraduate Teaching Reform at Shandong Normal University: 2024ZJ37.

Data Availability Statement

The datasets presented in this article are not readily available because student teaching evaluation data contains sensitive personal information, and therefore cannot be publicly disclosed to protect student privacy. Requests to access the datasets should be directed to zjr@sdnu.edu.cn.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Liu, K.; Feng, Y.; Zhang, L.; Wang, R.; Wang, W.; Yuan, X.; Cui, X.; Li, X.; Li, H. An effective personality-based model for short text sentiment classification using BiLSTM and self-attention. Electronics 2023, 12, 3274. [Google Scholar] [CrossRef]
Islam, M.S.; Kabir, M.N.; Ghani, N.A.; Zamli, K.Z.; Zulkifli, N.S.A.; Rahman, M.M. Challenges and future in deep learning for sentiment analysis: A comprehensive review and a proposed novel hybrid approach. Artif. Intell. Rev. 2024, 57, 62. [Google Scholar] [CrossRef]
Sivakumar, S.; Rajalakshmi, R. Context-aware sentiment analysis with attention-enhanced features from bidirectional transformers. Soc. Netw. Anal. Min. 2022, 12, 104. [Google Scholar] [CrossRef]
Xu, W.; Chen, J.; Ding, Z.; Wang, J. Text sentiment analysis and classification based on bidirectional gated recurrent units (grus) model. arXiv 2024, arXiv:2404.17123. [Google Scholar] [CrossRef]
Esuli, A.; Sebastiani, F. Sentiwordnet: A publicly available lexical resource for opinion mining. In Proceedings of the Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, 22–28 May 2006; Volume 6, pp. 417–422. [Google Scholar]
Pang, B.; Lee, L. Opinion mining and sentiment analysis. Found. Trends® Inf. Retr. 2008, 2, 1–135. [Google Scholar] [CrossRef]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
Xiao, Z.; Liang, P.J. Chinese sentiment analysis using bidirectional LSTM with word embedding. In Proceedings of the Cloud Computing and Security: Second International Conference, ICCCS 2016, Nanjing, China, 29–31 July 2016; Revised Selected Papers, Part II 2. pp. 601–610. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Liu, Z.; Lin, W.; Shi, Y.; Zhao, J. A robustly optimized BERT pre-training approach with post-training. In Proceedings of the 20th China National Conference on Chinese Computational Linguistics, Huhhot, China, 13–15 August 2021; Springer International Publishing: Cham, Switzerland, 2021; pp. 471–484. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. Albert: A lite bert for self-supervised learning of language representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
Zhao, J.; Zhang, Z.; Chen, K. A Unified Framework for Multimodal Sentiment Analysis Using BERT and DINOv2. arXiv 2025, arXiv:2503.07943. [Google Scholar]
Ananthi, D.; Prakash, A. An Attention-Enhanced Bidirectional Long Short-Term Memory with Contextual Semantic Knowledge for Enhancing Sentiment Analysis. Int. J. Intell. Eng. Syst. 2025, 18, 947–962. [Google Scholar]
Zhao, D.; Huang, D.; Meng, J.; Dong, Y.; Zhang, P. Chinese Entity Relations Classification Based on BERT-GRU-ATT. Comput. Sci. 2022, 49, 319–325. [Google Scholar]
Guo, X.; Lai, H.; Yu, Z.; Gao, S.; Xiang, Y. Emotion classification of case-related microblog comments integrating emotional knowledge. Chin. J. Comput. 2021, 44, 564–578. [Google Scholar]
Ren, J.; Wu, W.; Liu, G.; Chen, Z.; Wang, R. Bidirectional gated temporal convolution with attention for text classification. Neurocomputing 2021, 455, 265–273. [Google Scholar] [CrossRef]
Yao, L.; Mao, C.; Luo, Y. Graph convolutional networks for text classification. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 7370–7377. [Google Scholar]
Li, Y.; Zuo, X.; Zuo, W.; Liang, S.; Zhu, Y.; Zhu, Y. Causality extraction based on BERT-GCN. J. Jilin Univ. Sci. Ed. 2023, 61, 325–330. [Google Scholar]
Subramaniyaswamy, V.; Jagadeeswari, V.; Indragandhi, V.; Jhaveri, R.H.; Vijayakumar, V.; Kotecha, K.; Ravi, L. Somewhat homomorphic encryption: Ring learning with error algorithm for faster encryption of IoT sensor signal-based edge devices. Secur. Commun. Netw. 2022, 2022, 2793998. [Google Scholar] [CrossRef]
Serengil, S.; Ozpinar, A. Encrypted Vector Similarity Computations Using Partially Homomorphic Encryption: Applications and Performance Analysis. arXiv 2025, arXiv:2503.05850. [Google Scholar]
Pan, Y.; Chao, Z.; He, W.; Jing, Y.; Hongjia, L.; Liming, W. FedSHE: Privacy preserving and efficient federated learning with adaptive segmented CKKS homomorphic encryption. Cybersecurity 2024, 7, 40. [Google Scholar] [CrossRef]
Panzade, P.; Takabi, D.; Cai, Z. MedBlindTuner: Towards Privacy-Preserving Fine-Tuning on Biomedical Images with Transformers and Fully Homomorphic Encryption. In AI for Health Equity and Fairness: Leveraging AI to Address Social Determinants of Health; Springer Nature Switzerland: Cham, Switzerland, 2024; pp. 197–208. [Google Scholar]
Hamza, R.; Hassan, A.; Ali, A.; Bashir, M.B.; Alqhtani, S.M.; Tawfeeg, T.M.; Yousif, A. Towards secure big data analysis via fully homomorphic encryption algorithms. Entropy 2022, 24, 519. [Google Scholar] [CrossRef]
Jia, H.; Cai, D.; Yang, J.; Qian, W.; Wang, C.; Li, X.; Yang, S. Efficient and privacy-preserving image classification using homomorphic encryption and chunk-based convolutional neural network. J. Cloud Comput. 2023, 12, 175. [Google Scholar] [CrossRef]
Rovida, L.; Leporati, A. Encrypted image classification with low memory footprint using fully homomorphic encryption. Int. J. Neural Syst. 2024. [Google Scholar] [CrossRef]
Kim, M.; Song, Y.; Xia, Y.; Zhan, J. PrivFT: Private and Fast Text Classification with Homomorphic Encryption. arXiv 2021, arXiv:2101.12345. [Google Scholar]
Zhang, T.; He, Z.; Lee, R.B. Privacy-preserving machine learning through data obfuscation. arXiv 2018, arXiv:1807.01860. [Google Scholar]
Florencio, M.; Alencar, L.; Lima, B. An End-to-End Homomorphically Encrypted Neural Network. arXiv 2025, arXiv:2502.16176. [Google Scholar]
Song, T.; Li, Y.; Meng, F.; Xie, P.; Xu, D. A novel deep learning model by Bigru with attention mechanism for tropical cyclone track prediction in the Northwest Pacific. J. Appl. Meteorol. Climatol. 2022, 61, 3–12. [Google Scholar] [CrossRef]
Cui, Z.; Chen, W.; Chen, Y. Multi-scale convolutional neural networks for time series classification. arXiv 2016, arXiv:1603.06995. [Google Scholar]
Cheon, J.H.; Kim, A.; Kim, M.; Song, Y. Homomorphic encryption for arithmetic of approximate numbers. In Proceedings of the Advances in Cryptology—ASIACRYPT 2017: 23rd International Conference on the Theory and Applications of Cryptology and Information Security, Hong Kong, China, 3–7 December 2017; Part I 23. pp. 409–437. [Google Scholar]
Lu, J.; Yao, J.; Zhang, J.; Zhu, X.; Xu, H.; Gao, W.; Xu, C.; Xiang, T.; Zhang, L. Soft: Softmax-free transformer with linear complexity. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Online, 6–14 December 2021; Volume 34, pp. 21297–21309. [Google Scholar]
Liu, L.; Jiang, H.; He, P.; Chen, W.; Liu, X.; Gao, J.; Han, J. On the variance of the adaptive learning rate and beyond. arXiv 2019, arXiv:1908.03265. [Google Scholar]
Ting, T.T.; Lim, E.T.S.; Lee, J.; Wong, J.S.; Tan, J.H.; Tam, R.C.M.; Chaw, J.K.; Aitizaz, A.; Teoh, C.K. Educational big data mining: Mediation of academic performance in crime among digital age young adults. Online J. Commun. Media Technol. 2024, 14, e202403. [Google Scholar] [CrossRef]

Figure 1. Overview of cloud–user structure.

Figure 2. Overview of FHE procedure.

Figure 3. An overview of BERT model.

Figure 4. BHE+ALBERT-Mixplus model.

Figure 5. Encoder detail of BHE+ALBERT-Mixplus model.

Figure 6. The structure of the MCNN model.

Figure 7. The symmetrical approximation workflow of BHE.

Figure 8. Comparison of classification effects of different models.

Figure 9. 3D performance metrics display of BHE approximation models.

Figure 10. The effect of each component of ALBERT-Mixplus.

Figure 11. Time cost of different BERT models in standard fine-tuning.

Figure 12. The effect of weight decay values on model performance.

Table 1. Experimental parameter settings.

Model Parameters	Values
Operating System	Ubuntu 20.04
Graphics Card	NVIDIA RTX 2080 Super
GPU Memory	12 GB
Deep Learning Framework	PyTorch 1.10.0
Python Version	3.9
SEAL-Py	SEAL-Py 4.1

Table 2. Hyperparameter settings of BHE+ALBERT-Mixplus.

Model Parameters	Values
ALBERT-Hidden Layer Dimension	768
Number of ALBERT-Hidden Layer	12
Number of ALBERT-Attention Head	12
Learning Rate	2 × 10⁻⁵
BiGRU-Hidden Layer Dimension	256
BiGRU-Layer	2
Convolutional Kernel Size	(2,3,4)
CNN-Feature Map	128
Self-Attention Dimension	512
Maximum Sequence Length	128
Batch_size	32
Epochs	8
Dropout	0.5
Optimizer	RAdam [34]

Table 3. Effects comparison of different models.

Model	TAD-2023				SST-2
Model	Accuracy	Precision	Recall	F1	Accuracy	Precision	Recall	F1
CNN	81.62%	81.40%	80.83%	81.11%	83.47%	82.13%	81.52%	81.83%
BiLSTM	82.92%	82.66%	83.22%	82.94%	85.23%	84.47%	83.85%	84.24%
Transformer	84.12%	85.79%	83.42%	84.59%	89.15%	88.52%	88.36%	88.51%
BERT	91.13%	91.48%	90.14%	90.81%	92.87%	92.35%	92.28%	92.30%
BERT-GRU-ATT	93.11%	92.91%	92.15%	92.53%	93.68%	93.14%	93.07%	93.02%
BG-TCA	90.20%	90.55%	91.66%	91.10%	94.35%	93.82%	93.76%	93.84%
EK-INIT-CNN	89.62%	89.17%	89.33%	89.25%	90.73%	90.73%	93.12%	91.89%
TextGCN	91.15%	90.66%	92.12%	91.38%	86.74%	86.21%	86.19%	86.20%
Bert-GCN	94.13%	93.72%	92.91%	93.31%	93.42%	92.89%	92.83%	92.43%
BHE+ALBERT-Mixplus	95.16%	94.50%	94.77%	94.63%	95.48%	94.95%	94.88%	94.32%

Table 4. The effects of the BHE approximation model.

Model	TAD-2023				SST-2
Model	Accuracy	Precision	Recall	F1	Accuracy	Precision	Recall	F1
Baseline	95.93%	95.54%	95.79%	95.42%	96.55%	96.13%	97.11%	96.50%
ReLU	95.86%	95.42%	95.68%	95.55%	96.46%	96.00%	97.02%	96.51%
ReLU-S	95.61%	95.20%	95.42%	95.31%	96.37%	95.88%	96.89%	96.38%
ReLU-S-L	95.31%	94.93%	95.14%	95.03%	96.22%	95.81%	96.54%	96.17%
BHE	95.20%	94.81%	94.91%	94.74%	96.14%	95.74%	96.37%	95.86%
Perf↓	0.73%	0.73%	0.88%	0.68%	0.41%	0.39%	0.74%	0.64%

Table 5. Parameter settings comparison of ALBERT and BERT.

Model Parameters	ALBERT-Base	BERT-Base	BERT-Large
Hidden Layer Dimension	768	768	768
Number of Hidden Layer	12	12	24
Number of Attention Head	12	12	16
Maximum Sequence Length	128	128	128
Total Parameters	12 M	108 M	340 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Sarah Maidin, S.; Dewi, D.A. BHE+ALBERT-Mixplus: A Distributed Symmetric Approximate Homomorphic Encryption Model for Secure Short-Text Sentiment Classification in Teaching Evaluations. Symmetry 2025, 17, 903. https://doi.org/10.3390/sym17060903

AMA Style

Zhang J, Sarah Maidin S, Dewi DA. BHE+ALBERT-Mixplus: A Distributed Symmetric Approximate Homomorphic Encryption Model for Secure Short-Text Sentiment Classification in Teaching Evaluations. Symmetry. 2025; 17(6):903. https://doi.org/10.3390/sym17060903

Chicago/Turabian Style

Zhang, Jingren, Siti Sarah Maidin, and Deshinta Arrova Dewi. 2025. "BHE+ALBERT-Mixplus: A Distributed Symmetric Approximate Homomorphic Encryption Model for Secure Short-Text Sentiment Classification in Teaching Evaluations" Symmetry 17, no. 6: 903. https://doi.org/10.3390/sym17060903

APA Style

Zhang, J., Sarah Maidin, S., & Dewi, D. A. (2025). BHE+ALBERT-Mixplus: A Distributed Symmetric Approximate Homomorphic Encryption Model for Secure Short-Text Sentiment Classification in Teaching Evaluations. Symmetry, 17(6), 903. https://doi.org/10.3390/sym17060903

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BHE+ALBERT-Mixplus: A Distributed Symmetric Approximate Homomorphic Encryption Model for Secure Short-Text Sentiment Classification in Teaching Evaluations

Abstract

1. Introduction

2. Related Works

2.1. Current Research Status on Sentiment Classification of Teaching Evaluation Short Texts

2.2. HE in Privacy-Preserving Sentiment Analysis

3. BHE+ALBERT-Mixplus

3.1. Preliminaries

3.1.1. Fully Homorphic Encryption (FHE)Algorithm

3.1.2. Pre-Trained BERT Model

3.2. Overview of BHE+ALBERT-Mixplus

3.3. Meta-Information Extraction Layer

3.4. Hybrid Feature Extraction Layer

3.5. BHE Approximate Homomorphic Encryption Algorithm

3.5.1. Symmetrical Approximation Workflow

3.5.2. Gaussion Error Linear Units (GELU)

3.5.3. Softmax

3.5.4. LayerNorm

3.5.5. The Workflow of BHE-Based Text Sentiment Classification Task

4. Experiments

4.1. Experimental Data and Performance Metrics

4.2. Experimental Environment and Parameter Settings

4.3. Analysis of Experimental Results

4.3.1. The Impact of the BHE on ALBERT-Mixplus Model

4.3.2. The Impact of Each Component in the Hybrid Feature Extraction Layer on the ALBERT-Mixplus Model

4.3.3. The Impact of ALBERT on the Performance of the Hybrid Model in Standard Fine-Tuning

4.3.4. The Effect of Weight Decay

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI