1. Introduction
Within the realm of academic research, peer reviews constitute an indispensable mechanism for safeguarding its integrity and advancement [
1]. Not only do they uphold academic standards, but they also cultivate critical thinking and refine research through expert feedback, establishing themselves as a cornerstone of academic development [
2]. This rigorous process involves subjecting the scholarly work to the scrutiny of domain experts to validate its reliability, validity, and originality. However, traditional peer-review methodologies—predominantly reliant on manual assessment and qualitative analysis—are becoming increasingly inadequate. These approaches are overwhelmed by the growing volume and complexity of scientific production, a challenge exacerbated by the demand for faster and broader dissemination [
3]. Bauchner [
4] pointed out that artificial intelligence should be used to assist peer-review publication.
Driven by rapid advances in natural language processing (NLP), particularly the emergence of deep learning models, novel avenues for enhancing peer-review methodologies have emerged [
5]. Modern NLP facilitates the automated processing of large quantities of textual data, enables quantitative analysis, and uncovers complex underlying patterns [
6]. For instance, by leveraging automated text classification and sentiment analysis, NLP can differentiate between constructive suggestions and critical commentary, as well as identify emotional nuances within reviewer feedback [
7]. This capability holds significant practical value for graduate students in refining dissertation writing and effectively addressing reviewer critiques. Nevertheless, conventional NLP approaches often fail to capture the intricate semantic relationships and contextual dependencies inherent in review texts, resulting in superficial analyses that overlook latent evaluation patterns and substantive quality indicators embedded in expert feedback.
To overcome these limitations, this study draws inspiration from the concept of sensors, prevalent in industrial process control, to reframe automated review analysis for dissertations. The approach positions the soft-sensor as an essential supplement that extends data acquisition beyond physical hardware, targeting the wealth of data from software platforms. This academic soft-sensor aims to infer latent, difficult-to-quantify dimensions of scholarly quality from these digital streams. The proposed interpretable BERT-based soft-sensor framework thus plays a critical role in building a comprehensive data-driven educational system by transforming subjective textual feedback into quantifiable assessments. Furthermore, by integrating SHAP (Shapley Additive exPlanations), the framework provides a calibration and output interface, translating the model’s complex inferences into human-interpretable insights and thereby bridging the gap between black-box NLP models and actionable evaluative principles.
In summary, the key contributions of this study can be summarized as follows:
An enhanced BERT encoder architecture is proposed as the core inference engine of the soft-sensor. By integrating an additional self-attention layer, it significantly improves the ability to capture critical textual features, with experimental results demonstrating substantial gains in performance metrics compared to traditional baselines.
SHAP value analysis is introduced as an interpretability interface for the soft-sensor, enabling quantitative identification of key dimensions influencing evaluation outcomes in peer-reviewed texts. This approach offers novel insights into the reviewer decision-making process.
The soft-sensor framework is leveraged to systematically quantify factors prioritized by reviewers during assessment, providing actionable insights for manuscript improvement while supporting data-driven research into academic peer-review mechanisms.
The rest of our paper is organized as follows.
Section 2 delves into the foundational concepts, including BERT, self-attention mechanisms, and SHAP.
Section 3 critically reviews the related literature.
Section 4 elaborates on the construction of the interpretable soft-sensor, detailing its implementation from data collection and preprocessing to text classification using BERT and significance analysis of words via SHAP.
Section 5 evaluates our proposed framework through various metrics, performance comparisons, and detailed result analysis.
Section 6 discusses the limitations of this work and future directions for research. Finally,
Section 7 provides a comprehensive summary of the paper.
3. Related Work
3.1. Peer Review
Peer review is essential in graduate education as a critical mechanism of academic quality control. However, with the expansion of the scale of graduate education, the traditional paper review method faces the challenge of efficiency and objectivity. The academic community has shown strong interest in improving the peer-review process in recent years. Li [
14] employed three tools—LIWC, SentimentR, and Stanford CoreNLP—to evaluate the emotional tones of peer-review comments. Additionally, Buljan et al. [
15] utilized text analysis software to conduct a thorough analysis of 472,449 peer-review reports, examining multiple dimensions including tone, authenticity, impact, emotions, and ethics. With the advent of large language models (LLMs), Kousha et al. [
16] conducted a systematic review and discussion on how LLMs can enable the partial or complete automation of tasks related to paper review processes. Meanwhile, Jin et al. [
17] developed a peer-review simulation framework that leverages LLMs, effectively decoupling the influences of multiple potential factors and offering valuable insights for improving the design of peer-review mechanisms.
In the era of the rapid development of information technology, exploring the effective combination of AI technology and open science is an essential direction for the future development of peer review.
3.2. Applications of NLP
NLP technology has been widely used in text analysis, covering many aspects, from basic text processing to complex semantic analysis. With the development of machine learning models, especially the application of deep learning, the application of NLP technology in the field of higher education is getting deeper and deeper.
Research by Kastrati et al. [
18] shows that sentiment analysis techniques have been widely used to analyze student feedback on learning platforms. However, the field is still growing rapidly in the face of informal and diverse student language and the processing challenges of large amounts of data, especially in applying deep learning. Further, Wu et al. [
19]’s review, covering 2480 studies, generalizes the use of NLP in education into five domains by applying neural topic modeling and pre-trained language models, highlighting the shift from technology-centric to problem-oriented research topics, suggesting that NLP techniques are increasingly being tailored to specific challenges in education. In addition, in the academic peer-review process, the opposition-sentence attention (OSA) mechanism introduced by Lin et al. [
20] significantly improved the prediction accuracy of the overall peer-review recommendations by prioritizing sentences with robust opposition-related learning through positive unlabeled learning methods.
These studies advance our understanding of educational feedback and interaction and drive innovation in educational research methods and applications, highlighting the potential value of NLP in educational progress.
With the development of pre-trained language models such as BERT, the application of NLP in text analysis has reached new heights, further enhancing the ability to analyze education-related texts and facilitating performance improvements on multiple tasks.
3.3. BERT in Text Classification
Text classification, a fundamental task in NLP, has undergone a transformative shift with the advent of pre-trained language models, particularly BERT. This revolutionary technique has emerged as a cornerstone for enhancing text classification performance, opening up new possibilities in the field.
BERT models, with their unique ability for fine-grained classification, find practical applications in tasks like sentiment analysis and intent recognition. The effectiveness of BERT models has been empirically validated in domains such as movie review analysis and consumer feedback sentiment analysis, as demonstrated by their performance on benchmark datasets like IMDb and Yelp [
21,
22]. Sun et al. [
23] further validate this by showcasing BERT’s accuracy in capturing emotional changes and subtle tendencies in text.
Furthermore, compared with the traditional Word2Vec word embedding, BERT shows advantages in feature extraction, feature fusion, and model generalization. Lang Cong’s research [
24] proves this through experiments. At the same time, Zheng et al. [
25] combined BERT with deep learning to apply to the sentiment analysis of microblog text, which achieved better results compared with traditional deep learning methods.
BERT’s high performance is not limited to a single model application. When combined with networks such as BiLSTM and BiGRU, the hybrid model formed by Bert is particularly effective in capturing complex text sentiment [
26,
27]. These hybrid models utilize BERT’s deep semantic understanding ability and LSTM/GRU’s sequence processing ability to significantly improve sentiment classification accuracy.
The application of BERT in the academic world is also remarkable. It can understand complex academic language and terminology through pre-trained models and effectively identify key information such as the quality of academic papers, research innovation, and rigor of methods, which provides important support for editors and researchers in reviewing and screening academic papers.
4. Methodology
This section introduces the datasets and models used. After that, we describe the approach used to analyze word significance.
Figure 3 presents the overall architecture of the soft-sensor framework.
4.1. Original Signal Data
The research object of this study consists of a dataset comprising 15,923 master’s thesis evaluation records from the past five years at a university. This dataset covers 28 faculties and 171 disciplines.
Each evaluation record adopted in this dataset originated from manual annotations by experts in the field, forming a rich and multidimensional comprehensive assessment. The data structure includes both quantitative scores and categorical grades. Specifically, these were graded on a categorical scale (A–D) for their overall quality, with supporting scores across several critical dimensions: topic selection, foundational theory and specialized knowledge, scientific research capability and technological innovation, and thesis standardization. Crucially, the dataset also contains unstructured textual feedback from experts, comprising both academic comments that summarize strengths and sections on shortcomings and suggestions for improvement.
To facilitate subsequent analysis, this study categorized the relevant data into six main categories: engineering, management, economics, education, jurisprudence, and science, as well as a few additional subject categories. The proportion of subjects is as shown in
Figure 4. Each evaluation record is labeled with categorical grades A–D. The evaluations were conducted by experts who assessed various aspects of the theses, including topic selection, innovation, theoretical and specialized knowledge, as well as scientific research and writing proficiency. The data is authentic and comprehensive, providing a robust foundation for analysis. This corpus of review texts serves as the raw input signal for our academic soft-sensor.
4.2. Signal Preprocessing
To ensure the quality and consistency of the input signal fed into the sensing core, the raw text data undergo a series of preprocessing and standardization procedures. These steps are designed to facilitate the construction of a BERT-based model for academic paper review, with the overall preprocessing pipeline illustrated in
Figure 5. Initially, text tokenization is performed using the jieba segmentation tool to decompose the text into discrete words. Subsequently, non-informative stopwords are removed based on the comprehensive Baidu stopwords list, effectively eliminating high-frequency yet low-information terms. This process is crucial for reducing textual noise and highlighting the prominence of meaningful features, thereby significantly enhancing the model’s ability to comprehend and predict textual content with greater accuracy.
Furthermore, label encoding is applied to convert the original paper grades (‘A’, ‘B’, ‘C’, ‘D’) into binary classification labels. Specifically, grades ‘A’ and ‘B’ are assigned the class 1, while ‘C’ and ‘D’ are assigned the class 0. This mapping simplifies the training and prediction procedures and facilitates a more interpretable output from the classification model.
To ensure the quality of data, we undertake data cleaning measures. Excessively long paragraphs are truncated or segmented to adhere to the 512-token limit of the BERT model. This approach prevents errors or performance degradation caused by excessively lengthy text inputs, thereby maintaining the stability and efficiency. This process results in a clean, standardized textual signal that is suitable for ingestion by the BERT model.
Finally, the dataset is divided into training, validation, and test sets in the ratios of 70%, 15%, and 15%, respectively.
4.3. The Sensing Core: Feature Encoding and Inference with an Enhanced BERT Model
The core of the proposed soft-sensor is a modified BERT model, which serves as a powerful feature extractor and inference engine. It maps preprocessed textual tokens into contextually enriched representations, from which the final evaluation outcome is derived.
As illustrated in
Figure 6, an additional global self-attention layer is incorporated on top of the BERT encoder to function as a signal refinement module. This addition enables the architecture to perform a second, more focused integration of the encoded features, thereby enhancing its ability to capture long-range dependencies and key phrases in the review text—a critical capability for a sensor that discerns nuanced academic qualities.
During the training stage, the model was optimized using the AdamW optimizer with an initial learning rate of 5 × . The optimizer was configured with beta parameters (0.9, 0.999), an epsilon value of 1 × , and a weight decay of 0.01. Gradient clipping was applied with a maximum norm of 1.0 to stabilize the training process. The training ran for 15 epochs, using BCELoss as the loss function with a batch size of 32. A random seed of 42 was applied to ensure reproducibility, and the BERT layers were set with freeze_bert = False to allow full fine-tuning. Under this configuration, a pooling layer extracts a global feature representation of the input sequence. A special classification token ([CLS]) is inserted at the beginning of the sequence. After being processed through the Transformer encoder and the global self-attention layer, the output feature vector corresponding to this token is aggregated into a holistic representation of the entire sequence. This representation encapsulates the overall semantic information of the text and serves as a pivotal input for downstream classification tasks.
Subsequently, the refined feature representation from the [CLS] token is fed into a Multi-layer Perceptron (MLP) that acts as the final inference unit, mapping the high-dimensional features to a binary prediction. The MLP comprises multiple fully connected layers, typically including one or more hidden layers. Between these layers, a dropout layer is applied to mitigate overfitting by randomly turning off a subset of neurons during training, thereby improving generalization. The output layer consists of a linear projection followed by a Sigmoid activation function, which produces a probability score between 0 and 1 indicating the likelihood of the input text belonging to the positive class.
Leveraging the robust feature extraction capabilities of the BERT model, combined with nonlinear transformation via the MLP, the proposed soft-sensor demonstrates significantly enhanced performance in binary classification tasks. This architecture allows the sensor to not only encode complex textual patterns but also perform reliable inference on nuanced academic features.
This improvement in predictive performance substantiates more trustworthy SHAP-based interpretations of feature importance, reduces model-induced uncertainty, and clarifies associations between input features and outcomes. As a result, the soft-sensor can more accurately identify and articulate critical phrases and contextual cues that drive decisions, strengthening its role as an interpretable and effective tool for academic quality assessment.
4.4. The Interpretation Interface: Explaining Predictions with SHAP
To make the predictions of the sensing core interpretable and uncover the latent evaluation dimensions, we employ SHAP as the calibration and explanation interface of our soft-sensor. This interface decomposes the BERT model’s output to quantify the contribution of each input feature (word), effectively reverse-engineering the black-box inference process. An overview of the process is shown in
Figure 7.
A SHAP interpreter is instantiated to evaluate the sensing core’s response to individual input tokens. The SHAP values are computed through a transformation process wherein textual data are encoded via a tokenizer to ensure compatibility with the model architecture. By integrating the encoding process and tokenizer, the interpreter systematically quantifies the influence of each feature on the overall result. The resulting SHAP values assign a quantitative measure to each word’s contribution to the final prediction, thereby clarifying the model’s decision-making process and facilitating improved interpretation.
During the data screening and statistical processing phase, non-letter characters, stop words, non-nouns, and other lexically irrelevant elements were excluded to ensure data accuracy and relevance. The occurrence frequency of each word, along with the cumulative SHAP value associated with it, was computed to form the basis of saliency analysis. Additionally, the frequency of each word across documents and its document frequency were assessed to evaluate the term’s distribution within the corpus and its importance in the decision process.
Based on these metrics, a composite saliency score was calculated for each term by integrating its average SHAP value, term frequency (TF), and document frequency (DF) through a mathematically rigorous formulation. The composite saliency score for each term
w is defined as
The former component represents the mean absolute SHAP value per occurrence of term w, calculated by dividing the cumulative absolute SHAP values by TF. This normalization ensures that terms with high frequency but low per-occurrence impact are appropriately weighted, preventing the overemphasis of commonly occurring but individually insignificant terms. The latter component quantifies the prevalence of term w across the document corpus by dividing its DF by the maximum document frequency observed across all terms in the corpus. This ratio, bounded between 0 and 1, emphasizes terms that exhibit broad contextual relevance rather than those significant only within limited textual contexts.
Following this computation, a final normalization step is applied where all scores are scaled to the interval [0, 1] by dividing each score by the maximum observed value. The resulting scores were then sorted in descending order of significance. This comprehensive analytical process enables the systematic filtering and ranking of salient keywords based on their aggregated SHAP values, providing a human-interpretable output that reveals the key linguistic signals relied on for its assessment while ensuring statistical robustness and contextual relevance across the document corpus.
Furthermore, the SHAP-based keyword extraction process is entirely automated, requiring no manual intervention. To validate the distinctiveness of the selected keywords, we compared them with those generated by traditional methods such as TF-IDF. The results indicate that SHAP effectively captures contextually salient terms that correspond more closely with expert assessments, while TF-IDF tends to emphasize frequently occurring yet less distinctive words. This comparison confirms that our approach yields more semantically meaningful and review-specific insights.
6. Discussion
In the current era of rapid development of artificial intelligence technologies, represented by LLMs, this study proposes an interpretable soft-sensor framework based on BERT and SHAP. The core objective of this framework is to provide a stable, interpretable, and auditable discriminative foundation for academic evaluation. Unlike directly relying on generative models for end-to-end assessment, this framework aims to construct a reliable and transparent evaluation core by integrating semantic encoding and attribution analysis deeply. This core can serve as a crucial quality calibration module in future intelligent evaluation systems, used to verify, assist, and even constrain the “hallucinations” and inconsistencies that may exist in generative models. Thus, it can establish its unique value in the academic evaluation ecosystem driven by artificial intelligence.
However, although this framework is designed to achieve reliable perception and interpretable reasoning, its practical application still has several limitations. Firstly, the dataset used in this study mainly comes from a single institution. Although it encompasses multiple disciplines, the uneven distribution of disciplines and the differences in evaluation criteria among them may, to some extent, limit the model’s generalization ability. Additionally, the sample distribution of various categories in the dataset is unbalanced, which may result in the model’s recognition ability being weak in specific categories, making it difficult to achieve a more detailed and nuanced assessment of the papers’ quality.
In our future work, we will focus on expanding the evaluation dimensions of the model, moving from the current binary classification task to multi-class classification tasks, by modeling texts from more macro linguistic perspectives beyond the lexical level to provide more granular quality scores or level predictions. Secondly, we will actively explore the collaborative path with large language models. A promising direction is to combine the stable discrimination and attribution capabilities of this framework with the smooth generation and deep semantic understanding capabilities of generative models to build a hybrid intelligent system that not only has a reliable evaluation core but also can produce natural language comments, thereby promoting the academic review to a more scientific, transparent, and more instructive direction.