sBERT: Parameter-Efficient Transformer-Based Deep Learning Model for Scientific Literature Classification

: This paper introduces a parameter-efficient transformer-based model designed for scientific literature classification. By optimizing the transformer architecture, the proposed model significantly reduces memory usage, training time, inference time, and the carbon footprint associated with large language models. The proposed approach is evaluated against various deep learning models and demonstrates superior performance in classifying scientific literature. Comprehensive experiments conducted on datasets from Web of Science, ArXiv, Nature, Springer


Introduction
The scientific community has witnessed an unprecedented surge in the volume of published literature, with millions of articles disseminated annually across myriad platforms.As of recent estimates, over 2.5 million scientific articles are published each year across more than 30,000 peer-reviewed journals globally [1,2].In addition to journal articles, the proliferation of scientific conferences contributes significantly to the literature pool, with thousands of conferences generating substantial numbers of proceedings and papers annually.The academic sector also plays a crucial role, with universities around the world producing a vast number of theses and dissertations each year.For instance, in 2018 alone, the United States saw the publication of approximately 60,000 doctoral dissertations [3].This exponential growth in scientific publications is facilitated by advances in digital technology and open access initiatives, further underscoring the dynamic and expansive nature of modern scientific research.The usefulness of such a large amount of information depends on how it is automatically organized and grouped into various subjects, domains, and themes.Text classification is a crucial tool for organizing, managing, and retrieving textual data repositories.
Classic machine learning-based classification algorithms have been used in the task of text classification.The problems inherent to such algorithms, such as the need for feature engineering, have limited their application.To address this, deep learning algorithms including models based on convolutional neural networks (CNN) [4][5][6][7][8] and recurrent neural networks (RNNs) [9,10] have been proposed.
Transformer-based models [11][12][13][14][15][16] have recently been used.When it comes to text categorization tasks, these models perform better than the other simpler models.However, the performance gains are accompanied by a larger and more complex model.Requiring such complex models is necessary to achieve satisfactory outcomes in sequence-to-sequence tasks.However, these models are not the best to utilize for relatively easy tasks such as text classification.

•
How do different text classification models compare in terms of performance for scientific literature classification (SLC) -This paper evaluates the performance of various text classification models, including CNN, RNN, and transformer-based models across multiple datasets, such as WOS, ArXiv, Nature, Springer, and Wiley.-Detailed performance metrics are provided, which highlight the strengths and weaknesses of each model.

-
The results indicate that sBERT (small BERT) outperforms other models in accuracy and robustness across diverse datasets.
• What are the limitations of existing CNN and RNN models in handling text classification tasks?

-
The Introduction (Section 1) and Related Work (Section 2) discuss the representational limitations of CNN and RNN models, particularly the challenge of detecting long-range dependencies in the input.

-
This sets the stage for introducing transformer-based models that address these limitations through mechanisms such as self-attention.
• How can parameter efficiency be achieved while maintaining high performance?

-
The proposed sBERT model is designed to utilize a multi-headed attention mechanism and hybrid embeddings to capture global context efficiently.

-
The paper details the architecture of sBERT, emphasizing its parameter efficiency compared to other transformer-based models.-It requires only 15.7 MB of memory and achieves rapid inference times, demonstrating a significant reduction in computational resources without compromising performance.

Paper Outline
In Section 2 of this paper, we include the relevant literature review.Section 3 introduces the proposed approach, while Section 4 provides a detailed account of the datasets employed and the experiments carried out.Section 5 is devoted to the presentation and discussion of the study's findings.

Related Work
In this section, we summarize the relevant literature on various approaches to text classification.The review covers works published between 2015 and 2022 to provide a comprehensive overview of recent advances in this field.

Convolutional Neural Networks for Sentence Classification
The foundational work of Kim [4] demonstrated the effectiveness of convolutional neural networks (CNNs) for sentence classification, illustrating significant improvements in text classification tasks by employing simple yet powerful CNN architectures.Building upon this, Zhang et al. [17] introduced the Multi-Group Norm Constraint CNN (MGNC-CNN), which leveraged multiple word embeddings to enhance sentence classification performance.The utility of CNNs in combining contextual information was further explored by Wu et al. [18], who integrated self-attention mechanisms with CNNs to improve text classification accuracy.

Advances in CNN Architectures
Several studies have proposed advancements in CNN architectures for text classification.Zhang et al. [19] introduced character-level convolutional networks, highlighting their ability to handle texts at a granular level and outperforming traditional word-level models.Similarly, Conneau and Schwenk [20] developed very deep convolutional networks, emphasizing the importance of depth in CNNs for capturing intricate text patterns.
Johnson and Zhang [21] compared shallow word-level CNNs and deep character-level CNNs, demonstrating that deep character-level models achieve superior performance in text categorization tasks.More recently, Wang et al. [22] combined n-gram techniques with CNNs to enhance short text classification, while Soni et al. [23] introduced TextConvoNet, a robust CNN-based architecture specifically designed for text classification.

Word Embeddings and Contextual Information
The role of word embeddings in enhancing sentence classification has been a focal point in many studies.Mandelbaum and Shalev [24] explored various word embeddings and their effectiveness in sentence classification tasks.Senarath and Thayasivam [25] further advanced this by employing multiple word-embedding models for implicit emotion classification in tweets.Additionally, the combination of recurrent neural networks (RNNs) and CNNs with attention mechanisms, as proposed by Liu et al. [9], demonstrated significant improvements in sentence representation and classification.

Hierarchical and Attention-Based Models
The introduction of hierarchical and attention-based models has significantly influenced sentence classification methodologies.Yang et al. [26] proposed Hierarchical Attention Networks (HANs), which effectively captured the hierarchical structure of documents for better classification.Zhou et al. [27] extended this approach by using attention-based LSTM networks for cross-lingual sentiment classification.Furthermore, Bahdanau et al. [28] introduced the attention mechanism in neural machine translation, which has since been widely adopted in various text classification models.

Hybrid Models and Domain-Specific Applications
Hybrid models that combine CNNs with other neural network architectures have also shown promising results.Hassan and Mahmood [29] developed a deep-learning convolutional recurrent model that took advantage of the strengths of CNN and RNN for sentence classification.In domain-specific applications, Gonçalves et al. [30] utilized deep learning approaches for classifying scientific abstracts, while Jin and Szolovits [31] proposed hierarchical neural networks for sequential sentence classification in medical scientific abstracts.Yang and Emmert-Streib [32] introduced a CNN specifically designed for multi-label text classification of electronic health records.

Transformer-Based Models
The advent of transformer-based models has revolutionized text classification.Devlin et al. [11] introduced BERT, a pre-trained deep bidirectional transformer, which set new benchmarks in various NLP tasks.Subsequent models, such as RoBERTa by Liu et al. [14], ALBERT by Lan et al. [15], and XLNet by Yang et al. [13], further optimized the BERT architecture for improved performance.More recent innovations include ConvBERT by Jiang et al. [33], which incorporated dynamic convolution into the BERT architecture, and ELECTRA by Clark et al. [34], which proposed a new pretraining method for text encoders.DeBERTa by He et al. [35] and its subsequent versions [36] have continued to push the boundaries of what is achievable with transformer-based models in text classification.
The reviewed literature illustrates significant progress in text classification methodologies and the broader context of scientific publishing (Table 1).This period has seen the evolution of deep learning techniques, particularly CNNs and transformer-based models, which have dramatically improved the accuracy and efficiency of text classification systems.

Overview
The motivation for the proposed model is the successful application of transformerbased models such as BERT [11] for tasks that require understanding of natural language.However, these models have large parameter spaces.Additionally, in our experiments, we found that simpler models can perform well on the task of text classification.For example, BERT [11] uses an embedding size of 768 to represent each input token in the text.It also uses 12 transformer blocks, and each block in turn uses 12 attention heads.These design choices are suboptimal and result in parameter inefficiency when applied to text classification.This also results in a huge model (the BERT base has 108 M parameters).Although useful in other more complex NLP tasks, using such a large model is inefficient when used for text classification.For comparison, the model architectures and corresponding parameter space sizes are provided in Table 2.
Figure 2 provides an overview of the proposed approach.The input text is subjected to an embedding block that generates input to a lightweight encoder block.The encoded input is then used for classification using the classification block.The subsequent sections provide detailed descriptions of the blocks.

Detailed Description
The following subsections provide an in-depth description of the different steps involved in the proposed approach.

Embedding Block
The embedding block reduces the dimensionality of the input text and encodes the semantics and positional information of the words.This enriched and compact input representation serves to reduce computational requirements and improve performance.sBERT employs a hybrid embedding comprising a word embedding and a positional embedding.This is depicted in Figure 3.The weights in the word embedding layer are initialized using GloVe [41] and refined during training.The layer outputs the word vectors as a linear projection of the input words (Equation (1)).
x word i In Equation ( 1), x word i ∈ R denotes the output word vector corresponding to the ith position in the input, w i ∈ R vocab_size .W word ∈ R vocab_size×embedding_dim denotes the word embedding matrix.Here, embedding_dim is the dimensionality of word embedding vectors.The order and position of different input words is encoded by the positional embedding layer.This layer outputs a vector by transformation using a weight matrix (Equation (2)).
Here, x pos k ∈ R embedding_dim is the positional embedding vector corresponding to the kth word in the input, w k ∈ R vocab_size .W pos ∈ R vocab_size× embedding_dim denotes the weight matrix for the positional embedding layer.This matrix is initialized by weights calculated as a function of the position of a token in the input.A pair of even and odd positions in a row of the embedding matrix corresponding to the kth word, i.e., W pos k,2i and W pos k,2i+1 , is calculated using Equation (3) and Equation (4), respectively.Here, 0 Here, d is the intended embedding dimensionality, and n in the denominator is used to manage frequencies across the dimensions of the embeddings.By using a large value such as 10,000 for n, the frequency spectrum is spread to ensure that the embeddings can capture patterns that occur over different sequence lengths.The outputs of the two embeddings are summed to obtain an output vector.This creates a word embedding enriched with position information (Equation ( 5)).
Here, x i ∈ R embedding_dim represents the new hybrid embedded word vector, and embedding_dim is the dimensionality of the word embedding vector.

Query, Key, and Value Projections
To prepare the word representations for input to the encoder block (Figure 2), we use three linear transformation layers (query, key, and value) on the combined word vectors.Equations ( 6)-( 8) depict these.The three projections are shown in Figure 3.
Here, q i ∈ R embedding_dim denotes the query vector for the i th input word x i , x i ∈ R embedding_dim denotes the hybrid word vector obtained from the embedding block, and W q represents the weight matrix associated with generating query vectors.Also, W q ∈ R query_dim×embedding_dim and query_dim is the dimensionality of the query vector.
Here, k i ∈ R embedding_dim denotes the the key vector for the i th input word x i , and W k represents the weight matrix associated with generating key vectors.Also, Here, v i ∈ R embedding_dim denotes the the value vector for the i th input word x i , and W v represents the weight matrix associated with generating value vectors.Also, W v ∈ R query_dim×embedding_dim .

Self-Attention
The purpose of the self-attention mechanism is to obtain a context-aware representation of the input words.This helps detect long-range dependencies within the input, which in turn improves model performance.The following subsections describe the self-attention mechanism employed by the proposed model to enhance word representation and hence the overall model performance.

Overview
To represent a sequence of words, self-attention connects various positions of the sequence.If the same word is surrounded, in two different instances, by different words, humans understand it differently.By self-attention, we mean attending to other words in the context when interpreting each word in a text sample.
Regularities within a natural language such as sentence structure, grammar, and semantics associated with each word (word embedding vectors) cause a model with an attention mechanism built into its architecture to learn to attend to important words within the text.Attention is learned because it is rewarding for the task that the model is trained on.Training the model on a task that requires language understanding such as text classification improves this attention mechanism.This is because training improves contextual representation (one that attends to other areas of the text).Since the contextual representation is calculated using self-attention, the representation can only be improved by improving the attention mechanism itself.

Self-Attention Mechanism
The encoder block generates an attention-based representation that can focus on specific information from a large context.The attention score for a keyword when generating a representation for a given query word is calculated by scaling the dot product a ij of the query vector q i and key vector k j (of dimensionality d k ).This is represented in Equation (9).
In Equation (10), q i denotes the query vector, and k j is a key vector whose attention score against q i is being determined, while d k is the dimensionality of key vectors.The three projections (query, key, and value) corresponding to each input word are calculated by using three linear transformations.To obtain the attention scores for each token (query word) in the input, its dot product with all the words (key words) in the input is calculated.The dot product is calculated between the query vector representation of the query word and the key vector representation of the keyword.Scaling of the dot product a ij is achieved by dividing it by the square root of the dimensionality of the key vectors (d k ) to obtain a score(q i , k j ) (attention score for q i against each k i , as shown in Equation ( 10)).Scores for each query q i are subjected to a soft-max normalization to obtain the attention vector α i , as shown in Equation (11).
∑ j e score(q i ,k j ) Here, α ij denotes the attention score for the ith word against the jth word.The above operations can be combined into a single matrix operation that calculates an attention matrix A, in which each row α i represents the attention score vector for the word at the ith position in the input (Equation ( 12)).
Here, Q is the matrix, in which each row q i denotes the ith query word; similarly, K denotes the key matrix, in which each row k i denotes the ith key word.The ith row (a i ) in the matrix A in Equation ( 12) denotes the scaled dot product of the query word q i with every key word k j .Each row a i of A is subjected to soft-max to calculate α i , which makes the sum of all attention weights equal to one, i.e., α i = so f tmax(a i ), ∑ j α i,j = 1.To reduce the effect of dot products growing in values, which in turn pushes the soft-max function into flat regions, the dot products are scaled by the fraction 1 √ d k , where d k is the dimensionality of the key vector.The above steps are shown in Figure 4. Next, each value vector v j is scaled by the attention weight α i,j .To obtain the attention-enriched word representation for the ith position in the input (z i ), we scale the value vectors and sum them (Equation ( 13)). Figure 4 shows this diagrammatically.Here, ⊗ represents the scaling operation.

Residual Connections
Residual connections [42] in the proposed model improve model performance by addressing the problem of vanishing gradients, facilitating easier optimization, encouraging feature reuse, and leveraging the residual learning principle to focus on learning challenging parts of the mapping.Nonlinear activation functions cause the gradients to expand or disappear (depending on the weights).Skip connections theoretically provide a path that travels through the network, and gradients may also travel backward along it.
The outputs calculated by attention mechanism (z i ) are added to the outputs from the encoder block (x i ) to obtain (n i ∈ R embedding d im ), vectors that are used in the subsequent layer normalization step (Equation ( 14)).
3.2.5.Layer Normalization sBERT employs layer normalization to improve model performance by stabilizing training, reducing sensitivity to initialization, improving generalization, and facilitating faster convergence, thereby reducing the training time [43].We first calculate µ i , the mean, and σ 2 i , the variance of each input word vector (n i ), as shown in Equations ( 15) and ( 16).
Here, K represents the dimensionality of the input which, in our case, is equal to the embedding_dim.
Each of the K features of the word is subtracted by the mean, and the difference is divided by the square root of the standard deviation calculated above.A very small number ϵ is added to the standard deviation for numerical stability.In our experiments, we use a value of 0.001 for ϵ.This is shown in Equation (17).
As the final step in layer normalization, we scale the normalized vector n by a factor of γ 1 and shift its value by β 1 , as shown in Equation ( 18) and depicted in Figure 5.
All of the above four steps in layer normalization can be represented as shown in Equation (19).

Residual connection FCN Subunit
: concatenated outputs from multiple attention heads : outputs from self attention subunit Figure 5. Applying an FCNN (fully connected neural network) sub-unit of the encoder block to the outputs of the self-attention sub-unit (Figure 4).The superscript of the m 0 output vectors denotes the attention head index (0).

FCNN (Fully Connected Neural Network)
Two linear transformation layers are applied to the normalized output (Equations ( 20) and ( 21)).The fully connected layers in the model help capture and model intricate relationships within the data, leading to improved performance.Additionally, the model can extract features at different levels of granularity.
Figure 5 shows the application of the FCNN to the layer normalized outputs obtained by adding scaled value projections according to the attention vector corresponding to the position.

Concatenating Outputs from Multiple Attention Heads
To generate a consolidated representation from multiple attention heads for each word, we concatenate the vectors obtained from all attention heads (Figure 5).

Residual Connections
To skip the transformations shown in Equations ( 20) and ( 21), a residual connection is employed.So, p i (the output of Equation ( 18) is added to s i (the output of Equation ( 21)).This addition gives m i , the combined output (Equation ( 22)).
3.2.9.Combining the Representations Obtained from All Attention Heads Outputs from multiple attention heads are consolidated by employing a linear transformation layer on the concatenated token representations obtained from all heads.This is shown in Equation ( 23) and illustrated in Figure 6.
The vectors o i are subjected to layer normalization and can be represented as shown in Equation (24).This is illustrated in Figure 6.
: concatenated outputs from multiple attention heads

Classification
Finally, we average all (n) positions (t i ) to compute the vector (u) that represents the entire text, as shown in Equation (25).Averaging over all positions helps improve classification performance by capturing global context, reducing positional bias, enhancing robustness to input variations, and enhancing semantic understanding.
A soft-max-activated dense layer of neurons is used to output classification probabilities (Equations ( 26) and ( 27)).
Here, W out represents the weight matrix of the output layer, and where out i ∈ R nClasses .Figure 6 depicts the soft-max classification step.

Datasets and Experiments
The proposed model was applied to eight different datasets summarized in Table 3.For comparison, seven other deep-learning-based text classification models were also tested on the task.We ran a grid search to find the best-performing combination of hyperparameters.The grid search was run on eight datasets, and five-fold cross-validation was used to determine the best configuration for each dataset.The configuration described above showed the best performance for most datasets.The following subsections describe the datasets and the experimental setup used in the study.

Datasets
Ten different datasets comprising abstracts of research papers were used in the study.In addition to the three Web of Science (WOS) datasets [7], we created seven new SLC datasets for the experiments.The following subsections describe the datasets.

WOS
The WOS datasets [7] consist of 46,958 paper abstracts from 134 categories and 7 subcategories.

ArXiv
The dataset was gathered from ArXiv [44], an online preprint repository.It includes works in mathematical finance, electrical engineering, math, quantitative biology, physics, statistics, economics, astronomy, and computer science.There are 7 categories and 146 subcategories.

Nature
The dataset contains 49,782 abstracts from [45], and it is divided into 8 categories and 102 subcategories.

Springer
This dataset contains 116,230 abstracts from Springer [46], which are divided into 24 categories and 117 subcategories.A subset (SPR-50317) consisting of the largest 6 categories was also created.

Wiley
The dataset contains 179,953 samples and was obtained from Wiley [47].It contains 494 categories and 74 subcategories.A subset (WIL-30628) consisting of the largest 6 categories was also created.

COR233962
COR233962 has 233,962 abstracts divided into 6 categories.It was obtained from the repository made available by Cornell University [48].
The datasets differ in domains, sample sizes for training and testing, average words and characters per sample, and vocabulary size.These factors can affect the performance of text classification models.Generally, larger and more diverse datasets enhance model performance.Table 3 outlines the datasets used in this study.

Experimental Setup
Experiments for the study were performed on a hardware configuration that uses an Intel® Xeon® CPU @ 2.30 GHz, 12 GB of RAM, and approximately 358 GB of free disk space (Table 4).It also uses an NVIDIA Tesla T4 GPU.

Data Acquisition
To acquire some of the datasets, an HTTP request was sent to retrieve the required data, and the HTML content was retrieved using the HTTP library.The lxml library was then used to parse the HTML content and extract the abstracts and categories.The BeautifulSoup library was used to transform the data into a Pandas dataframe that was stored as a CSV file for further processing and analysis.
The datasets were sourced from individual repositories, ensuring consistency in class distribution between samples and their respective sources.In instances where certain classes contained an insufficient number of samples, exacerbating class imbalance, such classes were omitted from the datasets.Additionally, the variation in the number of classes within the created datasets aimed to enhance diversity.

Data Cleaning and Preprocessing
Special characters were filtered out from the abstracts, and tokenization (vocabulary size of 20,000) was performed.The input was restricted to the length of 250 tokens.This was achieved by padding and truncation.

Data Splitting for Training, Validation, and Testing
The datasets were randomly shuffled to remove ordering bias and then split into training and testing subsets using an 80:20 ratio.The training subsets were used for model training, while the test subsets were used for performance evaluation.

Training Details
The Adam optimizer with a learning rate of 0.01 was employed to train sBERT.A batch size of 16 and 100 epochs with early stopping was used.Figures 7 and 8 show the training graphs obtained during training sBERT on the WOS-46985 and COR-233962 datasets.Training times on different datasets for sBERT are listed in Table 5.

Performance Metric
Classification accuracy percentage, a measure of the percentage of correctly classified instances in a dataset, was employed.It is defined as the ratio of the number of correctly classified instances to the total number of instances in the dataset, multiplied by 100.Classification accuracy can be described mathematically as shown in Equation ( 28

Results and Discussion
We evaluate the performance of sBERT against several other deep-learning text classification techniques using classification accuracy percentages.Section 5.1 contrasts the parameter space sizes of different transformer-based models with the proposed model.In Section 5.2, we compare different models based on their carbon emissions.Section 5.3 explores the classification accuracy outcomes for various datasets.Finally, in Section 5.4, we present the results of hypothesis testing.

Parameter Space Comparison
Table 2 shows a comparison of various language models based on their parameter space size, which is an important consideration when selecting a model for a specific task.
As described in Table 2, BERT base, XLNet, and RoBERTa base all have 12 transformer layers, with varying hidden sizes, attention heads, and number of parameters.Models such as BERT large have 24 encoder layers, with larger hidden sizes and attention heads than their base counterparts, resulting in significantly larger parameter spaces.BERT large has 340 million parameters, while RoBERTa large has 355 million parameters.DistilBERT and ALBERT are both designed to be smaller and more efficient versions of BERT.DistilBERT has only six transformer layers, resulting in a smaller parameter space of 66 million parameters.ALBERT has 24 transformer layers like BERT large, but with a smaller hidden size and fewer attention heads, resulting in a much smaller parameter space of only 18 million parameters.
sBERT uses a single lightweight encoder block, a hidden size of 100, and 12 attention heads.This ensures a very small parameter space of 11.9 million.sBERT's parameter efficiency makes it optimal for applications involving text classification tasks.This also makes sBERT more suitable for low-resource applications and reduces the carbon footprint associated with training and fine-tuning more complex models (Table 6).

Carbon Emissions Comparison
The carbon emissions for training a deep learning model largely depend on the model's complexity and size, including the number of parameters and layers, which directly influence computational demands.Larger, more intricate models require more processing power and memory, leading to higher energy consumption.Additionally, the duration of training, dictated by the number of epochs and iterations, significantly impacts overall energy use.Efficient software implementation and optimization algorithms can mitigate some of these demands, but ultimately, more complex and sizable models inherently consume more energy, contributing to greater carbon emissions.Table 6 compares different models in terms of their carbon emissions.Measurements have been taken for training the models for 20 epochs in a training environment with the specifications in Table 4.

•
Power Consumption (Watts) Power consumption is the total power used by the hardware resources (CPU, GPU, and RAM) during model training.It can be calculated using Equation ( 29): where P CPU , P GPU , and P RAM represent the power consumption of the CPU, GPU, and RAM, respectively.• Energy Consumption (kWh) Energy consumption is the total amount of power consumed over a period of time.It is given by Equation ( 30), where t is the duration of the model training in hours.

•
Carbon Emission (kg CO 2 per kWh) The carbon emission is calculated by multiplying the energy consumption by the carbon intensity of the electricity grid.The carbon emission can be calculated using Equation (31), where CI is the carbon intensity factor.

Performance Comparison
In this section, we report evaluation results for various models on the WOS datasets and the other seven datasets.

Discussion
The experimental results offer insights into the performance of various text classification models across the three datasets: WOS-5736, WOS-11967, and WOS-46985.Each model's classification accuracy percentages were evaluated on these datasets.Among the models assessed, TextCNN exhibited relatively modest performance, achieving the highest accuracy on WOS-5736 at 49.46%.However, its overall performance across datasets was less than satisfactory, implying limitations in capturing intricate relationships within the data.Conversely, the Multi-Group Norm Constraint CNN (MGN-CNN) demonstrated better performance across the datasets, most notably on WOS-5736, with an accuracy of 98.41%.The Character-level CNN (CharCNN) displayed moderate performance, with the highest accuracy on WOS-5736 at 88.48%.Its performance may be attributed to its ability to exploit character-level information, making it suitable for datasets where such details are pivotal.In contrast, the Recurrent CNN (RCNN) showcased robust performance across the three datasets, suggesting its competence in capturing both local and sequential patterns in the data.8 and 9.

CS
The Very Deep CNN (VDCNN) showed the highest accuracy on WOS-5736 at 82.9%.However, deep models like VDCNN often entail higher computational requirements.The hybrid models, HDLTEX-CNN and HDLTEX-RNN, demonstrated strong performance, particularly on WOS-5736 and WOS-11967.
Most notably, the proposed model, sBERT, consistently outperformed other models across the datasets, achieving the highest accuracy on WOS-5736 at 99.21%.This superior performance is achieved by the incorporation of a multi-headed attention mechanism that enables sBERT to focus on the most salient portions of the text.Moreover, the utilization of word embeddings enriched by position information contributes to its success by effectively capturing semantic relationships and context-a vital aspect of text classification tasks.

Other Datasets
Table 10 lists the classification accuracy percentage measurements across the Nature, Springer, ArXiv, Wiley, and CornellArXiv datasets.Figure 10 shows confusion matrices for sBERT and RCNN on the WOS-46985 dataset.Figure 11 presents the results graphically.

Discussion
The experimental results present a comprehensive evaluation of various text classification models applied across the five distinct datasets: Nature, Springer, ArXiv, Wiley, and CornellArXiv.
TextCNN, the first model considered, demonstrates variable performance across datasets, achieving relatively lower accuracy percentages.It notably struggles on the Wiley dataset, attaining an accuracy of 8.15%.However, it exhibits more robust performance on the COR233962 dataset, reaching an accuracy of 85.06%.These variations in performance suggest that TextCNN may face challenges in datasets with diverse characteristics.
Conversely, the Multi-Group Norm Constraint CNN (MGN-CNN) showcases consistent and robust performance across all datasets, with notable accuracy percentages of 100% on Springer and 85.35% on ArXiv.
The Character-level CNN (CharCNN) exhibits moderate performance, with its highest accuracy observed on the ArXiv dataset at 80.53%.However, it faces challenges in the Wiley dataset, where it attains an accuracy of 7.82%.These results may be attributed to CharCNN's reliance on character-level information, which may be less relevant or informative in certain datasets.
The Recurrent CNN (RCNN) achieves competitive accuracy percentages across datasets, notably reaching 85% on Springer and 83.32% on ArXiv.Its recurrent architecture enables it to effectively capture sequential patterns in the data, which contributes to its adaptability.
The Very Deep CNN (VDCNN) demonstrates its highest accuracy on Springer at 63%.Nevertheless, it encounters challenges in the ArXiv dataset, where it achieves an accuracy of 66.11%.These variations in performance suggest that the depth of the model may not universally benefit all datasets.
The models HDLTEX-CNN and HDLTEX-RNN perform well across most datasets, with notable accuracy percentages.These models effectively leverage CNN and RNN architectures, showcasing their adaptability and utility in various text classification scenarios.
The proposed model, sBERT, performs well in all datasets, achieving the highest accuracy on Springer at 100% and on ArXiv at 85.94%.sBERT's exceptional performance underscores its versatility, which can be attributed to its multi-headed attention-based architecture and utilization of hybrid word and positional embeddings.These qualities enable sBERT to excel in various domains and dataset characteristics.The exceptional performance of sBERT across different datasets underscores its robustness and generalizability, suggesting its effectiveness in handling diverse domains and dataset sizes.sBERT's parameter efficiency is particularly crucial in mitigating the computational resources and carbon footprint associated with large-scale language models.The proposed model requires just 15.7 MB of memory and takes 0.06 s (average over inference times for 100 samples) to predict in the training environment (described in Section 4.2).

Hypothesis Testing
Hypothesis testing was performed to compare sBERT with other models using the 2 × 5 cross-validation method.For example, comparison with the RCNN model on the WOS-5736 dataset yielded the following results.

Accuracy Measurements of Compared Models
The accuracy measurements of the two models for the 10 splits in 2 × 5 cross-validation are presented in Table 11.The paired t-test statistic is obtained using Equation (32): where: d denotes the average difference between paired observations; s d denotes the standard deviation of the differences; n is the number of pairs.

Interpretation of the Results
The paired t-test is used to decide if there is a statistically significant difference between the means of two related groups (in this case, the classification accuracy of sBERT and RCNN across the different splits).
• t-statistic: The t-statistic of −7.48 is a measure of the difference between the two groups relative to the variability observed within the groups.The large negative value shows that the accuracy of sBERT is significantly different from that of RCNN.• p-value: The p-value of 3.78 × 10 −5 is much lower than the significance level threshold (0.05), suggesting that the difference in classification accuracy between sBERT and RCNN is statistically significant.This indicates that there is strong evidence to reject the null hypothesis (there is no difference in performance between the two models).
The low p-value in the test provides strong evidence against the null hypothesis, indicating that the observed difference in classification accuracy is highly unlikely to be due to random chance.

Conclusions and Future Work
In this study, we proposed sBERT, a parameter-efficient transformer model tailored for the classification of scientific literature.Through extensive experiments on multiple datasets, sBERT has been shown to outperform traditional models in both accuracy and efficiency.Our findings highlight the advantages of the multi-headed attention mechanism and optimized embeddings used in sBERT.Furthermore, the reductions in memory use, training and inference times, and carbon footprint emphasize the model's efficiency and environmental benefits.Future work will explore the application of sBERT to other text classification domains and further optimize its architecture for even greater performance.

Figure 2 .
Figure 2. Proposed model at a conceptual level.

FNNFigure 4 .
Figure 4.Applying the self-attention sub-unit of the encoder block to an embedded input position i (Figure3).A single attention head has been shown for simplicity.

Figure 6 .
Figure6.Obtaining a single vector representation for each position by transforming the concatenated attention outputs (m i ) from 12 attention heads (Figure5) using a fully connected layer.This representation is then used for classification after normalization and averaging over positions.

Figure 9 .
Figure 9. Graph showing classification accuracy (percentage) of different models on the WOS datasets.

Figure 10 .
Figure 10.Confusion matrices for sBERT on the dataset WOS-46985 and RCNN on WOS-46985 for comparison.For analysis, see Tables8 and 9.

Figure 11 .
Figure 11.Graph showing classification accuracy (percentage) of different models on the other datasets.

Table 1 .
Comparison of different models based on their main features and limitations.

Table 2 .
Transformer-based models with their parameter size, layers, attention heads, and hidden size.

Table 3 .
Summary of datasets used.
Table 5 lists the training times of different models.Code for the proposed model can be found online (Code: https://github.com/munziratcoding/sBERT,accessed on 11 July 2024).The following subsections present a detailed account of the experimental setup.

Table 5 .
Training times for sBERT on different datasets. ):

Table 6 .
Energy consumption and carbon emission of various models.

Table 11 .
Accuracy measurements of RCNN and sBERT for the 10 splits in 2 × 5 cross-validation.