BERT  Fine-Tuning for Software Requirement Classification: Impact of Model Components and Dataset Size

Safaa Eltahier; Omer Dawood; Imtithal Saeed

doi:10.3390/info16110981

,

and

¹

Software Engineering Department, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia

²

Computer Engineering and Information Department, College of Engineering in Wadi Aldawasir, Prince Sattam Bin Abdulaziz University, Wadi Aldawasir 11991, Saudi Arabia

³

Information Systems Department, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Information2025, 16(11), 981;https://doi.org/10.3390/info16110981

Version Notes

Order Reprints

Abstract

Recent advances in natural language processing (NLP) have enabled the automation of Software Requirements Classification (SRC), particularly through fine-tuning models such as Bidirectional Encoder Representations from Transformers (BERTs). While BERT-based models have shown promising results, the impact of hyperparameter sensitivity and dataset size on SRC performance remains underexplored. To address this gap, we present three main contributions: (1) the development and evaluation of BERT fine-tuning for SRC, with emphasis on the effects of key hyperparameters and dataset size; (2) comprehensive experiments to analyze the influence of individual hyperparameters to identify optimal configurations for robust and efficient performance; and (3) controlled experiments highlighting the critical factors that affect the fine-tuning outcomes in SRC, particularly dataset size and hyperparameter sensitivity. Our approach was assessed on two datasets: the PROMISE dataset and FR_NFR, a tailored dataset for the SRC task. The proposed method outperformed baseline models, achieving an average F1-score of 0.99 on PROMISE and 0.97 on FR_NFR. These findings provide empirical evidence on optimization strategies for BERT-based requirements classification and offer practical guidance to software engineering practitioners.

Keywords:

requirements engineering; software requirements classification; functional requirements; non-functional requirements; BERT fine-tuning

1. Introduction

Software requirements define what a system should do, specifying its expected behavior, essential properties, and constraints on development, such as performance, security, or regulatory compliance. The process of managing these requirements is governed by requirements engineering (RE), which provides a structured approach for eliciting, analyzing, documenting, and maintaining requirements throughout the software development lifecycle. Through systematic management of requirements, RE ensures the final system’s conformance to stakeholder needs, a reduction in development risks, and improvements in overall software quality [].

Software requirements are commonly classified into two categories: functional requirements (FRs) and non-functional requirements (NFRs). FRs specify the system’s expected behavior in particular situations and define what developers must implement, while NFRs describe the properties, quality attributes, or constraints the system must satisfy. A critical phase of the Software Development Life Cycle (SDLC) is the requirements analysis process, which focuses on identifying, refining, and structuring both FRs and NFRs based on broader system context and stakeholder needs.

Developers rely on these requirements to design solutions that deliver the intended functionality, while testers use them to ensure the implemented system satisfies both functional specifications and quality expectations. Together, FRs and NFRs provide the foundation for software development and validation, guaranteeing the final system meets stakeholder needs and adheres to required standards []. However, manually extracting FRs and NFRs from requirement documents is error-prone, labor-intensive, and time-consuming. Consequently, automating this process can improve efficiency, reduce human errors, and ensure more consistent requirement analysis.

Recently, the task of SRC has become increasingly associated with NLP, since requirements are typically written in natural language by analysts or domain experts. NLP techniques enable the automated processing and interpretation of these textual requirements, but also introduce the challenge of ensuring classification is both efficient and accurate. In the context of NLP-based framing, SRC can be framed as a binary classification problem, aiming to distinguish between functional requirements (FRs) and non-functional requirements (NFRs). Alternatively, it can be modeled as a multi-class classification problem, where NFRs are further categorized into specific quality attributes such as security, usability, interoperability, amongst others.

Researchers have proposed several approaches based on machine learning (ML), deep learning (DL), and transfer learning (TL) techniques []. Traditional ML has achieved good accuracy in classification tasks; however, some issues, such as dependence on feature engineering and a lack of contextual understanding, may arise with this approach. Other DL methods, such as CNN classifiers, have demonstrated high performance, particularly high precision, and recall, for NFR classification []. However, a major limitation of CNN models is their requirements of large, labeled datasets. Recent advances in NLP have significantly improved the field of requirements classification, particularly with the use of pre-trained language models like BERT, and its equivalents. Numerous studies [,] have examined the application of these models to automate the classification of software requirements. However, the optimal configuration of BERT for SRC tasks remains unclear, with limited understanding of which factors most significantly impact the performance during fine-tuning. This gap in knowledge often leads to suboptimal model configurations and inconsistent results throughout varying requirements datasets.

1.1. Research Problem

While BERT has shown remarkable effectiveness across a wide range of NLP tasks, its application to SRC has not yet been comprehensively explored. Particularly, there is limited understanding of the factors that most significantly influence performance improvements. Researchers often face challenges such as:

Assessing the sensitivity of model performance to different hyperparameter settings
Identifying the key components of the fine-tuning process that contribute most to classification accuracy
Determining the minimum and optimal dataset sizes required for effective fine-tuning

1.2. Research Questions

To address these gaps, this study is guided by the following research questions (RQs):

RQ1: Which hyperparameters have the most significant impact on BERT fine-tuning effectiveness for the SRC task?
RQ2: How does dataset size affect BERT performance in the SRC task?
RQ3: What is the optimal combination of dataset size and hyperparameter settings for maximizing performance in SRC tasks?

1.3. Research Contribution

This research advances the field of RE in the following ways:

Optimally tuned BERT model for SRC

We develop and evaluate a systematically fine-tuned BERT model specifically tailored for SRC. Unlike prior studies that applied BERT without comprehensive investigation into fine-tuning, this paper identifies the optimal hyperparameter configurations that significantly impact classification performance. By addressing this gap, our approach outperforms state-of-the-art methods.

2.: Empirical analysis of fine-tuning factors

We conduct a systematic empirical analysis of the influence of hyperparameters and dataset size on BERT’s performance in SRC. Through controlled experiments and ablation studies, we quantify the relative importance of individual components in the fine-tuning process. The findings not only provide practical guidelines for researchers and practitioners on effective model setup and resource utilization but also contribute to advancing transformer-based approaches in RE.

The remainder of the paper is organized as follows: Section 2 reviews related work, highlighting recent advancements in NLP models for requirements classification. Section 3 describes the proposed methodology, including the fine-tuning process, experimental setup, and datasets. Section 4 presents and discusses the results, with comparisons against existing approaches. Section 5 outlines the research limitations and suggestions for future work. Section 6 concludes the paper by summarizing key contributions, outlining limitations, and suggesting future research directions in applying large language models (LLMs) to RE.

2. Literature Review

Recently, the automation of the SRC task has gained considerable attention in the RE research community. Playing, a crucial role in managing large-scale datasets by significantly reducing the manual effort required from requirements engineers. To address the challenges of SRC, researchers have proposed a range of approaches, including those based on ML, DL, and TL. This section provides a review of the existing literature on SRC, with a focus on the key contributions, methodologies, and findings reported in prior studies.

For example, ref. [] developed a Support Vector Machine (SVM) classifier that leveraged metadata, lexical, and syntactical features, achieving precision and recall of up to 92% in automatically distinguishing between FRs and NFRs. Similarly, ref. [] applied ML methods as baseline models for comparison in their study, following the approach described by [], to classify software requirements into functional and non-functional categories. Using TF-IDF for text feature representation, their experiments showed that Naïve Bayes achieved the best performance, with an F1-score of 86% for functional requirements and 90% for non-functional requirements.

Abad et al. [] conducted a comparative study of several machine learning algorithms for classifying NFRs, including Latent Dirichlet Allocation (LDA), Biterm Topic Model (BTM), K-means, Naïve Bayes, and a hybrid approach combining hierarchical clustering with K-means. Their experiments utilized both raw and preprocessed versions of the PROMISE dataset. The preprocessing of data involved rule-based techniques and domain-specific dictionaries, significantly enhancing classification performance, and yielding a weighted average F1-score of 94% on the processed data. However, a notable limitation of their approach is the reliance on manual preprocessing, which may be tailored to specific datasets and thus limits generalizability.

Additionally, several studies have applied traditional machine learning techniques to the SRC task, specifically for identifying subclasses of the NFRs. Notable examples include [,,], who explored various algorithms to enhance NFR classification accuracy. However, despite promising results, the generalizability of these approaches across different domains remains unestablished, as highlighted by [], raising concerns about their applicability in diverse real-world settings.

Research in NFR classification has gradually shifted toward deep learning techniques, offering improved performance over traditional machine learning approaches. For instance, ref. [] employed a CNN to classify NFR subclasses in the PROMISE dataset without relying on handcrafted features. Their model outperformed traditional methods, demonstrating the effectiveness of deep learning in capturing complex patterns. Similarly, ref. [] applied both CNNs and artificial neural networks (ANNs) to classify NFRs into five categories across two datasets, achieving F-scores ranging from 82% to 92% with the CNN model. However, CNNs typically require large amounts of labeled training data, which can be a limitation in many domains []. To address this, recent approaches have increasingly adopted transfer learning and pretrained models. One prominent example is BERT, which combines bidirectional transformers with transfer learning to deliver state-of-the-art performance in various language tasks [].

Li et al. [] proposed a novel method for the SRC task by incorporating sentence structure and syntactic information. They constructed dependency parse trees and utilized a Graph Attention Network (GAT) to extract both implicit structural features and syntactic features from the requirements. Their approach involved four graph construction strategies combined with GAT, and the dependency-graph-based method achieved precision, recall, and F1-scores exceeding 90% for both functional and non-functional requirement classification. To further enhance performance, they integrated BERT with the graph construction method, resulting in a weighted average F1-score of 94%, comparable to the results reported by []. This integration highlighted the effectiveness of combining syntactic modeling with pretrained language representations in SRC tasks.

Kaur and Kaur [] proposed a BERT-CNN-based approach for the SRC task. To evaluate its effectiveness, they conducted experiments on the PROMISE dataset, focusing on multi-class classification across four NFR categories: Operability, Performance, Security, and Usability. Their results showed that the BERT-CNN model outperformed the baseline BERT model, demonstrating improved accuracy in distinguishing between NFR subclasses.

Derya et al. [] evaluated the effectiveness of the DistilBERT transformer as a transfer learning model for multi-class text classification on software requirements documents. Their findings showed that DistilBERT significantly outperformed traditional RNN-based models.

Additionally, ref. [] introduced a prompt-based learning approach for the SRC task using a BERT-based pretrained language model called PRCBERT. They conducted experiments on two small-scale software requirements datasets—PROMISE and NFR-Review—and found that PRCBERT outperformed both NoRBERT and MLM-BERT models in classification performance. Similarly, ref. [] fine-tuned BERT for the software language classification (SLC) task and achieved F1-scores of up to 94% for binary classification on the PROMISE dataset. To optimize memory usage and training efficiency, they limited the maximum sequence length to 128 for base models and 50 for larger models. They also investigated the impact of training epochs, concluding that 10–32 epochs were optimal for binary classification, while 10–64 epochs yielded better results for multiclass settings.

A few semi-supervised approaches have also been explored for the SRC task. For example, ref. [] proposed a framework that integrates Generative Adversarial Networks (GANs) with a BERT-based model to reduce reliance on large annotated datasets, making the approach more suitable for real-world applications. However, the limitations and broader applicability of such semi-supervised methods remain underexplored. According to [], TL-based approaches consistently achieve higher accuracy compared to traditional ML and DL techniques, further highlighting the growing preference for pretrained models in SRC tasks.

Recent studies increasingly highlight BERT’s robustness in addressing the SRC task, particularly emphasizing the importance of proper fine-tuning to capture contextual dependencies within requirement texts. BERT’s flexibility in fine-tuning makes it well-suited for modern SRC challenges and enhances performance across domain-specific applications. Despite the widespread use of BERT in SRC research, no existing study has thoroughly investigated the impact of hyperparameter tuning and dataset size on model performance. Therefore, ablation studies in this context remain underexplored, and further research is needed to optimize and improve performance in SRC tasks.

3. Materials and Methods

This section details the methodological steps taken and the implementation specifics, including dataset, model configuration and training setup, to ensure reproducibility and clarity. Here, we explain the process of fine-tuning BERT for SRC. The experimental setup utilized data from the PROMISE and FR_NFR datasets as input. Prior to feeding the data into BERT, the text was carefully pre-processed and cleaned to ensure suitability for the classification task.

3.1. Datasets

Two datasets were employed in the experimental analysis:

(A): PROMISE Dataset:

The PROMISE dataset is one of the most widely used resources in software requirements research. Maintained by the School of Information Technology and Engineering at the University of Ottawa since 2005, it comprises 625 software requirements. These are categorized into 255 FRs and 370 NFRs, with the latter further divided into 11 distinct NFR types. Due to its long-standing availability and structured classification, this dataset remains a benchmark for comparative studies in the field [].

(B): PURE Dataset:

The PURE (Public REquirements) dataset consists of 79 publicly available natural language requirements documents collected from various online sources. It includes a total of 34,268 sentences and supports a range of natural language processing tasks relevant to requirements engineering, such as model synthesis, abstraction detection, and document structure analysis []. Ref. [] extracted and processed requirements from the PURE repository and additional open-source software documentation to create the FR_NFR dataset. This refined dataset contains 6117 requirements, of which 3964 are functional and 2153 are non-functional.

Table 1 provides representative examples of functional and non-functional requirements from both datasets. Figure 1 illustrates the distribution of FRs and NFRs in the PROMISE and FR_NFR datasets, highlighting that the FR_NFR dataset is approximately 9.8 times larger than the PROMISE dataset.

Table 1. Representative examples of FR and NFR from the PROMISE and FR_NFR datasets.

Figure 1. Comparative Distribution of FR and NFR in the PROMISE and FR_NFR Datasets.

3.2. BERT Model Configuration

The architecture employed in this study is based on BERT, a multi-layer bidirectional Transformer encoder. All model configurations leverage pre-trained BERT encoders in conjunction with a Multi-Layer Perceptron (MLP) for classification tasks.

BERT processes input sequences using a combination of token embeddings, segment embeddings, and position embeddings, which are summed for each token. The model comprises multiple layers of bidirectional Transformer encoders that utilize self-attention mechanisms and feed-forward networks to capture contextual information from both directions of the input sequence.

A special classification token, [CLS], is prepended to each input sequence. Its final hidden state serves as the aggregate representation for classification. Additionally, the [SEP] token is used to denote sentence boundaries or separate multiple segments within the input. The final hidden state of the [CLS] token is passed through a dropout layer, followed by a fully connected linear layer, and finally a softmax activation function to generate class probabilities. This classification head maps the [CLS] representation to the target label space.

For our experiments, we adopted the BERTbase model due to its balance between performance and computational efficiency compared to BERTlarge. BERTbase consists of:

12 Transformer layers (blocks),
768-dimensional hidden states,
12 self-attention heads,
approximately 110 million parameters [,].

3.3. Fine-Tuning Process

Fine-tuning the BERT model for SRC involves several critical steps, beginning with pre-processing, which is essential for effective natural language understanding. Raw software requirement texts are first cleaned by removing punctuation, converting all characters to lowercase, eliminating stop words, and applying stemming to reduce words to their base forms. This ensures a standardized input format conducive to accurate classification.

Next, the cleaned text is transformed into a format compatible with BERT. This involves tokenization, which splits the text into subword units. Special tokens such as [CLS] are added at the beginning of each sequence to represent the classification context, while [SEP] tokens are used to denote sentence boundaries or separate segments. Each token is then mapped to its corresponding index in BERT’s vocabulary, resulting in a sequence of input IDs—numerical representations of the tokens.

To maintain uniform input lengths, sequences are padded to a fixed size, and longer sequences are truncated to fit within BERT’s maximum supported length. Attention masks are generated to differentiate between actual tokens and padding, allowing the model to focus on meaningful content during training.

The fine-tuning process customizes BERT’s pre-trained weights for the SRC task. Instead of relying on default settings, optimal hyperparameters are identified through a systematic ablation study, which evaluates the impact of various configurations on model performance.

Training involves backpropagation, enabling the model to learn task-specific patterns from both the PROMISE and FR_NFR datasets. The overall research methodology and model development workflow are illustrated in Figure 2. A detailed analysis of hyperparameter effects is provided in Section 4.3, and the optimal configurations for both datasets are summarized in Table 3.

Figure 2. Workflow for Methodology and Fine-Tuned BERT Model Development.

For optimization, the model employs the Adaptive Moment Estimation (ADAM) optimizer [], widely recognized for its adaptive learning rate and efficient convergence []. The sparse categorical cross-entropy loss function is used to measure the discrepancy between predicted and actual class labels, making it well-suited for multi-class classification tasks.

To ensure consistency and reliability in evaluation, the same dataset splitting methodology used in baseline studies is adopted. Specifically, a 10-fold Stratified K-Fold Cross-Validation technique is implemented to maintain balanced class distributions across all splits, ensuring unbiased performance assessment. The results of this evaluation are presented in next Section.

4. Results and Discussion

This study investigates the impact of hyperparameter sensitivity and dataset size on the performance of the traditional BERT model following fine-tuning for the SRC task. An ablation analysis approach was adopted to systematically evaluate how different configurations influence model effectiveness. This section presents and discusses the experimental findings in detail.

4.1. Experimental Results

The complete source code used for experimentation is publicly available via GitHub Python 3 (Source Code found at: https://github.com/omercomail/Requirements-BERT (accessed on 12 September 2025)), ensuring transparency and reproducibility.

To assess the model’s performance, we conducted comparative experiments using the PROMISE dataset (version 3), which serves as a benchmark in SRC research. Our results are compared against state-of-the-art baseline models, including both traditional machine learning algorithms (e.g., Support Vector Machines, Naïve Bayes) in addition to deep learning approaches, particularly BERT-based models. These comparisons are summarized in Table 4.

In addition to the PROMISE dataset, we evaluated our model on the FR_NFR dataset, which is approximately ten times larger, thus allowing for the evaluation of the scalability and robustness of the fine-tuned BERT model across datasets of varying sizes.

To evaluate the performance of both the baseline models and the proposed BERT-based model, we utilized standard classification metrics: accuracy, precision, recall, F1-score, and the confusion matrix.

With the aim of ensuring robust and unbiased evaluation, we adopted a 10-fold stratified K-fold cross-validation strategy. Through this approach, the dataset is systematically divided into ten equal subsets while maintaining the original class distribution within each fold. The model is then trained on nine folds and tested on the remaining one, and this process is repeated ten times so that each subset serves as the test set exactly once. The final performance metrics are then averaged across all folds, providing a stable and reliable estimate of the model’s generalization capability while minimizing the effect of random data partitioning.

Similarly to the approach used by [], we evaluated model performance using the F1-score (A), calculated as the weighted average of individual F1-scores across all requirement categories. This metric facilitates a fair comparison across models, especially when dealing with imbalanced datasets. Our proposed fine-tuned BERT model achieved state-of-the-art results, outperforming existing baseline approaches for the SRC task. Among the baseline models, the highest reported F1-score (A) was 0.94, achieved by [,,] on the PROMISE dataset.

4.2. Fine-Tuned BERT Model Results for the SRC Task

This section presents a detailed analysis of the results obtained using our proposed fine-tuned BERT model, evaluated on two datasets of different sizes. For the PROMISE dataset, the model demonstrated superior performance compared to baseline methods, as shown in Table 4.

For the PROMISE dataset, the corresponding confusion matrix in Figure 3a visually represents the classification accuracy across classes, whereas Table 2a, presents the complete classification report.

Figure 3. (a). Confusion matrix of SRC task in PROMISE dataset. (b). Confusion matrix of SRC task in FR_NFR dataset.

For the FR_NFR dataset, which is significantly larger, the model achieved an F1-score of 0.98 for the NFR class and 0.96 for the FR class using the optimal hyperparameter configuration. These results are detailed in Table 2b, with the associated confusion matrix shown in Figure 3b. To the best of our knowledge, no prior studies have reported binary classification results on this dataset, highlighting the novelty and contribution of our work.

Table 2. (a). PROMISE dataset: classification report for SRC task. (b). FR_NFR dataset: classification report for SRC task.

(a)
	Precision	Recall	F1-Score	Support
FR	0.98	0.98	0.98	255
NFR	0.99	0.99	0.99	370
Accuracy			0.98	625
Macro Avg	0.98	0.98	0.98	625
Weighted Avg	0.98	0.98	0.98	625
(b)
	Precision	Recall	F1-Score	Support
FR	0.97	0.98	0.98	3964
NFR	0.96	0.95	0.96	2153
Accuracy			0.97	6117
Macro Avg	0.97	0.97	0.97	6117
Weighted Avg	0.97	0.97	0.97	6117

4.3. Hyperparameter Ablation

To address the research question—“Which hyperparameters have the most significant impact on BERT fine-tuning effectiveness for the SRC task?”—this study conducted a systematic ablation analysis across two datasets of varying sizes to identify the most influential hyperparameter configurations. The findings highlight that the effectiveness of BERT fine-tuning for the SRC task is highly sensitive to the choice of hyperparameters. Specifically, learning rate, batch size, number of training epochs, dropout rate, and weight decay were found to significantly affect the model’s performance and its ability to generalize to unseen data. These hyperparameters play a crucial role in optimizing the training process and ensuring robust model behavior across different dataset scales. This was previously indicated by []. Following this, we employed a one-at-a-time approach, where each hyperparameter was varied individually while keeping others constant. This method allowed us to isolate and understand the impact of each parameter on model performance.

4.3.1. Effect of Training Epochs on BERT Fine-Tuning Performance

We first investigated the effect of varying the number of training epochs on BERT’s performance for the SRC task using the PROMISE dataset. Given the moderate class imbalance in this dataset, we used the F1-score as the primary evaluation metric and reported performance separately for FRs and NFRs. The F1-score, which balances precision and recall, is particularly suitable for evaluating models under class imbalance conditions.

Figure 4 illustrates the performance trends across different epoch settings. For FRs, the F1-score ranged from 0.96 to 0.99, showing a gradual improvement up to 8 epochs, after which performance slightly declined. A similar pattern was observed for NFRs, with F1-scores also ranging between 0.96 and 0.99, peaking at 8 epochs before experiencing a minor drop. These results suggest that 8 epochs represent an optimal training duration for this task, beyond which the model may begin to overfit.

Figure 4. F1-Score across software requirement types and epoch settings in PROMISE dataset.

In conclusion, for all classes in the PROMISE dataset, the model achieved its best performance at 8 training epochs, beyond which a decline in F1-score was observed due to overfitting. This indicates that 8 epochs strike a balance between effective learning and generalization to unseen data. In contrast, for the larger FR_NFR dataset, the optimal performance was reached at 4 epochs, achieving an F1-score of 0.98 for the FR class and 0.96 for the NFR class, as shown in Table 2b.

4.3.2. Effect of Batch Size Selection on BERT Fine-Tuning Performance

Batch size plays a crucial role in model convergence and generalization, especially when fine-tuning BERT for classification tasks. In our experiments on the PROMISE dataset, a batch size of 8 consistently yielded superior F1-scores across both FR and NFR classes compared to larger batch sizes of 16 and 32. As illustrated in Figure 5, these results suggest that smaller batch sizes are more effective for fine-tuning BERT on smaller datasets, likely due to better gradient estimation and reduced overfitting.

Figure 5. F1-Score across software requirement types and batch size settings in PROMISE dataset.

For the larger FR_NFR dataset, however, a batch size of 16 resulted in higher F1-scores for both classes, outperforming batch sizes of 8 and 32. These findings underscore the importance of aligning batch size selection with dataset scale. Specifically, smaller datasets benefit from smaller batch sizes, while larger datasets require moderately larger batch sizes to optimize learning efficiency and performance.

This observation is consistent with prior research, including [], which emphasizes the interdependence between dataset size and batch size in BERT fine-tuning. Selecting an appropriate batch size is therefore a critical step in optimizing model performance for SRC tasks.

4.3.3. Effect of Learning Rate on BERT Fine-Tuning Performance

The learning rate is a critical hyperparameter in BERT fine-tuning, as it determines the step size for updating model parameters during training. Its selection directly affects both training stability and convergence speed, making it essential for achieving optimal performance [].

In our experiments, we evaluated two learning rate settings: a lower rate of 1 × 10⁻⁵ and a higher rate of 5 × 10⁻⁵. As illustrated in Figure 6, the lower learning rate consistently yielded higher F1-scores across all requirement classes for both the PROMISE and FR_NFR datasets. These results suggest that a more conservative learning rate allows the model to fine-tune more precisely, especially on smaller datasets.

Figure 6. F1-Score across software requirement types and learning rates settings in PROMISE dataset.

Interestingly, when both the batch size was increased to 32 and the learning rate to 5 × 10⁻⁵ on the larger FR_NFR dataset, the model achieved performance comparable to the optimal configuration. This highlights the importance of dataset-size-aware hyperparameter tuning, where larger datasets can tolerate more aggressive learning rates and larger batch sizes without compromising performance.

These findings align with observations from the original BERT paper and the GLUE benchmark, which showed that different tasks and dataset sizes require tailored learning rate strategies to achieve optimal results.

Table 3. Top-performing Hyperparameters for BERT Fine-Tuning on SRC task.

Parameter	PROMISE Dataset	FR_NFR Dataset
Optimizer	Adam	Adam
Max Seq Length	256	256
Dropout Rate	0.2	0.2
Batch Size	8	16
Learning Rate	1 × 10⁻⁵	1 × 10⁻⁵
Epochs	8	4

To address RQ3, the optimal configuration was identified through a detailed ablation study assessing the impact of dataset size and hyperparameter variations. The configuration shown in Table 3 was derived from this analysis, rather than relying on default settings. This process involved systematically testing various combinations to identify the most effective setup for optimizing model performance. For smaller datasets, the best results were achieved using a learning rate of 1 × 10⁻⁵, batch size of 8, and 8 training epochs. For larger datasets, the optimal configuration included a learning rate of 1 × 10⁻⁵, batch size of 16, and 4 training epochs. Across all experiments, we consistently applied the Adam optimizer, set the maximum sequence length to 256 tokens, and used a dropout rate of 0.2 to reduce overfitting. The classification head incorporated softmax activation and sparse categorical cross-entropy loss, enhancing the model’s ability to distinguish between classes. Each hyperparameter was carefully selected based on empirical results, ensuring that the model was effectively tuned to learn task-specific patterns. These configurations reflect deliberate choices grounded in experimental evidence rather than arbitrary defaults.

To address RQ2—How does dataset size affect BERT performance in the SRC task?—our findings show that dataset size plays a critical role in shaping generalization, convergence speed, and training stability. The larger FR_NFR dataset enabled robust learning, achieving optimal performance at 4 epochs with a batch size of 16, while the smaller PROMISE dataset required 8 epochs and a batch size of 8 to prevent overfitting. Although PROMISE yielded a slightly higher peak F1-score, FR_NFR demonstrated more consistent performance across classes and epochs. These results align with prior studies [,], which highlight the instability of fine-tuning on small datasets and the generalization benefits of larger ones. Thus, even when peak metrics are similar, larger datasets offer a more stable foundation for fine-tuning BERT in SRC tasks.

Table 4. Comparison of methods for classifying software requirements into Functional (F) or Non-Functional (NFR) on the PROMISE dataset using 10-fold cross-validation.

Study	Technique	FR			NFR			F1 Avg
Study	Technique	P	R	F	P	R	F	F1 Avg
[]	SVM (word features)	0.92	0.93	0.93	0.93	0.92	0.92	0.93
[]	Naïve Bayes (TF-IDF)	0.80	0.93	0.86	0.95	0.85	0.90	0.88
[]	BERT classfier	0.92	0.88	0.90	0.92	0.95	0.93	0.92
[]	Bert+GAT	0.94	0.90	0.92	0.94	0.98	0.96	0.94
[]	Processed data	0.90	0.97	0.93	0.98	0.93	0.95	0.94
[]	PRCBERT with RoBERTa-large	0.92	0.95	0.93	0.94	0.96	0.95	0.94
Our model	Bert (fine-tuned)	0.98	0.98	0.98	0.99	0.99	0.99	0.99

5. Limitations and Future Work

This research lays a solid foundation for enhancing SRC through BERT fine-tuning. However, several limitations remain that should be addressed in future studies to further improve the model’s applicability and performance.

One key limitation is the scale of dataset testing. While our model demonstrated strong performance on the PROMISE and FR_NFR datasets, it has not yet been evaluated on much larger software requirements datasets. Applying the model to datasets containing hundreds of thousands or even millions of requirements would require significantly more computational resources and extended training time. Such testing would help determine whether the current hyperparameter configurations—batch size 8 with 8 epochs for smaller datasets, and batch size 16 with 4 epochs for larger ones—remain effective at scale or need to be adjusted for larger collections.

Another limitation is the scope of classification. This study focused solely on binary classification between FRs and NFRs, without further distinguishing between specific NFR types. In practice, NFRs encompass a wide range of categories, including security, performance, usability, reliability, maintainability, and portability. Future work should aim to develop models capable of multi-class classification to identify and categorize these distinct NFR types.

To address these limitations, future research should focus on three main directions:

Scaling the model to handle larger software requirements datasets while maintaining efficiency and accuracy.
Developing specialized models for fine-grained classification of individual NFR types.
Creating adaptive tuning mechanisms that automatically adjust hyperparameters based on dataset characteristics.

These enhancements will contribute to building more robust, scalable, and intelligent SRC systems capable of supporting real-world software engineering tasks.

6. Conclusions

This study presented a BERT-based approach for the SRC task, with a strong emphasis on systematic hyperparameter optimization. The primary goal was to automate the classification of FRs and NFRs by leveraging the capabilities of advanced natural language processing techniques. Through careful fine-tuning of the BERT model, we significantly enhanced its performance by optimizing key hyperparameters such as batch size, learning rate, and the number of training epochs.

Experiments were conducted on two datasets of varying sizes: the PROMISE dataset and the larger FR_NFR dataset. Our proposed fine-tuning methodology consistently outperformed existing baseline models, delivering high accuracy and reliable classification results. The ablation study allowed us to identify optimal configurations tailored to dataset size: a batch size of 8 with 8 training epochs for smaller datasets, and a batch size of 16 with 4 training epochs for larger datasets. Both configurations used a learning rate of 1 × 10⁻⁵ and the Adam optimizer, ensuring stable and effective training.

Based on the experimental results, three key conclusions can be drawn. First, BERT fine-tuning with optimized hyperparameters significantly outperforms traditional machine learning models such as SVM, Naïve Bayes, and logistic regression in SRC tasks. Second, systematic hyperparameter tuning is essential for achieving optimal performance, especially for smaller datasets where smaller batch sizes tend to yield better results. Third, the relationship between dataset size, batch size, learning rate, and training epochs is critical to successful BERT fine-tuning. Our findings are consistent with established best practices in transformer-based model optimization.

Overall, this study demonstrates that BERT fine-tuning, when properly optimized, is a powerful and scalable solution for automated software requirements classification. It offers strong potential for practical applications in software engineering, particularly in improving the efficiency and accuracy of requirements analysis.

Author Contributions

S.E. designed the study and wrote the first draft of the manuscript. O.D. and I.S. contributed to implementation and reviewing the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in Requirements-BERT at https://github.com/omercomail/Requirements-BERT (accessed on 13 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ozkaya, M.; Kardas, G.; Kose, M.A. An Analysis of the Features of Requirements Engineering Tools. Systems 2023, 11, 576. [Google Scholar] [CrossRef]
Wiegers, K.E.; Beatty, J.; Wiegers, K.E. Software Requirements, 3rd ed.; Best practices; Microsoft Press: Redmond, WA, USA, 2013; ISBN 978-0-7356-7966-5. [Google Scholar]
Kaur, K.; Kaur, P. The Application of AI Techniques in Requirements Classification: A Systematic Mapping. Artif. Intell. Rev. 2024, 57, 57. [Google Scholar] [CrossRef]
Shreda, Q.A.; Hanani, A.A. Identifying Non-Functional Requirements from Unconstrained Documents Using Natural Language Processing and Machine Learning Approaches. IEEE Access 2025, 13, 124159–124179. [Google Scholar] [CrossRef]
Kaur, K.; Kaur, P. BERT-CNN: Improving BERT for Requirements Classification Using CNN. Procedia Comput. Sci. 2023, 218, 2604–2611. [Google Scholar] [CrossRef]
Luo, X.; Xue, Y.; Xing, Z.; Sun, J. PRCBERT: Prompt Learning for Requirement Classification Using BERT-Based Pretrained Language Models. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, Rochester, MI, USA, 10–14 October 2022; ACM: New York, NY, USA, 2022; pp. 1–13. [Google Scholar]
Kurtanović, Z.; Maalej, W. Automatically Classifying Functional and Non-Functional Requirements Using Supervised Machine Learning. In Proceedings of the 2017 IEEE 25th International Requirements Engineering Conference (RE), Lisbon, Portugal, 4–8 September 2017; pp. 490–495. [Google Scholar]
Li, G.; Zheng, C.; Li, M.; Wang, H. Automatic Requirements Classification Based on Graph Attention Network. IEEE Access 2022, 10, 30080–30090. [Google Scholar] [CrossRef]
EzzatiKarami, M.; Madhavji, N.H. Automatically Classifying Non-Functional Requirements with Feature Extraction and Supervised Machine Learning Techniques: A Research Preview. In Proceedings of the Requirements Engineering: Foundation for Software Quality, Virtual, 12–15 April 2021; Dalpiaz, F., Spoletini, P., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 71–78. [Google Scholar]
Abad, Z.S.H.; Karras, O.; Ghazi, P.; Glinz, M.; Ruhe, G.; Schneider, K. What Works Better? A Study of Classifying Requirements. In Proceedings of the 2017 IEEE 25th International Requirements Engineering Conference (RE), Lisbon, Portugal, 4–8 September 2017; pp. 496–501. [Google Scholar]
Haque, M.A.; Rahman, M.A.; Siddik, M.S. Non-Functional Requirements Classification with Feature Extraction and Machine Learning: An Empirical Study. In Proceedings of the 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), Dhaka, Bangladesh, 3–5 May 2019; pp. 1–5. [Google Scholar]
Jindal, R.; Malhotra, R.; Jain, A.; Bansal, A. Mining Non-Functional Requirements Using Machine Learning Techniques. e-Inform. Softw. Eng. J. 2021, 15, 85–114. [Google Scholar] [CrossRef]
Airlangga, G. Enhancing Software Requirements Classification with Semisupervised GAN-BERT Technique. J. Electr. Comput. Eng. 2024, 2024, 4955691. [Google Scholar] [CrossRef]
Navarro-Almanza, R.; Juarez-Ramirez, R.; Licea, G. Towards Supporting Software Engineering Using Deep Learning: A Case of Software Requirements Classification. In Proceedings of the 2017 5th International Conference in Software Engineering Research and Innovation (CONISOFT), Merida, Mexico, 25–27 October 2017; pp. 116–120. [Google Scholar]
Baker, C.; Deng, L.; Chakraborty, S.; Dehlinger, J. Automatic Multi-Class Non-Functional Software Requirements Classification Using Neural Networks. In Proceedings of the 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Milwaukee, WI, USA, 15–19 July 2019; Volume 2, pp. 610–615. [Google Scholar]
Kici, D.; Malik, G.; Cevik, M.; Parikh, D.; Başar, A. A BERT-Based Transfer Learning Approach to Text Classification on Software Requirements Specifications. In Proceedings of the 34th Canadian Conference on Artificial Intelligence, Vancouver, BC, Canada, 25–28 May 2021. [Google Scholar] [CrossRef]
Hey, T.; Keim, J.; Koziolek, A.; Tichy, W.F. NoRBERT: Transfer Learning for Requirements Classification. In Proceedings of the 2020 IEEE 28th International Requirements Engineering Conference (RE), Zurich, Switzerland, 31 August–4 September 2020; IEEE: New York, NY, USA, 2020; pp. 169–179. [Google Scholar]
PURE: A Dataset of Public Requirements Documents. In Proceedings of the 2017 IEEE 25th International Requirements Engineering Conference (RE), Lisbon, Portugal, 4–8 September 2017; Available online: https://ieeexplore.ieee.org/document/8049173 (accessed on 16 September 2024).
Sonali, S.; Thamada, S. FR_NFR_Dataset 2024. Available online: https://data.mendeley.com/datasets/4ysx9fyzv4/1 (accessed on 12 September 2025).
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, 3–5 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
Gardazi, N.M.; Daud, A.; Malik, M.K.; Bukhari, A.; Alsahfi, T.; Alshemaimri, B. BERT Applications in Natural Language Processing: A Review. Artif. Intell. Rev. 2025, 58, 166. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar] [CrossRef]
Gkouti, N.; Malakasiotis, P.; Toumpis, S.; Androutsopoulos, I. Should I Try Multiple Optimizers When Fine-Tuning Pre-Trained Transformers for NLP Tasks? Should I Tune Their Hyperparameters? arXiv 2024, arXiv:2402.06948. [Google Scholar]
Taj, S.; Daudpota, S.M.; Imran, A.S.; Kastrati, Z. Aspect-Based Sentiment Analysis for Software Requirements Elicitation Using Fine-Tuned Bidirectional Encoder Representations from Transformers and Explainable Artificial Intelligence. Eng. Appl. Artif. Intell. 2025, 151, 110632. [Google Scholar] [CrossRef]
Mosbach, M.; Andriushchenko, M.; Klakow, D. On the Stability of Fine-Tuning BERT: Misconceptions, Explanations, and Strong Baselines. arXiv 2021, arXiv:2006.04884. [Google Scholar] [CrossRef]
Zhao, L.; Alhoshan, W. Machine Learning for Requirements Classification. In Handbook on Natural Language Processing for Requirements Engineering; Ferrari, A., Ginde, G., Eds.; Springer Nature: Cham, Switzerland, 2025; pp. 19–59. ISBN 978-3-031-73143-3. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

BERT Fine-Tuning for Software Requirement Classification: Impact of Model Components and Dataset Size

Abstract

1. Introduction

1.1. Research Problem

1.2. Research Questions

1.3. Research Contribution

2. Literature Review

3. Materials and Methods

3.1. Datasets

3.2. BERT Model Configuration

3.3. Fine-Tuning Process

4. Results and Discussion

4.1. Experimental Results

4.2. Fine-Tuned BERT Model Results for the SRC Task

4.3. Hyperparameter Ablation

4.3.1. Effect of Training Epochs on BERT Fine-Tuning Performance

4.3.2. Effect of Batch Size Selection on BERT Fine-Tuning Performance

4.3.3. Effect of Learning Rate on BERT Fine-Tuning Performance

5. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Requirement	Class	Dataset
“The system shall refresh the display every 60 s.”	NFR	PROMISE
“The system shall filter data by: Venues and Key Events.”	FR	PROMISE
“The app shall run on a smart phone with Android OS least version 2.3.”	NFR	FR_NFR_ dataset
“User shall be able to add, modify, or remove user data, with changes reflected successfully.”	FR	FR_NFR_ dataset