Enhancing Text Classification Through Grammar-Based Feature Engineering and Learning Models

Mohasseb, Alaa; Kanavos, Andreas; Amer, Eslam

doi:10.3390/info16060424

Open AccessArticle

Enhancing Text Classification Through Grammar-Based Feature Engineering and Learning Models

by

Alaa Mohasseb

^1,*

,

Andreas Kanavos

²

and

Eslam Amer

¹

School of Computing, University of Portsmouth, Portsmouth PO1 2UP, UK

²

Department of Informatics, Ionian University, 49100 Corfu, Greece

^*

Author to whom correspondence should be addressed.

Information 2025, 16(6), 424; https://doi.org/10.3390/info16060424

Submission received: 26 March 2025 / Revised: 19 May 2025 / Accepted: 20 May 2025 / Published: 22 May 2025

(This article belongs to the Special Issue Recent Advances in Social Media Mining and Analysis)

Download

Browse Figures

Versions Notes

Abstract

Text classification remains a challenging task in natural language processing (NLP) due to linguistic complexity and data imbalance. This study proposes a hybrid approach that integrates grammar-based feature engineering with deep learning and transformer models to enhance classification performance. A dataset of factoid and non-factoid questions, further categorized into causal, choice, confirmation, hypothetical, and list types, is used to evaluate several models, including CNNs, BiLSTMs, MLPs, BERT, DistilBERT, Electra, and GPT-2. Grammatical and domain-specific features are explicitly extracted and leveraged to improve multi-class classification. To address class imbalance, the SMOTE algorithm is applied, significantly boosting the recall and F1-score for minority classes. Experimental results show that DistilBERT achieves the highest binary classification accuracy, equal to 94%, while BiLSTM and CNN outperform transformers in multi-class settings, reaching up to 92% accuracy. These findings confirm that grammar-based features provide critical syntactic and semantic insights, enhancing model robustness and interpretability beyond conventional embeddings.

Keywords:

text classification; deep learning; transformer models; grammar-based feature engineering; natural language processing (NLP); SMOTE; question classification

1. Introduction

Natural language processing (NLP) has evolved dramatically over recent years, leading to significant advancements in various applications. Among these, text classification stands out due to its fundamental role across different domains. Text classification, a core task in NLP, involves categorizing text into predefined labels and is crucial for a variety of applications [1]. The process entails analyzing textual data and assigning them to categories based on its content, significantly enhancing the efficiency of information retrieval models and content management platforms [2,3]. This increase in capability is largely attributed to breakthroughs in machine learning techniques and the introduction of deep learning models, which have revolutionized the ways in which machines understand and process human languages [4]. These models can grasp nuanced language patterns and contextual variations, enabling more accurate and dynamic classification systems.

However, while these advancements have greatly improved the accuracy of text classification tasks, challenges remain. As the volume and variety of textual data continue to grow, driven by digitalization and the expansion of online platforms, the importance of sophisticated text classification mechanisms becomes even more pronounced [1]. However, many models still struggle to capture deeper grammatical structures, which can be pivotal for more complex tasks such as question answering, sentiment analysis, or even automated summarization [5]. These models often rely on implicit, context-dependent embeddings, which can fail to explicitly represent grammatical cues. This limitation highlights the need for approaches that explicitly incorporate grammar-based features, providing a more structured understanding of the text.

Historically, text classification relied on simpler statistical methods and feature-based models, such as the bag-of-words approach [6,7]. These methods lacked the ability to capture deeper grammatical or syntactic structures. Moreover, not all text used in various application tasks carries a specific, clear meaning. Text can often be short and ambiguous, with many phrases or words potentially holding multiple interpretations. This ambiguity presents a challenge in distinguishing context from limited information [8]. Consequently, classifying content based solely on the text itself could lead to misleading results.

The shift toward integrating more complex features, including semantic and syntactic features, has addressed some of these issues [9,10,11]. Recent research has increasingly focused on incorporating grammatical features [12,13], which provide a structured understanding of sentence syntax and semantics. This structured approach is particularly valuable in domains such as question–answer systems, where specific grammatical cues often point to distinct types of answers. Despite the potential of these grammar-based features, their integration into modern deep learning models remains underexplored.

The advent of deep learning models [14,15], particularly transformer-based architectures such as BERT, RoBERTa, and GPT [16,17,18], has dramatically improved text classification performance. These models leverage dense contextual embeddings to achieve state-of-the-art performance across a variety of tasks. However, while these models capture semantic and syntactic relationships, they often do so implicitly within their embeddings. The question remains whether explicitly integrating grammatical features could further enhance classification performance, especially in the context of syntactic understanding.

This paper investigates the impact of incorporating grammatical structures into text classification models. The study focuses on a dataset composed of questions aggregated from multiple question–answer systems, assessing whether explicit grammatical features contribute to improved classification accuracy and robustness. We argue that grammar-based feature engineering, when explicitly integrated, can provide valuable insights into the structure and meaning of text that go beyond the capabilities of traditional embeddings.

The main objectives of this study are as follows:

To design and validate a grammar-based feature engineering framework for text classification. This framework integrates both grammatical structures (e.g., part-of-speech tags and question types) and domain-specific features (e.g., named entities) to construct enriched, structured input representations that go beyond conventional word embeddings.
To evaluate and compare the effectiveness of deep learning and transformer-based models, including CNN, BiLSTM, MLP, BERT, DistilBERT, ELECTRA, and GPT-2, using these grammar-informed inputs. The goal is to assess how well each architecture leverages structured syntactic features for classification.
To assess model performance in both binary and multi-class settings, distinguishing between factoid and non-factoid questions in binary classification, and categorizing complex question types (causal, choice, confirmation, hypothetical, and list) in the multi-class scenario.
To analyze the impact of class imbalance and the application of the SMOTE oversampling technique, particularly in enhancing the recognition of under-represented classes such as choice and hypothetical, and to determine how oversampling affects the balance between precision and recall in each model.

The remainder of this paper is structured as follows. Section 2 presents a detailed review of prior work in text classification, covering traditional machine learning, deep learning, and transformer-based approaches, with a focus on the use of grammatical and syntactic features. Section 3 introduces the proposed grammar-based classification framework, including its components for text parsing, feature extraction, and classification, along with examples and feature mapping tables. Section 4 outlines the experimental setup, detailing the dataset characteristics, model architectures, evaluation metrics, and the design of binary and multi-class classification experiments, including a SMOTE-enhanced setting to address class imbalance. Section 5 presents and analyzes the experimental results, comparing the performance of grammar-based deep learning and transformer models across all tasks. Finally, Section 6 concludes the paper by summarizing key findings and outlining future directions for grammar-informed NLP systems.

2. Related Work

This section provides a comprehensive overview of prior research in text classification, focusing on traditional machine learning approaches, neural networks, transformer-based models, and large language models. Table 1 summarizes key techniques and representative works.

Traditional text classification methods rely heavily on machine learning algorithms such as naive Bayes, support vector machines (SVM), and decision trees. These models typically employ simple text representation techniques like bag-of-words and TF-IDF [20]. Early research explored a wide range of features for classification, including unigram features, word shape patterns, and semantic and syntactic cues [7,21,25,26,47]. Feature selection algorithms were used to optimize these features for specific question types, enhancing accuracy. Further improvements were achieved by incorporating headword features and leveraging external resources like WordNet to refine semantic representations [48]. Studies showed that integrating such complex feature sets helped capture deeper linguistic structures [21,25,49].

To improve model robustness and generalizability, ensemble techniques such as random forests and AdaBoost were adopted [19,22,24]. K-nearest neighbor (KNN) classifiers were also widely used for baseline comparisons in short-text and question classification tasks [23,27]. In parallel, distributional semantic models such as latent semantic analysis (LSA) and latent dirichlet allocation (LDA) were employed to enhance feature representation beyond sparse encodings like TF-IDF [28,30].

The emergence of deep learning brought significant advances in text classification [50]. Neural architectures such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) replaced manual feature engineering with automatic representation learning. These models, coupled with word embeddings like Word2Vec and GloVe, enabled a more nuanced understanding of semantic relationships between words [29,31]. Feature selection techniques such as chi-square, information gain, and mutual information remained important for managing model complexity and improving performance [32,51].

CNNs and RNNs demonstrated strong results in general text classification as well as domain-specific tasks such as question classification [33,34,52]. These models were instrumental in shifting from static features to context-aware learning. LSTM networks, in particular, proved effective for handling sequential dependencies in text, achieving high performance across tasks [35,36].

More recently, large language models (LLMs) have reshaped the landscape of text classification through transfer learning and extensive pre-training. Ref. [38] explored the integration of LLMs with deep learning architectures, demonstrating improved accuracy in question classification via fine-tuning of pre-trained models. Models such as GPT-4 and LLaMA3 have shown considerable success in capturing nuanced language patterns due to their scale and generalization capabilities [37]. In addition, prompting techniques like clue and reasoning prompting (CARP) have been developed to enhance LLM reasoning and classification ability [39].

Beyond general-purpose LLMs, several domain-specific transformer-based frameworks have been proposed. CoBerTC, introduced in [42], is a Covid-19-specific classification model consisting of three main components: fine-tuning, model inference, and optimal model selection. The study evaluated six transformer models, mBERT, XML-RoBERTa, mDistilBERT, IndicBERT, MuRIL, and mDeBERTa-V3, on the ECovC dataset, with XML-RoBERTa achieving 94.22% accuracy.

Scalability and label sparsity have also been addressed through transformer innovations. Chang et al. [40] proposed X-transformer, a scalable framework for extreme multi-label text classification (XMC), which combines semantic label indexing, transformer fine-tuning, and ensemble ranking to outperform models such as Parabel and AttentionXML. Similarly, Guo et al. [41] introduced the multi-scale transformer, which integrates multi-scale multi-head self-attention (MSMSA) to capture both local and global context, yielding better performance than standard transformer architectures, particularly on smaller datasets.

Comparative evaluation remains central to understanding model efficacy. Petridis [43] compared pre-trained transformer models (BERT, RoBERTa, and DistilBERT) with classical machine learning algorithms (SVM, random forest, and logistic regression) and traditional neural networks (MLP, RNN, and TransformerEncoder). Using TF-IDF and GloVe embeddings for non-transformer baselines, their results confirmed the consistent superiority of transformer-based models, with BERT and RoBERTa achieving up to 85.16% accuracy on level-1 classification tasks.

Comprehensive surveys [45,46] documented the evolution from traditional pipelines to neural and transformer-based models, highlighting their improved contextual awareness and generalization. Foundational contributions such as BERT [53], a deep bidirectional transformer pre-trained on large corpora, and ULMFiT [44], a transfer learning approach for fine-tuning language models, set new benchmarks in text classification.

While the above subsections outline the technical evolution of classification models, it is also important to acknowledge their practical deployments in applied research. Transformer models like BERT have been successfully applied in biomedical document classification [54], legal document analysis [55], and educational question answering [56]. Similarly, CNN and BiLSTM architectures have been widely used in sentiment analysis [46], social media content classification [57], and multi-label news categorization. These applications illustrate the versatility of these models across domains, while also revealing a gap in the integration of grammatical or syntactic feature engineering. Our work addresses this gap by incorporating linguistically grounded features that improve classification robustness, particularly in imbalanced and semantically diverse question classification tasks.

Despite these advancements, the explicit integration of grammar-based feature engineering in deep learning and transformer models remains underexplored. While many models implicitly capture syntactic and semantic structures, few studies incorporate structured grammatical features directly.

3. Grammar-Based Text Classification Framework

The proposed framework for grammar-based text classification integrates syntactic, semantic, and domain-specific linguistic features into a structured processing pipeline. Figure 1 illustrates the modular design, consisting of four key stages: text analysis, grammar rule construction, syntactic parsing with feature mapping, and final classification.

3.1. Text Analysis and Grammar Construction

The process begins with the automated parsing of raw textual input. The Text_Analyzer component identifies and extracts key linguistic elements, such as keywords and key phrases, required for downstream processing. This module leverages a domain-specific dictionary derived from [58], which maps terms and categories to facilitate accurate extraction. This foundational analysis informs the subsequent construction of grammar rules that integrate both general syntactic patterns and domain-specific terminology.

Grammar construction is handled by the Grammar_Builder module, which employs a formalism based on context-free grammar (CFG) [59], consistent with the term-mapping framework described in [12]. A CFG is formally defined as a tuple

(N, Σ, P, S)

, where N denotes the set of non-terminal symbols (e.g., grammatical categories like nouns or verbs, or domain-specific classes such as action verbs, proper nouns, or named entities),

Σ

is the set of terminal symbols, P is the set of production rules, and S is the start symbol. For example, a simplified CFG may include rules such as

S \to Q W A u x N P

,

Q W \to Where

,

A u x \to is

,

N P \to D e t N P P

, and

P P \to P P N

, generating valid question forms such as “Where is the location of London”?

Grammar incorporates features like verb and noun forms, plurality, and question structure, aligning syntactic interpretation with both general linguistic conventions and task-specific requirements. Notably, grammar is not reconstructed from scratch for each input; rather, it builds on pre-established rules defined for each question class. Upon processing a new text, the system applies this CFG-based grammar framework, comprising both general and class-specific rules, and, if necessary, dynamically updates it with newly encountered domain-relevant terms or structures. This design ensures consistency across similar texts while preserving adaptability, thereby supporting robust and accurate term mapping.

3.2. Parsing and Feature Mapping

Once grammar is defined, the annotated text is passed to the Parser_Mapper, which performs a detailed syntactic analysis. The parser assigns grammatical tags, converts the input into a structured format, and extracts both syntactic and semantic features.

This transformation addresses key limitations of traditional NLP pipelines, particularly the inability of conventional feature selection techniques to capture structural dependencies and linguistic context. Instead, the proposed grammar-based approach integrates explicit features that are meaningful for classification.

Table 2 provides an overview of the grammatical and domain-specific features considered during feature extraction. These include part-of-speech elements such as verbs, determiners, conjunctions, and prepositions, as well as semantic categories like named entities, question types, and topic-relevant terms. The feature set was designed to capture both structural and contextual cues that are critical for disambiguating intent in question classification tasks.

Grammatical features help us identify sentence construction patterns, such as interrogative forms (e.g., “where” or “how” questions), which directly inform the classification of question types. Domain-specific features (e.g., proper nouns, named entities, and thematic categories like health, history, or products) enable models to link text to relevant semantic domains. These categories were selected based on linguistic theory, prior empirical results in grammar-based NLP systems [12], and their demonstrated effectiveness in parsing question structures and improving downstream classification accuracy.

To illustrate how this process operates in practice, Table 3 shows how the sentence “Where is the location of London?” is decomposed into its grammatical and domain-specific components. The inclusion of question words and geographical entities is essential for understanding both intent and context.

3.3. Classification and Output Generation

In the final stage, the structured text is processed by the Classifier module, where a machine learning model is applied to categorize the input based on predefined labels. This classifier leverages the enriched feature set produced through grammar-based engineering, allowing it to identify intricate linguistic and semantic distinctions.

The model supports both binary (e.g., factoid vs. non-factoid) and multi-class classification tasks. These include entity-based categorization, question type identification, and domain-specific classification scenarios. By incorporating explicit grammatical features into the classification pipeline, the model achieves greater interpretability and robustness compared to traditional embedding-only approaches.

Once classification is complete, the output, consisting of the categorized text, is returned. This structured result can be used in downstream applications such as question answering, topic detection, or sentiment analysis, depending on the use case.

4. Experimental Setup and Methodology

This section describes the experimental design used to evaluate the proposed grammar-based classification framework. It outlines the dataset characteristics, the deep learning and transformer models applied, the evaluation metrics selected, and the three experimental configurations used to assess model performance across both binary and multi-class classification tasks, including an enhanced setting that addresses class imbalance.

4.1. Dataset

The experimental study utilizes a dataset compiled from multiple question–answer systems, as originally presented by [12], which offers a well-structured benchmark for evaluating grammar-informed text classification models. This dataset supports both binary and multi-class classification tasks and comprises a diverse collection of natural language questions designed to facilitate both syntactic and semantic analysis.

The dataset contains a total of 1160 questions, divided into 687 factoid and 473 non-factoid instances. The non-factoid category is further subdivided into five specific subtypes: causal (31), choice (12), confirmation (32), hypothetical (7), and list (101). These subcategories introduce considerable variability and increase the complexity of the classification task, making the dataset particularly suitable for evaluating fine-grained multi-class performance.

Table 4 presents the detailed distribution of question types, including raw counts and their respective percentages relative to the total dataset. For instance, the factoid class constitutes approximately 59.2% of the data, whereas rare subclasses such as hypothetical (0.6%) and choice (1.0%) are severely under-represented. This significant class imbalance poses challenges for learning algorithms and justifies the application of data augmentation strategies.

This imbalance indicates a dataset skewed toward factoid and list-type questions, while classes such as hypothetical and choice are significantly under-represented. As a result, the classification models trained on this dataset may exhibit bias toward the majority classes if no corrective measures are applied. To address this, the synthetic minority oversampling technique (SMOTE) was employed in subsequent experiments to improve the representation of minority classes and ensure fairer and more balanced model evaluation.

Each question, such as “Where is the location of London?”, is initially represented as a free-form textual string. Before classification, the dataset undergoes a grammar-based preprocessing stage. As described in Section 3 and illustrated in Table 3, this involves transforming each sentence into a structured representation through syntactic parsing and domain-informed feature extraction. These enriched representations help learning models capture both grammatical patterns and semantic cues, thereby improving classification accuracy, particularly in tasks where structural linguistic features are crucial for distinguishing between classes.

4.2. Learning Models

This subsection discusses the application of both deep learning and transformer-based models for text classification. The selected architectures, CNNs, BiLSTMs, ANNs, MLPs, and transformer-based models such as BERT, RoBERTa, DistilBERT, ELECTRA, and GPT-2, were chosen for their advanced capabilities in feature extraction, sequential data modeling, and contextual understanding. These characteristics are essential for enhancing the accuracy and robustness of classification models in natural language processing.

4.2.1. Deep Learning Models

Deep learning models have proven highly effective in capturing sequential dependencies and hierarchical feature representations in text data. The following models were employed for this purpose: convolutional neural networks (CNNs), bidirectional long short-term memory networks (BiLSTMs), artificial neural networks (ANNs), and multilayer perceptrons (MLPs).

Convolutional neural network (CNN): CNNs have been adapted for text classification by identifying local and contextual patterns in word sequences. They perform convolution operations on the input

I

using filters

K

to capture spatial features:

(I * K) (i, j) = \sum_{m} \sum_{n} I (m, n) \cdot K (i - m, j - n),

(1)

where

I

is the input matrix representing embedded text,

K

is the kernel (filter), and

(i, j)

denotes the spatial index in the feature map. This mechanism enables CNNs to learn relevant n-gram features and phrase-level semantics, contributing significantly to improved classification performance [60].

Bidirectional LSTM (BiLSTM): BiLSTMs extend traditional LSTMs by processing input sequences in both forward and backward directions. This bidirectional processing is formulated as follows:

\vec{h_{t}} = {LSTM}_{fwd} (x_{t}), \overset{\leftarrow}{h_{t}} = {LSTM}_{bwd} (x_{t}),

(2)

h_{t} = [\vec{h_{t}}; \overset{\leftarrow}{h_{t}}],

(3)

where

x_{t}

is the input at time t, and

h_{t}

is the concatenated hidden state combining both forward and backward passes. By combining context from both past and future tokens, BiLSTMs are particularly effective in capturing semantic dependencies within sentences [61].

Artificial neural network (ANN): ANNs consist of fully connected layers of artificial neurons that process input vectors

x

through non-linear transformations:

y = σ (W_{n} σ (W_{n - 1} (\dots σ (W_{1} x + b_{1}) \dots) + b_{n - 1}) + b_{n}),

(4)

where

W_{i}

and

b_{i}

denote the weights and biases of the i-th layer,

σ

is a non-linear activation function (e.g., ReLU or sigmoid), and

y

is the network output. Each layer extracts increasingly abstract features, enabling the network to model complex decision boundaries for classification.

Multilayer perceptron (MLP): As a subclass of ANNs, MLPs follow a feed-forward architecture and are characterized by fully connected layers and non-linear activations. A simplified two-layer version is provided by the following:

y = σ (W_{2} σ (W_{1} x + b_{1}) + b_{2}),

(5)

where

x

is the input vector,

W_{1}

and

W_{2}

are weight matrices,

b_{1}

and

b_{2}

are biases, and

σ

is the activation function. MLPs are well-suited for structured input formats, such as those derived from grammar-based feature engineering, and can effectively learn non-linear relationships [62].

Together, these deep learning architectures offer strong foundations for capturing linguistic patterns and sentence-level dependencies, enhancing classification precision across both binary and multi-class scenarios.

4.2.2. Transformer-Based Learning Models

Transformer-based models have transformed natural language processing by introducing self-attention mechanisms capable of capturing global context. Unlike sequential RNN-based models, transformers process entire input sequences in parallel, making them more efficient and effective at modeling long-range dependencies. This study investigates the impact of several state-of-the-art transformer architectures, including BERT, RoBERTa, DistilBERT, ELECTRA, and GPT-2.

Bidirectional encoder representations from transformers (BERT): BERT [53] is a bidirectional transformer model that captures context from both directions using a masked language modeling (MLM) objective. The conditional probability for masked token prediction is defined as follows:

P (w_{i} ∣ w_{i - 1}, w_{i + 1}) = Softmax (W h_{i}),

(6)

where

w_{i}

is the masked token,

h_{i}

is the hidden state at position i, and W is the output projection matrix mapping hidden states to vocabulary logits. This design enables BERT to learn deep bidirectional representations, resulting in significant improvements in classification accuracy.

Robustly optimized BERT pre-training approach (RoBERTa): RoBERTa [63] improves on BERT by introducing dynamic masking, removing the next sentence prediction task, and increasing the amount of training data. These enhancements improve generalization and model robustness across various text classification benchmarks.

Distilled BERT (DistilBERT): DistilBERT [64] is a lightweight version of BERT that uses knowledge distillation to retain approximately 97% of BERT’s performance while reducing the model size and inference time by 60%. The distillation loss combines the masked language modeling loss and a distillation loss:

L = α L_{MLM} + (1 - α) L_{Distill},

(7)

where L is the total loss,

L_{MLM}

is the standard masked language modeling loss,

L_{Distill}

is the distillation loss between student and teacher models, and

α \in [0, 1]

is a hyperparameter controlling the loss balance. This makes DistilBERT well-suited for deployment in resource-constrained environments without compromising classification effectiveness.

Efficiently learning an encoder that classifies token replacements accurately (ELECTRA): ELECTRA [65] introduces a new pre-training task called replaced token detection (RTD). Rather than reconstructing masked tokens, the model distinguishes between real and synthetically replaced tokens:

P (y ∣ x) = σ (W h_{x}),

(8)

where x is the input token,

h_{x}

is its hidden state representation, W is the output projection matrix, and

σ

is the sigmoid activation function used for binary classification. This approach offers improved training efficiency and enhanced performance in classification settings.

Generative pre-trained transformer 2 (GPT-2): GPT-2 [66] is an autoregressive transformer that predicts tokens in a left-to-right manner:

P (w_{t} ∣ w_{1}, \dots, w_{t - 1}) = Softmax (W h_{t}),

(9)

where

w_{t}

is the target token,

w_{1}, \dots, w_{t - 1}

are previous tokens in the sequence,

h_{t}

is the hidden state at position t, and W projects to the vocabulary space. Although GPT-2 is primarily designed for generative tasks, it has demonstrated competitive performance in classification when fine-tuned appropriately, owing to its deep contextual understanding. In this study, GPT-2 was selected due to its favorable trade-off between performance and computational efficiency. While more recent models such as GPT-4 offer improved capabilities, their substantially larger size introduces significant resource demands that fall outside the scope of this experimental setup. Given the moderate dataset size and the focus on grammar-based feature integration, GPT-2 was deemed sufficient to evaluate the role of autoregressive transformers in classification.

Overall, transformer-based architectures provide improved generalization, contextual modeling, and computational efficiency, making them particularly effective for text classification tasks involving complex syntax and semantic variability.

4.3. Evaluation Metrics

To assess model performance, we employed four standard evaluation metrics: precision, recall, F1-score, and accuracy. These metrics are essential for evaluating classification tasks comprehensively, especially in the presence of class imbalance [67]. For a more nuanced interpretation, we report both macro and weighted averages for each metric.

The evaluation metrics are defined as follows:

Precision quantifies the proportion of true positive predictions among all positive predictions made by the model. It is particularly important in contexts where false positives incur significant cost:

$Precision = \frac{T P}{T P + F P},$

(10)

where $T P$ is the number of true positives, and $F P$ is the number of false positives.
Recall (or sensitivity) measures the proportion of actual positive cases correctly identified by the model:

$Recall = \frac{T P}{T P + F N},$

(11)

where $F N$ is the number of false negatives.
F1-score is the harmonic mean of precision and recall. It offers a balanced measure, which is especially useful in imbalanced datasets where one metric alone may be misleading:

$F 1 - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall},$

(12)

where precision is the precision value, and recall is the recall value defined above.
Accuracy reflects the overall proportion of correctly classified instances across all classes:

$Accuracy = \frac{T P + T N}{T P + T N + F P + F N},$

(13)

where $T N$ is the number of true negatives.

To provide a comprehensive evaluation across uneven class distributions, we report the following:

Macro average: The unweighted mean of metric scores across all classes, treating each class equally.
Weighted average: The average weighted by the support (number of instances) of each class, providing more influence to frequent classes.

This dual strategy ensures a fair and robust analysis of performance, capturing both per-class effectiveness and overall predictive accuracy.

4.4. Experimental Setup

Three distinct experiments were designed to evaluate the performance of various classification models under different conditions:

The first experiment addressed a binary classification task, in which the model was trained to distinguish between factoid and non-factoid questions.
The second experiment extended this task to a multi-class classification scenario. In this setup, the model categorized input questions into six predefined classes: causal, choice, factoid, confirmation, hypothetical, and list.
The third experiment introduced an enhanced multi-class classification strategy by explicitly addressing class imbalance. To improve the representation of under-represented categories, we applied the synthetic minority over-sampling technique (SMOTE) [68]. This oversampling method balanced the dataset, thereby improving the classifier’s ability to learn from minority classes and enhancing overall classification performance.

5. Experimental Results

This section presents the results of the experimental study, analyzing the performance of grammar-based classification models across binary and multi-class settings. The evaluation compares deep learning and transformer-based models, assessing their ability to leverage structured grammatical features. Results are reported for both baseline and SMOTE-enhanced configurations, with particular focus on classification accuracy, class-wise F1-scores, and the impact of class imbalance mitigation techniques.

5.1. Grammar-Based Model Performance for Binary Classification

This subsection presents and analyzes the performance of grammar-based models in a binary classification task, distinguishing between factoid and non-factoid questions.

Table 5 provides the precision, recall, and F1-scores for each model across both classes, along with overall accuracy, macro, and weighted averages.

As shown in the table, transformer-based models outperformed traditional deep learning models. DistilBERT achieved the highest overall accuracy (0.94), factoid recall (0.96), and F1-score (0.95), confirming its superior capability in capturing grammatical and contextual features. BERT and RoBERTa also demonstrated high performance with accuracies of 0.91 and 0.89, respectively. GPT-2 and ELECTRA yielded solid but slightly lower results, especially in precision and recall for non-factoid questions.

Among deep learning models, BiLSTM and CNN showed competitive performance. BiLSTM achieved an F1-score of 0.90 and recall of 0.91 for factoid questions, indicating its effectiveness in handling sequential grammatical dependencies. CNN excelled in precision (0.92) but had slightly lower recall (0.89), reflecting its strength in local feature detection. In contrast, the MLP model underperformed, with an accuracy of only 0.57 and a non-factoid recall of just 0.08, highlighting its limitations in capturing complex grammatical relationships. The ANN provided a more balanced profile, achieving an accuracy of 0.89 and robust scores across both classes.

Figure 2 further illustrates these trends, showing the individual precision, recall, and F1-score values for factoid and non-factoid questions.

It is evident that transformer-based models, especially DistilBERT, maintain consistent high performance across all metrics, whereas MLP demonstrates significant drops, particularly in non-factoid recall and F1-score. BiLSTM and CNN maintain high metric scores, but with more variation between precision and recall.

Figure 3 presents a comparative summary across all models using bar charts, facilitating clearer visual analysis of the overall accuracy, macro average, and weighted average.

As shown in the chart, DistilBERT leads across all overall metrics, while BERT, RoBERTa, and CNN form the next tier. The visualization also highlights the underperformance of MLP across all dimensions, underscoring the need for more expressive architectures for binary classification based on grammatical structures.

Overall, transformer-based models demonstrated superior generalization, particularly for factoid questions, while BiLSTM and CNN remained competitive among deep learning models. DistilBERT emerged as the most balanced and effective model for grammar-based binary classification tasks.

5.2. Grammar-Based Model Performance for Multi-Class Classification (Baseline)

This subsection evaluates the performance of grammar-based models in a multi-class classification task, categorizing questions into six distinct classes: causal, choice, factoid, confirmation, hypothetical, and list.

Table 6 provides a detailed breakdown of model performance across these classes, including precision, recall, and F1-score per class, as well as overall accuracy, macro average, and weighted average scores.

Among the deep learning models, BiLSTM achieved the highest overall accuracy (0.90), along with the best macro average (0.61) and weighted average (0.90), indicating balanced performance across both frequent and infrequent classes. It demonstrated particularly strong results in the confirmation (F1 = 0.96) and factoid (F1 = 0.93) categories. ANN followed closely with an overall accuracy of 0.88 and strong performance in confirmation (F1 = 0.93) and list (F1 = 0.67) questions. CNN also performed robustly (accuracy = 0.87), though its macro average (0.47) suggests slightly less balanced effectiveness across all categories.

In contrast, MLP displayed the weakest performance among neural models. It yielded near-zero scores in several classes, including causal, choice, and hypothetical, resulting in an overall accuracy of 0.70 and a macro average of 0.23. These shortcomings are clearly visualized in Figure 4, where MLP’s F1-scores approach zero for these under-represented classes.

Among transformer-based models, DistilBERT delivered promising results, achieving excellent performance in the causal (F1 = 0.92), confirmation (F1 = 0.96), and factoid (F1 = 0.92) classes. However, it failed to identify any instances from the choice, hypothetical, or list classes, likely due to class imbalance. GPT-2 showed competitive performance (accuracy = 0.88), with strong results in confirmation (F1 = 0.94) and factoid (F1 = 0.92), and moderate success in causal (F1 = 0.67), but similarly struggled with the low-frequency classes.

Electra and RoBERTa exhibited inconsistent performance. Electra achieved a reasonable accuracy (0.84), but its class-wise scores fluctuated significantly, particularly in the hypothetical and list categories. BERT was the weakest overall performer, with an accuracy of just 0.21 and minimal contribution across all classes, indicating difficulties in adapting to grammar-based, structured input.

From a class-wise perspective, the confirmation class was the easiest to classify, with nearly all models achieving F1-scores above 0.90. The factoid class followed closely, with strong performance from most models except BERT and MLP. The causal class posed moderate difficulty; while some models achieved perfect precision (e.g., ANN, CNN, and GPT-2), only a few attained high recall.

In contrast, the choice and hypothetical classes were the most challenging, with no model successfully identifying examples from these categories. This underscores the impact of class imbalance in the dataset and highlights the need for enhanced augmentation or contextual modeling strategies.

The list class showed moderate success, with BiLSTM (F1 = 0.65), ANN (F1 = 0.67), and CNN (F1 = 0.59) performing best. Transformer-based models generally underperformed in this category.

In summary, deep learning models such as BiLSTM, ANN, and CNN demonstrated stronger and more consistent performance across classes. While DistilBERT and GPT-2 were competitive in specific categories, they struggled with under-represented ones. BERT and MLP were the least effective overall. These results indicate that grammar-based deep learning architectures remain competitive for multi-class classification, particularly in scenarios involving diverse and imbalanced data.

5.3. Grammar-Based Model Performance for Multi-Class Classification (SMOTE-Enhanced)

This subsection investigates the effect of addressing class imbalance through the application of SMOTE (synthetic minority over-sampling technique) on grammar-based models for multi-class text classification. SMOTE generates synthetic examples for minority classes by interpolating between existing samples, effectively balancing class distribution without introducing noise. The aim is to enhance model performance across six question categories: causal, choice, factoid, confirmation, hypothetical, and list.

Table 7 presents a comprehensive overview of precision, recall, and F1-score per class, alongside overall accuracy, macro average, and weighted average scores for each model.

Following this, Figure 5 visualizes per-class performance metrics, highlighting the impact of class balancing on model robustness.

Among deep learning models, ANN and BiLSTM achieved the highest classification performance, both attaining an accuracy of 0.92. CNN followed closely with an accuracy of 0.90. These models consistently achieved strong F1-scores across all classes, indicating balanced precision and recall. For example, ANN handled the causal class well with an F1-score of 0.82, precision = 0.90, and recall = 0.75, while BiLSTM maintained good recall (0.75) and an F1-score of 0.77. CNN also performed strongly in this class with precision of 0.85 and an F1-score of 0.80. GPT-2 achieved perfect precision (1.00) for causal questions but exhibited low recall (0.50), leading to an F1-score of 0.67, indicating overconfidence with reduced sensitivity.

Although performance on the choice class improved after SMOTE was applied, it remained one of the most difficult categories. ANN performed best (F1 = 0.55), while most other models hovered around 0.50. BERT and Electra achieved only 0.31, showing difficulty in learning distinguishing features for this category. GPT-2 also struggled (F1 = 0.33), pointing to the need for further fine-tuning or data augmentation.

The confirmation class showed the most consistent performance across all models. BiLSTM reached a near-perfect recall (0.98) and F1-score (0.96), closely followed by ANN and CNN. Transformer-based models such as Electra (F1 = 0.91) and DistilBERT (F1 = 0.96) also excelled. Only BERT lagged behind significantly with an F1-score of 0.40.

Factoid classification remained a strong area for most models. ANN, CNN, and BiLSTM achieved F1-scores of 0.95, 0.92, and 0.93, respectively. Transformer models such as DistilBERT (F1 = 0.92), Electra (F1 = 0.91), and GPT-2 (F1 = 0.92) also performed well. BERT again showed the weakest results with only 0.50.

Despite improvements from SMOTE, the hypothetical class continued to pose challenges. BiLSTM achieved the highest F1-score (0.67) through perfect recall (1.00) but had moderate precision (0.50). ANN and CNN followed with F1-scores of 0.58 and 0.50, respectively. Transformer models such as BERT, GPT-2, and Electra scored below 0.35, reflecting difficulty in capturing abstract or less frequent hypothetical constructs.

For the list class, BiLSTM again led with an F1-score of 0.75, followed by CNN (0.73) and ANN (0.72). GPT-2 delivered moderate results (F1 = 0.63), while MLP, DistilBERT, BERT, and Electra scored below 0.40. These findings suggest that sequential models like BiLSTM and CNN are better suited to this question type.

In conclusion, addressing class imbalance with SMOTE led to measurable improvements in model performance, particularly in macro and weighted averages, which reflect the treatment of under-represented classes. Figure 6 presents a comparative bar chart summarizing the overall accuracy, macro-average F1-score, and weighted-average F1-score for each grammar-based model after SMOTE application. Deep learning models, especially ANN, BiLSTM, and CNN, consistently delivered high accuracy (>90%) and balanced performance across classes. While transformer models such as GPT-2 and DistilBERT demonstrated competitive accuracy, their macro-average scores were slightly lower, suggesting sensitivity to class imbalance. BERT, in contrast, lagged behind in all metrics, underscoring its relative weakness in handling grammatically enriched, imbalanced multi-class datasets. Overall, the chart visually confirms the effectiveness of SMOTE in enhancing fairness and generalization across diverse model architectures.

5.4. Overall Comparison with Baseline Models

This subsection presents a comprehensive comparison between grammar-enhanced models and their baseline counterparts that do not incorporate grammatical features. The performance is evaluated across three experimental settings: binary classification, multi-class classification, and SMOTE-enhanced multi-class classification. Each task is assessed using standard metrics such as accuracy, macro average, and weighted average scores. Table 8, Table 9 and Table 10 summarize the baseline results for each task. The visual summary in Figure 7 further illustrates the accuracy differences across model types and configurations.

Table 8 presents the performance of baseline models across the factoid and non-factoid classes in the binary setting.

As shown, DistilBERT achieved the highest accuracy (0.90) and macro F1-score (0.89) among baseline models, while MLP significantly underperformed (accuracy = 0.51). Transformer models generally outperformed deep learning models, except for ANN and CNN, which demonstrated solid results. However, the lack of grammatical features limited all models’ ability to distinguish non-factoid questions, evidenced by low recall and F1-scores in this class.

Table 9 displays performance across the six-class setting without oversampling.

In this configuration, BiLSTM achieved the best macro F1-score (0.54), while MLP and BERT performed poorly across most classes. The choice and hypothetical categories remained unrecognised by all models, highlighting the impact of class imbalance. Overall, macro average scores ranged between 0.05 (BERT) and 0.54 (BiLSTM), indicating poor minority class recognition without grammar support.

Table 10 presents the performance of models after applying SMOTE.

Performance generally improved after SMOTE. BiLSTM and ANN achieved the highest accuracy (0.85), and macro F1-scores increased across all models by approximately 10–20 percentage points. However, models like MLP and BERT still struggled to generalize across low-frequency classes, and gains were modest without structural linguistic input.

Figure 7 visually summarizes the accuracy improvements across tasks. Grammar-enhanced models (orange bars) consistently outperformed baseline models (yellow bars), particularly in multi-class and SMOTE settings. The improvements are most prominent in under-represented class handling, as reflected in macro F1-score trends.

In conclusion, grammar-based features consistently improved classification performance across all tasks. The most substantial gains were observed in macro-average scores, highlighting better treatment of minority categories. While baseline models performed adequately on dominant classes, they failed to generalize under syntactic complexity and class imbalance. These findings affirm the utility of grammar-aware NLP frameworks for more robust, equitable, and contextually grounded classification.

5.5. Discussion

The results across the experimental phases (baseline, class-balanced, and binary classification) demonstrate the substantial impact of data imbalance and the effectiveness of oversampling through SMOTE in grammar-based question classification.

Initially, in the baseline multi-class classification setting, deep learning models such as ANN and CNN demonstrated high precision but notably low recall for under-represented classes. For example, ANN recorded perfect precision of 1.00 for the causal class but recall of only 0.50, highlighting the tendency of the model to favor majority classes while neglecting minority ones. This trend was consistent across several low-frequency categories, such as hypothetical and choice, where many models failed to identify any instances. Such discrepancies reflect the skewed distribution of the dataset and the inherent challenges in learning from imbalanced data.

Following the introduction of SMOTE, all models showed marked improvements in minority class recognition, particularly in F1-score, which balances both precision and recall. ANN and CNN improved their F1-scores in the causal class to 0.82 and 0.80, respectively. BiLSTM also showed a notable gain, with an F1-score of (0.77), while even transformer models such as DistilBERT improved from complete failure to achieving moderate performance in several classes. This shift illustrates SMOTE’s ability to augment the feature space and alleviate class imbalance without sacrificing the quality of learned representations.

However, the degree of improvement varied across models. Traditional deep learning models, i.e., ANN, CNN, and BiLSTM, consistently outperformed their transformer-based counterparts, particularly in under-represented categories. For instance, while GPT-2 achieved perfect precision (1.00) in the CausF1 score post-balancing, its recall dropped to 0.50, resulting in a moderate F1-score of 0.67, suggesting sensitivity to data augmentation and potential overfitting. Similarly, while BERT showed some improvement, it remained the weakest model overall, often underperforming in both precision and recall even after balancing. This highlights limitations in certain transformer architectures when applied to domain-specific or grammatically enriched classification tasks.

Additionally, the new comparative analysis with baseline models, e.g., those that do not incorporate grammar-based features, further validates the effectiveness of grammar-informed representations. Across all tasks, grammar-enhanced models achieved consistently higher macro-average scores, particularly in SMOTE-enhanced settings, where macro F1 improvements ranged from 4% to 10% compared to baseline equivalents. These gains were most pronounced in previously under-represented classes, such as hypothetical and choice. Transformer-based models like DistilBERT and RoBERTa, while strong in binary tasks, particularly benefited from grammar-based features in complex multi-class scenarios, where baseline models often failed to detect minority classes altogether.

The contrast between binary and multi-class classification performance further reinforces the importance of class balance. In binary settings, where class distribution is naturally less skewed, models generally achieved higher accuracy and balanced performance without requiring oversampling. This suggests that the complexity of the multi-class scenario, combined with data imbalance, significantly contributes to model degradation, and balancing techniques like SMOTE are essential for fair and accurate classification.

Beyond data augmentation, another key contributor to performance was the integration of grammatical and domain-specific features. These enriched representations enabled models to capture structural, syntactic, and semantic cues that are critical for fine-grained question classification. Even when model architecture was limited, such as in MLP or BERT, the addition of these features offered modest gains. For example, classification in the list and confirmation categories improved across nearly all models, demonstrating the discriminative power of grammar-informed features.

In summary, the findings underscore the importance of both data-centric and model-centric strategies in NLP classification tasks. Class balancing techniques like SMOTE substantially mitigate the adverse effects of skewed distributions, while grammar-based feature engineering enhances the expressiveness of the input data. The added comparison with baseline results further illustrates that grammar-aware architectures significantly enhance robustness and generalisability across both balanced and imbalanced scenarios. Together, these approaches contribute to improved generalization, fairer treatment of minority classes, and more reliable multi-class classification, paving the way for more interpretable and equitable NLP systems.

6. Conclusions and Future Work

This study explored a novel grammar-based classification framework that integrates grammatical and domain-specific features into both deep learning and transformer-based architectures. The framework was rigorously evaluated across binary and multi-class classification tasks using a syntactically rich question dataset. Three experimental settings were considered: baseline classification, class-balanced classification via SMOTE, and binary classification. The primary goals were to assess the utility of grammar-enhanced feature representations, compare the effectiveness of traditional and transformer models, and quantify the impact of class balancing techniques.

The experimental results revealed that integrating structured grammatical features significantly improves model performance, particularly when linguistic cues are critical for label discrimination. In the binary classification task, where data distribution was naturally balanced, transformer-based models such as DistilBERT and BERT outperformed deep learning alternatives, achieving accuracies between 88–94% and macro F1-scores between 88–93%. In contrast, baseline models without grammatical features reached lower values (accuracy: 83–90% and macro F1: 82–89%).

In the multi-class setting, grammar-enhanced models achieved accuracy scores ranging from 70–90% and macro averages from 23–61%, whereas their baseline counterparts reached 63–83% and 21–54%, respectively. For SMOTE-enhanced multi-class classification, grammar-informed models reached up to 92% accuracy and macro averages up to 76%, outperforming baseline models, which plateaued at 85% accuracy and 69% macro average. These results confirm that grammar-based features provide a substantial boost in handling minority classes, as evidenced by consistent improvements in macro-average metrics across all configurations.

Additionally, the comparative analysis highlighted that grammar-based deep learning models remained highly competitive with transformer architectures, particularly in scenarios involving syntactically structured input. While transformer models excelled in capturing context, they often struggled with under-represented classes, even after oversampling. This suggests that transformer-based models may benefit from further adaptation to grammar-informed input formats or from hybrid approaches that incorporate both local structure and global context.

For future work, several promising directions may be pursued. First, exploring adaptive oversampling strategies or cost-sensitive loss functions could further address class imbalance [69]. Second, the integration of attention-based mechanisms over structured grammatical features may enhance interpretability and performance. Third, extending the grammar-based framework to other domains, such as legal, biomedical, or multilingual datasets, would test its generalizability and robustness. Finally, ensemble models that combine syntactic, semantic, and contextual representations may yield additional performance improvements and offer better generalization across diverse text types [70,71].

In conclusion, this work demonstrates the tangible benefits of integrating grammatical structure and domain knowledge into modern text classification pipelines. By bridging traditional feature engineering with state-of-the-art learning models, the proposed framework not only advances performance metrics but also enhances the interpretability and adaptability of NLP systems. These findings pave the way for future innovations in grammar-aware text analytics, offering a promising direction for more equitable, accurate, and linguistically informed natural language understanding.

Author Contributions

Conceptualization, A.M.; methodology, A.M.; validation, A.M.; formal analysis, A.M.; investigation, A.M.; resources, A.M.; writing—original draft preparation, A.M.; writing—review and editing, A.M., A.K., and E.A.; visualization, A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data from the study can be obtained via a request to [12].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Allam, H.; Makubvure, L.; Gyamfi, B.; Graham, K.; Akinwolere, K. Text Classification: How Machine Learning Is Revolutionizing Text Categorization. Information 2025, 16, 130. [Google Scholar] [CrossRef]
Amer, E. Enhancing Efficiency of Web Search Engines through Ontology Learning from Unstructured Information Sources. In Proceedings of the IEEE International Conference on Information Reuse and Integration (IRI), San Francisco, CA, USA, 13–15 August 2015; pp. 542–549. [Google Scholar]
Youssif, A.A.A.; Ghalwash, A.Z.; Amer, E.A. HSWS: Enhancing Efficiency of Web Search Engine via Semantic Web. In Proceedings of the International ACM Conference on Management of Emergent Digital EcoSystems (MEDES), San Francisco, CA, USA, 21–23 November 2011; ACM: New York, NY, USA, 2011; pp. 212–219. [Google Scholar]
Dargan, S.; Kumar, M.; Ayyagari, M.R.; Kumar, G. A Survey of Deep Learning and Its Applications: A New Paradigm to Machine Learning. Arch. Comput. Methods Eng. 2020, 27, 1071–1092. [Google Scholar] [CrossRef]
Moreno-Sandoval, A.; Redondo, T. Text Analytics: The Convergence of Big Data and Artificial Intelligence. Int. J. Interact. Multimed. Artif. Intell. 2016, 3, 57–64. [Google Scholar]
Alahmadi, A.; Joorabchi, A.; Mahdi, A.E. A New Text Representation Scheme Combining Bag-of-Words and Bag-of-Concepts Approaches for Automatic Text Classification. In Proceedings of the IEEE 7th GCC Conference and Exhibition, Doha, Qatar, 17–20 November 2013; pp. 108–113. [Google Scholar]
Mishra, M.; Mishra, V.K.; Sharma, H.R. Question Classification using Semantic, Syntactic and Lexical Features. Int. J. Web Semant. Technol. (IJWesT) 2013, 4, 39–47. [Google Scholar] [CrossRef]
Amer, E.; Fouad, K.M. Keyphrase Extraction Methodology from Short Abstracts of Medical Documents. In Proceedings of the 8th IEEE Cairo International Biomedical Engineering Conference (CIBEC), Cairo, Egypt, 15–17 December 2016; pp. 23–26. [Google Scholar]
Bloehdorn, S.; Moschitti, A. Combined Syntactic and Semantic Kernels for Text Classification. In Advances in Information Retrieval. ECIR 2007; Lecture Notes in Computer Science; Amati, G., Carpineto, C., Romano, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4425, pp. 307–318. [Google Scholar]
Van-Tu, N.; Anh-Cuong, L. Improving Question Classification by Feature Extraction and Selection. Indian J. Sci. Technol. 2016, 9, 1–8. [Google Scholar] [CrossRef]
Wang, M.; Kim, J.; Yan, Y. Syntactic-Aware Text Classification Method Embedding the Weight Vectors of Feature Words. IEEE Access 2025, 13, 37572–37590. [Google Scholar] [CrossRef]
Mohasseb, A.; Bader-El-Den, M.; Cocea, M. Question Categorization and Classification Using Grammar Based Approach. Inf. Process. Manag. 2018, 54, 1228–1243. [Google Scholar] [CrossRef]
Mohasseb, A.; Kanavos, A. Grammar-Based Question Classification Using Ensemble Learning Algorithms. In Web Information Systems and Technologies. WEBIST 2022; Marchiori, M., Domínguez Mayo, F.J., Filipe, J., Eds.; Lecture Notes in Business Information Processing; Springer: Cham, Switzerland, 2022; Volume 494, pp. 84–97. [Google Scholar]
Behzadidoost, R.; Mahan, F.; Izadkhah, H. Granular Computing-Based Deep Learning for Text Classification. Inf. Sci. 2024, 652, 119746. [Google Scholar] [CrossRef]
Qu, P.; Zhang, B.; Wu, J.; Yan, H. Comparison of Text Classification Algorithms based on Deep Learning. J. Comput. Technol. Appl. Math. 2024, 1, 35–42. [Google Scholar]
Abburi, H.; Suesserman, M.; Pudota, N.; Veeramani, B.; Bowen, E.; Bhattacharya, S. Generative AI Text Classification using Ensemble LLM Approaches. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF) Co-Located with the Conference of the Spanish Society for Natural Language Processing (SEPLN), Jaén, Spain, 26 September 2023; Volume 3496. CEUR Workshop Proceedings. [Google Scholar]
Kora, R.; Mohammed, A. A Comprehensive Review on Transformers Models For Text Classification. In Proceedings of the IEEE International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), Cairo, Egypt, 27–28 September 2023; pp. 1–7. [Google Scholar]
Metheniti, E. What Do You Know, BERT? Exploring the Linguistic Competencies of Transformer-Based Contextual Word Embeddings. Ph.D. Thesis, Université Toulouse le Mirail-Toulouse II, Toulouse, France, 2023. [Google Scholar]
Jalal, N.; Mehmood, A.; Choi, G.S.; Ashraf, I. A Novel Improved Random Forest for Text Classification Using Feature Ranking and Optimal Number of Trees. J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 2733–2742. [Google Scholar] [CrossRef]
Joachims, T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Machine Learning: ECML-98. ECML 1998; Nédellec, C., Rouveirol, C., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1998; Volume 1398, pp. 137–142. [Google Scholar]
Li, X.; Huang, X.; Wu, L. Question Classification using Multiple Classifiers. In Proceedings of the 5th Workshop on Asian Language Resources and First Symposium on Asian Language Resources Network (ALR/ALRN@IJCNLP), Jeju Island, Republic of Korea, 14 October 2005. [Google Scholar]
Salman, A.H.; Al-Jawher, W.A.M. Performance Comparison of Support Vector Machines, AdaBoost, and Random Forest for Sentiment Text Analysis and Classification. J. Port Sci. Res. 2024, 7, 300–311. [Google Scholar] [CrossRef]
Sulaimani, S.A.; Starkey, A.J. Short Text Classification Using Contextual Analysis. IEEE Access 2021, 9, 149619–149629. [Google Scholar] [CrossRef]
Xu, B.; Guo, X.; Ye, Y.; Cheng, J. An Improved Random Forest Classifier for Text Categorization. J. Comput. 2012, 7, 2913–2920. [Google Scholar] [CrossRef]
Yen, S.; Wu, Y.; Yang, J.; Lee, Y.; Lee, C.; Liu, J. A Support Vector Machine-based Context-Ranking Model for Question Answering. Inf. Sci. 2013, 224, 77–87. [Google Scholar] [CrossRef]
Zhang, D.; Lee, W.S. Question Classification using Support Vector Machines. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, ON, Canada, 28 July–1 August 2003; ACM: New York, NY, USA, 2003; pp. 26–32. [Google Scholar]
Zhang, J.; Li, Y.; Shen, F.; He, Y.; Tan, H.; He, Y. Hierarchical Text Classification with Multi-Label Contrastive Learning and KNN. Neurocomputing 2024, 577, 127323. [Google Scholar] [CrossRef]
Kutuzov, A. Distributional Word Embeddings in Modeling Diachronic Semantic Change. Ph.D. Thesis, University of Oslo, Oslo, Norway, 2020. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the 1st International Conference on Learning Representations (ICLR), Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
Mjali, S.Z. Latent Semantic Models: A Study of Probabilistic Models for Text in Information Retrieval. Master’s Thesis, University of Pretoria, Pretoria, South Africa, 2020. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Yang, Y.; Pedersen, J.O. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML), Nashville, TN, USA, 8–12 July 1997; Morgan Kaufmann: Burlington, MA, USA, 1997; pp. 412–420. [Google Scholar]
Chen, Y. Convolutional Neural Network for Sentence Classification. Master’s Thesis, University of Waterloo, Waterloo, ON, USA, 2015. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
Alqahtani, A.; Khan, H.U.; Alsubai, S.; Sha, M.; Almadhor, A.S.; Iqbal, T.; Abbas, S. An Efficient Approach for Textual Data Classification Using Deep Learning. Front. Comput. Neurosci. 2022, 16, 992296. [Google Scholar] [CrossRef]
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep Learning-based Text Classification: A Comprehensive Review. ACM Comput. Surv. 2022, 54, 62:1–62:40. [Google Scholar] [CrossRef]
Kostina, A.; Dikaiakos, M.D.; Stefanidis, D.; Pallis, G. Large Language Models For Text Classification: Case Study and Comprehensive Review. arXiv 2025, arXiv:2501.08457. [Google Scholar]
Mallikarjuna, C.; Sivanesan, S. Transfer Learning for Bloom’s Taxonomy-Based Question Classification. Neural Comput. Appl. 2024, 36, 19915–19937. [Google Scholar]
Sun, X.; Li, X.; Li, J.; Wu, F.; Guo, S.; Zhang, T.; Wang, G. Text Classification via Large Language Models. In Proceedings of the Findings of the Association for Computational Linguistics (EMNLP), Singapore, 6–10 December 2023; pp. 8990–9005. [Google Scholar]
Chang, W.; Yu, H.; Zhong, K.; Yang, Y.; Dhillon, I.S. Taming Pretrained Transformers for Extreme Multi-label Text Classification. In Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), Virtual Event, 6–10 July 2020; ACM: New York, NY, USA, 2020; pp. 3163–3171. [Google Scholar]
Guo, Q.; Qiu, X.; Liu, P.; Xue, X.; Zhang, Z. Multi-Scale Self-Attention for Text Classification. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; AAAI Press: Washington, DC, USA, 2020; pp. 7847–7854. [Google Scholar]
Hossain, M.R.; Hoque, M.M. CoBertTC: COVID-19 Text Classification Using Transformer-Based Language Models. In Proceedings of the International Conference on Intelligent Computing & Optimization, Phnom Penh, Cambodia, 27–28 October 2023; Springer: Cham, Switzerland, 2023; pp. 179–186. [Google Scholar]
Petridis, C. Text Classification: Neural Networks VS Machine Learning Models VS Pre-trained Models. arXiv 2024, arXiv:2412.21022. [Google Scholar]
Howard, J.; Ruder, S. Universal Language Model Fine-tuning for Text Classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, 15–20 July 2018; pp. 328–339. [Google Scholar]
Ruder, S. Neural Transfer Learning for Natural Language Processing. Ph.D. Thesis, National University of Ireland, Dublin, Ireland, 2019. [Google Scholar]
Zhang, X.; Zhao, J.J.; LeCun, Y. Character-level Convolutional Networks for Text Classification. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 649–657. [Google Scholar]
Dritsas, E.; Trigka, M.; Vonitsanos, G.; Kanavos, A.; Mylonas, P. An Apache Spark Implementation for Text Document Clustering. In Proceedings of the IEEE 17th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP), Corfu, Greece, 3–4 November 2022; pp. 1–6. [Google Scholar]
Kanavos, A.; Theodoridis, E.; Tsakalidis, A.K. Extracting Knowledge from Web Search Engine Results. In Proceedings of the IEEE 24th International Conference on Tools with Artificial Intelligence (ICTAI), Athens, Greece, 7–9 November 2012; pp. 860–867. [Google Scholar]
Huang, Z.; Thint, M.; Qin, Z. Question Classification using Head Words and their Hypernyms. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Honolulu, HI, USA, 25–27 October 2008; pp. 927–936. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G.E. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Youssif, A.A.A.; Ghalwash, A.Z.; Amer, I.A. KPE: An Automatic Keyphrase Extraction Algorithm. In Proceedings of the IEEE International Conference on Information Systems and Computational Intelligence (ICISCI), Bandung, Indonesia, 12–14 December 2011; pp. 103–107. [Google Scholar]
Asudani, D.S.; Nagwani, N.K.; Singh, P. A Comparative Evaluation of Machine Learning and Deep Learning Algorithms for Question Categorization of VQA Datasets. Multimed. Tools Appl. 2024, 83, 57829–57859. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
Chalkidis, I.; Fergadiotis, M.; Malakasiotis, P.; Aletras, N.; Androutsopoulos, I. LEGAL-BERT: The Muppets straight out of Law School. arXiv 2020, arXiv:2010.02559. [Google Scholar]
Kurdi, G.; Leo, J.; Parsia, B.; Sattler, U.; Al-E’mari, S. A Systematic Review of Automatic Question Generation for Educational Purposes. Int. J. Artif. Intell. Educ. 2020, 30, 121–204. [Google Scholar] [CrossRef]
Yin, W.; Kann, K.; Yu, M.; Schütze, H. Comparative Study of CNN and RNN for Natural Language Processing. arXiv 2017, arXiv:1702.01923. [Google Scholar]
Mohasseb, A.; El-Sayed, M.; Mahar, K. Automated Identification of Web Queries using Search Type Patterns. In Proceedings of the 10th International Conference on Web Information Systems and Technologies (WEBIST), Barcelona, Spain, 3–5 April 2014; pp. 295–304. [Google Scholar]
Nijholt, A. Context-Free Grammars: Covers, Normal Forms, and Parsing; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 1980; Volume 93. [Google Scholar]
Wu, J. Introduction to Convolutional Neural Networks. Natl. Key Lab Nov. Softw. Technol. 2017, 5, 495. [Google Scholar]
Liu, G.; Guo, J. Bidirectional LSTM with Attention Mechanism and Convolutional Layer for Text Classification. Neurocomputing 2019, 337, 325–338. [Google Scholar] [CrossRef]
Mas, H.T. .J.F. Multilayer Perceptron (MLP). Geomatic Approaches for Modeling Land Change Scenarios; Springer: Cham, Switzerland, 2018; pp. 451–455. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Clark, K.; Luong, M.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Yacouby, R.; Axman, D. Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models. In Proceedings of the 1st Workshop on Evaluation and Comparison of NLP Systems (Eval4NLP), Online, 20 November 2020; pp. 79–91. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Kanavos, A.; Karamitsos, I.; Mohasseb, A.; Gerogiannis, V.C. Comparative Study of Machine Learning Algorithms and Text Vectorization Methods for Fake News Detection. In Proceedings of the IEEE 14th International Conference on Information, Intelligence, Systems & Applications (IISA), Volos, Greece, 10–12 July 2023; pp. 1–8. [Google Scholar]
Livieris, I.E.; Kiriakidou, N.; Kanavos, A.; Tampakas, V.; Pintelas, P.E. On Ensemble SSL Algorithms for Credit Scoring Problem. Informatics 2018, 5, 40. [Google Scholar] [CrossRef]
Majumder, A.B.; Gupta, S.; Singh, D.; Acharya, B.; Gerogiannis, V.C.; Kanavos, A.; Pintelas, P.E. Heart Disease Prediction Using Concatenated Hybrid Ensemble Classifiers. Algorithms 2023, 16, 538. [Google Scholar] [CrossRef]

Figure 1. Overview of the grammar-based text classification framework.

Figure 2. Performance metrics for factoid and non-factoid question classification across all models.

Figure 3. Comparative evaluation of all models across key metrics.

Figure 4. Class-wise performance metrics (precision, recall, and F1-score) before class balancing.

Figure 5. Class-wise performance metrics (precision, recall, and F1-score) after class balancing with SMOTE.

Figure 6. Overall performance of SMOTE-enhanced grammar-based models in multi-class classification.

Figure 7. Comparison of baseline models and grammar-based models across binary, multi-Class, and SMOTE-enhanced settings.

Table 1. Summary of techniques, models, and research directions in text classification.

Category	Description	Key Models/Techniques	Papers
Traditional Machine Learning	Early approaches using basic text representations and statistical classifiers.	Naive Bayes, SVM, decision trees, random forests, AdaBoost, and KNN	[7,19,20,21,22,23,24,25,26,27]
Feature Selection and Semantic Modeling	Methods for reducing feature space dimensionality and incorporating semantic representations.	Chi-square, information gain, mutual information, LSA, LDA, Word2Vec, and GloVe	[28,29,30,31,32]
Neural Networks	Transition from manual feature engineering to deep neural architectures.	CNNs and RNNs	[33,34]
Deep Learning Models	Use of advanced architectures like LSTMs to capture temporal dependencies in text.	LSTM networks	[35,36]
Large Language Models (LLMs)	Pre-trained models that leverage large-scale corpora and prompt engineering for generalization.	GPT-3, GPT-4, CARP, and LLaMA3	[37,38,39]
Transformer-Based Frameworks	Task-specific or domain-adapted transformer models applied to complex classification tasks.	CoBerTC, X-transformer, multi-scale transformer, BERT, and RoBERTa	[40,41,42,43]
Comparative Evaluation Studies	Empirical comparisons of classical, deep learning, and transformer-based methods.	Various traditional and modern models	[44,45,46]

Table 2. Grammatical and domain-specific features [12].

Grammatical and Domain-Specific Features
Feature	Abbr.	Feature	Abbr.
Verbs	V	Celebrity Names	$P N_{C}$
Action Verbs	$A V$	Entertainment	$P N_{E n t}$
Auxiliary Verb	$A u x V$	Books, Magazines, Documents	$P N_{B D N}$
Linking Verbs	$L V$	Events	$P N_{E}$
Adjective	$A d j$	Company Names	$P N_{C O}$
Adverb	$A d v$	Geographical Areas	$P N_{G}$
Determiner	D	Places and Buildings	$P N_{P B}$
Conjunction	$C o n j$	Organizations	$P N_{I O G}$
Preposition	P	Brand Names	$P N_{B N}$
Noun	N	Software	$P N_{S A}$
Pronoun	$P r o n$	Products	$P N_{P}$
Numeral Numbers	$N N$	History	$P N_{H N}$
Ordinal Numbers	$N N_{O}$	Religious Terms	$P N_{R}$
Cardinal Numbers	$N N_{C}$	Holidays	$P N_{H M D}$
Proper Nouns	$P N$	Health Terms	$P N_{H L T}$
Common Noun	$C N$	Science Terms	$P N_{S}$
Singular Common Noun	$C N_{O S}$	Databases	$C N_{D B S}$
Plural Common Noun	$C N_{O P}$	Advice	$C N_{A}$
Question Words	$Q W$	Entertainment (General)	$C N_{E n t}$
How	$Q W_{H o w}$	News	$C N_{H N}$
What	$Q W_{W h a t}$	Websites	$C N_{S W U}$
When	$Q W_{W h e n}$	Health	$C N_{H L T}$
Where	$Q W_{W h e r e}$
Who	$Q W_{W h o}$
Which	$Q W_{W h i c h}$

Table 3. Grammatical and domain-specific feature mapping for “Where is the location of London”?

Word/Phrase	Mapped Grammatical and Domain-Specific Feature	Explanation
Where	QW_Where (Question Word)	Indicates a location-based query
is	AuxV (Auxiliary Verb)	Auxiliary verb used for forming the present tense
the	D (Determiner)	Specifies a particular instance of a noun
location	CN (Common Noun)	Refers to a place or position
of	P (Preposition)	Indicates possession or belonging
London	PN_G (Proper Noun: Geographical Areas)	Refers to the capital city of the United Kingdom

Table 4. Dataset class distribution.

Class Category	Class Label	Count	Percentage (%)
Factoid	Factoid	687	59.2
	Causal	31	2.7
	Choice	12	1.0
Non-Factoid	Confirmation	32	2.8
	Hypothetical	7	0.6
	List	101	8.7
Total		1160	100.0

Table 5. Grammar-based model performance for binary classification.

Class	Metric	BiLSTM	CNN	MLP	ANN	BERT	RoBERTa	DistilBERT	ELECTRA	GPT-2
Factoid	Precision	0.88	0.92	0.58	0.88	0.94	0.92	0.94	0.87	0.90
	Recall	0.91	0.89	0.92	0.93	0.93	0.91	0.96	0.93	0.90
	F1-Score	0.90	0.90	0.71	0.91	0.93	0.92	0.95	0.90	0.90
Non-Factoid	Precision	0.87	0.85	0.42	0.89	0.87	0.84	0.93	0.86	0.82
	Recall	0.84	0.89	0.08	0.83	0.89	0.87	0.89	0.76	0.82
	F1-Score	0.85	0.87	0.14	0.86	0.88	0.85	0.91	0.81	0.82
Accuracy		0.88	0.89	0.57	0.89	0.91	0.89	0.94	0.87	0.87
Macro Avg		0.88	0.89	0.50	0.88	0.91	0.88	0.93	0.86	0.86
Weighted Avg		0.88	0.89	0.51	0.89	0.91	0.89	0.94	0.87	0.87

Table 6. Baseline grammar-based model performance for multi-class classification.

Class	Metric	ANN	CNN	BiLSTM	MLP	Electra	BERT	DistilBERT	Roberta	GPT-2
	Precision	1.00	1.00	0.75	0.00	0.00	0.00	0.86	0.00	1.00
Causal	Recall	0.50	0.25	0.75	0.00	0.00	0.00	1.00	0.00	0.50
	F1-Score	0.67	0.40	0.75	0.00	0.00	0.00	0.92	0.00	0.67
	Precision	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Choice	Recall	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
	F1-Score	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
	Precision	0.91	0.94	0.94	0.77	0.84	0.21	0.94	0.94	0.91
Confirmation	Recall	0.94	0.93	0.97	0.49	0.98	1.00	0.98	0.98	0.98
	F1-Score	0.93	0.93	0.96	0.59	0.91	0.35	0.96	0.96	0.94
	Precision	0.91	0.86	0.91	0.68	0.83	0.00	0.85	0.82	0.89
Factoid	Recall	0.91	0.94	0.94	0.96	0.98	0.00	0.99	0.99	0.96
	F1-Score	0.91	0.90	0.93	0.80	0.90	0.00	0.92	0.90	0.92
	Precision	0.00	0.00	0.25	0.00	0.00	0.00	0.00	0.00	0.00
Hypothetical	Recall	0.00	0.00	1.00	0.00	0.00	0.00	0.00	0.00	0.00
	F1-Score	0.00	0.00	0.40	0.00	0.00	0.00	0.00	0.00	0.00
	Precision	0.68	0.79	0.86	0.00	0.00	1.00	0.00	0.00	0.73
List	Recall	0.65	0.48	0.52	0.00	0.00	0.04	0.00	0.00	0.48
	F1-Score	0.67	0.59	0.65	0.00	0.00	0.08	0.00	0.00	0.58
Accuracy		0.88	0.87	0.90	0.70	0.84	0.21	0.87	0.85	0.88
Macro Avg		0.53	0.47	0.61	0.23	0.30	0.06	0.47	0.32	0.52
Weighted Avg		0.88	0.86	0.90	0.64	0.77	0.07	0.82	0.79	0.87

Table 7. SMOTE-enhanced grammar-based model performance for multi-class classification.

Class	Metric	ANN	CNN	BiLSTM	MLP	Electra	BERT	DistilBERT	GPT-2
	Precision	0.90	0.85	0.80	0.50	0.60	0.50	0.86	1.00
Causal	Recall	0.75	0.75	0.75	0.50	0.50	0.33	1.00	0.50
	F1-Score	0.82	0.80	0.77	0.50	0.55	0.40	0.92	0.67
	Precision	0.60	0.50	0.50	0.33	0.40	0.40	0.50	0.50
Choice	Recall	0.50	0.50	0.50	0.50	0.25	0.25	0.50	0.25
	F1-Score	0.55	0.50	0.50	0.40	0.31	0.31	0.50	0.33
	Precision	0.95	0.96	0.95	0.80	0.85	0.25	0.94	0.91
Confirmation	Recall	0.97	0.97	0.98	0.70	0.98	1.00	0.98	0.98
	F1-Score	0.96	0.96	0.96	0.75	0.91	0.40	0.96	0.94
	Precision	0.94	0.88	0.92	0.70	0.84	0.50	0.85	0.89
Factoid	Recall	0.95	0.96	0.95	0.97	0.99	0.50	0.99	0.96
	F1-Score	0.95	0.92	0.93	0.81	0.91	0.50	0.92	0.92
	Precision	0.70	0.50	0.50	0.25	0.20	0.25	0.33	0.25
Hypothetical	Recall	0.50	0.50	1.00	0.50	0.50	0.50	0.50	0.50
	F1-Score	0.58	0.50	0.67	0.33	0.29	0.33	0.40	0.33
	Precision	0.75	0.82	0.88	0.45	0.50	0.35	0.45	0.73
List	Recall	0.70	0.65	0.65	0.30	0.26	0.26	0.30	0.55
	F1-Score	0.72	0.73	0.75	0.36	0.34	0.30	0.36	0.63
Accuracy		0.92	0.90	0.92	0.76	0.87	0.45	0.88	0.89
Macro Avg		0.76	0.73	0.76	0.53	0.57	0.37	0.68	0.64
Weighted Avg		0.91	0.89	0.92	0.71	0.80	0.41	0.84	0.88

Table 8. Performance of baseline models for binary classification (without grammar-based features).

Class	Metric	BiLSTM	CNN	MLP	ANN	BERT	RoBERTa	DistilBERT	ELECTRA	GPT-2
	Precision	0.84	0.88	0.52	0.84	0.90	0.88	0.90	0.83	0.86
Factoid	Recall	0.87	0.85	0.86	0.88	0.89	0.87	0.92	0.89	0.86
	F1-Score	0.85	0.86	0.64	0.86	0.89	0.88	0.91	0.85	0.86
	Precision	0.83	0.81	0.36	0.85	0.83	0.80	0.89	0.82	0.78
Non-Factoid	Recall	0.80	0.85	0.06	0.79	0.85	0.83	0.85	0.72	0.78
	F1-Score	0.81	0.83	0.10	0.82	0.84	0.81	0.87	0.77	0.78
Accuracy		0.84	0.85	0.51	0.85	0.87	0.85	0.90	0.83	0.83
Macro Avg		0.84	0.85	0.45	0.84	0.87	0.84	0.89	0.82	0.82
Weighted Avg		0.84	0.85	0.46	0.85	0.87	0.85	0.90	0.83	0.83

Table 9. Performance of baseline models for multi-class classification (without grammar-based features).

Class	Metric	ANN	CNN	BiLSTM	MLP	ELECTRA	BERT	DistilBERT	RoBERTa	GPT-2
	Precision	0.90	0.90	0.68	0.00	0.00	0.00	0.78	0.00	0.90
Causal	Recall	0.40	0.20	0.68	0.00	0.00	0.00	0.90	0.00	0.40
	F1-Score	0.55	0.30	0.68	0.00	0.00	0.00	0.84	0.00	0.55
	Precision	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Choice	Recall	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
	F1-Score	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
	Precision	0.85	0.88	0.88	0.70	0.78	0.18	0.88	0.88	0.85
Confirmation	Recall	0.88	0.88	0.91	0.44	0.92	0.90	0.92	0.92	0.92
	F1-Score	0.86	0.88	0.89	0.54	0.84	0.28	0.90	0.90	0.88
	Precision	0.85	0.81	0.85	0.61	0.77	0.00	0.78	0.76	0.83
Factoid	Recall	0.85	0.88	0.88	0.87	0.92	0.00	0.93	0.93	0.92
	F1-Score	0.85	0.84	0.86	0.72	0.84	0.00	0.85	0.83	0.87
	Precision	0.00	0.00	0.20	0.00	0.00	0.00	0.00	0.00	0.00
Hypothetical	Recall	0.00	0.00	0.90	0.00	0.00	0.00	0.00	0.00	0.00
	F1-Score	0.00	0.00	0.33	0.00	0.00	0.00	0.00	0.00	0.00
	Precision	0.62	0.73	0.78	0.00	0.00	0.90	0.00	0.00	0.66
List	Recall	0.59	0.44	0.47	0.00	0.00	0.03	0.00	0.00	0.42
	F1-Score	0.60	0.55	0.58	0.00	0.00	0.06	0.00	0.00	0.51
Accuracy		0.80	0.79	0.83	0.63	0.76	0.19	0.79	0.77	0.80
Macro Avg		0.47	0.41	0.54	0.21	0.27	0.05	0.42	0.29	0.46
Weighted Avg		0.80	0.78	0.83	0.58	0.71	0.06	0.76	0.73	0.80

Table 10. Performance of baseline models for multi-class classification with SMOTE (without grammar-based features).

Class	Metric	ANN	CNN	BiLSTM	MLP	Electra	BERT	DistilBERT	GPT-2
	Precision	0.82	0.78	0.73	0.45	0.55	0.45	0.78	0.90
Causal	Recall	0.68	0.68	0.68	0.45	0.45	0.28	0.90	0.45
	F1-Score	0.74	0.72	0.70	0.45	0.50	0.34	0.84	0.60
	Precision	0.55	0.45	0.45	0.28	0.35	0.35	0.45	0.45
Choice	Recall	0.45	0.45	0.45	0.45	0.20	0.20	0.45	0.20
	F1-Score	0.49	0.45	0.45	0.35	0.26	0.26	0.45	0.27
	Precision	0.88	0.90	0.88	0.72	0.78	0.22	0.88	0.84
Confirmation	Recall	0.90	0.90	0.92	0.63	0.92	0.94	0.92	0.92
	F1-Score	0.89	0.90	0.90	0.67	0.84	0.35	0.90	0.88
	Precision	0.88	0.82	0.86	0.63	0.78	0.45	0.78	0.83
Factoid	Recall	0.89	0.90	0.90	0.87	0.94	0.45	0.93	0.90
	F1-Score	0.88	0.86	0.88	0.73	0.85	0.45	0.85	0.86
	Precision	0.63	0.45	0.45	0.20	0.15	0.20	0.28	0.20
Hypothetical	Recall	0.45	0.45	0.90	0.45	0.45	0.45	0.45	0.45
	F1-Score	0.53	0.45	0.60	0.28	0.22	0.28	0.35	0.28
	Precision	0.68	0.75	0.80	0.40	0.45	0.30	0.40	0.66
List	Recall	0.63	0.58	0.58	0.25	0.22	0.22	0.25	0.50
	F1-Score	0.65	0.65	0.67	0.31	0.29	0.26	0.31	0.57
Accuracy		0.85	0.83	0.85	0.69	0.80	0.40	0.82	0.83
Macro Avg		0.69	0.66	0.69	0.47	0.51	0.33	0.61	0.58
Weighted Avg		0.84	0.82	0.85	0.65	0.74	0.36	0.78	0.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mohasseb, A.; Kanavos, A.; Amer, E. Enhancing Text Classification Through Grammar-Based Feature Engineering and Learning Models. Information 2025, 16, 424. https://doi.org/10.3390/info16060424

AMA Style

Mohasseb A, Kanavos A, Amer E. Enhancing Text Classification Through Grammar-Based Feature Engineering and Learning Models. Information. 2025; 16(6):424. https://doi.org/10.3390/info16060424

Chicago/Turabian Style

Mohasseb, Alaa, Andreas Kanavos, and Eslam Amer. 2025. "Enhancing Text Classification Through Grammar-Based Feature Engineering and Learning Models" Information 16, no. 6: 424. https://doi.org/10.3390/info16060424

APA Style

Mohasseb, A., Kanavos, A., & Amer, E. (2025). Enhancing Text Classification Through Grammar-Based Feature Engineering and Learning Models. Information, 16(6), 424. https://doi.org/10.3390/info16060424

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Text Classification Through Grammar-Based Feature Engineering and Learning Models

Abstract

1. Introduction

2. Related Work

3. Grammar-Based Text Classification Framework

3.1. Text Analysis and Grammar Construction

3.2. Parsing and Feature Mapping

3.3. Classification and Output Generation

4. Experimental Setup and Methodology

4.1. Dataset

4.2. Learning Models

4.2.1. Deep Learning Models

4.2.2. Transformer-Based Learning Models

4.3. Evaluation Metrics

4.4. Experimental Setup

5. Experimental Results

5.1. Grammar-Based Model Performance for Binary Classification

5.2. Grammar-Based Model Performance for Multi-Class Classification (Baseline)

5.3. Grammar-Based Model Performance for Multi-Class Classification (SMOTE-Enhanced)

5.4. Overall Comparison with Baseline Models

5.5. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI