1. Introduction
Language is a fundamental component of human intelligence and plays a crucial role in cognitive processes. Among it various forms, poetry stands out as a refined and sophisticated artistic expression. It transcends national boundaries and cultural variations, maintaining its popularity across generations and significantly influencing the development of human society [
1].
Poetry represents the earliest form of literary expression in the Arabic language, serving as a powerful medium for articulating Arab self-identity, collective memory, and aspirations. Arabic poetry is typically classified into classical and contemporary forms [
2]. Much of the research on Arabic prosody—also known as ‘arud’—has historically focused on morphology and phonetics, a trend that has continued for several years. Analyzing poetic meters is essential to determining whether a piece follows a consistent rhythmic pattern or contains metrical deviations [
3].
Modern and traditional Arabic differ in their usage of vowels: short vowels are indicated through diacritic in writing, while long vowels are explicitly represented [
4]. In the eighth century, the renowned philologist, Al-Farahidi, pioneered the study of ancient Arabic poetry by poetry by developing a system of poetic meters [
5]. The metric structure in Arabic poetry encompasses sixteen distinct meters. Several key terms are used in Arabic prosody, including:
Tafilah: A distinct metrical foot
Bayt: A poetic line composed of two half-verses
Sadr: The initial segment of a half-verse
Ajuz: The latter segment of a half-verse
Arud: The final part of the Sadr
Darb: The final part of the ‘Ajuz’
Due to the diversity and complexity of the Arabic language, categorizing Arabic poetry requires advanced natural language processing (NLP) techniques [
6]. The morphological, syntactic, and semantic intricacies of Arabic often exceed the capabilities of traditional rule-based and statistical approaches.
The recent advancements in artificial intelligence (AI)—particularly in deep learning (DL) and transformer-based models—have significantly improved outcomes in tasks such as text classification and sentiment analysis. Applying these cutting-edge methods to Arabic poetry classification offers a promising pathway toward improved understanding and model accuracy [
7,
8,
9].
However, the Arabic NLP landscape remains challenging due to the language’s structural complexity and regional variations. Nonetheless, ongoing research and the development of specialized tools and datasets have facilitated more robust and efficient NLP solutions [
10]. As the field continues to grow, the application of Arabic NLP is becoming more widespread across domains, such as healthcare, social media analysis, and even poetry generation [
11,
12].
Recent studies have increasingly adopted machine learning (ML) and deep learning (DL) techniques to tackle the complex task of poetic meter classification. Recurrent neural networks (RNNs) and various ML and DL models have been employed to classify poems based on their metrical patterns [
13,
14,
15].
For meter categorization, handcrafted features such as word length, phonological patterns, and n-grams have traditionally been used with classic ML techniques, including Support Vector Machines (SVMs), Random Forests, and Decision Trees [
16,
17,
18]. While these approaches are efficient, their adaptability across different languages and poetic forms is limited, as they depend heavily on manual feature engineering. In contrast, DL models like BiGRU and BiLSTM can automatically learn auditory and rhythmic patterns directly from raw text. These models have demonstrated strong performance across diverse literary traditions, effectively capturing both sequential and semantic patterns without requiring extensive preprocessing [
19,
20,
21].
Transformer models have revolutionized NLP since the introduction of the ’Attention is All You Need‘ architecture in 2017 [
22]. These models process input data in parallel using self-attention mechanisms, enabling them to capture long-term dependencies and contextual relationships within text. As a result, transformers have become foundational in many advanced NLP applications, including text classification.
Recent advancements in NLP—particularly the emergence of transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT) and its Arabic counterparts—have significantly enhanced Arabic text processing. Comparative analyses have showed improved accuracy in text classification across various domains with the adoption of updated BERT models [
23,
24].
In this study, we used a public dataset for Arabic meter classification, MetRec [
25], and evaluated it using several transformers and DL models. We employed the BERT base Arabic model (AraBERT), Arabic Efficiently Learning an Encoder that Classifies Token Replacements Accurately (AraELECTRA), Multi-dialect Arabic BERT (MARBERT), Modern Arabic BERT (ARBERT), Computational Approaches to Modeling Arabic BERT (CAMeLBERT), and Arabic-BERT for meter classification. In addition to transformer-based models, we used DL architectures such as Bidirectional Long Short-Term Memory (BiLSTM) and Bidirectional Gated Recurrent Units (BiGRU). These networks are capable of capturing long-range dependencies in sequential data, allowing the classification model to learn both spatial and contextual features—an essential requirement for handling the intricacies of Arabic poetry.
We also assessed feature importance and model behavior using the Local Interpretable Model-agnostic Explanations (LIME) interpretability technique. The findings provide meaningful insights into the effectiveness of different architectures for Arabic poetry classification, supporting the development of more advanced NLP solutions for the Arabic language.
The major contributions of this study are as follows:
We compare and evaluate the performance of various pretrained transformer models—including Arabic-BERT, AraBERT, MARBERT, AraELECTRA, CAMeLBERT, and ARBERT—alongside BiLSTM and BiGRU DL models.
We evaluate half-verse poems and tune the DL models using different hidden layer configurations, and adjust batch sizes for the transformer models.
We apply multiple encoding strategies, including pretrained tokenizers (WordPiece and SentencePiece) for transformer models and character-level encoding for DL models.
We investigate model behavior and feature significance using LIME to gain insights into the model’s decision-making processes.
We contribute to the expanding of Arabic NLP by assessing interpretability and applicability of different modeling strategies for poetry classification.
The remainder of this paper is structured as follows:
Section 2 presents a comprehensive literature review of transformer models and DL methodologies.
Section 3 outlines the proposed techniques and model architecture.
Section 4 describes the experimental results, followed by an in-depth discussion in
Section 5, which also highlights potential avenues for further research. Finally,
Section 6 concludes this study.
2. Literature Review
Deep learning methods have been extensively utilized in numerous studies to address Arabic meter classification. Notably, the work by Al-shaibani et al. [
26] demonstrated significant progress by modeling the sequential nature of poetic lines using bidirectional recurrent neural networks (RNNs). Their approach effectively captured rhythmic patterns in Arabic poetry, providing a strong foundation for further exploration using advanced models such as Transformers.
Several efforts have aimed to develop algorithms for identifying meters in contemporary Arabic poetry [
4]. Some approaches focus on detecting features of classical poetry, including rhythm, punctuation, and alignment, although however, they merely differentiate poetic from non-poetic texts [
27]. Other techniques attempt to identify specific meters by converting verses into ‘Arud’ script using Mel Frequency and Linear Predictive Cepstral Coefficients characteristics [
28].
Achieving performance levels of up to 75% accuracy, Berkani et al. [
29] emphasized that automated meter recognition requires intricate data processing. Their method involves syllables segmentation, phoneme conversion, and the alignment of resulting patterns with predefined meter templates. Building on El-Katib’s approach, Khalaf et al. [
30] were noteworthy in automating Arabic meter classification. Their method supports educators and students in understanding prosody and accurately identifying poetic meters, even in complex scenarios.
Albaddawi et al. [
31] achieved high accuracy in meter detection by employing a model architecture comprising an embedding layer, five Bi-LSTM layers, a softmax activation function, and an output layer. Similarly, Abandah et al. [
32] developed a neural network focused on enhancing diacritical understanding, which is crucial for improving both understanding and meter recognition accuracy.
Talghalit et al. [
33] introduced a robust method based on leveraging contextual semantic embeddings and transformer models for Arabic biomedical search application. Their methodology involved preprocessing biomedical text, generating vector representations, and fine-tuning several transformer variants—including BERT, AraBERT, Biomedical BERT, RoBERTa, and Distilled—using a specialized Arabic biomedical dataset. Their model achieved an F1-score of 93.35% and an accuracy of 93.31%. Despite this strong performance, the authors identified key limitations, including Arabic’s complex morphology and the scarcity of domain-specific datasets. They emphasized the necessity of enriched contextual embeddings and expanded datasets to further improve transformer model performance in Arabic biomedical NLP applications.
Combining AraBERT’s contextual comprehension with Long Short-Term Memory (LSTM) for sequence modeling, Alosaimi et al. [
34] proposed a sentiment analysis model. Its performance was benchmarked against traditional ML and DL models using various vectorization strategies across four Arabic benchmark datasets. Al-Onazi et al. [
13] applied LSTM and convolutional neural network models, potimized using the Hawks optimization method, on the MetRec dataset for meter classification, achieving an accuracy of 98% with precision and recall scores of 86.5%.
In sentiment analysis, ensemble learning—where multiple models are combined to enhance classification accuracy—has proven highly effective. Studies show that integrating multiple deep learning models can outperform individual architectures in Arabic sentiment analysis tasks [
35,
36]. For instance, Zhou et al. [
37] employed a hybrid model combining convolutional and LSTM networks, where the convolutional layers captured local features and LSTM layers captured global context.
AraGPT-2, the Arabic adaptation of Generative Pre-trained Transformer-2, aims to generate synthetic text to improve Arabic classification tasks. Refai’s et al. [
38] outlined a three-part methodology: applying similarity metrics to evaluate sentence quality, generating refined Arabic samples using AraGPT-2, and assessing sentiment classification using AraBERT. Their research also addressed class imbalance issues prevalent in Arabic sentiment datasets.
Al Deen et al. [
10] explored Arabic Natural Language Inference (NJI) using transformer models such as AraBERT and RoBERTa. The study introduced a novel pretraining strategy combining Named Entity Recognition, allowing for better identification of conflicts and entailments. Using a custom dataset derived from publicly available resources and linguistically enriched pretraining, AraBERT achieved an accuracy of 88.1%, outperforming RoBERTa enhanced with language-specific features.
In another study, Qarah et al. [
39] developed a linguistic model for Arabic poetry analysis based on a BERT architecture pretrained from scratch using an extensive Arabic poetry corpus. This model incorporated 768 hidden units, 10 encoder layers, and 12 attention heads per layer, along with a vocabulary of 50,000 words, enabling it to capture the full a breadth of poetic forms and idiomatic expressions in the Arabic literature.
A hybrid methodology integrating deep learning with text representation techniques has been proposed to enhance the classification of Arabic news articles [
9]. After comprehensive preprocessing steps including text cleaning, tokenization, lemmatization, and data augmentation, a proprietary attention embedding layer was employed to capture contextual relationships in the text. With data augmentation, the model surpassed the state-of-the-art Arabic language model AraBERTv2, achieving a classification accuracy of 97.69% [
40].
In a study by Alshammari et al. [
41], the authors introduced a novel AI text classifier designed to address the specific challenges in detecting AI-generated Arabic texts. This approach optimized two transformer models: the Cross-lingual Language Model and AraELECTRA [
42]. Achieving an accuracy of 81%, the proposed models outperformed existing AI detectors. CAMeLBERT, when trained on an Arabic poetry dataset, achieved a performance accuracy of 80.9% [
43].
Alzaidi et al. [
44] enhanced Arabic text representation by utilizing FastText word embeddings, followed by an attention-based BiGRU classification model. This approach included a complete preprocessing pipeline and incorporated hyperparameter tuning to efficiently explore the search space and improve model performance. Similarly, Badri et al. [
45] applied NLP techniques to detect hate speech in the Arabic literature across various dialects, leveraging BERT-based pretrained models to capture the linguistic complexity of the Arabic language.
Al-Shaibani and Ahmad [
46] proposed a dotless Arabic text representation to reduce vocabulary size while maintaining comparable NLP performance. Their approach demonstrated up to a 50% reduction in vocabulary, with successful results across tasks such as language modeling and translation. Recent efforts in Arabic text-to-speech synthesis have also focused on enhancing emotional expressiveness, particularly in applications designed to assist visually impaired people [
47].
Transformer-based architectures have also been adapted for domain-specific applications. For instance, Wang et al. [
48] developed a graph transformer model for detecting fraudulent financial transactions. While their work targets the financial sector, it highlights the increasing trend of tailoring transformer models for intricate and domain-specific frameworks. Mustafa et al. [
49] explored the use of explainable AI methods—such as LIME and SHapley Additive exPlanations (SHAP)—to clarify model decision-making in Arabic linguistic applications. The integration of explainability tools is considered essential for broader adoption and trust in transformer-based models within domain-specific contexts.
CAMeLBERT, designed specifically for Arabic NLP, is capable of handling linguistic variations, including Classical Arabic (CA), Dialectal Arabic (DA), and Modern Standard Arabic (MSA). It has demonstrated great performance in MSA tasks, particularly in emotion analysis and named entity recognition. Studies indicate that when paired with DL techniques, fine-tuned CAMeLBERT models outperform other models in sentiment analysis [
50], achieving high accuracy, recall, and F1-scores in named entity recognition [
51,
52]. Its capacity to manage morphological and linguistic diversity also makes it suitable for dialect detection, translation, and the processing of the CA literature—an essential step in applying NLP techniques to historical texts [
53]. Although BERT has gained prominence in NLP, studies [
54,
55,
56] suggest that its application to Arabic text classification remains limited. However, transformer models specifically trained on extensive Arabic corpora—such as CAMeLBERT, AraBERT, and MARBERT—have shown superior performance. This growing body of research not only reveals promising avenues for addressing current challenges but also highlights the potential for expanding the efficacy of Arabic language applications through the use of specialized transformer architectures.
3. Materials and Methods
The proposed study comprises several phases: data preprocessing, data splitting into training and testing sets, model implementation, evaluation using various metrics, model testing, and interpretability analysis using the LIME technique. The workflow is illustrated in
Figure 1.
3.1. Dataset and Train-Test Split
This study utilizes the MetRec Arabic poetry dataset, which comprises 55,440 verses classified across 14 m [
26]. The symbol ‘#’ denotes the separation between the left and right parts of each verse. These halves are vertically concatenated to generate half-verse data, resulting in a total of 110,880 half-verses, as depicted in
Figure 2.
The dataset exhibits moderate class imbalance: 10,000 verses are associated with the saree, kamel, mutakareb, ramal, baseet, taweel, wafer, and rajaz meters; 2524 with mutadarak; 8896 with munsareh; 2812 with madeed; 4244 with mujtath; 10,002 with khafeef; and 2402 with the hazaj meter.
Figure 3 presents a word cloud generated from the dataset, while
Figure 4 illustrates sample verses along with their associated meter labels.
After removing unwanted characters, alphabets, and punctuation, the cleaned data are split into training, validation, and testing sets using a 60:20:20 ratio. The training set contains 66,528 half-verses, while both the validation and test sets each contain 22,176. The meter-wise distribution across these splits is shown in
Figure 5.
3.2. Data Preprocessing
The performance of the proposed models depends significantly on effective data preprocessing. Given the linguistic complexity of Arabic and differing model architectures, two preprocessing strategies were employed: character-level encoding for deep learning (DL) models and tokenization for transformer-based models. In both approaches, label encoding was applied to the target categories.
3.2.1. Tokenization
In natural language processing (NLP), tokenization—the process of breaking text into smaller components—is a crucial first step. Depending on the method, tokens may represent characters, subwords, or words. Among the various tokenization techniques, WordPiece and SentencePiece have become particularly popular for their adaptability to multiple languages and applications [
57].
WordPiece, developed for the BERT architecture, is a subword tokenization method [
58,
59]. It constructs a vocabulary from the characters in the text and identifies frequently co-occurring token pairs to merge into new tokens. This iterative process continues until either no additional pairs can be merged or a pre-defined vocabulary size is achieved, maximizing the likelihood of representing the training data efficiently.
SentencePiece, developed by Google, is another subword tokenization tool. Unlike WordPiece, it applies a data-driven, language-independent strategy by treating the input as a series of characters. SentencePiece supports Byte Pair Encoding (BPE) and Unigram Language Models to produce subword tokens suitable for various NLP applications [
57].
Table 1 summarizes the tokenization methods employed by the pretrained transformer models used in this study.
For the proposed study, we employed pretrained tokenizers alongside the corresponding models. The code snippet for invoking the tokenizer is shown in
Figure 6. The maximum sequence length was set to 60 and padding was enabled.
3.2.2. Character Encoding
Character encoding is the process of converting characters into a format that computers can easily access and manipulate. In this study, we addressed the complexity of the Arabic language—rich in morphological structures—by implementing character-level encoding. After removing special characters and spaces from the dataset, a vocabulary was created based on unique characters [
26]. Each character was then assigned an index to form a mapping dictionary. This mapping translates every text input into a sequence of numerical indices, thus preserving distinct characters. The maximum sequence length was derived from the longest text in the dataset to ensure consistency in input size through sequence padding. This encoding method is particularly effective for managing spelling variations and noisy text, as it retains character-level patterns. The code snippet for character encoding is depicted in
Figure 7.
3.2.3. Label Encoding
Label encoding is especially efficient when the target variable comprises discrete classes as it assigns a unique integer to each category. In this study, we used label encoding to map each of the 14 m classes to an integer ranging from 0 to 13, enabling faster processing during model training.
3.3. Model Architectures
Several models were used for classifying Arabic meters, including ARBERT, MARBERT, CAMeLBERT, Arabic-BERT, AraELECTRA, BiGRU, and BiLSTM.
3.3.1. Transformer Models
Transformers are suitable for high-performance tasks as they can handle large volumes of data and complex structures efficiently. The workflow of the transformer model used in this study is shown in
Figure 8.
AraBERT is a transformer model pretrained specifically for Arabic [
60]. It was developed to address the challenges of Arabic NLP, particularly the language’s intricate morphology and dialectal variations. AraBERT has demonstrated outstanding performance in various tasks, including sentiment analysis [
61]. Built on the original BERT architecture, AraBERT significantly enhances the precision and efficacy of Arabic NLP applications.
AraELECTRA is another transformer model fine-tuned for Arabic language [
42], extending the ELECTRA architecture. Unlike conventional masked language models, ELECTRA employs a discriminator network that distinguishes between real and replaced tokens generated by a generator. This approach improves training efficiency while maintaining performance.
CAMeLBERT is a robust Arabic transformer model designed to process different Arabic language forms, including dialectal Arabic, Classical Arabic, and Modern Standard Arabic (MSA) [
43]. Unlike AraBERT, CameLBERT is trained on a more diverse corpus, enabling it to capture stylistic nuances in poetry. Its BERT-based architecture is enhanced through extensive fine-tuning for a better comprehension of poetry forms and morphological patterns.
MARBERT and ARBERT are advanced transformer models tailored for Arabic NLP tasks [
62]. These models extend the BERT architecture with a bidirectional attention mechanism, allowing them to evaluate context by examining both left and right surroundings. MARBERT was trained on a large corpus comprising various Arabic dialects and MSA, while ARBERT focused primarily on MSA.
Arabic-BERT further expands the BERT architecture by focusing on Arabic-specific linguistic features, particularly those of MSA [
63]. All the pretrained transformer models used in this study consist of 12 hidden layers and 12 attention heads in their architecture.
3.3.2. Deep Learning Models
Deep learning architectures such as BiLSTM and BiGRU are commonly used for handling sequential data, including natural language processing, time-series forecasting, and speech recognition. Both methods effectively capture long-terms dependencies and mitigate the vanishing gradient issue inherent in conventional RNNs.
BiLSTM (Bidirectional Long Short-Term Memory) enhances contextual understanding by processing input sequences in both forward and reverse directions [
14,
64]. The LSTM model includes three main gates: input, output, and forget gates.
where
ft is the forget gate at time stamp
t,
σ is the sigmoid activation function,
vf and
wf denote the weight matrices for
ft,
It signifies the data input,
af is the error vector, and
Lt−1 denotes the hidden state result from the previous time step.
where
it is the input gate,
vi and
wi denote the weight matrices for
it, and
ai is the error vector.
where
is the candidate memory cell,
ac is the error vector, and
vc and
wc are the associated weight matrices.
where
CSt is the current cell state,
CSt−1 represents the memory state from the previous timestamp,
ft,
it,
is calculated from Equations (1), (2), (3), respectively.
where
Ot is the output gate,
vo and
wo denote the weight matrices for
Ot, and
ao is the error vector.
where
Lt is the hidden state at time
Ot, calculated from Equation (5), and
CSt from Equation (4). The BiLSTM analyzes the input sequence in both directions.
where the
is the forward output and
is the backward output of the LSTM. The symbol ‘ʘ’ denotes on operation such as multiplication, averaging, summation, or concatenation.
In contrast, the BiGRU is a simplified variant of LSTM, with fewer parameters and enhanced computational efficiency. Like the BiLSTM, it processes sequences bidirectionally [
26]. It includes two gates: the reset gate and the update gate.
where
RSt is the reset gate,
vr and
wr are the weight matrices for
RSt,
Gt−1 is the hidden state from the previous time step, and
ar is the bias vector.
where
Pt is the update gate,
vp and
wp denote the weight matrices, and a
p is the error vector.
where
is the new hidden state,
ah is the error vector, and
vh and
wh are the weight matrices.
RSt is obtained from Equation (8).
where
Gt is the final updated hidden state,
Pt is calculated from Equation (9), and
is from Equation (10).
where the
is the forward output and
is the backward output of the GRU. The symbol ‘☉’ denotes operations such as multiplication, averaging, summation, or concatenation.
The BiLSTM and BiGRU model architecture is illustrated in
Figure 9. The approach begins by converting a character sequence into an integer index using a character encoder, where each index corresponds to a specific vocabulary character. The input layer is shaped according to the training dataset. The embedding layer transforms input tokens into dense vectors of a predetermined size, set to 128 in this study. The BiGRU or BiLSTM layer then processes the sequence bidirectionally, capturing dependencies from past and future tokens. If the number of layers exceeds one, the return-sequence parameter is set to True. The dimensionality of the bidirectional layer is fixed at 256. To mitigate overfitting, a dropout layer with a rate of 0.2 is employed. A fully connected (dense) layer follows, mapping to the number of output classes—14 in this case. The Softmax function is used as the activation function, and the final prediction corresponds to the category with the highest probability in the Softmax output.
3.4. Model Evaluation and Parameter Tuning
The system environment used in this study is as follows:
Windows 10 operating system
16 GB RAM
Intel Core i7 processor
GPUs—Nvidia GeForce GTX 1080 Ti
Deep learning libraries: TensorFlow 2.7, Transformers 4.48.1
Additional evaluation libraries: Scikit-learn 1.0, PyArabic 0.6.14, lime 0.2.0.1
The models were compiled prior to training. The Adaptive Moment Estimation (Adam) optimizer was employed due to its memory efficiency and notable effectiveness [
65]. The loss function used was sparse categorical cross-entropy, appropriate for labels formatted as a sparse matrix.
The tuning parameters were model-specific. Deep learning (DL) models were optimized by varying the number of hidden layers, whereas the transformer model was tuned based on batch size. This study employed EarlyStopping and ReduceLRonPlateau as regularization strategies to mitigate overfitting. Training ceased when the model achieved optimal performance based on the monitored parameters, before overfitting could occur.
The EarlyStopping parameter monitored validation loss and stopped training if the loss remained constant or increased over eight consecutive epochs. The ReduceLRonPlateau technique reduced the learning rate by a factor of 0.1 if the loss value remained unchanged over two epochs.
Table 2 presents the parameters used in this study.
3.5. Performance Evaluation
The model were evaluated using the test dataset. Performance assessment included analyzing the confusion matrix and classification report for each model. For multi-class classification, this study evaluated standard metrics such as accuracy, F1-score, recall, and precision. The formulas for each metric are provided below.
A classification is considered accurate when a verse is correctly categorized into its respective meter (true positive, Tpos). For instance, if a verse written in the saree meter is correctly predicted as saree, it constitutes a true positive. A verse deviating from a specified meter but is accurately recognized as not conforming to that meter is a true negative (Tneg). Conversely, if a verse from one meter is misclassified as belonging to another, this results in a false positive (Fpos) for the predicted meter and a false negative (Fneg) for the actual meter. For instance, if a kamel meter verse is classified as madeed, it counts as an Fpos for madeed and an Fneg for kamel.
3.6. Explainability with LIME
LIME (Local Interpretable Model-Agnostic Explanations) is an interpretability technique that uses a locally approximated interpretable model to explain the predictions of complex models [
66]. It can highlight the words or characters that significantly influence the categorization decision in the classification model. LIME approximates the model’s decision in a relatively small area surrounding a specific instance by perturbing the input data and tracking how the model’s prediction changes.
In the context of Arabic meter classification, LIME facilitates the identification of specific words or characters that strongly influence the model’s prediction. This helps to interpret and validate the patterns recognized by the transformer models. The final training parameters for the model are described in
Table 3. The tuning parameters are mentioned in
Table 2.
5. Discussion
This study analyzed various models, including pretrained transformer models and bidirectional deep learning (DL) models, using half-verse data comprising 14 m. Among the tested models, CAMeLBERT achieved the highest accuracy of 90.62%, marginally outperforming the BiLSTM model, which reached 90.53%.
Tokenizers were also evaluated using the BiLSTM model. Specifically, SentencePiece (AraBERT) and WordPiece (CAMeLBERT) tokenizers were assessed. The model was tested with two layers, as depicted in
Table 8. However, transformer tokenizers did not yield significant performance improvements with BiLSTM. Consequently, character encoding was preferred for the DL models.
BiLSTM leverages the benefits of long-term memory and enhances traditional LSTM by processing sequences bidirectionally [
67]. Compared to GRUs, LSTMs with memory cells are more effective at capturing long-range dependencies. Al-Shathry [
68] conducted further research using a balanced dataset of full-verse poems. By randomly selecting 1000 verses for each of the 14 m, their BiGRU model achieved 98.6% accuracy and scores of 0.90 for recall, precision, and F1-score.
Transformer models have notable advanced NLP challenges, including Arabic meter categorization. These models use deep contextual embeddings to identify intricate linguistic patterns specific to Arabic [
34,
69]. Although CAMeLBERT was trained on diverse datasets—including one related to poetry—it achieved only 80.9% accuracy on that set [
43]. The exceptional performance of transformer models lies in their self-attention mechanism, which allows for more efficient modeling of long-range dependencies, outperforming recurrent networks such as BiLSTM and BiGRU. They are particularly adept at capturing complex linguistic characteristics [
55].
Arabic poetry is often governed by systematic metrical structures, which can be effectively represented using half-verses. Analyzing half-verses preserves essential rhythmic and structural features.
Table 9 presents a comparative analysis of this study with previous research using the same half-verse dataset.
Model explainability using LIME is visualized in
Figure 12,
Figure 13 and
Figure 14 for three representative sample texts. LIME highlights the words that most influence the model’s prediction by underlining them in different colors. These visualizations demonstrate how particular syllables or phrases contribute to the identification of a specific meter, along with the associated probabilities for the top predicted meters.
5.1. Practical Implications
The findings of this study offer valuable practical applications for Arabic poetry analysis and AI-driven literary tools. The high accuracy achieved using half-verse data shows that meter classification can be effectively automated without requiring full verses, enhancing its real-world applicability. Moreover, the ability of DL and transformer models to learn metrical patterns supports the development of educational technologies that allow researchers and students to rapidly analyze and classify Arabic poetic meters.
This research contributes to Arabic NLP, particularly in tasks that demand rhythmic-sensitive analysis. Demonstrating that half-verses suffice for precise classification reduces processing demands and broadens access to AI-driven solutions in Arabic literary and language technologies.
5.2. Limitations and Future Work
While the proposed study employs half-verse data to achieve a higher level of accuracy in Arabic meter classification, it has certain limitations. One of the restrictions is its reliance on a specific dataset which may not adequately represent the diversity of Arabic poetry. Future research should explore the potential for expanding the dataset to encompass a larger diversity of poetic styles and emotions to assess the generalizability of the models.
Additionally, exploring multimodal approaches that combine textual and audio features could provide deeper insights into Arabic poetry. Such interdisciplinary research would bridge computational linguistics, phonetics, and literary analysis, potentially leading to more nuanced and human-like poetic classification systems.
6. Conclusions
This study explored Arabic meter categorization using various deep learning and transformer models, including AraBERT, AraELECTRA, Arabic-BERT, ARBERT, MARBERT, CAMeLBERT, BiGRU, and BiLSTM models. To the best of our knowledge, no prior research has evaluated this combination of methodologies, making our contribution novel in the field.
By applying these methodologies, we introduced new perspectives on processing and categorizing Arabic poetic meters. CAMeLBERT achieved the highest accuracy (90.62%), closely followed by BiLSTM (90.53%). These results underscore the effectiveness of transformer models in capturing the complex metrical structures in Arabic poetry, marking a significant advancement and opening new avenues for Arabic NLP research.