Bangla Character Detection Using Enhanced YOLOv11 Models: A Deep Learning Approach

Aktar, Mahbuba; Islam, Nur; Yang, Chaoyu

doi:10.3390/app15116326

Open AccessArticle

Bangla Character Detection Using Enhanced YOLOv11 Models: A Deep Learning Approach

by

Mahbuba Aktar

¹

,

Nur Islam

²

and

Chaoyu Yang

^1,*

¹

School of Artificial Intelligence, Anhui University of Science and Technology, Huainan 232001, China

²

Information and Communication Engineering, Hohai University, Changzhou 213200, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 6326; https://doi.org/10.3390/app15116326

Submission received: 24 April 2025 / Revised: 24 May 2025 / Accepted: 29 May 2025 / Published: 4 June 2025

Download

Browse Figures

Versions Notes

Abstract

Recognising the Bangla alphabet remains a significant challenge within the fields of computational linguistics and artificial intelligence, primarily due to the script’s inherent structural complexity and wide variability in writing styles. The Bangla script is characterised by intricate ligatures, overlapping diacritics, and visually similar graphemes, all of which complicate automated recognition tasks. Despite ongoing advancements in deep learning (DL), machine learning (ML), and image processing (IP), accurately identifying Bangla characters continues to be a demanding and unresolved issue. A key limitation lies in the absence of robust detection frameworks capable of accommodating the script’s complex visual patterns and nuances. To address this gap, we propose an enhanced object detection model based on the YOLOv11 architecture, incorporating a ResNet50 backbone for improved feature extraction. The YOLOv11 framework is particularly effective in capturing discriminative features from input images, enabling real-time detection with high precision. This is especially beneficial in overcoming challenges such as character overlap and stylistic diversity, which often hinder conventional recognition techniques. Our approach was evaluated on a custom dataset comprising 50 primary Bangla characters (including vowels and consonants) along with 10 numerical digits. The proposed model achieved a recognition confidence of 99.9%, markedly outperforming existing methods in terms of accuracy and robustness. This work underscores the potential of single-shot detection models for the recognition of complex scripts such as Bangla. Beyond its technical contributions, the model has practical implications in areas including the digitisation of historical documents, the development of educational tools, and the advancement of inclusive multilingual technologies. By effectively addressing the unique challenges posed by the Bangla script, this research contributes meaningfully to both computational linguistics and the preservation of linguistic heritage.

Keywords:

Bangla alphabet; Bangla character; YOLOv11; deep learning; ResNet50

1. Introduction

The Bangla characters, also known as the Bengali script (Bengali: Bangla bôrṇômala), represent a writing system of profound cultural and historical importance. Spoken by over 278 million people, approximately

3.4 %

of the world’s largest population, it is ranked seventh globally as of 2022 [1]. Bengali is the primary language in Bangladesh and is widely spoken in the Indian states of West Bengal, Assam, and Tripura [2]. The script traces its origins to the ancient Brahmic family of scripts, from which many South Asian writing systems, including Devanagari (used for Sanskrit and Hindi), evolved. Over the centuries, the Bengali script developed into a distinctive system, diverging from closely related scripts such as Oriya and Assamese through unique stylistic and structural adaptations.

Due to its flexibility and expressive capacity, the Bengali script has historically been employed not only for writing the Bengali language but also for transcribing Sanskrit texts in the Bengal region. As one of the most widely used writing systems globally, it plays a vital role in regional communication, education, and the preservation of cultural heritage.

The script comprises vowels, consonants, diacritical marks, conjunct consonants, numerals, and punctuation. It includes eleven vowel graphemes (swôrôbôrnô), which represent six basic vowel sounds and two diphthongs, used in both Bengali and Assamese. One of the key complexities of the script lies in conjunct consonants (juktakkhôr), where clusters of up to four consonants merge to form visually complex shapes, significantly complicating segmentation and recognition. The Bengali numeral system is based on a positional base-10 structure and uses 10 distinct digits (0–9) for numeric representation. Accurate identification and classification of these diverse components require sophisticated recognition systems capable of handling the script’s high visual density and structural complexity.

Deep learning-based character recognition in Bengali presents unique challenges due to the script’s structural complexity and distinctive visual features. While traditional optical character recognition (OCR) techniques have proven effective for scripts with simpler character sets, they often perform suboptimally when applied to the Bengali script. This limitation arises from the presence of visually similar graphemes with subtle structural differences, as well as complex conjunct consonants and overlapping diacritical marks. These features complicate segmentation and accurate classification in conventional OCR systems.

The goal of this research is to develop a robust deep learning model that can effectively recognize and classify individual printed Bangla characters. Rather than relying on the conventional pipeline involving separate stages like segmentation, handcrafted feature extraction, and classification, this study adopts a more streamlined approach using a single-shot object detection method based on the YOLO (You Only Look Once) framework.

We specifically focus on an improved version of YOLOv11, which is designed for high-speed inference and strong recognition accuracy. This upgraded architecture is tailored to handle the unique visual complexity of Bangla characters, many of which share subtle structural similarities. The system processes the entire image in one forward pass, enabling real-time recognition with lower computational cost.

The architecture of our proposed model, as depicted in the diagram, is organized into three main sections: the backbone, neck, and detection head.

At the core of the backbone is ResNet50, a deep residual network responsible for extracting rich feature representations from input images. This backbone is particularly effective in maintaining performance across deeper layers by mitigating the vanishing gradient problem through shortcut connections.

Following the backbone, a Spatial Pyramid Pooling Fast (SPPF) module is used to enhance receptive fields and capture features at multiple scales. This is further augmented by the C2PSA block, which integrates self-attention mechanisms to strengthen both spatial and channel-wise feature learning. As shown in the left portion of the diagram, the C2PSA module uses a series of convolutional layers and PSA blocks, followed by concatenation, to enrich the feature maps before passing them forward.

The neck of the network utilizes several C3k2 modules, which are variations of convolutional blocks designed to either use grouped convolutions (C3k = True) or bottleneck structures (C3k = False), depending on the configuration. These are detailed in the top-right section of the diagram. These blocks help in refining and combining features across different scales through concatenation and upsampling layers.

Finally, the model outputs predictions through the 11 Detect head, which operates on three different scales. As illustrated in the bottom-right corner of the diagram, the detection head includes a series of convolutional and depthwise convolution layers with different kernel sizes (3 × 3 and 1 × 1). These layers handle the tasks of bounding box regression, object classification, and loss calculation for each predicted region.

To validate the effectiveness of our model, we conduct a comparative evaluation against YOLOv5 using a custom dataset of annotated Bangla characters. The results focus on detection accuracy, precision, recall, and inference speed.

Overall, this study shows that the enhanced YOLOv11 model, with its modified backbone, attention mechanisms, and multi-scale detection heads, is well suited for recognizing Bangla script. The proposed approach offers a fast and accurate solution, which could be valuable for applications like educational tools, digital archiving, and broader multilingual OCR systems.

2. Related Work

YOLO has proven efficient in several real-time object identification applications, including word and character recognition, owing to its high speed and precision. It has been frequently utilized in languages that pose issues comparable to Bangla, including Arabic, Chinese, and Hindi.

2.1. Character Recognition in Different Characters Using YOLO

2.1.1. YOLO for Chinese Character Recognition in Ancient Books

Recognising rare handwritten Chinese characters in historical manuscripts presents considerable challenges due to the complexity of ancient scripts and the scarcity of annotated data. A recent study addressed this by constructing a custom dataset containing 290 character classes and 71,277 annotated instances, sourced from the Guofeng chapter of the Book of Songs [3]. The YOLOv7 model was employed, with performance enhancements achieved through preprocessing techniques such as grayscale conversion, binarization, and refinements to bounding box annotation layouts. Additionally, an Efficient Channel Attention (ECA) module was integrated into the YOLOv7 architecture to reduce computational complexity without sacrificing accuracy. Experimental results demonstrated that YOLOv7 effectively manages the nuanced visual structures of ancient Chinese scripts. Future research directions include expanding the dataset, enhancing real-time detection capabilities, and exploring cross-linguistic applications to further improve model adaptability and performance.

2.1.2. YOLO for Handwritten Bangla Character Recognition

This line of research focuses on the recognition of Bangla characters using the BanglaLekha dataset, which comprises 84 distinct classes, including 50 basic characters, 10 digits, and 23 compound characters. A YOLOv5-based convolutional neural network (CNN) model was trained to perform character detection and recognition with high accuracy. The methodology encompasses data preprocessing, model training, and classification. Prior studies have shown substantial progress in Bangla handwriting recognition: ref. [4] achieved 98.8% accuracy in numeral recognition, ref. [5] examined Bangla handwritten characters and showed recognition accuracies of 99.50% for Bangla digits, 93.18% for vowels, 90.00% for consonants, and 92.25% for combined classes, and ref. [6] attained 98% for digits. CNN architectures consistently outperform traditional approaches, and recent developments such as 3D CNNs and residual networks have further enhanced recognition capabilities. Additionally, knowledge distillation techniques have been explored to improve accuracy, particularly for ambiguous or poorly written characters. The overarching goal is to develop robust systems capable of converting handwritten Bangla text into digital form, thereby advancing automation in handwriting recognition and document digitization.

2.1.3. YOLO for Greek Papyri Character Recognition

This study advances the recognition of characters in Greek papyri manuscripts by integrating modern object detection and representation learning methods. The authors utilize a dataset of 153 annotated training images sourced from 108 manuscripts, along with 34 external test images from the Oxyrhynchus Papyri and AL-PUB collections. The detection framework is based on YOLOv8 for real-time recognition, while prediction refinement is achieved using a DeiT (Data-efficient Image Transformer) model. A ResNet-50 network, pre-trained with SimCLR, extracts high-quality feature representations from limited labeled data. To improve localization accuracy, the authors employ Weighted Box Fusion (WBF) with hard majority voting. Building on prior work in historical document analysis [7,8,9], the study proposes a robust pipeline for both character recognition and author identification. From this research, we identify several areas for future investigation, including improving recognition accuracy for deteriorated texts, refining paleographic dating methods, and enabling scalable transcription. Key areas of emphasis include expanding the dataset to encompass a broader variety of manuscript types, fostering collaboration with historians to improve transcription precision, and integrating linguistic metadata to aid in long-term preservation and support extensive scholarly analysis.

2.1.4. YOLO for Kuzushiji and Old Japanese Script Recognition

This research applies cutting-edge machine learning methods to recognize Kuzushiji, an archaic Japanese cursive script used in historical texts. Due to limited annotated data, the study leverages Generative Adversarial Networks (GANs) to augment the dataset with synthetic samples, significantly enhancing model performance. A decoupled YOLOv4 architecture is employed for detection, with non-maximum suppression (NMS) used to improve prediction accuracy. Training strategies are optimized to ensure class balance and effective learning. The study also compares different GAN variants DCGAN, LSGAN, and WGAN for their effectiveness in data augmentation. Results indicate that GAN-based augmentation substantially improves detection and classification outcomes in the context of complex scripts. This work builds on earlier applications of YOLO for historical text analysis and underscores the synergy of GANs and YOLO in tackling low-resource and high-complexity character recognition challenges [10].

Although YOLO-based methods have been successfully applied to character recognition tasks across a variety of languages, including Chinese, Bangla, Greek, and Japanese, many of these approaches face similar challenges. These include difficulties in handling the complex shapes and variations of different scripts, limited performance on noisy or low-quality images, and a strong reliance on large amounts of annotated training data. In many cases, the models used are off-the-shelf YOLO variants with minimal adjustments for the unique traits of non-Latin scripts. For Bangla, in particular, the presence of visually similar letters and compound characters adds an extra layer of difficulty. To address these issues, we propose a refined YOLOv11 architecture incorporating compound scaling, C3k2 and C2PSA modules, and a training pipeline tailored to the intricacies of Bangla script. These improvements significantly boost recognition accuracy, especially for complex and visually similar characters, across a range of testing scenarios.

2.2. Different Bangla Alphabet or Character Datasets Using Different Models

Among the various techniques employed for Bangla alphabet recognition shown as Table 1, our approach, utilizing ResNet-50 with transfer learning, demonstrates exceptional performance in terms of both accuracy and versatility. Achieving an accuracy of 96.12% on the Banglalekha-isolated dataset, which includes 84 distinct classes, our model significantly outperforms other methods that either use fewer classes or report lower testing accuracies.

For instance, although Sifat et al. [15] achieved a 99% accuracy, their model was tested on only 50 classes, thus reducing the complexity of the classification task. Similarly, while Promila et al. [16] reported high accuracy (96% for character recognition and 91% for word recognition), their results were split across different recognition tasks, making direct comparisons less straightforward. The integration of ResNet-50 with transfer learning in our model facilitates more effective feature extraction and enhances its ability to generalize across datasets with diverse character variations. As such, our model stands as a more robust and effective solution for Bangla character recognition compared to the other approaches examined.

3. Methodology of the Proposed Bangla Alphabet Detection Algorithm

The Bangla script presents significant challenges for visual recognition tasks due to its intricate character structures, including ornate curves and distinctive diacritical marks such as the matra, a horizontal line that frequently appears atop many characters. These stylistic complexities often result in visual similarities between characters, making accurate recognition difficult, even for human observers. To address these challenges, this study introduces an enhanced version of the YOLOv11 object detection framework, which incorporates a ResNet50 backbone and an attention mechanism to improve feature extraction and recognition accuracy.

3.1. Overview of YOLOv11 and ResNet50

3.1.1. YOLOv11

YOLO (You Only Look Once) is a single-stage object detection algorithm that reformulates object localization and classification as a unified regression problem. Unlike two-stage detectors, YOLO performs detection in a single forward pass, enabling high-speed, real-time inference. The YOLOv11 model used in this research represents an evolved and customized version of the original framework, adapted explicitly for fine-grained character detection.

Model Architecture: YOLOv11 segments the input image into a

S \times S

grid, with each grid cell predicting bounding boxes and associated class probabilities. Each cell is responsible for detecting objects whose centers fall within its region. This design facilitates rapid and efficient detection, which is particularly important for real-time applications such as live Bangla script recognition.

Feature Extraction and Enhancements: The architecture incorporates multi-scale feature maps to effectively detect characters of various sizes and orientations. Improvements over previous YOLO versions include enhanced anchor box prediction mechanisms and refined feature aggregation strategies. These enhancements significantly boost performance in detecting small, overlapping, and visually similar characters, which are common in the Bangla script.

3.1.2. ResNet50 Integration

ResNet50 is integrated as the backbone of the YOLOv11 framework to further strengthen the model’s capacity to recognize complex visual features. ResNet50 employs residual learning to address the vanishing gradient problem, allowing for deeper architectures without compromising gradient flow.

Residual Learning and Spatial Hierarchy: The ResNet50 architecture consists of multiple convolutional and identity blocks interconnected through skip connections. This structure preserves spatial hierarchies and facilitates the extraction of fine-grained visual patterns essential for distinguishing nuanced character shapes.

Transfer Learning: The model is initialized using ResNet50 pre-trained on the ImageNet dataset. These pre-trained weights are fine-tuned on our custom Bangla alphabet dataset, enabling the model to leverage robust low- and mid-level visual features. This transfer learning strategy accelerates convergence and improves recognition accuracy, particularly when training data are limited.

3.2. Improvements in YOLOv11 Network Structure

As illustrated in Figure 1, the proposed YOLOv11 architecture incorporates an attention mechanism in conjunction with the ResNet50 backbone. The attention module enhances the model’s ability to selectively focus on salient regions of interest within the input image, thereby refining feature maps before object detection. This is particularly beneficial in distinguishing visually similar Bangla characters with subtle variations.

The resulting hybrid model offers an optimal trade-off between detection accuracy and computational efficiency. Its robust architecture enables real-time performance while maintaining high precision, making it well-suited for practical Bangla character recognition tasks in educational, archival, and OCR-related applications.

3.3. Mathematical Formulation

The object detection problem in our proposed framework is formulated as a regression task, where the goal is to predict both the location and category of objects within an input image I. The image is divided into an

S \times S

grid, and each grid cell

(i, j)

is responsible for detecting objects whose centers fall within that cell.

For each grid cell, the model predicts a bounding box and an associated confidence score. The bounding box is parameterized by its center coordinates

(x, y)

, width w, and height h. At the same time, the confidence score reflects the probability of object presence and the quality of the predicted bounding box:

{\hat{B}}_{i, j} = (x_{i, j}, y_{i, j}, w_{i, j}, h_{i, j}), C_{i, j} = P_{i, j} \cdot IoU (B_{i, j}, GT)

Here,

${\hat{B}}_{i, j}$ is the predicted bounding box for cell $(i, j)$ .
$P_{i, j}$ is the predicted probability that an object exists in a cell $(i, j)$ .
$IoU (B_{i, j}, GT)$ is the Intersection over Union between the predicted box $B_{i, j}$ and the ground-truth (GT) box.

Loss Function: The total loss function used to train the model combines both localization and classification components:

L = L_{b b o x} + L_{c l a s s}

where

$L_{b b o x}$ is the localization loss, typically calculated using Mean Squared Error (MSE), which penalizes deviations in the predicted bounding box coordinates.
$L_{c l a s s}$ is the classification loss, typically computed using Binary Cross-Entropy (BCE), which penalizes incorrect class predictions.

Residual Learning in ResNet50: ResNet50 improves feature extraction through residual learning. Each residual block is formulated

F (x) = H (x) + x

where

x is the input to the residual block.
$H (x)$ represents the transformation learned by a series of convolutional layers.
$F (x)$ is the output after adding the residual connection.

This identity mapping ensures better gradient flow and stabilizes training in deeper networks, making it especially effective for capturing the fine-grained structures present in Bangla characters.

4. Analysis and Findings from Experiments

4.1. Experimental Setup and Parameter Configuration

The hardware and software environment used for training and evaluating the proposed Bangla alphabet detection model is outlined in Table 2. The setup is optimized for deep learning tasks, ensuring efficient training and inference.

4.2. Building the Dataset

A comprehensive printed image dataset was developed for Bangla character detection, including a total of 60 unique classes: 11 vowels (Table 3), 39 consonants (Table 4), 10 numeric digits (Table 5), and 24 compound characters. Each class consists of 500 annotated samples, resulting in a total of 35,265 labeled images, which are divided into training (70%), validation (20%), and testing (10%) subsets.

The dataset includes variations in font style and visual complexity, as illustrated in Figure 2. Several characters are visually similar, which poses a challenge for detection models; examples of such cases are listed in Table 6.

Manual annotation was performed using the LabelImg tool, with each character assigned a unique class ID to ensure precise localization and classification. To enhance generalization, various data augmentation techniques were applied, including brightness adjustment, horizontal flipping, scaling, and rotation. These augmentations expose the model to a wider range of visual variations, thereby improving robustness and accuracy.

Training was conducted using the default YOLO parameters, such as input resolution and anchor box dimensions. This ensured consistent performance evaluation while maintaining a balance between training time and detection accuracy, ultimately supporting reliable real-world deployment.

4.3. Evaluation Overview

Preprocessing transforms raw images into a standardized format, ensuring compatibility with the input requirements of the detection model. Due to varying font styles, the size and dimensions of characters differ across images [18].

The proposed detection system performs dual tasks:

Localizing objects by predicting bounding boxes.
Classifying them into appropriate categories.

To evaluate the combined effectiveness of these tasks, we employ Mean Average Precision (mAP), a widely used metric in object detection that integrates both localization and classification performance.

The Intersection over Union (IoU) metric assesses the overlap between the predicted bounding box and the ground-truth bounding box. It is defined as the ratio of the intersection area to the union area of the two boxes:

IoU (P, Q) = \frac{Area (P \cap Q)}{Area (P \cup Q)}

(1)

A prediction is considered a true positive (TP) if its IoU with the ground-truth box exceeds a predefined threshold (0.5); otherwise, it is labeled a false positive (FP). Using this criterion, precision is computed as follows:

Precision = \frac{TP}{TP + FP}

(2)

To obtain a comprehensive evaluation, we calculate AP at multiple IoU thresholds ranging from 0.5 to 0.95 in steps of 0.05. The mean Average Precision (mAP) over all C classes is then calculated:

mAP = \frac{1}{C} \sum_{i = 1}^{C} A P_{i}

(3)

4.3.1. Additional Metrics

To supplement mAP, we also report the following:

Accuracy: the ratio of correct predictions to total predictions.

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(4)

Recall: the proportion of actual positives correctly identified by the model.

Recall = \frac{TP}{TP + FN}

(5)

F1 Score: the harmonic mean of precision and recall, providing a balanced measure.

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(6)

4.3.2. Average Precision (AP)

AP is computed as the area under the precision–recall (P-R) curve. The interpolated precision is used to ensure that the precision at each recall level is maximized over all higher recall levels. The AP calculation is given as follows:

AP = \sum_{r = 0}^{1} (r_{n + 1} - r_{n}) \cdot P_{interp} (r_{n + 1})

(7)

where

r_{n}

and

r_{n + 1}

are consecutive recall values, and

P_{interp} (r_{n + 1})

is defined as

P_{interp} (r_{n + 1}) = max_{\tilde{r} \geq r_{n + 1}} P (\tilde{r})

(8)

This formulation ensures that the precision–recall curve is non-increasing, improving robustness in performance measurement.

4.4. Result Analysis

This application is designed to recognize significant Bangla letters and common phrases encountered in daily life. It utilizes both the original YOLOv11 and an improved YOLOv11 model. The enhanced YOLOv11 version was employed to train the algorithm for 150 epochs using annotations formatted according to the YOLO standard. The model’s performance was evaluated using precision (P), recall (R), and mean average precision (mAP), with the Intersection over Union (IoU) metric ranging from 0.5 to 0.95.

In typical object detection training, two key inputs are provided: the ground-truth bounding boxes, which define the exact locations of the objects within the image, and the predicted bounding boxes generated by the model. Due to inherent model constraints and data variability, some discrepancies between the predicted and actual bounding boxes are to be expected. Evaluating the model’s performance based on predicted bounding boxes can be complex, as the degree of alignment between predicted and ground-truth boxes may vary across different observations. The coordinates of its top-left corner define each bounding box and include the object’s width and height dimensions along the x and y axes.

Otherwise, the class is not considered positive. As the training progresses, metric graphs clearly depict slopes that illustrate the effectiveness of the proposed method in achieving accurate alphabet prediction. Table 7 shows the values of recall, accuracy, and F1 score. Furthermore, we add that for IoU thresholds of 0.5 and 0.95, the mAP scores were recorded at 0.998 and 0.989, respectively, as presented in Figure 3g,i. These results underscore the efficacy of our approach in accurately predicting the Bangla letters.

In the F1 score graph, both IoU measurements at 0.5 and 0.95 yield similarly relevant outcomes. The confidence value that optimizes recall (R), as illustrated in Figure 3h, and accuracy, as depicted in Figure 3f, is 0.999. This is comparable to the best F1 score of 0.934, as shown in Figure 3e. It is generally recommended to prioritize a higher F1 score alongside a more confident prediction. The F1 score curve can be a useful tool to strike the optimal balance between precision and recall, ensuring both efficiency and accuracy.

The graph in Figure 3i shows the loss, precision, recall, and mAP functions for both the training and validation sets. The first three graphs in the top-left corner represent the training loss functions, all of which consistently decrease over time, indicating steady improvement in model performance. Similarly, the three bottom-left graphs show the validation loss functions, which also exhibit a downward trend, confirming that the model’s generalization is improving as training progresses. Moreover, Figure 3a–d provides a detailed view of class-level prediction behavior. The confusion matrices highlight accurate predictions and occasional misclassifications, while the class labels and correlogram curves further validate the model’s robustness across different Bangla characters.

Notably, the object loss function experiences a slight peak around the 20th epoch, but this is followed by a resumption of the declining trend after a few epochs, suggesting that the model is adapting to the complexity of the data. The accuracy and recall functions show a continuous upward trajectory, further supporting the conclusion that the model is learning effectively.

The mAP50 metric steadily approaches a value of one by the 150th epoch, which signifies that the model achieves near-perfect detection performance at a 50% Intersection over Union (IoU) threshold. This indicates that as over-detection occurs, the model’s detection capabilities improve, resulting in higher mAP scores.

We further investigate the performance of Bangla character recognition using a variety of advanced CNN models. In this study, we employed models such as BanglaNet, ResNet-50, VGG-16, CNN, DenseNet, DenseNet121, Vision Transformer, Inception V3, and our method on a custom dataset. The performance results for classification are presented in Figure 4.

As anticipated, DenseNet121 and ResNet-50 exhibit lower accuracy performance compared to other models, especially during testing for DenseNet121 and training for ResNet-50. However, YOLOv11 outperforms all other models, providing the best accuracy, as reflected in both training and testing performance. These results highlight the efficiency and robustness of the YOLOv11 model for Bangla character recognition tasks.

4.4.1. Classification of Letters or Characters

After completing model training, we employed the trained model to perform character prediction. The model’s primary task is to accurately identify letters within an image. Once characters are detected, the model generates predictions in the form of test words and images based on the recognized letters. During the detection phase, predicted bounding boxes are overlaid onto the input image, visually highlighting the identified characters.

Initially, the model was trained using a classification-based approach on individual characters. Following this, various test images were used to evaluate its performance. The model generates output based on the detected characters, and these are visually marked within the image using bounding boxes. The resulting annotated images are saved in the same directory as the training outputs. Additionally, the model’s weight parameters are automatically stored in the train folder after each training session. Each time the model is retrained, a new “exp” folder is created to log the corresponding results.

For testing, either a single image or a folder containing multiple images can be provided as input. If a folder path is specified, the model processes and predicts all images within that directory. By default, the output is stored in a designated results folder. This setup supports character predictions on both static images and video frames. Notably, the model achieved a character recognition accuracy of

99 %

on the test images.

To facilitate accurate predictions, each Bangla character was labeled to form distinct classes. This labeling allows the model to effectively identify characters in both images and videos during inference. As shown in Figure 5, Figure 6 and Figure 7, a new detection folder was created to store the model’s outputs, including predicted labels and their associated confidence scores. Once the model successfully recognizes a character, it annotates the corresponding image and displays the output accordingly.

4.4.2. False Detection Using the Original YOLOv11

The original YOLOv11 model demonstrated several false detections when recognizing Bangla characters, as illustrated in Figure 8. Despite reporting high confidence levels (up to 0.99 for each class), some predictions did not match the actual characters present in the image. These errors highlight the model’s difficulty in distinguishing visually similar letters, often due to overlapping structural features among certain Bangla characters.

These misclassifications can be attributed to several factors, including class imbalance, suboptimal feature extraction, and limited ability to differentiate fine-grained visual details. Expanding the dataset, particularly with samples of visually ambiguous characters, is critical for improving model performance. In addition, applying advanced data augmentation methods and fine-tuning the training parameters can further reduce false detections and enhance overall detection accuracy.

4.4.3. Correct Detection Using the Improved YOLOv11 Algorithm

After improving the YOLOv11 model, a noticeable reduction in false positives was observed, as presented in Figure 9. The enhanced model shows a significantly better ability to differentiate visually similar characters. These improvements were achieved through a combination of advanced feature extraction techniques, optimized training procedures, and more effective data augmentation strategies.

Although the model consistently reports high confidence scores (approximately 0.99 for each class), occasional misclassifications persist, indicating the need for further refinement. Notable enhancements include the integration of label smoothing to mitigate overconfidence, the use of GIoU and DIoU loss functions to improve bounding box localization, and the implementation of refined attention mechanisms (such as C2PSA) to better capture subtle visual distinctions. Moreover, addressing class imbalances through dataset balancing and targeted augmentation significantly boosted the model’s generalization capabilities, leading to more robust and accurate Bangla character recognition.

5. Conclusions

This study introduces an enhanced version of the YOLOv11 architecture, built upon the modified YOLOv5 framework, for the effective detection and classification of printed Bangla alphabet characters. The proposed model incorporates a ResNet50 backbone and attention mechanisms to enhance feature representation and detection accuracy. Experimental results demonstrate that our approach significantly outperforms existing models in terms of mean average precision (mAP), precision, recall, and F1 score, based on a custom-labeled Bangla alphabet dataset.

One of the most challenging aspects of Bangla alphabet recognition is distinguishing visually similar characters and conjunct consonants. Our model exhibits strong resilience in such cases, achieving an accuracy of up to 99.9% under optimal training conditions. We found that training for 150 epochs provided the best balance between accuracy and overfitting, making it the most effective configuration in our experiments.

LetterSpace=-2.0 Despite these promising results, there are a few limitations. The model has been tested exclusively on printed Bangla characters, and its performance on handwritten or cursive text remains unverified. Additionally, the current dataset does not include noisy backgrounds or real-world scenes, which may affect the model’s ability to generalize effectively.

Looking ahead, we aim to extend this framework to handle handwritten Bangla characters and adapt to more complex image conditions, such as varying lighting, occlusion, and background noise. We also plan to explore real-time Bangla script recognition from video streams and mobile-captured images. These future directions will further test the robustness and applicability of our proposed model in practical OCR systems for low-resource languages.

Author Contributions

Conceptualization, M.A. and N.I.; methodology, N.I.; software, N.I.; validation, N.I. and M.A.; formal analysis, N.I.; investigation, N.I.; resources, M.A.; data curation, M.A.; writing—original draft preparation, M.A.; writing—review and editing, C.Y.; supervision, C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

National Natural Science Foundation of China [Grant Number: 52227901].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ethnologue. What Are the Top 200 Most Spoken Languages? Ethnologue: Dallas, TX, USA, 2024. [Google Scholar]
The World Factbook. The World Population as of 2022 Speaks Bengali; The World Factbook: Langley, VA, USA, 2022. [Google Scholar]
Gong, Y.; Tian, Y.; Wang, R.; Luo, J. Handwritten Chinese Character Recognition in Ancient Books Based on Improved YOLOv7. In Proceedings of the 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Nanjing, China, 29–31 March 2024; Volume 3, pp. 1526–1532. [Google Scholar] [CrossRef]
Swetha, P.; Vidya, A.; Muthupriya, J.; Maheswari, M. Handwritten Digit Recognition Using Deep Learning. Int. J. Adv. Res. Comput. Commun. Eng. 2023, 12, 1–5. [Google Scholar] [CrossRef]
Das, T.R.; Hasan, S.M.F.; Jani, M.R.; Tabassum, F.; Islam, M.I. Bangla Handwritten Character Recognition Using Extended Convolutional Neural Network. J. Comput. Commun. 2021, 9, 158–171. [Google Scholar] [CrossRef]
Chowdhury, R.R.; Hossain, M.S.; ul Islam, R.; Andersson, K.; Hossain, S. Bangla Handwritten Character Recognition using Convolutional Neural Network with Data Augmentation. In Proceedings of the 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR), Spokane, WA, USA, 30 May–2 June 2019; pp. 318–323. [Google Scholar] [CrossRef]
Swindall, M.I.; Croisdale, G.; Hunter, C.C.; Keener, B.; Williams, A.C.; Brusuelas, J.H.; Krevans, N.; Sellew, M.; Fortson, L.; Wallin, J.F. Exploring Learning Approaches for Ancient Greek Character Recognition with Citizen Science Data. In Proceedings of the 2021 IEEE 17th International Conference on eScience (eScience), Virtual, 20–23 September 2021; pp. 128–137. [Google Scholar] [CrossRef]
Turnbull, R.; Mannix, E. Detecting and recognizing characters in Greek papyri with YOLOv8, DeiT and SimCLR. Int. J. Doc. Anal. Recognit. (IJDAR) 2024, 28, 277–285. [Google Scholar] [CrossRef]
Johnson, W. Bookrolls and Scribes in Oxyrhynchus; Studies in Book and Print Culture; University of Toronto Press: Toronto, ON, Canada, 2004. [Google Scholar]
Li, M.; Yue, X.; Meng, L. YOLOv4 for Kuzushiji Recognition with Synthetic Training Data Generated by GAN. CEUR Workshop Proc. 2023, 3459, 37–47. [Google Scholar]
Purkaystha, B.; Datta, T.; Islam, M.S. Bengali handwritten character recognition using deep convolutional neural network. In Proceedings of the 2017 20th International Conference of Computer and Information Technology (ICCIT), Dhaka, Bangladesh, 22–24 December 2017; pp. 1–5. [Google Scholar] [CrossRef]
Karim, M.A.; Rafiuddin, S.M.; Islam Razin, M.J.; Alam, T. Isolated Bangla Handwritten Character Classification using Transfer Learning. In Proceedings of the 2nd International Conference on Computing Advancements, New York, NY, USA, 10–12 March 2022; pp. 11–17. [Google Scholar] [CrossRef]
Bhowmik, T.; Ghanty, P.; Roy, A.; Parui, S. SVM-based hierarchical architectures for handwritten Bangla character recognition. Doc. Anal. Recognit. 2009, 12, 97–108. [Google Scholar] [CrossRef]
Al Rabbani Alif, M.; Ahmed, S.; Hasan, M.A. Isolated Bangla handwritten character recognition with convolutional neural network. In Proceedings of the 2017 20th International Conference of Computer and Information Technology (ICCIT), Dhaka, Bangladesh, 22–24 December 2017; pp. 1–6. [Google Scholar] [CrossRef]
Ahmed, S.; Tabsun, F.; Reyadh, A.S.; Shaafi, A.I.; Shah, F.M. Bengali Handwritten Alphabet Recognition using Deep Convolutional Neural Network. In Proceedings of the 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2), Rajshahi, Bangladesh, 11–12 July 2019; pp. 1–4. [Google Scholar] [CrossRef]
Haque, P.; Salma, U.; Chowdhury, R. Bangla Handwritten Character and Words Recognition-based on the YOLOv5 Algorithm. In Proceedings of the 2022 IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE), Naya Raipur, India, 30–31 December 2022; pp. 24–29. [Google Scholar] [CrossRef]
Chatterjee, S.; Dutta, R.K.; Ganguly, D.; Chatterjee, K.; Roy, S. Bengali Handwritten Character Classification using Transfer Learning on Deep Convolutional Neural Network. arXiv 2019, arXiv:1902.11133. [Google Scholar]
Rahman, M.; Akhand, M.A.H.; Islam, S.; Shill, P.; Rahman, M.M. Bangla Handwritten Character Recognition using Convolutional Neural Network. Int. J. Image, Graph. Signal Process. (IJIGSP) 2015, 7, 42–49. [Google Scholar] [CrossRef]

Figure 1. Framework of the improved YOLOv11 integrated with ResNet50 as the backbone.

Figure 2. Different font style alphabet.

Figure 3. Performance metrics and visualizations for the improved model.

Figure 4. Training and test accuracy curve for different backbones.

Figure 5. Bengali letter or character (vowel) detection using an improved YOLOv11.

Figure 6. Bengali letter or character (consonant) detection using an improved YOLOv11.

Figure 7. Bengali letter (number) detection using an improved YOLOv11.

Figure 8. False detection using the original YOLOv11.

Figure 9. Correct detection using improved YOLOv11.

Table 1. Different Bangla alphabet or character datasets using different models.

Author	Models	Datasets	Classes	Accuracy
Bishwajit et al. [11]	CNN	Banglalekha-isolated	50	91.23%
Karim et al. [12]	Transfer learning, 3DCNN	Banglalekha-isolated	84	87.45% training, 76.80% testing
Bhowmik et al. [13]	SVM	27,000 samples	50	82.44% valid, 81.84% test
Alif et al. [14]	ResNet-18	Banglalekha-isolated, CMATERdb	84	94.52%
Sifat et al. [15]	DCNN	76,000 images	50	99%
Promila et al. [16]	YOLOv5	Banglalekha-isolated	83	96% in character and 91% in word recognition
Chatterjee et al. [17]	ResNet-50 with transfer learning	Banglalekha-isolated	84	96.12%

Table 2. Configuring the software and hardware environment table.

	Specific Configuration Information
Hardware	System: Windows 10
	CPU: Intel(R) Core (TM) i5-10400F@2.90GHz
	RAM: 16GB
	Graphics Card: NVIDIA GeForce RTX 3090
Software Environment	OpenCV 4.10, CUDA 12.4, PyTorch 2.5.1.
Training Parameters	Epoch: 150, Batch: 16, Learning Rate: 0.001, Optimizer: Adam.

Table 3. The alphabet of vowels.

Vowels
অ	আ	ই	ঈ
উ	ঊ	ঋ
এ	ঐ	ও	ঔ

Note: The symbols shown are the vowels in the Bengali script.

Table 4. The alphabet of consonants.

Consonants
ক	খ	গ	ঘ	ঙ	চ	ছ	জ	ঝ	ঞ
ট	ঠ	ড	ঢ	ণ	ত	থ	দ	ধ	ন
প	ফ	ব	ভ	ম	য	র	ল	শ	ষ
স	হ	ড়	ঢ়	য়	ৎ	ং	ঃ	ঁ

Note: The symbols shown are the conso in the Bengali script.

Table 5. The alphabet of numbers.

Numbers
০	১	২	৩	৪
৫	৬	৭	৮	৯

Note: The symbols shown are the numbers in the Bengali script.

Table 6. The alphabet of similar characters.

Similar Characters
অ	আ	উ	ঊ	এ	ঐ	ও	ঔ
ড	ড়	ঢ	ঢ়	ণ	ন	ব	র
য	ষ	য়

Note: The symbols shown are the similar characters in the Bengali script.

Table 7. Performance comparison between original and improved YOLOv11.

Contrast	Original YOLOv11	Our
Precision (P)	98.2%	99.9%
Recall (R)	97.5%	1.0%
F1	90.3%	93.4%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Aktar, M.; Islam, N.; Yang, C. Bangla Character Detection Using Enhanced YOLOv11 Models: A Deep Learning Approach. Appl. Sci. 2025, 15, 6326. https://doi.org/10.3390/app15116326

AMA Style

Aktar M, Islam N, Yang C. Bangla Character Detection Using Enhanced YOLOv11 Models: A Deep Learning Approach. Applied Sciences. 2025; 15(11):6326. https://doi.org/10.3390/app15116326

Chicago/Turabian Style

Aktar, Mahbuba, Nur Islam, and Chaoyu Yang. 2025. "Bangla Character Detection Using Enhanced YOLOv11 Models: A Deep Learning Approach" Applied Sciences 15, no. 11: 6326. https://doi.org/10.3390/app15116326

APA Style

Aktar, M., Islam, N., & Yang, C. (2025). Bangla Character Detection Using Enhanced YOLOv11 Models: A Deep Learning Approach. Applied Sciences, 15(11), 6326. https://doi.org/10.3390/app15116326

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Consonants
ক	খ	গ	ঘ	ঙ	চ	ছ	জ	ঝ	ঞ
ট	ঠ	ড	ঢ	ণ	ত	থ	দ	ধ	ন
প	ফ	ব	ভ	ম	য	র	ল	শ	ষ
স	হ	ড়	ঢ়	য়	ৎ	ং	ঃ	ঁ

Consonants
ক	খ	গ	ঘ	ঙ	চ	ছ	জ	ঝ	ঞ
ট	ঠ	ড	ঢ	ণ	ত	থ	দ	ধ	ন
প	ফ	ব	ভ	ম	য	র	ল	শ	ষ
স	হ	ড়	ঢ়	য়	ৎ	ং	ঃ	ঁ

Article Menu

Bangla Character Detection Using Enhanced YOLOv11 Models: A Deep Learning Approach

Abstract

1. Introduction

2. Related Work

2.1. Character Recognition in Different Characters Using YOLO

2.1.1. YOLO for Chinese Character Recognition in Ancient Books

2.1.2. YOLO for Handwritten Bangla Character Recognition

2.1.3. YOLO for Greek Papyri Character Recognition

2.1.4. YOLO for Kuzushiji and Old Japanese Script Recognition

2.2. Different Bangla Alphabet or Character Datasets Using Different Models

3. Methodology of the Proposed Bangla Alphabet Detection Algorithm

3.1. Overview of YOLOv11 and ResNet50

3.1.1. YOLOv11

3.1.2. ResNet50 Integration

3.2. Improvements in YOLOv11 Network Structure

3.3. Mathematical Formulation

4. Analysis and Findings from Experiments

4.1. Experimental Setup and Parameter Configuration

4.2. Building the Dataset

4.3. Evaluation Overview

4.3.1. Additional Metrics

4.3.2. Average Precision (AP)

4.4. Result Analysis

4.4.1. Classification of Letters or Characters

4.4.2. False Detection Using the Original YOLOv11

4.4.3. Correct Detection Using the Improved YOLOv11 Algorithm

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Consonants
ক	খ	গ	ঘ	ঙ	চ	ছ	জ	ঝ	ঞ
ট	ঠ	ড	ঢ	ণ	ত	থ	দ	ধ	ন
প	ফ	ব	ভ	ম	য	র	ল	শ	ষ
স	হ	ড়	ঢ়	য়	ৎ	ং	ঃ	ঁ