The proposed methodology aims to designing and implementing a reconfigurable CNN-2D hardware architecture on FPGA for real-time multi-class brain tumor classification. The design addresses both algorithmic accuracy and hardware efficiency, ensuring compatibility with resource-constrained platforms. The workflow consists of three main stages: data preprocessing, CNN model design, and FPGA implementation with reconfigurable IP cores. The following subsection provides an overview of the proposed AI-based approach for multi-class brain cancer classification, including details on the dataset, preprocessing techniques, and model training parameters.
3.1. Proposed Brain Cancer Multi-Classification
3.1.1. Methodology
The proposed approach for multi-class brain cancer classification leverages a Convolutional Neural Network (CNN) designed to automatically detect and categorize brain MRI images into four classes: glioma, meningioma, pituitary tumor, and healthy (no tumor). This model aims to extract relevant features from MRI scans while maintaining high classification accuracy and efficiency, making it suitable for integration into real-time diagnostic systems.
The methodology consists of three main stages. In the data preprocessing stage, MRI images are enhanced, normalized, and resized to ensure uniform input quality and reduce noise. The CNN model design stage focuses on defining the network architecture, selecting appropriate activation functions, and applying optimization techniques to achieve a balance between accuracy and computational efficiency. Finally, in the training and evaluation stage, the model is trained on labeled datasets and evaluated using metrics such as accuracy, precision, recall, and overall performance.
Figure 1 illustrates the workflow of the proposed AI-based classification system, from image preprocessing to the final tumor classification.
The learning process employed in this study follows a supervised learning paradigm, which is well-suited for multi-class classification tasks. The proposed CNN architecture is designed to automatically extract discriminative features from brain MRI images, eliminating the need for manual feature engineering. During training, the network processes labeled MRI scans and iteratively learns to differentiate between various tumor types through backpropagation and weight optimization.
To improve training stability and accelerate convergence, the Adam optimizer is employed due to its adaptive learning rate and effectiveness in handling sparse gradients. The model is trained using categorical cross-entropy as the loss function, which is tailored for multi-class classification. Training is performed over multiple epochs, with performance continuously monitored on a separate validation set to track learning progress and prevent overfitting.
The proposed CNN model is evaluated using key performance metrics, including accuracy, precision, recall, and F1-score, providing a comprehensive assessment of its classification capability. Additionally, confusion matrices are generated to analyze misclassifications and gain deeper insights into the model’s decision-making process across different tumor categories. These evaluation strategies collectively ensure the reliability, robustness, and effectiveness of the proposed approach for multi-class brain tumor classification.
3.1.2. Dataset and Preprocessing
The dataset used in this study consists of 7023 MRI images, categorized into four tumor types: glioma (1621 images), meningioma (1645 images), pituitary tumor (1757 images), and no tumor (2000 images). These images were obtained from a publicly available MRI dataset, Msoud, sourced from the Kaggle platform. The Msoud [
26] dataset is an integration of three widely used MRI datasets: Figshare [
27], SARTAJ [
28], and BR35H [
29], ensuring a diverse and well-balanced representation of tumor types (
Figure 2).
To ensure effective model training and evaluation, the dataset was split into training (80%), validation (10%), and test (10%) sets, maintaining a balanced distribution of each class across all subsets. The inclusion of a validation set allows continuous monitoring of the model’s performance during training, helping to prevent overfitting and tune hyperparameters effectively. Data augmentation techniques such as rotation, flipping, zooming, and brightness adjustment were applied to further enhance the diversity of the training set and improve generalization.
Before training the model, several preprocessing steps are applied to improve image quality and optimize the learning process. The preprocessing pipeline consists of image resizing and data augmentation, both essential for enhancing model performance and preventing overfitting. All MRI images are resized to 256 × 256 pixels, ensuring uniform input dimensions across the dataset. This uniformity facilitates stable feature extraction and improves computational efficiency during training.
To improve the performance and robustness of the model, a set of data augmentation techniques is applied. These include rotation, flipping, zooming, and brightness adjustment, each introducing controlled variations that help the model generalize to real-world imaging conditions. Rotation at 90°, 180°, and 270° simulates differences in patient positioning. Flipping, both horizontal and vertical, provides additional diversity in orientations, allowing the model to better recognize tumors from different perspectives. Zooming, within a range of 0.7× to 1.3×, allows the model to learn tumor patterns at multiple scales. Brightness adjustment of ±30% compensates for variations in scanner settings and MRI contrast levels, improving model robustness to contrast differences across MRI scans.
These augmentation techniques are applied dynamically during training, effectively expanding the variability of the dataset without requiring additional MRI samples. This approach significantly improves the model’s generalization to unseen data and contributes to higher classification accuracy.
Table 1 summarizes the augmentation strategies and their corresponding parameters used in the proposed brain tumor classification system.
3.1.3. Proposed CNN Architecture for Brain Cancer Multi-Classification
The proposed CNN architecture (
Figure 3) is designed to efficiently extract spatial and structural features from brain MRI images, enabling accurate multi-class tumor classification. The proposed neuron network integrates five convolutional layers, each followed by a MaxPooling2D layer, allowing progressive reduction in spatial resolution while preserving essential discriminatory features. These convolutional layers capture increasingly complex patterns related to tumor shape, texture, and boundaries, improving the model’s ability to differentiate between tumor types. The final feature maps are passed through fully connected layers to generate the final four-class output.
Table 2 provides a detailed summary of the proposed CNN architecture.
The proposed CNN architecture for brain cancer classification is designed to process grayscale MRI images of size 256 × 256 × 1, where the input layer receives the single-channel image. The first convolutional layer applies 32 filters (3 × 3 kernel), producing an output of 254 × 254 × 32, capturing basic features such as edges and low-level textures. This layer extracts basic features like edges and textures. It is followed by a MaxPooling2D layer that reduces the spatial dimensions to 127 × 127 × 32, decreasing the computational load while retaining important features. The second convolutional layer uses 64 filters, producing a feature map of 125 × 125 × 64, followed by MaxPooling2D layer that reduces the spatial size to 62 × 62 × 64. The third convolutional layer applies 128 filters, generating 60 × 60 × 128, and MaxPooling2D reduces the spatial dimensions to 30 × 30 × 128. The fourth convolutional layer uses 256 filters, producing an output of 28 × 28 × 256, followed by MaxPooling to 14 × 14 × 256. The fifth convolutional layer applies 512 filters, generating an output of 12 × 12 × 512, followed by MaxPooling2D, which reduces the shape to 6 × 6 × 512.
The final pooling output is flattened into a 18,432-element vector, which is fed into a Dense layer with 512 units to learn high-level abstract representations. The final classification layer contains 4 neurons with a softmax activation function corresponding to the four categories: glioma, meningioma, pituitary tumor, and no tumor. This architecture achieves a balance between computational efficiency and classification performance, making it suitable for deployment on hardware-constrained systems such as FPGAs.
The model is trained following a structured supervised learning procedure. It is compiled using the Adam optimizer, selected for its adaptive learning rate and stable convergence, and the categorical cross-entropy loss function, appropriate for multi-class classification. The learning rate is set to 0.001, with a batch size of 32 and 50 epochs used for training. To prevent overfitting, early stopping is implemented, halting training if validation performance stops improving. The proposed neuron network is trained on an augmented dataset, enhancing robustness and generalization. A separate validation set is employed to monitor training progress, evaluate convergence, and tune hyperparameters.
After training, the model is assessed on an independent test set, unseen during training and validation. This final evaluation provides an unbiased measure of generalization ability, with performance metrics such as accuracy, precision, recall, F1-score, and confusion matrices used to assess classification quality across all tumor classes.
3.2. Proposed Hardware-Acceleration
To improve the computational efficiency of the proposed brain cancer classification model, a dedicated hardware-accelerated implementation is introduced using FPGA technology. The primary objective of this hardware acceleration is to optimize the execution of the Convolutional Neural Network (CNN), with particular emphasis on the most computationally demanding layers such as convolution, activation, and pooling operations. By mapping these operations onto specialized FPGA hardware kernels, the design achieves substantial gains in execution time, throughput, and energy efficiency compared to conventional CPU- or GPU-based software implementations. This approach leverages the inherent parallelism, reconfigurability, and low-power characteristics of FPGAs, enabling real-time inference while operating under strict resource constraints typically associated with embedded medical imaging systems.
The proposed hardware-acceleration strategy focuses on deploying and optimizing the complete CNN-based brain cancer classification model on an FPGA platform, while respecting strict constraints on available logic resources, BRAM, and DSP slices. To achieve this, three core operations such as Convolution (Conv2D), MaxPooling2D, and ReLU Activation were implemented as modular IPs with parameterizable configurations. Each algorithm is optimized for parallel execution and pipelining using High-Level Synthesis (HLS) pragmas to maximize throughput while minimizing latency and resource usage (
Figure 4). Here, we detail the key layers in the proposed CNN (Conv2D, MaxPooling, and ReLU) and their corresponding equations for hardware implementation.
The Conv2D layer is a core component in the CNN, where it applies a series of filters (kernels) to extract feature maps from the input image. For each filter, the convolution operation slides over the image and performs a dot product of the filter and the receptive field of the input image. The output of the convolutional operation is the feature map, which is then passed to the next layer. The convolution operation for a single output pixel is described by (1):
where
Y(i, j, k) is the output feature map at position (i, j) for filter k.
X(i + m, j + n, c) is the input image at position (i + m, j + n) for channel c.
W(m, n, c, k) is the filter weights for filter k.
B(k) is the bias term for filter k.
M and N are the dimensions of the filter.
C is the number of input channels (e.g., 1 for grayscale).
In FPGA, the proposed convolution IP performs multiply accumulate operations across input feature maps and filter weights to extract spatial and structural features. The algorithm accepts a generic input feature map of size
M ×
N ×
H and a weight tensor of F filters of size
K ×
K ×
H, generating an output of dimensions (
M −
K + 1) × (
N −
K + 1) ×
F. Nested loops iterate over the output channels, rows, columns, and kernel dimensions, while HLS #pragma UNROLL directives are applied to the most inner loops to exploit parallelism. Bias values are incorporated at the start of each convolution operation to enhance flexibility and support different activation thresholds. Algorithm 1 presents the proposed convolution IP for FPGA acceleration with generic parameters.
| Algorithm 1: Proposed Conv2D for FPGA Acceleration with Generic Parameters and Maximal Supported Configuration (M = 256, N = 256, H = 3, F = 512, K = 3) |
| Inputs: |
| 1 | I ← Input feature map (height M, width N, depth H) [M][N][H] |
| 2 | W ← Weight tensor (F filters, each K × K × H) [F][K] [K][H] |
| 3 | B ← Bias vector for each filter [F] |
| Outputs: |
| 4 | O ← Output feature map [C][L][H] |
| 5 | Output height: C ← M − K + 1 |
| 6 | Output width: L ← N − K + 1 |
| 7 | Initialize output feature map: O ← zeros (C × L × F) |
| 8 | For fi = 0 to F − 1 do |
| 9 | #pragma HLS UNROLL factor = 8 |
| 10 | | For ci = 0 to C − 1 do |
| 11 | | #pragma HLS UNROLL factor = 8 |
| 12 | | | For li = 0 to L − 1 do |
| 13 | | | | accumulator ← B[fi] |
| 14 | | | | For i = 0 to K − 1 do |
| 15 | | | | #pragma HLS UNROLL factor = 8 |
| 16 | | | | | For j = 0 to K − 1 do |
| 17 | | | | | #pragma HLS UNROLL factor = 8 |
| 18 | | | | | | For hi = 0 to H − 1 do |
| 19 | | | | | | | accumulator ← I[ci + i][li + j][hi] * W[fi][i][j][hi] + accumulator |
| 20 | | | | | | end for |
| 21 | | | | | end for |
| 22 | | | | end for |
| 23 | | | | O[c][l][f] ← accumulator |
| 24 | | | end for |
| 25 | | end for |
| 26 | end for |
The MaxPooling layer reduces the spatial dimensions of the feature map by selecting the maximum value from a specific region of the feature map. This operation not only reduces the size but also helps in making the representation invariant to small translations and distortions in the image. The MaxPooling operation is defined as:
where
X(i + m, j + n, k) is the input to the pooling layer at position (i + m, j + n) for feature map k.
Y(i, j, k) is the output after max pooling at position (i, j) for feature map k.
The proposed pooling IP reduces the spatial dimensions of feature maps while retaining key features, supporting both stride and kernel size as configurable parameters. For a given input feature map of size M × N × H, the algorithm computes the maximum value within a sliding K × K window across each depth channel, producing a downsampled output of size (M/S) × (N/S) × H, where S is the stride. Loop unrolling is applied to the channel dimension to accelerate computation, and intermediate max values are updated efficiently within the kernel window. Algorithm 2 presents the proposed pooling IP for FPGA acceleration with generic parameters.
| Algorithm 2: Proposed Pooling IP (MaxPooling2D) for FPGA Acceleration with Generic Parameters and Maximal Supported Configuration (M = 256, N = 256, H = 512, K = 2, S = 2) |
| Inputs: |
| 1 | I ← Input feature map (height M, width N, depth H) [M][N][H] |
| Outputs: |
| 2 | O ← Output feature map [C][L][H] |
| |
| 3 | Pool kernel size: K ← 2 |
| 4 | Pool stride: S ← 2 |
| 5 | Output height: C ← M/S |
| 6 | Output width: L ← N/S |
| 7 | Initialize output feature map: O ← zeros(C × L × H) |
| 8 | For hi = 0 to H − 1 do |
| 9 | #pragma HLS UNROLL factor = 8 |
| 10 | | For ci = 0 to C − 1 do |
| 11 | | | For li = 0 to L − 1 do |
| 12 | | | | For i = 0 to K − 1 do |
| 13 | | | | #pragma HLS UNROLL factor = 8 |
| 14 | | | | | For j = 0 to K − 1 do |
| 15 | | | | | | val ← I[ci*S + i][li*S + j][hi] |
| 16 | | | | | | if (val > max_val) then |
| 17 | | | | | | | max_val ← val |
| 18 | | | | | | end if |
| 19 | | | | | end for |
| 20 | | | | end for |
| 21 | | | | O[ci][li][hi] ← max_val |
| 22 | | | end for |
| 23 | | end for |
| 24 | end for |
For the ReLU (Rectified Linear Unit) activation function, it is applied element-wise to the feature map to introduce non-linearity into the model. This is crucial because it allows the network to learn complex patterns and representations that a linear model could not. ReLU sets all negative values in the feature map to zero while leaving positive values unchanged. The ReLU activation is mathematically defined as:
ReLU is simple and computationally efficient, making it ideal for FPGA implementation. Each element of the feature map is processed in parallel, and negative values are effectively discarded, enabling the model to focus on positive activations. Algorithm 3 presents the proposed ReLU Activation for FPGA acceleration with generic parameters.
| Algorithm 3: Proposed ReLU Activation for FPGA Acceleration with Generic Parameters and Maximal Supported Configuration (M = 256, N = 256, H = 512) |
| Inputs: |
| 1 | I ← Input feature map (height M, width N, depth H) [M][N][H] |
| Outputs: |
| 2 | O ← Output feature map [M][N][H] |
| |
| 3 | Initialize output feature map: O ← zeros (M × N × H) |
| 4 | For hi = 0 to H − 1 do |
| 5 | #pragma HLS UNROLL factor = 8 |
| 6 | | For i = 0 to M − 1 do |
| 7 | | | For j = 0 to N − 1 do |
| 8 | | | | if I[i][j][hi] > 0 then |
| 9 | | | | | O[i][j][hi] ← I[i][j][hi] |
| 10 | | | | else |
| 11 | | | | | O[i][j][hi] ← 0 |
| 12 | | | | end if |
| 13 | | | end for |
| 14 | | end for |
| 15 | end for |
Collectively, these IP cores form the computational backbone of the FPGA-accelerated CNN, supporting reconfigurability for different kernel sizes, feature map dimensions, and depth channels. The design leverages local buffering in on-chip BRAMs to minimize external memory access, exploits FPGA DSP slices for multiply accumulate operations, and pipelines dataflow across layers to maximize throughput. This modular, parameterized approach enables the deployment of CNNs for real-time multi-class brain tumor classification on resource-constrained FPGA platforms.
The proposed hardware acceleration for the brain cancer classification system is designed to efficiently execute the three core CNN operations, Conv2D, MaxPooling2D, and ReLU, using custom FPGA IP cores configured for high performance. These operations are executed iteratively across the five convolutional stages of the network (i.e.,
n = 5), enabling a fully pipelined and optimized hardware workflow. Each IP is implemented with its maximum supported configuration, ensuring that the FPGA can process the largest feature maps and filter dimensions required by the model. For instance, the Conv2D IP is capable of handling feature maps up to 254 × 254 pixels with 512 filters, the MaxPooling2D IP operates on 127 × 127 × 512 feature maps, and the ReLU IP is applied after each convolution stage to introduce non-linearity at minimal computational overhead.
Figure 5 presents a detailed overview of the proposed hardware implementation approach for brain cancer multi-classification.
Key optimization strategies are applied to enhance the performance of the proposed hardware accelerator. Loop unrolling with a factor of 8 is employed to increase parallelism, allowing multiple computations to be executed simultaneously and significantly reducing the latency of each operation. Furthermore, pipelining is introduced to overlap data processing and memory access, thereby improving throughput and ensuring continuous data flow across the Conv2D, MaxPooling, and ReLU IPs. The proposed CNN layers are synthesized and validated on a Xilinx FPGA platform, where critical performance indicators including execution time, throughput, resource utilization, and power consumption are systematically monitored. The resulting hardware acceleration achieves substantial reductions in processing time while maintaining efficient resource usage, making it well-suited for real-time brain tumor classification from MRI scans.
Figure 6 presents the proposed hardware design for brain cancer multi-classification.
The proposed hardware-accelerated approach substantially reduces computational latency compared to traditional software-based CNN implementations, delivering a highly efficient solution for large-scale medical image classification. By leveraging FPGA-optimized IP cores, the system achieves real-time processing capability, making it well-suited for deployment in resource-constrained and time-critical clinical environments.
The next section presents a detailed discussion of the experimental results, comparing the software and hardware implementations. The benefits of FPGA acceleration, trade-offs in terms of power and execution time, and comparisons with existing state-of-the-art approaches are analyzed to highlight the advantages of the proposed hardware implementation.