Impact of H&E Stain Normalization on Deep Learning Models in Cancer Image Classification: Performance, Complexity, and Trade-Offs

Simple Summary This research study investigates the impact of stain normalization on deep learning models for cancer image classification by evaluating model performance, complexity, and trade-offs. The primary objective is to assess the improvement in accuracy, performance, and resource optimization of deep learning models through the standardization of visual appearance in histopathology images using stain normalization techniques, alongside batch size and image size optimization. The findings provide valuable insights for selecting appropriate deep learning models in achieving precise cancer classification, considering the effects of H&E stain normalization and computational resource availability. This study contributes to the existing knowledge on the performance, complexity, and trade-offs associated with applying deep learning models to cancer image classification tasks. Abstract Accurate classification of cancer images plays a crucial role in diagnosis and treatment planning. Deep learning (DL) models have shown promise in achieving high accuracy, but their performance can be influenced by variations in Hematoxylin and Eosin (H&E) staining techniques. In this study, we investigate the impact of H&E stain normalization on the performance of DL models in cancer image classification. We evaluate the performance of VGG19, VGG16, ResNet50, MobileNet, Xception, and InceptionV3 on a dataset of H&E-stained cancer images. Our findings reveal that while VGG16 exhibits strong performance, VGG19 and ResNet50 demonstrate limitations in this context. Notably, stain normalization techniques significantly improve the performance of less complex models such as MobileNet and Xception. These models emerge as competitive alternatives with lower computational complexity and resource requirements and high computational efficiency. The results highlight the importance of optimizing less complex models through stain normalization to achieve accurate and reliable cancer image classification. This research holds tremendous potential for advancing the development of computationally efficient cancer classification systems, ultimately benefiting cancer diagnosis and treatment.


Introduction
Histopathology image analysis plays a crucial role in cancer diagnosis and treatment.With the advancements in deep learning (DL) techniques, the use of DL models for cancer image classification has shown promising results [1].However, the performance and reliability of these models heavily rely on the quality and consistency of input data.Histology images are commonly stained using Hematoxylin and Eosin (H&E) to enhance tissue contrast and aid in visual interpretation.However, variations in staining protocols and equipment can introduce visual inconsistencies among images, potentially affecting the performance of DL models [2,3].
Histology image stain normalization techniques have emerged as a means to address these visual inconsistencies by standardizing the appearance of images.By applying stain normalization methods, it is possible to remove or reduce staining variations and ensure a consistent visual representation of the underlying tissue structures.This normalization process holds the potential to improve the accuracy, reliability, and resource utilization of DL models for cancer image classification tasks.
In this research study, we investigate the impact of histology image stain normalization on the of DL models performance in cancer image classification [4,5].We conduct a comprehensive analysis using Generative Adversarial Network (GAN)-based stain normalization, and evaluate its impact on DL models performance, complexity, and trade-offs within the context of cancer classification tasks [6,7].
Furthermore, the study aims to explore the optimization of batch size and image size, which are important parameters in DL model training, to maximize the benefits of stain normalization in less complex models.By finding the optimal combination of these parameters, the study aims to enhance the overall performance of DL models in cancer image classification.

Dataset
This research utilizes two publicly available breast cancer datasets for training the GAN models and evaluating the performance of DL models in multiclass breast cancer classification.The following provides a description of the datasets used: CAMELYON16 Challenge Dataset was utilized to train the GAN models for stain normalization in two domains: Aperio and Hamamatsu.These two domains represent different imaging scanners commonly used in histopathology.This dataset consists of 400 whole-slide images (WSIs) of sentinel lymph nodes, obtained from two distinct datasets collected at Radboud University Medical Center (Nijmegen, The Netherlands) and the University Medical Center Utrecht (Utrecht, The Netherlands).The training dataset consists of 170 WSIs of lymph nodes, with 100 of them being normal slides and 70 containing metastases [8].Additionally, there is a second training dataset consisting of 100 WSIs, including 60 normal slides and 40 slides containing metastases.The test dataset consists of 130 WSIs collected from both universities.Figure 1 shows histopathology images from the same stained slide captured using Aperio Scanscope XT scanner and Hamamatsu Nanozoomer 2.0-HT scanner.
ICIAR 2018 Breast Cancer Histology (BACH) Grand Challenge Dataset: This dataset consists of 400 training and 100 test H&E-stained microscopy images with a resolution of 2048 × 1536 pixels.The images were scanned using a Leica DM 2000 LED microscope with a pixel resolution of 0.42 × 0.42 µm.Two expert pathologists labeled the images into four classes.While the labels of training images are available, the labels of test images are withheld [9].This dataset exhibits significant color variability, making it suitable for color normalization tasks and evaluating the performance of automated cancer diagnostic systems.In this research, the dataset was used for performing multiclass classification of breast histopathology images, specifically classifying them into normal, benign, in situ, and invasive carcinoma classes.Figure 2 shows microscopy images labeled with the predominant cancer type present in each image.The images showcase different cancer types, providing valuable insights into the variations in staining patterns and characteristics within the dataset.ICIAR 2018 Breast Cancer Histology (BACH) Grand Challenge Dataset: This dataset consists of 400 training and 100 test H&E-stained microscopy images with a resolution of 2048 × 1536 pixels.The images were scanned using a Leica DM 2000 LED microscope with a pixel resolution of 0.42 × 0.42 µm.Two expert pathologists labeled the images into four classes.While the labels of training images are available, the labels of test images are withheld [9].This dataset exhibits significant color variability, making it suitable for color normalization tasks and evaluating the performance of automated cancer diagnostic systems.In this research, the dataset was used for performing multiclass classification of breast histopathology images, specifically classifying them into normal, benign, in situ, and invasive carcinoma classes.Figure 2

Data Preprocessing
The primary objective of data preprocessing is to convert the original whole-slide images (WSIs) into manageable patch images of sizes 128 × 128, 256 × 256, and 512 × 512, which are suitable for subsequent tasks such as stain normalization and classification.The ICIAR 2018 dataset comprised 2048 × 1536 Tag Image File Format (TIFF) images [10].
For the ICIAR 2018 dataset, the preprocessing phase involved the generation of image patches.This process commenced by applying Otsu thresholding to remove the background from the images, effectively separating the foreground (tissue) from the background and improving the subsequent patch generation process.After the removal of the background, patches were generated at ×40 magnification, resulting in the creation of Portable Network Graphic (PNG) patch images, each with dimensions of 128 × 128, 256 × 256, and 512 × 512 pixels.The generation of multiple patch image sizes aimed to explore the impact of image size variation on subsequent tasks, including stain normalization and classification.ICIAR 2018 Breast Cancer Histology (BACH) Grand Challenge Dataset: This dataset consists of 400 training and 100 test H&E-stained microscopy images with a resolution of 2048 × 1536 pixels.The images were scanned using a Leica DM 2000 LED microscope with a pixel resolution of 0.42 × 0.42 µm.Two expert pathologists labeled the images into four classes.While the labels of training images are available, the labels of test images are withheld [9].This dataset exhibits significant color variability, making it suitable for color normalization tasks and evaluating the performance of automated cancer diagnostic systems.In this research, the dataset was used for performing multiclass classification of breast histopathology images, specifically classifying them into normal, benign, in situ, and invasive carcinoma classes.Figure 2

Data Preprocessing
The primary objective of data preprocessing is to convert the original whole-slide images (WSIs) into manageable patch images of sizes 128 × 128, 256 × 256, and 512 × 512, which are suitable for subsequent tasks such as stain normalization and classification.The ICIAR 2018 dataset comprised 2048 × 1536 Tag Image File Format (TIFF) images [10].
For the ICIAR 2018 dataset, the preprocessing phase involved the generation of image patches.This process commenced by applying Otsu thresholding to remove the background from the images, effectively separating the foreground (tissue) from the background and improving the subsequent patch generation process.After the removal of the background, patches were generated at ×40 magnification, resulting in the creation of Portable Network Graphic (PNG) patch images, each with dimensions of 128 × 128, 256 × 256, and 512 × 512 pixels.The generation of multiple patch image sizes aimed to explore the impact of image size variation on subsequent tasks, including stain normalization and classification.

Data Preprocessing
The primary objective of data preprocessing is to convert the original whole-slide images (WSIs) into manageable patch images of sizes 128 × 128, 256 × 256, and 512 × 512, which are suitable for subsequent tasks such as stain normalization and classification.The ICIAR 2018 dataset comprised 2048 × 1536 Tag Image File Format (TIFF) images [10].
For the ICIAR 2018 dataset, the preprocessing phase involved the generation of image patches.This process commenced by applying Otsu thresholding to remove the background from the images, effectively separating the foreground (tissue) from the background and improving the subsequent patch generation process.After the removal of the background, patches were generated at ×40 magnification, resulting in the creation of Portable Network Graphic (PNG) patch images, each with dimensions of 128 × 128, 256 × 256, and 512 × 512 pixels.The generation of multiple patch image sizes aimed to explore the impact of image size variation on subsequent tasks, including stain normalization and classification.

Stain Normalization
Stain normalization in histopathological images aims to standardize the appearance and address color inconsistencies caused by staining protocols, slide preparation techniques, and imaging conditions.This process adjusts the color properties of stained images to achieve uniformity across diverse samples.The evaluation of stain normalization techniques on the performance of DL models for cancer image classification is crucial in enhancing classification performance and developing efficient and accurate systems.
Cancers 2023, 15, 4144 Figure 3 provides a visual representation of the stain normalization results using different GAN models.This visualization offers valuable insights into the impact of these models on enhancing image quality for histopathological analysis.It demonstrates the effectiveness of the GANs in normalizing stained images and improving their visual quality, thereby contributing to more accurate and reliable histopathological analysis.
Figure 3 provides a visual representation of the stain normalization results using different GAN models.This visualization offers valuable insights into the impact of these models on enhancing image quality for histopathological analysis.It demonstrates the effectiveness of the GANs in normalizing stained images and improving their visual quality, thereby contributing to more accurate and reliable histopathological analysis.

Generative Adversarial Networks (GANs) Performance Evaluation Metrices
In order to proceed with the performance evaluation of the DL models, we conducted an evaluation of stain normalization processes to select the most suitable stain normalization GAN.This evaluation encompassed assessing the performance of different stain normalization GANs and subsequently evaluating the quality of the generated images.To ensure a comprehensive evaluation, appropriate metrics were employed in this process as follows.
The Structural Similarity Index (SSIM) is a widely used metric for assessing the similarity between two images.It takes into account three components: luminance,

Generative Adversarial Networks (GANs) Performance Evaluation Metrices
In order to proceed with the performance evaluation of the DL models, we conducted an evaluation of stain normalization processes to select the most suitable stain normalization GAN.This evaluation encompassed assessing the performance of different stain normalization GANs and subsequently evaluating the quality of the generated images.To ensure a comprehensive evaluation, appropriate metrics were employed in this process as follows.
The Structural Similarity Index (SSIM) is a widely used metric for assessing the similarity between two images.It takes into account three components: luminance, contrast, and structure.The SSIM ranges between −1 and 1, where a value of 1 indicates a perfect match [20,21].
The equation for SSIM is as follows: In this equation, µ x , µ y , are the mean and σ x , σ y are the standard deviation of the intensity values present in the two images, respectively.σ xy is the covariance between the two images' intensities.The constants c 1 , c 2 are used for negating the weak denominator effect.
The Fréchet Inception Distance (FID) is another metric used to evaluate the quality and diversity of generated images.It measures the similarity between the distribution of real images and the distribution of generated images in feature space, as captured by a pre-trained Inception model.A lower FID indicates better image quality and diversity Cancers 2023, 15, 4144 5 of 17 240 [22].Mathematically, the Fréchet Distance is used to compute the distance between two "multivariate" normal distribution.For a "univariate" normal distribution, the Fréchet Distance is given as where µ and σ are the mean and standard deviation of the normal distributions, X and Y are two normal distributions.
In the context of GAN evaluation, the FID utilizes feature distances calculated with a pre-trained Inception V3 model.The use of activations from the Inception V3 model to summarize each image gives the FID value.
The Fréchet Inception Distance for "multivariate" normal distribution is given by, where X and Y are the real and fake embeddings (activation from the Inception model) assumed to be two multivariate normal distributions.µ X and µ Y are the magnitudes of the vector X and Y. Tr is the trace of the matrix and ∑ X and ∑ Y are the covariance matrix of the vectors.The Inception Score (IS) is a metric used to evaluate the quality and diversity of generated images.It measures how well the generated images fool a pre-trained Inception model.A higher IS indicates better image quality and diversity [23].
The equation for IS is as follows: where x ∼ pg indicates that x is an image sampled from pg, D KL ( p q) is the KL- divergence between the distributions p and q, p(y|X) is the conditional class distribution, and p(y) = X p(y|X)p g (X) is the marginal class distribution.Additionally, other commonly used metrics for evaluating stain normalization techniques include Mean Squared Error (MSE) and Peak Signal-to-Noise Ratio (PSNR).MSE quantifies the average squared difference between pixel values of the generated and reference images, indicating improved stain normalization with lower MSE values.PSNR measures the ratio between the maximum possible image power and noise power, providing insights into image quality.
The PSNR computes the peak signal-to-noise ratio, in decibels, between two images.This ratio is used as a quality measurement between the generated and a target image.The higher the PSNR, the better the quality in generated image.The mean square error (MSE) and the peak signal-to-noise ratio (PSNR) are used to compare image quality.

Image Classification
After generating stain-normalized images using stain normalization GANs, the subsequent step involves performing image classification.In this section, the focus lies in employing deep learning (DL) models to classify stained histopathological cancer images into distinct categories, including benign, in situ, invasive, and normal.The objective is to evaluate the efficacy of the stain normalization techniques in enhancing the classification accuracy of DL models [24,25].By assessing the performance of DL models on the stain-normalized images, the study aims to determine the impact of stain normalization on the accuracy and reliability of cancer image classification.
The ICIAR2018 Breast Cancer Histology dataset is used for the image classification process.This dataset consists of stained normalized images that have undergone the previously explained stain normalization techniques.The dataset is divided into training, validation, and testing sets to ensure proper evaluation of the models' performance.It ensures that the same set of images is used for training, validation, and testing, with different DL models [26][27][28].Six different DL models are employed for image classification, ranging from less complex to high complex DL models including MobileNet, XceptionNet, InceptionV3, ResNet50, VGG16, and VGG19, chosen based on their proven performance in image classification tasks and compatibility with the stained histopathological images [29,30,[30][31][32].Table 1 provides an overview of the DL models used for cancer image classification, including their model size, parameter count, and depth.In the process of training the DL models, separately stain-unnormalized and stainnormalized image patches were used.Both stain-normalized and stain-unnormalized datasets consisted of image patches of three distinct sizes: 128 × 128, 256 × 256, and 512 × 512.This selection enabled an examination of how different image sizes affected the performance of the DL models when trained on unnormalized datasets.
By incorporating both stain-unnormalized and stain-normalized datasets, the intention was to compare the performance of the DL models on stain-unnormalized images with their performance on images that underwent stain normalization.Through this analysis, the effectiveness of stain normalization in enhancing the models' classification performance could be assessed.
Furthermore, the utilization of varying image patch sizes and batch sizes allowed for an evaluation of the impact of image resolution on the performance, efficiency, resource utilization, and trade-offs of the DL models.This facilitated the identification of the optimal image size and batch that yielded the most favorable classification results.By systematically exploring different image resolutions, the study aimed to determine the resolution that strikes the best balance between accuracy and computational efficiency, providing insights into the optimal image size for the classification task.
Figure 4 illustrates the workflow of stain-normalized image classification using a variety of deep learning models.The figure depicts the sequential steps involved in the classification process, highlighting the key stages and interactions between different components.
The evaluation of the image classification results involves analyzing metrics such as accuracy, precision, recall, and F1-score.These metrics provide quantitative measures of the models' performance in correctly classifying the stained histopathological images into their respective categories.Additionally, the performance of the image classification models is compared with and without the application of stain normalization techniques to assess the impact of the normalization process on classification accuracy.
resolution that strikes the best balance between accuracy and computational efficiency, providing insights into the optimal image size for the classification task.
Figure 4 illustrates the workflow of stain-normalized image classification using a variety of deep learning models.The figure depicts the sequential steps involved in the classification process, highlighting the key stages and interactions between different components.The evaluation of the image classification results involves analyzing metrics such as accuracy, precision, recall, and F1-score.These metrics provide quantitative measures of the models' performance in correctly classifying the stained histopathological images into their respective categories.Additionally, the performance of the image classification models is compared with and without the application of stain normalization techniques to assess the impact of the normalization process on classification accuracy.

GAN Models Selection
GANs' performance evaluation metrics collectively provide quantitative measures to assess the quality, diversity, and visual fidelity of stain-normalized images.By considering both perceptual and statistical aspects, these metrics contribute to a comprehensive assessment of the generated image quality, enabling informed decisionmaking in stain normalization research.Table 2 presents a comprehensive summary of the evaluation of various stain normalization methods, using a range of quantitative metrics.

GAN Models Selection
GANs' performance evaluation metrics collectively provide quantitative measures to assess the quality, diversity, and visual fidelity of stain-normalized images.By considering both perceptual and statistical aspects, these metrics contribute to a comprehensive assessment of the generated image quality, enabling informed decision-making in stain normalization research.Table 2 presents a comprehensive summary of the evaluation of various stain normalization methods, using a range of quantitative metrics.These graphs offer a visual depiction of the dynamic changes in metric values throughout the training process, providing valuable insights into the performance and convergence of the GANs.By observing the trends and fluctuations in the evaluation metrics, researchers can assess the progress and effectiveness of the GAN models and make informed decisions regarding their training and optimization.
The evaluation metrics used in this study provide valuable insights into the performance of each stain normalization method.Through comprehensive analysis, the results clearly indicate that StainGAN exhibits superior performance compared to other stain normalization GANs.The evaluation metrics highlight the effectiveness of StainGAN in achieving accurate and consistent stain normalization, making it a promising choice for enhancing image quality and standardization in histopathological analysis.These graphs offer a visual depiction of the dynamic changes in metric values throughout the training process, providing valuable insights into the performance and convergence of the GANs.By observing the trends and fluctuations in the evaluation metrics, researchers can assess the progress and effectiveness of the GAN models and make informed decisions regarding their training and optimization.
The evaluation metrics used in this study provide valuable insights into the performance of each stain normalization method.Through comprehensive analysis, the results clearly indicate that StainGAN exhibits superior performance compared to other stain normalization GANs.The evaluation metrics highlight the effectiveness of StainGAN in achieving accurate and consistent stain normalization, making it a promising choice for enhancing image quality and standardization in histopathological analysis.

Deep Learning Model Performance in Cancer Classification
In the image classification phase, various DL models, including MobileNet, Xcep-tionNet, InceptionV3, ResNet50, VGG16, and VGG19, were employed to classify the stain-normalized histopathological images into different cancer categories.The evaluation of the models' performance with different image sizes used and involved metrics such as accuracy, precision, recall, and F1-score [8,23,33,34].
Table 3 summarizes the classification performance of different DL models on the dataset with varying image sizes (128 × 128, 256 × 256, and 512 × 512) without stain normalization.The primary objective of this table is to demonstrate the models' effectiveness in cancer classification when stain normalization is not employed.

Computational Complexity Analysis
The computational complexity analysis aimed to investigate the impact of input image sizes and batch sizes on the resource utilization of DL models used in cancer image classification.The analysis involved a comprehensive comparison of various performance metrics, including the number of parameters and image size in the DL models, processing speed in relation to both image size and batch size, FLOPs (floating-point operations per second) relative to image size, and the correlation between image size and batch size with GPU usage.This evaluation encompassed diverse DL models, each employing different input image sizes and batch sizes.
Table 5 provides information on different models, including their respective image sizes, number of parameters, and FLOPs (floating-point operations) measured in millions.The FLOPs served as a measure of computational complexity, providing insights into the computational demands of the models.Through meticulous analysis of these performance metrics, this investigation yielded invaluable insights into the trade-offs, complexities, and resource requirements associated with DL models deployed in breast cancer image classification tasks that encompass diverse input image sizes and batch sizes.
The experiments were conducted using a computer system with specific specifications to ensure efficient execution of the DL experiments.The computer system used for these experiments was equipped with an Intel Core i5 processor running at 3.5 GHz, 64 GB of RAM, and a high-performance NVIDIA GeForce RTX 4090 graphics card.
The choice of this computer system was driven by the need for substantial computational power to handle the large-scale DL tasks involved in training and evaluating the models.The inclusion of the NVIDIA GeForce RTX 4090 graphics card ensured accelerated training and inference processes, leveraging the card's parallel computing capabilities.
Figure 6 illustrates the relationship between the number of FLOPs and the number of parameters in different DL models.The FLOPs metric provides insights into the computational complexity of the models, reflecting the number of arithmetic operations required for processing the input data.computing capabilities.
Figure 6 illustrates the relationship between the number of FLOPs and the nu of parameters in different DL models.The FLOPs metric provides insights int computational complexity of the models, reflecting the number of arithmetic oper required for processing the input data.In Figure 6, the size of the plot demonstrates size of input image and how the number of FLOPs changes as the number of parameters varies across different models.Each point on the graph represents a specific model configuration, with the x-axis denoting the number of parameters and the y-axis representing the corresponding number of FLOPs.
In the case of the models MobileNet, Xception, InceptionV3, and ResNet50, the number of training parameters remained constant regardless of the input image size.However, the number of floating-point operations (FLOPs) performed during model inference varied based on the input image size.This means that as the size of the input image increased, the computational workload in terms of FLOPs also increased.This insight is valuable for optimizing computational efficiency and resource allocation when utilizing these models, as it allows for better understanding of the computational requirements associated with different input sizes.
To evaluate the effect of increased image sizes on classification performance, the accuracy of each model was measured using the test dataset.Additionally, the processing speed, quantified as the number of images processed per second (IPS), was examined to identify disparities in computational efficiency.The findings of this investigation are summarized in Table 6, offering a comprehensive overview of diverse DL models.The table presents relevant information such as image sizes, the number of images processed per second, and corresponding batch and images sizes.By conducting a comparative assessment of processing speeds across different image sizes and batch sizes, valuable insights were obtained concerning the computational efficiency and capability of each model to handle varying workloads.

MobileNet
The results presented in Tables 3 and 4 provide evidence that increasing the image size has a positive impact on cancer classification performance of the DL models.The larger image sizes allow for capturing more detailed information, leading to improved accuracy in classification.However, it is important to consider the potential challenges associated with increasing image size, such as computational complexity and increased training and inference time.
To further investigate the impact on processing speed, we examined the relationship between batch size and image size on DL model performance.We explored how varying these parameters influenced the computational demands of the models during training and inference.By analyzing the processing speed, we aimed to identify an optimal balance between image size and batch size that maximizes both classification accuracy and computational efficiency.
Figure 7 illustrates the relationship between the number of IPS and the batch size in various DL models when image size is changed.The IPS metric serves as a valuable indicator of the computational efficiency of the models, quantifying the number of images processed within a second across different image sizes and batch sizes.
Analyzing the results, it is evident that increasing the image size generally leads to a decrease in processing speed across all models.This is expected since larger images contain more pixels and require more computational resources, resulting in a reduced number of images processed per second.
Furthermore, Table 6 shows that the batch size also influences the processing speed.Generally, as the batch size increases, the processing speed improves, indicating better utilization of parallel processing capabilities.However, there is a diminishing return in speed improvements beyond a certain batch size, as the models may experience limitations in memory or computational capacity.7 illustrates the relationship between the number of IPS and the batch size in various DL models when image size is changed.The IPS metric serves as a valuable indicator of the computational efficiency of the models, quantifying the number of images processed within a second across different image sizes and batch sizes.Analyzing the results, it is evident that increasing the image size generally leads to a decrease in processing speed across all models.This is expected since larger images contain more pixels and require more computational resources, resulting in a reduced number of images processed per second.
Furthermore, Table 6 shows that the batch size also influences the processing speed.Generally, as the batch size increases, the processing speed improves, indicating better utilization of parallel processing capabilities.However, there is a diminishing return in speed improvements beyond a certain batch size, as the models may experience limitations in memory or computational capacity.
Considering specific models, it can be observed that MobileNet generally achieves the highest processing speeds compared to other models, particularly at larger image sizes and batch size.On the other hand, Inception3 also consistently demonstrates faster processing speeds across different image sizes and batch sizes.
This relationship provides meaningful insights into the models' ability to handle larger workloads and deliver faster processing speeds, which are crucial considerations for optimizing DL model performance in real-world applications.
Furthermore, our analysis included an examination of memory utilization to investigate the impact of larger input image sizes and varying batch sizes on the utilization of Graphics Processing Unit (GPU) memory.Table 7 provides insights into GPU memory utilization in DL models, highlighting the impact of image and batch sizes on resource demands.
Among the DL models analyzed, MobileNet consistently demonstrates lower GPU memory requirements compared to others, irrespective of image size or batch size.Considering specific models, it can be observed that MobileNet generally achieves the highest processing speeds compared to other models, particularly at larger image sizes and batch size.On the other hand, Inception3 also consistently demonstrates faster processing speeds across different image sizes and batch sizes.
This relationship provides meaningful insights into the models' ability to handle larger workloads and deliver faster processing speeds, which are crucial considerations for optimizing DL model performance in real-world applications.
Furthermore, our analysis included an examination of memory utilization to investigate the impact of larger input image sizes and varying batch sizes on the utilization of Graphics Processing Unit (GPU) memory.Table 7 provides insights into GPU memory utilization in DL models, highlighting the impact of image and batch sizes on resource demands.and VGG19 exhibit the highest GPU memory usage across the analyzed models, even with smaller image sizes and batch sizes.These findings emphasize the importance of considering GPU memory limitations when selecting a DL model, as higher memory requirements may restrict the feasible batch size or image size.Employing optimization techniques such as model pruning can be beneficial in reducing the GPU memory footprint.
A subset of data corresponding to specific batch sizes is absent in the 256 × 256 and 512 × 512 image resolutions for certain DL models.This discrepancy arises from the inherent limitations of computer GPU memory.The constrained memory capacity of the GPU hindered the feasibility of processing and storing the entire dataset for these batch sizes at the aforementioned higher image resolutions.Consequently, the absence of data points in the experimental results directly stems from these hardware limitations.
Figure 8 illustrates the relationship between the number of GPU memory utilization and the batch size in different DL models when image size is changed.The GPU memory utilization metric serves as a valuable indicator of the computational resource demand of the models, quantifying the amount of GPU memory used during the training of DL models for different image sizes and batch sizes.The absence of data for certain batch sizes of DL models in the 256 × 256 and 512 × 512 image sizes is attributed to the inability to complete the processing task due to insufficient GPU memory.These larger image sizes necessitate a substantial amount of memory for processing, and when the allocated GPU memory falls short, the task cannot be executed successfully.This limitation arises from the inherent physical limitations of the GPU memory capacity, which imposes restrictions on the sizes of images that can be processed.As a result, data collection and analysis for batch sizes in these image sizes were not possible due to the impracticality of storing and manipulating the necessary data within the available memory resources.This underscores the critical importance of effective memory resource management and considering the hardware limitations when working with DL models that involve large image sizes and batch sizes.Employing memory optimization techniques and adopting memory-efficient architectures can help alleviate these constraints and enable the processing of larger image sizes within the constraints of the available GPU memory.
These insights played a crucial role in identifying potential challenges and limitations related to GPU memory capacity, providing invaluable guidance for optimizing the selection of image sizes and batch sizes.The ultimate goal of these optimization efforts was to maximize the effective utilization of available computational resources and ensure the smooth execution of the DL models.This approach aimed to strike a balance between resource efficiency and computational performance, allowing for efficient utilization of the GPU and facilitating optimal model performance.

Discussion
The findings of our study highlight the effectiveness of DL models for cancer The absence of data for certain batch sizes of DL models in the 256 × 256 and 512 × 512 image sizes is attributed to the inability to complete the processing task due to insufficient GPU memory.These larger image sizes necessitate a substantial amount of memory for processing, and when the allocated GPU memory falls short, the task cannot be executed successfully.This limitation arises from the inherent physical limitations of the GPU memory capacity, which imposes restrictions on the sizes of images that can be processed.As a result, data collection and analysis for batch sizes in these image sizes were not possible due to the impracticality of storing and manipulating the necessary data within the available memory resources.This underscores the critical importance of effective memory resource management and considering the hardware limitations when working with DL models that involve large image sizes and batch sizes.Employing memory optimization techniques and adopting memory-efficient architectures can help alleviate these constraints and enable the processing of larger image sizes within the constraints of the available GPU memory.
These insights played a crucial role in identifying potential challenges and limitations related to GPU memory capacity, providing invaluable guidance for optimizing the selection of image sizes and batch sizes.The ultimate goal of these optimization efforts was to maximize the effective utilization of available computational resources and ensure the smooth execution of the DL models.This approach aimed to strike a balance between resource efficiency and computational performance, allowing for efficient utilization of the GPU and facilitating optimal model performance.

Discussion
The findings of our study highlight the effectiveness of DL models for cancer classification and the importance of stain normalization techniques in improving their performance.Our evaluation revealed that VGG16 emerged as the best model for cancer classification, thanks to its deep architecture and large number of parameters.However, it is worth noting that other models such as MobileNet and Xception can also deliver competitive performance when stain normalization techniques are properly applied.
Stain normalization plays a critical role in addressing the challenge of staining variations.These stain variations can introduce image appearance discrepancies and hinder accurate classification.By applying H&E stain normalization on histopathology images, these variations can be mitigated, allowing the models to focus on relevant features and patterns.The significance of stain normalization is particularly pronounced for less complex models such as MobileNet, Xception, and Inception, as their fewer parameters may limit their ability to capture subtle differences caused by staining variations.
Our results indicate that stain normalization greatly impacts the performance of these less complex models, enabling them to achieve improved accuracy and efficiency in stain normalized cancer image classification tasks.This finding emphasizes the need to consider both the model architecture and the implementation of stain normalization techniques when developing cancer classification models.By incorporating stain normalization into the workflow, researchers and developers can enhance the performance of less complex models and achieve results comparable to more complex models such as VGG16.
Furthermore, the efficiency and computational complexity of MobileNet and Xception make them attractive candidates for practical applications.These models offer a good balance between performance and resource requirements, making them suitable for deployment in real-world scenarios where computational resources may be limited.
Although our evaluation showcased the impressive performance of various DL models for cancer classification, it is important to acknowledge the limitations and failure of certain models.In our study, the InceptionV3, ResNet50, and VGG19 models exhibited subpar performance compared to other models, highlighting its failure in accurately classifying cancer images.
VGG19, an extension of the VGG16 model, features a deeper architecture with 19 layers.While the increased depth allows for more complex representations to be learned, it also introduces challenges such as vanishing gradients and increased computational requirements.In our study, we found that despite its increased depth, VGG19 did not yield superior performance compared to VGG16.This suggests that the additional layers may not have provided substantial benefits for cancer classification, potentially due to diminishing returns or the risk of overfitting the data.
ResNet50, on the other hand, is a popular residual network architecture that addresses the vanishing gradient problem by introducing residual connections.These connections enable the gradient to flow directly from earlier layers to later layers, facilitating the training of deeper networks.However, in our experiments, ResNet50 did not perform as well as some of the other models, including VGG16.This could be attributed to several factors, such as the complexity of cancer images and the specific dataset used in our study.It is possible that ResNet50's architecture may not have been optimized for capturing the relevant features and patterns in stain-normalized cancer images.
The low performance of complex DL models such as InceptionV3 and ResNet50 in our study highlight the importance of model selection and the need for careful consideration when choosing an appropriate architecture for stain-normalized cancer image classification.It is crucial to assess the suitability of a model's architecture, complexity, efficiency, and ability to capture relevant features within the context of the specific dataset and task at hand.While InceptionV3 and ResNet50 may have demonstrated strong performance in other domains or with different datasets, their performance limitations in our study underscore the need for thorough evaluation and model selection in the context of stain-normalized cancer classification.

Conclusions
In conclusion, our study highlights the significant impact of H&E stain normalization on the selection of DL models for cancer image classification.While VGG16 exhibited strong performance, InceptionV3 and ResNet50 faced limitations in this context.Notably, stain normalization techniques greatly enhanced the performance of less complex models such as MobileNet, Xception, and Inception.These models emerged as competitive alternatives with lower computational complexity, improved computational efficiency, and reduced resource requirements.The findings underscore the importance of optimizing less complex models through stain normalization, achieving accurate and reliable cancer image classification while striking a balance between performance, complexity, efficiency, and trade-offs.This research holds tremendous potential for advancing the development of computationally efficient cancer classification systems.

Figure 2 .
Figure 2. Microscopy images labeled with the predominant cancer type in each image: (a) Normal, (b) benign, (c) in situ carcinoma, and (d) invasive carcinoma.Scale bar, 164µm.

Figure 2 .
Figure 2. Microscopy images labeled with the predominant cancer type in each image: (a) Normal, (b) benign, (c) in situ carcinoma, and (d) invasive carcinoma.Scale bar, 164µm.

Figure 3 .
Figure 3. CycleGAN, MultipathGAN, and StainGAN normalized sample images from the dataset, labeled with the predominant cancer type in each image.(a) Benign, (b) in situ carcinoma, (c) invasive carcinoma, and (d) normal.Scale bar, 31.5µm.

Figure 4 .
Figure 4. Workflow of stain normalized images classification using diverse deep learning models.Scale bar, 11,698.35µm.

Figure 5
Figure 5 represents the variation of evaluation metrics (SSIM, FID, and IS) with respect to the iteration number for GANs evaluation.

Figure 4 .
Figure 4. Workflow of stain normalized images classification using diverse deep learning models.Scale bar, 11,698.35µm.

Figure 5 Figure 5 .
Figure 5 represents the variation of evaluation metrics (SSIM, FID, and IS) with respect to the iteration number for GANs evaluation.Cancers 2023, 15, x FOR PEER REVIEW 8 of 18

Figure 5 .
Figure 5. Variation of evaluation metrics (SSIM, FID, and IS) with respect to the iteration number for GANs evaluation.(a) Variation of SSIM values; (b) variation of FID values; (c) variation of IS score.

Figure 6 .
Figure 6.Relationship between model complexity and computational efficiency in deep lea FLOPs versus parameters.

Figure 6 .
Figure 6.Relationship between model complexity and computational efficiency in deep learning: FLOPs versus parameters.

Figure 7 .
Figure 7. Relationship between model computational efficiency and batch size in different input image sizes in deep learning models: (a) Image size 128 × 128, (b) image size 256 × 256, and (c) image size 512 × 512.

Figure 7 .
Figure 7. Relationship between model computational efficiency and batch size in different input image sizes in deep learning models: (a) Image size 128 × 128, (b) image size 256 × 256, and (c) image size 512 × 512.

Figure 8 .
Figure 8. Relationship between GPU memory utilization and batch size in different input image sizes in deep learning models: (a) image size 128 × 128, (b) image size 256 × 256, and (c) image size 512 × 512.

Figure 8 .
Figure 8. Relationship between GPU memory utilization and batch size in different input image sizes in deep learning models: (a) image size 128 × 128, (b) image size 256 × 256, and (c) image size 512 × 512.

Table 1 .
Deep learning models used for cancer image classification.

Table 2 .
Comparative evaluation of stain normalization methods based on image quality metrics.

Table 2 .
Comparative evaluation of stain normalization methods based on image quality metrics.

Table 3 .
Classification performance of deep learning models on dataset without stain normalization and with different image sizes.

Table 5 .
Comparison of deep learning models: image size, number of parameters, and FLOPs.

Table 6 .
Processing speed (images per second) of different image and batch sizes in deep learning models.

Table 7 .
GPU memory utilization in different deep learning models with image and batch sizes.MobileNetoccasionally showing slightly higher requirements than Xception.ResNet50 generally demands more GPU memory than MobileNet, Xception, and InceptionV3.Notably, VGG16