Dataset of Registered Hematoxylin–Eosin and Ki67 Histopathological Image Pairs Complemented by a Registration Algorithm

: In this work, we describe a dataset suitable for analyzing the extent to which hematoxylin– eosin (HE)-stained tissue contains information about the expression of Ki67 in immunohistochemistry staining. The dataset provides images of corresponding pairs of HE and Ki67 stainings and is complemented by algorithms for computing the Ki67 index. We introduce a dataset of high-resolution histological images of testicular seminoma tissue. The dataset comprises digitized histology slides from 77 conventional testicular seminoma patients, obtained via surgical resection. For each patient, two physically adjacent tissue sections are stained: one with hematoxylin and eosin, and one with Ki67 immunohistochemistry staining. This results in a total of 154 high-resolution images. The images are provided in PNG format, facilitating ease of use for image analysis compared to the original scanner output formats. Each image contains enough tissue to generate thousands of non-overlapping 224 × 224 pixel patches. This shows the potential to generate more than 50,000 pairs of patches, one with HE staining and a corresponding Ki67 patch that depicts a very similar part of the tissue. Finally, we present the results of applying a ResNet neural network for the classification of HE patches into categories according to their Ki67 label.


Summary
Deep learning has revolutionized digital pathology, offering robust tools for the automated analysis of histopathological images.By leveraging convolutional neural networks, researchers can achieve high accuracy in tasks such as tumor detection, tissue segmentation, and cellular classification.These models excel in recognizing complex patterns within whole slide images (WSIs), significantly reducing the manual workload of pathologists and enhancing diagnostic consistency.Advanced architectures, such as ResNet [1], U-Net, and Transformer-based models, have been particularly effective in improving the feature extraction and interpretation of histological features [2][3][4][5].Additionally, deep learning techniques have been applied to predict molecular phenotypes and patient outcomes from morphological data, bridging the gap between histology and genomics [6].Despite these advancements, challenges remain, including the need for large annotated datasets, computational resource demands, and ensuring model generalizability across diverse populations and staining variations [3].Ongoing research is focused on addressing these issues, as well as integrating deep learning with other artificial intelligence techniques to further enhance the capabilities of digital pathology.
Hematoxylin-eosin (HE) staining, as shown in Figure 1a, is widely used as a universal basic tissue stain.It is the first step in evaluating various cancer types and is extensively used for primary diagnosis due to its simplicity and cost-effectiveness.HE staining provides basic morphological information, such as the shape of cells and tissues.In clinical practice, immunohistochemical (IHC) staining, illustrated in Figure 1b, is frequently employed to obtain the protein expression status for diagnosis confirmation and subtyping.IHC staining visualizes the expression of various proteins (e.g., Ki67, estrogen receptor) on the cell membrane or nucleus.It is often necessary to perform several different IHC stains to conduct a differential diagnosis and determine attributes such as histogenesis, molecular subtype, or proliferation rate.Despite being a standard procedure, IHC staining has several limitations.It is expensive and highly dependent on tissue handling protocols, as the results are expressed through stain intensity, presence/absence of a stain, localization of staining, or the percentage of cells showing detectable stain intensity.Additionally, the interpretation of IHC results is visual and relies on the subjective assessment of pathologists, leading to inter-observer variability.Recent studies have shown a correlation between HE-and IHC-stained slides from the same region [7][8][9].Consequently, it should be possible to model the relationship between the morphological information in HE slides and IHC information, predicting the expression of specific proteins directly from HE-stained slides without the additional IHC staining process [10].This approach could prove to be time-and cost-efficient, or it could provide a second opinion in assessing IHC staining.
In this work, we provide a dataset of high-resolution (35,000 × 35,000 pixels on average) histological images suitable for the application of various machine learning (ML) methods, especially convolutional neural networks.The dataset consists of 77 conventional testicular seminoma patients' histology samples obtained from the surgical resection of patients' tumors.The tested series offers pairs of images of HE-stained tissue and Ki67 IHC-stained tissue, showing adjacent sections of the tissue, creating a total of 154 images.The high resolution of the images makes it possible to generate tens of thousands of patches of different sizes, which can form a large dataset.Despite the non-uniformity at the cellular level, since the sections are spatially adjacent, it is ensured that the tissue slides exhibit the same properties and characteristics.Another advantage of the dataset provided is the fact that all images are already converted to PNG format, which is very easy to work with and can be used for many image analyses, unlike the original scanner output formats.In addition to the image pairs themselves, supplementary data such as age, tumor stage, rete testis invasion, and lymphocytic infiltration are also recorded for each patient.

Classification of Histological Images
In recent years, deep neural networks have been increasingly utilized for various medical image analysis tasks [2].This trend extends to the histological domain, where they are employed for tasks such as classifying tumor tissues and evaluating biomarkers to inform treatment planning.
In [11], the researchers introduced HE-HER2Net, an enhanced Xception network, by incorporating global average pooling, batch normalization, dropout, and dense layers with the Swish activation function.This network was designed to classify HE images into four categories of human epidermal growth factor receptor 2 (HER2) positivity, ranging from 0 to 3+.Beyond the standard model evaluation, the researchers compared HE-HER2Net to other existing architectures, reporting that it outperformed all in accuracy, precision, recall, and AUC score.The authors of [9] advanced the classification of breast cancer molecular status (estrogen receptor, progesterone receptor, HER2) directly from HE-stained histopathology images using deep learning techniques.They developed an innovative approach that combines neural networks with neural style transfer techniques to generate tissue "fingerprints".These fingerprints are unique, high-dimensional representations of tissue images that maintain crucial morphological features despite variations in staining styles.The authors demonstrated that their method significantly improves the accuracy of predicting the ER, PR, and HER2 status of breast cancer tissues compared to traditional methods.Ref. [8] demonstrates that machine learning can accurately determine the molecular marker status from cellular morphology alone.The scholars developed a multiple-instance learning-based deep neural network to identify estrogen receptor status from HE-stained WSI.In [12], the researchers proposed a three-step method for classifying HER2 status in breast cancer tissues.Initially, they utilized a pre-trained UNet-based nucleus detector [13] to generate patches.Next, they trained a CNN to detect tumor nuclei and subsequently classify them as HER2-positive or HER2-negative.

Prediction of Ki67 Expression from HE Images
The dataset presented here is part of research aimed at predicting Ki67 expression directly from HE images, eliminating the need for additional IHC staining.Ki67 is a nuclear protein present in cancer cells and detectable only in actively proliferating cells [14].This protein is absent in cells during their resting phase, indicating they are not growing.Consequently, elevated levels of Ki67 serve as an indicator of rapid cancer cell growth and division, making it a good marker of proliferation (rapid increase in the number of cells) [15].
In [10], the authors addressed the problem of determining the number of Ki67-positive cells from HE images for the treatment of several cancer types.The ResNet model was trained to differentiate between negative and positive cells in homogeneous regions, effectively classifying tissues as having either 0% or 100% positivity.In contrast, we aim to train the model to classify tissues into various positivity ratio intervals.This approach involves analyzing patches that encompass heterogeneous tissue regions containing both positive and negative cells.In seminomas, the Ki67 index typically exceeds 50%, although values below 20% have also been observed [16].Notably, a high proliferation index in seminomas does not show a clear correlation with the clinical stage or the presence of distant metastases [17].However, a specific study identified a significant inverse relationship between Ki67 expression exceeding 50% and rete testis invasion [18].To facilitate machine learning on 77 pairs of HE-and Ki67-stained slides containing testicular seminoma samples, we established three categories for Ki67 expression: below 20%, 20-50%, and above 50%.The method employed for obtaining Ki67 annotations for HE patches is described in more detail in [19].Applying clustering, the Ki67 scans were recolored into three dominant colors: brown, blue, and white.Then, the Ki67 positivity ratio was estimated from the number of pixels belonging to the colors mentioned above.In [20], we employed the presented dataset to train the ResNet18 model on both binary and multiclassification tasks.To evaluate the model performance, we divided the dataset into a training set and validation set, which comprised 10% of the extracted patches from the training set, allowing us to monitor train-ing progress and validate the model on previously unseen data originating from tissues familiar to the model.The model achieved good performance in classifying HE patches into Ki67 index categories on both binary tasks and multiclass tasks with an accuracy of 0.775 and 0.789, respectively.

Additional Data
Additional data include six columns for every pair of samples and are publicly available as Additional_data.xlsx,see Supplementary Materials Section.
Testicular tumors included in the dataset were radical orchiectomy resection specimens from male patients aged 27 to 61 years, with a median of 39.5 years and a multimodal distribution (modes 29, 46).The pTNM stage was assessed by a pathologist specialized in urogenital pathology according to valid guidelines.Tumors limited to the testis and epididymis without lymphovascular invasion were evaluated as pT1.The pT2 category included tumors defined the same way as pT1 but with lymphovascular invasion, or tumors extending through the tunica albuginea with involvement of the tunica vaginalis.Tumors infiltrating the spermatic cord regardless of lymphovascular invasion were placed into category pT3.Tumors extending into the scrotum were classified as pT4 [21].Orchiectomy specimens in our cohort did not include lymph nodes biopsies; therefore, the N and M stage was marked as "NxMx" in every sample.
Rete testis invasion in the column "infiltration of rete testis" was evaluated independently from the tumor stage as an adverse prognostic factor which may account for higher rates of recurrence and distant metastasis even in the early-stage disease [22,23].Lymphocytic infiltration (column "intensity of lymphocyte inflammatory reaction") presented a characteristic histological feature of seminoma and was classified according to density into three categories: strong, moderate, and weak.No association was discovered between the density of inflammatory reaction and tumor stage or rete testis invasion.
In the column "Ki67 proliferation index (eyeballing method)", we report the proliferation activity for samples evaluated by pathologists within areas of the highest density of positive staining (so-called hot spots).The column "laterality" refers to the laterality of the testis.

Image Acquisition
Seventy-seven testicular seminoma samples were sectioned into parallel formalinfixed paraffin-embedded sections with a thickness of 3-4 micrometers.Hematoxylineosin (HE) staining was conducted using the Tissue-Tek Prisma ® Plus Automated Slide Stainer (Sakura Finetek Japan Co., Ltd.(Tokyo, Japan)).The deparaffinized sections were stained with Weigert hematoxylin, followed by washing and differentiation with low pH alcohol, additional washing, eosin staining, dehydration, clearing with carboxylole and xylene, and coverslipping with the Tissue-Tek Film® Automated Coverslipper (Sakura Finetek Japan Co., Ltd.).The immunohistochemical analysis employed the monoclonal mouse antibody clone MIB-1 (FLEX, Dako) on the automated PTLink platform (Dako, Denmark A/S).Visualization utilized EnVision FLEX/HRP (Dako), DAB (EnVision FLEX, Dako), and contrast hematoxylin staining.Whole slide images of HE-and Ki67-stained sections from the same cases were sequentially ordered, anonymized, and scanned using the 3D Histech PANNORAMIC® 250 Flash III 3.0.3 in BrightField Default mode with magnification 20×.HE and Ki67 staining were performed on adjacent tissue sections, to ensure tissue similarity.

Data Preprocessing
Scanned whole slide images (WSIs) were stored in MRXS format, with each file approximately 1 GB in size.An MRXS file contains images of multiple specimen samples on a single digitized virtual slide, captured at multiple levels with varying resolutions.Since there is a limited set of operations that can be performed on MRXS format scans via Python libraries, for image analysis, we converted the MRXS files to PNG format using the OpenSlide library in Python.The scans included images at 8 levels, with lower levels having higher resolution images.To manage the substantial memory demands of the top-resolution images (approximately 6 GB per image at level 0), we opted to process the images at the second-highest resolution level 1.This level retains sufficient detail without compromising information integrity, thereby alleviating memory-related challenges.HE scans included two tissue sections; thus, we extracted super patches containing a single tissue section from the original scans.For the HE scans, we selected one tissue out of the two that was the most complete or most similar in shape to the Ki67 tissue.This procedure was similarly applied to Ki67 scans, significantly reducing the size of the resulting PNG images.Images were extracted from the WSIs based on exported annotations created in SlideViewer.

Tissue Registration
To create pairs of patches, it is essential to ensure that patches from the same region on the HE-and Ki67-stained images correspond to the same tissue or tissue from the same region of the slide.In other words, if we place the images on top of each other, there will be an overlap of tissues.Due to rotation and displacement differences in the converted PNG images, an alignment of the two images was required.Since both sections were scanned under the same resolution, the registration does not require scaling and, thus, we consider an affine transformation including translation and rotation.This requires determining three transformation parameters.
It is important to note that the tissue pairs are not identical, preventing cell-level matching.Although the HE and IHC sections were adjacent and sequential, the same cells were not present in both sections.Therefore, the matching was based on the similarity of tissue regions, such as shape.To predict proliferation in staining by neural networks, not having cell-to-cell correspondence may seem like a problem, but if we do not want to teach the model to recognize the difference between individual positive and negative cells, only to recognize patches belonging to a certain degree of proliferation, we can use the HE and Ki67 patches even if there are not identical cells or the same number of cells on them, operating under the assumption that the tissue structure remains relatively preserved in a given area.
To validate this assumption, patches measuring 224 × 224 and 512 × 512 pixels were generated, the Ki67-positivity ratio was quantified in accordance with [19], and all patches from the image were visualized in a heatmap, colored based on the degree of Ki67 positivity, as shown in Figure 2. Due to the low Ki67 positivity of tissue patches, negative values were assigned to patches without tissue to distinctly differentiate the tissue-containing areas from the background.
The objective was to confirm that, although the tissues may not be identical, the Ki67 expression in a given region is influenced by the existing tissue structure (aggressive or non-aggressive tumor) which persists in three-dimensional space.Therefore, two adjacent sections will retain this structure and its properties (such as the degree of proliferation) despite discrepancies in individual cells.This implies the presence of regions with uniform Ki67 positivity from which patches can be derived.Examination of the heatmaps supports this assumption, demonstrating homogeneous regions with multiple patches exhibiting the same or similar Ki67 expression, indicating that these values are not randomly distributed across the tissue patches.In [24], we introduced a semi-automated registration approach based on keypoints and optimization methods.For each HE and Ki67 scan pair, we manually defined pairs of keypoints and used an optimization technique to determine the best transformation parameters between them.Keypoint definition was conducted using SlideViewer, where small square annotations were created for individual keypoints.These annotations were then exported to XML files via SlideMaster.SlideViewer enabled simultaneous viewing and annotation of multiple scans side by side, which expedited the process of annotating keypoints on both HE and Ki67 scans.Corresponding keypoints were given identical annotation names for straightforward identification during subsequent steps.Five square keypoints were annotated for each slide.The XML annotation files recorded the coordinates of each annotation's top-left corner relative to the top-left corner of the scan, along with the annotation's width and height.All dimensions were specified for the original scan size (layer 0) and were scaled by dividing by 2 layer , corresponding to the layer from which the PNG image was later extracted.Initially, keypoints were defined as squares rather than points.In the next phase, we determined the centers of these squares and recalculated their coordinates relative to the large bounding box annotation (super patch annotation).These adjusted coordinates in the scaled image slices served as input for the optimization algorithm.
The transformation between the two sets of keypoints was defined as rotation and translation.To simplify the subsequent image transformations, we performed rotations around the center of the image.The rotation matrix is typically defined for rotation around the origin point [0, 0].Therefore, within the transformation function, we first translated all points by the vector ( −width 2 , −height 2 ), applied the rotation, and then translated them back to their original positions.After the rotation, we applied a tissue shift via translation.By expanding the space from 2 × 2 to 3 × 3, the entire transformation can be expressed using matrices as follows: where x, y are input keypoint coordinates, x ′ , y ′ are output transformed coordinates, α is the parameter for rotation in radians, and px, py are coordinates for final translation.
To identify the optimal transformation parameters that minimized the distance between the original HE keypoints and the transformed Ki67 keypoints, we employed an optimization method.The objective function L could be expressed as the sum of the Euclidean distances between the original HE keypoints (x HE , y HE ) and the Ki67 keypoints transformed via the optimized parameters in (1) (x ′ Ki67 , y ′ Ki67 ) as follows: We utilized the scipy.optimizelibrary in Python, specifically the minimize method, which allows for the selection of various solvers (algorithms) depending on whether the optimization problem includes constraints or bounds.Since our problem had neither, we opted for the default solver.The parameter for this method is a function that computes the objective function L for the optimized parameters α, px, and py.
After obtaining the transformation parameters and preprocessing both scans, we compared their dimensions and added white pixels in both directions so that they were the same size.This ensured that the upper left corner of the original image remained in the same position in the new image, preserving the coordinates of keypoints.Here is a summary of the procedure: 1.
Rotate the Ki67 image around its center with the "expand" option enabled, ensuring the resulting image is large enough to contain the entire rotated IHC image, with additional white pixels as padding; 2.
Create a white image of the same dimensions as the rotated Ki67 image; 3.
Calculate the translation vector v for the HE image relative to the white image, ensuring that, when placed with its top-left corner at the origin and then shifted, it is centered; 4.
Adjust the translation vector v by subtracting the shift parameters obtained from the optimization; 5.
Copy each pixel of the HE image to the corresponding coordinates in the white image, adjusted by the translation vector v.
An example of successful registration is shown in Figure 3, where keypoints are highlighted as blue rectangles in the HE images and orange rectangles in the IHC images.

Validation with Convolutional Network Model
The correctness of the registration was validated by assessing the accuracy of the classification model in predicting Ki67 expression directly from HE images, further described in [20].We compared the accuracy of two models: the first trained on an existing dataset by default, and the second on a randomly generated dataset.A model trained on a randomly generated dataset should exhibit lower validation accuracy, as the annotations in the validation data are assigned randomly, preventing the neural network from modeling this randomness effectively.From the results presented in [20], it is clear the first model achieved high accuracy, while the second model's accuracy was equivalent to random guessing.The significantly lower accuracy of the second model provides sufficient evidence to confirm the correctness of our registration procedure.

Dataset Limitations and Considerations
While the dataset was carefully created, several limitations were identified that should be considered for further data processing and analysis.Firstly, the sample size is limited to 77 participants.Although the morphological spectrum of germ cell tumors is broad, the histomorphology of "conventional seminoma" is uniform across different individuals, representing a diagnostically specific morphological entity.Therefore, while a series of 77 tumors may be sufficiently representative for many research tasks, application-specific data requirements should be considered.Additionally, the dataset exclusively comprises samples of male tissue due to the fact that seminoma, by definition, includes germ cell tumors arising only in testicular tissue.
Nevertheless, despite these limitations, the dataset significantly enhances opportunities for machine learning applications in digital pathology.

Conclusions
This dataset is aimed for AI models to predict the Ki67 index or even to generate Ki67 staining.Since one pair of HE and Ki67 images contain two physically different sections of tissue (although neighboring), there is no one-to-one correspondence on a cellular level.However, patches, created on the same locations from the images in one pair, contain similar quantitative characteristics, such as number of cells, number of Ki67 positive cells, averaged cell size, etc.In [20,24], we elaborate on the usage of the dataset by computing the Ki67 index of HE patches evaluated from the corresponding Ki67 patch.

Supplementary Materials:
The following are available online at https://www.mdpi.com/article/10.3390/data9080100/s1. Supplementary data include a single spreadsheet file Additional_data.xlsx;see Section 2.1 for details.Institutional Review Board Statement: Ethical review and approval were waived for this study due to the retrospective analysis of the fully anonymized data used in the study.The consent to use biological material for diagnostic and research purposes was included during admittance to the healthcare facility.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The original data presented in the study are openly available at https://doi.org/10.5281/zenodo.11218961

Figure 3 .
Figure 3. Example of successful semi-automated registration for two pairs.(a) HE tissue on the left, (b) Ki67 tissue in the middle, (c) overlay of transformed HE and Ki67 tissue on the right.