The fundamental novelty of the TPB + TEB fusion strategy lies in its hierarchical, multi-scale integration mechanism. Unlike conventional methods that fuse metadata once (e.g., via channel concatenation at the bottleneck), our framework injects BERT-processed metadata at three distinct resolution levels through the TEB modules. Each TEB performs channel-wise concatenation followed by residual refinement, enabling the network to dynamically modulate visual features using clinical context appropriate to each scale. This design ensures that low-level texture details benefit from metadata-guided attention just as high-level semantic decisions do—something unattainable with flat fusion schemes.
MMY-Net adopts a Y-shaped encoder–decoder structure with two parallel encoders (visual and textual) and a shared decoder.
3.1. Multimodal Image Segmentation Datasets
For simple structured metadata like age and gender used in our experiments, we employ a context-aware tokenization strategy rather than treating them as isolated features. Age values are converted to descriptive phrases (e.g., “52 years old patient”), while gender is represented as “male patient” or “female patient”. These phrases are then processed through the BERT tokenizer with special tokens ([CLS] and [SEP]) to maintain textual structure. Though our current experiments use standard BERT rather than ClinicalBERT, we conducted ablation studies comparing domain-adapted language models. ClinicalBERT showed only marginal improvements (0.8% higher Dice score) despite requiring 3× longer fine-tuning time, suggesting that for simple structured metadata, domain adaptation provides diminishing returns.
Regarding embedding alternatives, we evaluated simpler approaches including:
One-hot encoding + MLP (baseline);
Learned embeddings + MLP;
ClinicalBERT feature extraction.
To prevent data leakage, we first split the WSIs at the patient level using a 7:2:1 ratio (train:validation:test), ensuring no patient appears across multiple sets. Only after this patient-level split did we extract patches from each WSI subset. This approach guarantees that all patches from the same WSI share identical metadata but never appear across training, validation, and testing sets.
As shown in Table 10 (new), BERT outperformed simpler approaches by 2.1–3.7% in mDice, demonstrating that its contextual understanding of metadata relationships provides meaningful signal even for structured data. While an MLP embedding would be computationally efficient, it fails to capture implicit relationships between clinical variables that BERT learns from its pre-training corpus. For example, BERT naturally encodes relationships between age groups and disease prevalence without explicit supervision.
We evaluate MMY-Net on three datasets:
Dataset 1: Open/public eyelid tumor dataset contains 112 whole-slide images (WSIs)—65 seborrheic keratosis (SK) and 47 basal cell carcinoma (BCC)—with patient age and gender. A total of 7989 patches were extracted and split 7:2:1.
Dataset 2: Open/public test set (65 BCC + 42 SK WSIs) was used solely for external validation. Annotations are fine-grained, unlike the coarse labels in Dataset 1 (see Figure 10).
Dataset 3: Open/public gland segmentation challenge dataset comprises 85 training and 80 test H&E images with metadata on malignancy (benign/malignant) and differentiation grade.
Before experiments, Openslide was used to extract 1024 × 1024 pixel patch-level pathological slices and the corresponding segmentation masks from these WSI using Algorithm 1. The data was randomly divided into training, validation, and test sets in a 7:2:1 ratio. All images underwent Vahadane stain normalization.
Figure 5 shows the example patch-level images of basal cell carcinoma and seborrheic keratosis after stain normalization.
The demographic statistics of patients are illustrated in
Figure 6 and
Figure 7.
Figure 6 and
Figure 7 show the gender and age distributions of the 112 patients corresponding to the WSI. BCC represents basal cell carcinoma, and SK represents seborrheic keratosis. The distributions indicate that the average age of basal cell carcinoma patients is higher than that of seborrheic keratosis patients, and the number of female patients exceeds that of male patients for both diseases.
Example pathological samples from the gland dataset are shown in
Figure 8.
The patch-level pathological image distribution corresponding to Dataset 1 and Dataset 2 is summarized in
Table 2.
Gland Dataset (Dataset 3):
Provided in the 2015 MICCAI Gland Segmentation Challenge.
Contains 165 Hematoxylin and Eosin (H&E)-stained pathological images from 16 patients, covering glandular tissue.
Training set: 85 images (37 benign, 48 malignant) from 15 patients.
Test set: 80 images (33 benign, 27 malignant) from 12 patients.
Most images have a resolution of 775 × 522, containing rich tissue structures and gland distribution information.
Figure 8 shows sample images from the Dataset 3. This dataset is significant for researching and evaluating gland segmentation algorithms.
For these three datasets, in Dataset 1 and Dataset 2, each WSI corresponds to one patient. Thus, each patch-level image corresponds to two types of patient metadata: gender (male/female) and age (e.g., 52).
Annotation differences between Dataset 1 and Dataset 2 are illustrated in
Figure 9.
In Dataset 3, metadata includes two parts: gland malignancy and differentiation degree. These can be found via the corresponding patient ID. Gland malignancy has two categories: benign and malignant. Differentiation degree has five categories: healthy, adenoma, moderate differentiation, moderate-to-low differentiation, and low differentiation. Their semantic similarity can be referenced in
Table 3. Cosine distance indicates that descriptions of differentiation degree form a certain distance relationship with the final tumor malignancy.
It should be noted that, although Dataset 1 and Dataset 2 are both datasets for basal cell carcinoma and seborrheic keratosis, they come from different hospitals and were annotated by different pathologists, resulting in different annotation styles. Specifically, Dataset 2 annotations are fine-grained, while Dataset 1 annotations are coarse-grained, as illustrated in
Figure 9 (left: Dataset 1 example; right: Dataset 2 example).
We compared BERT against three alternatives: (1) one-hot encoding + MLP, (2) learned embeddings + MLP, and (3) ClinicalBERT. As shown in Table 10, BERT outperformed MLP-based methods by 2.1–3.7% in mDice, demonstrating that its pre-trained contextual understanding captures implicit relationships (e.g., age–disease prevalence correlations) that shallow embeddings miss. While MLPs are computationally lighter, they lack the semantic depth needed to interpret even simple metadata meaningfully within a diagnostic context.
3.2. MMY-Net-Based Multimodal Segmentation Framework
The previous sections described the internal structure of MMY-Net. This section outlines the framework for metadata-based multimodal eyelid tumor segmentation using MMY-Net, as shown in
Figure 10, divided into two stages: training and inference.
Figure 10.
Framework for MMY-Net-based multimodal eyelid tumor segmentation.
Figure 10.
Framework for MMY-Net-based multimodal eyelid tumor segmentation.
Training Stage:
First, preprocess the entire dataset by extracting each pathological image slice and its corresponding segmentation annotation. These slices serve as input data for training MMY-Net.
During extraction, the corresponding segmentation annotations must also be extracted to provide supervision signals for network training.
After preparing the patch-level pathological slice dataset, input it along with corresponding patient metadata into the network for training. During training, the network uses metadata and the corresponding pathological slices to learn features related to tumor segmentation, which are used to predict segmentation results for each slice.
Inference Stage:
After network training, perform segmentation prediction on whole-slide pathological images.
Use a sliding window to extract slices from the entire pathological slice, ensuring non-overlapping slices to fully cover all regions and avoid missing potential tumor areas.
Record the coordinate position of each patch image on the original whole-slide pathological image.
For the Dataset 3, which involves only single-gland semantic segmentation, use an equally weighted combination of cross-entropy and Dice loss functions.
After inputting the slices into the trained MMY-Net for inference and obtaining segmentation results, use the position coordinates to stitch the results back to the original image. Specifically, place each slice’s segmentation result at its corresponding position in the original image. The final stitched image is the tumor segmentation result for the entire slide.
MMY-Net’s multi-class segmentation essentially classifies each pixel. Thus, for Dataset 1 and Dataset 2, tumor classification results are also obtained. However, the stitched image might show multiple tumor types in one image, which is not realistic. Therefore, a normalization operation as defined in Equation (6) is designed to obtain the final classification result, where
C0 represents basal cell carcinoma and
C1 represents seborrheic keratosis.
n0 is the number of pixels predicted as basal cell carcinoma in the WSI, and
n1 is the number predicted as seborrheic keratosis. After normalization, each WSI corresponds to only one tumor type—either basal cell carcinoma (malignant) or seborrheic keratosis (benign).
3.3. Evaluation Metrics for Multimodal Image Segmentation
In experiments on Dataset 1/Dataset 2, the basal cell carcinoma region in each WSI is defined as pixel label “1”, seborrheic keratosis region as label “2”, and other regions as background “0”. Since basal cell carcinoma and seborrheic keratosis cannot coexist in the same image, segmentation metrics are calculated separately for each disease.
Specifically, three metrics (CPA, Dice, IoU) are used to evaluate tumor segmentation performance:
Class Pixel Accuracy (CPA): Measures pixel-level accuracy for each class, crucial for evaluating disease recognition.
Dice and IoU: Evaluate similarity between predicted and ground truth images based on area overlap.
These are standard methods for evaluating image segmentation, effectively reflecting model performance and enabling comparison between models. Multiscale structural similarity (MS-SSIM) is also widely used for evaluating perceptual quality and structural similarity in medical image reconstruction and segmentation tasks [
37].
Mathematically, these metrics are defined in Equations (7)–(12), where TP, FP, TN, and FN are pixel-level true positives, false positives, true negatives, and false negatives. Calculations are performed per class, with final results provided as class averages.
Since Dataset 1 and Dataset 2 experiments involve multi-class segmentation, CPA, IoU, and Dice are calculated for each class. Here, mPA (Equation (10)), mIoU (Equation (11)), and mDice (Equation (12)) represent the average values across all classes. For Dataset 3 experiments, being a binary semantic segmentation task, only Dice and IoU (Equations (8) and (9)) are calculated.
3.4. Implementation of MMY-Net-Based Multimodal Eyelid Tumor Segmentation
MMY-Net segments tumor regions in Dataset 1 and Dataset 3s. To achieve this, input image size was adjusted to 224 × 224. Network training used the Stochastic Gradient Descent (SGD) optimizer with a learning rate of 0.001 and batch sizes of 1 and 4. The network was trained for 1000 epochs.
During training on Dataset 1 and Dataset 2, data augmentation was applied. From eight methods—random rotation, vertical flip, horizontal flip, random resize, random color enhancement, random elastic transform, Gaussian noise, and image blur—one to four were randomly selected to augment each image.
After using BERT to extract feature vectors from each metadata item, each metadata corresponds to a 1 × 768-dimensional vector. Since each image corresponds to two metadata items (gender and age, or malignancy and differentiation degree), the final text vector for each image is 2 × 768-dimensional. For easier processing, 16 zeros are appended to the end of each metadata vector, resulting in a 2 × 784-dimensional vector, which is then reshaped into a 2 × 28 × 28 text vector. In the TPB, both deconvolutions use 3 × 3 kernels, stride 2, padding 1, and output padding 1. The first deconvolution uses 16 kernels, outputting a 16 × 56 × 56 feature map. The second uses 32 kernels, outputting a 32 × 112 × 112 feature map. After a same convolution with 64 kernels, the feature map size becomes 64 × 112 × 112, matching the image feature map size. The model parameters were optimized using the Adam optimizer, which provides adaptive learning rate adjustment for efficient convergence. The training process adopts widely used optimization strategies reported in previous deep learning studies [
38,
39].
In Dataset 1, data was split into training, validation, and test sets in a 7:2:1 ratio. Dataset 2 was used entirely as an independent test set. In Dataset 3, 80% of the official 85 training images were used for training, 20% for validation, and the official 80 test images for testing.