Sensory Representation of Neural Networks Using Sound and Color for Medical Imaging Segmentation

Irenel Lopo Da Silva; Nicolas Francisco Lori; José Manuel Ferreira Machado

doi:10.3390/jimaging11120449

,

and

¹

IT Department, Computer Engineering, School of Engineering, University of Minho, 4704-553 Braga, Portugal

²

Algoritmi/LASI, University of Minho, 4704-553 Braga, Portugal

³

Faculty of Sciences and Technology, University of Azores, 9500-321 Ponta Delgada, Portugal

^*

Author to whom correspondence should be addressed.

J. Imaging2025, 11(12), 449;https://doi.org/10.3390/jimaging11120449

This article belongs to the Section Medical Imaging

Version Notes

Order Reprints

Review Reports

Abstract

This paper introduces a novel framework for sensory representation of brain imaging data, combining deep learning-based segmentation with multimodal visual and auditory outputs. Structural magnetic resonance imaging (MRI) predictions are converted into color-coded maps and stereophonic/MIDI sonifications, enabling intuitive interpretation of cortical activation patterns. High-precision U-Net models efficiently generate these outputs, supporting clinical decision-making, cognitive research, and creative applications. Spatial, intensity, and anomalous features are encoded into perceivable visual and auditory cues, facilitating early detection and introducing the concept of “auditory biomarkers” for potential pathological identification. Despite current limitations, including dataset size, absence of clinical validation, and heuristic-based sonification, the pipeline demonstrates technical feasibility and robustness. Future work will focus on clinical user studies, the application of functional MRI (fMRI) time-series for dynamic sonification, and the integration of real-time emotional feedback in cinematic contexts. This multisensory approach offers a promising avenue for enhancing the interpretability of complex neuroimaging data across medical, research, and artistic domains.

Keywords:

sparse graph neural networks; image segmentation; deep learning; fMRI; sensory representation

1. Introduction

Advances in medical image segmentation have substantially improved the accuracy of identifying brain structures and delineating pathological regions. Despite these technical improvements, interpreting segmentation outputs remains challenging, especially for clinicians and researchers without specialized expertise in artificial intelligence (AI) or computational visualization. Traditional visualization techniques—such as grayscale images or heatmaps—often fail to convey the multidimensional nature of neuroimaging data intuitively. This limits both accessibility and practical usability, preventing the full exploitation of AI-based neuroimaging tools in clinical and interdisciplinary applications.

Effective interpretation of complex neuroimaging data requires not only technical precision but also cognitive accessibility. Static visualizations may impose a high cognitive load, making it difficult for non-specialist users to understand brain activation patterns or morphological variations. Multimodal visualization approaches, integrating multiple perceptual channels such as color and sound, provide an opportunity to translate abstract computational results into perceptually intuitive representations. This approach aligns with human sensory integration, where the brain fuses visual, auditory, and somatosensory inputs to form coherent, context-dependent representations [1]. However, there remains a practical and scientific gap in developing frameworks that enable intuitive, multimodal interpretability of AI-based segmentation outputs, particularly in neuroimaging.

Several empirical studies confirm that non-expert users, including clinicians without radiological specialization, experience significant difficulty interpreting complex imaging outputs or AI-based heatmaps [2,3,4]. These challenges often stem from the lack of perceptual cues and cognitive overload associated with multidimensional medical data [2]. Moreover, recent research highlights that multimodal cues, such as color-coded visualizations and auditory sonification, can improve pattern recognition, enhance interpretability, and reduce decision-making time for non-technical audiences [5,6]. These findings reinforce the necessity for perceptually grounded, accessible frameworks that enable intuitive interaction with medical images.

To address this limitation, we introduce a novel sensory representation framework that translates segmentation predictions into auditory and visual modalities. The primary objective is to enhance interpretability, accessibility, and engagement with medical imaging data, allowing users from diverse backgrounds including clinicians, neuroscientists, and creative professionals to interact with complex brain representations in a perceptually intuitive manner. This framework builds on recent advances in multimodal neuroimaging fusion, which show that integrating multiple sensory modalities can reveal complementary aspects of neural dynamics and facilitate human–AI interpretability [7,8,9].

From a neuroscientific perspective, human perception relies heavily on prior experience and multisensory integration to interpret complex stimuli. Auditory and visual information is dynamically combined within temporal and parietal cortical regions, supporting cognition, language, emotion, and memory [10,11,12]. Functional neuroimaging studies using fMRI reveal how distributed cortical networks synchronize to represent sensory and affective experiences [9]. These paradigms extend beyond controlled laboratory conditions to naturalistic stimuli—such as films or music—enabling the investigation of brain function in ecologically valid contexts [13,14,15]. Such multimodal and dynamic scenarios highlight the importance of perceptual integration in understanding neural activity.

In computational neuroscience, sensory representation translates abstract neural activations into interpretable perceptual forms. Models leverage top-down modulation mechanisms to enhance categorization and perceptual fidelity [16,17]. Features extracted from neural networks—such as intensity, spatial extent, and anomalous activity—can be encoded into perceivable cues, including color-coded visuals and auditory signals, reflecting meaningful brain patterns. This facilitates early detection of abnormalities and intuitive comprehension of complex neurobiological structures [18,19,20].

Accordingly, the present work introduces a multimodal sensory framework that integrates deep learning-based segmentation with perceptual translation through sound and color. By bridging machine perception with human intuition, it contributes to the emerging field of interpretable AI in medical imaging.

This work makes the following key contributions:

1.: We introduce a unified framework that converts deep learning-based segmentation outputs into perceptual sound (MIDI/sonification) and visual (color-coded) modalities, supporting sensory-based data interpretation.
2.: We demonstrate how multimodal mappings enhance interpretability and accessibility in neuroimaging, fostering cross-disciplinary understanding among clinicians, neuroscientists, and creative professionals.
3.: We propose the concept of “auditory biomarkers”, a perceptual mechanism for potential pathological identification linking sensory cues to early diagnostic features.

2. Related Work

2.1. Sensory Representation Through Color

Color is a fundamental channel for sensory representation in neuroimaging and computational models. The RGB (red, green, blue) model is widely used, with intensities ranging from 0 to 255 for each channel. Human perception integrates these channels to produce a rich spectrum of shades, processed in combination within the primary visual cortex. This principle is exploited in image recognition, brain–computer interfaces, and multimodal visualization frameworks, enabling the classification of targets with similar shapes but varying colors [21]. Color-based encoding enhances the interpretability of segmentation outputs, particularly for highlighting anomalies or subtle structural variations.

2.2. Sensory Representation Through Sound

Auditory perception combines elementary sound events into coherent percepts. The brain synthesizes auditory stimuli approximately every 200 ms, allowing recognition even in the presence of interference, such as the auditory continuity illusion [22,23]. Subjective experiences of pitch, timbre, and volume depend on both waveform characteristics and individual listener factors, including personality and prior auditory experience [24]. Sonification techniques in neuroimaging leverage these perceptual mechanisms to translate neural activity into intuitive auditory representations, supporting more immediate and accessible interpretation of complex data.

2.3. Magnetic Resonance Imaging in Brain Neuroscience

Functional and structural MRI are pivotal tools for understanding brain anatomy and activity. fMRI captures neural dynamics by detecting subtle changes in blood flow, enabling the study of functional networks across subcortical and cortical regions [7,25]. Brain responses vary depending on stimulus type; for example, naturalistic stimuli such as films simultaneously engage visual and emotion-processing regions [14]. High-resolution neuroimaging supports early detection of pathologies, including brain tumors, and enhances classification accuracy through computer-aided diagnostic tools [26].

2.4. Magnetic Resonance Imaging in Cinematic Paradigms

Films and other naturalistic stimuli are increasingly used in neuroscience to study brain function and emotional processing [13,27,28,29]. fMRI studies using film clips reveal dynamic patterns of functional connectivity, highlighting temporal variability in emotional responses. These paradigms provide ecologically valid insights into cognition and affect, contributing to mental health applications and audience emotion modulation [13,14,15]. They also support multimodal approaches by linking visual and auditory representations to brain activity in real-world scenarios, aligning with the concept of sensory-based interpretation of neuroimaging data.

Previous research has explored various approaches to medical image sonification and multimodal representation. Roginska et al. [30] introduced immersive auditory display methods for neuroimaging visualization, while Cadiz et al. [31] proposed statistical sonification based on image intensity descriptors. Chiroiu et al. [32] demonstrated that tonal sonification enhances the detection of subtle image details, and Schütz et al. [6] combined visual and auditory feedback in multimodal interaction frameworks.

However, none of these studies integrated deep learning–based segmentation with perceptually optimized auditory and visual encoding designed for both clinical and non-expert users.

As summarized in Table 1, our approach differs from prior work by integrating deep learning segmentation and perceptual sonification into a unified pipeline, enabling interpretable and multimodal neural activity representation.

Table 1. Comparison of multimodal medical image representation methods.

3. Materials and Methods

This study proposes a multimodal sensory pipeline for brain tumor analysis that extends beyond conventional segmentation methods. By leveraging outputs from deep learning segmentation models, our pipeline generates interpretable visual, auditory, and musical representations of brain MRI scans, facilitating anomaly detection and pattern recognition in clinical contexts.

3.1. Dataset

We used the Low-Grade Glioma (LGG) Segmentation Dataset, containing MRI scans with manually annotated masks highlighting abnormalities in the FLAIR sequence. The dataset includes 110 patients from Cancer Genome Atlas (TCGA) LGG collection, available via Cancer Imaging Archive (TCIA) [34,35].

For implementation, we accessed the curated Kaggle version (https://www.kaggle.com/datasets/mateuszbuda/lgg-mri-segmentation (accessed on 3 March 2024) reorganized into patient-wise folders pairing MRI slices and segmentation masks. Image paths were stored in arrays, and masks were filtered via the _mask suffix using glob and pandas.

3.2. Patient-Wise Splitting and Preprocessing

To avoid slice-level data leakage, patient IDs were randomly assigned into training (90%), validation (5%), and test (5%) sets using fixed seeds NumPy = 2.0.2, TensorFlow = 2.19.0, Python = 3.12.12.

The preprocessing pipeline included the following:

1.: listing and loading: paired images and masks indexed by patient ID.
2.: resampling and resizing: isotropic resolution and cropping/padding to $256 \times 256$ pixels.
3.: normalization: intensity scaled to $[0, 1]$ .
4.: data augmentation (train only): random flips (p = 0.5), rotations ( $\pm 7^{\circ}$ ), elastic deformations, Gaussian noise, intensity shifts ( $\pm 10 %$ ), and motion blur with random linear kernels (size 3–7 pixels, directions: horizontal, vertical, diagonal) to simulate patient motion artifacts. Validation and test sets were unaugmented.

3.3. Proposed Model Architectures

This section provides a clear and technically detailed description of the four neural architectures evaluated in this study: U-Net, DeepLab, Graph Attention Network (GAT), and Sparse Graph Neural Network (SGNN). The aim is to balance scientific rigor with an accessible narrative while offering explicit component-level details that correspond directly to the block diagrams presented. Each model is described following the same structure: (i) input representation, (ii) core architectural components, (iii) decoding or reconstruction strategy, and (iv) segmentation output.

Overall, the four architectures share a common preprocessing pipeline (normalization and skull-stripping) and generate a binary segmentation mask, which is subsequently transformed into a visual color map and an auditory sonification output (Figure 1, Figure 2, Figure 3 and Figure 4).

Figure 1. Pipeline overview: Each model (U-Net, DeepLab, GAT, SGNN) is run individually, producing a segmentation mask that is converted into a color map and sent to the sonification module.

Figure 2. Diagrams of the U-Net, DeepLab, GAT, and SGNN architectures, scaled to fit within the MDPI page layout limits. (a) U-Net: encoder–decoder structure with symmetric skip connections. (b) DeepLab: Atrous Spatial Pyramid Pooling (ASPP) module. (c) GAT: message passing with learned attention coefficients. (d) SGNN: patch-based graph construction followed by a spectral GNN leading to the final segmentation output.

Figure 3. Sonification workflow: extraction of mask-derived features, auditory mapping, and synthesis.

Figure 4. Color map generation pipeline: normalization → colormap function → RGB visualization.

3.3.1. U-Net

U-Net serves as the convolutional baseline, chosen for its robustness in biomedical segmentation. The network adopts a symmetric encoder–decoder structure, allowing the model to integrate global context while preserving spatial resolution through skip connections.

Encoder: Four downsampling stages, each consisting of two

3 \times 3

convolutions (BatchNorm + ReLU), followed by a

2 \times 2

max-pooling layer. Bottleneck: Two convolutions with increased channel depth to capture high-level features. Decoder: Four upsampling stages using transposed convolutions, each concatenated with the corresponding encoder feature maps; every stage includes two convolutions (BN + ReLU). Output Layer: A final

1 \times 1

convolution with sigmoid activation produces the binary mask. This architecture corresponds to the block diagram in Figure 2a.

3.3.2. DeepLab

The modified DeepLab incorporates multi-scale feature extraction and an attention-augmented encoder to strengthen discrimination of tumor boundaries.

Residual Encoder: Four residual blocks, each composed of two

3 \times 3

convolutions (BN + ReLU) and a skip connection. A global channel-attention mechanism (global average pooling + dense layers) modulates feature importance. ASPP Module: Atrous Spatial Pyramid Pooling with dilation rates

{6, 12, 18}

extracts contextual information at multiple scales. Decoder: Transposed convolutions progressively restore spatial resolution. Output Layer: A sigmoid-activated

1 \times 1

convolution. The corresponding diagram is shown in Figure 2b.

3.3.3. Graph Attention Network (GAT)

The GAT-based architecture reformulates segmentation as a message-passing process over a graph derived from intermediate feature maps, capturing spatial relationships not easily learned by convolutions.

Graph Construction: Feature maps are reshaped into node sequences, with edges defined via cosine similarity between node features, creating a graph where nodes correspond to spatial locations and edges encode feature similarity.

Attention Layer: Multi-head attention (eight heads) computes attention coefficients

α_{i j}

that quantify the influence of node j on node i,

α_{i j} = {softmax}_{j} (LeakyReLU (a^{⊤} [W h_{i} ∥ W h_{j}])),

(1)

with the following:

$h_{i}, h_{j} \in R^{F}$ : node feature vectors.
$W \in R^{F^{'} \times F}$ : learnable linear transformation.
$[\cdot ∥ \cdot]$ : concatenation.
$a \in R^{2 F^{'}}$ : learnable vector for attention scoring.
$LeakyReLU (\cdot)$ : nonlinear activation.
${softmax}_{j} (\cdot)$ : normalizes over neighbors $\sum_{j} α_{i j} = 1$ .

For multi-head attention with K heads, updated node features are

h_{i}^{'} = ∥_{k = 1}^{K} σ (\sum_{j \in N (i)} α_{i j}^{(k)} W^{(k)} h_{j}),

(2)

where

N (i)

is the set of neighbors of i,

α_{i j}^{(k)}

and

W^{(k)}

are the head-specific attention coefficients and weights, and

σ (\cdot)

is a nonlinear activation. The outputs of all heads are concatenated (‖) to form the updated node embedding.

Decoder: Node embeddings are reshaped into spatial grids and refined via transposed convolutions to reconstruct the segmentation map. Figure 2c illustrates the conceptual graph structure and attention flow.

3.3.4. Sparse Graph Neural Network (SGNN)

The SGNN extends the GAT design by enforcing sparsity in edge connections, making computation more efficient while retaining key relational information.

Sparse Graph Definition: For each node, only the top-k most similar neighbors (cosine similarity) are preserved. Attention Mechanism: The same multi-head attention formulation as the GAT, applied on a sparsified adjacency matrix.

Graph Decoder: Embeddings are projected into spatial tensor form and decoded via convolutional upsampling. The architecture is depicted in Figure 2d.

Overall, the four architectures represent complementary modeling strategies: convolutional (U-Net), multi-scale contextual (DeepLab), relational (GAT), and sparse-relational (SGNN). Their structural differences are made explicit both in this section and in the block diagrams.

3.4. Comparison with Existing Methods

Conventional segmentation methods such as U-Net and DeepLab provide accurate visual outputs but lack multimodal interpretability. Our approach introduces the following two key innovations:

Graph-based spatial modeling: GAT and SGNN architectures capture non-local relationships between brain regions, improving boundary reconstruction and anomaly localization.
Multimodal sensory representation: model predictions are converted into auditory (stereo audio) and musical (MIDI) outputs, encoding intensity, texture, and anomaly information for intuitive interpretation.

This approach enables not only accurate segmentation but also synesthetic perception of pathological patterns, which is absent in traditional pipelines.

3.5. Advanced Brain Sonification Generation

The key steps of the multimodal representation pipeline are as follows:

Z-score and morphological analysis to identify anomalies relative to normal brain regions.
create _advanced_brain_sonification generates stereo audio incorporating melody, harmonics, rhythm, and dissonance to reflect tumor morphology and anomalies.
create_advanced_midi_from_brain converts image features into symbolic MIDI for pitch, duration, and rhythm, supporting interactive auditory exploration.

Figure 5 and Figure 6 illustrate the full pipeline, integrating segmentation, graph-based modeling, and auditory/musical outputs.

Figure 5. Overview of the multimodal sensory pipeline for brain MRI segmentation. The pipeline integrates deep learning-based segmentation, anomaly detection, and visual (color-coded maps) and auditory (stereo audio and MIDI) representations.

Figure 6. Revised sensory processing pipeline showing the integration of segmentation, graph-based modeling, and multimodal outputs is presented in Appendix A. Each stage transforms MRI data into interpretable visual and auditory representations.

3.6. Training Protocol and Hyperparameters

All models were trained under identical configurations for fair comparison, as follows:

Batch size: 32; learning rate: $1 \times 10^{- 4}$ (Adam optimizer).
Input: $256 \times 256 \times 3$ ; epochs: maximum 150 (50 per session).
Loss: dice coefficient (primary) and Jaccard distance (auxiliary).
Scheduler: ReduceLROnPlateau (factor 0.2, patience 5). Early stopping: patience 20.
Checkpointing: best validation Dice saved; periodic checkpoint every 5 epochs.

3.7. Evaluation Metrics and Runtime Analysis

Segmentation performance was assessed using the Dice coefficient, Intersection over Union (IoU), and Jaccard distance (smooth factor = 100). Computational metrics included inference time, GPU/CPU usage, VRAM, FLOPs, and trainable parameters. Experiments were conducted on GPU-enabled Google Colab. Summary results are shown in Table 2 and Table 3.

Table 2. Computational complexity of segmentation models. GFLOPs = giga floating-point operations; Parameters in millions.

Table 3. Inference performance comparison on CPU and GPU. FPS = frames per second; Acceleration factor = CPU time/GPU time.

4. Results

We present a multisensory support tool for brain anomaly segmentation. Performance was evaluated using Dice, IoU, and binary accuracy. Binary accuracy is less informative due to class imbalance; Dice and IoU better reflect segmentation quality.

4.1. Comparative Analysis of Architectures

U-Net achieved the highest performance (Dice = 93.25%, IoU = 87.43%, Figure 7) as reported in Table 4, due to skip connections that preserve multi-scale spatial features, enabling precise tumor boundaries. IoU is lower than Dice because it penalizes small boundary errors more heavily: e.g., 93 overlapping pixels with 5 extra false positives and 5 missed pixels yield Dice 93% but IoU 87%. DeepLab captures global context but loses fine details, explaining its slightly lower Dice/IoU.

Figure 7. Comparative performance of segmentation models on the test set, showing Dice coefficient, IoU, and binary accuracy. U-Net achieves the highest spatial coherence for tumor delineation.

Table 4. Segmentation model performance on the test set.

SGNN and GAT underperform due to sparse graph connectivity, diluted attention signals, and reduced ability to capture fine structures (Figure 8). High binary accuracy (>98%) across all models reflects class imbalance, not segmentation quality.

Figure 8. Conceptual comparison of CNN versus GAT/SGNN for dense image segmentation. CNN preserves local neighborhoods leading to accurate and spatially coherent segmentation, whereas GAT/SGNN may lose fine spatial information due to sparse edges and attention mechanisms.

4.2. Benchmarking with Recent Literature

Our U-Net slightly outperforms prior studies, confirming robustness for LGG segmentation. Differences in Dice/IoU are due to dataset variations, preprocessing, and evaluation methods. Figure 7 and Figure 8 visually reinforce that U-Net provides spatially coherent segmentation, while graph-based methods struggle with dense images due to local information loss and attention dilution as reported in Table 5.

Table 5. Benchmarking segmentation performance from the recent literature.

4.3. GAT and SGNN Performance Degradation

GAT and SGNN underperform in dense image segmentation due to the following limitations in capturing local spatial structures as reported in Table 6:

Table 6. Factors affecting GAT/SGNN performance vs. CNNs.

Sparse connectivity leads to loss of critical local information.
Cosine-similarity edges may introduce noisy connections.
Multi-head attention can dilute informative signals.
Dropout of nodes or edges disrupts information flow.
Graph abstraction reduces fine-structure learning capability.

4.4. Advanced Sonification and Multimodal Representation

Neuroimaging features were translated into coherent auditory structures to enhance interpretability. Using region_to_musical_scale mapping, high-intensity regions corresponded to major scales, low-intensity to minor scales, large areas to pentatonic scales, and small/complex regions to blues scales. The anomaly_to_dissonance function encoded pathological areas with tritones (high-intensity anomalies), minor sevenths (moderate), and amplitude modulation to reflect spatial localization and anomaly severity. The integrated visual–auditory representations are summarized in Table 7 and Table 8. and illustrated in Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13.

Table 7. Comparison of visual versus auditory representation for brain image sonification. Panel A = visual segmentation; Panel B = spectrogram; Panel C = MIDI score.

Table 8. Estimated clinical endpoints with multimodal representation (color + sound). Interpretation time in seconds; confidence rated 1–5; NASA-TLX = subjective workload (0–100).

Figure 9. Advanced segmentation output highlighting abnormal regions (tumor) in brain MRI. Color intensity corresponds to predicted probability of abnormal tissue.

Figure 10. Advanced segmentation output for healthy brain regions. Color-coded mask reflects predicted normal tissue, supporting contrast with anomalous regions.

Figure 11. Visual representation of segmented tumor regions. Provides spatial context for abnormal areas detected in the MRI.

Figure 12. Auditory representation (spectrogram) of brain segmentation. Frequency and amplitude variations correspond to regional intensity and morphological patterns, enabling perceptual exploration of anomalies.

Figure 13. Auditory representation (PianoRoll) of brain segmentation. Pitch and Velocity variations correspond to regional intensity and morphological patterns, enabling perceptual exploration of anomalies.

The dataset including WAV/MIDI files, JSON metadata, and processing documentation has been made publicly available via Zenodo [39], DOI: https://doi.org/10.5281/zenodo.17116504.

4.5. User Accuracy and Learnability

U-Net’s superior Dice and IoU translate to clearer, spatially coherent segmentations, expected to improve radiologist accuracy. Sonification can reinforce ambiguous boundaries, potentially reducing false negatives/positives. Users with prior medical imaging experience are expected to learn the auditory cues rapidly; novice users may require longer exposure. Formal usability testing is planned for future work.

4.6. Discussion of Results

CNN-based models (U-Net, DeepLab) are preferred for dense segmentation tasks due to high spatial precision. Graph-based models (GAT, SGNN) may be suitable for low-latency or adaptive applications but underperform on fine spatial details. Multimodal outputs (color + sonification) enhance interpretability, potentially improving clinical decision-making.

4.7. Clinical and Translational Implications

Multimodal feedback improves interpretability, diagnostic accuracy, and efficiency. Applications extend beyond clinical settings to immersive audiovisual media and brain–computer interface systems.

Table 9 highlights trade-offs between segmentation accuracy, inference speed, and projected user-centered endpoints. Multimodal feedback is expected to increase diagnostic accuracy, reduce interpretation time, enhance confidence, and decrease cognitive workload (NASA-TLX).

Table 9. Algorithmic performance vs. estimated clinical endpoints.

5. Discussion

This study demonstrates the potential of translating cortical activation patterns into interpretable multimodal sensory representations, with applications spanning clinical diagnostics, cognitive neuroscience, and immersive audiovisual experiences. In clinical contexts, combining visual and auditory representations enhances interpretability, supports early detection of abnormalities, and may reduce cognitive fatigue during neuroimaging analysis.

CNN-based architectures, particularly U-Net, proved effective in generating high-fidelity sensory representations due to their ability to preserve spatial coherence across brain regions. Graph-based models, such as SGNN, offer low-latency advantages suitable for real-time or immersive audiovisual applications, albeit at some cost in spatial precision.

5.1. Clinical Rationale for Auditory Biomarkers

The concept of auditory biomarkers leverages human auditory perception to complement traditional visual inspection of neuroimaging data. High-intensity or anomalous regions in MRI scans can be mapped to distinctive sound patterns (e.g., dissonance, pitch modulation), enabling clinicians to detect subtle pathological changes that may be less apparent visually. This multisensory integration is grounded in cognitive neuroscience principles, where combining auditory and visual information facilitates pattern recognition, memory encoding, and anomaly detection [1,10]. Preliminary user feedback indicates that auditory cues can reinforce ambiguous boundaries and improve recognition speed, though formal usability studies are required to validate these observations in clinical practice.

5.2. Translational and Interdisciplinary Implications

Analogous to signal transformation in other high-dimensional systems—such as superconducting qubits encoding quantum states into phonons [40]—multimodal representations convert complex neural patterns into perceptually intuitive forms. This approach not only benefits clinical diagnostics but also enables applications in neurofeedback, education, and creative industries, including film and interactive media.

5.3. Deployment Considerations and Limitations

For real-world deployment, adherence to regulatory and ethical frameworks (e.g., EU Artificial Intelligence Act, 2024) is critical. Trustworthy sensory AI pipelines require interpretability, transparency, and human oversight to ensure safety and clinical reliability.

Current limitations include the following:

Analysis limited to 2D MRI slices; volumetric 3D data may enhance both segmentation and sonification fidelity.
GAT and SGNN architectures underperformed in dense segmentation due to sparse connectivity and attention dilution.
User-centered metrics reported here are projected estimates; formal clinical validation is needed.
Preliminary usability observations suggest rapid learnability for clinicians with prior imaging experience, but structured user studies are necessary.

5.4. Future Directions

Future work should explore the following:

Extension to 3D and multimodal neuroimaging for richer sensory representations.
Real-time adaptive sonification integrated with AI-assisted diagnostic tools and neurofeedback systems.
Formal user studies to quantify learning curves, interpretability gains, and diagnostic impact.
Optimization of graph-based models to improve spatial fidelity while maintaining low-latency performance.
Broader applications in educational and creative contexts, enabling cross-disciplinary engagement with neuroimaging data.

Overall, the proposed multimodal sensory framework demonstrates that integrating auditory and visual channels can enhance the interpretability and clinical relevance of AI-based neuroimaging analyses while providing avenues for interdisciplinary applications.

6. Conclusions

This study presents a validated pipeline for multimodal sensory representation of brain imaging data, integrating deep learning-based segmentation with complementary visual and auditory outputs. Structural MRI predictions were transformed into color-coded visual maps and stereophonic/MIDI sonifications, enabling intuitive, synesthetic interpretation of neural activity. Among evaluated architectures, the U-Net model achieved high-precision segmentation, supporting efficient generation of bimodal outputs.

The framework establishes a proof of concept linking computational image analysis with perceptual interpretation and introduces the notion of auditory biomarkers, which may facilitate early detection of pathological patterns through sound.

Key limitations include the modest dataset size, lack of formal clinical end-user validation, and heuristic sonification rules that do not yet incorporate systematic emotional or cognitive mappings. Future work should address these gaps as follows:

Conducting clinical validation with radiologists to quantify diagnostic accuracy, confidence, and interpretation speed.
Extending the approach to dynamic fMRI data, enabling time-resolved audiovisual representations of cognitive and emotional processes.
Integrating real-time feedback for immersive applications in film, neurofeedback, and human–machine interaction.
Optimizing sonification algorithms to incorporate perceptually grounded mappings, enhancing both clinical interpretability and educational utility.

Overall, this multisensory framework demonstrates the feasibility of translating neural activity into perceivable visual and auditory formats, providing a foundation for clinical, research, and artistic applications while promoting more intuitive understanding of brain function.

Author Contributions

Conceptualization, I.L.D.S., N.F.L., and J.M.F.M.; Methodology, I.L.D.S., N.F.L., and J.M.F.M.; Software, I.L.D.S.; Validation, N.F.L., and J.M.F.M.; Formal analysis, I.L.D.S., N.F.L., and J.M.F.M.; Investigation, I.L.D.S., and N.F.L.; Resources, I.L.D.S., N.F.L., and J.M.F.M.; Data curation, I.L.D.S.; Writing—original draft preparation, I.L.D.S., N.F.L., and J.M.F.M.; Writing—review and editing, I.L.D.S., N.F.L., and J.M.F.M.; Visualization, I.L.D.S., N.F.L., and J.M.F.M.; Supervision, N.F.L., and J.M.F.M.; Project administration, N.F.L., and J.M.F.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset supporting the results of this study, including WAV/MIDI files, JSON metadata, and processing documentation, has been made publicly available via Zenodo [39], DOI: https://doi.org/10.5281/zenodo.17116504 (accessed on 5 October 2025).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Attribution of Icons Used in the Figures

Table A1. Attribution of icons used in the figures. All icons were obtained from Flaticon.com under a free license with attribution, accessed on 2 November 2025.

No.	Icon/Concept	Author/Designer	Direct Link	Access Date	License
1	Brain	Freepik	https://www.flaticon.com/free-icon/brain_6969711	2 November 2025	Free with attribution
2	Data Load	Ida Desi Mariana	https://www.flaticon.com/free-icon/loading_12663288	2 November 2025	Free with attribution
3	Notebook	Good Ware	https://www.flaticon.com/free-icon/diary_4533668	2 November 2025	Free with attribution
4	Python	Freepik	https://www.flaticon.com/free-icon/python_1822899	2 November 2025	Free with attribution
5	RMI (Music/Stimulus)	IconsMind	https://www.flaticon.com/free-icon/music_16735082	2 November 2025	Free with attribution
6	Pre-processing	Uniconlabs	https://www.flaticon.com/free-icon/data-processing_14247619	2 November 2025	Free with attribution
7	Neural Network	IconPro86	https://www.flaticon.com/free-icon/neural-network_10321502	2 November 2025	Free with attribution
8	Neural Network (alt.)	Pojok d	https://www.flaticon.com/free-icon/neural-network_13320564	2 November 2025	Free with attribution
9	Loss Function	Paul J.	https://www.flaticon.com/free-icon/warning_11371561	2 November 2025	Free with attribution
10	Optimization	Syafii5758	https://www.flaticon.com/free-icon/optimization_12001646	2 November 2025	Free with attribution
11	Predict Model	Freepik	https://www.flaticon.com/free-icon/data-model_12538663	2 November 2025	Free with attribution
12	Sound Save	Boris Faria	https://www.flaticon.com/free-icon/arrow_10923526	2 November 2025	Free with attribution
13	Sound Generation	Fach	https://www.flaticon.com/free-icon/ai-technology_17030840	2 November 2025	Free with attribution
14	Sound Effects	Stockes Design	https://www.flaticon.com/free-icon/sound-setting_4300505	2 November 2025	Free with attribution
15	Sound Combinations	Eucalyp	https://www.flaticon.com/free-icon/sound-control_2637807	2 November 2025	Free with attribution
16	Sound Play	Smashicons	https://www.flaticon.com/free-icon/play-button_660559	2 November 2025	Free with attribution
17	HSV Color Wheel	Creator	https://www.flaticon.com/free-icon/color-wheel_11460836	2 November 2025	Free with attribution
18	RGB	Freepik	https://www.flaticon.com/free-icon/rgb_1215827	2 November 2025	Free with attribution
19	Save Image	Pocike	https://www.flaticon.com/free-icon/photo_15473970	2 November 2025	Free with attribution
20	Visual (Display)	Smashicons	https://www.flaticon.com/free-icon/computer-monitor_4617970	2 November 2025	Free with attribution
21	Sound Wave	Freepik	https://www.flaticon.com/free-icon/music-wave_4020749	2 November 2025	Free with attribution

Note: All icons were obtained from Flaticon.com under the free license with attribution. Individual designers are acknowledged as listed above.

References

Mantini, D.; Marzetti, L.; Corbetta, M.; Romani, G.L.; Del Gratta, C. Multimodal integration of fMRI and EEG data for high spatial and temporal resolution analysis of brain networks. Brain Topogr. 2010, 23, 150–158. [Google Scholar] [CrossRef] [PubMed]
Lundervold, A.S.; Lundervold, A. An overview of deep learning in medical imaging focusing on MRI. Z. Med. Phys. 2019, 29, 102–127. [Google Scholar] [CrossRef] [PubMed]
Kwee, T.C.; Kwee, R.M. Diagnostic radiology and its future: What do clinicians need. Eur. Radiol. 2023, 33, 6322–6333. [Google Scholar] [CrossRef]
Lim, S.S.; Phan, T.D.; Law, M.; Goh, G.S.; Moriarty, H.K.; Lukies, M.W.; Joseph, T.; Clements, W. Non-radiologist perception of the use of artificial intelligence (AI) in diagnostic medical imaging reports. J. Med. Imaging Radiat. Oncol. 2022, 66, 1029–1034. [Google Scholar] [CrossRef]
Chen, J.; Ma, J.; Huai, X. Qualitative studies: Designing a multimodal medical visualization tool for helping patients interpret 3D medical images. Front. Physiol. 2025, 16, 1559801. [Google Scholar] [CrossRef]
Schütz, L.; Matinfar, S.; Schafroth, G.; Navab, N.; Fairhurst, M.; Wagner, A.; Wiestler, B.; Eck, U.; Navab, N. A Framework for Multimodal Medical Image Interaction. arXiv 2024, arXiv:2407.07015. [Google Scholar] [CrossRef] [PubMed]
Shoeibi, A.; Khodatars, M.; Jafari, M.; Ghassemi, N.; Alizadehsani, R.; Nahavandi, S.; Gorriz, J.M.; Moradi, M.; Acharya, U.R. Diagnosis of brain diseases in fusion of neuroimaging modalities using deep learning: A review. Inf. Fus. 2023, 93, 3–25. [Google Scholar] [CrossRef]
Xu, M.; Zhang, J.; Li, W.; Chen, Y.; Zhou, H.; Wang, P.; Zhang, L. Deep learning aided neuroimaging and brain regulation. Sensors 2023, 23, 4993. [Google Scholar] [CrossRef]
Hansen, J.Y.; Cauzzo, S.; Singh, K.; García-Gomar, M.G.; Shine, J.M.; Bianciardi, M.; Misic, B. Integrating brainstem and cortical functional architectures. Nat. Neurosci. 2024, 27, 2500–2511. [Google Scholar] [CrossRef]
Richter, D.; Heilbron, M.; de Lange, F.P. Dampened sensory representations for expected input across the ventral visual stream. Oxf. Open Neurosci. 2022, 1, kvac013. [Google Scholar] [CrossRef]
Pérez-Bellido, A.; Barnes, K.A.; Crommett, L.E.; Yau, J.M. Auditory frequency representations in human somatosensory cortex. Cereb. Cortex 2018, 28, 3908–3921. [Google Scholar] [CrossRef]
Eickhoff, S.B.; Milham, M.; Vanderwal, T. Towards clinical applications of movie fMRI. NeuroImage 2020, 220, 116860. [Google Scholar] [CrossRef]
Schaefer, S.; Rotte, M. How movies move us: Movie preferences are linked to differences in brain reactivity. Front. Behav. Neurosci. 2024, 18, 1396811. [Google Scholar] [CrossRef]
Mahrukh, R.; Shakil, S.; Malik, A.S. Sentiment analysis of fMRI using automatically generated stimuli labels under naturalistic paradigm. Sci. Rep. 2023, 13, 7267. [Google Scholar] [CrossRef] [PubMed]
Saarimäki, H. Naturalistic stimuli in affective neuroimaging: A review. Front. Hum. Neurosci. 2021, 15, 675068. [Google Scholar] [CrossRef] [PubMed]
Zhang, N.; Xu, N.-L. Reshaping sensory representations by task-specific brain states: Toward cortical circuit mechanisms. Curr. Opin. Neurobiol. 2022, 77, 102628. [Google Scholar] [CrossRef] [PubMed]
van Hulle, M.M.; Orban, G.A. Entropy driven artificial neuronal networks and sensorial representation: A proposal. J. Parallel Distrib. Comput. 1989, 6, 264–290. [Google Scholar] [CrossRef]
Mazzola, G.; Göller, S. Performance and interpretation. J. New Music Res. 2002, 31, 221–232. [Google Scholar] [CrossRef]
Zifko, U.A.; Slomka, P.J.; Reid, R.H.; Young, G.B.; Remtulla, H.; Bolton, C.F. The cortical representation of somatosensory evoked potentials of the phrenic nerve. J. Neurol. Sci. 1996, 139, 197–202. [Google Scholar] [CrossRef]
Bittar, R.G.; Ptito, A.; Reutens, D.C. Somatosensory representation in patients who have undergone hemispherectomy: A functional magnetic resonance imaging study. J. Neurosurg. 2000, 92, 45–51. [Google Scholar] [CrossRef]
Wu, Y.; Mao, Y.; Feng, K.; Wei, D.; Song, L. Decoding of the neural representation of the visual RGB color model. PeerJ Comput. Sci. 2023, 9, e1376. [Google Scholar] [CrossRef] [PubMed]
Tamakoshi, S.; Minoura, N.; Katayama, J.; Yagi, A. Entire sound representations are time-compressed in sensory memory: Evidence from MMN. Front. Neurosci. 2016, 10, 347. [Google Scholar] [CrossRef] [PubMed]
Riecke, L.; van Opstal, A.J.; Goebel, R.; Formisano, E. Hearing illusory sounds in noise: Sensory-perceptual transformations in primary auditory cortex. J. Neurosci. 2007, 27, 12684–12689. [Google Scholar] [CrossRef]
Melchiorre, A.B.; Schedl, M. Personality correlates of music audio preferences for modelling music listeners. Front. Psychol. 2020, 11, 760. [Google Scholar] [CrossRef]
Khan, P.; Kader, M.F.; Islam, S.M.R.; Rahman, A.B.; Kamal, M.S.; Toha, M.U.; Kwak, K.-S. Machine learning and deep learning approaches for brain disease diagnosis: Principles and recent advances. IEEE Access 2021, 9, 37622–37655. [Google Scholar] [CrossRef]
Samee, N.A.; Mahmoud, N.F.; Atteia, G.; Abdallah, H.A.; Alabdulhafith, M.; Al-Gaashani, M.S.A.M.; Ahmad, S.; Muthanna, M.S.A. Classification Framework for Medical Diagnosis of Brain Tumor with an Effective Hybrid Transfer Learning Model. Diagnostics 2022, 12, 2541. [Google Scholar] [CrossRef]
Dudai, Y. The cinema-cognition dialogue: A match made in brain. Front. Hum. Neurosci. 2012, 6, 248. [Google Scholar] [CrossRef]
Vanderwal, T.; Eilbott, J.; Castellanos, F.X. Movies in the magnet: Naturalistic paradigms in developmental functional neuroimaging. Dev. Cogn. Neurosci. 2019, 36, 100600. [Google Scholar] [CrossRef]
Zhang, Z.; Xu, S.; Li, L.; Zhou, Y.; Lin, D.; Zhang, M.; Zhang, L.; Huang, G.; Liu, X.; Becker, B.; et al. Functional connectivity profiles of the default mode and visual networks reflect temporal accumulative effects of sustained naturalistic emotional experience. NeuroImage 2023, 273, 119941. [Google Scholar] [CrossRef]
Roginska, A.; Mohanraj, H.; Ballora, M.; Friedman, K. Immersive Sonification for Displaying Brain Scan Data. In Proceedings of the International Conference on Health Informatics (HEALTHINF-2013), SCITEPRESS, Philadelphia, PA, USA, 9–11 September 2013; pp. 24–33. Available online: https://www.scitepress.org/Papers/2013/42029/42029.pdf (accessed on 9 November 2025). [CrossRef]
Cadiz, R.F.; de la Cuadra, P.; Montoya, A.; Marín, V.; Andia, M.E.; Tejos, C.; Irarrazaval, P. Sonification of Medical Images Based on Statistical Descriptors. In Proceedings of the International Computer Music Conference (ICMC), Denton, TX, USA, 25 September–1 October 2015; Available online: https://quod.lib.umich.edu/cgi/p/pod/dod-idx/sonification-of-medical-images-based-on-statistical.pdf (accessed on 9 November 2025).
Chiroiu, V.; Tarnita, D.; Popa, D. Using the Sonification for Hardly Detectable Details in Medical Images. Sci. Rep. 2019, 9, 17711. [Google Scholar] [CrossRef]
Matinfar, S.; Dehghani, S.; Salehi, M.; Sommersperger, M.; Faridpooya, K.; Fairhurst, M.; Navab, N. From Tissue to Sound: A New Paradigm for Medical Sonic Interaction Design. Med. Image Anal. 2025, 103, 103571. [Google Scholar] [CrossRef] [PubMed]
Clark, K.; Vendt, B.; Smith, K.; Freymann, J.; Kirby, J.; Koppel, P.; Moore, S.; Phillips, S.; Maffitt, D.; Pringle, M.; et al. The Cancer Imaging Archive (TCIA): Maintaining and operating a public information repository. J. Digit. Imaging 2013, 26, 1045–1057. [Google Scholar] [CrossRef] [PubMed]
Zakharova, G.; Efimov, V.; Raevskiy, M.; Rumiantsev, P.; Gudkov, A.; Belogurova-Ovchinnikova, O.; Sorokin, M.; Buzdin, A. Reclassification of TCGA diffuse glioma profiles linked to transcriptomic, epigenetic, genomic and clinical data, according to the 2021 WHO CNS tumor classification. Int. J. Mol. Sci. 2023, 24, 157. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Zhang, X.; Huang, J. Automatic brain MRI tumour segmentation based on deep fusion of feature maps. Artif. Intell. Rev. 2025, 588, 154. [Google Scholar] [CrossRef]
Avazov, K.; Mirzakhalilov, S.; Umirzakova, S.; Abdusalomov, A.; Cho, Y.I. Dynamic Focus on Tumor Boundaries: A Lightweight U-Net for MRI Brain Tumor Segmentation. Bioengineering 2024, 11, 1302. [Google Scholar] [CrossRef]
Liu, G.; Zhang, H.; Yang, Y. mResU-Net: Multi-scale residual U-Net-based brain tumour segmentation. IEEE J. Biomed. Health Inform. 2023, 27, 3349–3359. [Google Scholar] [CrossRef]
da Silva, I.L.; Lori, N.F.; Machado, J.M.F. Advanced Brain Sonification Documentation (Version 1). Available online: https://doi.org/10.5281/zenodo.17116504 (accessed on 5 October 2025).
Bozkurt, A.B.; Golami, O.; Yu, Y.; Tian, H.; Mirhosseini, M. A mechanical quantum memory for microwave photons. Nat. Phys. 2025, 21, 1469–1474. [Google Scholar] [CrossRef]

Figure 1. Pipeline overview: Each model (U-Net, DeepLab, GAT, SGNN) is run individually, producing a segmentation mask that is converted into a color map and sent to the sonification module.

Figure 2. Diagrams of the U-Net, DeepLab, GAT, and SGNN architectures, scaled to fit within the MDPI page layout limits. (a) U-Net: encoder–decoder structure with symmetric skip connections. (b) DeepLab: Atrous Spatial Pyramid Pooling (ASPP) module. (c) GAT: message passing with learned attention coefficients. (d) SGNN: patch-based graph construction followed by a spectral GNN leading to the final segmentation output.

Figure 3. Sonification workflow: extraction of mask-derived features, auditory mapping, and synthesis.

Figure 4. Color map generation pipeline: normalization → colormap function → RGB visualization.

Figure 5. Overview of the multimodal sensory pipeline for brain MRI segmentation. The pipeline integrates deep learning-based segmentation, anomaly detection, and visual (color-coded maps) and auditory (stereo audio and MIDI) representations.

Figure 6. Revised sensory processing pipeline showing the integration of segmentation, graph-based modeling, and multimodal outputs is presented in Appendix A. Each stage transforms MRI data into interpretable visual and auditory representations.

Figure 7. Comparative performance of segmentation models on the test set, showing Dice coefficient, IoU, and binary accuracy. U-Net achieves the highest spatial coherence for tumor delineation.

Figure 8. Conceptual comparison of CNN versus GAT/SGNN for dense image segmentation. CNN preserves local neighborhoods leading to accurate and spatially coherent segmentation, whereas GAT/SGNN may lose fine spatial information due to sparse edges and attention mechanisms.

Figure 9. Advanced segmentation output highlighting abnormal regions (tumor) in brain MRI. Color intensity corresponds to predicted probability of abnormal tissue.

Figure 10. Advanced segmentation output for healthy brain regions. Color-coded mask reflects predicted normal tissue, supporting contrast with anomalous regions.

Figure 11. Visual representation of segmented tumor regions. Provides spatial context for abnormal areas detected in the MRI.

Figure 12. Auditory representation (spectrogram) of brain segmentation. Frequency and amplitude variations correspond to regional intensity and morphological patterns, enabling perceptual exploration of anomalies.

Figure 13. Auditory representation (PianoRoll) of brain segmentation. Pitch and Velocity variations correspond to regional intensity and morphological patterns, enabling perceptual exploration of anomalies.

Table 1. Comparison of multimodal medical image representation methods.

Study/Year	Segmentation Method	Sonification/Multimodal Method	Target Users	Key Contribution/Novelty
Roginska et al. (2013) [30]	Manual region selection	Immersive 3D sonification (spatial audio rendering)	Neuroscientists/educators	Introduced immersive auditory display for brain scan visualization; exploratory framework, not automated or clinical.
Cadiz et al. (2015) [31]	MRI intensity statistics	Statistical-based sound synthesis (spectral descriptors)	Researchers/designers	Used statistical descriptors from medical images to generate sound; proof-of-concept without deep learning.
Chiroiu et al. (2019) [32]	Manual segmentation/image analysis	Tonal sonification for subtle detail detection	Radiologists	Demonstrated that sonification helps detect faint or small features in CT/MRI; not data-driven.
Schütz et al. (2024) [6]	Deep neural segmentation (prototype)	Interactive multimodal interface (visual + audio)	Clinicians/HCI researchers	Proposed framework for real-time multimodal interaction combining segmentation and auditory feedback; still at prototype level.
Matinfar et al. (2025) [33]	CNN/U-Net variants	Sonic interaction model (tissue-based mapping)	Radiologists/medical designers	Introduced “tissue-to-sound” paradigm linking image modality to perceptual audio cues; formal perceptual evaluation included.
This Work (2025)	U-Net/DeepLab segmentation	Multimodal: perceptual color mapping + MIDI/3D stereophonic sonification	Clinicians/non-experts	Combines high-precision deep learning segmentation with perceptually optimized visual–auditory fusion; introduces auditory biomarkers and synesthetic encoding for intensity, anomaly size, and region confidence.

Table 2. Computational complexity of segmentation models. GFLOPs = giga floating-point operations; Parameters in millions.

Model	GFLOPs	Parameters (M)
U-Net	96.41	31.04
DeepLab	96.48	31.49
GNN	0.39	0.28
GAT	1.35	0.15

Table 3. Inference performance comparison on CPU and GPU. FPS = frames per second; Acceleration factor = CPU time/GPU time.

Model	CPU Time (ms)	GPU Time (ms)	FPS (GPU)	Acceleration Factor (×)
U-Net	2731.99	104.14	9.60	26.23
DeepLab	3019.07	104.99	9.52	28.75
GNN	125.21	64.00	15.63	1.96
GAT	577.79	86.41	11.57	6.69

Table 4. Segmentation model performance on the test set.

Model	Dice (%)	IoU (%)	Binary Accuracy (%)
U-Net	93.25	87.43	99.89
DeepLab	92.38	85.92	99.84
SGNN	76.40	62.31	99.58
GAT	39.97	25.88	98.96

Table 5. Benchmarking segmentation performance from the recent literature.

Study (Year)	Model/Dataset	Dice (%)	IoU (%)
Li et al. (2025) [36]	AS-WEC on LGG MRI	92.96	93.12
Lightweight U-Net (2024) [37]	Modified U-Net + attention	93.00	–
mResU-Net (2023) [38]	mResU-Net on BraTS2021	92.89	–
This work (U-Net)	U-Net on TCGA-LGG	93.25	87.43
This work (DeepLab)	DeepLab variant	92.38	85.92

Table 6. Factors affecting GAT/SGNN performance vs. CNNs.

Factor	GAT/SGNN Effect	CNN Comparison
Sparsity	Loss of local connections	Preserves neighboring pixels
Node degree	Over-smoothing/info loss	Local neighborhoods captured
Graph construction	Noisy edges	Deterministic convolution
Multi-head attention	Signal dilution	Feature aggregation deterministic
Dropout	Critical edges/nodes removed	Minor impact on spatial flow

Table 7. Comparison of visual versus auditory representation for brain image sonification. Panel A = visual segmentation; Panel B = spectrogram; Panel C = MIDI score.

Panel	Content	Objective
A—Segmentation Image	Brain image with highlighted segmented regions and anomalies	Visualize regions of interest and spatial patterns in the neuroimaging data.
B—Spectrogram/Sonification	Spectrogram of the audio (WAV) generated from the brain image	Reveal temporal and frequency patterns derived from the sonification, illustrating auditory representation of brain features.
C—MIDI Score	MIDI score generated from the sonification, showing notes corresponding to brain regions	Translate intensity, area, and texture into musical notation, allowing perceptual and auditory interpretation of neuroimaging data.

Table 8. Estimated clinical endpoints with multimodal representation (color + sound). Interpretation time in seconds; confidence rated 1–5; NASA-TLX = subjective workload (0–100).

Condition	Diagnostic Accuracy (%)	Interpretation Time (s)	Confidence (1–5)	NASA-TLX Score (0–100)
Visual only (baseline)	85.0	12.5	3.4	62
Color + Sound (proposed)	91.5	9.8	4.3	48

Table 9. Algorithmic performance vs. estimated clinical endpoints.

Model	Dice (%)	IoU (%)	Inf. Time (s)	Diagnostic Accuracy (%)	Confidence	Interp. Time (s)	NASA-TLX
U-Net	91.5	85.2	0.12	90.0	4.1	11.2	55
DeepLabV3+	92.5	86.8	0.20	91.0	4.2	10.9	52
SGNN (Sparse)	90.1	84.0	0.40	89.0	3.9	12.0	58
GAT (Graph)	89.5	83.2	0.45	88.5	3.8	12.3	59

Note: All user-centered metrics are projected estimates; formal clinical validation is required in future work.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Sensory Representation of Neural Networks Using Sound and Color for Medical Imaging Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Sensory Representation Through Color

2.2. Sensory Representation Through Sound

2.3. Magnetic Resonance Imaging in Brain Neuroscience

2.4. Magnetic Resonance Imaging in Cinematic Paradigms

3. Materials and Methods

3.1. Dataset

3.2. Patient-Wise Splitting and Preprocessing

3.3. Proposed Model Architectures

3.3.1. U-Net

3.3.2. DeepLab

3.3.3. Graph Attention Network (GAT)

3.3.4. Sparse Graph Neural Network (SGNN)

3.4. Comparison with Existing Methods

3.5. Advanced Brain Sonification Generation

3.6. Training Protocol and Hyperparameters

3.7. Evaluation Metrics and Runtime Analysis

4. Results

4.1. Comparative Analysis of Architectures

4.2. Benchmarking with Recent Literature

4.3. GAT and SGNN Performance Degradation

4.4. Advanced Sonification and Multimodal Representation

4.5. User Accuracy and Learnability

4.6. Discussion of Results

4.7. Clinical and Translational Implications

5. Discussion

5.1. Clinical Rationale for Auditory Biomarkers

5.2. Translational and Interdisciplinary Implications

5.3. Deployment Considerations and Limitations

5.4. Future Directions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Attribution of Icons Used in the Figures

References

Article Metrics

Citations

Article Access Statistics