Visual-to-Tactile Cross-Modal Generation Using a Class-Conditional GAN with Multi-Scale Discriminator and Hybrid Loss
Abstract
1. Introduction
- A class-conditional GAN architecture is proposed for generating vibrotactile spectrograms from texture images of material surfaces. The generator is adapted from pix2pix [18], while the discriminator follows the multiscale structure of pix2pixHD [19]. An optimal configuration with three discriminators and three downsampling layers was determined through grid search. Class conditioning is implemented via Conditional Batch Normalization (CBN) [24] at the generator’s bottleneck, based on the material label predicted by a separately trained classifier. The effectiveness of this approach is further illustrated using GradCAM visualizations.
- A hybrid loss function is utilized for training, which combines loss, Feature Matching (FM) loss, and adversarial loss components to better guide the generator. This hybrid loss improves both structural accuracy and perceptual quality of the generated spectrograms.
- An extensive evaluation was conducted on samples from 9 materials in the LMT-108 Surface-Materials dataset [25], including qualitative inspection of real vs. generated spectrograms and quantitative assessment using the LPIPS and FID metrics. The results demonstrate superior performance compared to three baseline models (pix2pix, pix2pixHD, and Residue-Fusion GAN), as well as several ablated variants. Ablation studies focus on the impact of removing loss, FM loss, and using a single discriminator, highlighting the contribution of each component to the final performance. Each experiment was repeated 10 times to ensure robustness. Two-sided paired t-tests comparing the proposed method to each baseline and ablated variant confirmed the statistical significance of the improvements. Representative failure cases are also discussed to highlight remaining challenges.
2. Related Works
3. Proposed Method
3.1. Block Diagram
3.2. Implementation Details
4. Experiments
4.1. Dataset and Experimental Setup
4.2. Discriminator Configuration Selection
4.3. Comparison Analysis
4.4. Impact of Class Conditioning
4.5. Failure Cases
5. Limitations of the Presented Work
- Need for Class-Labeled Data: The method relies on the availability of well-labeled data, which may not always be feasible in some practical scenarios. To make the method more practical for broader applications, future work could focus on learning from partially labeled or unlabeled data using semi-supervised or domain adaptation techniques.
- Training with Weakly Paired Data: The same texture may correspond to multiple spectrograms within a class, creating challenges in precisely reconstructing the target spectrogram. While this results in some imprecision in the generated spectrograms, this isn’t a significant issue. The vibrotactile sensation for materials within a given class should remain similar, allowing a person to experience a consistent tactile sensation when touching the material, even if the exact spectrogram varies.
- Failure in Fine-Grained Details: In some cases, the generated spectrograms exhibit deviations from the real ones in finer details, such as missing structural features or misplaced high-energy regions. In future work, expanding the architecture to incorporate bidirectional approaches, such as CycleGAN [20] or DiscoGAN [21], could help address these challenges. While these models are primarily designed for unpaired domain translation and do not exploit paired supervision directly, their cycle-consistency constraints can promote structural fidelity and preserve information that may be lost in single-pass mappings. Integrating such properties into a supervised setting, or combining them with more advanced loss functions, could improve the model’s sensitivity to fine-grained features and lead to more accurate spectrograms.
- Limited Dataset Scope: The method has been tested on a dataset with only 9 classes. This limited scope may affect its ability to generalize to other datasets or classes not represented in the current study. Expanding the dataset to include more classes or applying the method to different datasets would help assess the model’s generalization capabilities and identify potential areas for improvement.
- Absence of Time-Domain Evaluation: The current approach focuses on spectrogram generation, and while time-domain vibrotactile signals can be reconstructed using algorithms such as Griffin–Lim, this step is not evaluated in the present study. Due to the potential limitations of Griffin–Lim in accurately recovering perceptually faithful vibrations, it remains unclear whether the reconstructed signals would elicit a realistic tactile sensation for end users. A dedicated user study would be necessary in future work to assess the perceptual quality and effectiveness of the reconstructed signals in actual haptic applications.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| AR | Augmented Reality |
| AvgPool | Average Pooling |
| CBN | Conditional Batch Normalization |
| cGAN | conditional Generative Adversarial Network |
| CM-AVAE | Cross-Modal Adversarial Variational Autoencoder |
| DE | SPatially-Adaptive (DE) normalization |
| FC | Fully Connected |
| FID | Fréchet Inception Distance |
| FM | Feature Matching |
| GAN | Generative Adversarial Network |
| LPIPS | Learned Perceptual Image Patch Similarity |
| LP | Short-Time Fourier Transform (STFT) |
| SPADE | SPatially-Adaptive (DE) normalization |
| STFT | Short-Time Fourier Transform |
| VAEs | Variational Autoencoders |
| VR | Virtual Reality |
| WGAN-GP | Wasserstein Generative Adversarial Network with Gradient Penalty |
References
- Zhang, D.; Tron, R.; Khurshid, R.P. Haptic feedback improves human-robot agreement and user satisfaction in shared-autonomy teleoperation. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3306–3312. [Google Scholar]
- Gani, A.; Pickering, O.; Ellis, C.; Sabri, O.; Pucher, P. Impact of haptic feedback on surgical training outcomes: A randomised controlled trial of haptic versus non-haptic immersive virtual reality training. Ann. Med. Surg. 2022, 83, 104734. [Google Scholar] [CrossRef]
- Hiemstra, E.; Terveer, E.M.; Chmarra, M.K.; Dankelman, J.; Jansen, F.W. Virtual reality in laparoscopic skills training: Is haptic feedback replaceable? Minim. Invasive Ther. Allied Technol. 2011, 20, 179–184. [Google Scholar] [CrossRef] [PubMed]
- Gayathri, R.; Nam, S. Enhancing User Experience in Virtual Museums: Impact of Finger Vibrotactile Feedback. Appl. Sci. 2024, 14, 6593. [Google Scholar] [CrossRef]
- Li, D.; Xiong, Q.; Zhou, X.; Yeow, R.C.H. A Novel Kinesthetic Haptic Feedback Device Driven by Soft Electrohydraulic Actuators. arXiv 2024, arXiv:2411.18387. [Google Scholar] [CrossRef]
- Li, X.; Liu, H.; Zhou, J.; Sun, F. Learning cross-modal visual-tactile representation using ensembled generative adversarial networks. Cogn. Comput. Syst. 2019, 1, 40–44. [Google Scholar] [CrossRef]
- Li, Y.; Zhu, J.Y.; Tedrake, R.; Torralba, A. Connecting touch and vision via cross-modal prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10609–10618. [Google Scholar]
- Zhong, S.; Albini, A.; Jones, O.P.; Maiolino, P.; Posner, I. Touching a nerf: Leveraging neural radiance fields for tactile sensory data generation. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022; PMLR: Cambridge MA, USA 2023; pp. 1618–1628. [Google Scholar]
- Yang, F.; Ma, C.; Zhang, J.; Zhu, J.; Yuan, W.; Owens, A. Touch and go: Learning from human-collected vision and touch. arXiv 2022, arXiv:2211.12498. [Google Scholar] [CrossRef]
- Luo, S.; Yuan, W.; Adelson, E.; Cohn, A.G.; Fuentes, R. Vitac: Feature sharing between vision and tactile sensing for cloth texture recognition. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 2722–2727. [Google Scholar]
- Lee, J.T.; Bollegala, D.; Luo, S. “Touching to see” and “seeing to feel”: Robotic cross-modal sensory data generation for visual-tactile perception. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 4276–4282. [Google Scholar]
- Chen, J.; Zhou, S. Vision2Touch: Imaging Estimation of Surface Tactile Physical Properties. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
- Liu, H.; Guo, D.; Zhang, X.; Zhu, W.; Fang, B.; Sun, F. Toward image-to-tactile cross-modal perception for visually impaired people. IEEE Trans. Autom. Sci. Eng. 2020, 18, 521–529. [Google Scholar] [CrossRef]
- Su, Z.; Huang, B.; Miao, J.; Wang, W.; Lin, X. Configurable Performance-Communication Trade-Off for Quaternion-Based AUVs: A Partitioned Hybrid Event-Triggered Approach. IEEE Trans. Veh. Technol. 2025, early access. [Google Scholar] [CrossRef]
- Huang, B.; Song, Y.; Qin, H.; Miao, J.; Zhu, C. Safety-enhanced formation maneuver control for electric vehicle with edge-weighted topology and reinforcement learning strategy. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 14716–14731. [Google Scholar] [CrossRef]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 2, pp. 2672–2680. [Google Scholar]
- Li, Y.; Zhao, H.; Liu, H.; Lu, S.; Hou, Y. Research on visual-tactile cross-modality based on generative adversarial network. Cogn. Comput. Syst. 2021, 3, 131–141. [Google Scholar]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
- Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8798–8807. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
- Kim, T.; Cha, M.; Kim, H.; Lee, J.K.; Kim, J. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; PMLR: Cambridge MA, USA, 2017; pp. 1857–1865. [Google Scholar]
- Park, T.; Liu, M.Y.; Wang, T.C.; Zhu, J.Y. Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2337–2346. [Google Scholar]
- Griffin, D.; Lim, J. Signal estimation from modified short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 1984, 32, 236–243. [Google Scholar]
- De Vries, H.; Strub, F.; Mary, J.; Larochelle, H.; Pietquin, O.; Courville, A.C. Modulating early visual processing by language. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6597–6607. [Google Scholar]
- Strese, M.; Schuwerk, C.; Iepure, A.; Steinbach, E. Multimodal feature-based surface material classification. IEEE Trans. Haptics 2016, 10, 226–239. [Google Scholar] [CrossRef]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar] [CrossRef]
- Verma, P.; Chafe, C. A generative model for raw audio using transformer architectures. In Proceedings of the 2021 24th International Conference on Digital Audio Effects (DAFx), Vienna, Austria, 8–10 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 230–237. [Google Scholar]
- Zhu, H.; Luo, M.D.; Wang, R.; Zheng, A.H.; He, R. Deep audio-visual learning: A survey. Int. J. Autom. Comput. 2021, 18, 351–376. [Google Scholar] [CrossRef]
- Sung-Bin, K.; Senocak, A.; Ha, H.; Owens, A.; Oh, T.H. Sound to visual scene generation by audio-to-visual latent alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6430–6440. [Google Scholar]
- Ujitoko, Y.; Ban, Y. Vibrotactile signal generation from texture images or attributes using generative adversarial network. In Proceedings of the Haptics: Science, Technology, and Applications: 11th International Conference, EuroHaptics 2018, Pisa, Italy, 13–16 June 2018; Proceedings, Part II 11. Springer: Berlin/Heidelberg, Germany, 2018; pp. 25–36. [Google Scholar]
- Ban, Y.; Ujitoko, Y. TactGAN: Vibrotactile designing driven by GAN-based automatic generation. In Proceedings of the SIGGRAPH Asia 2018 Emerging Technologies, Tokyo, Japan, 4–7 December 2018; pp. 1–2. [Google Scholar] [CrossRef]
- Cai, S.; Ban, Y.; Narumi, T.; Zhu, K. FrictGAN: Frictional Signal Generation from Fabric Texture Images using Generative Adversarial Network. In Proceedings of the ICAT-EGVE, Virtual, 2–4 December 2020; pp. 11–15. [Google Scholar]
- Cai, S.; Zhao, L.; Ban, Y.; Narumi, T.; Liu, Y.; Zhu, K. GAN-based image-to-friction generation for tactile simulation of fabric material. Comput. Graph. 2022, 102, 460–473. [Google Scholar]
- Cai, S.; Zhu, K.; Ban, Y.; Narumi, T. Visual-tactile cross-modal data generation using residue-fusion gan with feature-matching and perceptual losses. IEEE Robot. Autom. Lett. 2021, 6, 7525–7532. [Google Scholar]
- Xi, Q.; Wang, F.; Tao, L.; Zhang, H.; Jiang, X.; Wu, J. CM-AVAE: Cross-Modal Adversarial Variational Autoencoder for Visual-to-Tactile Data Generation. IEEE Robot. Autom. Lett. 2024, 9, 5214–5221. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Agatsuma, S.; Kurogi, J.; Saga, S.; Vasilache, S.; Takahashi, S. Simple Generative Adversarial Network to Generate Three-axis Time-series Data for Vibrotactile Displays. In Proceedings of the International Conference on Advances in Computer-Human Interactions, ACHI 2020, Valencia, Spain, 21–25 November 2020; pp. 19–24. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; PMLR: Cambridge MA, USA, 2015; pp. 2256–2265. [Google Scholar]
- Corvi, R.; Cozzolino, D.; Poggi, G.; Nagano, K.; Verdoliva, L. Intriguing properties of synthetic images: From generative adversarial networks to diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 973–982. [Google Scholar]
- Chen, M.; Mei, S.; Fan, J.; Wang, M. Opportunities and challenges of diffusion models for generative AI. Natl. Sci. Rev. 2024, 11, nwae348. [Google Scholar] [CrossRef]
- Chen, C.; Ding, H.; Sisman, B.; Xu, Y.; Xie, O.; Yao, B.Z.; Tran, S.D.; Zeng, B. Diffusion models for multi-task generative modeling. In Proceedings of the The Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Lin, X.; Xu, W.; Mao, Y.; Wang, J.; Lv, M.; Liu, L.; Luo, X.; Li, X. Vision-based Tactile Image Generation via Contact Condition-guided Diffusion Model. arXiv 2024, arXiv:2412.01639. [Google Scholar]
- Gu, C.; Gromov, M. Unpaired Image-To-Image Translation Using Transformer-Based CycleGAN. In Proceedings of the International Conference on Software Testing, Machine Learning and Complex Process Analysis, Tomsk, Russia, 25–27 November 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 75–82. [Google Scholar]
- Dubey, S.R.; Singh, S.K. Transformer-based generative adversarial networks in computer vision: A comprehensive survey. IEEE Trans. Artif. Intell. 2024, 5, 4851–4867. [Google Scholar] [CrossRef]
- Dou, Y.; Yang, F.; Liu, Y.; Loquercio, A.; Owens, A. Tactile-augmented radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26529–26539. [Google Scholar]
- Yang, F.; Zhang, J.; Owens, A. Generating visual scenes from touch. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 22070–22080. [Google Scholar]
- Jiang, S.; Zhao, S.; Fan, Y.; Yin, P. GelFusion: Enhancing Robotic Manipulation under Visual Constraints via Visuotactile Fusion. arXiv 2025, arXiv:2505.07455. [Google Scholar] [CrossRef]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Zhang, M.; Terui, S.; Makino, Y.; Shinoda, H. TexSenseGAN: A User-Guided System for Optimizing Texture-Related Vibrotactile Feedback Using Generative Adversarial Network. IEEE Trans. Haptics 2025, 18, 325–339. [Google Scholar]
- pytorch.org. Instalation of Pytorch v1.12.1. Available online: https://pytorch.org/get-started/previous-versions/ (accessed on 1 September 2025).
- Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Zheng, W.; Liu, H.; Wang, B.; Sun, F. Cross-modal learning for material perception using deep extreme learning machine. Int. J. Mach. Learn. Cybern. 2020, 11, 813–823. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
- Yu, Y.; Zhang, W.; Deng, Y. Frechet inception distance (fid) for evaluating gans. China Univ. Min. Technol. Beijing Grad. Sch. 2021, 3, 1–7. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]





| Component | K | S | P | Output Shape | #Params |
|---|---|---|---|---|---|
| Input: Texture + Label | – | – | – | [1, 256, 256] + 9 | 0 |
| Conv2d | 4 × 4 | 2 | 1 | [64, 128, 128] | 1024 |
| LeakyReLU | – | – | – | [64, 128, 128] | 0 |
| Conv2d | 4 × 4 | 2 | 1 | [128, 64, 64] | 131,072 |
| BatchNorm2d | – | – | – | [128, 64, 64] | 256 |
| LeakyReLU | – | – | – | [128, 64, 64] | 0 |
| Conv2d | 4 × 4 | 2 | 1 | [256, 32, 32] | 524,288 |
| BatchNorm2d | – | – | – | [256, 32, 32] | 512 |
| LeakyReLU | – | – | – | [256, 32, 32] | 0 |
| Conv2d | 4 × 4 | 2 | 1 | [512, 16, 16] | 2,097,152 |
| BatchNorm2d | – | – | – | [512, 16, 16] | 1024 |
| LeakyReLU | – | – | – | [512, 16, 16] | 0 |
| Conv2d | 4 × 4 | 2 | 1 | [512, 8, 8] | 4,194,304 |
| BatchNorm2d | – | – | – | [512, 8, 8] | 1024 |
| LeakyReLU | – | – | – | [512, 8, 8] | 0 |
| Conv2d | 4 × 4 | 2 | 1 | [512, 4, 4] | 4,194,304 |
| BatchNorm2d | – | – | – | [512, 4, 4] | 1024 |
| LeakyReLU | – | – | – | [512, 4, 4] | 0 |
| Conv2d | 4 × 4 | 2 | 1 | [512, 2, 2] | 4,194,304 |
| BatchNorm2d | – | – | – | [512, 2, 2] | 1024 |
| LeakyReLU | – | – | – | [512, 2, 2] | 0 |
| Conv2d | 4 × 4 | 2 | 1 | [512, 1, 1] | 4,194,304 |
| CBN | – | – | – | [512, 1, 1] | 10,240 |
| ReLU | – | – | – | [512, 1, 1] | 0 |
| ConvTranspose2d | 4 × 4 | 2 | 1 | [512, 2, 2] | 4,194,304 |
| BatchNorm2d | – | – | – | [512, 2, 2] | 1024 |
| ReLU | – | – | – | [1024, 2, 2] | 0 |
| ConvTranspose2d | 4 × 4 | 2 | 1 | [512, 4, 4] | 8,388,608 |
| BatchNorm2d | – | – | – | [512, 4, 4] | 1024 |
| ReLU | – | – | – | [1024, 4, 4] | 0 |
| ConvTranspose2d | 4 × 4 | 2 | 1 | [512, 8, 8] | 8,388,608 |
| BatchNorm2d | – | – | – | [512, 8, 8] | 1024 |
| ReLU | – | – | – | [1024, 8, 8] | 0 |
| ConvTranspose2d | 4 × 4 | 2 | 1 | [512, 16, 16] | 8,388,608 |
| BatchNorm2d | – | – | – | [512, 16, 16] | 1024 |
| ReLU | – | – | – | [1024, 16, 16] | 0 |
| ConvTranspose2d | 4 × 4 | 2 | 1 | [256, 32, 32] | 4,194,304 |
| BatchNorm2d | – | – | – | [256, 32, 32] | 512 |
| ReLU | – | – | – | [512, 32, 32] | 0 |
| ConvTranspose2d | 4 × 4 | 2 | 1 | [128, 64, 64] | 1,048,576 |
| BatchNorm2d | – | – | – | [128, 64, 64] | 256 |
| ReLU | – | – | – | [256, 64, 64] | 0 |
| ConvTranspose2d | 4 × 4 | 2 | 1 | [64, 128, 128] | 262,144 |
| BatchNorm2d | – | – | – | [64, 128, 128] | 128 |
| ReLU | – | – | – | [1, 256, 256] | 0 |
| ConvTranspose2d | 4 × 4 | 2 | 1 | [1, 256, 256] | 2049 |
| Output: Tanh | – | – | – | [1, 256, 256] | 0 |
| Total Trainable Parameters | 54,418,049 | ||||
| Component | K | S | P | Output Shape | #Params |
|---|---|---|---|---|---|
| Input: Texture + Spectrograml | – | – | – | [1, 256, 256] + [1, 256, 256] | 0 |
| Conv2d | 4 × 4 | 2 | 2 | [64, 129, 129] | 2112 |
| LeakyReLU | – | – | – | [64, 129, 129] | 0 |
| Conv2d | 4 × 4 | 2 | 2 | [128, 65, 65] | 131,200 |
| BatchNorm2d | – | – | – | [128, 65, 65] | 256 |
| LeakyReLU | – | – | – | [128, 65, 65] | 0 |
| Conv2d | 4 × 4 | 2 | 2 | [256, 33, 33] | 524,544 |
| BatchNorm2d | – | – | – | [256, 33, 33] | 512 |
| LeakyReLU | – | – | – | [256, 33, 33] | 0 |
| Conv2d | 4 × 4 | 1 | 2 | [512, 34, 34] | 2,097,664 |
| BatchNorm2d | – | – | – | [512, 34, 34] | 1024 |
| LeakyReLU | – | – | – | [512, 34, 34] | 0 |
| Conv2d | 4 × 4 | 1 | 2 | [1, 35, 35] | 8193 |
| AvgPool2d | 3 × 3 | 2 | 1 | [1, 1, 128, 128] | 0 |
| Conv2d | 4 × 4 | 2 | 2 | [64, 65, 65] | 2112 |
| LeakyReLU | – | – | – | [64, 65, 65] | 0 |
| Conv2d | 4 × 4 | 2 | 2 | [128, 33, 33] | 131,200 |
| BatchNorm2d | – | – | – | [128, 33, 33] | 256 |
| LeakyReLU | – | – | – | [128, 33, 33] | 0 |
| Conv2d | 4 × 4 | 2 | 2 | [256, 17, 17] | 524,544 |
| BatchNorm2d | – | – | – | [256, 17, 17] | 512 |
| LeakyReLU | – | – | – | [256, 17, 17] | 0 |
| Conv2d | 4 × 4 | 1 | 2 | [512, 18, 18] | 2,097,664 |
| BatchNorm2d | – | – | – | [512, 18, 18] | 1024 |
| LeakyReLU | – | – | – | [512, 18, 18] | 0 |
| Conv2d | 4 × 4 | 1 | 2 | [1, 19, 19] | 8193 |
| AvgPool2d | 3 × 3 | 2 | 1 | [1, 1, 64, 64] | 0 |
| Conv2d | 4 × 4 | 2 | 2 | [64, 33, 33] | 2112 |
| LeakyReLU | – | – | – | [64, 33, 33] | 0 |
| Conv2d | 4 × 4 | 2 | 2 | [128, 17, 17] | 131,200 |
| BatchNorm2d | – | – | – | [128, 17, 17] | 256 |
| LeakyReLU | – | – | – | [128, 17, 17] | 0 |
| Conv2d | 4 × 4 | 2 | 2 | [256, 9, 9] | 524,544 |
| BatchNorm2d | – | – | – | [256, 9, 9] | 512 |
| LeakyReLU | – | – | – | [256, 9, 9] | 0 |
| Conv2d | 4 × 4 | 1 | 2 | [512, 10, 10] | 2,097,664 |
| BatchNorm2d | – | – | – | [512, 10, 10] | 1024 |
| LeakyReLU | – | – | – | [512, 10, 10] | 0 |
| Conv2d | 4 × 4 | 1 | 2 | [1, 11, 11] | 8193 |
| Total Trainable Parameters | 8,296,515 | ||||
| Num of Downsampling Layers (L) | |||||
|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | ||
| Num of discriminators (D) | 1 | 0.3442 ± 0.0106 | 0.3486 ± 0.0099 | 0.3458 ± 0.0080 | 0.3337 ± 0.0084 |
| 2 | 0.3254 ± 0.0078 | 0.3328 ± 0.0090 | 0.3294 ± 0.0084 | 0.3258 ± 0.0100 | |
| 3 | 0.3291 ± 0.0102 | 0.3308 ± 0.0096 | 0.3113 ± 0.0087 | 0.3202 ± 0.0073 | |
| 4 | 0.3200 ± 0.0079 | 0.3274 ± 0.0072 | 0.3194 ± 0.0057 | 0.3205 ± 0.0101 | |
| Method | LPIPS | FID |
|---|---|---|
| pix2pix | 0.3513 ± 0.0089 | 52.23 ± 2.01 |
| pix2pixHD | 0.3345 ± 0.0105 | 43.60 ± 1.88 |
| Residue-Fusion GAN | 0.3198 ± 0.0075 | 40.76 ± 1.83 |
| W/o CBN | 0.3439 ± 0.0093 | 45.42 ± 2.10 |
| W/o L1 loss | 0.3278 ± 0.0087 | 41.67 ± 2.14 |
| W/o FM loss | 0.3247 ± 0.0068 | 41.29 ± 1.98 |
| Single Discriminator | 0.3458 ± 0.0080 | 47.60 ± 1.91 |
| Full Model (Proposed) | 0.3113 ± 0.0087 | 38.77 ± 1.92 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Neshov, N.; Tonchev, K.; Manolova, A.; Petkova, R.; Bozhilov, I. Visual-to-Tactile Cross-Modal Generation Using a Class-Conditional GAN with Multi-Scale Discriminator and Hybrid Loss. Sensors 2026, 26, 426. https://doi.org/10.3390/s26020426
Neshov N, Tonchev K, Manolova A, Petkova R, Bozhilov I. Visual-to-Tactile Cross-Modal Generation Using a Class-Conditional GAN with Multi-Scale Discriminator and Hybrid Loss. Sensors. 2026; 26(2):426. https://doi.org/10.3390/s26020426
Chicago/Turabian StyleNeshov, Nikolay, Krasimir Tonchev, Agata Manolova, Radostina Petkova, and Ivaylo Bozhilov. 2026. "Visual-to-Tactile Cross-Modal Generation Using a Class-Conditional GAN with Multi-Scale Discriminator and Hybrid Loss" Sensors 26, no. 2: 426. https://doi.org/10.3390/s26020426
APA StyleNeshov, N., Tonchev, K., Manolova, A., Petkova, R., & Bozhilov, I. (2026). Visual-to-Tactile Cross-Modal Generation Using a Class-Conditional GAN with Multi-Scale Discriminator and Hybrid Loss. Sensors, 26(2), 426. https://doi.org/10.3390/s26020426

