How to Talk to Your Classifier: Conditional Text Generation with Radar–Visual Latent Space
Abstract
1. Introduction
- We propose a training method, combining a visual classifier with a DAAE, where the DAAE is tasked with reconstructing text captions corresponding to radar images.
- We show the capacity of the presented method to generate radar image descriptions via the classifier’s latent representation, thereby enhancing interpretability and ensuring alignment with the classified outcome.
- We confirm, through an ablation study, that our method of text generation does not interfere with classification efficacy (), even when adjustments are made to increase the force on the Gaussian constraint.
2. Related Work
2.1. Explainability of Neural Networks
2.2. Visual–Semantic Learning
2.3. Latent Space Modeling
3. Approach
3.1. Radar Signal Processing
3.2. Denoising Adversarial Autoencoder
3.3. Decoding Text from Image Classifier
4. Experiments
4.1. Implementation Settings
4.2. Results
4.2.1. Classification
4.2.2. Decoded Scene Descriptions
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Image Evaluation
Appendix A.1. Generating Image Captions
Appendix A.2. Training Setup
Models | CIFAR 10 | CIFAR 100 |
---|---|---|
Cross-entropy | ||
Cross-entropy with Gaussian latent | ||
SupAText (GIT-Base) | ||
SupAText (GIT-Large) | ||
SupAText (BLIP-Base) | ||
SupAText (BLIP-Large) |
References
- Adib, F.; Katabi, D. See through Walls with WiFi! In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM ’13, New York, NY, USA, 27 October–1 November 2013; pp. 75–86. [Google Scholar]
- Sirmacek, B.; Riveiro, M. Occupancy Prediction Using Low-Cost and Low-Resolution Heat Sensors for Smart Offices. Sensors 2020, 20, 5497. [Google Scholar] [CrossRef] [PubMed]
- Günter, A.; Böker, S.; König, M.; Hoffmann, M. Privacy-preserving people detection enabled by solid state LiDAR. In Proceedings of the 2020 16th International Conference on Intelligent Environments (IE), Madrid, Spain, 20–23 July 2020; pp. 1–4. [Google Scholar]
- Wang, C.; Zhang, H.; Yang, L.; Liu, S.; Cao, X. Deep people counting in extremely dense crowds. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 1299–1302. [Google Scholar]
- Rahman, M.M.; Yataka, R.; Kato, S.; Wang, P.; Li, P.; Cardace, A.; Boufounos, P. MMVR: Millimeter-Wave Multi-view Radar Dataset and Benchmark for Indoor Perception. In Computer Vision—ECCV 2024, 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXIX; Springer: Cham, Switzerland, 2024; pp. 306–322. [Google Scholar]
- Wu, Z.; Zhang, D.; Xie, C.; Yu, C.; Chen, J.; Hu, Y.; Chen, Y. RFMask: A simple baseline for human silhouette segmentation with radio signals. IEEE Trans. Multimed. 2022, 25, 4730–4741. [Google Scholar] [CrossRef]
- Lee, S.P.; Kini, N.P.; Peng, W.H.; Ma, C.W.; Hwang, J.N. Hupr: A benchmark for human pose estimation using millimeter wave radar. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 5715–5724. [Google Scholar]
- Paek, D.H.; Kong, S.H.; Wijaya, K.T. K-radar: 4d radar object detection for autonomous driving in various weather conditions. In Proceedings of the 35th Conference on Neural Information Processing Systems (NIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; pp. 3819–3829. [Google Scholar]
- Egmont-Petersen, M.; de Ridder, D.; Handels, H. Image processing with neural networks—A review. Pattern Recognit. 2002, 35, 2279–2301. [Google Scholar] [CrossRef]
- Stephan, M.; Hazra, S.; Santra, A.; Weigel, R.; Fischer, G. People Counting Solution Using an FMCW Radar with Knowledge Distillation From Camera Data. In Proceedings of the 2021 IEEE Sensors, Virtual, 31 October–4 November 2021; pp. 1–4. [Google Scholar] [CrossRef]
- Mauro, G.; Martinez-Rodriguez, I.; Ott, J.; Servadei, L.; Wille, R.; Cuellar, M.P.; Morales-Santos, D. Context-adaptable radar-based people counting via few-shot learning. Appl. Intell. 2023, 53, 25359–25387. [Google Scholar] [CrossRef]
- Sun, H.; Servadei, L.; Feng, H.; Stephan, M.; Santra, A.; Wille, R. Utilizing explainable ai for improving the performance of neural networks. In Proceedings of the 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), Nassau, Bahamas, 12–14 December 2022; pp. 1775–1782. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Fan, L.; Li, T.; Yuan, Y.; Katabi, D. In-home daily-life captioning using radio signals. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part II; Springer: Berlin/Heidelberg, Germany, 2020; pp. 105–123. [Google Scholar]
- He, J.; Spokoyny, D.; Neubig, G.; Berg-Kirkpatrick, T. Lagging inference networks and posterior collapse in variational autoencoders. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019. [Google Scholar]
- Zhao, J.; Kim, Y.; Zhang, K.; Rush, A.; LeCun, Y. Adversarially regularized autoencoders. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5902–5911. [Google Scholar]
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. Why should I trust you? Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2–29 October 2017; pp. 618–626. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Jolliffe, I.T. Principal Component Analysis and Factor Analysis. In Principal Component Analysis; Springer: New York, NY, USA, 1986; pp. 115–128. [Google Scholar] [CrossRef]
- Hyvärinen, A.; Oja, E. Independent component analysis: Algorithms and applications. Neural Netw. 2000, 13, 411–430. [Google Scholar] [CrossRef] [PubMed]
- Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 5–6 December 2016. [Google Scholar]
- Zhang, Q.; Wu, Y.N.; Zhu, S.C. Interpretable convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8827–8836. [Google Scholar]
- Socher, R.; Ganjoo, M.; Manning, C.D.; Ng, A. Zero-shot learning through cross-modal transfer. In Proceedings of the 27th Conference on Neural Information Processing Systems (NIPS 2013), Lake Tahoe, NV, USA, 5–8 December 2013. [Google Scholar]
- Lai, Z.; Yang, J.; Xia, S.; Lin, L.; Sun, L.; Wang, R.; Liu, J.; Wu, Q.; Pei, L. RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-wave Point Cloud Sequence. arXiv 2025, arXiv:2504.09862. [Google Scholar]
- Li, C.; Xu, H.; Tian, J.; Wang, W.; Yan, M.; Bi, B.; Ye, J.; Chen, H.; Xu, G.; Cao, Z.; et al. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv 2022, arXiv:2205.12005. [Google Scholar]
- Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
- Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
- Tsimpoukelli, M.; Menick, J.L.; Cabi, S.; Eslami, S.M.; Vinyals, O.; Hill, F. Multimodal few-shot learning with frozen language models. Adv. Neural Inf. Process. Syst. 2021, 34, 200–212. [Google Scholar]
- Geva, M.; Caciularu, A.; Dar, G.; Roit, P.; Sadde, S.; Shlain, M.; Tamir, B.; Goldberg, Y. Lm-debugger: An interactive tool for inspection and intervention in transformer-based language models. arXiv 2022, arXiv:2204.12130. [Google Scholar]
- Grondahl, T.; Asokan, N. EAT2seq: A generic framework for controlled sentence transformation without task-specific training. arXiv 2019, arXiv:1902.09381. [Google Scholar]
- Gu, Y.; Feng, X.; Ma, S.; Zhang, L.; Gong, H.; Zhong, W.; Qin, B. Controllable Text Generation via Probability Density Estimation in the Latent Space. arXiv 2022, arXiv:2212.08307. [Google Scholar]
- Li, C.; Gao, X.; Li, Y.; Peng, B.; Li, X.; Zhang, Y.; Gao, J. Optimus: Organizing sentences via pre-trained modeling of a latent space. arXiv 2020, arXiv:2004.04092. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Tech Rep. 2019, 1, 9. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Shen, T.; Mueller, J.; Barzilay, R.; Jaakkola, T. Educating text autoencoders: Latent representation guidance via denoising. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 8719–8729. [Google Scholar]
- Milovanović, V. On fundamental operating principles and range-doppler estimation in monolithic frequency-modulated continuous-wave radar sensors. Facta Univ. Ser. Electron. Energetics 2018, 31, 547–570. [Google Scholar] [CrossRef]
- Santra, A.; Vagarappan Ulaganathan, R.; Finke, T. Short-Range Millimetric-Wave Radar System for Occupancy Sensing Application. IEEE Sens. Lett. 2018, 2, 7000704. [Google Scholar] [CrossRef]
- Creswell, A.; Bharath, A.A. Denoising adversarial autoencoders. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 968–984. [Google Scholar] [CrossRef] [PubMed]
- Asghar, N. Yelp dataset challenge: Review rating prediction. arXiv 2016, arXiv:1605.05362. [Google Scholar]
- Servadei, L.; Sun, H.; Ott, J.; Stephan, M.; Hazra, S.; Stadelmayer, T.; Lopera, D.S.; Wille, R.; Santra, A. Label-Aware Ranked Loss for robust People Counting using Automotive in-cabin Radar. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 3883–3887. [Google Scholar]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Wang, J.; Yang, Z.; Hu, X.; Li, L.; Lin, K.; Gan, Z.; Liu, Z.; Liu, C.; Wang, L. Git: A generative image-to-text transformer for vision and language. arXiv 2022, arXiv:2205.14100. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Symbol | Parameters | Value |
---|---|---|
Center frequency | 60 GHz | |
B | Bandwidth (B) | (–) GHz |
Number of samples per chirp ( | 128 | |
Number of chirps | 64 | |
Sampling frequency ADC | 2 MHz | |
chirp time duration | 390 μs | |
Frame repetition time | s | |
Number of receiving antennas | 3 |
Model | Test Accuracy in % | ROUGE-L |
---|---|---|
Classifier (baseline) | 98.31 | - |
Classifier + DAAE () | 98.3 | 30.1 |
Classifier + DAAE () | 98.3 | 30.8 |
Classifier + DAAE () | 98.23 | 26.64 |
Classifier + DAAE () | 98.02 | 30.57 |
Ground Truth Text | Reconstructed | Classifier Predictions |
---|---|---|
“three people sitting” | “four walking” | 4 |
“three people sitting” | “three sitting” | 4 |
“three people sitting” | “four walking” | 5 |
“three people walking” | “five walking” | 5 |
“three people sitting” | “five sitting” | 5 |
“three people sitting” | “person sitting and walking” | 5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ott, J.; Sun, H.; Servadei, L.; Wille, R. How to Talk to Your Classifier: Conditional Text Generation with Radar–Visual Latent Space. Sensors 2025, 25, 4467. https://doi.org/10.3390/s25144467
Ott J, Sun H, Servadei L, Wille R. How to Talk to Your Classifier: Conditional Text Generation with Radar–Visual Latent Space. Sensors. 2025; 25(14):4467. https://doi.org/10.3390/s25144467
Chicago/Turabian StyleOtt, Julius, Huawei Sun, Lorenzo Servadei, and Robert Wille. 2025. "How to Talk to Your Classifier: Conditional Text Generation with Radar–Visual Latent Space" Sensors 25, no. 14: 4467. https://doi.org/10.3390/s25144467
APA StyleOtt, J., Sun, H., Servadei, L., & Wille, R. (2025). How to Talk to Your Classifier: Conditional Text Generation with Radar–Visual Latent Space. Sensors, 25(14), 4467. https://doi.org/10.3390/s25144467