How to Talk to Your Classifier: Conditional Text Generation with Radar–Visual Latent Space

Ott, Julius; Sun, Huawei; Servadei, Lorenzo; Wille, Robert

doi:10.3390/s25144467

Open AccessArticle

How to Talk to Your Classifier: Conditional Text Generation with Radar–Visual Latent Space

¹

Infineon Technologies AG, 85579 Neubiberg, Germany

²

TUM School of Computation, Information and Technology, Technical University Munich, 80333 Munich, Germany

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(14), 4467; https://doi.org/10.3390/s25144467

Submission received: 5 June 2025 / Revised: 11 July 2025 / Accepted: 16 July 2025 / Published: 17 July 2025

(This article belongs to the Special Issue AI-Powered RF Sensing and Signal Intelligence: Advances in Detection and Classification Techniques)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Many radar applications rely primarily on visual classification for their evaluations. However, new research is integrating textual descriptions alongside visual input and showing that such multimodal fusion improves contextual understanding. A critical issue in this area is the effective alignment of coded text with corresponding images. To this end, our paper presents an adversarial training framework that generates descriptive text from the latent space of a visual radar classifier. Our quantitative evaluations show that this dual-task approach maintains a robust classification accuracy of 98.3% despite the inclusion of Gaussian-distributed latent spaces. Beyond these numerical validations, we conduct a qualitative study of the text output in relation to the classifier’s predictions. This analysis highlights the correlation between the generated descriptions and the assigned categories and provides insight into the classifier’s visual interpretation processes, particularly in the context of normally uninterpretable radar data.

Keywords:

language–vision learning; classification; explainable neural networks; radar

Graphical Abstract

1. Introduction

Sensors and devices tailored to the Internet of Things (IoT) are increasingly integrating into everyday lives. Among the countless applications, real-time estimation of the number of people is a particularly challenging and common problem. When used in private environments, such a feature can significantly improve intelligent energy and climate management by monitoring the occupancy of a specific space. During a pandemic, these technologies play a critical role in managing crowds and raising alarms to prevent the spread of infectious diseases. Different sensors, including Wi-Fi/Bluetooth [1], thermal [2], lidar [3], cameras [4], and radar, offer different approaches to accomplishing this task, each with its own limitations. For example, thermal sensors are susceptible to interference from sunlight, while Wi-Fi sensors, although designed primarily for communication, require extensive data processing to obtain physical measurements. Lidar sensors provide a high level of detail and are prone to problems in changing lighting conditions. Cameras that arguably perform better in indoor surveillance pose significant privacy concerns, preventing their use in smart home applications. Therefore, our focus is on radars, particularly frequency-modulated continuous wave (FMCW) radars, which are known for their resilience to weather and illumination, as well as their ability to maintain the secrecy of private data. Additionally, their economical processing capabilities enable the detection of a person’s distance, angle, and speed, although the resolution of the radar limits the number of people that can be detected. In this study, we utilize a cost-optimized radar with just three receiving antennas, which can be used for elevator surveillance or door bells. Such systems have to run all the time; thus, limiting power consumption is inevitable. Related works on radar focus on high-resolution tasks with more antennas, like pose estimation [5,6,7] or automotive long-range detection with higher frequencies [8].

The data collected by radars are often converted into an image format, a format with respect to which neural networks have demonstrated outstanding capabilities [9]. Recent breakthroughs in both processing chains and neural network design have demonstrated the effectiveness of FMCW radars in people counting, with an accuracy of up to six people. To ensure the viability of these systems in real-world environments, it is critical to consider a variety of scenarios and environmental conditions during their development. Factors such as individual movement patterns and radar mounting height are crucial. For example, the Range-Doppler image differs depending on whether people are moving or standing. If there are multiple people in the radar range, the system must detect whether they are stationary or moving. Studies have taken these variables into account and trained and tested algorithms in controlled environments [10], which were then adjusted to work properly in different environments [11].

However, for systems to be truly effective, they must also be interpretable. Conventional methods strive to elucidate decisions made by neural networks at the input level through attention maps [12]. These techniques are too computationally intense, and radar images are inherently complex, making attention maps indecipherable by an untrained user.

To close this gap, a system that can provide understandable interpretations to any user in real time must be developed. Using a description of the neural network’s interpretation through text can result in a generally understandable explanation, similarly to creating captions. The challenge lies in combining radar data with text descriptions.

A typical approach involves training an image-encoding network with a text generation network simultaneously, where popular image-to-text models use encoder and decoder Transformer structures to learn shared representations [13,14]. Despite advances in text generation and image captioning, Transformers have not surpassed Convolutional Neural Networks (CNNs) in processing sensor data due to the latter’s lower data requirements. In [15], radar and image embeddings are aligned for text generation under difficult conditions for cameras, namely darkness. Like many other radar studies, they use radars that consume too much power for consumer radar applications. This work uses two radars with 12 antennas each. Similarly, our research focuses on converting the latent representation of a CNN-based classifier into text for the improved interpretation of the classifier’s decision. However, simply attaching an LSTM for text creation does not meet the explainability requirements. We aim for a system where small changes in the latent space correspond to small changes in the resulting text, which should also correlate with the exact or expected number of people.

The solution to this problem lies in the domain of the Variational Autoencoder (VAE), which maps inputs into a latent Gaussian space for reconstructing initial data. By utilizing this architecture, we can harness the latent classifier space for projections into the Gaussian domain and generate text. Such an adaptation involves substituting the visual encoder within the VAE, ensuring that no gradient from text reconstruction permeates the classifier. However, as the literature indicates [16,17], the VAE has limitations, namely, overlooking the latent representation and lacking encoded structure—where a structured latent space should align with the nuances of the reconstructed text. The incorporation of a denoising target in the Denoising Adversarial Autoencoder (DAAE) introduces structure by reconstructing unaltered sentences from their perturbed counterparts, while adversarial training schemes emphasize the Gaussian latent representation for reconstruction purposes. The application of such adversarial training in the context of classification, particularly for radar sensors, has yet to be explored. Moreover, the potential limitations on the classifier’s performance posed by the latent space and scalability considerations are also subjects of our ablation study. The generated descriptions are evaluated using the ROUGE-L score [18], which measures the overlapping sequence length between reference and generated descriptions.

To summarize, the goal of this work is to show how text can be aligned a classifier without losing accuracy, and the detailed contributions of this work are as follows:

We propose a training method, combining a visual classifier with a DAAE, where the DAAE is tasked with reconstructing text captions corresponding to radar images.
We show the capacity of the presented method to generate radar image descriptions via the classifier’s latent representation, thereby enhancing interpretability and ensuring alignment with the classified outcome.
We confirm, through an ablation study, that our method of text generation does not interfere with classification efficacy ( $98.3 %$ ), even when adjustments are made to increase the force on the Gaussian constraint.

Our experiment with an industrial radar dataset produced promising results, indicating that it is possible to generate reliable text that reflects visual interpretations while preserving the integrity of classification performance. This framework significantly improves the usability of neural networks in real-world scenarios, especially when combined with sensor data that would otherwise be cryptic to human interpretation.

This manuscript addresses the existing discourse on the explainability of neural networks, the synergy of image–text learning, and latent space modeling, and then, it explains the DAAE approach and the mechanisms for generating text descriptions from visual latent representations. In Section 2, seminal work on text generation and latent space modeling is reviewed. Afterward, the building blocks for this work are presented in Section 3, such as dataset generation in Section 3.1, the DAAE in Section 3.2, and the model’s design for text reconstruction in Section 3.3. The method is evaluated on an industrial radar dataset with respect to classification accuracy and text quality in Section 4 and then summarized in Section 5.

2. Related Work

In this section, we address the basic research underlying the three core aspects of our proposed approach: the explainability of neural networks, the integration of visual and semantic information, and the manipulation of latent space to control text generation.

2.1. Explainability of Neural Networks

The rise of complex neural network (NN) models in areas such as computer vision (CV) and natural language processing (NLP) has created an urgent need for explainable artificial intelligence (XAI). XAI is critical for detecting biases within datasets and ensuring that algorithms are fair and transparent. Various methods have been developed to decipher how neural networks work. One notable approach is to leverage a network’s internal processing to shed light on how particular inputs produce their respective outputs. One technique, Local Interpretable Model-Agnostic Explanations (LIME) [19] simplifies this by approximating the complex model with a more interpretable surrogate model that can analyze the meaning of each input feature. Similarly, Shapley Additive explanations (SHAPs) [20] calculates the contribution of each input feature to the output of the network. Other methods such as Class Activation Mapping (CAM) [21] and Gradient-Weighted Class Activation Mapping (Grad-CAM) [22] use neural sensitivity and dimensionality to generate saliency maps that express feature relevance in high-dimensional spaces. However, the complexity of these functional spaces makes it difficult to interpret networks in real time. Meanwhile, transformer networks, with their innate attention mechanisms, provide insight into their decision-making processes [23].

Besides finding the relevant input features, explainability can be obtained from the representations of the network. Techniques like Principal Component Analysis (PCA) [24] or Independent Component Analysis (ICA) [25] extract low-dimensional representations from the neural network’s embedding space to show which inputs are entangled in the latent space. The desired class separation can also be learned. In InfoGAN [26], an adversarial training scheme aims to reduce the entanglement of representations. This effect can also be achieved with specific loss functions [27].

However, these explanatory techniques are most useful when dealing with human-interpretable inputs such as images or text. These methods fail when it comes to data that are not immediately understandable to humans—such as frequency signatures from WiFi, radar, or ultrasound sensors. Therefore, to explain such data, we try to use multimodal approaches, such as visual–semantic learning.

2.2. Visual–Semantic Learning

The goal of visual–semantic learning is to create a harmonious relationship between images and their textual descriptions. Research in this area has made significant progress, and this is exemplified by studies such as [28], which label visual entities and map them into a shared latent space with text using dual networks called projectors. CLIP [14] further advances this concept by combining images with text in a common embedding space trained on large-scale image–text pairs sourced from the Internet. Recently, this idea has also been extended to radar data, learning alignments between radar point cloud and text descriptions [29]. This methodology has been applied to a variety of tasks, including image captioning [13,30], image–text retrieval [31], and even generative image creation tasks based on text input [32].

Image captioning models are the closest to our vision of a language-explained visual model, as they infer text descriptions based on visual signals. These approaches differ in the integration of image and text information—some use soft prompts [33], and others use a trainable text generation network [13]. However, the usefulness of these models depends on the accuracy of the image encoder. Inaccuracies in the visual interpretation make the captions nonsensical. Here, the role of latent space is crucial: it must maintain its semantic integrity regardless of the changes within it.

2.3. Latent Space Modeling

Text generation methods such as [34] can create modified text by manipulating the latent representation of the language model. This allows for advanced grammatical changes or linguistic style transfers, as shown in [35]. These methods learn specific mappings for latent space manipulation or use keyword searches to change the value vectors of specific transformer layers. When it comes to random text generation, ref. [36] reported an invertible function of the latent distribution to a Gaussian prior. In this way, perturbations in the prior space are mapped to the latent distribution and allow the flexible control of the generated text. Similarly, VAEs are used in [37] to estimate the latent space of a Bert-GPT2 [38,39] encoder–decoder model with a Gaussian prior steer. This highlights that Gaussian priors are promising for text manipulations in latent representations. However, the seminal works focus mostly on transformer-based text manipulation, and VAEs suffer from ignoring the latent prior [16].

Concurrent work proposed a DAAE that is trained in an unsupervised fashion by reconstructing perturbed sentences [40]. The DAAE is conditioned on a prior Gaussian distribution and assumes noisy inputs. Thus, the DAAE fulfills two important criteria to enable the interpretation of a visual classifier: (1) The classifier’s latent space can be modeled as a Gaussian distribution from where the decoder of the DAAE can generate text. (2) Given a description of the model, the DAAE learns to reconstruct the caption by focusing on the latent representation of the classifier through the adversarial training objective. (3) The DAAE interprets the visual latent space as a noisy version of the true latent representation, compensating for an imperfect classifier.

3. Approach

The approach is divided into three parts. First, we describe the radar dataset. Afterward, we introduce DAAE, and then, we demonstrate how it can be used to generate text from a visual classifier embedding. In this way, the text explains the visual interpretation of the radar input.

3.1. Radar Signal Processing

The proposed method uses Infineon’s BGT60TR13C FMCW radar chipset (Infineon Technologies AG, Neubiberg, Germany), as shown in Figure 1. This chipset operates at a base frequency of 60 GHz and has one transmit antenna (TX) and three receive antennas (RX). With an ultra-wide bandwidth of

5.5

GHz, a very high distance resolution of up to 3 cm, and a ramp-up speed of 400 MHz/μs, a higher Doppler speed can be achieved. In addition, the high signal-to-noise ratio (SNR) ensures the detection of people up to 15 m when in front of the sensor, while its high sensitivity enables the detection of sub-millimeter movements. Thanks to optimized performance modes during sensor operation, the lowest power consumption of less than 5 mW can be guaranteed. These interruptions serve to save energy and facilitate data preprocessing. The collected data are digitized to 12 bits by the analog-to-digital converter (ADC) and then forwarded from the evaluation board via USB to the PC for further examination. We can use the radar signal to calculate the range of individual targets. The distance resolution (

Δ R

) and maximum detectable range (

R_{m a x}

) can be calculated using the following formulas [41]:

Δ R = \frac{c}{2 B},

(1)

R_{m a x} = \frac{Δ R}{2} N_{s},

(2)

where c denotes the speed of light, B denotes the bandwidth, and

N_{s}

denotes the number of samples. In this setting, we use a bandwidth of 1 GHz, which results in a resolution of 15 cm. In combination with 128 samples per chirp, a maximum range of approximately 10 m is enabled, which is particularly suitable for indoor observations. In order to measure the velocity of the targets, we use the Doppler frequency along with the number of chirps. The velocity resolution (

Δ V

) and the maximum detectable velocity (

V_{m a x}

) are computed as follows:

V_{m a x} = \frac{c}{4 f_{0} t_{c}},

(3)

Δ V = \frac{2 V_{m a x}}{N_{c}},

(4)

where

f_{0}

denotes the center frequency,

t_{c}

denotes the chirp time duration,

N_{c}

denotes the number of chirps, and c denotes the speed of light [41].

We set the chirp time to 391 μs; thus, we can detect up to

3.2

m/s and resolve

0.1

m/s. With an average human walking speed of

1.42

m/s, we can detect all sorts of human motion. The configuration of the used radar and the relevant parameters are detailed in Table 1.

The received frame in each interval takes the form of a three-dimensional array for which its dimensions (

N_{c}

,

N_{s}

, and

N_{r x}

) are based on the number of chirps (

N_{c}

), the number of samples per chirp (

N_{s}

), and the number of receiving antennas (

N_{r x}

). The axis along the chirps represents slow time, while the axis along the samples represents fast time. To avoid leakage from the transmit/receive antenna, we subtract the average in the fast time. Signal processing occurs on two separate chains. In a setup designed to sense human presence, a person can display significant movements, such as walking or running (macro movements), and subtle movements, such as breathing, or slight body movements while standing still (micro-movements). First, we use the coherent integration of

N_{c}

consecutive frames to create a virtual frame that increases the SNR of micro-motions (breathing) [42]. At the same time, macro-movements are captured in the latest image. The following processing steps include a moving target indicator (MTI) to remove reflections from completely static targets, followed by a 2D fast Fourier transformation (FFT). For the MTI, we subtract the average along the slow time axis, as applied in [42]. After these processing steps, we obtain two RDIs for macro- and micro-motion and stack them to obtain a radar input with two channels. The macro- and micro-frame processing steps for each antenna are shown in Figure 2.

The obtained radar data, as described in Section 3.1, capture scenes of up to five people performing three different movements: walking, sitting, and standing in six different office rooms. The data are randomly split by recording in two datasets: one for training and one for testing. The training data contain 300,000 frames, which comprise 8 h of recorded radar data at a frame rate of 10 Hz. The training classes are distributed as follows: 27,181 with zero people, 53,391 with one person, 44,701 with two people, 71,824 with three people, 61,186 with four people, and 41,717 with five people. The test dataset contains 90,000 frames. The test classes are distributed as follows: 10,296 samples with zero people, 15,268 with one person, 15,623 with two people, 19,763 with three people, 17,215 with four people, and 11,835 with five people. In each of the training recordings,

20 %

of the frames are used for validation, and the remaining

80 %

are used for training. Afterward, the presented preprocessing steps are applied to prevent any information leakage through the micro-frame buffer.

In addition, each recording has a description that describes the motion of individuals, e.g., “three people walking and two people sitting”. The considered movements are walking, sitting, and standing.

3.2. Denoising Adversarial Autoencoder

The DAAE of [40] extends the concept of an Adversarial Autoencoder (AAE) [17,43] by incorporating a local perturbation process C into the input data during training. These perturbations corrupt the original input data

x \in X

to

\tilde{x} \in X

. Let the joint probability of the data be

p (x, \tilde{x}) = p_{d a t a} (x) p_{C} (\tilde{x} | x)

and the marginal probability be

p (\tilde{x}) = \sum_{x} p (x, \tilde{x})

. The model’s task is to reconstruct the original data from the perturbed version. The individual building blocks are an LSTM for text encoding

E_{t e x t}

, an LSTM for text generation G, and a single-layer discriminator D. The encoder

E_{t e x t}

is later replaced by the visual encoder. This is achieved by the following training objectives:

\min_{E, G} \max_{D} L_{r e c} (θ_{E}, θ_{G}) - λ L_{a d v} (θ_{E}, θ_{D}),

(5)

with

\begin{matrix} L_{r e c} (θ_{E}, θ_{G}) & = E_{p (x, \tilde{x})} - \log p_{G} (x | E_{t e x t} (\tilde{x})) \end{matrix}

(6)

\begin{matrix} L_{a d v} (θ_{E}, θ_{D}) & = & E_{p (z)} [- \log D (z)] + \\ E_{p (\tilde{x})} [- \log (1 - D (E_{t e x t} (\tilde{x})))], \end{matrix}

(7)

where

L_{r e c}

is the reconstruction loss for the estimated

\tilde{x}

with respect to the input x, and

L_{a d v}

is the adversarial loss evaluating the latent distribution of the encoded

\tilde{x}

. This setup follows a typical Generative Adversarial Network (GAN) training method, where the discriminator and generator interplay in an actor–critic fashion. However, the generator and discriminator act on the latent space to enforce the Gaussian distribution, as in the VAE. Thus, reconstruction loss is needed for high text quality. This approach encourages the network to map sequences to appropriate latent spaces without the need for additional training objectives or reparameterization-style tricks like VAE. Furthermore, DAAE overcomes the VAE by neglecting the latent information combined with noisy inputs, as shown in [16]. In addition, encoders and decoders are LSTM blocks and not transformers. Better performance with LSTM blocks was shown by [40]. Although LSTMs are known to perform poorly for very long text sequences, this issue is not relevant in this particular use case. Below, we can replace the encoder with a visual classifier that presents a noisy view of the original input.

3.3. Decoding Text from Image Classifier

In this section, we show how text can be decoded based on the latent representation of the visual classifier. The focus is on training a classifier to count up to five people. At the same time, we want to generate text from the classifier embedding. In this way, the text describes the classifier’s interpretation of the radar data. The text generator will be trained independently of the classifier. The overall architecture of the visual classifier and text decoder is shown in Figure 3.

The visual classifier consists of a CNN-based encoder where each block processes the macro- and micro-RDI individually and then uses a cross-convolution along the combined features. The radar image encoder is followed by two linear layers that encode the mean and standard deviation of the Gaussian latent representation. The visual encoding is defined as

E_{v}

.

With this design, the visual classifier has the same latent structure as the DAAE. Further, the joint latent structure enables the reconstruction of radar scene descriptions from the visual latent space using the decoder of the pretrained DAAE. Finally, a classification layer predicts the amount of people in front of the radar sensor.

Given the stochastic nature of classifier training and incomplete classification performance, the visual latent representation is a noisy version of the original input image representation. The training goal is to match the semantic description

\tilde{x}

with the latent interpretation z of the radar’s input. This is achieved with the following loss functions:

\begin{matrix} \min_{E, G} \max_{D} & L_{c l s} (θ_{E_{v}}) + L_{r e c} (θ_{E_{v}}, θ_{G}) - λ L_{a d v} (θ_{E_{v}}, θ_{D}), \end{matrix}

(8)

with

\begin{matrix} L_{c l s} (θ_{E_{v}}) & = E_{p (x)} [- \log p_{c l s} (c | E_{v} (x))] \end{matrix}

(9)

\begin{matrix} L_{r e c} (θ_{G}) & = E_{p (x, \tilde{x})} [- \log p_{G} (\tilde{x} | \bar{E_{v}} (x))] \end{matrix}

(10)

\begin{matrix} L_{a d v} (θ_{E_{v}}, θ_{D}) & = & E_{p (z)} [- \log D (z)] + \\ E_{p (\tilde{x})} [- \log (1 - D (E_{v} (x)))], \end{matrix}

(11)

where

\bar{E_{v}}

denotes the stop gradient operator. The proposed approach trains the vision encoder with the classification loss (

L_{c l s}

) and enforces the Gaussian latent structure with adversarial training (

L_{a d v}

) using the pretrained discriminator from the DAAE. The reconstruction loss can be interpreted as the alignment between radar and text features. However, reconstruction loss is used to fine-tune the text decoder without training the classifier. Since the focus is on the interpretation of the visual classifier, it should not be influenced by the text generator. Only the discriminator’s loss is necessary to enforce a common latent space structure. The Gaussian constraint must be carefully tuned, as we expect it to reduce classification performances.

4. Experiments

In this section, we first review the implementation settings. Afterward, we quantitatively benchmark the joint vision–language training with generated captions on an industrial radar dataset. Finally, we demonstrate the descriptive capabilities of the generated text.

4.1. Implementation Settings

In the implementation, we used PyTorch v2.0.^™-GPU v2.4.0 with CUDA^® Toolkit v11.8.0 and cuDNN v8.5.0. As a processing unit, we used a single Nvidia^® Tesla^® P40 GPU, Intel^® Core i7-8700K CPU, and DIMM 16 GB DDR4-3000 module of RAM.

The DAAE is pre-trained on a review dataset from websites like Yelp and TripAdvisor [44]. The configuration and dataset are the same as those described in [40]. For the visual classifier, we used three cross-convolution blocks described in [45] as the backbone, with two connected dense layers for the mean and variance of the Gaussian latent representation and a final dense layer for the classification.

For joint training, we keep the vocabulary of the pre-trained DAAE comprising 10,000 words, which could result in words in captions that are not part of the vocabulary, but they are matched to related words. The joint architecture is trained with the Stochastic Gradient Descent (SGD) optimizer and an initial learning value of

0.05

. The learning rate decays after 5, 15, and 25 epochs with a decay rate of

0.1

. In addition, it is noteworthy that the networks are trained with 64 samples per batch on a single GPU for 50 epochs. The classification results are averaged over five different seeds.

The implementation is published on GitHub, with publicly available image datasets and automatically generated captions (a reproducible implementation of our method on public image classification datasets with automatically generated captions. Github link: https://github.com/juliusott/talk-to-your-classifier, (accessed on 12 June 2025)) provided for reproducibility.

4.2. Results

This section presents the experimental results of the proposed joint vision–language model. First, the classification results of the proposed method on the industrial radar dataset (see Figure 4) are shown. In addition, we demonstrate the effects of the different scaling factors of adversarial training objects that enforce the embedding space’s constraints. Second, we demonstrate the decoded text from the classifier embeddings, with a focus on texts where the classifier prediction is wrong.

4.2.1. Classification

In sensor applications, the emphasis is on performance. As mentioned in Equation (8), the Gaussian constraint is enforced by the adversarial loss function with scaling factor

λ_{a d v}

. The ablation study compares scaling factors

λ_{a d v} = 1, 10, 20

to measure the interplay between Gaussian constraint and text reconstruction. As a baseline, we use an end-to-end classifier of the same architecture without the Gaussian latent space. Note that the classifier is not influenced by the text reconstruction’s loss, and hence, we only ablate force on the Gaussian constraint. The results illustrate that the classifier performance is minimally affected by the Gaussian latent space constraint and also achieves the same accuracy as the baseline. However, giving more importance to the Gaussian structure will result in slower convergence.

A similar conclusion can be drawn from Table 2, which demonstrates high overall performance, while it drops when the focus on the Gaussian constraint is too high. At the same time, we calculate the ROUGE score, which assesses how often words appear in the reference and model output. In particular, the ROUGE-L score is considered since it measures the longest common sequence between generated texts and references.

Table 2 shows how

λ_{a d v}

leverages classification accuracy and text quality. The best classification accuracy is achieved by reducing the Gaussian constraint. However, the ROUGE-L score shows how classification performance influences the text’s quality, which is reasonable since ambiguous features lead to poor text descriptions. The ROUGE score is the highest for

λ_{a d v} = 1

, but it decreases when

λ_{a d v}

is decreased or increased. In addition, the ROUGE-L score decreases for

λ_{a d v} = 10

and increases again for

λ_{a d v} = 20

. This phenomenon can be explained by the neglect of the classifier and the heavy focus on the discriminator’s loss. The high score for the highest

λ_{a d v}

is attributed to simple words like “people”, which appear in all text descriptions.

Finding the best

λ_{a d v}

depends on the application, but we found that

λ_{a d v} = 1

achieves the highest classification performance and the best ROUGE-L score.

4.2.2. Decoded Scene Descriptions

Besides the classification performance, the main focus of this work is to provide a method for generating text descriptions that increase the interpretability of the classifier.

High-dimensional classifier embeddings are projected via t-distributed stochastic neighbor embedding (TSNE) [46]. In Figure 5, we examine the first and second TSNE components of the predicted latent representation for three-person radar images, where

λ_{a d v} = 1

during training. The visualization focuses on the different movements that the three targets perform, namely, “walking,” “sitting,” and “sitting and walking.” As expected, the sitting and walking people are well separated, and the combination of walking and sitting people overlaps with the others. However, we also observe that different movements are sometimes very close in the classifier’s embedding space. This finding underlines that certain movements can have overlapping frequency signatures. To evaluate the descriptions, we consider examples where the classifier predicted the wrong number of people and the correct number of people. Figure 5 shows four predicted targets in red and five predicted targets in blue, while the actual number of people is still three.

Table 3 shows the different types of text descriptions that were observed. We compare the ground truth description with the reconstructed description and the predicted label. The main observation is that the reconstructed descriptions are aligned with the classifier’s prediction, which allows the interpretation of the classifier’s reasoning. There are also cases where the text description and classifier do not match.

These scenarios can be summarized into two cases: First, the text description is correct with respect to the ground truth, but the classifier is wrong; second, both predict wrong labels. In the latter case, the text description does not mention the number of people but describes the movements in the scene. The few mismatches can be explained by the interpretation of the latent space and the separation between classification and text reconstruction loss functions.

Overall, all different text descriptions provide a meaningful visual interpretation. To further illustrate the capabilities of the presented approach, the experiments were extended using images in Appendix A.

5. Conclusions

In this article, we prove how a DAAE can be used in combination with a visual classifier to provide instance interpretability with a classification performance of

98.3 %

. To achieve this, we recorded a large industrial radar dataset that takes different movements of the targets into account. Radar data requires latent space interpretability because the processed radar data is not human-interpretable. The results presented show that the common training goal of text reconstruction and classification not only leads to the same classification performance but also enables the interpretability of arguments. In addition, the text matches the predicted class and provides meaningful descriptions of the radar input. Furthermore, this method can be generalized across multi-domain tasks that can include video information as well. Finally, further research will focus on unreliable state detection.

Author Contributions

Methodology, J.O. and H.S.; Software, J.O.; Validation, H.S.; Writing—original draft, J.O.; Writing—review and editing, H.S.; Supervision, L.S. and R.W.; Project administration, R.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The datasets presented in this article are not available because of company policies.

Conflicts of Interest

Julius Ott was employed by Infineon Technologies AG. Huawei Sun was employed by Infineon Technologies AG. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Image Evaluation

Appendix A.1. Generating Image Captions

Image classification datasets do not normally include image-wise semantic descriptions. The first option would be to use the class names and insert them into a prompt like “The photograph shows a (class)”, as is carried out in [14]. However, these prompts do not cover much information about the image. Thus, we consider image captioning models such as BLIP [13] and GIT [47], which are pre-trained on a corpus of four million and fourteen million image–text pairs, respectively. Both models have a “base” and “large” configuration with regard to the number of parameters, and they are publicly available in [48]. Although such models are not specifically trained on CIFAR 10/100 images, the captions describe the images with high detail, as revealed in Figure A1.

The captions contain important information about the image even though the actual class tag is missing. Notably, the detail varies between the captioning models, which will be considered in the experiments.

Figure A1. Examples of generated captions for CIFAR 10 images for different image captioning models. The description is more detailed than the classes “dog” and “airplane” but varies for different models.

Appendix A.2. Training Setup

For the visual classifier, we used a ResNet50 [49] backbone with two connected dense layers and 128 units, respectively, for the mean and standard deviation of the Gaussian latent representation. A final dense layer with softmax activation carries out classification. The joint architecture is trained with the SGD optimizer and an initial learning value of

0.05

. The learning rate decays after 350, 400, and 450 epochs, with a decay rate of

0.1

. In addition, it is noteworthy that the networks are trained with 256 samples per batch on a single GPU for 500 epochs. The classification results are averaged over three different seeds.

The explanation of the classification from the decoded text should not only give accurate text for correctly predicted tasks but also provide background information or descriptions of the object. In Figure A2, we present examples of the generated text for the CIFAR 10/100 test dataset at different epochs in the training stage. In most of the test instances, the generated text aligns with the predicted class and provides further information on the classification process. The second example of Figure A2 shows the remarkable value that SupAText creates in the case of misclassification: After 50 epochs, we can deduce that “deer” and “horse” are close in the latent representation, since these are predictions from the same visual latent vector. Afterward, the text and the classifier prediction align, while the generated prediction provides further background information, which allows further reasoning about the classification process. In Figure A3, the same evaluation is performed on CIFAR 100 examples. The text provides insights about the focus of the classifier, such as the color of the rose and the red-white design of the Coca-Cola can. The central image of the butterfly is an example of how the text aligns with the classifier predictions, where features between the cat and possum such as black and white skin match, although the species is incorrect.

Regarding the generation of text from different latent positions, Figure A4 illustrates the first and second components of TSNE [46] for the aquarium fish class and the generated text, respectively. The embeddings with a similar first component in the top left and top right corner generate a similar text description that aligns with the predicted class. On the opposite end, the embeddings in the bottom left and bottom right corners show a misalignment between the predicted class and the generated text. In particular, the generated text in the bottom right corner is at the edge of the cluster, underlining the interpretation capabilities of the generated text.

Figure A2. Evolution of generated text for three examples of the CIFAR 10 test set after 50, 150, and 200 epochs. Predicted classes highlighted in red and green refer to incorrect and correct predictions, respectively. The correct class is shown in blue on the left side of the input image.

Figure A3. Text generated for three examples of the CIFAR 100 test set.

Table A1. Classification performance of the proposed supervised method with text. The results are evaluated on the CIFAR 10 and CIFAR 100 image classification benchmark with different image captioning models for the pseudo-label generation.

Models	CIFAR 10	CIFAR 100
Cross-entropy	$95 %$	$75.3 %$
Cross-entropy with Gaussian latent	$92.7 %$	$69.1 %$
SupAText (GIT-Base)	$94.78 %$	$75.33 %$
SupAText (GIT-Large)	$95.23 %$	$75.23 %$
SupAText (BLIP-Base)	$95.32 %$	$74.89 %$
SupAText (BLIP-Large)	$95.2 %$	$75.5 %$

Figure A4. TSNE embedding of class “aquarium fish” and generated descriptions with reference captions from BLIP-Large.

References

Adib, F.; Katabi, D. See through Walls with WiFi! In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM ’13, New York, NY, USA, 27 October–1 November 2013; pp. 75–86. [Google Scholar]
Sirmacek, B.; Riveiro, M. Occupancy Prediction Using Low-Cost and Low-Resolution Heat Sensors for Smart Offices. Sensors 2020, 20, 5497. [Google Scholar] [CrossRef] [PubMed]
Günter, A.; Böker, S.; König, M.; Hoffmann, M. Privacy-preserving people detection enabled by solid state LiDAR. In Proceedings of the 2020 16th International Conference on Intelligent Environments (IE), Madrid, Spain, 20–23 July 2020; pp. 1–4. [Google Scholar]
Wang, C.; Zhang, H.; Yang, L.; Liu, S.; Cao, X. Deep people counting in extremely dense crowds. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 1299–1302. [Google Scholar]
Rahman, M.M.; Yataka, R.; Kato, S.; Wang, P.; Li, P.; Cardace, A.; Boufounos, P. MMVR: Millimeter-Wave Multi-view Radar Dataset and Benchmark for Indoor Perception. In Computer Vision—ECCV 2024, 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXXIX; Springer: Cham, Switzerland, 2024; pp. 306–322. [Google Scholar]
Wu, Z.; Zhang, D.; Xie, C.; Yu, C.; Chen, J.; Hu, Y.; Chen, Y. RFMask: A simple baseline for human silhouette segmentation with radio signals. IEEE Trans. Multimed. 2022, 25, 4730–4741. [Google Scholar] [CrossRef]
Lee, S.P.; Kini, N.P.; Peng, W.H.; Ma, C.W.; Hwang, J.N. Hupr: A benchmark for human pose estimation using millimeter wave radar. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 5715–5724. [Google Scholar]
Paek, D.H.; Kong, S.H.; Wijaya, K.T. K-radar: 4d radar object detection for autonomous driving in various weather conditions. In Proceedings of the 35th Conference on Neural Information Processing Systems (NIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; pp. 3819–3829. [Google Scholar]
Egmont-Petersen, M.; de Ridder, D.; Handels, H. Image processing with neural networks—A review. Pattern Recognit. 2002, 35, 2279–2301. [Google Scholar] [CrossRef]
Stephan, M.; Hazra, S.; Santra, A.; Weigel, R.; Fischer, G. People Counting Solution Using an FMCW Radar with Knowledge Distillation From Camera Data. In Proceedings of the 2021 IEEE Sensors, Virtual, 31 October–4 November 2021; pp. 1–4. [Google Scholar] [CrossRef]
Mauro, G.; Martinez-Rodriguez, I.; Ott, J.; Servadei, L.; Wille, R.; Cuellar, M.P.; Morales-Santos, D. Context-adaptable radar-based people counting via few-shot learning. Appl. Intell. 2023, 53, 25359–25387. [Google Scholar] [CrossRef]
Sun, H.; Servadei, L.; Feng, H.; Stephan, M.; Santra, A.; Wille, R. Utilizing explainable ai for improving the performance of neural networks. In Proceedings of the 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), Nassau, Bahamas, 12–14 December 2022; pp. 1775–1782. [Google Scholar]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Fan, L.; Li, T.; Yuan, Y.; Katabi, D. In-home daily-life captioning using radio signals. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part II; Springer: Berlin/Heidelberg, Germany, 2020; pp. 105–123. [Google Scholar]
He, J.; Spokoyny, D.; Neubig, G.; Berg-Kirkpatrick, T. Lagging inference networks and posterior collapse in variational autoencoders. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019. [Google Scholar]
Zhao, J.; Kim, Y.; Zhang, K.; Rush, A.; LeCun, Y. Adversarially regularized autoencoders. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5902–5911. [Google Scholar]
Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. Why should I trust you? Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2–29 October 2017; pp. 618–626. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Jolliffe, I.T. Principal Component Analysis and Factor Analysis. In Principal Component Analysis; Springer: New York, NY, USA, 1986; pp. 115–128. [Google Scholar] [CrossRef]
Hyvärinen, A.; Oja, E. Independent component analysis: Algorithms and applications. Neural Netw. 2000, 13, 411–430. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 5–6 December 2016. [Google Scholar]
Zhang, Q.; Wu, Y.N.; Zhu, S.C. Interpretable convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8827–8836. [Google Scholar]
Socher, R.; Ganjoo, M.; Manning, C.D.; Ng, A. Zero-shot learning through cross-modal transfer. In Proceedings of the 27th Conference on Neural Information Processing Systems (NIPS 2013), Lake Tahoe, NV, USA, 5–8 December 2013. [Google Scholar]
Lai, Z.; Yang, J.; Xia, S.; Lin, L.; Sun, L.; Wang, R.; Liu, J.; Wu, Q.; Pei, L. RadarLLM: Empowering Large Language Models to Understand Human Motion from Millimeter-wave Point Cloud Sequence. arXiv 2025, arXiv:2504.09862. [Google Scholar]
Li, C.; Xu, H.; Tian, J.; Wang, W.; Yan, M.; Bi, B.; Ye, J.; Chen, H.; Xu, G.; Cao, Z.; et al. mplug: Effective and efficient vision-language learning by cross-modal skip-connections. arXiv 2022, arXiv:2205.12005. [Google Scholar]
Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Tsimpoukelli, M.; Menick, J.L.; Cabi, S.; Eslami, S.M.; Vinyals, O.; Hill, F. Multimodal few-shot learning with frozen language models. Adv. Neural Inf. Process. Syst. 2021, 34, 200–212. [Google Scholar]
Geva, M.; Caciularu, A.; Dar, G.; Roit, P.; Sadde, S.; Shlain, M.; Tamir, B.; Goldberg, Y. Lm-debugger: An interactive tool for inspection and intervention in transformer-based language models. arXiv 2022, arXiv:2204.12130. [Google Scholar]
Grondahl, T.; Asokan, N. EAT2seq: A generic framework for controlled sentence transformation without task-specific training. arXiv 2019, arXiv:1902.09381. [Google Scholar]
Gu, Y.; Feng, X.; Ma, S.; Zhang, L.; Gong, H.; Zhong, W.; Qin, B. Controllable Text Generation via Probability Density Estimation in the Latent Space. arXiv 2022, arXiv:2212.08307. [Google Scholar]
Li, C.; Gao, X.; Li, Y.; Peng, B.; Li, X.; Zhang, Y.; Gao, J. Optimus: Organizing sentences via pre-trained modeling of a latent space. arXiv 2020, arXiv:2004.04092. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Tech Rep. 2019, 1, 9. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Shen, T.; Mueller, J.; Barzilay, R.; Jaakkola, T. Educating text autoencoders: Latent representation guidance via denoising. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 8719–8729. [Google Scholar]
Milovanović, V. On fundamental operating principles and range-doppler estimation in monolithic frequency-modulated continuous-wave radar sensors. Facta Univ. Ser. Electron. Energetics 2018, 31, 547–570. [Google Scholar] [CrossRef]
Santra, A.; Vagarappan Ulaganathan, R.; Finke, T. Short-Range Millimetric-Wave Radar System for Occupancy Sensing Application. IEEE Sens. Lett. 2018, 2, 7000704. [Google Scholar] [CrossRef]
Creswell, A.; Bharath, A.A. Denoising adversarial autoencoders. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 968–984. [Google Scholar] [CrossRef] [PubMed]
Asghar, N. Yelp dataset challenge: Review rating prediction. arXiv 2016, arXiv:1605.05362. [Google Scholar]
Servadei, L.; Sun, H.; Ott, J.; Stephan, M.; Hazra, S.; Stadelmayer, T.; Lopera, D.S.; Wille, R.; Santra, A. Label-Aware Ranked Loss for robust People Counting using Automotive in-cabin Radar. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 3883–3887. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Wang, J.; Yang, Z.; Hu, X.; Li, L.; Lin, K.; Gan, Z.; Liu, Z.; Liu, C.; Wang, L. Git: A generative image-to-text transformer for vision and language. arXiv 2022, arXiv:2205.14100. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]

Figure 1. Infineon’s FMCW radar system BGT60TR13C with a size of 6.5 × 5.0 × 0.9 mm³. The system is based on a microcontroller that filters the radar signal, digitizes it, and transmits it to a computer via micro-USB.

Figure 2. Radar processing steps from recorded radar frames to final macro–micro-RDIs. The steps in blue refer to the micro-movements and the steps in yellow refer to the macro-movements. The changing data form is mentioned after each processing step.

Figure 3. Joint architecture for text generation from a visual latent space. The vision encoder generates the mean and standard deviation of the Gaussian latent space. After reparametrization, the embedding is used for classification and text generation, while the discriminator makes sure the latent space is from a standard normal distribution. The yellow building blocks are pretrained parts from the DAAE.

Figure 4. Illustration of the recording setup. The recorded scenes capture different numbers of people in an office environment. The sensor is mounted in front of the camera. Below the camera images, we show the corresponding Range-Doppler image and the generated caption.

Figure 5. Visualization of first two TSNE components of the validation dataset where three people are present in a scene.

Table 1. Detailed radar configuration used throughout this work.

Symbol	Parameters	Value
$f_{0}$	Center frequency	60 GHz
B	Bandwidth (B)	( $60.5$ – $61.5$ ) GHz
$N_{s}$	Number of samples per chirp ( $N_{s})$	128
$N_{c}$	Number of chirps	64
$f_{s}$	Sampling frequency ADC	2 MHz
$t_{c}$	chirp time duration	390 μs
$t_{s}$	Frame repetition time	$0.05$ s
$N_{r x}$	Number of receiving antennas	3

Table 2. Test accuracy for different adversarial scaling factors and the respective ROUGE-L scores for text evaluation.

Model	Test Accuracy in %	ROUGE-L
Classifier (baseline)	98.31	-
Classifier + DAAE ( $λ_{a d v} = 0.1$ )	98.3	30.1
Classifier + DAAE ( $λ_{a d v} = 1$ )	98.3	30.8
Classifier + DAAE ( $λ_{a d v} = 10$ )	98.23	26.64
Classifier + DAAE ( $λ_{a d v} = 20$ )	98.02	30.57

Table 3. Comparison of ground truth text, reconstructed text, and the predicted label.

Ground Truth Text	Reconstructed	Classifier Predictions
“three people sitting”	“four walking”	4
“three people sitting”	“three sitting”	4
“three people sitting”	“four walking”	5
“three people walking”	“five walking”	5
“three people sitting”	“five sitting”	5
“three people sitting”	“person sitting and walking”	5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ott, J.; Sun, H.; Servadei, L.; Wille, R. How to Talk to Your Classifier: Conditional Text Generation with Radar–Visual Latent Space. Sensors 2025, 25, 4467. https://doi.org/10.3390/s25144467

AMA Style

Ott J, Sun H, Servadei L, Wille R. How to Talk to Your Classifier: Conditional Text Generation with Radar–Visual Latent Space. Sensors. 2025; 25(14):4467. https://doi.org/10.3390/s25144467

Chicago/Turabian Style

Ott, Julius, Huawei Sun, Lorenzo Servadei, and Robert Wille. 2025. "How to Talk to Your Classifier: Conditional Text Generation with Radar–Visual Latent Space" Sensors 25, no. 14: 4467. https://doi.org/10.3390/s25144467

APA Style

Ott, J., Sun, H., Servadei, L., & Wille, R. (2025). How to Talk to Your Classifier: Conditional Text Generation with Radar–Visual Latent Space. Sensors, 25(14), 4467. https://doi.org/10.3390/s25144467

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

How to Talk to Your Classifier: Conditional Text Generation with Radar–Visual Latent Space

Abstract

1. Introduction

2. Related Work

2.1. Explainability of Neural Networks

2.2. Visual–Semantic Learning

2.3. Latent Space Modeling

3. Approach

3.1. Radar Signal Processing

3.2. Denoising Adversarial Autoencoder

3.3. Decoding Text from Image Classifier

4. Experiments

4.1. Implementation Settings

4.2. Results

4.2.1. Classification

4.2.2. Decoded Scene Descriptions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Image Evaluation

Appendix A.1. Generating Image Captions

Appendix A.2. Training Setup

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI