Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

From Questionnaires to Heatmaps: Visual Classification and Interpretation of Quantitative Response Data Using Convolutional Neural Networks

Appl. Sci. 2025, 15(19), 10642; https://doi.org/10.3390/app151910642

by Michael Woelk^1,2,†

, Modelice Nam^3,*,†

, Björn Häckel⁴

and Matthias Spörrle^3,5,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4:

Tianci Zhang

Appl. Sci. 2025, 15(19), 10642; https://doi.org/10.3390/app151910642

Submission received: 31 August 2025 / Revised: 26 September 2025 / Accepted: 30 September 2025 / Published: 1 October 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes to classify structured tabular data by first transforming them into graphical representations and then performing image level mappings with CNN. The idea is original and interesting, which paves a new way for analyzing and understanding quantitative tabular data. Experiments are conducted on a real-world dataset to verify the effectiveness of the proposed method as well as its superiority than the traditional logistic regression approach. The proposed method achieves promising results under the visual forms of both bar charts and pie charts, and shows good interpretability and visualization effect. However, I still have some concerns on the rationality of the designed technique and experimental results. Below are the detailed comments and suggestions.

-As far as I concern, the transformation of tabular data into visual images will introduce redundant and irrelevant information, and thus may increase the data complexity. Although the authors state that this transformation may improve visualization effect and explainability of the model, I am actually not fully convinced of this key innovation. Logistic regression may not perform nonlinear mappings between the tabular data and binary classification results. There are still many models like one-dimensional CNN and Transformer that can take tabular data as their inputs and output the binary classification results through nonlinear mappings and global modeling. This will keep a low data complexity and meanwhile high classification performance. The authors should further clarify the rationality of their proposed idea.

-Why are the activation heat maps in Figure 7 and Figure 8 so different given the same bar chart? Does it mean that the 1xConv and 2xConv models will provide totally different explanations for the same input bar chart? If so, I am doubt about the stability of CNN for explaining the bar charts.

-The loss and accuracy curves in Figure 9 look somewhat strange. They seem to keep nearly constant for most of the time and have a sudden drop or jump near the 200th epoch. Why does this kind of discontinuity appear? Is it mean that the training process of the designed network is not stable and difficult to converge?

-The mathematical meaning of each column in Table 2 should be explained in detail.

-There are some typos in the manuscript which should be corrected. For example, ‘highlight-ed’ on page 16 and ‘ist’ on page 18.

Author Response

Letter to the Editor

Dear Editorial Board of Applied Sciences,

Dear Ms. Su,

We would like to thank you very much for the opportunity to revise our manuscript.

We have addressed each of the comments made by the reviewers, and we are confident that this has substantially improved the manuscript.

Below you can see, how we have addressed each comment and how this has shaped the manuscript. The main changes include the following:

Clearer statement of novelty and contribution in the Abstract, Introduction, and Conclusion sections,
Expanded methodological details: dataset source and size, class distribution, preprocessing, train/test splits, visual encoding process, hyperparameters, and CAM integration,
Addition of supplementary evaluation metrics (precision, recall, macro-F1, Brier score, ECE),
Clarifications on interpretability: CAMs described in detail and interpretability claims softened where appropriate,
Robustness analysis (randomized item order),
Improved presentation: actual accuracy values reported, Appendix A1 added with code–survey item mapping and updated table captions,
Limitations expanded to acknowledge visualisation design choices, computational cost, figure resolution constraints, and applicability boundaries.

We would like to thank you and the reviewers for providing us with valuable feedback.

We are looking forward to hearing back from you.

With kind regards

Michael Woelk

Modelice Nam

Björn Häckel

Matthias Spörrle

Comments to Reviewer #1

Authors: We would like to thank you for taking the time to review our manuscript and for your comments. We have addressed each of your comments and believe that following your and the other reviewers’ suggestions has substantially improved the manuscript.

However, I still have some concerns on the rationality of the designed technique and experimental results. Below are the detailed comments and suggestions:

- As far as I concern, the transformation of tabular data into visual images will introduce redundant and irrelevant information, and thus may increase the data complexity. Although the authors state that this transformation may improve visualization effect and explainability of the model, I am actually not fully convinced of this key innovation. Logistic regression may not perform nonlinear mappings between the tabular data and binary classification results. There are still many models like one-dimensional CNN and Transformer that can take tabular data as their inputs and output the binary classification results through nonlinear mappings and global modelling. This will keep a low data complexity and meanwhile high classification performance. The authors should further clarify the rationality of their proposed idea.

Authors: We acknowledge this important point. While direct tabular models (e.g. Transformers) are efficient and powerful, our approach addresses a different need: an intuitive visual aid that contributes to interpretability. By transforming each record into a chart and applying CNN–CAM, we obtained per-respondent heatmaps that highlighted which dimensions influenced the prediction. This provides an intuitively accessible explanation that tabular models typically cannot offer. Moreover, our results show that CNN–CAM models achieve classification accuracy comparable to logistic regression (93.16% vs. 93.19%), demonstrating that visual transformation and resulting explainability do not compromise predictive performance. Thus, the additional complexity yields a unique advantage of bridging predictive performance with interpretability.

In the Introduction (Section 1), we have added the following text to make this clearer: “While existing methods such as logistic regression, decision trees with SHAP or LIME, and 1D-CNNs or transformer models can already be used for tabular data, our approach differs in its combination of deterministic visual coding and simultaneous explainable classification at the individual response level. 1D CNNs or Transformer-based models can also process tabular inputs and capture nonlinear relationships (Ullah et al., 2022; Thielmann et al., 2025); however, they typically do not provide outputs that are readily interpretable for non-specialists.

Our contribution is a deterministic one-record-to-one-chart encoding combined with CNN–CAM analysis, producing per-respondent heatmaps within a lightweight, end-to-end workflow. This enables accessible, instance-level explanations while maintaining the predictive performance. Recent studies on image-based encodings of tabular data (e.g., Mamdouh et al., 2025; Alenizy & Berri, 2025) demonstrate the potential of this paradigm; however, our approach is distinct in that it integrates intuitive visual aids directly into the classification pipeline.

In the Discussion (Section 5.2), we have now also clarified that “our approach is comparable in accuracy to logistic regression but adds value by integrating visual encoding with CNN-CAM to provide intuitive visual assistance. This extends recent work on tabular-to-image transformations with integrated explanatory power”.

- Why are the activation heat maps in Figure 7 and Figure 8 so different given the same bar chart? Does it mean that the 1xConv and 2xConv models will provide totally different explanations for the same input bar chart? If so, I am doubt about the stability of CNN for explaining the bar charts.

Authors: The observed differences resulted from the architectural depth, not instability. The 1×Conv model emphasises localised bar length and colour features, whereas the 2×Conv model integrates broader spatial patterns. Both produce stable and reproducible explanations, albeit at different levels of abstraction.

We have clarified this now in the Materials and Methods section (Section 3.3): “The 1×Conv model was chosen as an efficient, lean baseline, whereas the 2×Conv model produced more robust results, especially with pie charts. This illustrates the trade-off between computational effort and accuracy.”

We have specified this in the Results (Section 4.1): “The observed differences in the activation maps reflect the representational depth of the respective model architectures. The 1×Conv model assigns saliency at the level of individual bars, thereby emphasizing localized features, whereas the 2×Conv model covers broader patterns. Despite these differences in granularity, both models produced identical predicted labels for the illustrated case and exhibited highly similar confidence values for the illustrated case. This consistency indicates that the models arrive at convergent decisions while relying on distinct, representational strategies. Accordingly, we interpret the resulting CAMs as complementary perspectives on the same underlying evidence, rather than as conflicting explanations. These variations mirror differences in abstraction inherent to the architectures and should not be construed as reflecting gains in predictive accuracy.”

- The loss and accuracy curves in Figure 9 look somewhat strange. They seem to keep nearly constant for most of the time and have a sudden drop or jump near the 200th epoch. Why does this kind of discontinuity appear? Is it mean that the training process of the designed network is not stable and difficult to converge?

Authors: Thank you; our graph is correct but needs further clarification. The sharp increase in recent periods is due to the greater complexity of the pie charts, indicating a later convergence that leads to stable, final results.

We have added this explanation in the Results (Section 4.2): “The late increase in performance from approximately epoch 190 onward can be attributed to the greater structural complexity of the pie chart representations. In contrast to the linear ordering of bar charts, pie charts encode information in a circular layout, requiring the network to learn the global radial dependencies. This complexity delays convergence until the later stages of training. While the learning curves of both pie chart models appear less smooth compared to those of the bar chart models, the overall training process was stable and ultimately convergent, yielding robust final results (87.88% with 1xConv and 90.51% with 2xConv.”

- The mathematical meaning of each column in Table 2 should be explained in detail.

Authors: We followed your suggestion and provided further explanation. We expanded Table 2 caption to explain the meaning of each statistic.

“Note: Logistic regression with β = regression coefficient, Std. Error = standard error of the coefficient, z = Wald statistic, p-value = coefficient significance, 95% CI = confidence interval for β. Target variable: recommendation value (no = 0; yes = 1).”

“[…] The regression coefficients (Table 2) indicate the relative importance of each category in predicting employer recommendations. Strong contributions were observed for variables such as working atmosphere (β = 0.57, p < 0.001), image (β = 0.49, p < 0.001), and supervisor behaviour (β = 0.46, p < 0.001). These results suggest that employees who perceive a supportive atmosphere, positive organizational image, and competent leadership are substantially more likely to recommend their employer.
In contrast, team spirit (β = –0.04, p = 0.073), equality (β = –0.01, p = 0.724), and treatment of older colleagues (β = –0.02, p = 0.423) were not significant predictors in this model, indicating that their contributions to employer recommendation were statistically negligible once other factors were accounted for. Although these coefficients provide statistically robust insights, their interpretation requires methodological expertise, making them less accessible to practitioners without a statistical background.”

- There are some typos in the manuscript which should be corrected. For example, ‘highlight-ed’ on page 16 and ‘ist’ on page 18.

Authors: We have corrected these issues and carefully proofread the entire manuscript.

We would like to thank you again for all your valuable comments. We believe that incorporating your comments into our manuscript has substantially improved it.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper addresses an important problem: making the results of quantitative survey data more interpretable for non-specialists through visual transformation and CNN-based classification. The integration of Convolutional Neural Networks (CNNs) with Class Activation Maps (CAMs) is presented as a novel, domain-agnostic method that could bridge the gap between predictive accuracy and explainability. The topic is timely and relevant, particularly given the growing interest in explainable artificial intelligence. However, the paper currently lacks sufficient clarity in its motivation, methodological detail, and comparative evaluation. Significant revisions are needed to strengthen the contribution and improve the reproducibility and transparency of the work.

I suggest major revision based on my comments:

The novelty and added value of the work compared to existing methods (e.g., logistic regression, decision trees with SHAP, or LIME) are not well emphasized.
The motivation, challenges addressed, objectives, and implications are not clearly outlined, leaving the reader uncertain about the paper’s unique contribution.
The dataset’s source, characteristics, and class distribution are not described. Without this, the reproducibility of the study is severely limited.
The transformation from numerical responses to visual representations (bar charts, pie charts) is not described in sufficient detail. The labeling process is also omitted.
Information about training/test splits, sample size, and hyperparameter tuning is missing, making it impossible to evaluate the rigor of the approach.
The CAM methodology, its integration into the CNN, and the analysis of generated heatmaps are only superficially described. These are central to the paper but currently underdeveloped.
Without details of the dataset and extracted knowledge, the results primarily demonstrate CNN classification ability—already well-established in literature—rather than the claimed interpretability benefits.
Clearly state what is novel about the approach and how it advances beyond existing XAI techniques. Explicitly compare with other interpretability frameworks (e.g., SHAP, LIME, Grad-CAM applied to structured data).
Provide a structured outline of the problem, research gap, proposed solution, objectives, and expected significance. Explain why visual representations are a promising alternative to standard numerical approaches.
Specify whether the dataset was collected by the authors or obtained from a public repository (with proper citation). Include the number of records, predictor variables, target variable, class distribution, and preprocessing steps.
Detail the methodology: Explain the process of transforming responses into bar charts and pie charts: how many charts were created, what dimensions were chosen, and why. Describe the labeling procedure used for classification. Report the sample size for training, validation, and testing. Indicate whether hyperparameter optimization was performed.
Section 3.4 (CAM integration) and Section 4.4 (CAM-based analysis) should be expanded. Include technical details on how CAMs were overlaid on the generated images, and provide illustrative examples that demonstrate how practitioners can extract knowledge from these visualizations.
Provide a benchmark comparison not only with logistic regression but also with other explainable methods suitable for structured data. Discuss performance differences and trade-offs in interpretability.
Explicitly show how CNN-based classification contributes to extracting meaningful insights from questionnaire data, beyond simple predictive accuracy.
Transforming numerical responses into images may obscure fine-grained statistical relationships present in the original data.
The choice of visual representation (bar chart vs. pie chart) may affect model performance, introducing bias unrelated to the underlying data.
For very large surveys with many variables, creating image-based representations could be computationally expensive and inefficient compared to direct analysis of tabular data.
While CAMs highlight regions of an image, the connection between highlighted chart areas and the underlying survey constructs may remain abstract for domain experts.
The method’s effectiveness may vary depending on the type of survey and scale used, limiting its applicability across all quantitative response datasets.
Since CNNs are already known to perform well in classification tasks, the real contribution lies in interpretability; yet, this aspect remains underexplored.

Author Response

Comments to Reviewer #2

Authors: We thank the reviewer for the constructive and thoughtful comments, which have helped us improve the clarity and rigor of the manuscript. Below, we provide point-by-point responses and indicate the corresponding revisions.

However, the paper currently lacks sufficient clarity in its motivation, methodological detail, and comparative evaluation. Significant revisions are needed to strengthen the contribution and improve the reproducibility and transparency of the work. I suggest major revision based on my comments:

- The novelty and added value of the work compared to existing methods (e.g., logistic regression, decision trees with SHAP, or LIME) are not well emphasized.

Authors: We agree that the original version did not clearly highlight our contribution. Our novelty lies in the deterministic one-record-to-one-chart encoding combined with per-respondent CAM explanations, which enables an intuitive visual aid that contributes to the interpretability of tabular data. Unlike SHAP or LIME, which provide post-hoc numerical explanations, our approach offers directly interpretable visual heatmaps aligned with widely understood chart formats (bar and pie charts).

In the Introduction (Section 1), we clarify that while existing methods such as logistic regression, SHAP or LIME, 1D-CNNs, and transformer models can be applied to tabular data, they either provide abstract statistical outputs or post-hoc explanations. In contrast, our approach integrates deterministic visual encoding with CNN–CAM analysis to deliver per-respondent, visually accessible explanations within a single workflow.

Moreover, we also emphasised the novelty statement in the Discussion (Section 5.2), explicitly positioning our contribution relative to SHAP, LIME, and logistic regression: “Beyond predictive accuracy, the central novelty of our approach lies in combining deterministic one-record-to-one-chart encoding with per-respondent CAM explanations. Unlike logistic regression, which yields coefficients, or SHAP and LIME, which provide post-hoc feature importance values, our framework directly produces visual explanations that are accessible even to non-specialists. Compared with numerical models such as logistic regression, our approach does not aim to achieve a superior accuracy. CNN–CAM (93.16%) and logistic regression (93.19%) performed similarly. Its advantages lie in the combination of visual encoding and instance-level interpretability. This positions our method alongside emerging research on tabular-to-image transformations (Mamdouh et al., 2025; Alenizy & Berri, 2025) while extending the contribution by offering integrated interpretability through CAMs.”

- The motivation, challenges addressed, objectives, and implications are not clearly outlined, leaving the reader uncertain about the paper’s unique contribution. Furthermore, the manuscript suffers from several problems.

Authors: Thank you. In the revised manuscript, we clearly state that our work addresses the gap between predictive performance and interpretability in survey-based decision contexts, where stakeholders often lack statistical expertise.

We have included this in the Introduction (Section 1): “This addresses the challenge that non-specialists often cannot translate statistical coefficients or feature importance values into actionable insights, particularly in applied fields such as human resource management and healthcare. […] The implications of this contribution extend beyond predictive performance: the method aims to strengthen trust, transparency, and usability of AI systems in decision contexts where interpretability is essential.”

- The dataset’s source, characteristics, and class distribution are not described. Without this, the reproducibility of the study is severely limited.

Authors: Thank you for highlighting this aspect. We have revised the Materials and Methods section (Section 3.1) to include the following details: “The dataset was acquired from a publicly accessible employer review platform and contained 65,563 records. Each record included thirteen features measured on a 5-point Likert scale, reflecting aspects of the work environment. […] The outcome variable is binary, indicating whether the employee recommends the employer (“no” = 0, “yes” = 1). […] Following the preprocessing and imputation of missing values, the class distribution was approximately balanced, ensuring that both outcome categories were adequately represented.”

- The transformation from numerical responses to visual representations (bar charts, pie charts) is not described in sufficient detail. The labeling process is also omitted.

Authors: We have now provided more detail in the Materials and Methods (Section 3.2): “The main idea is to convert the ordinal property values into a graphical representation. In this study, two visualisation strategies were implemented and evaluated for comparative analysis.”

Bar charts. “This colour gradient follows the theoretical logic of a Likert scale, associating lower values with negative evaluations and higher values with positive evaluations in a way that is intuitive for both humans and CNNs. Alternative scales (e.g. blue to orange) were considered, but no improvement in model accuracy was observed.
[…] The image size of 65 × 65 pixels was chosen to ensure sufficient visual resolution to clearly recognise differences in bar length and colour intensity while keeping memory and computing requirements for CNN models low. Preliminary tests with larger formats (e.g. 128 × 128 pixels) showed no significant improvement in accuracy but resulted in significantly longer training times. The selected size therefore strikes a balance between intuitive visual aid contributed to interpretability and computational effort.”

Pie charts. “[…] respective scale values. Fixed RGB assignments were used for each category to ensure consistency throughout the dataset. This representation emphasized proportional relationships but also introduced rotational symmetry, which we expected to challenge convergence during training.”

Labelling process. “Each generated chart was assigned the binary recommendation label directly from the target variable “employer recommendation” (“no” = 0, “yes” = 1). This ensured that both the bar and pie representations of the same record inherited the same outcome label. “

- Information about training/test splits, sample size, and hyperparameter tuning is missing, making it impossible to evaluate the rigor of the approach.

Authors: Thank you, we could address this issue in the Materials and Methods section (Section 3.3): “Two CNN architectures were implemented to classify the visually encoded survey data. The first, referred to as the 1×Conv model, comprised a single convolutional layer designed to capture local patterns, such as bar lengths and colour gradients, with minimal computational cost. Second, the 2×Conv model incorporates two convolutional layers, enabling the extraction of more complex spatial relationships at the expense of increased training time. Each dataset was transformed into a bar chart and pie chart. The bar charts were generated with a fixed vertical order of categories; the bar length corresponded to the respective Likert value, coloured in a gradient from red (1) to green (5). The pie charts were coded as segments, with each category assigned a fixed RGB colour. All images had a resolution of 65 × 65 pixels and were normalised in RGB. Each image directly adopted the label of the target variable “employer recommendation” (“no” = 0, “yes” = 1). The dataset was divided into 45,894 training images and 19,669 test images with a fixed random seed in a ratio of 70:30. Both architectures were trained using the Adam optimiser with cross-entropy loss, a batch size of 1,000, and 200 training epochs. The target variable was one-hot-encoded. The hyperparameters were set heuristically, following widely used defaults in CNN research, and were not systematically optimised. This limitation has been explicitly acknowledged in the Discussion (Section 5.5).

Data splitting was performed strictly at the level of individual respondents before image generation to ensure that no record appeared in both the training and testing sets. A fixed random seed (55) was applied to all the training runs to guarantee reproducibility. Training time varied substantially across architectures and visual encodings: while bar chart models converged within a few hours, the pie chart models required up to 61 hours, reflecting the higher complexity introduced by their radial symmetry.”

We have also revised the Discussion (Section 5.5) by adding the following: “The selection of hyperparameters (e.g. learning rate, batch size, and number of epochs) was heuristic rather than systematically optimised. While these settings yielded stable results, more sophisticated approaches such as grid search, Bayesian optimization, or adaptive learning strategies could further improve performance and efficiency.”

- The CAM methodology, its integration into the CNN, and the analysis of generated heatmaps are only superficially described. These are central to the paper but currently underdeveloped.

Authors: To resolve this important issue, the following revision has been incorporated into the Materials and Methods (Section 3.4): “For each prediction, the feature maps from this layer were combined with the corresponding classification weights to generate class-specific activations. The resulting activation maps were normalised to the interval [0,1], bilinearly upsampled to the original 65 × 65 pixel resolution, and overlaid as a heat map on the corresponding bar or pie chart using a jet colour scale. This procedure created intuitive heatmaps that highlighted the image regions most influential for the model’s decision, thereby enabling practitioners to identify, for example, which aspects (e.g. salary/social benefits, supervisor behaviour, work-life balance) contributed to a positive or negative recommendation. The CAM procedure was fully integrated into the pipeline, so that each prediction was accompanied by a respondent-level visual explanation.”

- Without details of the dataset and extracted knowledge, the results primarily demonstrate CNN classification ability—already well-established in literature—rather than the claimed interpretability benefits.

Authors: We appreciate this observation. We have updated the Results (from Section 4.1 to Section 4.4): “The colour scale, ranging from blue (low activation) to red (high activation), provides an intuitive means of identifying which scale values contributed most strongly to the decision.

Notably, the model does not limit its focus to extreme values but also emphasises categories with average or slightly deviating ratings – for example, in dimensions such as image (IMG), work-life balance (WLB), and equality (GBR). For users, this creates a tangible link between input structures and model decisions, which is a connection not available in this form with conventional approaches, such as logistic regression.

For example, for bar charts, in cases where employees gave low ratings for salary/social benefits (GSL) and supervisor behaviour (VGV), the CAMs clearly highlighted these bars as the decisive features driving the “non-recommendation” prediction. Conversely, when high ratings were present for the working atmosphere (AAP) and communication (KOM), CAMs emphasised these categories in positive classifications. These examples illustrate how the proposed approach extracts actionable knowledge from survey data by visually linking the model predictions to the specific dimensions of the questionnaire. This provides a layer of intuitive visual assistance that goes beyond reporting predictive accuracy and directly supports practitioners’ understanding.

Nevertheless, CAMs should be interpreted as evidence-based visual aids rather than as causal explanations. While they reliably show where the model focused, the link to underlying psychological or organizational constructs remains partly subjective and should be interpreted in light of domain knowledge (see Section 5.5).”

Discussion (Section 5.4): “This feature is particularly relevant in practice. For example, in human resource management, a heatmap that visually emphasises salary/social benefits, working atmosphere (AAP), or supervisor behaviour (VGV) as key drivers of a prediction can be more readily understood and communicated to decision-makers than regression coefficients or numerical importance values. In this way, the CNN–CAM framework does not merely replicate known CNN classification capabilities but adds an intuitive visual aid that contributes to interpretability by showing which exact survey dimensions underpin each individual classification. This bridges the gap between predictive modelling and practical insight, which is crucial for applied domains that require transparent and actionable outputs.”

Discussion (Section 5.5): “Third, although CAMs provide visually intuitive explanations, their interpretation remains partially subjective. While they reliably show where the model focuses, the link to the underlying psychological or organizational constructs cannot be inferred directly. A highlighted region in a heatmap indicates saliency for the CNN, but it does not prove that the corresponding survey dimension was causally decisive for the outcome. This distinction is particularly important in applied contexts, such as human resource management or healthcare, where stakeholders may be tempted to equate visual evidence with causal mechanisms. Consequently, CAM-based explanations should always be interpreted in conjunction with domain expertise.”

We also added the following in the Conclusion (Section 6): “The approach achieves high classification accuracy while also producing interpretable decision bases that are accessible to non-technical users, which is a crucial requirement in many real-world applications. More broadly, these qualities position this approach as a promising tool for data-driven decision-making in contexts where transparency, reliability, and efficiency are essential.

Importantly, the method does not only replicate known CNN classification capabilities but also extracts actionable knowledge from survey data by visually linking predictions to specific questionnaire dimensions, thereby directly supporting practitioner interpretation and communication.”

- Clearly state what is novel about the approach and how it advances beyond existing XAI techniques. Explicitly compare with other interpretability frameworks (e.g., SHAP, LIME, Grad-CAM applied to structured data).

Authors: Thank you for your feedback. The novelty of our method lies in integrating visual encoding, CNN classification, and CAM-based explanations applied to structured quantitative data into a deterministic pipeline. Thus, we have added a comparison in the Background and Related Works (Section 2.4): “In addition, various explainable methods exist for tabular data, such as SHAP and LIME. However, these are usually employed as additional layers afterwards. Our approach differs in that explainability is integrated directly into the visual classification process, in a manner similar to Grad-CAM in the context of images.”

We have also included in the Materials and Methods (Section 3.4) that “The Grad-CAM variant was used in the procedure, with the last convolutional layer serving as the target layer. The resulting activation maps were interpolated and normalised on the input images before being overlaid with a Jet colour scale.”

Moreover, the following has been added to the Discussion (Section 5.2): “Previous studies have demonstrated that post-hoc explanation frameworks, such as SHAP and LIME, can assign importance scores to features in structured datasets [26]. Although these methods are powerful, they generate numerical or abstract outputs that require statistical expertise for interpretation. For example, SHAP provides additive contribution values, and LIME produces local surrogate models, both of which are informative but not directly intuitive for non-specialists. Logistic regression offers another form of interpretability through model coefficients; however, these require methodological expertise and assume linear relationships between features and outcomes.

In contrast, the CNN–CAM framework presented here provides direct visual, per-respondent explanations by overlaying heatmaps on familiar chart formats. This enables end users to see at a glance which survey dimensions influenced a given prediction, bridging the gap between statistical modelling and intuitive understanding.”

- Provide a structured outline of the problem, research gap, proposed solution, objectives, and expected significance. Explain why visual representations are a promising alternative to standard numerical approaches.

Authors: Thank you for highlighting this. We have revised the Introduction (Section 1) and integrated the following: “Structured quantitative data, such as employee or customer surveys, are commonly analysed using traditional statistical or machine learning methods (e.g. logistic regression). The central problem is that, while these techniques often deliver accurate predictions, their outputs (coefficients and importance scores) are usually abstract and difficult for non-specialists to interpret. This creates a gap between predictive performance and stakeholder comprehensibility.

The research gap lies in the limited availability of domain-independent approaches that combine strong predictive ability and intuitive interpretability. Existing XAI techniques, such as SHAP and LIME, improve post-hoc interpretability but still rely on abstract numerical measures. Similarly, recent studies on tabular-to-image encodings have shown that CNNs can classify structured data; however, these approaches have not integrated per-instance interpretability into a lightweight, reproducible pipeline.

To address this gap, we propose a deterministic one-record-to-one-chart encoding, in which each survey response is transformed into a chart (bar or pie) and classified using CNN. Class Activation Maps (CAMs) are then applied to generate per-respondent heatmaps that directly highlight which survey dimensions influenced the classification.

The objectives of this study are threefold: (i) to evaluate whether CNN–CAM models achieve predictive performance comparable to logistic regression, (ii) to examine whether visual encodings provide stable and meaningful interpretability, and (iii) to demonstrate the practical applicability of this method as a transparent, domain-independent workflow.

The significance of this contribution lies in bridging the gap between predictive accuracy and practitioner accessibility. By combining visual representations, CNN classification, and CAM explanations, the approach provides intuitive, human-readable insights that are promising for applied decision contexts where trust and transparency are critical.”

- Specify whether the dataset was collected by the authors or obtained from a public repository (with proper citation). Include the number of records, predictor variables, target variable, class distribution, and preprocessing steps.

Authors: We have now provided full details. In total, there were 44,417 missing responses in the 13 predictor variables, corresponding to 5.2% of the total values. The target variable (“employer recommendation”) had no missing values.

The following update has been incorporated in the Materials and Methods (Section 3.1): “In total, 44,417 missing values were found across the 13 predictor variables, accounting for 5.2% of all possible responses (see Table A1). The target variable had no missing values. In the case of missing values, an iterative imputation method based on multiple regression estimates was used. The estimated values were then rounded to the nearest whole number on a scale from 1 to 5 to preserve the discrete numerical structure. We employed a stratified 70:30 split at the respondent level to form the training and validation data files. This approach ensures that both a bar chart and a pie chart for the same respondent are not included in different data splits, thereby effectively preventing data leakage.”

An overview of the missing values is presented in Table A1 in Appendix A.1.

Table A1. Overview of missing values.

Items	Submitted responses	Missing responses
Employer Recommendation	65,563	0
Salary/social benefits	62,416	3147
Image	61,616	3947
Career/ training	61,602	3961
Working atmosphere	65,134	429
Communication	62,736	2,827
Team spirit	62,703	2,860
Work-life balance	62,341	3,222
Supervisor behaviour	62,723	2,840
Interesting tasks	62,420	3,143
Working conditions	62,276	3,287
Environmental/social awareness	60,674	4,889
Equality	61,144	4,419
Treatment of older colleagues	60,117	5,446
	852,31	44,417
		5,2 %

- Detail the methodology: Explain the process of transforming responses into bar charts and pie charts: how many charts were created, what dimensions were chosen, and why. Describe the labeling procedure used for classification. Report the sample size for training, validation, and testing. Indicate whether hyperparameter optimization was performed.

Authors: We have substantially expanded this in the Materials and Methods section (Section 3.2). In addition to the aforementioned transformation from numerical responses to visual representations and the labelling procedure used for classification, the following information was added:

“For model training, each of the 45,894 records in the training set was converted into two visual representations (one bar chart and one pie chart), resulting in a total of 91,788 training images. These images were used to train both the 1×Conv and 2×Conv architectures. For testing, each of the 19,669 records was represented four times: as a bar chart evaluated with the 1×Conv model, as a bar chart evaluated with the 2×Conv model, as a pie chart evaluated with the 1×Conv model, and as a pie chart evaluated with the 2×Conv model. This procedure yielded 183,576 labelled test images (19,669 × 4), ensuring that every record was consistently evaluated across both the visual encodings and model architectures.”

We have also revised the Discussion (Section 5.5) “ Parameter selection for the visualisation (e.g. colour palette, image resolution) was heuristic rather than systematically optimised. The generalisability of the approach could be strengthened by developing systematic procedures for selecting the visualisation parameters. Future research should also examine scalability strategies (e.g., parallelisation or model compression), multimodal extensions (e.g., combining survey data with free-text responses), and validation of CAM interpretability against expert judgment."

- Section 3.4 (CAM integration) and Section 4.4 (CAM-based analysis) should be expanded. Include technical details on how CAMs were overlaid on the generated images, and provide illustrative examples that demonstrate how practitioners can extract knowledge from these visualizations.

Authors: We have specified that a Grad-CAM variant was applied to the final convolutional layer, with activation maps normalised, upsampled to the original 65×65 resolution, and overlaid on the bar or pie charts. This produced intuitive heatmaps that highlighted the survey dimensions that were most influential for each prediction, enabling per-respondent visual explanations (e.g. salary/social benefits, supervisor behaviour, work-life balance).

To demonstrate how practitioners can extract knowledge from these visualisations, following update in the Results (Section 4.4): “For practitioners such as human resources managers, this means that CAM heat maps can be used as a diagnostic tool. For instance, a case with low scores for “communication” and “supervisor behaviour” revealed that these specific areas were strongly activated by the model, which explained the negative recommendation. In another case, where the scores were high for “working atmosphere” and “career/ training”, the heat map clearly marked these segments, making the positive classification plausible. Such visual traceability transforms predictions into actionable insights.”

- Provide a benchmark comparison not only with logistic regression but also with other explainable methods suitable for structured data. Discuss performance differences and trade-offs in interpretability.

Authors: Although implementing full SHAP/LIME pipelines was beyond the scope of this study, we added a comparative discussion. In the Discussion (Section 5.2), we now emphasise that logistic regression served as a robust and transparent empirical baseline in this study. Although it achieved accuracy comparable to the CNN–CAM models, its interpretability relies on statistical coefficients that remain abstract and less accessible to non-specialists. In contrast, our approach integrates classification and explanation within a unified workflow, providing per-respondent visual explanations rather than relying on post-hoc methods, such as SHAP or LIME.

We also added that “Finally, while alternative architectures, such as 1D-CNNs or Transformers, are well suited for tabular data, they are typically optimised for efficiency or scalability rather than intuitive interpretability. Our focus is on the visual pipeline because, in addition to comparable accuracy, it offers an intuitive visual aid that contributes to explainability. Our aim was not to surpass conventional methods in raw accuracy but to offer an interpretation format that is immediately legible to non-specialists. The trade-off of CNN-CAM is that it requires image encoding and CNN training (higher compute), and the explanations are faithful saliency cues rather than causal proofs. Moreover, the sensitivity to visualisation design (e.g., bar vs. pie) must be acknowledged.”

In the Discussion (Section 5.5), we acknowledge that “Another limitation is the absence of comparison with alternative frameworks. A valuable extension of this study would involve systematically comparing image-based CNN-CAM approaches with alternative model classes, such as 1D-CNNs or Transformers, that directly process tabular data. Such comparisons would allow for a more precise assessment of the practical advantages of the visual encoding pathway.”

- Explicitly show how CNN-based classification contributes to extracting meaningful insights from questionnaire data, beyond simple predictive accuracy.

Authors: Thank you for your valuable input. We have emphasised in the Results (Section 4.4) that CAMs highlight which categories (e.g. salary/social benefits, supervisor behaviour, working atmosphere) drive predictions, enabling practitioners to connect outcomes with actionable survey dimensions. Additionally, we have revised the Discussion (Section 5.4):

“Accordingly, CNN-based classification contributes beyond predictive capability by enabling knowledge discovery through linking outcomes to specific organisational factors. The identified survey dimensions correspond to constructs well-known to human resource practitioners and managers, thereby converting otherwise obscure model outputs into actionable insights that can guide policy development and targeted interventions. For example, the identification of supervisor behaviour and communication quality as decisive determinants of employee recommendations illustrates how the method advances from mere classification to substantive knowledge generation. In this way, the approach positions itself as a bridge between predictive modelling and practical decision support.”

- Transforming numerical responses into images may obscure fine-grained statistical relationships present in the original data.

Authors: We appreciate your feedback and have expanded the limitations section accordingly. In the revised manuscript, we explicitly acknowledge the following limitation in the Discussion (Section 5.5): “Additionally, a limitation of the proposed approach is that transforming numerical responses into image-based representations may obscure fine-grained statistical relationships that are directly accessible in the raw tabular format. For example, subtle linear effects or interaction terms between survey dimensions are not explicitly modelled when values are converted into graphical features, such as bar length or segment size. Although CNN can still learn from visual cues, the abstraction into images necessarily reduces the precision of the original numerical structure.

This trade-off highlights the complementary nature of our method: CNN–CAM adds intuitive visual aids but cannot fully replace statistical models when precise coefficient estimates or formal hypothesis testing are required.”

- The choice of visual representation (bar chart vs. pie chart) may affect model performance, introducing bias unrelated to the underlying data.

Authors: We agree with your valuable feedback and have subsequently enhanced the limitations section to address the points you raised. In the revised manuscript, we explicitly acknowledge the following limitation in the Discussion (Section 5.5):

“A further limitation is that the choice of visual representation itself influences performance. In our study, bar charts consistently yielded higher accuracies and clearer CAM overlays than pie charts, indicating that performance partly depends on representational characteristics (e.g. linear alignment, proportional scaling, and absence of rotational symmetry) rather than solely on the underlying data. This highlights that visualisation design decisions are not neutral; they can introduce systematic bias by making models appear stronger or weaker, depending on the encoding used. Although bar charts were effective in the present study, other contexts may benefit from alternative encodings. Future research should therefore systematically examine how different visualization strategies affect both predictive accuracy and interpretability.”

- For very large surveys with many variables, creating image-based representations could be computationally expensive and inefficient compared to direct analysis of tabular data.

Authors: We have expanded the limitations section to address your concerns. Thus, we explicitly expanded the following in the Discussion (Section 5.5): “Another avenue for future research concerns scalability. The generation of image-based representations and subsequent CNN training can be computationally demanding for large-scale surveys, particularly when involving hundreds of variables or millions of respondents. While direct analysis of tabular data bypasses these steps and may therefore be more efficient in high-dimensional contexts, our study remained tractable due to the relatively small number of predictor variables (13). Nevertheless, scaling the method to more complex datasets may introduce practical bottlenecks in both the preprocessing and model training. To address this, future studies should investigate optimisation strategies, such as dimensionality reduction prior to visualisation, parallelised image generation, or model compression techniques. Moreover, many application domains combine ordinal ratings with free text responses or numerical indicators. Extending the approach to multimodal architectures that integrate visual encodings with text mining or numerical analysis could provide richer insights and enhance decision-support capacity.”

- While CAMs highlight regions of an image, the connection between highlighted chart areas and the underlying survey constructs may remain abstract for domain experts.

Authors: We are grateful for your valuable feedback and have revised the limitations section to comprehensively address the concerns you highlighted (Section 5.5): “Finally, a further limitation concerns the interpretability of CAMs themselves. Although the overlays consistently highlight regions of an image that influence the model’s decision, the connection between these highlighted areas and the underlying survey constructs is not always straightforward. For example, a bar segment may receive strong activation; however, translating this into a clear statement about the causal role of the corresponding survey item requires domain expertise. Therefore, CAMs should be regarded as visual aids, as they provide intuitive cues about which dimensions shape a prediction; however, they cannot replace expert interpretation.

The lack of standardisation further compounds this limitation. Currently, no formal evaluation metrics exist for assessing the validity of CAM-based explanations. Establishing systematic validation procedures (such as comparing CAM outputs with expert assessments) would strengthen the robustness and credibility of this approach. This issue is particularly important in sensitive domains, where over-interpreting highlighted regions without appropriate contextual knowledge could lead to misleading or even harmful conclusions.”

- The method’s effectiveness may vary depending on the type of survey and scale used, limiting its applicability across all quantitative response datasets.

Authors: We have subsequently elaborated on the limitations section in the Discussion (Section 5.5): “The applicability of the proposed approach may depend on the type of survey and the response scale employed. Our study used a 5-point Likert scale with thirteen items, which lends itself well to visual encodings, such as bars and pie segments. In surveys with very different scales (e.g., continuous ratings, multi-choice items) or substantially larger numbers of questions, the transformation into compact, interpretable charts may be less effective.

This limitation implies that CNN–CAM is not a universal solution for all quantitative-response datasets. Its value is strongest in cases where responses are structured and of manageable dimensionality. Identified as an area for future research, a broader validation across diverse survey types and scales is needed to assess generalisability and to adapt the method to contexts where visual encodings are less natural.”

- Since CNNs are already known to perform well in classification tasks, the real contribution lies in interpretability; yet, this aspect remains underexplored.

Authors: Following your comment, we have adjusted our contribution to underscore how the intuitive visual aid contributed to interpretability in the Introduction (Section 1): “To our knowledge, this study is the first to present a complete pipeline that ranges from visual transformation to CNN-based classification to CAM-supported explanation. This integrated approach has not previously been reported for ordinally scaled survey data.”

In addition, the Discussion (Section 5.4) has been expanded as follows: “CNNs are well established as powerful classifiers, and our results confirm that they perform comparably to logistic regression on structured survey data. Accordingly, the contribution of this study does not lie in achieving higher accuracy but in providing interpretable per-respondent explanations through CAM overlays.

This focus on interpretability positions CNN–CAM not simply as another classification technique but as a practical tool for knowledge extraction and communication. By linking predictions to familiar survey items in a visual format, the method makes AI-driven analysis more transparent and accessible to practitioners.”

The Conclusion (Section 6) has also been updated as follows: “This study provides the first consistent pipeline that leads from ordinal survey data to CNN-based classification and CAM-supported explanation via deterministic visual coding. This clearly positions it as a new methodological approach in the field of explainable AI.”

We would like to thank you again for your valuable and detailed comments. We believe that incorporating your comments into our manuscript has substantially improved it.

Reviewer 3 Report

Comments and Suggestions for Authors

The paper proposes a pipeline for modeling and explaining survey outcomes. Each respondent’s ordinal answers are rendered as a single image (bar or pie chart), a compact CNN predicts a target label (e.g., “recommendation”), and a class-activation map highlights the chart regions that most influenced the decision. The stated contribution is a deterministic, one-record-to-one-chart transformation coupled with per-respondent visual explanations, intended to make survey modeling both accurate and easy to communicate to non-specialists. Results on one HR-style dataset show accuracy comparable to a logistic-regression baseline, with CAM heatmaps offered as the interpretability layer.
I think the work is close to publishable. What I ask below does not change your study design and should be quick to add.
Please make the split protocol explicit in one short paragraph. Just confirm that train/validation/test partitioning was done at the respondent level before any images were generated, and say if the split was stratified. Because each record produces both a bar and a pie, this one sentence removes any concern about accidental leakage. If you can also report averages over a few random seeds with a simple confidence interval, even better—it gives readers immediate confidence in the numbers without extra complexity.
Your methods section already lists the metrics you compute; in the results, it would help to show them in a compact table. Alongside accuracy and the confusion matrix, please include per-class precision, recall, and F1 with macro-F1. If you already output probabilities, adding one calibration number (Brier score or ECE) is enough. I am not asking for many new baselines; if time is short, it is fine to keep logistic regression as the main reference and simply present the fuller metrics you already compute.
For the interpretability part, a little more precision will help. Please name the exact CAM variant you use (e.g., Grad-CAM or Grad-CAM++), indicate the layer you target, and note briefly how the heatmap is normalized or upsampled. If feasible, include a single faithfulness check: occlude the region the CAM marks as most important and show the expected drop in model confidence. One figure like this makes the explanation claim solid. If you cannot run this now, it is okay to soften the language to “an intuitive visual aid,” and keep the stronger claim for future work.
The paper already compares bar and pie. Since you also note that chart design can influence both accuracy and interpretation, a very small robustness note would close the loop. If you can run one quick alternative—either grayscale charts or randomized item order—and report how the metrics change, that will reassure readers the model is not relying mainly on fixed positions or colors. Even a short sentence in the results is enough.
Two small presentation points will improve readability. In the abstract and conclusion, please use the actual numbers (and if you have them, simple intervals) rather than “high accuracy,” and state plainly that the logistic baseline is competitive. And because you use coded item names, a small table mapping each code to the full question text (referenced in the figure captions) will make the figures easier to read.
Finally, in the introduction, one or two sentences that say exactly what is new here will help position the paper. I would write it simply: “Our contribution is a deterministic one-record-to-one-chart encoding with per-respondent CAM explanations in a lightweight end-to-end workflow,” and then acknowledge nearby lines of work on tabular-to-image encodings and chart-reading CNNs. This is only wording; no extra experiments are needed.
With these light edits and clarifications, the paper reads as a clean, reproducible recipe: what the data are, how the split is kept safe, how the model performs on standard metrics, why the heatmaps can be trusted at a basic level, and how sensitive the approach is to a simple chart tweak. I believe these changes are fully achievable within a minor-revision window, and I would be happy to support acceptance after they are madel

Author Response

Comments to Reviewer #3

The paper proposes a pipeline for modelling and explaining survey outcomes. Each respondent’s ordinal answers are rendered as a single image (bar or pie chart), a compact CNN predicts a target label (e.g., “recommendation”), and a class-activation map highlights the chart regions that most influenced the decision. The stated contribution is a deterministic, one-record-to-one-chart transformation coupled with per-respondent visual explanations, intended to make survey modelling both accurate and easy to communicate to non-specialists. Results on one HR-style dataset show accuracy comparable to a logistic-regression baseline, with CAM heatmaps offered as the interpretability layer.

Authors: We sincerely thank the reviewer for the supportive and constructive comments, which helped us refine the manuscript for clarity, reproducibility, and interpretability. Below, we provide detailed responses to each of these points.

I think the work is close to publishable. What I ask below does not change your study design and should be quick to add.

- Please make the split protocol explicit in one short paragraph. Just confirm that train/validation/test partitioning was done at the respondent level before any images were generated, and say if the split was stratified. Because each record produces both a bar and a pie, this one sentence removes any concern about accidental leakage. If you can also report averages over a few random seeds with a simple confidence interval, even better—it gives readers immediate confidence in the numbers without extra complexity.

Authors: We thank the reviewer for raising this important point. In the Materials and Methods section (Sections 3.1 and 3.3), we have now elaborated to confirm that the dataset was partitioned at the respondent level prior to image generation, thereby ensuring that no leakage occurred between the bar and pie chart variants of the same record. A stratified 70:30 split was employed to maintain class balance. A fixed random seed (55) was also implemented to guarantee that the results could be reproduced.

We thank the reviewer again for the valuable suggestion regarding the multiple random seeds. Unfortunately, we could not rerun the full training pipeline because of computational constraints; the pie chart models, in particular, required over 60 hours of training. We clarified that we ran a fixed random seed (55) to guarantee that the results could be reproduced exactly. This limitation is now explicitly acknowledged in the Discussion (Section 5.5): “A further limitation concerns reproducibility across random seeds. Due to the high computational cost of training (with pie chart models requiring more than 60 hours per run), we were unable to average the results over multiple seeds. Instead, we used a fixed random seed (55) to ensure the exact reproducibility of the reported results. While this provides stable benchmarks for replication, future work should explore robustness across different seeds and training initialisations.”

- Your methods section already lists the metrics you compute; in the results, it would help to show them in a compact table. Alongside accuracy and the confusion matrix, please include per-class precision, recall, and F1 with macro-F1. If you already output probabilities, adding one calibration number (Brier score or ECE) is enough. I am not asking for many new baselines; if time is short, it is fine to keep logistic regression as the main reference and simply present the fuller metrics you already compute.

Authors: We agree and have expanded the evaluation accordingly in the revised manuscript. In the Materials and Methods section (Section 3.5), we now state, “In addition to accuracy, precision, recall, and F1 score (macro-F1) were computed for both classes to obtain a more detailed picture of the classification performance. […] To assess calibration of predicted probabilities, we computed the Brier score and the Expected Calibration Error (ECE).”

In the Results (Section 4), we report that “These findings are confirmed by the supplementary metrics. For both CNN models, the precision, recall, and macro-F1 were all above 0.91 (1×Conv: Precision = 0.915, Recall = 0.917, Macro-F1 = 0.916; 2×Conv: Precision = 0.918, Recall = 0.916, Macro-F1 = 0.917). […] In addition, the calibration was checked using the Brier score and expected calibration error (ECE). Both values were in the low range (< 0.04), confirming a reliable probability estimate”

- For the interpretability part, a little more precision will help. Please name the exact CAM variant you use (e.g., Grad-CAM or Grad-CAM++), indicate the layer you target, and note briefly how the heatmap is normalized or upsampled. If feasible, include a single faithfulness check: occlude the region the CAM marks as most important and show the expected drop in model confidence. One figure like this makes the explanation claim solid. If you cannot run this now, it is okay to soften the language to “an intuitive visual aid,” and keep the stronger claim for future work.

Authors: We thank the reviewer for this valuable suggestion. We have now specified in Materials and Methods (Section 3.4) that we used the Grad-CAM variant, applied to the final convolutional layer, with activation maps normalised and upsampled to the input resolution.

Owing to time and computational constraints, we were unable to include a formal faithfulness check (e.g. occlusion-based validation). To reflect this, we have softened the interpretability claims throughout this manuscript. CAMs are now described as providing an intuitive visual aid that highlights influential regions rather than as fully validated causal explanations.
We also note explicitly in the Discussion (Section 5.5) that “a systematic validation of faithfulness (e.g., by occluding the marked regions and measuring the decline in model confidence) is reserved for future work.”

- The paper already compares bar and pie. Since you also note that chart design can influence both accuracy and interpretation, a very small robustness note would close the loop. If you can run one quick alternative—either grayscale charts or randomized item order—and report how the metrics change, that will reassure readers the model is not relying mainly on fixed positions or colors. Even a short sentence in the results is enough.

Authors: We thank the reviewer for this suggestion and have conducted a robustness test by rearranging the order of the survey items. Following sentences summarising this robustness test has been added to the Results (Section 4.1): To verify robustness, the fixed vertical order of items in the bar charts was replaced with an alternative sorting method based on the value of the logistic regression. The CNN models showed only a slight decrease in accuracy compared to the original order (–0.04% with 1xConv and –0.17% with 2xConv), and the intuitive visual aid of the CAM heatmaps remained consistent. This confirms that the models were primarily based on value patterns rather than the original item order.”

We have also considered the option you mentioned as a potential avenue for future research in the Discussion (Section 5.5): “Another possible robustness test would be to display the charts in greyscale or further randomise the elements. Since such an analysis was not implemented in this study, we see this as a useful extension for future work to systematically quantify the influence of colouring and order.”

In this process, we have also incorporated it into the Discussion (Section 5.3): “In addition, the choice of chart type can introduce unwanted distortions. While bar charts have a stable order, the rotational symmetry of pie charts can cause diffuse activation patterns. This suggests that visual coding itself is an important component of model success.”

- Two small presentation points will improve readability. In the abstract and conclusion, please use the actual numbers (and if you have them, simple intervals) rather than “high accuracy,” and state plainly that the logistic baseline is competitive. And because you use coded item names, a small table mapping each code to the full question text (referenced in the figure captions) will make the figures easier to read.

Authors: We thank the reviewer for these valuable suggestions, which have improved the clarity of the presentation. We have included in Abstract: “Our evaluation found that CNN models with bar chart coding achieved an accuracy of between 93.05% and 93.16%, comparable to the 93.19% achieved by logistic regression.”

The Conclusion also now state the actual performance values: “The analyses performed showed that CNN models with bar chart encoding achieved accuracies of between 93.05% and 93.16%. This was comparable to the accuracy of logistic regression, which was 93.19%.”

A complete overview of all abbreviations and their corresponding question formulations can be found in Table A2 in Appendix A.2, which improves figure readability.

Table A2. Mapping of variable codes to the full survey items.

Code	Full Survey Item
WEW	Employer Recommendation
GSL	Salary/social benefits
IMG	Image
KWB	Career/ training
AAP	Working atmosphere
KOM	Communication
KZH	Team spirit
WLB	Work-life balance
VGV	Supervisor behaviour
IAG	Interesting tasks
ABD	Working conditions
USB	Environmental/social awareness
GBR	Equality
UÄK	Treatment of older colleagues

Note: These codes are used in Figures 7–12 to label survey dimensions. For clarity, each figure caption now refers to Table A2.

Figure 7. Bar charts - prediction with 1×Conv (a) evaluated instance; (b) corresponding CNN activation heatmap. See Table A2 for the mapping of variable codes (e.g. AAP, KOM, and WLB) to the full survey items.
Figure 8. Bar charts - prediction with 2×Conv (a) evaluated instance; (b) corresponding CNN activation heatmap. The variable codes are explained in Table A2.
Figure 11. Pie charts – prediction with 1×Conv. (a) evaluated instance; (b) corresponding CNN activation heatmap. The code definitions are provided in Table A2.
Figure 12. Pie charts – prediction with 2×Conv. (a) evaluated instance; (b) corresponding CNN activation heatmap. The variable codes correspond to the survey items listed in Table A2.

- Finally, in the introduction, one or two sentences that say exactly what is new here will help position the paper. I would write it simply: “Our contribution is a deterministic one-record-to-one-chart encoding with per-respondent CAM explanations in a lightweight end-to-end workflow,” and then acknowledge nearby lines of work on tabular-to-image encodings and chart-reading CNNs. This is only wording; no extra experiments are needed.

Authors: We agree and have revised the introduction to further acknowledge nearby lines of work (for example, CNNs and Transformers) and explicitly highlight that our contribution lies in a deterministic one-record-to-one-chart encoding combined with CNN–CAM analysis, yielding per-respondent visual explanations within a unified workflow.

With these light edits and clarifications, the paper reads as a clean, reproducible recipe: what the data are, how the split is kept safe, how the model performs on standard metrics, why the heatmaps can be trusted at a basic level, and how sensitive the approach is to a simple chart tweak. I believe these changes are fully achievable within a minor-revision window, and I would be happy to support acceptance after they are made.

Authors: We would like to thank you again for all your valuable and detailed comments. We believe that incorporating your comments into our manuscript has substantially improved it.

Reviewer 4 Report

Comments and Suggestions for Authors

In this manuscript, the authors use convolutional neural networks (CNNs) to classify and interpret quantitative response data. Overall, the core contribution of the manuscript lies in the interpretable application of existing tools. Specific comments are as follows:

The methods applied in the manuscript are all well-established. I recommend that the authors provide further clarification in the introduction to emphasize the essential differences between the proposed approach and existing methods.
In the experimental section, the selection of some key parameters lacks theoretical or experimental justification, such as image size or color mapping. The authors should provide more detailed explanations in the experiments.
The baseline models chosen by the authors are overly simplistic and fail to adequately demonstrate the advantages of CNNs and image-based approaches. It is necessary to select more competitive comparison models.
In terms of experimental results, the accuracy of the CNN is very close to that of logistic regression. How does this demonstrate the superiority of the proposed method?
The interpretation and generalization of the experimental results should be more cautious. Some of the explanations in the manuscript appear to be quite subjective.
The quality of the figures and tables in the manuscript is mediocre. I strongly recommend that the authors improve them.
There are some grammatical errors that the authors should carefully revise.

Author Response

Comments to Reviewer #4

Authors: We thank the reviewer for carefully reading our manuscript and for the constructive suggestions. Below, we address each comment point-by-point and explain the corresponding revisions made to improve clarity, rigor, and presentation.

Specific comments are as follows:

The methods applied in the manuscript are all well-established. I recommend that the authors provide further clarification in the introduction to emphasize the essential differences between the proposed approach and existing methods.

Authors: We appreciate your valuable feedback. In addition to acknowledging related work, we revised the Introduction to emphasise our unique contribution, which is the deterministic one-record-to-one-chart encoding combined with per-respondent CAM explanations, which provides interpretable outputs at the individual record level. Although CNNs and CAMs are established tools, their integration into a lightweight, survey-oriented pipeline is novel.

In the experimental section, the selection of some key parameters lacks theoretical or experimental justification, such as image size or color mapping. The authors should provide more detailed explanations in the experiments.

Authors: We acknowledge that the initial manuscript did not sufficiently justify these design choices. A resolution of 65 × 65 pixels was selected to balance the computational efficiency and visual fidelity. Smaller images risked losing fine-grained differences in bar lengths or pie segment sizes, whereas larger images substantially increased the training time without noticeable improvements in accuracy. The red–green colour gradient for the bar charts was chosen to reflect the ordinal semantics of the Likert scales, intuitively associating low values with red (negative evaluations) and high values with green (positive evaluations). For the pie charts, fixed RGB assignments ensured consistent mapping across categories.

The following justification has been added to the Methods section (Section 3.2): “The colour mapping from red (=1) to green (=5) follows the theoretical logic of a Likert scale, in which low values are typically interpreted as negative and high values as positive. This intuitive scheme not only facilitates human readability but also supports consistency between visual encoding and content meaning. Alternative scales (e.g. blue-orange) were explored but showed no improvements in model accuracy. An image size of 65 × 65 pixels was chosen because it ensured sufficient visual resolution to clearly recognise differences in bar length and colour intensity while keeping the memory and computing requirements for the CNN models low. Preliminary tests with larger formats (e.g., 128 × 128 pixels) showed no significant gains in accuracy but resulted in significantly longer training times. The chosen size therefore represents a pragmatic compromise between visual interpretability and computational effort.”

We also explicitly acknowledge in the Discussion (Section 5.5) that these parameter choices are heuristic and that the systematic optimisation of visualisation parameters (e.g. resolution and colour palettes) is an important direction for future work.

The baseline models chosen by the authors are overly simplistic and fail to adequately demonstrate the advantages of CNNs and image-based approaches. It is necessary to select more competitive comparison models.

Authors: We agree that a broader baseline would provide additional context. Owing to computational and space constraints, we limited the empirical baselines to logistic regression but expanded the discussion. In the Discussion (Section 5.2), we have briefly added a comparative discussion highlighting that “unlike logistic regression, SHAP, or LIME, our CNN–CAM approach integrates classification and explanation in a single workflow, providing per-respondent visual explanations that are more accessible to non-specialists, while acknowledging trade-offs in computational cost, design sensitivity, and the non-causal nature of saliency maps. While our empirical baseline was limited to logistic regression for reasons of scope and computational efficiency, we acknowledge that including further empirical baselines, such as decision trees with SHAP or models explained with LIME, would provide additional points of comparison. Therefore, we identified the absence of direct comparisons with alternative approaches as a direction for future work.”

In terms of experimental results, the accuracy of the CNN is very close to that of logistic regression. How does this demonstrate the superiority of the proposed method?

Authors: We agree that classification accuracy alone does not establish superiority.

The key contribution of our work is not a higher accuracy but enhanced interpretability. Logistic regression yields statistical coefficients that require methodological expertise, whereas our approach is a deterministic one-record-to-one-chart encoding with per-respondent CAM explanations in a lightweight end-to-end workflow. This provides an intuitive visual aid and makes the results accessible to non-specialists. This reframing is emphasised in the Abstract, Introduction, Discussion (Section 5.2), and Conclusions.

The interpretation and generalization of the experimental results should be more cautious. Some of the explanations in the manuscript appear to be quite subjective.

Authors: We appreciate your comment. We have revised our use of language, particularly with regard to interpretation and generalisation, and adjusted the wording where necessary to soften overly strong claims.

The quality of the figures and tables in the manuscript is mediocre. I strongly recommend that the authors improve them.

Authors: We appreciate the reviewer’s comments regarding the presentation quality. Owing to the fixed resolution used in CNNs (65 × 65 pixel input images), we were unable to substantially increase the figure resolution without altering the study setup. However, we implemented several improvements within these constraints:

Figure captions have been expanded and now reference Table A2, which provides full definitions of all variable codes, improving readability for readers.
We have formatted the tables in accordance with the journal's guidelines and have expanded the caption of Table 2 to elucidate each column (β, Std. Error, z, p-value, CI).
Figures with higher dpi were employed to ensure clarity.

We acknowledge that figure resolution remains limited by the design of the study and explicitly note this as a presentation-related limitation in Discussion (Section 5.5): Finally, the visual quality of figures is constrained by the fixed 65 × 65 pixel resolution required for CNN compatibility; while this ensured consistency during training, it inevitably limits the aesthetic resolution of the presented charts and heatmaps, a trade-off we acknowledge as a presentation-related limitation.

There are some grammatical errors that the authors should carefully revise.

Authors: We thank the reviewer for highlighting this issue. The manuscript has been carefully proofread, and errors such as “highlight-ed” and “ist” have been corrected.

We would like to thank you again for your valuable and detailed comments. We believe that incorporating your comments into our manuscript has substantially improved it.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

No further comment.

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript has been thoroughly revised in line with the reviewer’s feedback. The adjustments made have strengthened both the clarity of exposition—making the text easier to follow for non-specialist readers—and the depth of the arguments, which now come across as more precise and impactful.

Reviewer 4 Report

Comments and Suggestions for Authors

Article Menu

From Questionnaires to Heatmaps: Visual Classification and Interpretation of Quantitative Response Data Using Convolutional Neural Networks

Further Information

Guidelines

MDPI Initiatives

Follow MDPI