Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Transformer-Based Cascading Reconstruction Network for Video Snapshot Compressive Imaging

Appl. Sci. 2023, 13(10), 5922; https://doi.org/10.3390/app13105922

by Jiaxuan Wen^1,*

, Junru Huang¹, Xunhao Chen², Kaixuan Huang¹ and Yubao Sun¹

Reviewer 1:

Juraj Bienik

Reviewer 2:

M.Mary Shanthi Rani

Reviewer 3:

Vicente González Ruiz

Appl. Sci. 2023, 13(10), 5922; https://doi.org/10.3390/app13105922

Submission received: 16 March 2023 / Revised: 5 May 2023 / Accepted: 9 May 2023 / Published: 11 May 2023

Round 1

Reviewer 1 Report

In paper a transformer-based cascading reconstruction network for video snapshot compressive imaging was proposed. Presented method shown significant adhance of video quality.
I have minor reservations about the graphic editing of tables and images - some are out of alignment with the text.

Author Response

Point 1: In paper a transformer- based cascading reconstruction network for video snapshot compressive imaging was proposed. Presented method shown significant adhance of video quality. I have minor reservations about the graphic editing of tables and images - some are out of alignment with the text.

Response 1:

Thank you for your careful review of our manuscript. We greatly appreciate your feedback.

As a result, we have made several corrections in the text to address the inconsistencies between the figures and their corresponding descriptions. In particular, we have made changes to Figure 4 on page 8, and have updated the input notation from symbol " x_lo' " to symbol " X_lo' ".

Thank you again for your time and effort in reviewing our manuscript, which has helped to improve the quality of our work.

Author Response File: Author Response.docx

Reviewer 2 Report

Presentation flow is good.

How do the authors derive the formula used for calculating the reconstruction error? Justification required.

The authors have not compared with or described about Autoencoders , a neural net architecture commonly used for data compression. It would be good if the authors justify the proposed architecture's performance with that of Autoencoders.

Author Response

Point 1: How do the authors derive the formula used for calculating the reconstruction error? Justification required.

Response 1:

Thank you for your constructive feedback on our manuscript. We have carefully reviewed your comments and have made the necessary revisions to address the issues raised.

In particular, we have added a detailed description of the process for constructing the loss function in Chapter 4 on page 9, as per your suggestion. We have provided a detailed explanation for each component of the loss function to ensure clarity and understanding.

Thank you again for your valuable input, which has helped to improve the quality of our work.

Point 2: The authors have not compared with or described about Autoencoders, a neural net architecture commonly used for data compression. It would be good if the authors justify the proposed architecture's performance with that of Autoencoders.

Response 2:

Thank you for your valuable feedback on our manuscript. We appreciate your suggestions, which have led us to identify areas for improvement in our experimental design. To better demonstrate the effectiveness of our proposed network, we have included a comparative experiment as per your suggestion.

Given the widespread use of autoencoder neural networks in data compression, we have selected the masked autoencoder algorithm (MAE) proposed in Reference 1 as the comparative algorithm. This algorithm uses mask and an asymmetric structure to restore low-quality images that have been masked, which is similar to the video snapshot compressive imaging reconstruction task. We have used the same dataset as our proposed network to train the MAE algorithm. In the attachment, we have listed a detailed table. As shown in Table 1, the reconstructed PSNR value of this network on the test dataset was only 24.1, performing poor reconstruction ability. Therefore, we have decided not to include this algorithm in the main text of the manuscript.

Thank you again for your insightful suggestions, which have inspired us to seek out and study other outstanding algorithms for video snapshot compressive imaging task in future research.

Author Response File: Author Response.docx

Reviewer 3 Report

The authors propose a transformer-based artificial neural network for video encoding. The paper is reasonably well written (with a few minor typos missing). In the attached PDF you can see my corrections.

The performance of the proposal seems reasonable. However, and for this reason I speak of "it seems", it is not possible to establish a faithful comparison since data on the volume of data required to carry out the reconstructions are not reported. Neither is a comparison made with other standard compression algorithms that could situate your proposal, and the others with which it is compared (all based on artificial neural networks), in terms of Rate/Distortion. This information is essential.

Comments for author File: Comments.pdf

Author Response

Point 1: The paper is reasonably well written (with a few minor typos missing). In the attached PDF you can see my corrections.

Response 1:

We would like to express our sincere gratitude for your diligent review of our manuscript. We have carefully considered your comments and have made revisions accordingly.

Specifically, we have revised our manuscript according to the pdf version you provided, correcting some spelling and punctuation errors. We have also made further modifications to the original Equation (18) on page 8, and have improved the calculation process for the first branch input by adding Equation (15). Furthermore, we have added specific details about the citation in the first line of Section 5.1 on page 9, and have incorporated the better expression you provided in the first line of Section 5.4 on page 12.

Thank you again for your valuable feedback, which has greatly improved the quality of our manuscript. We appreciate your time and effort in reviewing our work.

Point 2: The performance of the proposal seems reasonable. However, and for this reason I speak of "it seems" it is not possible to establish a faithful comparison since data on the volume of data required to carry out the reconstructions are not reported.

Response 2:

We would like to thank you for your thorough review of our manuscript. Data volume is an important aspect to estimate the model effectiveness and complexity. In response to your suggestion, we have made further revisions to our manuscript.

Specifically, we have added a comparison of the data volumes of each deep network-based model parameter during the reconstruction process in Table 3 of Section 5.5 on page 13. In the attachment, we have listed a detailed table. By comparing the data volume and reconstruction time of each algorithm, we found that although the use of the Transformer in our proposed network inevitably leads to a longer reconstruction time, the parameter volume is lower than that of RE²-Net. Our proposed network can achieve better reconstruction performance while balancing model complexity and effectiveness, resulting in improved reconstruction quality at a reasonable computational cost.

Thank you again for your valuable feedback, which has helped to improve the quality of our manuscript.

Point 3: Neither is a comparison made with other standard compression algorithms that could situate your proposal, and the others with which it is compared (all based on artificial neural networks), in terms of Rate/Distortion. This information is essential.

Response 3:

Thank you for your valuable suggestions. Rate/distortion is a critical factor in compressive sensing tasks. Based on your comments, we have identified areas for improvement in our experiments. Specifically, we have added a description of the compression ratio calculation in Section 5.1 on page 10, which indicates that the compression ratio for a single frame is 50%, while the overall video compression ratio is 1/8.

In compressive sensing tasks, images are often compressed at different compression ratios and then reconstructed to verify the effectiveness of the algorithm. In contrast to traditional compressive sensing tasks, the same CACTI system has a fixed mask. Therefore, in video snapshot compressive imaging, researchers typically focus on improving the reconstruction quality of the same system. In our manuscript, we and the other comparative methods are based on the same CACTI system and use the same mask for algorithm construction during the reconstruction process. Therefore, we are more concerned with the overall quality of the reconstructed video rather than the compression sampling rate of the system.

Based on your suggestion, we will focus on the differences in mask compression rates between our system and other systems in future research. Once again, we thank you for your valuable feedback.

Author Response File: Author Response.docx

Round 2

Reviewer 3 Report

The authors have successfully addressed my comments.

Article Menu

Transformer-Based Cascading Reconstruction Network for Video Snapshot Compressive Imaging

Further Information

Guidelines

MDPI Initiatives

Follow MDPI