Prex-NetII: Attention-Based Back-Projection Network for Light Field Reconstruction
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors proposed a new method "Prex-NetII: Attention-Based Back-Projection Network for Light Field Reconstruction" in this paper.
After reviewing the paper, I have the following comments.
[1] The authors could summarize the literature survey on related work in form of a table which shows the relative merits of different method with reference numbers.
[2] How many "up project + down project" layer in figure 1 should be used in a normal case for optimal performance.
[3] The fonts in fig. 1 should be enlarged for better presentation.
[4] Why a (7x7 conv) function block is used in figure 2?
[5] For the implementation of the up-project block, why 1x1 conv (49) should be coupled with conv(96) in next block in Figure 3?
[6] The concatenated outputs of the down-projection blocks is expressed in eq. (1), how to implement the function concatenate for effectiveness operation?
[7] Authors could consider add a flowchart to illustrate the logic flow of the proposed method.
[8] Why loss function in form of eq. (5) is used in this studies?
[9] From line 218 to 225 and also in Table 1, please enter correct reference number for each methods.
[10] The scale of figure 6 are too small to read the details, authors could make one sample with enlarged scale to show the effectiveness of the proposed method.
[11] Is there any other performance index for comparison beside PSNR?
[12] The conclusion should be extended to describe more original contributions of the paper.
Author Response
|
Comment: The authors could summarize the literature survey on related work in form of a table which shows the relative merits of different method with reference numbers. Response: Thank you for your thoughtful comment. Summarizing the relative strengths and weaknesses of existing studies in a tabular format could indeed help readers better grasp the research trends. However, recent light field reconstruction papers commonly describe each approach narratively rather than in tabular form. Therefore, we chose to maintain the descriptive style in order to convey the conceptual differences among the methods more clearly.
Comment: How many "up project + down project" layer in figure 1 should be used in a normal case for optimal performance. Response: Thank you for your helpful suggestion. We had already conducted experiments to determine the optimal number of “up projection + down projection” layers. However, the results were not included in the original submission. Based on the results, using 12 pairs of blocks offers the best balance between performance and model complexity. The explanation has been added in lines 242–250 of the revised manuscript.
Comment: The fonts in fig. 1 should be enlarged for better presentation. Response: Thank you for your comment. As suggested, the font size in Figure 1 has been increased to improve readability.
Comment: Why a (7x7 conv) function block is used in figure 2? Response: Thank you for your question. The 7×7 convolution block in Figure 2 is used within the spatial attention module to enlarge the receptive field and capture long-range spatial correlations. This wider kernel helps the network extract more informative spatial features in the early stage of feature extraction. The explanation has been revised in lines 139–140 of the manuscript.
Comment: For the implementation of the up-project block, why 1x1 conv (49) should be coupled with conv(96) in next block in Figure 3?
Comment: The concatenated outputs of the down-projection blocks is expressed in eq. (1), how to implement the function concatenate for effectiveness operation? Response: Thank you for your question. In Equation (1), the outputs of the down-projection blocks are concatenated along the channel dimension to form the input feature map for the next block. This operation aggregates multi-level features while preserving the same spatial resolution. In implementation, the concatenation is performed using the channel-wise operation in PyTorch (i.e., torch.cat([...], dim=1)).
Comment: Authors could consider add a flowchart to illustrate the logic flow of the proposed method. Response: Thank you for your suggestion. Rather than adding a separate flowchart, we clarified the overall workflow of the network at the beginning of the Method section and described each component in detail in the following paragraphs. This explanation has been added to lines 112–121 in the revised manuscript.
Comment: Why loss function in form of eq. (5) is used in this studies? Response: Thank you for your valuable comment. We have added an explanation in the manuscript to clarify the reason for using the loss function in Equation (5). Specifically, the Charbonnier loss was adopted for its robustness to outliers and its smooth approximation of the norm. This explanation has been included in lines 229–230 of the revised manuscript.
Comment: From line 218 to 225 and also in Table 1, please enter correct reference number for each methods. Response: Thank you for your comment. The reference numbers for each method in lines 218–225(revised lines 235-237) and Table 1 have been corrected in the revised manuscript.
Comment: The scale of figure 6 are too small to read the details, authors could make one sample with enlarged scale to show the effectiveness of the proposed method. Response: Thank you for your comment. We understand the reviewer’s concern about the small scale of Figure 6. The cropped images have a resolution of only 35 × 25 pixels, and they are already shown in an enlarged form in the manuscript. We kindly ask for the reviewer’s understanding that, due to this limitation, further enlargement would not provide additional visual details. Nevertheless, we will consider possible ways to improve the visual presentation in future work or supplementary materials.
Comment: Is there any other performance index for comparison beside PSNR? Response: Thank you for your comment. In addition to PSNR, we have included the SSIM metric for performance comparison. The results have been added to Table 1 in the revised manuscript.
Comment: The conclusion should be extended to describe more original contributions of the paper. Response: Thank you for your valuable suggestion. We have expanded the conclusion section to describe the original contributions of this study in more detail. The revised content can be found in the conclusion section of the manuscript. |
Author Response File:
Author Response.docx
Reviewer 2 Report
Comments and Suggestions for AuthorsI would like to thank the authors for submitting their manuscript entitled "Prex-NetII: Attention-Based Back-Projection Network for Light Field Reconstruction". The study proposes an efficient method for improving the angular resolution of light field images, which capture spatial and angular information for 3D reconstruction. The system uses a convolutional neural network that integrates multiple input views using a pixel blending technique and incorporates skip connections to improve stability and performance.
The authors' idea is clearly explained and the experimental results are promising. However, there are several aspects that would benefit from further clarification and refinement. Addressing the following comments will strengthen the manuscript and improve its overall quality.
Comments and suggestions to the authors
- All cross-references to bibliography, figures, and tables (identified as “Table ??,” “Figure ??,” “[?],” or “[? ?]”) should be reviewed to ensure that each corresponds to the appropriate element in the manuscript.
- The inclusion of more current references (2023-2025) in the state of the art would be beneficial in highlighting the novelty of the work. It is therefore recommended to improve the “2. Related Work” section with other novel works in the field, comparing them with the proposed method.
- The conclusions state that the proposed network outperforms existing methods in execution time, but no quantitative data is presented to support this claim in the results section. A quantitative evaluation of execution time should be added.
- Reinforcing in the Introduction how the key features of Prex-Net directly and concisely address the challenges of existing methods would help the reader to better situate themselves for the rest of the paper, with a better understanding of the principles of the work.
- It would be appropriate to improve the figure captions to ensure that they are clear, concise, and self-explanatory, complementing the textual description in the body of the manuscript. Currently, they are too brief.
Author Response
|
Comment: All cross-references to bibliography, figures, and tables (identified as “Table ??,” “Figure ??,” “[?],” or “[? ?]”) should be reviewed to ensure that each corresponds to the appropriate element in the manuscript. Response: We appreciate the reviewer’s observation. The missing cross-references (e.g., “Table ??,” “Figure ??,” “[?]”) likely resulted from a LaTeX compilation issue. To prevent this problem, we will submit the correctly compiled PDF together with the revised manuscript so that all figures, tables, and references display properly.
Comment: The inclusion of more current references (2023-2025) in the state of the art would be beneficial in highlighting the novelty of the work. It is therefore recommended to improve the “2. Related Work” section with other novel works in the field, comparing them with the proposed method. Response:. Thank you for your suggestion. Light field reconstruction research can generally be divided into two directions: spatial resolution enhancement and angular resolution enhancement. Since our work focuses on the latter, we have carefully selected recent papers that represent major advances in this area. We therefore believe that the current references sufficiently capture recent research trends.
Comment: The conclusions state that the proposed network outperforms existing methods in execution time, but no quantitative data is presented to support this claim in the results section. A quantitative evaluation of execution time should be added. Response: We sincerely appreciate the reviewer’s careful observation. The statement in the conclusion that described our method as faster than existing approaches was an unintentional mistake. We have carefully revised this part to correct the error, and the updated content can be found in lines 288–289 of the revised manuscript.
Comment: Reinforcing in the Introduction how the key features of Prex-Net directly and concisely address the challenges of existing methods would help the reader to better situate themselves for the rest of the paper, with a better understanding of the principles of the work. Response: Thank you for your valuable suggestion. The introduction has been revised to more clearly explain how the main features of the proposed Prex-Net directly address the limitations of previous methods. The revised text appears in lines 42–52.
Comment: It would be appropriate to improve the figure captions to ensure that they are clear, concise, and self-explanatory, complementing the textual description in the body of the manuscript. Currently, they are too brief. Response: Thank you for your helpful comment. All figure captions have been revised to be clearer, more concise, and self-explanatory. These updates appear throughout the revised manuscript. |
Author Response File:
Author Response.docx
Reviewer 3 Report
Comments and Suggestions for AuthorsThis manuscript proposes an improved light-field reconstruction network, Prex-NetII, which employs pixel shuffle for efficient feature extraction, introduces skip connections to enhance training stability, and integrates spatial and channel attention mechanisms within the network structure. Extensive experiments on several public datasets demonstrate that the method outperforms existing approaches in terms of PSNR. The topic is timely, the overall structure is well organized, and the writing is clear. The experimental results support the main conclusions and show promising academic and practical value. However, several issues must be addressed before the manuscript can be considered for publication.
1.Although pixel shuffle and attention mechanisms are incorporated, the distinction from the authors’ previous work (Prex-Net) is not sufficiently clear. A detailed quantitative comparison with the earlier model—including network architecture, parameter count, and performance—would better highlight the unique contributions of this study.
2. The manuscript notes that the proposed method increases the number of parameters compared to the previous version, but provides no concrete analysis of complexity such as FLOPs, inference time, or memory usage. Please add detailed computational and efficiency comparisons so that readers can fully assess practical feasibility.
3.Several figures (e.g., Figures 1–6) have low resolution or unclear numbering and captions. Please reformat and provide complete, high-quality figure legends to ensure readability. The description of dataset splits and preprocessing (including cropping and augmentation) is also too brief. Provide explicit training/testing details to improve reproducibility. In addition, many placeholder citations (e.g., “[? ?]”) remain; these must be corrected and matched to the reference list.
4.The opening sentence of the Abstract could more directly emphasize the core contribution. In Section 3, some variables (e.g., r, Finit) are not defined when first introduced and should be clearly explained. The Conclusion could further discuss potential applications such as real-time VR/AR or microscopic imaging to broaden the practical outlook.
5.Inconsistent symbol definitions symbols in Equations (1)–(4), such as 𝑈𝑖, lack complete dimensional descriptions and clear physical or mathematical meaning when first introduced. A table summarizing all variables, their dimensions, and their correspondence to light-field data would greatly improve clarity.
6.In the loss function (Equation 5), 𝜖 is described only as “a small constant,” without specifying its value or rationale. Please provide the chosen value and, if possible, a sensitivity analysis to ensure reproducibility.
7.Derivation and implementation details, equations (2) and (4) involve multiple convolutions, pixel shuffle, and residual operations, but the text gives only a brief description. Add step-by-step derivations or pseudocode and a simple computation flow diagram to help readers follow the implementation.
8. Provide a more systematic ablation study to evaluate the independent contribution of pixel shuffle and spatial attention—e.g., results with only pixel shuffle, only spatial attention, and their combination.
9. The up/down-projection blocks contain many parameters, yet the manuscript lacks discussion of parameter sharing or model-lightening strategies. Consider experiments with grouped or depthwise separable convolutions, or low-rank decomposition, to assess potential reductions in complexity while maintaining performance.
10. Although the Charbonnier loss is used, there is no comparison with other losses (e.g., L1, SSIM, perceptual loss). Including such comparisons would clarify the impact of loss design on reconstruction quality.
11. Because the network stacks multiple up/down-projection blocks, provide experiments showing how the number of blocks affects performance, to justify the chosen depth and to explore the trade-off between model size and accuracy.
12.Current results rely mainly on mean PSNR without variance or confidence intervals. Report standard deviations or conduct significance tests from multiple independent runs to demonstrate that the observed improvements are statistically meaningful and robust to random initialization.
13. Beyond PSNR, include other perceptual quality metrics such as SSIM and LPIPS, as well as inference speed (FPS) and GPU memory usage, to give a more comprehensive evaluation of visual quality and practical applicability.
This work presents a promising contribution to light-field reconstruction, but it requires substantial improvements in the clarity of its novelty, computational-complexity analysis, figure quality, and completeness of references. I recommend Major Revision.
Author Response
|
Comment: Although pixel shuffle and attention mechanisms are incorporated, the distinction from the authors’ previous work (Prex-Net) is not sufficiently clear. A detailed quantitative comparison with the earlier model—including network architecture, parameter count, and performance—would better highlight the unique contributions of this study. Response: We thank the reviewer for raising the issue of distinguishing this work from our previous model, Prex-Net. To address this, we included Table 2, which provides a quantitative comparison of the two models in terms of parameter count and performance. The results show that the proposed model achieves better performance while reducing the number of parameters.
Comment: The manuscript notes that the proposed method increases the number of parameters compared to the previous version, but provides no concrete analysis of complexity such as FLOPs, inference time, or memory usage. Please add detailed computational and efficiency comparisons so that readers can fully assess practical feasibility. Response: We appreciate the reviewer’s comment regarding the computational complexity analysis. The earlier statement that the proposed method increases the number of parameters was due to an miscomparison with a previous version that used nine projection groups. After correction, we confirmed that the proposed network actually reduces the number of parameters when using the same number of projection groups. The revised explanation has been added in lines 242–250, and the memory usage analysis has been included in Table 2.
Comment: Several figures (e.g., Figures 1–6) have low resolution or unclear numbering and captions. Please reformat and provide complete, high-quality figure legends to ensure readability. The description of dataset splits and preprocessing (including cropping and augmentation) is also too brief. Provide explicit training/testing details to improve reproducibility. In addition, many placeholder citations (e.g., “[? ?]”) remain; these must be corrected and matched to the reference list. Response: We appreciate the reviewer’s valuable comments. All figures (Figures 1–6) have been reformatted in high resolution, and their captions have been revised for clarity and completeness. Details regarding dataset division and preprocessing (including cropping and augmentation) are now provided through the released training code. The placeholder citations such as “[? ?]” were caused by LaTeX compilation errors; this issue has been corrected, and a properly rendered PDF file has been uploaded.
Comment: The opening sentence of the Abstract could more directly emphasize the core contribution. In Section 3, some variables (e.g., r, Finit) are not defined when first introduced and should be clearly explained. The Conclusion could further discuss potential applications such as real-time VR/AR or microscopic imaging to broaden the practical outlook. Response: We appreciate the reviewer’s helpful comments. The abstract has been revised to highlight the main contributions more clearly. and are already defined in the main text. The conclusion has also been updated to include potential applications.
Comment: Inconsistent symbol definitions symbols in Equations (1)–(4), such as ??, lack complete dimensional descriptions and clear physical or mathematical meaning when first introduced. A table summarizing all variables, their dimensions, and their correspondence to light-field data would greatly improve clarity. Response: We thank the reviewer for the helpful comments. The error in Equation (2) has been corrected. Regarding the correspondence between and the light field data, represents an intermediate feature map within the network rather than a directly observable light field representation. For this reason, a one-to-one correspondence with the light field input cannot be explicitly defined.
Comment: In the loss function (Equation 5), ? is described only as “a small constant,” without specifying its value or rationale. Please provide the chosen value and, if possible, a sensitivity analysis to ensure reproducibility. Response: We appreciate the reviewer’s valuable comment. To verify the effectiveness of the Charbonnier loss, we conducted a comparison with the L1 loss. The constant ε was set to 1×10^{-3}, a commonly used value in previous studies that ensures stable convergence and prevents numerical instability. This explanation has been added in lines 232–233 of the revised manuscript.
Comment: Derivation and implementation details, equations (2) and (4) involve multiple convolutions, pixel shuffle, and residual operations, but the text gives only a brief description. Add step-by-step derivations or pseudocode and a simple computation flow diagram to help readers follow the implementation. Response: We appreciate the reviewer’s helpful suggestion. Equations (2) and (4) have been revised to provide clearer descriptions of the convolution, pixel shuffle, and residual operations. These revisions clarify the computational process. These can be found in lines 178–180 and 209 of the revised manuscript.
Comment: Provide a more systematic ablation study to evaluate the independent contribution of pixel shuffle and spatial attention—e.g., results with only pixel shuffle, only spatial attention, and their combination. Response: We appreciate the reviewer’s thoughtful suggestion. In the proposed model, the pixel shuffle operation is essential for constructing the initial feature map. Unlike our previous work (Prex-Net), which extracted features using 3D convolutions, the proposed network relies on pixel shuffle to integrate multiple sub-aperture images into a unified representation. Removing the pixel shuffle operation causes the network to fail in generating valid outputs, as it can no longer synthesize views properly from the input images. Therefore, an independent ablation of pixel shuffle cannot be meaningfully applied in this framework. However, the effect of the spatial attention module has been quantitatively analyzed, as shown in the table within the manuscript, which sufficiently demonstrates its contribution to the overall performance.
Comment: The up/down-projection blocks contain many parameters, yet the manuscript lacks discussion of parameter sharing or model-lightening strategies. Consider experiments with grouped or depthwise separable convolutions, or low-rank decomposition, to assess potential reductions in complexity while maintaining performance. Response: We appreciate the reviewer’s valuable suggestion regarding model efficiency. Following the comment, we conducted additional experiments applying grouped convolution, depthwise separable convolution, and weight sharing within the projection block. As shown in the table below, these approaches significantly reduced the number of parameters; however, they also led to a noticeable degradation in reconstruction performance. All experiments were performed using the Prex-NetII configuration with 9 projection groups.
Comment: Although the Charbonnier loss is used, there is no comparison with other losses (e.g., L1, SSIM, perceptual loss). Including such comparisons would clarify the impact of loss design on reconstruction quality. Response: We appreciate the reviewer’s valuable comment. An additional experiment using the L1 loss was conducted, and the performance was found to be lower than that of the Charbonnier loss. The results of the L1-based experiment are provided below, and all experiments were performed using the Prex-NetII configuration with 9 projection groups.
Comment: Because the network stacks multiple up/down-projection blocks, provide experiments showing how the number of blocks affects performance, to justify the chosen depth and to explore the trade-off between model size and accuracy. Response: We appreciate the reviewer’s insightful suggestion. To examine the effect of network depth, we conducted an additional experiment by varying the number of projection blocks. The results, summarized in the newly added Table 2, demonstrate the trade-off between model size and reconstruction performance and justify the selected network depth.
Comment: Current results rely mainly on mean PSNR without variance or confidence intervals. Report standard deviations or conduct significance tests from multiple independent runs to demonstrate that the observed improvements are statistically meaningful and robust to random initialization. Response: We thank the reviewer for this valuable comment. While we did not include statistical indicators such as variance or confidence intervals, we ensured reproducibility by training all models under identical conditions and making the complete implementation publicly available on GitHub for independent verification.
Comment: Beyond PSNR, include other perceptual quality metrics such as SSIM and LPIPS, as well as inference speed (FPS) and GPU memory usage, to give a more comprehensive evaluation of visual quality and practical applicability. Response: We appreciate the reviewer’s valuable comment. To provide a more comprehensive evaluation of perceptual quality and model efficiency, we have added the SSIM metric and the number of model parameters to the experimental results. These additions can be found in Table 1. |
Author Response File:
Author Response.docx
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors' work has resulted in an overall improvement in the quality of the manuscript.
To optimize the strength and currency of the bibliographic base, I recommend including some more recent references. This would complement the citations already considered sufficient by the authors and ensure that the research is fully in line with the current state of knowledge.
Author Response
Comment: The authors' work has resulted in an overall improvement in the quality of the manuscript.
To optimize the strength and currency of the bibliographic base, I recommend including some more recent references. This would complement the citations already considered sufficient by the authors and ensure that the research is fully in line with the current state of knowledge.
Response: We appreciate the reviewer’s suggestion to strengthen the bibliographic foundation with more recent references. Accordingly, two recent papers have been added to the manuscript (Lines 85–87 and 105–108). The details of the added references are as follows:
[19] Chen, Yilei, et al. "Enhanced light field reconstruction by combining disparity and texture information in PSVs via disparity-guided fusion." IEEE Transactions on Computational Imaging 9 (2023): 665-677.
[25] Fang, Li, Qian Wang, and Long Ye. "GLGNet: light field angular superresolution with arbitrary interpolation rates." Visual Intelligence 2.1 (2024): 6.
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe author has basically made modifications in accordance with the requirements of my review opinions. Although they fail to meet all the requirements, they have also basically done some work. On this basis, the manuscript should be acceptable.
Author Response
Comment: The author has basically made modifications in accordance with the requirements of my review opinions. Although they fail to meet all the requirements, they have also basically done some work. On this basis, the manuscript should be acceptable.
Response: We sincerely appreciate the reviewer’s recognition of our revisions and the helpful feedback provided throughout the review process. While it was not possible to address every point in full, we carefully revised the manuscript based on the reviewer’s suggestions and made several improvements. We also reread the entire paper to refine the writing and ensure consistency, and the revised parts are highlighted in the resubmitted version.
Author Response File:
Author Response.pdf


