GPU-Accelerated PSO for High-Performance American Option Valuation
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- In the “INTRODUCTION” section, it is suggested that the description of the limitations of existing GPU-PSO in option pricing could be more specific.
- In the "GPU ACCELERATION OF PSO AND OPTION PRICING" section, there is a lack of a comprehensive review of related research by scholars in the past five years. It is recommended to provide a more detailed description of previous studies to better highlight the innovation of this research.
- On page 2, lines 55-56, the equation should be placed on a separate line, and necessary explanations should be provided for each letter in the equation.
- All figures and tables in the text should have titles and be numbered. They also need to be properly explained and interpreted in the text.
- A detailed comparative analysis of Figure 2 is required.
- The performance improvement data for each stage of the optimization process from the CPU baseline to the final fused float4 kernel is detailed (Table 1 and Figure 1). The technical details of this part need to be supplemented.
- On page 7, Table 2 lists the techniques of each kernel file, but lacks code snippets or pseudo-code examples.
- Pages 5-8 show significant performance improvements (from 36.7s to 0.246s). Page 5 mentions "exceeding existing GPU implementations", but does not provide detailed analysis with comparison data from specific literature. Page 7 mentions "reaching hardware limits", and it is suggested to analyze the GPU occupancy rate or memory bandwidth utilization.
- Pages 9-10, the sensitivity analysis and risk assessment sections lack validation with actual cases. The discussion on future hardware scalability is rather general. It is recommended to supplement the potential for adaptation to new AMD/NVIDIA architectures.
- It is suggested that the conclusion section be consolidated into one paragraph. This section should summarize this research, present the main conclusions of this article, and look forward to future work.
- The format of the references needs to be organized according to the requirements of the journal.
- English grammar and format need to be further improved.
English grammar and format need to be further improved.
Author Response
1. In the “INTRODUCTION” section, it is suggested that the description of the limitations of existing GPU-PSO in option pricing could be more specific.
We re-wrote the INTRODUCTION section and the limitations of existing GPU-PSO are explicitly stated (see the paragraph starting with “Our main contribution…”.
2. In the "GPU ACCELERATION OF PSO AND OPTION PRICING" section, there is a lack of a comprehensive review of related research by scholars in the past five years. It is recommended to provide a more detailed description of previous studies to better highlight the innovation of this research.
Now this old section is split into two new sections: PARTICLE SWARM OPTIMIZATION FOR DERIVATIVE PRICING and GPU ACCELERATION OF PSO AND OPTION PRICING. More discussions are added to hopefully address the concern by the referee.
3. On page 2, lines 55-56, the equation should be placed on a separate line, and necessary explanations should be provided for each letter in the equation.
Done. Now, a full definition of these equations is given.
4. All figures and tables in the text should have titles and be numbered. They also need to be properly explained and interpreted in the text.
All tables and figures now have titles and are numbered.
5. A detailed comparative analysis of Figure 2 is required.
Now, in the new version, lines 448-452 explains Figure 2.
6. The performance improvement data for each stage of the optimization process from the CPU baseline to the final fused float4 kernel is detailed (Table 1 and Figure 1). The technical details of this part need to be supplemented.
Done. Please see the newly added paragraphs immediately before Table 1 and after Figure 1.
7. On page 7, Table 2 lists the techniques of each kernel file, but lacks code snippets or pseudo-code examples.
Line 295 (new version) demonstrates the pseudo-code for the core fitness function, e.g. the particle wise American option evaluation. We don’t list all code snippet to avoid much duplication.
8. Pages 5-8 show significant performance improvements (from 36.7s to 0.246s). Page 5 mentions "exceeding existing GPU implementations", but does not provide detailed analysis with comparison data from specific literature. Page 7 mentions "reaching hardware limits", and it is suggested to analyze the GPU occupancy rate or memory bandwidth utilization.
Unlike vendor-specific APIs like NVIDIA’s CUPTI or AMD’s ROCm counters, Apple hasn’t exposed low-level profiling interfaces for OpenCL. Instead, their tooling is fully integrated with Metal, reinforcing it as the primary GPU infrastructure for performance analysis.
We did not have low-level GPU profiling (occupancy or bandwidth counters) available on the M3 at writing. Apple’s GPU profiling ecosystem is built around Metal, not OpenCL. While one can run OpenCL code on Apple Silicon, Metal remains the only API supported with deep hardware performance tracking and instrumentation tools, the only way to access real hardware counters (e.g. memory bandwidth, occupancy, etc.) is to transform or wrap kernels to Metal Shaders.
Transforming OpenCL codes to Metal is out of scope of this research paper, we deliberately use OpenCL implementation for general purpose GPU programing with portability.
9. Pages 9-10, the sensitivity analysis and risk assessment sections lack validation with actual cases. The discussion on future hardware scalability is rather general. It is recommended to supplement the potential for adaptation to new AMD/NVIDIA architectures.
We removed the sensitivity section and leave it for future analysis work (explained in the Conclusion section). For scalability, we enriched via section 5.7 for a functional wise scalability analysis and tied it to Section 7.
10. It is suggested that the conclusion section be consolidated into one paragraph. This section should summarize this research, present the main conclusions of this article, and look forward to future work.
Merged into one paragraph. Furthermore, we added a couple of short paragraphs for future work – other potential options, Greek calculations, and sensitivity analyses that can utilize our methodology.
11. The format of the references needs to be organized according to the requirements of the journal.
Done. Now the format complies with the Journal template.
12. English grammar and format need to be further improved.
The document now is put in Journal’s standard format. We used an LLM to check through our manuscript.
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript requires revision due to methodological gaps and insufficient validation. First, the claimed novelty of combining standard OpenCL optimizations (coalescing, vectorization, fusion) lacks differentiation from prior GPU-PSO literature (e.g., Sharma et al. 2013, Chen et al. 2021).
The core algorithmic contribution—applying these techniques to American options—overlaps substantially with the authors' own prior work on GPU-accelerated Longstaff-Schwartz (Li & Chen 2023; Li et al. 2024), without clarifying how PSO-specific challenges justify new publication. Second, experimental design undermines performance claims:
(1) The baseline comparison uses a single-threaded NumPy implementation rather than multi-threaded CPU or optimized CUDA benchmarks, exaggerating speedups;
(2) Hardware choice (Apple M3 Max) is atypical for HPC finance, and results lack cross-platform validation;
(3) Kernel timings (Table 1) omit profiling metrics (occupancy, memory bandwidth utilization) to substantiate optimization efficacy. Additionally, financial validation is inadequate: The PSO price (10.657) diverges from binomial (10.602) and Longstaff-Schwartz (10.665) benchmarks without error analysis.
No sensitivity tests (e.g., varying strike/maturity) or comparison to industry standards (QuantLib) are provided. The "Sensitivity Analysis" section (Sec. 6) is purely speculative—no Greeks are computed—while scalability claims (Sec. 7) lack empirical data.
Finally, the model’s applicability to path-dependent options (e.g., Bermudans, barriers) remains unverified despite being a stated motivation.
Author Response
This manuscript requires revision due to methodological gaps and insufficient validation. First, the claimed novelty of combining standard OpenCL optimizations (coalescing, vectorization, fusion) lacks differentiation from prior GPU-PSO literature (e.g., Sharma et al. 2013, Chen et al. 2021).
Our study comes more from an engineering perspective that decomposes the techniques and analyzes how the performance gains are achieved marginally. This is particularly valuable for practical applications which usually combine and ensemble different techniques with fine-tuning on a specific infrastructure platform.
The core algorithmic contribution—applying these techniques to American options—overlaps substantially with the authors' own prior work on GPU-accelerated Longstaff-Schwartz (Li & Chen 2023; Li et al. 2024), without clarifying how PSO-specific challenges justify new publication. Second, experimental design undermines performance claims:
- The baseline comparison uses a single-threaded NumPy implementation rather than multi-threaded CPU or optimized CUDA benchmarks, exaggerating speedups;
We explain from lines 318-324 (new version) that the choice was deliberate.
- Hardware choice (Apple M3 Max) is atypical for HPC finance, and results lack cross-platform validation.
We do not have access to any lab equipment, and the study was conducted purely upon personal computers which happen to be an Apple device. Yet OpenCL code is known for its high portability hence the results are easily replicated on other platforms such as CUDA or AMD chips.
- Kernel timings (Table 1) omit profiling metrics (occupancy, memory bandwidth utilization) to substantiate optimization efficacy. Additionally, financial validation is inadequate: The PSO price (10.657) diverges from binomial (10.602) and Longstaff-Schwartz (10.665) benchmarks without error analysis.
Error analysis included from lines 356-364.
No sensitivity tests (e.g., varying strike/maturity) or comparison to industry standards (QuantLib) are provided. The "Sensitivity Analysis" section (Sec. 6) is purely speculative—no Greeks are computed—while scalability claims (Sec. 7) lack empirical data.
We fully acknowledge the comment by the referee. The section is now removed.
Finally, the model’s applicability to path-dependent options (e.g., Bermudans, barriers) remains unverified despite being a stated motivation.
Already included explicitly in the Conclusion section. We also added its further applications on Greek and stress testing calculations that are important in risk management.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper is a methodology-driven exposition of enhancing Particle Swarm Optimization (PSO) for American options pricing on OpenCL on GPUs. The paper outlines a series of performance optimizations from a baseline using NumPy to an end product significantly reducing execution time. All the steps in the optimizations, like loop restructuring, kernel fusion, and SIMD vectorization, are outlined by the authors, and the respective performance measurements at each step are reported. PyOpenCL on Apple M3 Max hardware offers practical insight. Overall, the paper offers decent critique of GPU-acceleration approaches to computational finance but within the scope of a specific hardware-software configuration.
However, the following elements needs consideration in order to improve the quality of the manuscript:
- In the Introduction section, research contributions are missing. The research contributions should be highlighted (for instance using bullet points) for the readers to understand the overall contribution to the existing body of knowledge.
- An explicit novelty statement is missing from the introduction section. An explicit novelty statement in clear and concise manner help the readers understand the novel element of the research.
- A well-elaborated separate section named as Related Works or Literature Review is needed for better understanding of the readers.
- As PSO is stochastic and sensitive to initialization, how can authors be sure that the final price (i.e., 10.657446) is not a local optima? Have authors quantified the convergence stability of their method across several runs or seeds, and how does it statistically stand in comparison with state-of-the-art techniques?
- Why do authors choose PSO to price rather than to calibrate or approximate boundaries, where it is more common? Given the path dependence of American options, how do authors account for PSO's competitiveness relative to model-based methods like Longstaff-Schwartz or PDE solvers in reliability of convergence, interpretability, or robustness?
- As float4 vectorization uses single-precision arithmetic, how did authors control that rounding errors over time do not distort payoffs or lead to incorrect early-exercise decisions, especially for large time steps or highly volatile assets?
- Which of authors’ optimizations (e.g., float4, loop reversal, fusion) would be portable to other GPU architectures (e.g., NVIDIA CUDA, AMD ROCm)? Is authors’ performance gain mostly due to architectural alignment with float4 SIMD on Apple hardware, and how would the benefits translate to more common platforms in finance?
- How does authors’ pipeline handle more realistic contract features such as path-dependent payoffs (e.g., Asian features), discrete dividends, or early-exercise penalties? Would the current vectorization and lockstep approach extend if authors include stochastic volatility or jump-diffusion?
- ·Did authors profile memory bottlenecks (e.g., shared memory contention, register pressure) at scale? At what size do authors observe GPU saturation performance degradation, and how does that influence convergence behavior or accuracy?
Author Response
The paper is a methodology-driven exposition of enhancing Particle Swarm Optimization (PSO) for American options pricing on OpenCL on GPUs. The paper outlines a series of performance optimizations from a baseline using NumPy to an end product significantly reducing execution time. All the steps in the optimizations, like loop restructuring, kernel fusion, and SIMD vectorization, are outlined by the authors, and the respective performance measurements at each step are reported. PyOpenCL on Apple M3 Max hardware offers practical insight. Overall, the paper offers decent critique of GPU-acceleration approaches to computational finance but within the scope of a specific hardware-software configuration.
However, the following elements needs consideration in order to improve the quality of the manuscript:
- In the Introduction section, research contributions are missing. The research contributions should be highlighted (for instance using bullet points) for the readers to understand the overall contribution to the existing body of knowledge.
We re-wrote the Introduction section and the limitations of existing GPU-PSO are explicitly stated (see the paragraph starting with “Our main contribution…”.
- An explicit novelty statement is missing from the introduction section. An explicit novelty statement in clear and concise manner help the readers understand the novel element of the research.
We re-wrote the Introduction section and the limitations of existing GPU-PSO are explicitly stated (see the paragraph starting with “Our main contribution…”.
- A well-elaborated separate section named as Related Works or Literature Review is needed for better understanding of the readers.
A new Section 6 Related Work and Discussion is added.
- As PSO is stochastic and sensitive to initialization, how can authors be sure that the final price (i.e., 10.657446) is not a local optimum? Have authors quantified the convergence stability of their method across several runs or seeds, and how does it statistically stand in comparison with state-of-the-art techniques?
Performance metrics and reproductivity are mentioned in lines 335-343.
- Why do authors choose PSO to price rather than to calibrate or approximate boundaries, where it is more common? Given the path dependence of American options, how do authors account for PSO's competitiveness relative to model-based methods like Longstaff-Schwartz or PDE solvers in reliability of convergence, interpretability, or robustness?
The reviewer’s comment is correct, LSMC is common, yet it has drawbacks such as the curse of dimensionality, and matrix singularity issues that could introduce instable calculation results, as mentioned in Li et. al. 2024. PSO method does not have these limitations.
- As float4 vectorization uses single-precision arithmetic, how did authors control that rounding errors over time do not distort payoffs or lead to incorrect early-exercise decisions, especially for large time steps or highly volatile assets?
Single-precision is one limitation of the GPUs. For some hardware platforms such as Nvidia’s chips and newer version of CUDA SDK, it provides double-precision with significant performance drawdown. Eventually it is an art and science choice of the accuracy/performance tradeoffs. For our research equipment, i.e. an Apple MacBook Pro M3 Max chip, the supported OpenCL 1.2 only supports single-precision. We rewrote and explain from lines 325-334.
- Which of authors’ optimizations (e.g., float4, loop reversal, fusion) would be portable to other GPU architectures (e.g., NVIDIA CUDA, AMD ROCm)? Is authors’ performance gain mostly due to architectural alignment with float4 SIMD on Apple hardware, and how would the benefits translate to more common platforms in finance?
All optimizations are of the general GPU programing techniques, its OpenCL-based implementations are portable to other GPU architectures, worth noting that kernel fusion, and float N SIMD which will be eventually determined by the low-level underlying hardware. We discussed portability from lines 298-307.
- How does authors’ pipeline handle more realistic contract features such as path-dependent payoffs (e.g., Asian features), discrete dividends, or early-exercise penalties? Would the current vectorization and lockstep approach extend if authors include stochastic volatility or jump-diffusion?
We have a revised Conclusion section that addresses this issue (last two paragraphs).
- Did authors profile memory bottlenecks (e.g., shared memory contention, register pressure) at scale? At what size do authors observe GPU saturation performance degradation, and how does that influence convergence behavior or accuracy?
Unlike vendor-specific APIs like NVIDIA’s CUPTI or AMD’s ROCm counters, Apple hasn’t exposed low-level profiling interfaces for OpenCL. Instead, their tooling is fully integrated with Metal, reinforcing it as the primary GPU infrastructure for performance analysis.
We did not have low-level GPU profiling (occupancy or bandwidth counters) available on the M3 at writing. Apple’s GPU profiling ecosystem is built around Metal, not OpenCL. While one can run OpenCL code on Apple Silicon, Metal remains the only API supported with deep hardware performance tracking and instrumentation tools, the only way to access real hardware counters (e.g. memory bandwidth, occupancy, etc.) is to transform or wrap kernels to Metal Shaders. Transforming OpenCL codes to Metal is out of scope of this research paper, we deliberately use OpenCL implementation for general purpose GPU programing with portability.
Reviewer 4 Report
Comments and Suggestions for AuthorsGeneral description. The work is devoted to the high-performance implementation of American option valuation using the PSO method on a GPU with OpenCL. The authors consistently optimize the kernels (coalesced access, branch elimination, kernel fusion, SIMD vectorization float4) and report a total time of 0.246 s and ≈150× acceleration relative to the CPU baseline of 36.7 s. LinkedIn The techniques used are in line with the recommendations of leading vendors (NVIDIA/AMD) regarding memory coalescence, avoiding divergence, and using vector types (float4). The authors' code and implementations are available in an open repository (MIT), which enhances reproducibility.
Comments and shortcomings:
• The authors do not directly compare the performance of their method with other modern implementations (e.g., other GPU methods such as LSMC or CUDA).
• The reported 0.246 s is useful, but there is a lack of “apples to apples” comparisons with modern GPU implementations on the same hardware and with fixed paths × steps × particles parameters. For reference, LSMC on Tesla V100 reports 17 ms (1 million paths, 100 steps) and is up to 2.5× faster than the manually optimized CUDA version; This is a different algorithm and different hardware, but it sets the performance bar — standardized benchmarks and a configuration table should be added.
• There is no detailed information about the stability and accuracy of the results when changing the algorithm parameters. Generalized error metrics (to the binomial/analytical reference price) on time grids, sensitivity to PSO hyperparameters, and repeatability with different seeds are needed. For the related GPU LSMC, it has been shown that the choice of basis functions can cause instability, which must be specifically addressed; it is desirable to highlight the risks for the PSO approach in a similar way.
• The paper does not address the impact of double precision on the accuracy of financial calculations.
• There is no demonstration of the algorithm's performance on more complex types of options (e.g., multidimensional or path-dependent).
Overall conclusion. The work is interesting and shows good results in optimizing the speed of option price calculation, but requires additional comparisons with modern analogues for a more complete analysis. It is also recommended to supplement the research with more detailed accuracy tests and open access to the source code. This will allow for a better assessment of the practical significance of the presented results.
Author Response
Comments and shortcomings.
- The authors do not directly compare the performance of their method with other modern implementations (e.g., other GPU methods such as LSMC or CUDA).
LSMC is another method that we studied in Li et. al 2024 (JoD). The focus of this research paper is deliberately focusing on how incremental optimization techniques for the PSO method to achieve significant performance uplift; apart from the fact we did not have access to an Nvidia computing device, the second focus is the portability of OpenCL, where CUDA only works on Nvidia chips and Metal only works on Apple M series chips. We did mention among different hardware platforms from lines 325-334, our future work may include other hardware benchmarks/comparison.
- The reported 0.246 s is useful, but there is a lack of “apples to apples” comparisons with modern GPU implementations on the same hardware and with fixed paths × steps × particles parameters. For reference, LSMC on Tesla V100 reports 17 ms (1 million paths, 100 steps) and is up to 2.5× faster than the manually optimized CUDA version; This is a different algorithm and different hardware, but it sets the performance bar — standardized benchmarks and a configuration table should be added.
We mentioned performance metrics in lines 335-343 in the new version.
- There is no detailed information about the stability and accuracy of the results when changing the algorithm parameters. Generalized error metrics (to the binomial/analytical reference price) on time grids, sensitivity to PSO hyperparameters, and repeatability with different seeds are needed. For the related GPU LSMC, it has been shown that the choice of basis functions can cause instability, which must be specifically addressed; it is desirable to highlight the risks for the PSO approach in a similar way.
Unlike LSMC which is a regression-based approach and known for matrix singularity issue prone to instability, PSO is an optimization approach and does not have the matrix singularity issue.
We removed the sensitivity section and leave it for future analysis work. For scalability, we enriched via section 5.7 for a functional wise scalability analysis and tied it to Section 7.
- The paper does not address the impact of double precision on the accuracy of financial calculations.
Single-precision is one limitation of the GPUs. For some hardware platforms such as Nvidia’s chips and newer version of CUDA SDK, it provides double-precision with significant performance drawdown. Eventually it is an art and science choice of the accuracy/performance tradeoffs. For our research equipment, i.e. an Apple MacBook Pro M3 Max chip, the supported OpenCL 1.2 only supports single-precision. We rewrote and explain from lines 325-334.
- There is no demonstration of the algorithm's performance on more complex types of options (e.g., multidimensional or path-dependent).
We have a newly re-written Conclusion section to address this issue.
Reviewer 5 Report
Comments and Suggestions for AuthorsThe critical analysis of the article “GPU-Accelerated PSO for High-Performance American Option Valuation” reveals an interesting proposal with relevant potential in the field of computational finance. However, the article presents notable weaknesses, especially concerning the methodological structure and the clarity of the scientific design. The following in-depth evaluation highlights several aspects that should be corrected before a new revision is considered.
The main research question addressed by the article is how to apply low-level optimization techniques (such as SIMD vectorization, kernel fusion, and memory coalescing) to accelerate the valuation of American options using Particle Swarm Optimization (PSO) on GPUs via OpenCL. Although this question is implicitly suggested in the abstract (lines 8–20), the authors never explicitly formulate it as a research question. I recommend presenting this formulation in the Introduction to make the research objective clearer and more evident.
While the research topic is relevant—considering the increasing use of GPUs in quantitative finance and the computational complexity of American option valuation—the originality lies more in the combination of well-known techniques than in any deep methodological or conceptual innovation. Most of the strategies applied (lines 97–131), such as kernel fusion and vectorization, are well-documented in the NVIDIA and AMD guides, correctly cited in the manuscript, but they are not novel. Therefore, the article only partially fills a research gap: it contributes more as a technical case study with superior results than as a conceptual advancement in the use of PSO in finance. I suggest revisiting and clarifying the actual contributions of the research.
The most significant issue lies in the methodology, which is neither presented in a structured way nor with the scientific rigor expected. The article lacks a section titled “Methodology” and does not provide a clear breakdown of the steps followed, as would be required in a replicable protocol. For instance, in lines 132–143, the authors begin describing the implementation, but they fail to present a methodological framework—such as a definition of variables, the number of experiment repetitions, performance evaluation metrics beyond runtime, or even an explicit definition of the fitness function used by PSO. This omission undermines transparency and reproducibility. I strongly recommend including a structured methodology section that formally describes the computational problem, input and output parameters, experimental design, and the rationale behind each decision. A figure illustrating all the macro-level stages of the research would greatly aid in understanding and tracking the proposed process.
Another critical issue is the lack of justification for the chosen pricing model. Although the authors state that PSO is directly used to price American options, the manuscript does not clearly explain how the option value is actually obtained (i.e., which stochastic model is simulated, whether Monte Carlo methods are used, or whether early exercise is handled via boundary approximation, etc.). This theoretical gap is evident in Section 2 (lines 51–75), which summarizes related work but fails to justify or detail the authors’ specific modeling approach.
The conclusion (lines 353–363) is consistent with the results presented but lacks a critical discussion of the study’s limitations. For example, the article does not mention the potential impact of overfitting when using PSO with many parameters, nor does it address the risk that performance gains may not translate into significant improvements in real financial environments due to integration complexity (although briefly touched on in Section 8). Additionally, the suggestion to generalize the results to other GPU architectures (line 322) is neither tested nor discussed in depth, which limits the external validity of the findings.
The figures and tables are well-organized and informative. Table 1 (line 144) clearly summarizes the performance gains, and Table 2 (line 216) is helpful for linking optimization techniques to specific kernel files. Figure 2 (line 232) offers valuable insights into scalability. However, the article would benefit from including more statistical information, such as standard deviations or variance analyses, to ensure the robustness of the reported execution times.
Moreover, the figures and tables should include the number of repetitions per measurement, the full hardware specifications (beyond just the GPU), and any notes on background processes that might have affected performance—details that are crucial for benchmarking experiments.
Regarding the references, the article is technically well-grounded and includes classic sources on PSO, OpenCL, and option pricing methods. However, most references are technical or application-oriented, with little engagement with broader theoretical literature on metaheuristic methods in finance or option pricing algorithms in general. I suggest including citations with a more comprehensive theoretical foundation and more recent academic reviews on PSO applications in financial modeling.
The formatting also appears to be inconsistent with the journal's template and should be revised.
Author Response
The main research question addressed by the article is how to apply low-level optimization techniques (such as SIMD vectorization, kernel fusion, and memory coalescing) to accelerate the valuation of American options using Particle Swarm Optimization (PSO) on GPUs via OpenCL. Although this question is implicitly suggested in the abstract (lines 8–20), the authors never explicitly formulate it as a research question. I recommend presenting this formulation in the Introduction to make the research objective clearer and more evident.
Both the ABSTRACT and especially the INTRODUCTION have been re-written. Our objective and the contribution of our paper are more clearly stated, hopefully to address the comment raised by the referee here.
While the research topic is relevant—considering the increasing use of GPUs in quantitative finance and the computational complexity of American option valuation—the originality lies more in the combination of well-known techniques than in any deep methodological or conceptual innovation. Most of the strategies applied (lines 97–131), such as kernel fusion and vectorization, are well-documented in the NVIDIA and AMD guides, correctly cited in the manuscript, but they are not novel. Therefore, the article only partially fills a research gap: it contributes more as a technical case study with superior results than as a conceptual advancement in the use of PSO in finance. I suggest revisiting and clarifying the actual contributions of the research.
We rewrote and added contribution from lines 67-87.
The most significant issue lies in the methodology, which is neither presented in a structured way nor with the scientific rigor expected. The article lacks a section titled “Methodology” and does not provide a clear breakdown of the steps followed, as would be required in a replicable protocol. For instance, in lines 132–143, the authors begin describing the implementation, but they fail to present a methodological framework—such as a definition of variables, the number of experiment repetitions, performance evaluation metrics beyond runtime, or even an explicit definition of the fitness function used by PSO. This omission undermines transparency and reproducibility. I strongly recommend including a structured methodology section that formally describes the computational problem, input and output parameters, experimental design, and the rationale behind each decision. A figure illustrating all the macro-level stages of the research would greatly aid in understanding and tracking the proposed process.
We rewrote section 4 as OpenCL Optimization Methodologies and Strategies. Section 5 touches upon definition of variables, stage-by-stage results, performance metrics and reproductivity.
Another critical issue is the lack of justification for the chosen pricing model. Although the authors state that PSO is directly used to price American options, the manuscript does not clearly explain how the option value is actually obtained (i.e., which stochastic model is simulated, whether Monte Carlo methods are used, or whether early exercise is handled via boundary approximation, etc.). This theoretical gap is evident in Section 2 (lines 51–75), which summarizes but fails to justify or detail the authors’ specific modeling approach.
We added an Appendix to address this issue.
The conclusion (lines 353–363) is consistent with the results presented but lacks a critical discussion of the study’s limitations. For example, the article does not mention the potential impact of overfitting when using PSO with many parameters, nor does it address the risk that performance gains may not translate into significant improvements in real financial environments due to integration complexity (although briefly touched on in Section 8). Additionally, the suggestion to generalize the results to other GPU architectures (line 322) is neither tested nor discussed in depth, which limits the external validity of the findings.
We address this issue now in Section 8 (newly renamed as Limitations, Integration with Financial Workflows). Basically, using swarm intelligence to find the optimal exercise boundary does not pose any problem of overfitting (since there is no prediction) or other risks. This is because we use it as an optimization tool only, same as any other AI/ML tools. The easiest way to under PSO is to view it as a smart grid search. But instead of plowing through every node, PSO’s particles “intelligently” pick relevant nodes.
The figures and tables are well-organized and informative. Table 1 (line 144) clearly summarizes the performance gains, and Table 2 (line 216) is helpful for linking optimization techniques to specific kernel files. Figure 2 (line 232) offers valuable insights into scalability. However, the article would benefit from including more statistical information, such as standard deviations or variance analyses, to ensure the robustness of the reported execution times.
Added in Table 1 of standard deviation in seconds for each kernel execution.
Moreover, the figures and tables should include the number of repetitions per measurement, the full hardware specifications (beyond just the GPU), and any notes on background processes that might have affected performance—details that are crucial for benchmarking experiments.
Added in Table 1 title (line 350) that the results were obtained upon 100 repetitions.
Regarding the references, the article is technically well-grounded and includes classic sources on PSO, OpenCL, and option pricing methods. However, most references are technical or application-oriented, with little engagement with broader theoretical literature on metaheuristic methods in finance or option pricing algorithms in general. I suggest including citations with a more comprehensive theoretical foundation and more recent academic reviews on PSO applications in financial modeling.
An appendix is provided. In the appendix, how American option pricing can use Monte Carlo is provided. Relevant citations are provided.
The formatting also appears to be inconsistent with the journal's template and should be revised.
Done. The new version is compliant with the Journal format.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsThe authors have addressed the provided comments. The paper can be accepted.
Author Response
Please see file attached.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have addressed all the comments.
Author Response
Please see file attached.
Author Response File: Author Response.pdf
Reviewer 5 Report
Comments and Suggestions for AuthorsI appreciate the revisions made to the manuscript, particularly the effort to rewrite the Abstract and Introduction for greater clarity, the inclusion of standard deviations and 100 repetitions in Table 1, and the addition of an Appendix reviewing Monte Carlo-based American option pricing. These changes improve transparency and presentation. Nevertheless, several of my earlier concerns remain insufficiently addressed. First, the research question continues to be only implied rather than explicitly stated in the Introduction, which weakens the framing of your study. Second, although Sections 4 and 5 now contain more implementation details, there is still no structured “Methodology” section that defines variables, inputs and outputs, experimental design, and a replicable framework—elements that are essential for reproducibility. Third, the justification for the chosen pricing model is now in the Appendix, but it does not provide a critical rationale for selecting PSO over alternative methods. Fourth, while Section 8 was expanded, the discussion of limitations is mostly defensive, focusing on dismissing overfitting risks rather than critically examining broader issues such as integration in financial workflows, generalization to different GPU architectures, and potential trade-offs in real-world implementation. Finally, although the references have been expanded, the literature review remains heavily technical and does not sufficiently engage with broader theoretical studies on metaheuristics in finance. In sum, while the manuscript is improved, it still requires a more explicit research framing, a properly structured methodology, a stronger theoretical justification of modeling choices, and a more critical discussion of limitations in order to reach the level of rigor expected.
Author Response
Please see file attached.
Author Response File: Author Response.pdf