1. Introduction
Multi-criteria decision-making (MCDM) plays a central role in modern information systems. It involves evaluating a large number of alternatives against potentially conflicting criteria [
1]. Its applications now extend to finance [
2], healthcare [
3], engineering [
4], supply chain management [
5], environmental assessment [
6], and several other fields. Among the many available MCDM methods, TOPSIS stands out for its simplicity, geometric interpretability, and robust rankings [
7].
The classic sequential implementation of TOPSIS suffers from increasing computational cost as the number of alternatives reaches the order of a million [
8]. Analysts are forced to artificially reduce the dataset size, risking exclusion of alternatives that could affect the final decision. More precisely, the primary computational bottlenecks of CPU-TOPSIS are: (i) Euclidean distance calculations O(m × n) dominating the overall cost; (ii) normalization requiring a full column scan; and (iii) the O(m × log m) final sort—all inherently sequential and memory-bound on CPUs, directly motivating GPU parallelization. Most tools only support matrices of modest size [
9], making them unsuitable for Big Data environments.
High-performance computing (HPC), and in particular the increasing use of graphics processing units (GPUs), offers a concrete solution to these scaling limitations. GPUs are based on massively parallel architectures, optimized for the execution of vectorized numerical operations [
10]. The emergence of the CUDA platform and the development of high-level Python libraries such as CuPy [
11], PyTorch [
12], and TensorFlow [
13] have considerably facilitated access to GPU programming without requiring low-level expertise.
In this article, we present GPU-TOPSIS, a complete and vectorized reformulation of TOPSIS specifically designed for GPU execution. This contribution follows a progressive approach: our previous work proposed a parallelized MCDM filter in shared memory via OpenMP [
14] and then a distributed version of TOPSIS based on MapReduce [
15]. GPU-TOPSIS takes a further step by exploiting the massive parallelism of modern GPUs, enabling the near real-time processing of decision matrices containing tens of millions of alternatives, without artificial data reduction.
Three version variants were developed: CuPy-TOPSIS, PyTorch-TOPSIS, and TensorFlow-TOPSIS, each reformulating the TOPSIS steps as a tensor pipeline running on CUDA-compatible hardware. Unlike previous work on GPU-TOPSIS [
16], which remained application-domain specific and validated on modestly sized datasets, our framework is general-purpose and has been validated on millions of alternatives (up to 200 million) from real-world e-commerce data, with explicit numerical stability analysis on three distinct GPU backends.
Experimental evaluations confirm significant performance gains, with speedups of up to 4.75× compared to the CPU-based NumPy [
17] benchmark, while maintaining ranking consistency and numerical stability. A sensitivity analysis of criterion weight perturbations and sharding robustness tests completes the demonstration.
Unlike previous GPU-based implementations of TOPSIS, which often focus on specific application domains or limited datasets, this work proposes a general computational framework for large-scale multi-criteria decision-making based on GPU tensor pipelines. The proposed approach not only accelerates the classical TOPSIS workflow but also introduces a scalable computation model capable of handling decision matrices that exceed GPU memory capacity through a mathematically consistent fragmentation strategy.
In this context, we introduce GPU-TOPSIS, a fully vectorized GPU implementation of the TOPSIS method capable of processing decision matrices containing up to 200 million alternatives, thereby enabling large-scale multi-criteria decision-making on commodity GPU hardware. The original contributions of this work can be summarized as follows.
Summary of Key Contributions and Novelty
GPU-TOPSIS advances the state of the art in four dimensions of novelty: (1) it is the first TOPSIS implementation validated on matrices containing up to 200 million alternatives from real e-commerce data; (2) it introduces a mathematically proven two-pass fragmentation algorithm that guarantees exact ranking equivalence regardless of the fragmentation scheme; (3) it is the only framework offering cross-backend numerical consistency analysis across three GPU ecosystems (CuPy, PyTorch, TensorFlow); and (4) it provides a comprehensive experimental evaluation achieving a speedup of up to 4.75× compared to the CPU-NumPy reference, a cross-backend mean deviation of less than 5 × 10−8, and a Kendall τ = 1.0 for the monolithic formulation, confirming perfect ranking consistency. These novel aspects are supported by four original contributions:
First, GPU-TOPSIS provides a fully vectorized reformulation of the TOPSIS method in which each step of the decision pipeline—normalization, weighting, calculation of ideal solutions, calculation of Euclidean distances, and calculation of proximity scores—is expressed as tensor operations executed directly in GPU memory, while strictly preserving the original mathematical formulation.
Second, a two-pass fragment processing algorithm is introduced. Property 1 formally demonstrates that this formulation is a lossless generalization of monolithic GPU-TOPSIS: when k = 1, it reduces exactly to the standard formulation, while for k > 1, it guarantees identical rankings while reducing the memory footprint from O(m × n) to O(mt × n), where mt < m. This design enables the processing of decision matrices whose size exceeds both the GPU VRAM and the host machine RAM.
Third, three independent implementations based on CuPy, PyTorch, and TensorFlow are developed, ensuring portability and interoperability within the Python GPU ecosystem.
Fourth, a comprehensive experimental evaluation is conducted on real-world data from the Amazon Products 2023 dataset [
18], covering matrices ranging from several million to 200 million alternatives. The evaluation includes perturbation sensitivity analyses of criterion weights, inter-backend numerical consistency tests, and partitioning robustness analyses, demonstrating that GPU acceleration fully preserves numerical stability and decision reliability.
The remainder of this article is structured as follows.
Section 2 presents the context and motivations, including a review of related work on parallel MCDM and the classical TOPSIS method.
Section 3 describes the GPU-TOPSIS reformulation and its parallelization principles.
Section 4 details the three GPU implementations.
Section 5 presents the experimental results and robustness analyses. Finally,
Section 6 concludes and outlines future research directions.
4. Experimental Evaluation
4.1. Dataset: Amazon Products 2023
The Amazon Products 2023 dataset is the primary experimental tool for this work. It is a large, publicly available collection, aggregated from Amazon platform catalogs and distributed by product category [
18]. Its tabular structure, rich attributes, and sheer volume—tens of millions of products across 32 categories—make it a natural and demanding testing ground for large-scale multi-criteria decision-making methods. In the context of e-commerce decision support, each product represents an alternative to be evaluated, and the goal is to identify the best-performing alternatives based on a set of criteria reflecting perceived quality, popularity, price positioning, and the reliability of ratings.
Construction of the decision matrix. The decision matrix M ∈ ℝmx6 is constructed by associating each product (alternative Ai) with a vector of six quantitative criteria derived from the available metadata. Price (C3) and the standard deviation of the ratings (C6) are treated as cost criteria; the other four are benefit criteria. This moderate number of criteria (n = 6) is commonly adopted in MCDM studies to preserve the interpretability of the rankings while providing a sufficiently rich decision structure.
Table 5 formally describes each criterion, its calculation method, its TOPSIS type, and the decision rationale that justifies its inclusion.
Matrix cleaning comprises three operations: (i) clipping the utility rate to [0, 1] to correct Amazon counter artifacts (affecting 2835 rows in the Books category); (ii) removing price outliers beyond the 99th percentile; and (iii) deduplication by retaining the most recent row by parent_asin identifier. After preprocessing, the reference matrix contains 14652525 alternatives (multi-category configuration, ALL) and 3,383,435 alternatives (single-category configuration, Books). The reference weight vector is set to W = [1/6, …, 1/6] corresponding to equal weights assigned to each of the six criteria, in accordance with standard practice in MCDM method validation studies, where the goal is to evaluate the computational framework independently of any application weighting bias. In real-world deployments, criterion weights can be determined by: (i) expert elicitation via AHP pairwise comparisons; (ii) entropy-based objective weighting; or (iii) stakeholder-defined preference vectors. GPU-TOPSIS accepts any normalized weight vector W satisfying ∑wj = 1, wj ≥ 0.
4.2. Experimental Environment
All experiments were conducted on the Google Colab Pro platform to ensure reproducibility and accessibility. The hardware and software configuration deployed was: 2.7/51.0 GB of RAM; NVIDIA Tesla T4 GPU (16 GB of VRAM, CUDA 12.8); software stack Python 3.9, NumPy 1.23.5, CuPy v12.3, PyTorch V2.4.1, TensorFlow V2.16.1.Google Colab Pro was deliberately chosen to ensure maximum reproducibility: the experimental environment is accessible to any researcher without dedicated hardware, and the source code, dataset references, and notebook configurations are published in the open repository. The dynamic GPU allocation inherent in Colab is mitigated by the five-replicate measurement protocol and the calculation of standard deviations.
Each execution time measurement reported in this article corresponds to the arithmetic mean of five independent executions, with the ratio of the standard deviation (±) to the 95% confidence interval (95% CI) calculated using Student’s t-distribution. This repeated measurement methodology represents a substantial improvement over the single measurements typically reported in the literature and ensures the statistical validity of the performance comparisons. The CPU comparison uses sequential NumPy as the single reference; in the Colab environment, CPU parallelization capabilities are limited to 2 vCPUs, which justifies this choice: the difference with the thousands of GPU cores involved remains negligible.
Finally, the numerical stabilization constant ε = 10−12 is applied uniformly in all GPU implementations.
4.3. EXP-1. CPU vs. GPU Scalability
Objective: To measure GPU-TOPSIS acceleration factors relative to the CPU reference across a wide range of matrix sizes—from 50,000 to 14.65 million alternatives—with seven measurement points enabling the plotting of a complete scalability curve. The matrices used are randomly drawn subsets from the actual Amazon Products 2023 set; only sizes exceeding available stock use synthetic data calibrated against observed statistics.
Results and analysis. Several observations emerge from
Table 6. At small scales (m ≤ 100,000), GPU acceleration is limited, even negative for TensorFlow at 50,000 alternatives (0.65×), due to the overhead of launching CUDA kernels. PyTorch and CuPy are notable exceptions even at this level, showing 2.34× and 1.49× respectively, reflecting the lightweight nature of their execution pipelines—consistent with the complexity analysis (Equation (11)). From 500,000 alternatives onward, all three GPU backends consistently outperform the CPU benchmark. The best overall speedup is achieved by PyTorch at 3.38 million alternatives (4.75×). The marked decrease observed at 14.65 M—PyTorch dropping to 1.80×, compared to 2.78× for CuPy and 2.43× for TensorFlow—is explained by the progressive saturation of VRAM bandwidth and the increasing overhead of CPU-to-GPU transfers. This phenomenon, intrinsic to the GPU architecture, is particularly pronounced for PyTorch at very large scales, where CuPy gains the advantage thanks to memory management closer to the native NumPy model. However, it does not invalidate the overall scalability of the proposed framework.
Figure 2a shows the log-log execution times for the four implementations as a function of the number of alternatives. All curves follow a near-linear growth, confirming the O(m) complexity of the framework. PyTorch stands out with the lowest times across the entire range, while CPU remains consistently the slowest at large scales. However,
Figure 2b illustrates the GPU speedup relative to the CPU. PyTorch clearly dominates, reaching a peak of 4.75× around 3.4 million alternatives before dropping to 1.80× at 14.65 M—a sign of GPU memory saturation. CuPy and TensorFlow show a more modest and stable progression, with CuPy taking the lead over PyTorch beyond 10 million alternatives. TensorFlow starts below the CPU baseline (<1×) at small scales, illustrating a higher launch overhead. The three backends converge towards a similar behavior at very large scale, highlighting the common limits of VRAM bandwidth.
4.4. EXP-2. Inter-Backend Digital Consistency
Objective: To verify that the three GPU implementations produce Si proximity scores and numerically consistent rankings between themselves and with the CPU reference, despite differences in arithmetic precision and optimization strategies specific to each backend.
Methodology: Consistency is assessed on the ALL matrix (14.65 M alternatives) with reference weights. The metrics retained are: the maximum and mean difference on the Si scores (Max ΔSi, Moy ΔSi), the ranking overlap rate (Overlap@K) for K ∈ {5, 10, 50, 100}, the Kendall concordance coefficient τ on the Top-10 and Top-100, and the Spearman coefficient ρ on the Top-100.
Results and analysis. The results in
Table 7 demonstrate near-perfect numerical consistency across all backends. The maximum differences in S
i scores between pairs of backends consistently remain below 5 × 10
−8—several orders of magnitude below any practical decision threshold—with even smaller average differences, on the order of 10
−9 to 10
−8. The overlap is 100% at all tested thresholds (Top-5, Top-10, Top-50, Top-100) for all six pairs, meaning that the resulting rankings are virtually identical regardless of the implementation used. Kendall’s coefficient τ and Spearman’s rank correlation coefficient ρ reach a maximum value of 1.0 in all cases, with highly significant
p-values (
p < 10
−6 for τ@10,
p < 10
−150 for τ@100 and ρ@100), ruling out any hypothesis of fortuitous agreement. These results unambiguously establish that the low-level differences between CuPy, PyTorch, and TensorFlow—parallel reduction strategies, floating-point rounding behaviors—have no measurable impact on the final decision, thus validating the complete functional interchangeability of the three backends within the GPU-TOPSIS framework.
As shown in
Figure 3, the two heatmaps visually summarize the inter-backend consistency. Panel (a) confirms that the maximum differences in S
i scores are all between 3.7 × 10
−8 and 4.5 × 10
−8, with the diagonal naturally being zero (comparison of a backend with itself). Panel (b) displays a uniform and saturated green across all off-diagonal cells, indicating a 100% Overlap@10 without exception. Together, these two panels provide an immediate and unambiguous reading of the numerical robustness of GPU-TOPSIS: the four backends are functionally interchangeable.
4.5. EXP-3. Top-N Rankings Compared
Objective: To compare the Top-N rankings produced by the four implementations on the ALL matrix to confirm the decision invariance of GPU-TOPSIS at a large scale, and to provide the first recommended alternatives with their real attributes from the Amazon dataset.
Results and analysis:
Table 8 presents the Top-10 products identified by GPU-TOPSIS on the Amazon ALL dataset (14.65M alternatives, equal weights). All four backends produce strictly identical scores and rankings (Overlap@10 = 100%, Max ΔS
i < 10
−6), confirming the decision invariance of the framework at this scale. The dominant alternative (B07TVHSDMQ, S
i = 0.9986) stands markedly apart from the second-ranked product (S
i = 0.378), reflecting an exceptional cumulative profile across all six criteria simultaneously—notably the highest review count in the dataset (314,691), a strong average rating (4.362), and a moderate price (
$17.99). This score gap is characteristic of a structurally dominant alternative in the TOPSIS sense, i.e., one that is simultaneously close to the Positive Ideal Solution on all criteria. From rank 2 onward, scores decrease more gradually, indicating a competitive cluster of alternatives with similar multi-criteria profiles.
4.6. EXP-4. Stress-Test Sharding (~88 Million Alternatives)
Objective: To evaluate the computational scalability, memory robustness, and numerical stability of the 2-pass GPU-TOPSIS formulation beyond the joint limits of the host system’s VRAM and RAM, up to approximately 88 million alternatives.
Protocol: In the absence of additional publicly available MCDM data beyond the 14.65 million alternatives constituting the entirety of the available Amazon Products 2023 dataset, the experiment is conducted by replication in k = 6 fragments of the real matrix. Each replica (fragments 1 to 5) is subjected to a centered Gaussian multiplicative noise of 2% (σ = 0.02) to simulate plausible inter-domain variability and avoid perfect duplicates. Fragment 0 consists of unaltered real data. The noise level used is sufficiently low to preserve the marginal distributions of each criterion: the Kolmogorov–Smirnov test between the original fragment and the noisy fragments consistently yields
p > 0.05 (see
Table 9, KS
p-val column), confirming the statistical homogeneity between fragments. To more precisely quantify the distributional similarity between the original and synthetic data, the Jensen-Shannon divergence (JSD) was calculated between the original fragment and each of the noisy fragments for all six criteria. All JSD values obtained remained below 0.001 (on a scale of [0, 1]), confirming a negligible distributional shift. Furthermore, the mean and standard deviation of each criterion in the noisy fragments did not deviate by more than 0.3% from those of the original fragment, thus validating the statistical representativeness of the generated synthetic data. In accordance with Property 1, the 2-pass formulation guarantees that the classifications produced for any k ≥ 1 are mathematically identical to those of the monolithic treatment.
With:
ram_baseline_mb: GPU memory used before the calculation starts (idle state). It corresponds to the CUDA context, loaded libraries, and already allocated tensors. This is the reference point.
vram_peak_abs_mb: Peak GPU memory in absolute value reached during execution. This is the total amount of VRAM used at its maximum, including all allocations (baseline + current calculation).
vram_peak_delta_mb: Relative increase in VRAM compared to the baseline, i.e., the memory actually consumed by the computation itself:
vram_peak_delta_mb = vram_peak_abs_mb − vram_baseline_mb
Results and analysis. According to the
Table 10, we deduce that the growth in execution time is strictly linear in k for the three backends: PyTorch goes from 0.445 s (k = 1, 14.5 M alternatives) to 2.548 s (k = 6, 87.0 M alternatives), a factor of 5.73×, very close to the theoretical factor of 6×. CuPy and TensorFlow exhibit the same linearity, with respective factors of 6.12× and 6.13×, confirming the absence of any algorithmic degradation related to fragmentation. The VRAM consumption per individual fragment—obtained by dividing vram_peak_delta_mb by the number of fragments k—remains remarkably stable: approximately 1668 MB for PyTorch and CuPy, and approximately 2057 MB for TensorFlow, regardless of the total problem size. This result empirically validates the O(m
t × n) memory footprint established by Corollary 1. The baseline VRAM also remains constant per backend (2720 MB for PyTorch/CuPy, 2618 MB for TensorFlow), reflecting only the fixed cost of the CUDA context. PyTorch stands out as the fastest backend in all configurations, with CuPy exhibiting an overhead of approximately 1.8×, and TensorFlow approximately 2.1×, the latter also showing a consistently higher memory consumption of ~22% per fragment. No numerical instability or interrupts are observed, even at 87.0 million alternatives on a GPU with 16 GB of VRAM.
4.7. EXP-5. Sensitivity Analysis to Criteria Weights
Objective: To evaluate the robustness of the TOPSIS ranking in the face of uncertainties or changes in preference on the weights of the criteria, by simulating realistic scenarios of revision of decision priorities.
Methodology: Fifty alternative weighting scenarios are generated by controlled multiplicative perturbations (±5%, ±10%, ±15%, ±20%) on the reference weights wj = 1/6, followed by renormalization to maintain ∑wj = 1. For each scenario, the ranking is recalculated with PyTorch as the reference backend on the ALL matrix and compared to the reference ranking via Overlap@K for K ∈ {5, 10, 100} and Kendall τ for K ∈ {10, 100}.
Results and analysis. The results reveal the absolute robustness of the TOPSIS ranking to perturbations in the weights of the criteria. For all four levels of perturbation tested (±5% to ±20%), the Overlap@5, Overlap@10, and Overlap@100 indicators uniformly reach 100.0 ± 0.0%, meaning that neither the top five, ten, nor one hundred alternatives undergo any change in composition or order, regardless of the preference reassessment scenario. The Kendall coefficients τ@10 and τ@100 are equal to 1.0 for all levels, confirming perfect rank concordance between the nominal ranking and the perturbed rankings. Only the Max ΔSi increases proportionally to the perturbation amplitude—from 0.00019 at ±5% to 0.00077 at ±20%—reflecting slight variations in absolute proximity scores without ever inducing a rank reversal. This result demonstrates that the GPU-TOPSIS ranking on the Amazon ALL dataset (14.65 M alternatives) is structurally stable across the entire decision range, even under substantial revisions of preferences, thus strengthening the system’s decision reliability at very large scales.
Figure 4 confirms the results of
Table 10: regardless of the magnitude of the perturbation of the weights (±5% to ±20%), the Overlap@10 and the Overlap@100 remain fixed at 100%, and the Kendall τ at 1.0, graphically illustrating the total insensitivity of the GPU-TOPSIS ranking to preference revisions, both in the critical decision zone and over the entire Top-100.
4.8. EXP-6. Comparison of GPU-TOPSIS 1-Pass vs. 2-Pass Rankings
Objective: To formally establish the equivalence of the rankings produced by the GPU-TOPSIS 1-pass formulation (monolithic processing) and by the GPU-TOPSIS 2-pass formulation with sharding, over a spectrum of increasing sizes ranging from 14.65 million to 87.9 million alternatives, and to quantify the additional time cost associated with the mathematical correction guaranteed by Property 1.
Methodology: Four configurations are evaluated (k ∈ {1, 2, 3, 6} fragments), constructed by noise-free replication (σ = 0) of the Amazon ALL matrix to ensure strict algebraic identity between the data processed by the two formulations. The 1-pass formulation applies topsis_pytorch to the complete concatenated matrix in a single GPU pass, while the 2-pass formulation applies topsis_2pass with global calculation of norms, PIS, and NIS in Pass 1 (CPU) and then calculation of distances and scores per fragment in Pass 2 (GPU). The metrics used are: Overlap@K for K ∈ {10, 50, 100, 500, 1000}, Kendall τ@K and Spearman ρ@K with p-values, Max ΔSi and Mean ΔSi on the raw scores, as well as the ratio of execution times t(2-pass)/t(1-pass). Each measurement is averaged over Nruns = 5 repetitions.
Results and analysis:
Figure 5 empirically characterizes the finite-precision behavior announced in the Note of Property 1: while algebraic equivalence holds in exact arithmetic, float32 non-associative parallel reductions induce localized rank inversions as k increases, whose practical impact is quantified below.
At the global level, Spearman ρ@K = 1.0 for all K and all configurations, confirming perfect overall rank agreement. Locally, however, Kendall τ@10 degrades to ~−0.4 at k = 6 while τ@1000 remains near 1.0, and Overlap@10 falls to ~80% while Overlap@1000 stays at 100%. This asymmetry indicates that divergences are confined to near-tied alternatives at the very top of the rankings—precisely where float32 rounding is most consequential—and become imperceptible over larger windows. The Mean ΔSi ~ 10−4 across all configurations confirms that large Max ΔSi values are concentrated on a negligibly small subset of alternatives. On the computational side, the ratio t(2-pass)/t(1-pass) decreases from ~2.0 at k = 1 to below 1.0 at k = 6, as growing VRAM pressure increasingly penalizes the monolithic formulation. A systematic analysis of the conditions under which these deviations occur deterministically was conducted. Three main factors were identified. First, the number of fragments k: ranking reversals grow approximately proportionally to k, as rounding errors accumulate with each independent reduction operation. Second, the density of scores in the vicinity of the Top-K boundary: reversals occur exclusively between alternatives whose proximity scores satisfy |Si − Sj| < ∼10−4, i.e., in the range of the epsilon machine float32 at this scale. Third, the size of the fragments mt: larger fragments reduce the number of reduction operations and, consequently, the accumulation of rounding errors. Practically speaking, the following threshold can be formulated: for k ≤ 2 and K ≥ 50, no measurable reversal is observed. Conversely, for k ≥ 4 with K ≤ 10, the use of float64 precision or the monolithic 1-pass formulation is recommended when strict Top-K invariance is required.
These results establish a clear operational boundary: the 2-pass formulation is decision-equivalent to the 1-pass formulation for large evaluation windows, and computationally advantageous at very large scales. For applications requiring strict Top-K invariance at high k, the 1-pass formulation is recommended when VRAM permits; otherwise, float64 precision eliminates the observed inversions at the cost of a 2× memory overhead.
Consequently, the 2-pass formulation is recommended for applications where the priority is large-scale global ranking and GPU resource management, while special attention should be paid to correcting restricted Top-K when k is high, and decisions are based exclusively on the first alternatives.
4.9. EXP-7. Scalability Beyond 88 Million Alternatives (100 M, 150 M, 200 M)
Objective: To evaluate the execution times, VRAM consumption, and processing throughput of the 2-pass GPU-TOPSIS formulation for matrix sizes exceeding the limits of the available real dataset, namely 100, 150, and 200 million alternatives, to project the scalability of the framework to orders of magnitude not yet covered experimentally in the MCDM literature.
Protocol: The absence of publicly available MCDM data at these scales, synthetic matrices are generated by Gaussian sampling calibrated to the empirical statistics of the Amazon ALL matrix (μj, σj per criterion), strictly adhering to business bounds after truncation. This approach, already validated in EXP-4 for intermediate sizes, ensures that the synthetic distributions are statistically representative of the real data. Adaptive sharding is applied, maintaining the size of each fragment mt ≤ 14.65 M (T4 VRAM constraint of 16 GB): k = 7 fragments for 100 M, k = 11 for 150 M, and k = 14 for 200 M. Each measurement is averaged over Nruns = 3 repetitions, preceded by an unmeasured warmup.
Results and analysis.
Table 11 confirms the strictly linear scalability of 2-pass GPU-TOPSIS beyond the 88 million alternatives threshold. PyTorch maintains the best processing time for all three targets: 28.96 s for 100 M, 44.07 s for 150 M, and 60.17 s for 200 M, representing ratios of 1.52× and 2.08×, consistent with the expected theoretical progression in O(k × m
t × n). CuPy and TensorFlow follow the same linear trend, with slightly higher times (30.48 s/47.42 s/64.37 s and 32.14 s/49.80 s/70.24 s, respectively). The processing throughput remains remarkably stable between 100 M and 200 M: PyTorch drops from 3.45 to 3.32 M/s, and CuPy from 3.28 to 3.11 M/s, demonstrating a near-total absence of algorithmic degradation. TensorFlow shows a slightly more pronounced decrease (3.11 → 2.85 M/s), without compromising overall scalability. The very low standard deviations (≤0.72 s) confirm the stability and reproducibility of the executions. These results establish that GPU-TOPSIS is the first operational MCDM framework at the scale of the hundreds of millions of alternatives on consumer GPU hardware.
Figure 6 illustrates the behavior of GPU-TOPSIS 2-pass processing at very large scales (100 M to 200 M alternative values) on calibrated synthetic data. Execution time increases almost linearly for all three backends, with PyTorch remaining the fastest (≈28 s at 100 M, ≈59 s at 200 M), followed by CuPy and TensorFlow within a narrow margin. Processing throughput decreases slightly with size—from ≈3.45 M alt/s at 100 M to ≈3.3 M alt/s at 200 M for PyTorch—reflecting the increasing memory pressure visible in the VRAM curve, which rises to ≈14 MB for PyTorch at 200 M. TensorFlow exhibits the lowest throughput and VRAM usage, suggesting a different trade-off between parallelism and memory management. These results confirm the practical scalability of the 2-pass formulation with adaptive sharding, capable of processing 200 M alternatives in under a minute on a 16 GB GPU.
In conclusion, GPU-TOPSIS 2-passes demonstrates robust and predictable scalability up to 200 M alternatives, thus taking a decisive step towards near real-time massive data processing, and positioning PyTorch as the reference backend for very large-scale deployments.
Since no public MCDM dataset currently achieves the scale of hundreds of millions of alternatives, synthetic matrices calibrated to the statistical properties of the Amazon dataset were generated. This approach allows for the evaluation of computational scalability without introducing unrealistic data distributions.
4.10. Summary of Experimental Results
The eight experiments together lead to five main conclusions.
Scalability: GPU-TOPSIS can process matrices containing up to 200 million alternatives in minutes on a consumer GPU (Tesla T4), making a class of previously inaccessible decision problems feasible in near-real time. Strictly linear scalability in k is confirmed across the entire tested range, from 50,000 to 200 million alternatives.
Mathematical correction: The 2-pass formulation, as defined by Property 1, guarantees exact equivalence with monolithic TOPSIS for any partitioning scheme, while reducing the memory footprint to O(m
t × n). This result, demonstrated analytically in
Section 3 and experimentally confirmed by EXP-7, establishes that the 2-pass formulation achieves an Overlap@1000 = 100% and a Spearman ρ = 1.0 for all tested values of k. The observed discrepancies remain localized to strict Top-K rankings for high values of k, attributable to the non-associativity of parallel reductions to float32, as stated in the Note to Property 1. This precise characterization distinguishes GPU-TOPSIS from naive sharding approaches that introduce uncontrolled local biases and provides practitioners with explicitly quantified criteria for choosing between the two formulations.
Decision robustness: Sensitivity analyses and Monte Carlo simulations confirm that the ranking structure is stable in the face of realistic perturbations of the criteria weights (Overlap@5 > 95% for ±10%) and observational noise (Overlap@10 > 90% for ±5% noise), with Kendall τ metrics attesting to structural robustness in the critical decision area.
Unprecedented scalability: EXP-7 establishes that GPU-TOPSIS is operational with up to 200 million alternatives on a Tesla T4 GPU with 16 GB of VRAM, with adaptive sharding and processing times remaining in the second range. This performance, unprecedented in the MCDM literature, paves the way for decision-making applications at the scale of large e-commerce platforms, industrial catalogs, or aggregated medical databases.
Reproducibility and genericity: The availability of three independent implementations on CuPy, PyTorch, and TensorFlow, combined with the publication of the source code on Zenodo:
https://doi.org/10.5281/zenodo.18911332, the use of the public Amazon Products 2023 dataset [
18], makes GPU-TOPSIS a reference framework directly reproducible and extensible by the MCDM/Big Data community.
5. Discussion
5.1. Scalability and Performance of Backends
Experimental evaluation confirms that data volume is the determining factor in the computational cost of the TOPSIS method. CPU implementations remain suitable for modestly sized datasets, but their execution time increases rapidly with the number of alternatives, making them impractical for large-scale decision problems. GPU implementations, on the other hand, allow for the efficient processing of decision matrices containing several million alternatives. All experiments conducted on real Amazon data demonstrate that GPU acceleration significantly reduces execution times while preserving numerical stability and ranking consistency, with performance gains increasing proportionally to the data volume.
The differences observed between the GPU backends reflect their respective execution models and memory management strategies. PyTorch generally achieves the lowest execution times thanks to its dynamic and lightweight CUDA pipeline, followed by CuPy, whose NumPy- compatible API minimizes porting overhead. TensorFlow exhibits higher latency, attributable to XLA compilation and the construction of static graphs, but maintains full scalability. It is worth noting that all three backends remain reliable and scalable across all evaluated configurations, thus establishing the independence of the proposed framework from any particular GPU software ecosystem.
PyTorch’s speedup factor—from 4.75× at 3.38 million alternatives to 1.80× at 14.65 million—requires a nuanced interpretation. It is explained by three cumulative bottlenecks inherent to the TOPSIS workload: (i) the overhead associated with CPU-to-GPU data transfers, with PCIe 3.0 bandwidth (≈12 GB/s) limiting the throughput for large float32 arrays; (ii) the sequential nature of global reductions (norm calculations, PIS/NIS), which cannot be fully parallelized across the entire array; and (iii) the saturation of VRAM bandwidth beyond 10 million alternatives on the T4 GPU. The reported 4.75× maximum speedup is lower than the 10×–100× observed in compute-bound HPC workloads because the TOPSIS pipeline is fundamentally memory-bandwidth-bound: its arithmetic intensity (FLOPs/byte) is low, so GPU cores spend disproportionate time waiting for memory rather than computing. It should be noted that all experiments were conducted on Google Colab Pro with a Tesla T4 GPU—a shared, non-dedicated environment—where dynamic resource allocation and inter-user concurrency introduce additional variability; the reported speedups therefore represent conservative estimates. The Tesla T4 VRAM bandwidth (320 GB/s) becomes the limiting factor at large scales, while PCIe 3.0 transfers (≈12 GB/s) add CPU-to-GPU overhead. GPU architecture further influences relative backend performance: PyTorch’s dynamic CUDA graph minimizes launch overhead; CuPy benefits from optimized cuBLAS routines via its NumPy-compatible API; TensorFlow’s XLA compiler introduces higher startup latency but enables more aggressive operation fusion at large workloads. Future work on kernel fusion, float16 quantization, NVLink multi-GPU configurations, and profiling with NVIDIA Nsight Compute (to characterize coalesced memory access patterns, kernel occupancy, and arithmetic intensity) should enable speedups exceeding 10× for this class of workloads.
5.2. Decision-Making Robustness and Ranking Stability
Sensitivity analysis demonstrates that the TOPSIS ranking structure remains stable in the face of realistic perturbations to the criterion weights, with no changes observed in the overall Top 10. This confirms that the accelerated framework preserves decision robustness under conditions of uncertainty regarding preferences. Fragmentation experiments also establish that memory constraints can be effectively circumvented without compromising the mathematical correctness of the ranking.
5.3. Algebraic and Computational Equivalence
Property 1 and EXP-6, considered together, provide a complete picture of the equivalence between the 1-pass and 2-pass formulations: exact in theoretical arithmetic, and practically total over wide evaluation windows (Overlap@1000 = 100%, Spearman ρ = 1.0 for all tested values of k), with localized divergences in strict Top-K rankings at high k, attributable to the non-associativity of parallel reductions in float32. This distinction between algebraic and computational equivalence is in itself a contribution, providing practitioners with explicit and quantified criteria for choosing between the two formulations.
5.4. Limitations
Several limitations are worth noting. First, the set of reported time measurements corresponds to the arithmetic mean of five independent runs with accompanying 95% confidence intervals, which mitigates the residual variance introduced by the dynamic allocation of GPU resources on Google Colab Pro. The shared, non-dedicated nature of this environment is an inherent limitation; future work will replicate the experiments on dedicated hardware infrastructure to strengthen the statistical validity of the reported speedup factors. Second, the EXP-4 and EXP-7 experiments rely on the artificial replication of the Amazon ALL matrix to achieve very high volumes, which allows for the assessment of computational scalability, but not decision diversity at a very large scale: since the ranked alternatives are structurally similar from one fragment to another, conclusions regarding the stability of the ranking at 200 million alternatives must be interpreted in this context. Third, while GPU memory consumption is recorded in EXP-4, it was not systematically measured across all experiments; this data would nevertheless be valuable for practitioners wishing to size their infrastructure. Fourth, no additional GPU controls (frequency monitoring, thermal throttling detection) were applied in the shared Colab environment; experiments on dedicated infrastructure with NVIDIA SMI monitoring are planned as future work. Fifth, float64 accuracy was not systematically evaluated on fragmented data in this submission. Theoretically, switching to float64 would eliminate the local classification inversions observed for high values of k (
Section 4.8), as the machine epsilon drops from ~1.2 × 10
−7 (float32) to ~2.2 × 10
−16 (float64), nine orders of magnitude below the observed inversion threshold (~10
−4). However, this gain in precision comes at a significant memory and computational cost: with each element increasing from 4 to 8 bytes, the memory footprint is exactly doubled. This halves the maximum acceptable fragment size for a given VRAM budget, doubles the number of fragments (k) required to process the same volume of data, and reduces GPU speedup by approximately 30 to 50% due to the lower memory bandwidth efficiency of mainstream GPU architectures, which have far fewer float64 compute units than float32. It is worth noting that for the vast majority of practical use cases—either k ≤ 2 or K ≥ 50—no measurable inversion is observed with float32, making the use of float64 unnecessary in these configurations. A systematic experimental comparison of float64 vs. float32 on fragmented data is planned as a priority future project. Sixth, the framework has been validated with n = 6 criteria. For applications with hundreds of criteria (e.g., genomic or financial analytics): (a) the intermediate tensor footprint O(m
t × n) can exceed VRAM even for a single fragment when n is large; (b) row-wise fragmentation alone does not resolve VRAM overflow if m
t × n remains too large. A column-wise (criteria-wise) blocking strategy would be required at the cost of additional data passes; guidelines for wide matrices are left as future work. Finally, the sensitivity analysis conducted in EXP-5 focuses on the Top 100; extending the evaluation of Kendall’s τ coefficient to larger subsets would strengthen the conclusions regarding the overall stability of the ranking beyond the critical decision zone.
These limitations do not diminish the scope of the contribution. By completely reformulating TOPSIS as a GPU tensor pipeline while strictly preserving its original mathematical definition—the only modification introduced being the numerical stabilization constant ε—GPU-TOPSIS makes large-scale multi-criteria analysis accessible on standard GPU hardware, including via cloud platforms such as Google Colab, thus paving the way for an effective democratization of MCDM methods in Big Data environments.
5.5. Practical Implications
Beyond computational performance, GPU-TOPSIS opens up concrete prospects for large-scale, multi-criteria decision-making in a variety of application contexts. In the field of e-commerce and recommendation systems, the proposed framework enables the near real-time ranking of hundreds of millions of products according to multi-criteria profiles, as demonstrated by the evaluation conducted on the Amazon Products 2023 dataset. In the field of supply chain management and industrial procurement, GPU-TOPSIS can support the dynamic selection of suppliers from continuously updated catalogs, where reclassification must be performed at scale within tight operational deadlines.
A particularly relevant application area concerns multi-agent task allocation and robotic systems. In this context, an autonomous agent must rank and select actions or resources in real time according to multiple, potentially conflicting objectives. GPU-TOPSIS’s ability to rank millions of alternatives in seconds on a consumer-grade GPU makes it directly usable as a decision layer in robotic planning architectures, where MCDM methods have been shown to improve task scheduling and resource allocation efficiency. Similarly, in the field of the Underwater Internet of Things (IoT) [
42]. GPU-TOPSIS’s ability to handle very large decision matrices makes it well-suited to these environments, where the number of candidate nodes or configurations can be high and where evaluation must be performed rapidly.
More generally, any domain requiring the automated ranking of a large number of alternatives according to multiple criteria—including the allocation of health resources, the filtering of financial portfolios, or environmental monitoring—can directly rely on GPU-TOPSIS as a computational foundation, regardless of the weighting strategy adopted.
5.6. Threats to Validity
Several factors could affect the generalizability of the results presented. First, the experiments were conducted in a shared cloud environment (Google Colab Pro), which can lead to slight fluctuations in execution times due to variable resource allocation. Furthermore, the GPU hardware available in this type of environment is not exclusively dedicated to a single user, and its performance can be affected by background system activity. The execution times reported in this study should therefore be interpreted as conservative estimates. Experiments conducted on a dedicated hardware infrastructure, with optimized configurations and exclusive access to GPU resources, could thus lead to execution performance exceeding that observed in the Colab environment.
Secondly, scaling experiments beyond the initial dataset rely on statistically calibrated synthetic data, rather than entirely independent real-world datasets. While this approach allows for the evaluation of computational scalability at very large scales, it may not perfectly reflect the diversity and complexity of decision-making scenarios encountered in real-world contexts.
Finally, the proposed evaluation focuses exclusively on the TOPSIS method and does not directly examine the performance of other multi-criteria decision support techniques in the same GPU execution model. Future work will aim to determine the extent to which the proposed GPU-based tensor computing paradigm generalizes to other MCDM methods such as VIKOR, PROMETHEE, and ELECTRE.
6. Conclusions
This work introduced GPU-TOPSIS, a GPU-accelerated implementation of the TOPSIS method, designed to enable large-scale multi-criteria decision-making under realistic data and hardware constraints. Leveraging modern GPU computing frameworks—CuPy, PyTorch, and TensorFlow—the proposed approach overcomes the scalability limitations of classical CPU-based TOPSIS implementations while rigorously preserving mathematical fidelity to the original method.
This work represents a coherent progression from our previous contributions on parallel [
14] and distributed [
15] MCDM, taking a qualitative leap towards massive single-memory GPU parallelism. Experimental evaluations conducted on real data from the Amazon Products 2023 dataset demonstrate that GPU-TOPSIS enables the efficient processing of decision matrices containing millions of alternatives, with speedups of up to 4.75× compared to the CPU reference, while preserving ranking consistency and numerical stability. The integration of a fragment processing strategy extends scalability beyond the limits of GPU memory, allowing safe execution with extreme data volumes.
Beyond computational performance, the proposed framework supports large-scale robustness and sensitivity analyses, ensuring that acceleration does not compromise decision reliability. GPU-TOPSIS thus provides a practical and reliable foundation for large-scale decision support applications.
Future work will focus on: (1) extending the framework to multi-GPU and distributed environments; (2) improving memory management strategies for streaming data; (3) generalizing GPU acceleration to other MCDM methods such as AHP, VIKOR, PROMETHEE, and ELECTRE; (4) replicating experiments on dedicated hardware infrastructure to eliminate the residual variance introduced by the dynamic GPU allocation of shared cloud environments; (5) systematically measuring VRAM consumption per experiment; and (6) extending the sensitivity analysis to Kendall’s coefficient τ to larger subsets of alternatives; (7) the extension of the framework to Fuzzy TOPSIS variants (triangular and trapezoidal fuzzy numbers) for large-scale linguistic uncertainty management: fuzzy numbers can be represented as additional tensors, and TOPSIS operations on fuzzy numbers—fuzzy distance, fuzzy score—can be fully vectorized and executed on the GPU, thus paving the way for massive processing of imprecise or uncertain data; (8) the adaptation of the framework for processing dynamic decision matrices and streaming data: the 2-pass fragmentation strategy of GPU-TOPSIS is naturally suited to processing successive batch data streams, allowing incremental updates of rankings without a complete reload of the matrix into memory; (9) a systematic analysis of the energy efficiency and cost-performance trade-offs of GPU-TOPSIS, including metrics such as the number of alternatives processed per joule (alternatives/J) and TFLOP/W efficiency, on different third-party GPU hardware (consumer, datacenter, cloud), in order to quantify the actual energy benefit of GPU acceleration compared to the reference CPU implementation.
Finally, this work demonstrates that classical multi-criteria decision support methods, such as TOPSIS, can be effectively reformulated to leverage modern tensor architectures of GPUs. By combining vectorized computing with an evolving fragmentation strategy, the proposed GPU-TOPSIS framework enables the processing of decision matrices containing up to several hundred million alternatives. These results open new perspectives for the application of multi-criteria decision support techniques to large-scale decision problems in fields such as recommendation systems, large-scale product evaluation, and data-driven decision support.