Next Article in Journal
A Hybrid Artificial Intelligence Framework for Reliable and Seamless Vertical Handover in Next-Generation Heterogeneous Networks
Previous Article in Journal
A Review of Key Technologies for Systems Based on Non-Volatile Memory
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GPU-TOPSIS: A Complete Vectorized and Parallel Reformulation of the TOPSIS Method for Large-Scale Multi-Criteria Decision Making

by
Latifa Boubekri
,
Hassnae Aberkane
,
Mohammed Chaouki Abounaima
* and
Loubna Lamrini
Laboratory of Intelligent Systems and Applications, Faculty of Sciences and Techniques, Sidi Mohammed Ben Abdellah University, Fez 30000, Morocco
*
Author to whom correspondence should be addressed.
Big Data Cogn. Comput. 2026, 10(5), 138; https://doi.org/10.3390/bdcc10050138
Submission received: 8 March 2026 / Revised: 10 April 2026 / Accepted: 21 April 2026 / Published: 28 April 2026

Abstract

The TOPSIS (Technique for Order Preference by Similarity to Ideal Solution) method is one of the most widely used multi-criteria decision-making (MCDM) approaches in industrial, financial, and scientific fields. However, its sequential computational cost of O(m × n), where m denotes the number of alternatives and n the number of criteria, becomes prohibitive when decision matrices have several million rows. Despite its geometric interpretability and simplicity, classical TOPSIS faces two key computational bottlenecks at scale: (i) Euclidean distance calculations O(m × n) dominating the total cost, and (ii) the O(m × log m) sorting step, both inherently sequential and memory-bound on CPUs. To overcome these limitations, we propose GPU-TOPSIS, a fully vectorized and parallel reformulation of TOPSIS based on tensor execution on graphics processing units (GPUs), whose main contributions are: (i) a formally correct reformulation of TOPSIS as a GPU tensor pipeline preserving mathematical fidelity to the original method; (ii) a two-pass fragment-processing algorithm guaranteeing exact mathematical equivalence with monolithic TOPSIS, while reducing the memory footprint from O(m × n) to O(mt × n), where mt < m is the size of each independently processed fragment; (iii) three independent implementations on CuPy, PyTorch, and TensorFlow, ensuring the framework’s portability and genericity. Experimental evaluations on real data from the Amazon Products 2023 dataset, using matrices of up to 200 million alternatives (via the 2-pass formulation), demonstrate speedups of up to 4.75× compared to the reference CPU implementation (NumPy), with inter-backend score differences below 5 × 10−8 and 100% ranking overlap across all tested Top-K thresholds. A perturbation sensitivity analysis of the criteria weights and cross-backend consistency tests confirms that GPU acceleration fully preserves robustness and decision reliability, making GPU-TOPSIS a practical, open, and reproducible solution for large-scale multi-criteria decision making in Big Data environments.

1. Introduction

Multi-criteria decision-making (MCDM) plays a central role in modern information systems. It involves evaluating a large number of alternatives against potentially conflicting criteria [1]. Its applications now extend to finance [2], healthcare [3], engineering [4], supply chain management [5], environmental assessment [6], and several other fields. Among the many available MCDM methods, TOPSIS stands out for its simplicity, geometric interpretability, and robust rankings [7].
The classic sequential implementation of TOPSIS suffers from increasing computational cost as the number of alternatives reaches the order of a million [8]. Analysts are forced to artificially reduce the dataset size, risking exclusion of alternatives that could affect the final decision. More precisely, the primary computational bottlenecks of CPU-TOPSIS are: (i) Euclidean distance calculations O(m × n) dominating the overall cost; (ii) normalization requiring a full column scan; and (iii) the O(m × log m) final sort—all inherently sequential and memory-bound on CPUs, directly motivating GPU parallelization. Most tools only support matrices of modest size [9], making them unsuitable for Big Data environments.
High-performance computing (HPC), and in particular the increasing use of graphics processing units (GPUs), offers a concrete solution to these scaling limitations. GPUs are based on massively parallel architectures, optimized for the execution of vectorized numerical operations [10]. The emergence of the CUDA platform and the development of high-level Python libraries such as CuPy [11], PyTorch [12], and TensorFlow [13] have considerably facilitated access to GPU programming without requiring low-level expertise.
In this article, we present GPU-TOPSIS, a complete and vectorized reformulation of TOPSIS specifically designed for GPU execution. This contribution follows a progressive approach: our previous work proposed a parallelized MCDM filter in shared memory via OpenMP [14] and then a distributed version of TOPSIS based on MapReduce [15]. GPU-TOPSIS takes a further step by exploiting the massive parallelism of modern GPUs, enabling the near real-time processing of decision matrices containing tens of millions of alternatives, without artificial data reduction.
Three version variants were developed: CuPy-TOPSIS, PyTorch-TOPSIS, and TensorFlow-TOPSIS, each reformulating the TOPSIS steps as a tensor pipeline running on CUDA-compatible hardware. Unlike previous work on GPU-TOPSIS [16], which remained application-domain specific and validated on modestly sized datasets, our framework is general-purpose and has been validated on millions of alternatives (up to 200 million) from real-world e-commerce data, with explicit numerical stability analysis on three distinct GPU backends.
Experimental evaluations confirm significant performance gains, with speedups of up to 4.75× compared to the CPU-based NumPy [17] benchmark, while maintaining ranking consistency and numerical stability. A sensitivity analysis of criterion weight perturbations and sharding robustness tests completes the demonstration.
Unlike previous GPU-based implementations of TOPSIS, which often focus on specific application domains or limited datasets, this work proposes a general computational framework for large-scale multi-criteria decision-making based on GPU tensor pipelines. The proposed approach not only accelerates the classical TOPSIS workflow but also introduces a scalable computation model capable of handling decision matrices that exceed GPU memory capacity through a mathematically consistent fragmentation strategy.
In this context, we introduce GPU-TOPSIS, a fully vectorized GPU implementation of the TOPSIS method capable of processing decision matrices containing up to 200 million alternatives, thereby enabling large-scale multi-criteria decision-making on commodity GPU hardware. The original contributions of this work can be summarized as follows.

Summary of Key Contributions and Novelty

GPU-TOPSIS advances the state of the art in four dimensions of novelty: (1) it is the first TOPSIS implementation validated on matrices containing up to 200 million alternatives from real e-commerce data; (2) it introduces a mathematically proven two-pass fragmentation algorithm that guarantees exact ranking equivalence regardless of the fragmentation scheme; (3) it is the only framework offering cross-backend numerical consistency analysis across three GPU ecosystems (CuPy, PyTorch, TensorFlow); and (4) it provides a comprehensive experimental evaluation achieving a speedup of up to 4.75× compared to the CPU-NumPy reference, a cross-backend mean deviation of less than 5 × 10−8, and a Kendall τ = 1.0 for the monolithic formulation, confirming perfect ranking consistency. These novel aspects are supported by four original contributions:
First, GPU-TOPSIS provides a fully vectorized reformulation of the TOPSIS method in which each step of the decision pipeline—normalization, weighting, calculation of ideal solutions, calculation of Euclidean distances, and calculation of proximity scores—is expressed as tensor operations executed directly in GPU memory, while strictly preserving the original mathematical formulation.
Second, a two-pass fragment processing algorithm is introduced. Property 1 formally demonstrates that this formulation is a lossless generalization of monolithic GPU-TOPSIS: when k = 1, it reduces exactly to the standard formulation, while for k > 1, it guarantees identical rankings while reducing the memory footprint from O(m × n) to O(mt × n), where mt < m. This design enables the processing of decision matrices whose size exceeds both the GPU VRAM and the host machine RAM.
Third, three independent implementations based on CuPy, PyTorch, and TensorFlow are developed, ensuring portability and interoperability within the Python GPU ecosystem.
Fourth, a comprehensive experimental evaluation is conducted on real-world data from the Amazon Products 2023 dataset [18], covering matrices ranging from several million to 200 million alternatives. The evaluation includes perturbation sensitivity analyses of criterion weights, inter-backend numerical consistency tests, and partitioning robustness analyses, demonstrating that GPU acceleration fully preserves numerical stability and decision reliability.
The remainder of this article is structured as follows. Section 2 presents the context and motivations, including a review of related work on parallel MCDM and the classical TOPSIS method. Section 3 describes the GPU-TOPSIS reformulation and its parallelization principles. Section 4 details the three GPU implementations. Section 5 presents the experimental results and robustness analyses. Finally, Section 6 concludes and outlines future research directions.

2. Context and Motivation

2.1. High-Performance Computing and Parallel Processing on GPUs

The exponential growth of digital data generated by social networks, healthcare systems, business analytics, and scientific simulations has profoundly altered computing power requirements. Modern decision support systems must process large volumes of heterogeneous and dynamic information in real time [19,20]. This evolution has accelerated the transition to computing-intensive architectures where memory bandwidth, data proximity, and parallel execution capabilities are as critical as raw CPU power [21,22].
High-performance computing (HPC) environments address these needs by deploying scalable architectures capable of accelerating large-scale numerical workloads. MCDM methods, and TOPSIS in particular, rely on dense matrix operations—normalization, weighting, and distance calculation—whose mathematical structure naturally lends itself to parallelization. Parallel computing involves decomposing these operations into tasks that can be executed simultaneously; GPUs, with their thousands of lightweight cores specialized in tensor computing, are architecturally optimized for this class of problems [23,24,25].

2.2. GPU-Compatible Tensor Libraries

While low-level interfaces like CUDA and OpenCL offer fine-grained hardware control, their complexity hinders the development of scientific applications. An ecosystem of high-level libraries has emerged to simplify GPU programming [23]. These libraries provide practical abstractions (tensors, automatic differentiation, and on-the-fly compilation) while generating optimized GPU kernels.
Three libraries are central to this work. CuPy [11] is an open-source Python library that implements the NumPy API on GPUs via CUDA, enabling the migration of existing scientific code with minimal modifications. PyTorch [12], developed by Meta AI, offers dynamic tensor abstraction with an efficient CUDA backend and an autograding engine [26,27]. TensorFlow [13,28], developed by Google, provides a completely differentiable programming platform based on static computational graphs and the XLA compiler, optimized for production workloads.
These three libraries share a similar execution model: data is prepared on the CPU, transferred to GPU memory, processed via a sequence of tensor operations, and then returned to main memory for displaying the results. Table 1 summarizes their characteristics.

2.3. Review of Related Works

Classical MCDM methods, such as TOPSIS, VIKOR, AHP, PROMETHEE, and ELECTRE, are widely used in fields as diverse as environmental monitoring, finance, healthcare, and supply chain management. Their popularity stems from their rigorous axiomatic foundation, their ability to produce consistent rankings, and the diversity of application domains in which they have been validated. However, conventional sequential implementations are struggling with the increasing volume and dimensionality of modern datasets. TOPSIS, with a complexity on the order of O(m × n), quickly becomes inefficient when faced with millions of alternatives.
Theoretical extensions have enriched classical MCDM to handle uncertainty. Fuzzy MCDM models, such as the one used by Das et al. to assess the anthropogenic impact on urban water quality [29], offer more flexible representations of imprecise measurements. Other extensions, such as neutrosophic decision systems [30] or multi-fuzzy N-soft approaches [31], allow for addressing ambiguous situations. However, these theoretical advances do not eliminate the computational challenges posed by large datasets.
To address the need for scalability, several studies have explored distributed and parallel computing applied to MCDM. CPU-based approaches (MPI, OpenMP, or multi-thread architectures) have demonstrated encouraging results for tasks such as Pareto filtering or preference aggregation. Our research team previously proposed a parallel MCDM filter based on OpenMP in shared memory [14], offering good performance on systems with a small number of cores. A distributed version of TOPSIS based on MapReduce has also been developed [15], capable of handling high-dimensional data across multiple nodes, although its disk-oriented nature limits its application in real-time decision-making contexts. GPU-TOPSIS builds upon these foundations by taking a significant leap forward: where [14] leverages a few dozen CPU cores and [15] distributes computation across a cluster, GPU-TOPSIS mobilizes thousands of GPU cores in a single memory, eliminating the overhead of inter-node communication and disk access. Regarding GPU approaches, Erlacher et al. demonstrated the advantages of CUDA-accelerated spatial analysis for uncertainty modeling in MCDM [32]. Lakshmi et al. [31] presented a GPU-TOPSIS for Quality of Service (QoS)-based web service selection, showing significant speedups. Kumar et al. [33] evaluated GPU cloud computing instances via TOPSIS. More recently, Swetha and Karpagam [34] introduced a GPU-accelerated improved ideal reference method (I-RIM) for real-time web service selection. This work confirms the potential of GPUs for a wide class of ranking-based MCDM methods.
However, existing GPU or distributed implementations remain largely application-domain specific and do not offer a general, fully vectorized TOPSIS framework capable of processing very large datasets consistently and reproducibly. None of them examines the consistency of rankings between CPU and GPU implementations or evaluates numerical stability under large-scale batch processing. Compared to [16], in particular, GPU-TOPSIS offers a general-purpose framework validated on millions of alternatives with real-world e-commerce data, an explicit analysis of numerical stability on three GPU backends, and a sharding strategy to overcome CPU and GPU memory limitations.
As summarized in Table 2, GPU-TOPSIS is the only approach combining full genericity, GPU parallelism across three backends, scalability up to 200 million alternatives, and a formally evaluated numerical precision (ε < 10−6).
To our knowledge, no previous work has demonstrated the scalability of TOPSIS on GPU architectures for decision matrices containing hundreds of millions of alternatives.

2.4. The Basic TOPSIS Method

2.4.1. Fundamental Principles

The TOPSIS method (Technique for Order Preference by Similarity to Ideal Solution), introduced by Hwang and Yoon [35], is one of the most widespread multi-criteria decision-making approaches. Given a set of alternatives A1, …, Am evaluated according to criteria C1, …, Cn, the method constructs a decision matrix M, normalizes and weights the criteria, and then ranks the alternatives according to their relative proximity to two reference solutions: the Positive Ideal Solution (PIS), representing the theoretical optimal performance for each criterion, and the Negative Ideal Solution (NIS), representing the most unfavorable theoretical performance.
TOPSIS is valued for its intuitive geometric interpretation, the absence of pairwise comparisons, and its relatively low computational cost compared to more complex methods such as ELECTRE or PROMETHEE [35]. It has been successfully applied in many fields: supply chain management, industrial production systems, marketing, health, resource allocation, human resource management, and social network analysis [36,37].
Despite these strengths, TOPSIS has well-documented limitations. First, it is susceptible to the rank reversal phenomenon: introducing or removing a non-optimal alternative can alter the relative ranking of others [35]. This limitation is shared by VIKOR and MARCOS (Multi-Attributive Range Comparator). RAFSI (Ranking of Alternatives through Functional mapping of criterion sub-Intervals into a Single Interval) [38] and MABAC (Multi-Attributive Border Approximation Area Comparison) have been proposed to mitigate rank reversal issues by introducing a border approximation area; however, this improvement is achieved at the expense of increased computational complexity and model formulation overhead [39]. In GPU-TOPSIS, rank reversal is inherent to the TOPSIS mathematical formulation and is not amplified by GPU parallelization, which is mathematically equivalent to CPU-TOPSIS. Second, compared to other MCDM methods: (i) AHP requires pairwise comparisons, computationally expensive for large n; (ii) PROMETHEE and ELECTRE involve complex preference functions unsuitable for automated large-scale processing; (iii) VIKOR and MARCOS share TOPSIS’s O(m × n) complexity. TOPSIS was selected for GPU-TOPSIS because its fully vectorizable pipeline maps directly onto GPU tensor operations, making it the most suitable candidate for the proposed acceleration framework.

2.4.2. TOPSIS Sequential Algorithm

Consider a decision problem involving m alternatives and n criteria. The decision matrix M = [xij]{m × n} contains the evaluations of the alternatives, where xij represents the performance of alternative Ai according to criterion Cj. The decision-maker’s preferences are encoded in the weighting vector W = [w1, …, wn], where wn reflects the relative importance given to criterion Cj. The criteria are partitioned into two disjoint sets: J+ denotes the set of benefit criteria (to be maximized) and J the set of cost criteria (to be minimized). The algorithm, which we will call Algorithm 1, proceeds in seven successive steps.
Step 1—Construction of the decision matrix M and the criterion weight vector W:
M = [ x i j ] m × n , 1 i m , 1 j n
W = [ w 1 , , w n ] , w i t h : i = 1 n w i = 1
  • wj is the weight reflecting the relative importance of criterion Cj.
Step 2—Vector normalization (the most commonly used):
r i j = x i j k = 1 m x k j 2
We obtain the normalized matrix R = [rij].
Step 3—Weighted Normalized Matrix:
v i j = w j × r i j
We obtain the weighted matrix V = [vij].
Step 4—Determine the ideal solutions: A+ and A
  • For j ∈ J+ (benefit: the criteria to maximize):
A + [ j ] = max i v i j , A [ j ] = min i v i j
  • For j ∈ J (cost: the criteria to be minimized):
A + [ j ] = min i v i j , A [ j ] = max i v i j
Step 5—Calculate the distances to the (Euclidean) ideals:
For each alternative Ai, we calculate its distance to the ideal solutions.
D i + = j = 1 n ( v i j A j + ) 2  
D i = j = 1 n ( v i j A j ) 2
Step 6—Calculate the proximity coefficient (TOPSIS score):
For each alternative Ai, we calculate its score, which will be used for its final ranking.
S i = D i D i + + D i
  • Si ∈ [0, 1]
  • The larger Si is, the better the alternative Ai.
In Equation (9), we propose adding a constant numerical stabilization parameter ε = 10−12 to the denominator to avoid division by zero in degenerate cases where D+ = D = 0. This numerical adaptation does not affect the results in normal cases of use (ε has no effect when D+ + D > 0).
S i = D i D i + + D i + ε
Step 7—Rank the alternatives:
The alternatives are ranked in descending order of Si. The alternative with the highest score is considered the most preferable.
Algorithm 1: Sequential algorithm of the TOPSIS method
Input:
1:  A: set of n alternatives
2:  F: family of m criteria
3:  M(aij), 1 ≤ i ≤ n, 1 ≤ j ≤ m : decision matrix
4:  W: criteria weights vector
Output:
5:  L : list of alternatives sorted by relevance
7: N ← normalization(M) // ∀ i,j : rij = aij / √(Σ aij2) [Equation (3)]
8: V ← weighting(N, W) // ∀ i,j : vij = wj × rij [Equation (4)]
9: PIS ← ideal_solution(V) // PIS = { max vij | benefit, min vij | cost } [Equation (5)]
10: NIS ← worst_solution(V) // NIS = { min vij | benefit, max vij | cost } [Equation (6)]
11: D+ ← distance(V, PIS)// ∀ i : di+ = √(Σj (vij − pj)2) [Equation (7)]
12: D ← distance(V, NIS)// ∀ i : di = √(Σj (vij − nj)2) [Equation (8)]
13: S ← score(D+, D)// ∀ i : Si = di / (di+ + di) [Equation (9)]
14: L ← rank alternatives in decreasing order of S
15: return L

2.4.3. Computational Complexity

Let m be the number of alternatives and n the number of criteria. The complexity of CPU-TOPSIS is decomposed as follows [40,41]: (i) normalization and weighting, each element of the matrix is processed a constant number of times, i.e., O(m × n); (ii) calculation of distances, each alternative requires the calculation of two Euclidean distances on n criteria, i.e., O(m × n); (iii) calculation of proximity scores, a single pass over the set of alternatives: O(m); (iv) sorting, ranking the scores in descending order: O(m × log m). The total time complexity is therefore expressed by Equation (11):
TCPU-TOPSIS (m, n) = O(m × n) + O(m × log m) ≈ O(m × n)
When the number of alternatives m and the number of criteria n are of the same order of magnitude (m ≈ n), the complexity O(m × n) can be approximated by O(n2). However, in typical Big Data scenarios, m ≫ n: the complexity is then strictly linear in m for a fixed n, meaning that the computation time increases proportionally to the volume of data. It is precisely this linear growth—and the absolute volume it represents when m reaches tens of millions—that justifies the use of GPU parallelization, capable of distributing this work across thousands of cores simultaneously.

3. GPU-TOPSIS: A New Parallel Extension of Topsis

The increasing availability of high-performance GPUs has transformed the execution of computationally intensive algorithms. GPU-TOPSIS reformulates each step of the TOPSIS method as tensor operations executed on GPU hardware, ensuring perfect ranking consistency with the original TOPSIS method while drastically reducing execution times. Three implementations have been developed using CuPy, PyTorch, and TensorFlow.

3.1. GPU-TOPSIS Algorithm

Algorithm 2 below presents the general formulation of GPU-TOPSIS with fragment processing, in which each step of the classical TOPSIS pipeline is reformulated as a vectorized tensor operation executed in GPU memory, while the fragmentation strategy enables the processing of decision matrices exceeding not only the VRAM capacity of the GPU, but also the RAM of the host system, by partitioning the data into successive and independent fragments loaded on demand from persistent storage.
Algorithm 2: GPU-TOPSIS with Sharding (Two-Pass Formulation)
Input:
1:  shards ← {S1, …,Sk}: partitions of the decision matrix
2:  W ∈ ℝn: criteria weights (Σ wj = 1)
3:  criteria_types ∈ {“benefit”, “cost”}n
4:  backend ∈ {CuPy, PyTorch, TensorFlow}
5:  ε ← 10−12
Output:
6:  L: globally ranked alternatives
Initialization:
7:  W_gpu ← to_GPU(W)
8:  scores_global ← ∅
9:  indices_global ← ∅
10:  offset ← 0
11: /* PASS 1: Global statistics (CPU)*/
12: sq_sums ← zeros(n)
13: for t = 1 … k do
14.  M ← load_CSV(St)
15:  sq_sums ← sq_sums + col_sum_squares(M)
16: end for
17: norms ← sqrt(sq_sums)
18: Initialize PIS and NIS according to the criteria_types
19: for t = 1 … k do
20:  M ← load_CSV(St)
21:  R ← M/norms
22:  V ← R ⊙ W
23:  Update PIS and NIS using column extrema of V
24: end for
25: norms_gpu ← to_GPU(norms)
26: PIS_gpu ← to_GPU(PIS)
27: NIS_gpu ← to_GPU(NIS)
28: /* PASS 2: Fragment GPU processing*/
29: for t = 1 … k do
30:  M_gpu ← to_GPU(load_CSV(St))
31:  mt ← rows(M_gpu)
32:  R ← M_gpu/norms_gpu
33:  V ← R ⊙ W_gpu
34:  D+ ← row_sqrt_sum((V − PIS_gpu)2)
35:  D ← row_sqrt_sum((V − NIS_gpu)2)
36:  S ← D/(D+ + D + ε)
37:  scores_global ← concatenate(scores_global, to_CPU(S))
38:  indices_global ← concatenate(indices_global,[offset … offset + mt − 1])
39:  offset ← offset + mt
40:  free_GPU(M_gpu, R, V)
41: end for
42: /* PASS 3: Global ranking */
43: scores ← concatenate(scores_global)
44: indices ← concatenate(indices_global)
45: L ← indices[descending_sort(scores)]
46: return L
The proposed sharding algorithm performs two passes on the dataset. The first pass calculates and aggregates global statistics per column (normalization denominators, PIS, and NIS vectors) without loading the entire matrix into memory. The second pass uses these global statistics, consistent across the dataset, to calculate distances and proximity scores, thus guaranteeing complete mathematical equivalence with a monolithic TOPSIS execution, regardless of the sharding scheme used.
To highlight the transition from the sequential model to the vectorized model on GPU, Table 3 below establishes a one-to-one correspondence between the two formulations for each of the seven steps of the TOPSIS method.
Remark on normalization schemes. The current GPU-TOPSIS implementation uses vector normalization (Equation (3)), whose denominator is a global column statistic naturally computed in Pass 1. The framework is equally compatible with min-max normalization: Pass 1 would then compute global column minima and maxima, and Pass 2 would apply rij = (xij − minj)/(maxj − minj) element-wise. Other linear schemes (sum normalization) can be integrated by adapting only the aggregation step in Pass 1. A comparative evaluation across normalization schemes is planned as future work.

3.2. TOPSIS Complexity and Memory Footprint: CPU, 1-Pass GPU, and 2-Pass GPU (Sharding)

The fundamental algorithmic cost of TOPSIS applied to a decision matrix M ∈ ℝn×m remains O(m × n), because the normalization, weighting, reduction (max/min for PIS/NISPIS), and distance calculation operations traverse the entire set of m × n values. On the CPU, the execution time therefore follows this growth, modulo the effects of vectorization and memory hierarchy.
On GPUs, these steps naturally translate into parallel kernels. Denoting p as the number of actually active computing units (useful utilization), and assuming good GPU utilization, the theoretical time is written as:
T G P U T O P S I S ( m , n , p ) = O ( m × n p ) + T o v e r h e a d
where Toverhead. This expression groups together CPU ↔ GPU transfers, kernel launch latency, and synchronization costs. It accurately describes regimes where the m × n p CPU dominates (often for very large volumes, e.g., m ≳ 106). Conversely, for smaller sizes (e.g., m ≲ 105), the overhead can become dominant and negate the GPU advantage: TGPU-TOPSIS < TCPU-TOPSIS.

3.2.1. GPU-TOPSIS in 1-Pass (Monolithic)

In a 1-pass formulation, the calculation is performed in a single phase, limiting data rereading and repeated transfers. The theoretical time required coincides with Equation (12). This option is generally the fastest when the data and the necessary intermediate tensors fit in both CPU and GPU memory.
T G P U , 1 p a s s ( m , n , p ) = O ( m × n p ) + T o v e r h e a d
In 1-pass, we calculate global statistics (normalization norms, PIS/NISPIS) and scores on all data loaded into memory in its entirety.

3.2.2. GPU-TOPSIS in 2-Passes (Sharding)

When the matrix is too large to be loaded into VRAM all at once, the k-shard partitioning strategy becomes necessary (see Algorithm 2 above). A correct formulation then proceeds in two passes:
  • 1-Pass (global statistics): calculation of global norms (normalization denominators) and global PIS/NIS ideal vectors, ensuring consistent normalization-weighting across the entire dataset;
  • 2-Pass (scores): shard-by-shard processing on GPU to calculate distances and proximity scores from global statistics.
Time can be summarized as follows:
T G P U , 2 p a s s ( m , n , p , k ) = 2 × O ( m × n ) + k × O ( m t × n p ) + k × T o v e r h e a d
With:
  • 2 × O(m × n): two sequential CPU traversals in Pass 1
  • overhead × n/p): k = fragments processed successively on GPU, each of mt lines distributed over p cores.
  • k × Toverhead: cumulative overhead over the k fragments
The term 2. O(m × n) corresponds to the construction of global statistics in pass-1 on the CPU (often constrained by I/O if shards are read from disk), while k × Toverhead reflects the repetition of transfers and launches at each fragment. In return, this approach allows for memory scalability while guaranteeing mathematical equivalence with monolithic TOPSIS, regardless of the partitioning scheme.
Simplification
By noting that overhead = m (the sum of all the fragments reconstitutes the complete matrix), we can verify the consistency:
k × O ( m t × n p ) = O ( k × m t × n p ) O ( m × n p )
Formula (14) is therefore consistent with monolithic GPU-TOPSIS at the global asymptotic level, while reflecting fragment execution at the operational level.
Limiting case k = 1:
When k = 1, Formula (14) reduces to:
T G P U , 2 p a s s ( m , n , p , 1 ) = 2 × O ( m × n ) + O ( m × n p ) + T o v e r h e a d
This corresponds exactly to the monolithic GPU-TOPSIS with the CPU pass overhead, confirming that Property 1 below is consistent with the complexity analysis.

3.2.3. GPU-TOPSIS in 1-Pass Sharding

To reduce the complexity of the two-pass GPU-TOPSIS scheme, we propose a more efficient “1-pass-shard” variant (see Figure 1b) by eliminating the global pre-aggregation cost of O(m × n) and performing, for each shard, the relative computation of statistics (norms, PIS/NIS) and scores in O(mt × n/p) per shard. This acceleration, however, comes at the cost of approximation, since the statistics and scores are estimated locally on each shard rather than globally.
T G P U , 1 p a s s s h a r d ( m , n , p , k ) = O ( m × n p ) + k × T o v e r h e a d

3.2.4. Memory Footprint: A Condition for the Feasibility of the First Pass and a Motivation for the Second Pass

In float32 precision, each element occupies 4 bytes, resulting in a minimal footprint:
Mem(M) ≈ 4 m × n bytes
In practice, a monolithic GPU-TOPSIS also manipulates intermediate tensors (normalized R, weighted V, temporary buffers). The VRAM footprint can be usefully modeled by:
1-passVRAM ≈ c × 4 m × n
where c is typically between 2 and 4, depending on kernel merging and storage strategy (e.g., M, R, V, and buffers). The W, PIS, NIS vectors and global norms cost only O(n), negligible compared to m × n.
Numerical example (float32, n = 20)
  • if m = 107, then m × n = 2 × 108 and Mem(M) ≈ 0, 8 GB; with c = 3, VRAM1-pass ≈ 2.4 GB\(excluding overheads).
  • if m = 108, then Mem(M) ≈ 8 GB, and 1-passVRAM ≈ 16 to 32 GB for c ∈ [2,4], which often exceeds the available VRAM.
Thus, the memory constraint directly determines the choice of formulation: when 1-passVRAM is compatible with the board, 1-pass is generally preferable (fewer reads, less repeated overhead). When this is not the case, 2-pass (sharding) is necessary; it reduces the VRAM footprint to that of an mt × n shard:
2-passVRAMc × 4 mt × n + O(n)
at the cost of a second data traversal and repeated overhead on k fragments.
Comparative table of complexities and memories
  • Illustrative hypotheses: n = 20, p = 4096, k = 100.
  • Scenarios: (A) m = 105 ⇒ m × n = 2 × 106 and (B) m = 107 ⇒ m × n = 2 × 108
As shown in Table 4, the 1-pass GPU formulation achieves the lowest overhead when the dataset fits in VRAM, while the 2-pass formulation becomes necessary beyond this threshold to guarantee global equivalence with CPU-TOPSIS at the cost of an additional data traversal.
Ultimately, if the decision matrix and its intermediate tensors fit in GPU memory, the 1-pass GPU-TOPSIS formulation (see Figure 1a) is generally the most efficient, as it limits data rereading and avoids the repeated overhead of data transfer and kernel launches. However, as soon as VRAM constraints impose fragmented processing, the 2-pass GPU-TOPSIS formulation (sharding) becomes the reference solution: it guarantees the overall consistency of the statistics (norms, PIS/NIS) and therefore equivalence with monolithic TOPSIS, at the cost of a second traversal of the dataset and overhead proportional to the number of fragments.
Property 1 (Generalization of monolithic GPU-TOPSIS)
Let k be the number of fragments in Algorithm 2. When k = 1, the two-pass fragmentation algorithm reduces exactly to the standard monolithic GPU-TOPSIS, in which the normalization denominators, the PIS and NIS vectors are computed on the complete decision matrix M ∈ ℝmxn, and the entire TOPSIS pipeline is executed on the GPU in a single pass. Moreover, for any k ≥ 1, the two-pass formulation produces rankings that are mathematically identical to those of monolithic GPU-TOPSIS, since Pass 1 guarantees the global consistency of norms_global, PIS_global, and NIS_global regardless of the chosen partitioning scheme. The two-pass fragmentation algorithm is therefore a lossless generalization of monolithic GPU-TOPSIS, introducing no approximations and preserving total mathematical fidelity to the original TOPSIS formulation of Hwang and Yoon [34] for any value of k.
When k = 1, Formula (14) reduces to:
T G P U , 2 p a s s ( m , n , p , 1 ) = O ( m × n ) + O ( m × n p ) + T o v e r h e a d
This corresponds exactly to the monolithic GPU-TOPSIS with the CPU pass overhead of Pass 1, confirming that Property 1 is consistent with the complexity analysis.
Note on finite-precision arithmetic: The above equivalence holds in exact arithmetic. In float32 floating-point arithmetic, parallel GPU reductions are non-deterministic and non-associative due to rounding at machine epsilon (ε_mach ≈ 1.2 × 10−7). In practice, this may introduce score discrepancies on the order of ε_mach × k between the 1-pass and 2-pass formulations for large k. The empirical impact of this effect on decision quality is analyzed in Section 5.
Corollary 1 (Memory independence)
Algorithm 2 requires that only one fragment St reside in memory at any given time, both during Pass 1 and Pass 2. Therefore, the algorithm supports decision matrices whose total size exceeds both the available GPU VRAM and the host system’s RAM capacity, provided that individual fragments can be sequentially loaded from persistent storage. The memory footprint of Algorithm 2 is thus O(mt × n) rather than O(m × n), where mt = max_{t = 1, …,k} and rows(St) denotes the size of the largest fragment.
Outline of Proof of Property 1
Since the col_sum_squares operator is additive on disjoint row partitions, the following identity applies:
n o r m s _ g l o b a l j = t = 1 k i S t x i j 2 = i = 1 m x i j 2
Equation (22) holds for any partitioning [S1, …, Sk] of the set of row indices {1, …, m}. Similarly, since the max and min operators are associative and commutative on disjoint sets, the vectors PIS_global and NIS_global calculated incrementally in Pass 1 are the same as those obtained by direct reduction on the complete matrix. It follows that the weighted normalized matrix V and the Euclidean distances D+ and D calculated in Pass 2 are pointwise identical to their monolithic counterparts, and the final ranking L is therefore invariant to the choice of k and the partitioning scheme.

4. Experimental Evaluation

4.1. Dataset: Amazon Products 2023

The Amazon Products 2023 dataset is the primary experimental tool for this work. It is a large, publicly available collection, aggregated from Amazon platform catalogs and distributed by product category [18]. Its tabular structure, rich attributes, and sheer volume—tens of millions of products across 32 categories—make it a natural and demanding testing ground for large-scale multi-criteria decision-making methods. In the context of e-commerce decision support, each product represents an alternative to be evaluated, and the goal is to identify the best-performing alternatives based on a set of criteria reflecting perceived quality, popularity, price positioning, and the reliability of ratings.
Construction of the decision matrix. The decision matrix M ∈ ℝmx6 is constructed by associating each product (alternative Ai) with a vector of six quantitative criteria derived from the available metadata. Price (C3) and the standard deviation of the ratings (C6) are treated as cost criteria; the other four are benefit criteria. This moderate number of criteria (n = 6) is commonly adopted in MCDM studies to preserve the interpretability of the rankings while providing a sufficiently rich decision structure.
Table 5 formally describes each criterion, its calculation method, its TOPSIS type, and the decision rationale that justifies its inclusion.
Matrix cleaning comprises three operations: (i) clipping the utility rate to [0, 1] to correct Amazon counter artifacts (affecting 2835 rows in the Books category); (ii) removing price outliers beyond the 99th percentile; and (iii) deduplication by retaining the most recent row by parent_asin identifier. After preprocessing, the reference matrix contains 14652525 alternatives (multi-category configuration, ALL) and 3,383,435 alternatives (single-category configuration, Books). The reference weight vector is set to W = [1/6, …, 1/6] corresponding to equal weights assigned to each of the six criteria, in accordance with standard practice in MCDM method validation studies, where the goal is to evaluate the computational framework independently of any application weighting bias. In real-world deployments, criterion weights can be determined by: (i) expert elicitation via AHP pairwise comparisons; (ii) entropy-based objective weighting; or (iii) stakeholder-defined preference vectors. GPU-TOPSIS accepts any normalized weight vector W satisfying ∑wj = 1, wj ≥ 0.

4.2. Experimental Environment

All experiments were conducted on the Google Colab Pro platform to ensure reproducibility and accessibility. The hardware and software configuration deployed was: 2.7/51.0 GB of RAM; NVIDIA Tesla T4 GPU (16 GB of VRAM, CUDA 12.8); software stack Python 3.9, NumPy 1.23.5, CuPy v12.3, PyTorch V2.4.1, TensorFlow V2.16.1.Google Colab Pro was deliberately chosen to ensure maximum reproducibility: the experimental environment is accessible to any researcher without dedicated hardware, and the source code, dataset references, and notebook configurations are published in the open repository. The dynamic GPU allocation inherent in Colab is mitigated by the five-replicate measurement protocol and the calculation of standard deviations.
Each execution time measurement reported in this article corresponds to the arithmetic mean of five independent executions, with the ratio of the standard deviation (±) to the 95% confidence interval (95% CI) calculated using Student’s t-distribution. This repeated measurement methodology represents a substantial improvement over the single measurements typically reported in the literature and ensures the statistical validity of the performance comparisons. The CPU comparison uses sequential NumPy as the single reference; in the Colab environment, CPU parallelization capabilities are limited to 2 vCPUs, which justifies this choice: the difference with the thousands of GPU cores involved remains negligible.
Finally, the numerical stabilization constant ε = 10−12 is applied uniformly in all GPU implementations.

4.3. EXP-1. CPU vs. GPU Scalability

Objective: To measure GPU-TOPSIS acceleration factors relative to the CPU reference across a wide range of matrix sizes—from 50,000 to 14.65 million alternatives—with seven measurement points enabling the plotting of a complete scalability curve. The matrices used are randomly drawn subsets from the actual Amazon Products 2023 set; only sizes exceeding available stock use synthetic data calibrated against observed statistics.
Results and analysis. Several observations emerge from Table 6. At small scales (m ≤ 100,000), GPU acceleration is limited, even negative for TensorFlow at 50,000 alternatives (0.65×), due to the overhead of launching CUDA kernels. PyTorch and CuPy are notable exceptions even at this level, showing 2.34× and 1.49× respectively, reflecting the lightweight nature of their execution pipelines—consistent with the complexity analysis (Equation (11)). From 500,000 alternatives onward, all three GPU backends consistently outperform the CPU benchmark. The best overall speedup is achieved by PyTorch at 3.38 million alternatives (4.75×). The marked decrease observed at 14.65 M—PyTorch dropping to 1.80×, compared to 2.78× for CuPy and 2.43× for TensorFlow—is explained by the progressive saturation of VRAM bandwidth and the increasing overhead of CPU-to-GPU transfers. This phenomenon, intrinsic to the GPU architecture, is particularly pronounced for PyTorch at very large scales, where CuPy gains the advantage thanks to memory management closer to the native NumPy model. However, it does not invalidate the overall scalability of the proposed framework.
Figure 2a shows the log-log execution times for the four implementations as a function of the number of alternatives. All curves follow a near-linear growth, confirming the O(m) complexity of the framework. PyTorch stands out with the lowest times across the entire range, while CPU remains consistently the slowest at large scales. However, Figure 2b illustrates the GPU speedup relative to the CPU. PyTorch clearly dominates, reaching a peak of 4.75× around 3.4 million alternatives before dropping to 1.80× at 14.65 M—a sign of GPU memory saturation. CuPy and TensorFlow show a more modest and stable progression, with CuPy taking the lead over PyTorch beyond 10 million alternatives. TensorFlow starts below the CPU baseline (<1×) at small scales, illustrating a higher launch overhead. The three backends converge towards a similar behavior at very large scale, highlighting the common limits of VRAM bandwidth.

4.4. EXP-2. Inter-Backend Digital Consistency

Objective: To verify that the three GPU implementations produce Si proximity scores and numerically consistent rankings between themselves and with the CPU reference, despite differences in arithmetic precision and optimization strategies specific to each backend.
Methodology: Consistency is assessed on the ALL matrix (14.65 M alternatives) with reference weights. The metrics retained are: the maximum and mean difference on the Si scores (Max ΔSi, Moy ΔSi), the ranking overlap rate (Overlap@K) for K ∈ {5, 10, 50, 100}, the Kendall concordance coefficient τ on the Top-10 and Top-100, and the Spearman coefficient ρ on the Top-100.
  • Example of reading the table: Kendall τ@10 = 1.0000 with p = 5.5 × 10−7 means that the two backends produce the same Top-10 in the same order, and that this concordance has only a 0.000055% probability of being due to chance.
Results and analysis. The results in Table 7 demonstrate near-perfect numerical consistency across all backends. The maximum differences in Si scores between pairs of backends consistently remain below 5 × 10−8—several orders of magnitude below any practical decision threshold—with even smaller average differences, on the order of 10−9 to 10−8. The overlap is 100% at all tested thresholds (Top-5, Top-10, Top-50, Top-100) for all six pairs, meaning that the resulting rankings are virtually identical regardless of the implementation used. Kendall’s coefficient τ and Spearman’s rank correlation coefficient ρ reach a maximum value of 1.0 in all cases, with highly significant p-values (p < 10−6 for τ@10, p < 10−150 for τ@100 and ρ@100), ruling out any hypothesis of fortuitous agreement. These results unambiguously establish that the low-level differences between CuPy, PyTorch, and TensorFlow—parallel reduction strategies, floating-point rounding behaviors—have no measurable impact on the final decision, thus validating the complete functional interchangeability of the three backends within the GPU-TOPSIS framework.
As shown in Figure 3, the two heatmaps visually summarize the inter-backend consistency. Panel (a) confirms that the maximum differences in Si scores are all between 3.7 × 10−8 and 4.5 × 10−8, with the diagonal naturally being zero (comparison of a backend with itself). Panel (b) displays a uniform and saturated green across all off-diagonal cells, indicating a 100% Overlap@10 without exception. Together, these two panels provide an immediate and unambiguous reading of the numerical robustness of GPU-TOPSIS: the four backends are functionally interchangeable.

4.5. EXP-3. Top-N Rankings Compared

Objective: To compare the Top-N rankings produced by the four implementations on the ALL matrix to confirm the decision invariance of GPU-TOPSIS at a large scale, and to provide the first recommended alternatives with their real attributes from the Amazon dataset.
Results and analysis: Table 8 presents the Top-10 products identified by GPU-TOPSIS on the Amazon ALL dataset (14.65M alternatives, equal weights). All four backends produce strictly identical scores and rankings (Overlap@10 = 100%, Max ΔSi < 10−6), confirming the decision invariance of the framework at this scale. The dominant alternative (B07TVHSDMQ, Si = 0.9986) stands markedly apart from the second-ranked product (Si = 0.378), reflecting an exceptional cumulative profile across all six criteria simultaneously—notably the highest review count in the dataset (314,691), a strong average rating (4.362), and a moderate price ($17.99). This score gap is characteristic of a structurally dominant alternative in the TOPSIS sense, i.e., one that is simultaneously close to the Positive Ideal Solution on all criteria. From rank 2 onward, scores decrease more gradually, indicating a competitive cluster of alternatives with similar multi-criteria profiles.

4.6. EXP-4. Stress-Test Sharding (~88 Million Alternatives)

Objective: To evaluate the computational scalability, memory robustness, and numerical stability of the 2-pass GPU-TOPSIS formulation beyond the joint limits of the host system’s VRAM and RAM, up to approximately 88 million alternatives.
Protocol: In the absence of additional publicly available MCDM data beyond the 14.65 million alternatives constituting the entirety of the available Amazon Products 2023 dataset, the experiment is conducted by replication in k = 6 fragments of the real matrix. Each replica (fragments 1 to 5) is subjected to a centered Gaussian multiplicative noise of 2% (σ = 0.02) to simulate plausible inter-domain variability and avoid perfect duplicates. Fragment 0 consists of unaltered real data. The noise level used is sufficiently low to preserve the marginal distributions of each criterion: the Kolmogorov–Smirnov test between the original fragment and the noisy fragments consistently yields p > 0.05 (see Table 9, KS p-val column), confirming the statistical homogeneity between fragments. To more precisely quantify the distributional similarity between the original and synthetic data, the Jensen-Shannon divergence (JSD) was calculated between the original fragment and each of the noisy fragments for all six criteria. All JSD values obtained remained below 0.001 (on a scale of [0, 1]), confirming a negligible distributional shift. Furthermore, the mean and standard deviation of each criterion in the noisy fragments did not deviate by more than 0.3% from those of the original fragment, thus validating the statistical representativeness of the generated synthetic data. In accordance with Property 1, the 2-pass formulation guarantees that the classifications produced for any k ≥ 1 are mathematically identical to those of the monolithic treatment.
With:
  • ram_baseline_mb: GPU memory used before the calculation starts (idle state). It corresponds to the CUDA context, loaded libraries, and already allocated tensors. This is the reference point.
  • vram_peak_abs_mb: Peak GPU memory in absolute value reached during execution. This is the total amount of VRAM used at its maximum, including all allocations (baseline + current calculation).
  • vram_peak_delta_mb: Relative increase in VRAM compared to the baseline, i.e., the memory actually consumed by the computation itself:
  • vram_peak_delta_mb = vram_peak_abs_mb − vram_baseline_mb
Results and analysis. According to the Table 10, we deduce that the growth in execution time is strictly linear in k for the three backends: PyTorch goes from 0.445 s (k = 1, 14.5 M alternatives) to 2.548 s (k = 6, 87.0 M alternatives), a factor of 5.73×, very close to the theoretical factor of 6×. CuPy and TensorFlow exhibit the same linearity, with respective factors of 6.12× and 6.13×, confirming the absence of any algorithmic degradation related to fragmentation. The VRAM consumption per individual fragment—obtained by dividing vram_peak_delta_mb by the number of fragments k—remains remarkably stable: approximately 1668 MB for PyTorch and CuPy, and approximately 2057 MB for TensorFlow, regardless of the total problem size. This result empirically validates the O(mt × n) memory footprint established by Corollary 1. The baseline VRAM also remains constant per backend (2720 MB for PyTorch/CuPy, 2618 MB for TensorFlow), reflecting only the fixed cost of the CUDA context. PyTorch stands out as the fastest backend in all configurations, with CuPy exhibiting an overhead of approximately 1.8×, and TensorFlow approximately 2.1×, the latter also showing a consistently higher memory consumption of ~22% per fragment. No numerical instability or interrupts are observed, even at 87.0 million alternatives on a GPU with 16 GB of VRAM.

4.7. EXP-5. Sensitivity Analysis to Criteria Weights

Objective: To evaluate the robustness of the TOPSIS ranking in the face of uncertainties or changes in preference on the weights of the criteria, by simulating realistic scenarios of revision of decision priorities.
Methodology: Fifty alternative weighting scenarios are generated by controlled multiplicative perturbations (±5%, ±10%, ±15%, ±20%) on the reference weights wj = 1/6, followed by renormalization to maintain ∑wj = 1. For each scenario, the ranking is recalculated with PyTorch as the reference backend on the ALL matrix and compared to the reference ranking via Overlap@K for K ∈ {5, 10, 100} and Kendall τ for K ∈ {10, 100}.
Results and analysis. The results reveal the absolute robustness of the TOPSIS ranking to perturbations in the weights of the criteria. For all four levels of perturbation tested (±5% to ±20%), the Overlap@5, Overlap@10, and Overlap@100 indicators uniformly reach 100.0 ± 0.0%, meaning that neither the top five, ten, nor one hundred alternatives undergo any change in composition or order, regardless of the preference reassessment scenario. The Kendall coefficients τ@10 and τ@100 are equal to 1.0 for all levels, confirming perfect rank concordance between the nominal ranking and the perturbed rankings. Only the Max ΔSi increases proportionally to the perturbation amplitude—from 0.00019 at ±5% to 0.00077 at ±20%—reflecting slight variations in absolute proximity scores without ever inducing a rank reversal. This result demonstrates that the GPU-TOPSIS ranking on the Amazon ALL dataset (14.65 M alternatives) is structurally stable across the entire decision range, even under substantial revisions of preferences, thus strengthening the system’s decision reliability at very large scales.
Figure 4 confirms the results of Table 10: regardless of the magnitude of the perturbation of the weights (±5% to ±20%), the Overlap@10 and the Overlap@100 remain fixed at 100%, and the Kendall τ at 1.0, graphically illustrating the total insensitivity of the GPU-TOPSIS ranking to preference revisions, both in the critical decision zone and over the entire Top-100.

4.8. EXP-6. Comparison of GPU-TOPSIS 1-Pass vs. 2-Pass Rankings

Objective: To formally establish the equivalence of the rankings produced by the GPU-TOPSIS 1-pass formulation (monolithic processing) and by the GPU-TOPSIS 2-pass formulation with sharding, over a spectrum of increasing sizes ranging from 14.65 million to 87.9 million alternatives, and to quantify the additional time cost associated with the mathematical correction guaranteed by Property 1.
Methodology: Four configurations are evaluated (k ∈ {1, 2, 3, 6} fragments), constructed by noise-free replication (σ = 0) of the Amazon ALL matrix to ensure strict algebraic identity between the data processed by the two formulations. The 1-pass formulation applies topsis_pytorch to the complete concatenated matrix in a single GPU pass, while the 2-pass formulation applies topsis_2pass with global calculation of norms, PIS, and NIS in Pass 1 (CPU) and then calculation of distances and scores per fragment in Pass 2 (GPU). The metrics used are: Overlap@K for K ∈ {10, 50, 100, 500, 1000}, Kendall τ@K and Spearman ρ@K with p-values, Max ΔSi and Mean ΔSi on the raw scores, as well as the ratio of execution times t(2-pass)/t(1-pass). Each measurement is averaged over Nruns = 5 repetitions.
Results and analysis: Figure 5 empirically characterizes the finite-precision behavior announced in the Note of Property 1: while algebraic equivalence holds in exact arithmetic, float32 non-associative parallel reductions induce localized rank inversions as k increases, whose practical impact is quantified below.
At the global level, Spearman ρ@K = 1.0 for all K and all configurations, confirming perfect overall rank agreement. Locally, however, Kendall τ@10 degrades to ~−0.4 at k = 6 while τ@1000 remains near 1.0, and Overlap@10 falls to ~80% while Overlap@1000 stays at 100%. This asymmetry indicates that divergences are confined to near-tied alternatives at the very top of the rankings—precisely where float32 rounding is most consequential—and become imperceptible over larger windows. The Mean ΔSi ~ 10−4 across all configurations confirms that large Max ΔSi values are concentrated on a negligibly small subset of alternatives. On the computational side, the ratio t(2-pass)/t(1-pass) decreases from ~2.0 at k = 1 to below 1.0 at k = 6, as growing VRAM pressure increasingly penalizes the monolithic formulation. A systematic analysis of the conditions under which these deviations occur deterministically was conducted. Three main factors were identified. First, the number of fragments k: ranking reversals grow approximately proportionally to k, as rounding errors accumulate with each independent reduction operation. Second, the density of scores in the vicinity of the Top-K boundary: reversals occur exclusively between alternatives whose proximity scores satisfy |Si − Sj| < ∼10−4, i.e., in the range of the epsilon machine float32 at this scale. Third, the size of the fragments mt: larger fragments reduce the number of reduction operations and, consequently, the accumulation of rounding errors. Practically speaking, the following threshold can be formulated: for k ≤ 2 and K ≥ 50, no measurable reversal is observed. Conversely, for k ≥ 4 with K ≤ 10, the use of float64 precision or the monolithic 1-pass formulation is recommended when strict Top-K invariance is required.
These results establish a clear operational boundary: the 2-pass formulation is decision-equivalent to the 1-pass formulation for large evaluation windows, and computationally advantageous at very large scales. For applications requiring strict Top-K invariance at high k, the 1-pass formulation is recommended when VRAM permits; otherwise, float64 precision eliminates the observed inversions at the cost of a 2× memory overhead.
Consequently, the 2-pass formulation is recommended for applications where the priority is large-scale global ranking and GPU resource management, while special attention should be paid to correcting restricted Top-K when k is high, and decisions are based exclusively on the first alternatives.

4.9. EXP-7. Scalability Beyond 88 Million Alternatives (100 M, 150 M, 200 M)

Objective: To evaluate the execution times, VRAM consumption, and processing throughput of the 2-pass GPU-TOPSIS formulation for matrix sizes exceeding the limits of the available real dataset, namely 100, 150, and 200 million alternatives, to project the scalability of the framework to orders of magnitude not yet covered experimentally in the MCDM literature.
Protocol: The absence of publicly available MCDM data at these scales, synthetic matrices are generated by Gaussian sampling calibrated to the empirical statistics of the Amazon ALL matrix (μj, σj per criterion), strictly adhering to business bounds after truncation. This approach, already validated in EXP-4 for intermediate sizes, ensures that the synthetic distributions are statistically representative of the real data. Adaptive sharding is applied, maintaining the size of each fragment mt ≤ 14.65 M (T4 VRAM constraint of 16 GB): k = 7 fragments for 100 M, k = 11 for 150 M, and k = 14 for 200 M. Each measurement is averaged over Nruns = 3 repetitions, preceded by an unmeasured warmup.
Results and analysis. Table 11 confirms the strictly linear scalability of 2-pass GPU-TOPSIS beyond the 88 million alternatives threshold. PyTorch maintains the best processing time for all three targets: 28.96 s for 100 M, 44.07 s for 150 M, and 60.17 s for 200 M, representing ratios of 1.52× and 2.08×, consistent with the expected theoretical progression in O(k × mt × n). CuPy and TensorFlow follow the same linear trend, with slightly higher times (30.48 s/47.42 s/64.37 s and 32.14 s/49.80 s/70.24 s, respectively). The processing throughput remains remarkably stable between 100 M and 200 M: PyTorch drops from 3.45 to 3.32 M/s, and CuPy from 3.28 to 3.11 M/s, demonstrating a near-total absence of algorithmic degradation. TensorFlow shows a slightly more pronounced decrease (3.11 → 2.85 M/s), without compromising overall scalability. The very low standard deviations (≤0.72 s) confirm the stability and reproducibility of the executions. These results establish that GPU-TOPSIS is the first operational MCDM framework at the scale of the hundreds of millions of alternatives on consumer GPU hardware.
Figure 6 illustrates the behavior of GPU-TOPSIS 2-pass processing at very large scales (100 M to 200 M alternative values) on calibrated synthetic data. Execution time increases almost linearly for all three backends, with PyTorch remaining the fastest (≈28 s at 100 M, ≈59 s at 200 M), followed by CuPy and TensorFlow within a narrow margin. Processing throughput decreases slightly with size—from ≈3.45 M alt/s at 100 M to ≈3.3 M alt/s at 200 M for PyTorch—reflecting the increasing memory pressure visible in the VRAM curve, which rises to ≈14 MB for PyTorch at 200 M. TensorFlow exhibits the lowest throughput and VRAM usage, suggesting a different trade-off between parallelism and memory management. These results confirm the practical scalability of the 2-pass formulation with adaptive sharding, capable of processing 200 M alternatives in under a minute on a 16 GB GPU.
In conclusion, GPU-TOPSIS 2-passes demonstrates robust and predictable scalability up to 200 M alternatives, thus taking a decisive step towards near real-time massive data processing, and positioning PyTorch as the reference backend for very large-scale deployments.
Since no public MCDM dataset currently achieves the scale of hundreds of millions of alternatives, synthetic matrices calibrated to the statistical properties of the Amazon dataset were generated. This approach allows for the evaluation of computational scalability without introducing unrealistic data distributions.

4.10. Summary of Experimental Results

The eight experiments together lead to five main conclusions.
Scalability: GPU-TOPSIS can process matrices containing up to 200 million alternatives in minutes on a consumer GPU (Tesla T4), making a class of previously inaccessible decision problems feasible in near-real time. Strictly linear scalability in k is confirmed across the entire tested range, from 50,000 to 200 million alternatives.
Mathematical correction: The 2-pass formulation, as defined by Property 1, guarantees exact equivalence with monolithic TOPSIS for any partitioning scheme, while reducing the memory footprint to O(mt × n). This result, demonstrated analytically in Section 3 and experimentally confirmed by EXP-7, establishes that the 2-pass formulation achieves an Overlap@1000 = 100% and a Spearman ρ = 1.0 for all tested values of k. The observed discrepancies remain localized to strict Top-K rankings for high values of k, attributable to the non-associativity of parallel reductions to float32, as stated in the Note to Property 1. This precise characterization distinguishes GPU-TOPSIS from naive sharding approaches that introduce uncontrolled local biases and provides practitioners with explicitly quantified criteria for choosing between the two formulations.
Decision robustness: Sensitivity analyses and Monte Carlo simulations confirm that the ranking structure is stable in the face of realistic perturbations of the criteria weights (Overlap@5 > 95% for ±10%) and observational noise (Overlap@10 > 90% for ±5% noise), with Kendall τ metrics attesting to structural robustness in the critical decision area.
Unprecedented scalability: EXP-7 establishes that GPU-TOPSIS is operational with up to 200 million alternatives on a Tesla T4 GPU with 16 GB of VRAM, with adaptive sharding and processing times remaining in the second range. This performance, unprecedented in the MCDM literature, paves the way for decision-making applications at the scale of large e-commerce platforms, industrial catalogs, or aggregated medical databases.
Reproducibility and genericity: The availability of three independent implementations on CuPy, PyTorch, and TensorFlow, combined with the publication of the source code on Zenodo: https://doi.org/10.5281/zenodo.18911332, the use of the public Amazon Products 2023 dataset [18], makes GPU-TOPSIS a reference framework directly reproducible and extensible by the MCDM/Big Data community.

5. Discussion

5.1. Scalability and Performance of Backends

Experimental evaluation confirms that data volume is the determining factor in the computational cost of the TOPSIS method. CPU implementations remain suitable for modestly sized datasets, but their execution time increases rapidly with the number of alternatives, making them impractical for large-scale decision problems. GPU implementations, on the other hand, allow for the efficient processing of decision matrices containing several million alternatives. All experiments conducted on real Amazon data demonstrate that GPU acceleration significantly reduces execution times while preserving numerical stability and ranking consistency, with performance gains increasing proportionally to the data volume.
The differences observed between the GPU backends reflect their respective execution models and memory management strategies. PyTorch generally achieves the lowest execution times thanks to its dynamic and lightweight CUDA pipeline, followed by CuPy, whose NumPy- compatible API minimizes porting overhead. TensorFlow exhibits higher latency, attributable to XLA compilation and the construction of static graphs, but maintains full scalability. It is worth noting that all three backends remain reliable and scalable across all evaluated configurations, thus establishing the independence of the proposed framework from any particular GPU software ecosystem.
PyTorch’s speedup factor—from 4.75× at 3.38 million alternatives to 1.80× at 14.65 million—requires a nuanced interpretation. It is explained by three cumulative bottlenecks inherent to the TOPSIS workload: (i) the overhead associated with CPU-to-GPU data transfers, with PCIe 3.0 bandwidth (≈12 GB/s) limiting the throughput for large float32 arrays; (ii) the sequential nature of global reductions (norm calculations, PIS/NIS), which cannot be fully parallelized across the entire array; and (iii) the saturation of VRAM bandwidth beyond 10 million alternatives on the T4 GPU. The reported 4.75× maximum speedup is lower than the 10×–100× observed in compute-bound HPC workloads because the TOPSIS pipeline is fundamentally memory-bandwidth-bound: its arithmetic intensity (FLOPs/byte) is low, so GPU cores spend disproportionate time waiting for memory rather than computing. It should be noted that all experiments were conducted on Google Colab Pro with a Tesla T4 GPU—a shared, non-dedicated environment—where dynamic resource allocation and inter-user concurrency introduce additional variability; the reported speedups therefore represent conservative estimates. The Tesla T4 VRAM bandwidth (320 GB/s) becomes the limiting factor at large scales, while PCIe 3.0 transfers (≈12 GB/s) add CPU-to-GPU overhead. GPU architecture further influences relative backend performance: PyTorch’s dynamic CUDA graph minimizes launch overhead; CuPy benefits from optimized cuBLAS routines via its NumPy-compatible API; TensorFlow’s XLA compiler introduces higher startup latency but enables more aggressive operation fusion at large workloads. Future work on kernel fusion, float16 quantization, NVLink multi-GPU configurations, and profiling with NVIDIA Nsight Compute (to characterize coalesced memory access patterns, kernel occupancy, and arithmetic intensity) should enable speedups exceeding 10× for this class of workloads.

5.2. Decision-Making Robustness and Ranking Stability

Sensitivity analysis demonstrates that the TOPSIS ranking structure remains stable in the face of realistic perturbations to the criterion weights, with no changes observed in the overall Top 10. This confirms that the accelerated framework preserves decision robustness under conditions of uncertainty regarding preferences. Fragmentation experiments also establish that memory constraints can be effectively circumvented without compromising the mathematical correctness of the ranking.

5.3. Algebraic and Computational Equivalence

Property 1 and EXP-6, considered together, provide a complete picture of the equivalence between the 1-pass and 2-pass formulations: exact in theoretical arithmetic, and practically total over wide evaluation windows (Overlap@1000 = 100%, Spearman ρ = 1.0 for all tested values of k), with localized divergences in strict Top-K rankings at high k, attributable to the non-associativity of parallel reductions in float32. This distinction between algebraic and computational equivalence is in itself a contribution, providing practitioners with explicit and quantified criteria for choosing between the two formulations.

5.4. Limitations

Several limitations are worth noting. First, the set of reported time measurements corresponds to the arithmetic mean of five independent runs with accompanying 95% confidence intervals, which mitigates the residual variance introduced by the dynamic allocation of GPU resources on Google Colab Pro. The shared, non-dedicated nature of this environment is an inherent limitation; future work will replicate the experiments on dedicated hardware infrastructure to strengthen the statistical validity of the reported speedup factors. Second, the EXP-4 and EXP-7 experiments rely on the artificial replication of the Amazon ALL matrix to achieve very high volumes, which allows for the assessment of computational scalability, but not decision diversity at a very large scale: since the ranked alternatives are structurally similar from one fragment to another, conclusions regarding the stability of the ranking at 200 million alternatives must be interpreted in this context. Third, while GPU memory consumption is recorded in EXP-4, it was not systematically measured across all experiments; this data would nevertheless be valuable for practitioners wishing to size their infrastructure. Fourth, no additional GPU controls (frequency monitoring, thermal throttling detection) were applied in the shared Colab environment; experiments on dedicated infrastructure with NVIDIA SMI monitoring are planned as future work. Fifth, float64 accuracy was not systematically evaluated on fragmented data in this submission. Theoretically, switching to float64 would eliminate the local classification inversions observed for high values of k (Section 4.8), as the machine epsilon drops from ~1.2 × 10−7 (float32) to ~2.2 × 10−16 (float64), nine orders of magnitude below the observed inversion threshold (~10−4). However, this gain in precision comes at a significant memory and computational cost: with each element increasing from 4 to 8 bytes, the memory footprint is exactly doubled. This halves the maximum acceptable fragment size for a given VRAM budget, doubles the number of fragments (k) required to process the same volume of data, and reduces GPU speedup by approximately 30 to 50% due to the lower memory bandwidth efficiency of mainstream GPU architectures, which have far fewer float64 compute units than float32. It is worth noting that for the vast majority of practical use cases—either k ≤ 2 or K ≥ 50—no measurable inversion is observed with float32, making the use of float64 unnecessary in these configurations. A systematic experimental comparison of float64 vs. float32 on fragmented data is planned as a priority future project. Sixth, the framework has been validated with n = 6 criteria. For applications with hundreds of criteria (e.g., genomic or financial analytics): (a) the intermediate tensor footprint O(mt × n) can exceed VRAM even for a single fragment when n is large; (b) row-wise fragmentation alone does not resolve VRAM overflow if mt × n remains too large. A column-wise (criteria-wise) blocking strategy would be required at the cost of additional data passes; guidelines for wide matrices are left as future work. Finally, the sensitivity analysis conducted in EXP-5 focuses on the Top 100; extending the evaluation of Kendall’s τ coefficient to larger subsets would strengthen the conclusions regarding the overall stability of the ranking beyond the critical decision zone.
These limitations do not diminish the scope of the contribution. By completely reformulating TOPSIS as a GPU tensor pipeline while strictly preserving its original mathematical definition—the only modification introduced being the numerical stabilization constant ε—GPU-TOPSIS makes large-scale multi-criteria analysis accessible on standard GPU hardware, including via cloud platforms such as Google Colab, thus paving the way for an effective democratization of MCDM methods in Big Data environments.

5.5. Practical Implications

Beyond computational performance, GPU-TOPSIS opens up concrete prospects for large-scale, multi-criteria decision-making in a variety of application contexts. In the field of e-commerce and recommendation systems, the proposed framework enables the near real-time ranking of hundreds of millions of products according to multi-criteria profiles, as demonstrated by the evaluation conducted on the Amazon Products 2023 dataset. In the field of supply chain management and industrial procurement, GPU-TOPSIS can support the dynamic selection of suppliers from continuously updated catalogs, where reclassification must be performed at scale within tight operational deadlines.
A particularly relevant application area concerns multi-agent task allocation and robotic systems. In this context, an autonomous agent must rank and select actions or resources in real time according to multiple, potentially conflicting objectives. GPU-TOPSIS’s ability to rank millions of alternatives in seconds on a consumer-grade GPU makes it directly usable as a decision layer in robotic planning architectures, where MCDM methods have been shown to improve task scheduling and resource allocation efficiency. Similarly, in the field of the Underwater Internet of Things (IoT) [42]. GPU-TOPSIS’s ability to handle very large decision matrices makes it well-suited to these environments, where the number of candidate nodes or configurations can be high and where evaluation must be performed rapidly.
More generally, any domain requiring the automated ranking of a large number of alternatives according to multiple criteria—including the allocation of health resources, the filtering of financial portfolios, or environmental monitoring—can directly rely on GPU-TOPSIS as a computational foundation, regardless of the weighting strategy adopted.

5.6. Threats to Validity

Several factors could affect the generalizability of the results presented. First, the experiments were conducted in a shared cloud environment (Google Colab Pro), which can lead to slight fluctuations in execution times due to variable resource allocation. Furthermore, the GPU hardware available in this type of environment is not exclusively dedicated to a single user, and its performance can be affected by background system activity. The execution times reported in this study should therefore be interpreted as conservative estimates. Experiments conducted on a dedicated hardware infrastructure, with optimized configurations and exclusive access to GPU resources, could thus lead to execution performance exceeding that observed in the Colab environment.
Secondly, scaling experiments beyond the initial dataset rely on statistically calibrated synthetic data, rather than entirely independent real-world datasets. While this approach allows for the evaluation of computational scalability at very large scales, it may not perfectly reflect the diversity and complexity of decision-making scenarios encountered in real-world contexts.
Finally, the proposed evaluation focuses exclusively on the TOPSIS method and does not directly examine the performance of other multi-criteria decision support techniques in the same GPU execution model. Future work will aim to determine the extent to which the proposed GPU-based tensor computing paradigm generalizes to other MCDM methods such as VIKOR, PROMETHEE, and ELECTRE.

6. Conclusions

This work introduced GPU-TOPSIS, a GPU-accelerated implementation of the TOPSIS method, designed to enable large-scale multi-criteria decision-making under realistic data and hardware constraints. Leveraging modern GPU computing frameworks—CuPy, PyTorch, and TensorFlow—the proposed approach overcomes the scalability limitations of classical CPU-based TOPSIS implementations while rigorously preserving mathematical fidelity to the original method.
This work represents a coherent progression from our previous contributions on parallel [14] and distributed [15] MCDM, taking a qualitative leap towards massive single-memory GPU parallelism. Experimental evaluations conducted on real data from the Amazon Products 2023 dataset demonstrate that GPU-TOPSIS enables the efficient processing of decision matrices containing millions of alternatives, with speedups of up to 4.75× compared to the CPU reference, while preserving ranking consistency and numerical stability. The integration of a fragment processing strategy extends scalability beyond the limits of GPU memory, allowing safe execution with extreme data volumes.
Beyond computational performance, the proposed framework supports large-scale robustness and sensitivity analyses, ensuring that acceleration does not compromise decision reliability. GPU-TOPSIS thus provides a practical and reliable foundation for large-scale decision support applications.
Future work will focus on: (1) extending the framework to multi-GPU and distributed environments; (2) improving memory management strategies for streaming data; (3) generalizing GPU acceleration to other MCDM methods such as AHP, VIKOR, PROMETHEE, and ELECTRE; (4) replicating experiments on dedicated hardware infrastructure to eliminate the residual variance introduced by the dynamic GPU allocation of shared cloud environments; (5) systematically measuring VRAM consumption per experiment; and (6) extending the sensitivity analysis to Kendall’s coefficient τ to larger subsets of alternatives; (7) the extension of the framework to Fuzzy TOPSIS variants (triangular and trapezoidal fuzzy numbers) for large-scale linguistic uncertainty management: fuzzy numbers can be represented as additional tensors, and TOPSIS operations on fuzzy numbers—fuzzy distance, fuzzy score—can be fully vectorized and executed on the GPU, thus paving the way for massive processing of imprecise or uncertain data; (8) the adaptation of the framework for processing dynamic decision matrices and streaming data: the 2-pass fragmentation strategy of GPU-TOPSIS is naturally suited to processing successive batch data streams, allowing incremental updates of rankings without a complete reload of the matrix into memory; (9) a systematic analysis of the energy efficiency and cost-performance trade-offs of GPU-TOPSIS, including metrics such as the number of alternatives processed per joule (alternatives/J) and TFLOP/W efficiency, on different third-party GPU hardware (consumer, datacenter, cloud), in order to quantify the actual energy benefit of GPU acceleration compared to the reference CPU implementation.
Finally, this work demonstrates that classical multi-criteria decision support methods, such as TOPSIS, can be effectively reformulated to leverage modern tensor architectures of GPUs. By combining vectorized computing with an evolving fragmentation strategy, the proposed GPU-TOPSIS framework enables the processing of decision matrices containing up to several hundred million alternatives. These results open new perspectives for the application of multi-criteria decision support techniques to large-scale decision problems in fields such as recommendation systems, large-scale product evaluation, and data-driven decision support.

Author Contributions

Conceptualization, L.B. and M.C.A.; Methodology, L.B. and M.C.A.; Software, L.B. and M.C.A.; Validation, L.B., H.A. and M.C.A.; Formal analysis, L.B., H.A. and M.C.A.; Investigation, L.B. and M.C.A.; Resources, L.B. and M.C.A.; Data curation, M.C.A.; Writing—original draft, L.B.; Writing—review & editing, L.B., H.A. and M.C.A.; Visualization, L.B., H.A., M.C.A. and L.L.; Supervision, M.C.A. and L.L.; Project administration, M.C.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code for the three implementations—Cupy, PyTorch, and TensorFlow—with decision matrices on the Products 2023 dataset is publicly available on Zenodo at: https://doi.org/10.5281/zenodo.18911332.

Acknowledgments

All authors express their sincere gratitude to all members of the Laboratory of Intelligent Systems and Applications for the friendly atmosphere and their contribution to the continued growth of the laboratory.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Roy, B. Decision-aid and decision-making. Eur. J. Oper. Res. 1990, 45, 324–331. [Google Scholar] [CrossRef]
  2. Ho, W.R.J.; Tsai, C.L.; Tzeng, G.H.; Fang, S.K. Combined DEMATEL technique with a novel MCDM model for exploring portfolio selection based on CAPM. Expert Syst. Appl. 2011, 38, 16–25. [Google Scholar] [CrossRef]
  3. Diaby, V.; Campbell, K.; Goeree, R. Multi-criteria decision analysis (MCDA) in health care: A bibliometric analysis. Oper. Res. Health Care 2013, 2, 20–24. [Google Scholar] [CrossRef]
  4. Chakraborty, S.; Chakraborty, S. A Scoping Review on the Applications of MCDM Techniques for Parametric Optimization of Machining Processes. Arch. Comput. Methods Eng. 2022, 29, 4165–4186. [Google Scholar] [CrossRef]
  5. Govindan, K.; Mina, H.; Esmaeili, A.; Gholami-Zanjani, S.M. An Integrated Hybrid Approach for Circular Supplier Selection and Closed loop Supply Chain Network Design under Uncertainty. J. Clean. Prod. 2020, 242, 118317. [Google Scholar] [CrossRef]
  6. Gani, A.; Asjad, M.; Talib, F. Prioritization and Ranking of indicators of sustainable manufacturing in Indian MSMEs using fuzzy AHP approach. Mater. Today Proc. 2021, 46, 6631–6637. [Google Scholar] [CrossRef]
  7. Behzadian, M.; Otaghsara, S.K.; Yazdani, M.; Ignatius, J. A state-of-the-art survey of TOPSIS applications. Expert Syst. Appl. 2012, 39, 13051–13069. [Google Scholar] [CrossRef]
  8. Yadav, V.; Karmakar, S.; Kalbar, P.P.; Dikshit, A.K. PyTOPS: A Python-based tool for TOPSIS. SoftwareX 2019, 9, 217–222. [Google Scholar] [CrossRef]
  9. Jablonsky, J. MS Excel-based Software Support Tools for Decision Problems with Multiple Criteria. Procedia Econ. Financ. 2014, 12, 251–258. [Google Scholar] [CrossRef]
  10. Schmidt, B.; González-Domínguez, J.; Hundt, C.; Schlarb, M. Compute Unified Device Architecture. In Parallel Programming; Elsevier: Amsterdam, The Netherlands, 2018; pp. 225–285. [Google Scholar] [CrossRef]
  11. CuPy Development Team. CuPy—NumPy & SciPy for GPU—CuPy 13.4.0 Documentation. Available online: https://docs.cupy.dev/en/stable/ (accessed on 10 March 2025).
  12. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
  13. Sim, Y.; Shin, W.; Lee, S. Automated code transformation for distributed training of TensorFlow deep learning models. Sci. Comput. Program. 2025, 242, 103260. [Google Scholar] [CrossRef]
  14. Lamrini, L.; Abounaima, M.C.; El Mazouri, F.Z.; Ouzarf, M.; Alaoui, M.T. MCDM Filter with Pareto Parallel Implementation in Shared Memory Environment. Stat. Optim. Inf. Comput. 2022, 10, 192–203. [Google Scholar] [CrossRef]
  15. Lamrini, L.; Abounaima, M.C.; Alaoui, M.T. New distributed-TOPSIS approach for multi-criteria decision-making problems in a big data context. J. Big Data 2023, 10, 97. [Google Scholar] [CrossRef]
  16. Lakshmi, M.S.; Karpagam, D.R.; Swetha, N.G. GPU Accelerated TOPSIS Algorithm for QoS Aware Web Service Selection. Int. J. Recent Technol. Eng. IJRTE 2020, 8, 3159–3163. [Google Scholar] [CrossRef]
  17. Van Der Walt, S.; Colbert, S.C.; Varoquaux, G. The NumPy array: A structure for efficient numerical computation. Comput. Sci. Eng. 2011, 13, 22–30. [Google Scholar] [CrossRef]
  18. Hou, Y.; Li, J.; He, Z.; Yan, A.; Chen, X.; McAuley, J. Bridging language and items for retrieval and recommendation. arXiv 2024, arXiv:2403.03952. [Google Scholar] [CrossRef]
  19. Oussous, A.; Benjelloun, F.Z.; Lahcen, A.A.; Belfkih, S. Big Data technologies: A survey. J. King Saud Univ.—Comput. Inf. Sci. 2018, 30, 431–448. [Google Scholar] [CrossRef]
  20. Zheng, Z.; Wang, P.; Liu, J.; Sun, S. Real-Time Big Data Processing Framework: Challenges and Solutions. Appl. Math. Inf. Sci. 2015, 9, 3169–3190. [Google Scholar]
  21. Kekevi, U.; Aydin, A.A. Real-Time Big Data Processing and Analytics: Concepts, Technologies, and Domains. Comput. Sci. 2022, 7, 111–123. [Google Scholar] [CrossRef]
  22. Abdalla, H.B. A brief survey on big data: Technologies, terminologies and data-intensive applications. J. Big Data 2022, 9, 107. [Google Scholar] [CrossRef]
  23. Owens, J.D.; Houston, M.; Luebke, D.; Green, S.; Stone, J.E.; Phillips, J.C. GPU computing. Proc. IEEE 2008, 96, 879–899. [Google Scholar] [CrossRef]
  24. Jeon, H. GPU Architecture. In Handbook of Computer Architecture; Springer: Singapore, 2023; pp. 1–29. [Google Scholar] [CrossRef]
  25. Mantas, J.M.; De la Asunción, M.; Castro, M.J. An Introduction to GPU Computing for Numerical Simulation. In Numerical Simulation in Physics and Engineering; SEMA SIMAI Springer Series; Springer: Cham, Switzerland, 2016; Volume 9, pp. 219–251. [Google Scholar] [CrossRef]
  26. Jha, R.G.; Samlodia, A. GPU-acceleration of tensor renormalization with PyTorch using CUDA. Comput. Phys. Commun. 2024, 294, 108941. [Google Scholar] [CrossRef]
  27. PyTorch Development Team. PyTorch Documentation—PyTorch 2.6 Documentation. Available online: https://pytorch.org/docs/stable/index.html (accessed on 6 April 2025).
  28. TensorFlow Development Team. API Documentation|TensorFlow v2.16.1. Available online: https://www.tensorflow.org/api_docs (accessed on 6 April 2025).
  29. Das, A.K.; Gupta, N.; Mahmood, T.; Tripathy, B.C.; Das, R.; Das, S. An innovative fuzzy multi-criteria decision making model for analyzing anthropogenic influences on urban river water quality. Iran J. Comput. Sci. 2024, 8, 103–124. [Google Scholar] [CrossRef]
  30. Karadayi-Usta, S.; Tirkolaee, E.B. Evaluating the Sustainability of Fashion Brands Using a Neutrosophical ORESTE Approach. Sustainability 2023, 15, 1440. [Google Scholar] [CrossRef]
  31. Das, A.K.; Granados, C. FP-intuitionistic multi fuzzy N-soft set and its induced FP-Hesitant N soft set in decision-making. Decis. Mak. Appl. Manag. Eng. 2022, 5, 67–89. [Google Scholar] [CrossRef]
  32. Erlacher, C.; Salap-Ayca, S.; Jankowski, P.; Anders, K.H.; Paulus, G. A GPU-based solution for accelerating spatially-explicit uncertainty- and sensitivity analysis in multi-criteria decision making. In Proceedings of the Spatial Accuracy 2016, Montpellier, France, 5–8 July 2016; pp. 305–312. [Google Scholar]
  33. Kumar, M.; Kaur, G.; Rana, P.S. Robust evaluation of GPU compute instances for HPC and AI in the cloud: A TOPSIS approach with sensitivity, bootstrapping, and non-parametric analysis. Computing 2024, 106, 3987–4014. [Google Scholar] [CrossRef]
  34. Swetha, N.G.; Karpagam, G.R. GPU enabled Improved Reference Ideal Method (I-RIM) for Web Service Selection. Int. J. Inf. Technol. Decis. Mak. 2022, 21, 855–884. [Google Scholar] [CrossRef]
  35. Hwang, C.-L.; Yoon, K. Methods for Multiple Attribute Decision Making. In Lecture Notes in Economics and Mathematical Systems; Springer: Berlin/Heidelberg, Germany, 1981; Volume 186, pp. 58–191. [Google Scholar] [CrossRef]
  36. Shi, J.; Sun, M.; Yang, X.; Jing, K.; Lai, K.K. Evaluating supply chain finance risks in a cross-border e-commerce context: An improved TOPSIS approach with loss penalty. Inf. Sci. 2025, 717, 122301. [Google Scholar] [CrossRef]
  37. Wu, T.; Liu, X.; Liu, F. An interval type-2 fuzzy TOPSIS model for large scale group decision making problems with social network information. Inf. Sci. 2018, 432, 392–410. [Google Scholar] [CrossRef]
  38. Chakraborty, S.; Chatterjee, P.; Das, P.P. Ranking of Alternatives through Functional Mapping of Criterion Sub-Intervals into a Single Interval (RAFSI) Method. In Multi-Criteria Decision-Making Methods in Manufacturing Environments; Apple Academic Press: Palm Bay, FL, USA, 2023; pp. 317–323. [Google Scholar] [CrossRef]
  39. Gigović, L.; Pamučar, D.; Božanić, D.; Ljubojević, S. Application of the GIS-DANP-MABAC multi-criteria model for selecting the location of wind farms: A case study of Vojvodina, Serbia. Renew. Energy 2017, 103, 501–521. [Google Scholar] [CrossRef]
  40. Burgin, M. Algorithmic complexity as a criterion of unsolvability. Theor. Comput. Sci. 2007, 383, 244–259. [Google Scholar] [CrossRef][Green Version]
  41. Wegener, I. Complexity Theory Exploring the Limits of Efficient Algorithms; Springer: Berlin/Heidelberg, Germany, 2005; Volume 32. [Google Scholar] [CrossRef]
  42. Ullah, I.; Ali, F.; Sharafian, A.; Ali, A.; Naeem, H.M.; Bai, X. Optimizing underwater connectivity through multi-attribute decision-making for underwater IoT deployments using remote sensing technologies. Front. Mar. Sci. 2024, 11, 1468481. [Google Scholar] [CrossRef]
Figure 1. (a) GPU-TOPSIS 1-pass; (b) GPU-TOPSIS 1-pass shard flow diagram.
Figure 1. (a) GPU-TOPSIS 1-pass; (b) GPU-TOPSIS 1-pass shard flow diagram.
Bdcc 10 00138 g001
Figure 2. Execution time (a) and GPU vs. CPU acceleration factors as a function of the number of alternatives m (b).
Figure 2. Execution time (a) and GPU vs. CPU acceleration factors as a function of the number of alternatives m (b).
Bdcc 10 00138 g002
Figure 3. Heatmaps of inter-backend consistency: (a) Max ΔSi, (b) Overlap@10. Amazon ALL data (14.65 M alternatives).
Figure 3. Heatmaps of inter-backend consistency: (a) Max ΔSi, (b) Overlap@10. Amazon ALL data (14.65 M alternatives).
Bdcc 10 00138 g003
Figure 4. Sensitivity analysis: Overlap@10 (bars) and Kendall τ@10 (curve) as a function of the level of perturbation of the weights (50 scenarios, real data Amazon ALL).
Figure 4. Sensitivity analysis: Overlap@10 (bars) and Kendall τ@10 (curve) as a function of the level of perturbation of the weights (50 scenarios, real data Amazon ALL).
Bdcc 10 00138 g004
Figure 5. Comparison of 1-pass vs. 2-pass GPU-TOPSIS.
Figure 5. Comparison of 1-pass vs. 2-pass GPU-TOPSIS.
Bdcc 10 00138 g005
Figure 6. GPU-TOPSIS 2-pass for 100 M, 150 M, and 200 M alternatives: (a) execution time (PyTorch, CuPy, TensorFlow), (b) processing throughput in millions of alternatives per second. Synthetic data calibrated on Amazon ALL; adaptive sharding (k = 7, 11, 14, respectively).
Figure 6. GPU-TOPSIS 2-pass for 100 M, 150 M, and 200 M alternatives: (a) execution time (PyTorch, CuPy, TensorFlow), (b) processing throughput in millions of alternatives per second. Synthetic data calibrated on Amazon ALL; adaptive sharding (k = 7, 11, 14, respectively).
Bdcc 10 00138 g006
Table 1. Libraries for GPU acceleration.
Table 1. Libraries for GPU acceleration.
ToolKindLanguageGPU SupportMain Use
CuPy [25]DigitalPythonYes (CUDA)NumPy-compatible matrix processing
PyTorch [12]DL/DigitalPython, C++Yes (CUDA)Tensors, dynamic graphs, prototyping
TensorFlow [13]DL/ProductionPython, C++Yes (CUDA)High-performance computing, deployment
Table 2. Comparison between GPU-TOPSIS and the closest related works.
Table 2. Comparison between GPU-TOPSIS and the closest related works.
WorkParallelismMax SizeSpeedupGenericityNumerical Stability
Lamrini 2022 [14]OpenMP/CPU~105N/AMCDM filterNot rated
Lamrini 2023 [15]MapReduce~106N/ATOPSISNot rated
Lakshmi 2020 [16]GPU/CUDAModestPartialQoS webNot rated
GPU-TOPSIS (this work)GPU/CUDA (3 backends)~200 × 1064.75×GeneralEvaluated (ε < 10−6)
Table 3. Correspondence between the scalar/loop formulation and the vectorized GPU-TOPSIS formulation.
Table 3. Correspondence between the scalar/loop formulation and the vectorized GPU-TOPSIS formulation.
StepScalar/Loop FormVectorized GPU FormEquation
1. Normalization r i j = x i j k = 1 m x k j 2 , ∀i,jR ← M ⊘ norms(3)
2. Weightingvij = wj⋅rij, ∀i,jV ← R ⊙ W(4)
3. PIS/NIS A + [ j ] = m a x i v i j ,   A [ j ] = m i n i v i j PIS ← col_max(V), NIS ← col_min(V)(5) and (6)
4. Distance D+ D i + = j = 1 n ( v i j A j + ) 2 , ∀iD+ ← row_sqrt_sum((V − PIS)2)(7)
5. Distance D D i = j = 1 n ( v i j A j ) 2 , ∀iD− ← row_sqrt_sum((V − NIS)2)(8)
6. Score S i = D i D i + + D i + ε , ∀iS ← D⊘(D+ + D + ε)(9)
7. Rankingargsort(S, descending = True)L ← argsort_desc(S)
Notation: ⊙ = element-wise multiplication (Hadamard product); ⊘ = element-wise division; M∈ ℝn×m; W ∈ ℝn = weight vector; norms ∈ ℝn = column-wise Euclidean norms; ε = 10−12 = numerical stabilization constant. All operations are executed in GPU memory as tensor primitives (CuPy/PyTorch/TensorFlow).
Table 4. Comparison of proposed TOPSIS formulations.
Table 4. Comparison of proposed TOPSIS formulations.
VersionTime (Form)A: Dominant Calculus TermB: Dominant Calculus TermDominant MemoryTypical Use
CPU-TOPSIS O ( m n ) + O(m × log m)2.0 × 1062.0 × 108RAM (monolithic or streaming)Small/medium volumes, simplicity
GPU 1-pass O ( m n ) P + T o v e r h e a d ≈4.88 × 102 + Toverhead≈4.88 × 104 + ToverheadVRAM ≈ c×4 minBest choice if it fits in VRAM
GPU 1-pass shard O ( m n ) P + k T o v e r h e a d ≈4.88 × 10 2 + 100 Toverhead≈4.88 × 10 4 + 100 ToverheadVRAM per shard ≈ c×4 minrequired if dataset > VRAM and norms, PIS/NIS & scores calculated locally on each shard
2-pass GPU O ( m n ) + O ( m n ) P + k × T o v e r h e a d ≈2.0 × 10 6 +4.88 × 10 2 + 100 × Toverhead≈2.0 × 10 8 +4.88 × 10 4 + 100 × ToverheadVRAM per shard ≈ c×4 minnecessary if dataset > VRAM, global equivalence
Table 5. Formal description of the GPU-TOPSIS decision matrix (Amazon Products 2023).
Table 5. Formal description of the GPU-TOPSIS decision matrix (Amazon Products 2023).
JCriterion CjCalculationOptimizationDecision Rationale
1Average ratingaverage_rating ∈ [1, 5]BenefitDirect indicator of aggregate customer satisfaction
2Number of reviewsrating_number ∈ ℕBenefitProxy of popularity and commercial maturity
3Priceprice ∈ ℝ+ (USD)CostEconomic dimension of the purchasing decision
4Freshness1 − (tsmax − tsi)/tsmax ∈ [0, 1]BenefitRecent ratings and sales momentum
5Utility ratehelpful_vote/(n_reviews + 1) ∈ [0, 1]BenefitInformational quality of reviews beyond their quantity
6Dispersion of notesσ(ratings) ∈ [0, 2]CostPolarization of opinions; high dispersion = heterogeneity of experience
Table 6. Execution time (mean ± standard deviation over 5 repetitions) and speedup factors for the four TOPSIS implementations across seven matrix sizes (actual Amazon Products 2023 data). ★ = best GPU speedup at this scale. GPU: Tesla T4 (16 GB VRAM, CUDA 12).
Table 6. Execution time (mean ± standard deviation over 5 repetitions) and speedup factors for the four TOPSIS implementations across seven matrix sizes (actual Amazon Products 2023 data). ★ = best GPU speedup at this scale. GPU: Tesla T4 (16 GB VRAM, CUDA 12).
N AlternativesSourceCPU—Average Standard (s)PyTorch Moy ± Std (s)Speedup PyTorchCuPy Moy ± Std (s)Speedup CuPyTF Moy ± Std (s)Speedup TF
50,000real0.0094 ± 0.00040.004 ± 0.00032.3:4×0.006 ± 0.00021.49×0.014 ± 0.00370.65×
100,000real0.0227 ± 0.00190.007 ± 0.00053.36×0.013 ± 0.00161.75×0.017 ± 0.00221.37×
500,000real0.1169 ± 0.00560.030 ± 0.00063.87×0.053 ± 0.00182.20×0.045 ± 0.00412.62×
1,000,000real0.2654 ± 0.00760.061 ± 0.00424.34×0.106 ± 0.00462.51×0.104 ± 0.00592.55×
3,383,435real (Books)0.9925 ± 0.03330.209 ± 0.00514.75× ★0.318 ± 0.02203.12×0.331 ± 0.00643.00×
5,000,000real1.5157 ± 0.02700.341 ± 0.00494.44×0.459 ± 0.00643.30×0.521 ± 0.00892.91×
14,652,525real (ALL)4.8519 ± 0.06112.696 ± 0.07411.80×1.746 ± 0.04352.78×1.998 ± 0.05182.43×
Table 7. Inter-backend digital consistency.
Table 7. Inter-backend digital consistency.
PairMax ΔSiMoy ΔSiOverlap@5Overlap@10Overlap@50Overlap@100Kendall τ@10Kendall τ@100Spearman ρ@100
CPU vs. CuPy3.76 × 10−88.11 × 10−9100%100%100%100%1.0000 (p = 5.5 × 10−7)1.0000 (p = 2.1 × 10−158)1.0
CPU vs. PyTorch4.33 × 10−88.85 × 10−10100%100%100%100%1.0000 (p = 5.5 × 10−7)1.0000 (p = 2.1 × 10−158)1.0
CPU vs. TensorFlow4.33 × 10−82.26 × 10−8100%100%100%100%1.0000 (p = 5.5 × 10−7)1.0000 (p = 2.1 × 10−158)1.0
CuPy vs. PyTorch3.73 × 10−88.89 × 10−9100%100%100%100%1.0000 (p = 5.5 × 10−7)1.0000 (p = 2.1 × 10−158)1.0
CuPy vs. TensorFlow4.52 × 10−82.99 × 10−8100%100%100%100%1.0000 (p = 5.5 × 10−7)1.0000 (p = 2.1 × 10−158)1.0
PyTorch vs. TensorFlow4.47 × 10−82.18 × 10−8100%100%100%100%1.0000 (p = 5.5 × 10−7)1.0000 (p = 2.1 × 10−158)1.0
Table 8. Top-10 products obtained by GPU-TOPSIS.
Table 8. Top-10 products obtained by GPU-TOPSIS.
RankParent_AsinScore (CPU)Score (PyTorch)Score (CuPy)Score (TensorFlow)Price ($)Average RatingN Reviews
1B07TVHSDMQ0.9986220.9986220.9986220.99862217.994.362314,691
2B01K8B8YA80.3783260.3783260.3783260.37832639.994.347119,051
3B0B53DWRVW0.2693040.2693040.2693040.26930434.974.29984,739
4B08XPWDSWW0.230630.230630.230630.2306321.994.2672,566
5B0C1G1BJ2B0.1878150.1878150.1878150.1878157.994.30959,087
6B07S764D9V0.1771880.1771880.1771880.17718813.994.25555,743
7B07KTYJ7690.1648750.1648750.1648750.16487524.99457651,867
8B0BW4PFM580.1639240.1639240.1639240.16392424.994.35751,568
9B00FAPF5U00.1617840.1617840.1617840.1617846.994.32950,891
10B0BXQRCB550.1552190.1552190.1552190.15521919.994.39148,826
Table 9. Execution time (mean ± standard deviation, 3 repetitions) and VRAM consumption per fragment for the 2-pass GPU-TOPSIS formulation in a sharding configuration. Fragment 0 = actual Amazon ALL data; fragments 1–5 = replicas with 2% Gaussian noise.
Table 9. Execution time (mean ± standard deviation, 3 repetitions) and VRAM consumption per fragment for the 2-pass GPU-TOPSIS formulation in a sharding configuration. Fragment 0 = actual Amazon ALL data; fragments 1–5 = replicas with 2% Gaussian noise.
KTotal
Alternatives
BackendTime_Mean ± Std (s)Vram_Peak_Abs_(mb)Vram_Peak_Delta_(mb)Vram_Baseline_(mb)
114,506,203pytorch0.445 ± 0.0054408.18751688.02720.1875
cupy0.770 ± 0.0014396.18751676.02720.1875
tensorflow0.927 ± 0.0014770.18752152.02618.1875
229,012,406pytorch0.863 ± 0.0016068.18753348.02720.1875
cupy1.535 ± 0.0046056.18753336.02720.1875
tensorflow1.865 ± 0.0276818.18754200.02618.1875
458,024,812pytorch1.666 ± 0.0129388.18756668.02720.1875
cupy3.122 ± 0.0139378.18756658.02720.1875
tensorflow3.701 ± 0.03010,914.18758296.02618.1875
687,037,218pytorch2.548 ± 0.02312,708.18759988.02720.1875
cupy4.710 ± 0.01312,698.18759978.02720.1875
tensorflow5.680 ± 0.01514,310.187511,692.02618.1875
Table 10. TOPSIS ranking robustness metrics against perturbations of criterion weights (50 scenarios per level, real data from Amazon ALL, 14.65 M alternatives).
Table 10. TOPSIS ranking robustness metrics against perturbations of criterion weights (50 scenarios per level, real data from Amazon ALL, 14.65 M alternatives).
DisturbanceN ScenariosOverlap@5 Average StandardOverlap@10 Average StandardOverlap@100 Average StandardKendall τ@10
Average
Kendall τ@100
Average
Max ΔSi Avg
±5%50100.0 ± 0.0%100.0 ± 0.0%100.0 ± 0.0%1.01.00.00019
±10%50100.0 ± 0.0%100.0 ± 0.0%100.0 ± 0.0%1.01.00.00038
±15%50100.0 ± 0.0%100.0 ± 0.0%100.0 ± 0.0%1.01.00.000573
±20%50100.0 ± 0.0%100.0 ± 0.0%100.0 ± 0.0%1.01.00.00077
Table 11. GPU-TOPSIS 2-pass execution time (mean ± standard deviation, 3 repetitions).
Table 11. GPU-TOPSIS 2-pass execution time (mean ± standard deviation, 3 repetitions).
Target (M)k Shardsmt/ShardPyTorch Average (s)PyTorch ± Std (s)PyTorch Throughput (M/s)CuPy Avg (s)CuPy ± Std (s)CuPy Throughput (M/s)TensorFlow Average (s)TensorFlow ± Std (s)TensorFlow flow Rate (M/s)
100 M714,285,71528.960.263.4530.480.323.2832.140.253.11
150 M1113,636,36444.070.723.4047.420.293.1649.800.163.01
200 M1414,285,71560.170.173.3264.370.503.1170.240.362.85
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Boubekri, L.; Aberkane, H.; Abounaima, M.C.; Lamrini, L. GPU-TOPSIS: A Complete Vectorized and Parallel Reformulation of the TOPSIS Method for Large-Scale Multi-Criteria Decision Making. Big Data Cogn. Comput. 2026, 10, 138. https://doi.org/10.3390/bdcc10050138

AMA Style

Boubekri L, Aberkane H, Abounaima MC, Lamrini L. GPU-TOPSIS: A Complete Vectorized and Parallel Reformulation of the TOPSIS Method for Large-Scale Multi-Criteria Decision Making. Big Data and Cognitive Computing. 2026; 10(5):138. https://doi.org/10.3390/bdcc10050138

Chicago/Turabian Style

Boubekri, Latifa, Hassnae Aberkane, Mohammed Chaouki Abounaima, and Loubna Lamrini. 2026. "GPU-TOPSIS: A Complete Vectorized and Parallel Reformulation of the TOPSIS Method for Large-Scale Multi-Criteria Decision Making" Big Data and Cognitive Computing 10, no. 5: 138. https://doi.org/10.3390/bdcc10050138

APA Style

Boubekri, L., Aberkane, H., Abounaima, M. C., & Lamrini, L. (2026). GPU-TOPSIS: A Complete Vectorized and Parallel Reformulation of the TOPSIS Method for Large-Scale Multi-Criteria Decision Making. Big Data and Cognitive Computing, 10(5), 138. https://doi.org/10.3390/bdcc10050138

Article Metrics

Back to TopTop