1. Background
The enormous amount of data produced by the Next-Generation Sequencing (NGS) technologies opens the way for a more comprehensive characterization of mechanisms at the molecular level, at the basis of the cellular life and may have a role in the occurrence and progress of disorders and diseases. This may help to approach fundamental questions in biological and clinical research, such as how the interactions between cellular components and chromatin structure may affect gene activity, or to what extent complex diseases such as diabetes or cancer may involve specific (epi)genomic traits. Indexing NGS data is an important problem in this context [
1].
In particular, an index is a data structure that enables efficient retrieval of stored objects. Indexing strategies used in NGS allow space-efficient storage of biological sequences in a full-text index that facilitates fast querying, in order to return exact or approximate string matches. Popular full-text index data structures include variants of suffix arrays [
2], FM-index based on the Burrows–Wheeler transform (BWT) and some auxiliary tables [
3], and hash tables [
4]. The choice of a specific index structure is often a trade-off between query speed and memory consumption. For example, hash tables can be very fast but their memory footprint is sometimes prohibitive for large string collections [
5].
Here, we address the problem of computing BWT in a distributed environment, relying on Big Data technologies such as Apache Spark [
6] and Hadoop [
7]. To the best of our knowledge, this is the first attempt to combine Spark and Hadoop together, in order to improve the performance due to both memory and cloud optimization.
Previous research has been proposed in the literature on the BWT computation based on Apache Hadoop, such as the approaches presented in [
8,
9] (BigBWA). While one of the algorithms proposed in [
8] computes the BWT in a MapReduce [
10] fashion, in [
9] the parallelism is intended only to split the input sequences and then apply another existing framework, i.e., BWA [
11], in order to align them via BWT. Another related approach from the literature is RopeBWT2 [
12]. However, the latter is a single-node FM-index constructor, optimized for high-end shared-memory systems, whereas our purpose is scaling the BWT computation on commodity clusters, using distributed-memory paradigms. Different from the approach proposed here, the performance of most previous methods is closely linked to hardware assumptions, rather than to tasks beyond the BWT construction itself, therefore a direct comparison with them is not suitable. The only algorithm comparable with our approach is the MapReduce one proposed in [
8].
We propose an approach for the BWT computation which combines the advantages of the MapReduce paradigm [
10] and the Spark Resilient Distributed Datasets (RDD) [
6] (a preliminary version of it appeared in [
13]). In particular, our strategy is based on the computation of Suffix Arrays in a distributed environment, by revisiting the idea of
prefix doubling presented in [
14]. We have also implemented in Apache Spark the MapReduce naive algorithm presented in [
8], according to two different variants, and compared them with the novel one proposed here.
Validation results obtained on real biological datasets, including genomic and proteomic data, show that our approach improves the performance for BWT computation as compared to its competitors.
1.1. Preliminaries
Let S be a string of n characters defined on the alphabet . We denote by the i-th character in S and by its i-th suffix. We recall the following basic notions.
1.1.1. BWT
The Burrows-Wheeler transform of S is useful in order to rearrange it into runs of similar characters. This may have advantages both for indexing and for compressing more efficiently S. The BWT applied to S returns:
a permutation of S, obtained by sorting all its circular shifts in lexicographic order, and then extracting the last column;
the index (0-based) I of the row containing the original string S.
Among the most important properties of BWT, it is reversible.
Figure 1 shows an example of BWT for the string
S =
. In particular,
, and
.
1.1.2. Suffix Array
The suffix array
of
S is defined as an array of integers providing the starting positions of suffixes of
S in lexicographical order. Therefore, an entry
contains the starting position of the
i-th suffix in
S among those in lexicographic order.
Figure 2 shows the Suffix Array for the same example of BWT.
1.1.3. Inverse Suffix Array
Inverse Suffix Array of S, means that the rank of the suffix i is j, i.e., .
3. Results and Discussion
The presented algorithms have been evaluated on real datasets taken from the Pizza&Chili website [
16], where a set of text collections of various types and sizes are available to test experimentally compressed indexes. In particular, the text collections stored on this website have been selected to form a representative sample of different applications where indexed text searching might be useful. From this collection, we have chosen the following three datasets:
PROTEINS, containing a sequence of newline-separated protein sequences obtained from the Swissprot database.
DNA, a sequence of newline-separated gene DNA sequences obtained from files of the Gutenberg Project.
ENGLISH, the concatenation of English text files selected from collections of the Gutenberg Project.
We have implemented in Apache Spark all algorithms considered here. The experimental evaluations have been performed on a cluster from the GARR Cloud Platform, configured with 1 master and 48 slave nodes, each node with 6 VCore, 24 GB of RAM and 1 TB for disk. Apache Hadoop and Spark have been used.
For the PROTEINS and DNA datasets, we considered the first 50 MB, the first 100 MB, and the entire 200 MB dataset.
Figure 3 focuses on the performance of the compared algorithms, showing running times for all successful executions (executions that failed or required more than 10 h to complete are omitted).
Figure 3 reports the end-to-end runtime of the three BWT implementations
,
and PDA, respectively, across datasets ranging from 50 MB to 1 GB. The results are shown on a logarithmic scale to capture runtime differences exceeding two orders of magnitude. The observed trends reflect both the algorithmic structure of each variant and the way they interact with Spark’s distributed execution model.
achieves the lowest runtime on all datasets up to 200 MB, completing in approximately 0.4–2 min regardless of data type. Although algorithmically simple, maps extremely well to Spark: it performs a single large distributed sort on the full set of suffixes/rotations and avoids multiple rounds of shuffle, intermediate data materialization, and JNI transitions. The entire computation remains within the JVM, minimizing serialization overhead and garbage-collection pressure. The high memory footprint of (up to 1 GB for 200 MB inputs) does not significantly impact the performance at this scale, due to the fact that executors can retain the entire working set in memory. However, this prevents from scaling to large inputs, due to the quadratic space requirements of full suffix representation, therefore it is not able to process the ENGLISH dataset.
is consistently the slowest implementation and exhibits poor scalability. While competitive on 50 MB datasets, its runtime increases linearly with input size, from about 12 min on 100 MB DNA up to more than h on the 200 MB PROTEINS dataset. This behavior stems from its hybrid design. Indeed, radix sorting is delegated to C++ via JNI calls. This requires the executor to repeatedly marshal data into a native format, invoke the native routine and then reconstruct the JVM objects afterwards. The associated JNI overhead grows with the number of partitions and becomes dominant for large datasets. Moreover, the interaction between radix partitioning and the highly repetitive structure of biological sequences leads to partition skew, excessive shuffle volume, and frequent spill-to-disk events, all of which significantly degrade performance. Therefore, the theoretically linear complexity of radix sorting does not materialize in practice within Spark, resulting in runtimes far worse than both and PDA.
In contrast, PDA exhibits the most stable and predictable scaling behavior. Runtime increases gradually from 4–6 min for the 50 MB datasets to about 1.06 h for the ENGLISH dataset. This algorithm follows a prefix-doubling strategy and refines suffix ranks over multiple iterations. Each iteration triggers a distributed sort on compact integer tuples, rather than on full suffix strings, significantly reducing memory traffic. Although PDA performs several shuffle stages, the amount of data exchanged in each step is small and regular, generating well-balanced partitions and keeping executor memory pressure low (typically 150–400 MB). As a result, PDA is the only variant able to handle the 1 GB dataset reliably, confirming that a multi-pass approach with lightweight intermediate structures is more compatible with Spark’s execution model than a single-pass strategy involving heavy per-record state. Overall, the runtime results highlight that achieving optimal performance for an algorithm on a single machine does not necessarily translate into efficient execution in a distributed environment.
benefits from a single well-optimized global sort, PDA benefits from predictable multi-stage refinement with compact state, and
suffers from JNI overhead, skew-induced imbalance, and increased shuffle costs. These systems-level effects ultimately shape the performance hierarchy observed in
Figure 3.
Figure 4 shows the throughput achieved by the three BWT implementations across all considered datasets. Throughput is defined as the amount of input processed per second (MB/s), and therefore higher values correspond to better performance. The results highlight three distinct behavioral patterns.
First, achieves the highest throughput across all datasets, with values ranging from approximately to MB/s. This behavior is largely attributable to its low computational overhead. performs significantly fewer intermediate operations than the other two implementations, indeed it benefits from reduced communication and shuffle overhead in Spark. However, this performance advantage is deceptive, as does not scale to larger datasets and lacks robustness when the amount of data increases or distribution becomes more complex. Its superior throughput should, therefore, be interpreted as a side-effect of algorithmic simplicity, rather than genuine efficiency or scalability.
In contrast, PDA exhibits stable and predictable throughput, consistently ranging between and MB/s across all datasets and data types, including the largest ENGLISH text corpus. This stability indicates that the prefix-doubling approach maintains a balanced computation-communication profile, with resource consumption that grows proportionally to the input size. As a result, PDA is the most scalable and reliable implementation among the three. PDA consistently delivers similar throughput independent of the dataset family (DNA, PROTEINS, ENGLISH) and maintains strong performance even on large-scale inputs.
Finally, shows the lowest throughput, with values dropping as low as – MB/s on the larger datasets. This degradation is expected, due to the fact that radix-based sorting techniques require substantial initialization and partitioning overhead in distributed environments, which become progressively more expensive as the input grows. The heavy use of bucket-sorting structures and the associated data-shuffling operations significantly penalize performance, particularly on datasets above 100 MB. Notably, could not be executed (did not complete successfully) on the 1 GB ENGLISH corpus, which further reinforces its limited suitability for large-scale distributed settings.
Overall, the throughput analysis reveals that, while appears faster on small datasets, PDA is the only variant that maintains consistent and scalable performance across all dataset families and sizes, making it the most appropriate choice for large and heterogeneous workloads. , although theoretically efficient on uniform data distributions, proves unsuitable for distributed BWT construction at scale due to its substantial communication and initialization costs.
Figure 5 shows the peak executor-side memory usage observed during the execution of the three BWT construction strategies. The results reveal key architectural differences in how each method interacts with Spark’s execution model.
incurs the highest memory pressure, exceeding 1 GB for medium-sized inputs (200 MB). This behavior is consistent with its algorithmic structure.
explicitly materializes the full set of string rotations and sorts them lexicographically. In Spark, this results in large wide dependencies and multi-stage shuffles where each executor must temporarily hold sizable segments of the rotation matrix. Because each rotation has the same length as the input and rotations are represented as full strings or large byte arrays, the RDDs expand quadratically and stress both JVM heap and off-heap memory. The massive shuffle map output also triggers spill events and memory amplification typical of sort-based shuffle operations.
reduces the cost of comparing large rotations by performing digit-wise comparisons on fixed-size segments, but it still materializes multiple rounds of key-value buffers during each radix iteration. Since these buffers must be fully materialized before each partition-level sort, executor memory remains high and relatively invariant to input size (around 700 MB). This plateau is a consequence of radix grouping since the memory footprint scales primarily with the number of radix passes and alphabet characteristics, not with input length. Additionally, the use of wide transformations in each digit pass forces Spark to retain intermediate broadcasted ranks and local buckets in memory, further increasing peak usage.
In contrast, PDA exhibits the most favorable memory profile, maintaining a stable footprint between 150 and 400 MB across all datasets. This is directly attributable to the prefix-doubling approach. Rather than generating the full rotation matrix, the algorithm stores compact rank pairs and refines them iteratively. At each iteration, only small fixed-size records are exchanged across the cluster, leading to lightweight shuffles and minimal executor memory residency. Spark is able to pipeline transformations efficiently because intermediate state is small enough to avoid both spilling and repeated materialization. Crucially, this compact representation enables PDA to handle the ENGLISH dataset, where both and would exceed executor memory, while PDA requires only 157 MB at peak.
The obtained results demonstrate that the scalability of BWT construction on distributed systems is strongly influenced by data representation and shuffle patterns, rather than by input size alone. Methods that minimize wide dependencies and materialized intermediate state, such as the prefix-doubling strategy implemented in PDA, achieve significantly better memory locality, predictable peak usage, and overall robustness at scale.
To assess the scalability of the proposed distributed BWT, we measured the runtime of PDA on the ENGLISH dataset while increasing the number of worker nodes. As shown in
Figure 6, execution time decreases substantially from 7400 s (24 nodes) to 5050 s (36 nodes) and 4000 s (48 nodes), confirming that the algorithm benefits from additional parallel resources.
The observed speedup, however, is sublinear, reaching around 92% parallel efficiency. The parallel efficiency (
) when scaling from 24 to 48 worker nodes is computed as
This behavior is expected for distributed sorting–based workloads and it is mainly due to Spark-specific overheads, including shuffle traffic, synchronization barriers between iterations, partition skew, and spill-related disk I/O. Despite these bottlenecks, the iterative BWT maintains good large-scale scalability, successfully processing a ENGLISH dataset while continuing to exhibit runtime reductions as resources increase. The results indicate that the iterative variant is the most robust and scalable implementation among those evaluated.
While the empirical evaluation provides meaningful insights into the behavior of the three BWT construction strategies in Spark, several limitations must be acknowledged. First, exhibits the highest throughput and lowest runtime on small datasets. However, this advantage is largely superficial as its algorithmic structure is inherently non-scalable. Because materializes the full rotation matrix, its memory footprint grows quadratically with input size, preventing execution on large datasets such as the 1 GB English corpus. Consequently, large-scale measurements for are not available, limiting the ability to perform uniform comparisons across the full dataset.
faces even more severe limitations, tied to its algorithmic structure and its interaction with Spark’s distributed execution model. Experimentally, presents very long and often impractical processing times, even on moderately sized inputs. Although the implementation is written in C++ to maximize performance, the cost remains prohibitive. From a theoretical perspective, this behavior is expected: the computational complexity of is , where p is the partition under consideration and denotes the maximum distance between any two indices in p, including the final index. When partition indices are not uniformly distributed, can become very large, leading to substantial slowdowns. In distributed settings, these theoretical issues are compounded by the practical overheads of JNI transitions, repeated bucket initialization, and large-volume shuffle operations. As a consequence, exhibits non-linear performance degradation with increasing input size and could not be executed on the largest dataset (1 GB), preventing a complete three-way comparison at scale.
, although generally faster than , exhibits analogous scalability limitations. The version implemented with Merge sort benefits from a more favorable comparison cost but still cannot process very large inputs because the underlying representation of full rotations imposes excessive memory demands on executors. Thus, both SMR variants encounter fundamental limitations that hinder their applicability to large-scale distributed BWT construction.
In contrast, PDA is the only method capable of handling all dataset sizes. Despite performing multiple prefix-doubling iterations, each involving a global distributed sort, PDA maintains predictable memory usage and stable runtime behavior. Its compact representation of intermediate state minimizes shuffle volume and avoids the quadratic blow-up observed in , while also preventing the partition-skew amplification that degrades . As a result, PDA fully exploits Spark’s distributed parallelism and remains robust even under varying cluster conditions. This aligns with theoretical expectations that distributing the refinement of suffix ranks across lightweight iterations, PDA achieves consistent scalability and emerges as the only practical choice for large and heterogeneous workloads.