Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark

Galluzzo, Ylenia; Giancarlo, Raffaele; Randazzo, Mario; Rombo, Simona E.

doi:10.3390/data11030048

Open AccessArticle

Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark^†

Department of Mathematics and Computer Science, University of Palermo, 90123 Palermo, Italy

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in the First International AAI4H—Advances in Artificial Intelligence for Healthcare Workshop Co-Located with the 24th European Conference on Artificial Intelligence (ECAI 2020), Santiago de Compostela, Spain, 4 September 2020.

Data 2026, 11(3), 48; https://doi.org/10.3390/data11030048

Submission received: 10 May 2025 / Revised: 11 December 2025 / Accepted: 20 February 2026 / Published: 2 March 2026

Download

Browse Figures

Versions Notes

Abstract

With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of “omics” data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here, we propose a novel approach for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. We implement three algorithms based on the MapReduce framework, distributing the index computation and not only the input dataset, differently than previous approaches from the literature. Experimental results performed on real datasets show that the proposed approach is promising.

Keywords:

sequences; indexing; BWT; big data

1. Background

The enormous amount of data produced by the Next-Generation Sequencing (NGS) technologies opens the way for a more comprehensive characterization of mechanisms at the molecular level, at the basis of the cellular life and may have a role in the occurrence and progress of disorders and diseases. This may help to approach fundamental questions in biological and clinical research, such as how the interactions between cellular components and chromatin structure may affect gene activity, or to what extent complex diseases such as diabetes or cancer may involve specific (epi)genomic traits. Indexing NGS data is an important problem in this context [1].

In particular, an index is a data structure that enables efficient retrieval of stored objects. Indexing strategies used in NGS allow space-efficient storage of biological sequences in a full-text index that facilitates fast querying, in order to return exact or approximate string matches. Popular full-text index data structures include variants of suffix arrays [2], FM-index based on the Burrows–Wheeler transform (BWT) and some auxiliary tables [3], and hash tables [4]. The choice of a specific index structure is often a trade-off between query speed and memory consumption. For example, hash tables can be very fast but their memory footprint is sometimes prohibitive for large string collections [5].

Here, we address the problem of computing BWT in a distributed environment, relying on Big Data technologies such as Apache Spark [6] and Hadoop [7]. To the best of our knowledge, this is the first attempt to combine Spark and Hadoop together, in order to improve the performance due to both memory and cloud optimization.

Previous research has been proposed in the literature on the BWT computation based on Apache Hadoop, such as the approaches presented in [8,9] (BigBWA). While one of the algorithms proposed in [8] computes the BWT in a MapReduce [10] fashion, in [9] the parallelism is intended only to split the input sequences and then apply another existing framework, i.e., BWA [11], in order to align them via BWT. Another related approach from the literature is RopeBWT2 [12]. However, the latter is a single-node FM-index constructor, optimized for high-end shared-memory systems, whereas our purpose is scaling the BWT computation on commodity clusters, using distributed-memory paradigms. Different from the approach proposed here, the performance of most previous methods is closely linked to hardware assumptions, rather than to tasks beyond the BWT construction itself, therefore a direct comparison with them is not suitable. The only algorithm comparable with our approach is the MapReduce one proposed in [8].

We propose an approach for the BWT computation which combines the advantages of the MapReduce paradigm [10] and the Spark Resilient Distributed Datasets (RDD) [6] (a preliminary version of it appeared in [13]). In particular, our strategy is based on the computation of Suffix Arrays in a distributed environment, by revisiting the idea of prefix doubling presented in [14]. We have also implemented in Apache Spark the MapReduce naive algorithm presented in [8], according to two different variants, and compared them with the novel one proposed here.

Validation results obtained on real biological datasets, including genomic and proteomic data, show that our approach improves the performance for BWT computation as compared to its competitors.

1.1. Preliminaries

Let S be a string of n characters defined on the alphabet

Σ

. We denote by

S (i)

the i-th character in S and by

S_{i}

its i-th suffix. We recall the following basic notions.

1.1.1. BWT

The Burrows-Wheeler transform of S is useful in order to rearrange it into runs of similar characters. This may have advantages both for indexing and for compressing more efficiently S. The BWT applied to S returns:

a permutation $b w t (S)$ of S, obtained by sorting all its circular shifts in lexicographic order, and then extracting the last column;
the index (0-based) I of the row containing the original string S.

Among the most important properties of BWT, it is reversible. Figure 1 shows an example of BWT for the string S =

B A N A N A $

. In particular,

b w t (S) = B N N $ A A A

, and

I = 3

.

1.1.2. Suffix Array

The suffix array

S A

of S is defined as an array of integers providing the starting positions of suffixes of S in lexicographical order. Therefore, an entry

S A [i]

contains the starting position of the i-th suffix in S among those in lexicographic order. Figure 2 shows the Suffix Array for the same example of BWT.

1.1.3. Inverse Suffix Array

Inverse Suffix Array of S,

I S A [i] = j

means that the rank of the suffix i is j, i.e.,

S A [j] = i

.

2. Implementation

We use the notation

[i, j]

for denoting the set

{i, \dots, j}

. Let

S \in Σ *

be a string of length n. For

i \in [0, n - 1]

, let

S_{i}

denote the suffix of S starting in position i and let

S_{i, j}

denote the sub-string

S [i] S [i + 1] \dots S [j]

. In order for the bwt calculation to be easily reversed, we assume that the string S always ends with a $ sentinel character, i.e., the smallest character in

Σ

. In addition, we assume that

S [i] = $

for

i > n

.

In the following, we first introduce the Sorting based MapReduce algorithm proposed in [8], in two variants; then we describe in detail the Prefix Doubling algorithm proposed here.

2.1. Sorting Based MapReduce (SMR)

The first step of the algorithm SMR is suffix partitioning. The goal is to partition the set of possible suffixes into sub-sets

K_{1}, \dots, K_{r},

where r is a positive integer representing the desired number of partitions (i.e., the number of nodes within the cluster). The partitioning has to comply with the following property.

Property 1.

For each pair

K_{i}, K_{j}

, all suffixes in

K_{i}

are lower or greater in lexicographical order than all those ones in

K_{j}

.

To this aim, suffixes are discriminated by their first k characters (i.e., k-mers in the following). Here, k is a positive integer chosen in advance. The algorithm SMR maps each suffix

S_{i}

of S in a key-value tuple (k-mer, i) (Algorithm 1). Therefore, suffixes are partitioned according to the key, and then the tuples are sorted with respect to this key to maintain the order of the partitions.

Algorithm 1 Preparation to Partitioning
1:	procedure MAP( $S_{i}$ )
2:	return (k-mer of $S_{i}$ , i)

Partitioning

A fairly important issue is how to implement partitioning in practice. An efficient technique is to sample the set of keys and then determine ranges based on the desired number of partitions. Then, the set of keys is partitioned according to the determined ranges, thus the partitions are balanced. Partitioning and sorting at the same time is possible this way. In [8] such a technique is used and optimized for the case of genomics sequences. It is also provided by the Spark framework [6] through the RangePartitioner functionality https://archive.apache.org/dist/spark/docs/2.3.0/api/java/index.html (accessed on 13 February 2026). The following example clarifies this aspect.

Example:

S =

C A T T A T T A G G A

0 1 2 3 4 5 6 7 8 9 10

* * * *

For

k = 3

Prefix	Sorted Partitions
`(CAT, 0)`	{(`$$$`, 11), (`A$$`, 10)}
`(ATT, 1)`	{(`ATT`, 1), (`ATT`, 4), (`AGG`, 7)}
`(TTA, 2)`	{(`CAT`, 0), (`GA$`, 9), (`GGA`, 8)}
`(TAT, 3)`	{(`TTA`, 2), (`TAT`, 3), (`TTA`, 5)}
`(ATT, 4)`
`(TTA, 5)`
`(TAG, 6)`
`(AGG, 7)`
`(GGA, 8)`
`(GA$`, 9)
`(A$$`, 10)
`($$$`, 11)

The second step consists of completing the work by ordering suffixes partition by partition (see Algorithm 2). Suffix indexes are collected in the partition p by the procedure Calculate Partial SA. The idea is to reduce the problem of calculating the partial SA

S A_{p}

to the calculation of another SA,

S A_{t}

, that refers to a new string T built from p. The order of suffixes in T implicitly defines the order of suffixes indexed from p in S. This approach is inspired by the recursive step in the DC3 algorithm [15].

Algorithm 2 Partial Suffix Arrays Computation
1:	procedure CALCULATE PARTIAL SA(p)
2:	Calculate $l_{m a x}$ the maximum distance between two elements in p
3:	Generate $L$ the list of $S [p [i], l_{m a x}]$ for each $i \in p$
4:	Sort $L$ using AlgorithmX
5:	return $S A_{p}$

Two different variants of SMR are considered:

{SMR}_{r}

, such that AlgorithmX is the Radix Sort, and

{SMR}_{m}

, if it is the Merge sort.

2.2. Prefix Doubling Algorithm (PDA)

The more crucial aspect for the BWT computation considered here is the calculation of the Suffix Array of the input string. Indeed, BWT can be obtained from the Suffix Array in a MapReduce fashion via join operation. Therefore, the algorithm proposed for the computation of the Suffix Array, based on the idea of prefix doubling inspired by [14], is described below (see Algorithms 3 and 4).

Input:

Let S be a string of length n, the input tuples set is:

Input = {(null, S (i)) : i = 1, \dots, n)}

Output:

A set of tuples of the form

(i, r)

, where i is the index of a suffix in S and r is its rank (i.e., its position in the list of the sorted suffixes). In the literature this is referred to as the ISA. For our purpose, the resulting output is inverted in order to obtain the Suffix Array of S and, then, its BWT.

Algorithm 3 Sketch Algorithm Iterative with Prefix Doubling
1:	procedure CALCULATEISA(S)
2:	$Input = {(null, S (i)) : i = 1, . . ., n}$
3:	Initialize set ISA with Input
4:	for ( $k \leftarrow 0$ to $⌈ l o g_{2} n ⌉$ ) do
5:	Apply the operation of Shifting to ISA obtaining two sets
6:	Join the two sets obtained with the operation of Pairing
7:	Update ISA by calling RE-RANKING
8:	end for
9:	return ISA

Algorithm 4 Re-ranking
1:	procedure RE-RANKING(Pairs)
2:	Sort tuple of pairs in Pairs by value
3:	for all ( $(i, (r_{a 1}, r_{a 2})) \in P a i r s$ ) do
4:	$Let (j, (r_{b 1}, r_{b 2}))$ the previous pair to $(i, (r_{a 1}, r_{a 2}))$
5:	if $(r_{a 1}, r_{a 2}) = (r_{b 1}, r_{b 2})$ then
6:	$r_{n e w} = j$
7:	else
8:	Assign to $r_{n e w}$ the position in the sorted set Pairs of $(i, (r_{a 1}, r_{a 2}))$
9:	end if
10:	Update tuple $(i, r)$ in ISA with tuple $(i, r_{n e w})$
11:	end for

2.2.1. Initialization

The first step is starting from the Input set and initialize the set of tuples

(i, r)

, as described in the previous paragraph. In this step, the rank is based on the first character of the suffix. In particular, let

O c c (c)

be the number of occurrences of the character lexicographically smaller of c in the string S, then the rank of the suffix i can be determined as

O c c (S (i))

.

In a MapReduce fashion, this can be accomplished by first counting the occurrences of each character in S, and then computing the cumulative sum

O c c

on the sorted counts. The map and reduce steps are:

map : (null, S (i)) \to (S (i), 1)

reduce : (c, list [1, 1, \dots, 1]) \to (c, sum of ones)

From this,

O c c

is calculated locally by collecting the result.

The

I S A

set can be then initialized with the following map step:

map : (null, S (i)) \to (i, o c c (S (i)))

2.2.2. ISA Extending

The next step is to extend each rank contained in the initialized

I S A

by the whole suffix. Here, we use a technique called Prefix Doubling which is based on the following statement:

Given that the suffixes of a string are already sorted by their prefix of length h, we can deduce their ordering by their prefix of length 2 h.

Given two suffixes

S_{i}

and

S_{j}

with an identical prefix of length h, we can deduce their sorting by comparing the order of the suffixes

S_{i + h}

and

S_{j + h}

. Thus, the idea is to pair, for each suffix

S_{i}

, its rank with the rank of the suffix

S_{i + h}

(i.e.,

(I S A [i]), I S A [i + h])

) and sort all these pairs in order to obtain the sorting by the prefix of length

2 h

. Indeed, an iteration doubles the prefix, since the longest suffix has size n, all suffixes will be sorted after at most

{log}_{2} (n)

iterations.

2.2.3. Shifting and Pairing

To implement the above idea in a MapReduce fashion, we apply the two considered map steps to the latest

I S A

calculated to obtain two different sets:

map : (i, r) \to (i, (r, 0))

map : (i, r) \to (i - 2^{k}, (- r, 0))

where k is the number of the iterations minus one. The indices of rank are shifted this way, then the rank is paired by a reduce step. It is worth noticing that a negative number is used to denote a shifted rank, and the value is mapped as a tuple with a zero term, in order to consider the ranks shifted that overflow the string length.

The union of the two obtained sets is considered and all tuples with a negative key are discarded (the corresponding ranks do not pair with any other rank in the set). The following reduce step is applied to the union:

reduce : (i, list [(r 1, 0), (r 2, 0)]) \to (i, (r 1, - r 2))

where

r 2

is the rank shifted. Some ranks may occur that are not reduced due to the unique key. These ranks overflow the length of S and remain paired with zero. We denote the final set derived from this phase by

P a i r s

.

2.2.4. Re-Ranking

Our purpose is to extend the previous rank with a new rank, obtained by considering the prefix doubled. Therefore, we compute the new rank according to the tuple in

P a i r s

as follows: first we sort all tuples by value, then we compare each tuple at position i (after sorting) with the one in position

i - 1

. If they are equal, the new rank is equal to the rank of the previous tuple, otherwise the new rank is i. Finally, a new ISA set with rank extended is obtained, and the procedure is iterated on it again. All operations described above can be achieved also in a distributed manner:

For the sorting operation, a certain number of partitions can be identified by range into roughly equal ranges of the elements in the set (the ranges can be determined by sampling the data). Then, for each partition, a sorting algorithm is applied that sorts each partition locally. This is easily provided by the framework Apache Spark.
In order to compute the new rank, the partition identified previously is considered and the procedure above is applied locally, as described before, using the length of the partition and the offset (i.e., the number of elements in the previous partition) for computing the position of the tuples.

2.2.5. Example

Let S = BANANA$ be the input string of length

n = 7

. The input pairs are:

\begin{matrix} Input = {(null, B), (null, A), (null, N), \\ (null, A), (null, N), (null, A), (null, $)} \end{matrix}

As for

O c c (c)

, it is shown in Table 1.

After the initialization, the initial ISA set is:

\begin{matrix} ISA = {(0, 3), (1, 0), (2, 4), (3, 0), \\ (4, 4), (5, 0), (6, 6)} \end{matrix}

After the first iteration, the shifted tuples are:

\begin{matrix} Shifted = {(- 1, (- 3, 0)), (0, (0, 0)), (1, (- 4, 0)), \\ (2, (0, 0)), (3, (- 4, 0)), (4, (0, 0)), (5, (- 6, 0))} \end{matrix}

After the pairing, we obtain the set:

\begin{matrix} Pairs = {(0, (3, 0)), (1, (0, 4)), (2, (4, 0), (3, (0, 4), \\ (4, (4, 0), (5, (0, 6), (6, (6, 0)} \end{matrix}

Finally, we sort by value and we re-rank the indices. Then, the new ISA is:

\begin{matrix} ISA = {(0, 3), (1, 1), (2, 4), (3, 1), \\ (4, 4), (5, 0), (6, 6)} \end{matrix}

We observe that the only rank updated in this iteration is the one with index 5, indeed, shifting by 1, it is possible to distinguish among the prefixes

A N

,

A N

and

A $

corresponding to the suffixes

S_{1}

,

S_{3}

and

S_{5}

.

3. Results and Discussion

The presented algorithms have been evaluated on real datasets taken from the Pizza&Chili website [16], where a set of text collections of various types and sizes are available to test experimentally compressed indexes. In particular, the text collections stored on this website have been selected to form a representative sample of different applications where indexed text searching might be useful. From this collection, we have chosen the following three datasets:

PROTEINS, containing a sequence of newline-separated protein sequences obtained from the Swissprot database.
DNA, a sequence of newline-separated gene DNA sequences obtained from files of the Gutenberg Project.
ENGLISH, the concatenation of English text files selected from collections of the Gutenberg Project.

We have implemented in Apache Spark all algorithms considered here. The experimental evaluations have been performed on a cluster from the GARR Cloud Platform, configured with 1 master and 48 slave nodes, each node with 6 VCore, 24 GB of RAM and 1 TB for disk. Apache Hadoop

3.1 . 3

and Spark

2.3 . 4

have been used.

For the PROTEINS and DNA datasets, we considered the first 50 MB, the first 100 MB, and the entire 200 MB dataset. Figure 3 focuses on the performance of the compared algorithms, showing running times for all successful executions (executions that failed or required more than 10 h to complete are omitted).

Figure 3 reports the end-to-end runtime of the three BWT implementations

{SMR}_{m}

,

{SMR}_{r}

and PDA, respectively, across datasets ranging from 50 MB to 1 GB. The results are shown on a logarithmic scale to capture runtime differences exceeding two orders of magnitude. The observed trends reflect both the algorithmic structure of each variant and the way they interact with Spark’s distributed execution model.

{SMR}_{m}

achieves the lowest runtime on all datasets up to 200 MB, completing in approximately 0.4–2 min regardless of data type. Although algorithmically simple,

{SMR}_{m}

maps extremely well to Spark: it performs a single large distributed sort on the full set of suffixes/rotations and avoids multiple rounds of shuffle, intermediate data materialization, and JNI transitions. The entire computation remains within the JVM, minimizing serialization overhead and garbage-collection pressure. The high memory footprint of

{SMR}_{m}

(up to 1 GB for 200 MB inputs) does not significantly impact the performance at this scale, due to the fact that executors can retain the entire working set in memory. However, this prevents

{SMR}_{m}

from scaling to large inputs, due to the quadratic space requirements of full suffix representation, therefore it is not able to process the ENGLISH dataset.

{SMR}_{r}

is consistently the slowest implementation and exhibits poor scalability. While competitive on 50 MB datasets, its runtime increases linearly with input size, from about 12 min on 100 MB DNA up to more than

3.4

h on the 200 MB PROTEINS dataset. This behavior stems from its hybrid design. Indeed, radix sorting is delegated to C++ via JNI calls. This requires the executor to repeatedly marshal data into a native format, invoke the native routine and then reconstruct the JVM objects afterwards. The associated JNI overhead grows with the number of partitions and becomes dominant for large datasets. Moreover, the interaction between radix partitioning and the highly repetitive structure of biological sequences leads to partition skew, excessive shuffle volume, and frequent spill-to-disk events, all of which significantly degrade performance. Therefore, the theoretically linear complexity of radix sorting does not materialize in practice within Spark, resulting in runtimes far worse than both

{SMR}_{m}

and PDA.

In contrast, PDA exhibits the most stable and predictable scaling behavior. Runtime increases gradually from 4–6 min for the 50 MB datasets to about 1.06 h for the ENGLISH dataset. This algorithm follows a prefix-doubling strategy and refines suffix ranks over multiple iterations. Each iteration triggers a distributed sort on compact integer tuples, rather than on full suffix strings, significantly reducing memory traffic. Although PDA performs several shuffle stages, the amount of data exchanged in each step is small and regular, generating well-balanced partitions and keeping executor memory pressure low (typically 150–400 MB). As a result, PDA is the only variant able to handle the 1 GB dataset reliably, confirming that a multi-pass approach with lightweight intermediate structures is more compatible with Spark’s execution model than a single-pass strategy involving heavy per-record state. Overall, the runtime results highlight that achieving optimal performance for an algorithm on a single machine does not necessarily translate into efficient execution in a distributed environment.

{SMR}_{m}

benefits from a single well-optimized global sort, PDA benefits from predictable multi-stage refinement with compact state, and

{SMR}_{r}

suffers from JNI overhead, skew-induced imbalance, and increased shuffle costs. These systems-level effects ultimately shape the performance hierarchy observed in Figure 3.

Figure 4 shows the throughput achieved by the three BWT implementations across all considered datasets. Throughput is defined as the amount of input processed per second (MB/s), and therefore higher values correspond to better performance. The results highlight three distinct behavioral patterns.

First,

{SMR}_{m}

achieves the highest throughput across all datasets, with values ranging from approximately

1.7

to

2.3

MB/s. This behavior is largely attributable to its low computational overhead.

{SMR}_{m}

performs significantly fewer intermediate operations than the other two implementations, indeed it benefits from reduced communication and shuffle overhead in Spark. However, this performance advantage is deceptive, as

{SMR}_{m}

does not scale to larger datasets and lacks robustness when the amount of data increases or distribution becomes more complex. Its superior throughput should, therefore, be interpreted as a side-effect of algorithmic simplicity, rather than genuine efficiency or scalability.

In contrast, PDA exhibits stable and predictable throughput, consistently ranging between

0.20

and

0.32

MB/s across all datasets and data types, including the largest ENGLISH text corpus. This stability indicates that the prefix-doubling approach maintains a balanced computation-communication profile, with resource consumption that grows proportionally to the input size. As a result, PDA is the most scalable and reliable implementation among the three. PDA consistently delivers similar throughput independent of the dataset family (DNA, PROTEINS, ENGLISH) and maintains strong performance even on large-scale inputs.

Finally,

{SMR}_{r}

shows the lowest throughput, with values dropping as low as

0.02

–

0.10

MB/s on the larger datasets. This degradation is expected, due to the fact that radix-based sorting techniques require substantial initialization and partitioning overhead in distributed environments, which become progressively more expensive as the input grows. The heavy use of bucket-sorting structures and the associated data-shuffling operations significantly penalize performance, particularly on datasets above 100 MB. Notably,

{SMR}_{r}

could not be executed (did not complete successfully) on the 1 GB ENGLISH corpus, which further reinforces its limited suitability for large-scale distributed settings.

Overall, the throughput analysis reveals that, while

{SMR}_{m}

appears faster on small datasets, PDA is the only variant that maintains consistent and scalable performance across all dataset families and sizes, making it the most appropriate choice for large and heterogeneous workloads.

{SMR}_{r}

, although theoretically efficient on uniform data distributions, proves unsuitable for distributed BWT construction at scale due to its substantial communication and initialization costs.

Figure 5 shows the peak executor-side memory usage observed during the execution of the three BWT construction strategies. The results reveal key architectural differences in how each method interacts with Spark’s execution model.

{SMR}_{m}

incurs the highest memory pressure, exceeding 1 GB for medium-sized inputs (200 MB). This behavior is consistent with its algorithmic structure.

{SMR}_{m}

explicitly materializes the full set of string rotations and sorts them lexicographically. In Spark, this results in large wide dependencies and multi-stage shuffles where each executor must temporarily hold sizable segments of the rotation matrix. Because each rotation has the same length as the input and rotations are represented as full strings or large byte arrays, the RDDs expand quadratically and stress both JVM heap and off-heap memory. The massive shuffle map output also triggers spill events and memory amplification typical of sort-based shuffle operations.

{SMR}_{r}

reduces the cost of comparing large rotations by performing digit-wise comparisons on fixed-size segments, but it still materializes multiple rounds of key-value buffers during each radix iteration. Since these buffers must be fully materialized before each partition-level sort, executor memory remains high and relatively invariant to input size (around 700 MB). This plateau is a consequence of radix grouping since the memory footprint scales primarily with the number of radix passes and alphabet characteristics, not with input length. Additionally, the use of wide transformations in each digit pass forces Spark to retain intermediate broadcasted ranks and local buckets in memory, further increasing peak usage.

In contrast, PDA exhibits the most favorable memory profile, maintaining a stable footprint between 150 and 400 MB across all datasets. This is directly attributable to the prefix-doubling approach. Rather than generating the full rotation matrix, the algorithm stores compact rank pairs and refines them iteratively. At each iteration, only small fixed-size records are exchanged across the cluster, leading to lightweight shuffles and minimal executor memory residency. Spark is able to pipeline transformations efficiently because intermediate state is small enough to avoid both spilling and repeated materialization. Crucially, this compact representation enables PDA to handle the ENGLISH dataset, where both

{SMR}_{m}

and

{SMR}_{r}

would exceed executor memory, while PDA requires only 157 MB at peak.

The obtained results demonstrate that the scalability of BWT construction on distributed systems is strongly influenced by data representation and shuffle patterns, rather than by input size alone. Methods that minimize wide dependencies and materialized intermediate state, such as the prefix-doubling strategy implemented in PDA, achieve significantly better memory locality, predictable peak usage, and overall robustness at scale.

To assess the scalability of the proposed distributed BWT, we measured the runtime of PDA on the ENGLISH dataset while increasing the number of worker nodes. As shown in Figure 6, execution time decreases substantially from 7400 s (24 nodes) to 5050 s (36 nodes) and 4000 s (48 nodes), confirming that the algorithm benefits from additional parallel resources.

The observed speedup, however, is sublinear, reaching around 92% parallel efficiency. The parallel efficiency (

η

) when scaling from 24 to 48 worker nodes is computed as

η = \frac{Speedup}{Number of nodes / Baseline nodes} .

η = \frac{{Speedup}_{48}}{48 / 24} = \frac{T_{24} / T_{48}}{2} = \frac{7400 / 4000}{2} = 0.925 \approx 92 % .

This behavior is expected for distributed sorting–based workloads and it is mainly due to Spark-specific overheads, including shuffle traffic, synchronization barriers between iterations, partition skew, and spill-related disk I/O. Despite these bottlenecks, the iterative BWT maintains good large-scale scalability, successfully processing a ENGLISH dataset while continuing to exhibit runtime reductions as resources increase. The results indicate that the iterative variant is the most robust and scalable implementation among those evaluated.

While the empirical evaluation provides meaningful insights into the behavior of the three BWT construction strategies in Spark, several limitations must be acknowledged. First,

{SMR}_{m}

exhibits the highest throughput and lowest runtime on small datasets. However, this advantage is largely superficial as its algorithmic structure is inherently non-scalable. Because

{SMR}_{m}

materializes the full rotation matrix, its memory footprint grows quadratically with input size, preventing execution on large datasets such as the 1 GB English corpus. Consequently, large-scale measurements for

{SMR}_{m}

are not available, limiting the ability to perform uniform comparisons across the full dataset.

{SMR}_{r}

faces even more severe limitations, tied to its algorithmic structure and its interaction with Spark’s distributed execution model. Experimentally,

{SMR}_{r}

presents very long and often impractical processing times, even on moderately sized inputs. Although the implementation is written in C++ to maximize performance, the cost remains prohibitive. From a theoretical perspective, this behavior is expected: the computational complexity of

{SMR}_{r}

is

O (| p | - l_{max})

, where p is the partition under consideration and

l_{max}

denotes the maximum distance between any two indices in p, including the final index. When partition indices are not uniformly distributed,

l_{max}

can become very large, leading to substantial slowdowns. In distributed settings, these theoretical issues are compounded by the practical overheads of JNI transitions, repeated bucket initialization, and large-volume shuffle operations. As a consequence,

{SMR}_{r}

exhibits non-linear performance degradation with increasing input size and could not be executed on the largest dataset (1 GB), preventing a complete three-way comparison at scale.

{SMR}_{m}

, although generally faster than

{SMR}_{r}

, exhibits analogous scalability limitations. The version implemented with Merge sort benefits from a more favorable comparison cost but still cannot process very large inputs because the underlying representation of full rotations imposes excessive memory demands on executors. Thus, both SMR variants encounter fundamental limitations that hinder their applicability to large-scale distributed BWT construction.

In contrast, PDA is the only method capable of handling all dataset sizes. Despite performing multiple prefix-doubling iterations, each involving a global distributed sort, PDA maintains predictable memory usage and stable runtime behavior. Its compact representation of intermediate state minimizes shuffle volume and avoids the quadratic blow-up observed in

{SMR}_{m}

, while also preventing the partition-skew amplification that degrades

{SMR}_{r}

. As a result, PDA fully exploits Spark’s distributed parallelism and remains robust even under varying cluster conditions. This aligns with theoretical expectations that distributing the refinement of suffix ranks across lightweight iterations, PDA achieves consistent scalability and emerges as the only practical choice for large and heterogeneous workloads.

4. Conclusions

A novel approach for the implementation of a full-text index, that is, the Burrows Wheeler transform, has been proposed here. Three algorithms based on the MapReduce paradigm have been implemented in Apache Spark and they have been validated on real datasets.

Among the various applications where an efficient and distributed implementation of BWT may be useful (e.g., data compression, pattern matching, etc.), we mention that searching for a suitable combination of Indexing and Machine Learning techniques has recently proved to be a promising issue [17,18,19]. Therefore, the construction of an FM-index begins with text transformation using the Burrows-Wheeler Transform (BWT), which is computationally intensive for large texts such as those found in bioinformatics or search engine contexts. We plan to focus our future studies on the application of the approach proposed here for building a large FM-index more efficiently. Indeed, the BWT phase is often the bottleneck in the construction of an FM-index, so any optimization of the BWT would improve the overall performance of the index.

Further research will include the exploration of BWT construction in distributed environments for video and image compression, which are yet challenging tasks [20,21,22].

Author Contributions

Conceptualization, M.R. and S.E.R.; methodology, Y.G., R.G., M.R. and S.E.R.; software, Y.G. and M.R.; validation, Y.G. and M.R.; resources, R.G. and S.E.R.; writing—original draft preparation, Y.G., R.G., M.R. and S.E.R.; writing—review and editing, Y.G., R.G., M.R. and S.E.R.; visualization, Y.G.; supervision, S.E.R. All authors have read and agreed to the published version of the manuscript.

Funding

Partial support to Y.G., R.G. and S.E.R. has been granted by Projects INdAM—GNCS CUP: E53C23001670001 and CUP: E53C24001950001. Partial support to Y.G., R.G. and S.E.R. has been granted by the Project “Models and Algorithms relying on knowledge Graphs for sustainable Development goals monitoring and Accomplishment—MAGDA” (CUP: B77G24000050001), under the “Future Artificial Intelligence—FAIR” program funded by the European Union under the Italian National Recovery and Resilience Plan (PNRR) of NextGenerationEU. Partial support to R.G. and S.E.R. has been granted by the following Project funded by the European Union under the Italian National Recovery and Resilience Plan (PNRR) of NextGenerationEU: SPRINT, partnership on “Telecommunications of the Future” (PE00000001—program “RESTART”—CUP E83C22004640001).

Data Availability Statement

https://github.com/Ylenia1211/spark-bwt_v2.git (accessed on 13 February 2026).

Acknowledgments

The authors are grateful to the Reviewers, for their suggestions which have contributed to improve the quality of the manuscript, and to Giuseppe Cattaneo, for his valuable help with specific configurations of the considered cluster.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jalili, V.; Matteucci, M.; Masseroli, M.; Ceri, S. Indexing Next-Generation Sequencing data. Inf. Sci. 2017, 384, 90–109. [Google Scholar] [CrossRef]
Abouelhoda, M.I.; Kurtz, S.; Ohlebusch, E. Replacing suffix trees with enhanced suffix arrays. J. Discret. Algorithms 2004, 2, 53–86. [Google Scholar] [CrossRef]
Ferragina, P.; Manzini, G. Indexing compressed text. J. ACM 2005, 52, 552–581. [Google Scholar] [CrossRef]
Lee, W.P.; Stromberg, M.P.; Ward, A.; Stewart, C.; Garrison, E.P.; Marth, G.T. MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping. PLoS ONE 2014, 9, e90581. [Google Scholar] [CrossRef] [PubMed]
Schmidt, B.; Hildebrandt, A. Next-generation sequencing: Big data meets high performance computing. Drug Discov. Today 2017, 22, 712–717. [Google Scholar] [CrossRef] [PubMed]
Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Ma, J.; McCauly, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of the Presented as Part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), San Jose, CA; USENIX Association: Berkeley, CA, USA, 2012; pp. 15–28. [Google Scholar]
Nandimath, J.; Banerjee, E.; Patil, A.; Kakade, P.; Vaidya, S.; Chaturvedi, D. Big data analysis using Apache Hadoop. In Proceedings of the 2013 IEEE 14th International Conference on Information Reuse & Integration (IRI); IEEE: Piscataway, NJ, USA, 2013; pp. 700–703. [Google Scholar]
Menon, R.K.; Bhat, G.P.; Schatz, M.C. Rapid Parallel Genome Indexing with MapReduce. In MapReduce ’11: Proceedings of the Second International Workshop on MapReduce and Its Applications, New York, NY, USA; Association for Computing Machinery: New York, NY, USA, 2011; pp. 51–58. [Google Scholar]
Abuín, J.M.; Pichel, J.C.; Pena, T.F.; Amigo, J. BigBWA: Approaching the Burrows-Wheeler aligner to Big Data technologies. Bioinformatics 2015, 31, 4003–4005. [Google Scholar] [CrossRef] [PubMed]
Dean, J.; Ghemawat, S. MapReduce: A flexible data processing tool. Commun. ACM 2010, 53, 72–77. [Google Scholar] [CrossRef]
Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009, 25, 1754–1760. [Google Scholar] [CrossRef] [PubMed]
Li, H. BWT construction and search at the terabase scale. Bioinformatics 2024, 40, btae717. [Google Scholar] [CrossRef] [PubMed]
Randazzo, M.; Rombo, S.E. A Big Data Approach for Sequences Indexing on the Cloud via Burrows Wheeler Transform. In Proceedings of the First International AAI4H—Advances in Artificial Intelligence for Healthcare Workshop Co-Located with the 24th European Conference on Artificial Intelligence (ECAI 2020), Santiago de Compostela, Spain, 4 September 2020; Zumpano, E., Tagarelli, A., Comito, C., Greco, S., Veltri, P., Solanas, A., Mora, I., Sànchez-Marrè, M., Eds.; CEUR Workshop Proceedings; CEUR-WS.org: Aachen, Germany, 2020; Volume 2820, pp. 28–31. [Google Scholar]
Flick, P.; Aluru, S. Parallel distributed memory construction of suffix and longest common prefix arrays. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, 15–20 November 2015; pp. 16:1–16:10. [Google Scholar]
Kärkkäinen, J.; Sanders, P.; Burkhardt, S. Linear work suffix array construction. J. ACM 2006, 53, 918–936. [Google Scholar] [CrossRef]
Manzini, G.; Navarro, G. The Pizza and Chili Corpus Home Page. 2007. Available online: https://pizzachili.dcc.uchile.cl/texts.html (accessed on 13 February 2026).
Graham, D.J.; Robinson, B.P. On the internal correlations of protein sequences probed by non-alignment methods: Novel signatures for drug and antibody targets via the Burrows-Wheeler Transform. Chemom. Intell. Lab. Syst. 2019, 193, 103809. [Google Scholar] [CrossRef]
Raff, E.; Nicholas, C.; McLean, M. A New Burrows Wheeler Transform Markov Distance. arXiv 2019, arXiv:1912.13046. [Google Scholar] [CrossRef]
Ferragina, P.; Vinciguerra, G. The PGM-index: A fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow. 2020, 13, 1162–1175. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, T.; Wang, S.; Yu, P. An efficient perceptual video compression scheme based on deep learning-assisted video saliency and just noticeable distortion. Eng. Appl. Artif. Intell. 2025, 141, 109806. [Google Scholar] [CrossRef]
Badkobeh, G.; Bannai, H.; Köppl, D. Bijective BWT based compression schemes. In Proceedings of the International Symposium on String Processing and Information Retrieval; Springer: Berlin/Heidelberg, Germany, 2024; pp. 16–25. [Google Scholar]
Devadason, J.R.; D., A. A comprehensive survey of image compression methods: From prediction models to advanced techniques. Multimed. Tools Appl. 2025, 84, 38653–38691. [Google Scholar] [CrossRef]

Figure 1. An example of Burrows–Wheeler Transform (BWT).

Figure 2. An example of Suffix Array.

Figure 3. Runtime across datasets and BWT modes (log scale). The plot compares

{SMR}_{m}

,

{SMR}_{r}

and PDA in terms of execution time expressed in hours.

Figure 3. Runtime across datasets and BWT modes (log scale). The plot compares

{SMR}_{m}

,

{SMR}_{r}

and PDA in terms of execution time expressed in hours.

Figure 4. Throughput across datasets for

{SMR}_{m}

,

{SMR}_{r}

, and PDA.

Figure 4. Throughput across datasets for

{SMR}_{m}

,

{SMR}_{r}

, and PDA.

Figure 5. Peak memory usage across datasets by the algorithms

{SMR}_{m}

,

{SMR}_{r}

, and PDA.

Figure 5. Peak memory usage across datasets by the algorithms

{SMR}_{m}

,

{SMR}_{r}

, and PDA.

Figure 6. Runtime of the PDA implementation on the ENGLISH dataset as a function of the number of worker nodes. Increasing cluster size reduces execution time, although with sublinear speedup due to shuffle overhead, synchronization barriers, and partition imbalance. The trend highlights good scalability up to 48 nodes, but the benefits diminish as the workload becomes increasingly network-bound.

Table 1. Computation of

O c c (c)

.

Table 1. Computation of

O c c (c)

.

c	A	B	N	$
$O c c (c)$	0	3	4	6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Galluzzo, Y.; Giancarlo, R.; Randazzo, M.; Rombo, S.E. Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark. Data 2026, 11, 48. https://doi.org/10.3390/data11030048

AMA Style

Galluzzo Y, Giancarlo R, Randazzo M, Rombo SE. Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark. Data. 2026; 11(3):48. https://doi.org/10.3390/data11030048

Chicago/Turabian Style

Galluzzo, Ylenia, Raffaele Giancarlo, Mario Randazzo, and Simona E. Rombo. 2026. "Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark" Data 11, no. 3: 48. https://doi.org/10.3390/data11030048

APA Style

Galluzzo, Y., Giancarlo, R., Randazzo, M., & Rombo, S. E. (2026). Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark. Data, 11(3), 48. https://doi.org/10.3390/data11030048

Article Menu

Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark^†

Abstract

1. Background

1.1. Preliminaries

1.1.1. BWT

1.1.2. Suffix Array

1.1.3. Inverse Suffix Array

2. Implementation

2.1. Sorting Based MapReduce (SMR)

Partitioning

2.2. Prefix Doubling Algorithm (PDA)

2.2.1. Initialization

2.2.2. ISA Extending

2.2.3. Shifting and Pairing

2.2.4. Re-Ranking

2.2.5. Example

3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark †

Abstract

1. Background

1.1. Preliminaries

1.1.1. BWT

1.1.2. Suffix Array

1.1.3. Inverse Suffix Array

2. Implementation

2.1. Sorting Based MapReduce (SMR)

Partitioning

2.2. Prefix Doubling Algorithm (PDA)

2.2.1. Initialization

2.2.2. ISA Extending

2.2.3. Shifting and Pairing

2.2.4. Re-Ranking

2.2.5. Example

3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark^†