Optimizing Data Parallelism for FM-Based Short-Read Alignment on the Heterogeneous Non-Uniform Memory Access Architectures

: Sequence alignment is a critical factor in the variant analysis of genomic research. Since the FM (Ferrainas–Manzini) index was developed, it has proven to be a model in a compact format with efficient pattern matching and high-speed query searching, which has attracted much research interest in the field of sequence alignment. Such characteristics make it a convenient tool for handling large-scale sequence alignment projects executed with a small memory. In bioinformatics, the massive success of next-generation sequencing technology has led to an exponential growth in genomic data, presenting a computational challenge for sequence alignment. In addition, the use of a heterogeneous computing system, composed of various types of nodes, is prevalent in the field of HPC (high-performance computing), which presents a promising solution for sequence alignment. However, conventional methodologies in short-read alignment are limited in performance on current heterogeneous computing infrastructures. Therefore, we developed a parallel sequence alignment to investigate the applicability of this approach in NUMA-based (Non-Uniform Memory Access) heterogeneous architectures against traditional alignment algorithms. This proposed work combines the LF (longest-first) distribution policy with the EP (enhanced partitioning) strategy for effective load balancing and efficient parallelization among heterogeneous architectures. The newly proposed LF-EP-based FM aligner shows excellent efficiency and a significant improvement over NUMA-based heterogeneous computing platforms. We provide significantly improved performance over several popular FM aligners in many dimensions such as read length, sequence number, sequence distance, alignment speedup, and result quality. These resultant evaluation metrics cover the quality assessment, complexity analysis, and speedup evaluation of our approach. Utilizing the capabilities of NUMA-based heterogeneous computing architectures, our approach effectively provides a convenient solution for large-scale short-read alignment in the heterogeneous system.


Introduction
Advances in next-generation sequencing can be seen in the rapid decrease in sequencing costs attributed to genome sequencing, which dropped from USD 1 million in 2007 to USD 600 in 2021 [1].This drop has led to the exponential growth of sequencing data, which ranges from gigabytes to terabytes and even petabytes.The fact that the capacity of sequence data doubles within 12 months while the processor processing capacity doubles within 18 months results in a significant challenge.Consequently, it requires a more powerful sequence processing program to manage the data deluge [2].
Sequence alignment is the most fundamental process in the analysis of genomic data.Across most variation detection and annotation processes, it is the starting point, leading to biologically significant consequences.It is a highly computing-intensive and into future study directions to improve the efficiency of FM aligners in heterogeneous architecture systems.
The following is a summary of the contributions of this investigation: (1) We developed a convenient parallelism technique for NUMA systems that couples the longest-first distribution strategy with enhanced partitioning.This approach attains effective load balancing and optimal parallelization across heterogeneous NUMA processors on HPC systems.This contribution focuses primarily on the effective and rapid deployment of bioinformatics algorithms on modern heterogeneous architectures.(2) We introduced the LF-EP-based FM aligner, which utilizes the FM index for short-read sequence alignment with a novel implementation tailored for popular heterogeneous computing platforms that constitute NUMA nodes.This proposed aligner shows significant performance improvements in scalability, speedup, and quality, which is convenient for genomic research where massive data volumes are processed rapidly.(3) We conducted a comprehensive evaluation of the popular NUMA infrastructures.
Our paper demonstrates improvements in parallelism through a thorough review of both main processor platforms.We tested our approach on a wide variety of NUMA architectures, including Intel Xeon and AMD Opteron.(4) We compared it to prevailing FM aligners and sequence benchmarks.These FM aligners include BWA, SOAP2, BOWTIE2, and GEM and various sequence benchmarks-BALIBASE, OXBENCH, SABRE, and PREFAB.These comprehensive aligners and benchmarks not only highlight the practical advantages but also establish a baseline for future studies to be compared with.
The rest of the paper is organized as follows: a critical survey of related studies in the literature is performed in Section 2. Basic concepts of sequence alignment, an in-depth review of the FM aligner, and details of heterogeneous architectures are presented in Section 3. Our proposed algorithm, longest-first combined with enhanced partitioning, is provided in Section 4. The experimental results and performance evaluation of the proposed algorithm are presented in Section 5. Experimental findings are concluded in Section 6, with prospective research directions detailed in brief.

Literature Study
The FM index finds a wide range of applications related to genomic studies, particularly in sequence alignment.It is a suitable approach for aligning short reads to large genome references with a small memory footprint [9].Most efficient sequence aligners, including BOWTIE [10], BWA [11], HISAT [12], and SOAP2 [13], exploit the FM indexbased mechanism.The FM index has two types of indices: a global FM index for the entire genome and many small FM indices for genome fragments that span the whole genome together [3].
Additionally, another area where the FM index is used in genomics studies is wholegenome alignment [14].Moreover, hardware-accelerated exact and inexact alignment using the FM index has been studied for FPGA-based alignment architectures [15].An evaluation study, in the context of RNA alignment, showed a significant reduction in memory consumption and runtime of FM index-based aligners compared to hash tablebased aligners at nearly identical precision and recall rates [16].This test also showed the high tolerance of BOWTIE to indels and sequencing errors, so it was not the proper candidate for using high-tolerant but inexact hash-based aligners.While FM-based aligners have been widely adopted as the data structure for several years for the fast alignment of sequences, some performance limitations have been observed in memory-constrained systems when using FM-based aligners compared to hash-based aligners [17].
Sequencing alignment is a collaborative research type in which heterogeneous architectures have been widely used.NUMA has emerged as one of the most widely utilized computing infrastructures [18].Research focusing on sequence alignments as a trade-off between data locality and data dispersion in the systems based on NUMA highlights an interesting research topic on NUMA-based heterogeneous architecture [16].Several studies [19][20][21] have quantitatively investigated the performance of various memory allocation policies and data partitioning strategies in NUMA architectures.Many algorithms and tools have been developed, but they do not consider the impact of the computing structure on the overall performance of sequence alignment.For instance, BarraCUDA is one of the short-read sequence aligners that rapidly align sequences using GPUs [22].It uses a suffix-tree method to locate exact matches in the reference sequence.GSWABE provides all-to-all pairwise global, semi-global, and local alignment through a tile-based method that leverages the computational capability of CUDA-enabled GPUs [23].However, these new hardware techniques are not aimed at improving the efficiency of the utilization of the existing infrastructures.
Based on the above discussion, using NUMA-based heterogeneous architectures in HPC systems can generally improve sequence alignment algorithms [24].Processors with NUMA architectures and parallelization techniques enhance the speed of the alignment process.It is very important to note that the size of biological databases is continuously increasing, which results in data being more complex in the alignment process and also needing more time for computation [25].Additionally, there is a requirement for the development of alignment algorithms that can work with datasets where sequences have varying lengths, which is a must to achieve more precise results for applications requiring high precision [26].However, much more research is needed to be conducted regarding the investigation of the interplay between stages of alignment and the processor architecture to enhance the performance of the sequence alignment algorithms for NUMA-based heterogeneous architectures.The research aims to address these challenges and contribute to better and more effective alignment algorithms suitable for datasets exhibiting heterogeneity in computing architectures within heterogeneous processors of HPC systems.

FM-Based Mechanism in Sequence Alignment
In the field of bioinformatics, deciphering genetic information is one of the most important tasks for sequence aligners and exposes hidden patterns within DNA, RNA, and protein sequences.The criteria for the selection of the sequence aligners depend on the properties of the data, the level of accuracy, and the computational performance.These aligners are implemented using computational methods, such as heuristics and machine learning algorithms, combined with statistical approaches to increase precision and effectiveness.Notable alignment tools in bioinformatics include MAQ [27], BOWTIE [10], BWA [11], and HISAT [12].
FM index-based sequence alignment is a widely employed tool in genomics research.It generates a compact index for the reference genome, enabling the alignment to be conducted in fixed-size segments using a round-robin pattern [28].This aligner processes data individually, thereby avoiding interdependence among data segments.FM aligners permit predefined scores for mismatches and gaps, thus increasing the probability of aligning short reads to long reference sequences.It uses a compressed genome reference, which simplifies the search of sequencing data.The BWT data structure is applied to optimize pattern matching [29].Therefore, the FM-based sequence aligner generally comprises two processes: index building and sequence searching.Index building is the process through which the genome reference is preprocessed.The alignment of short sequences, such as DNA or RNA segments, with the reference genome is called sequence searching.Due to these techniques, FM-based sequence aligners take less time and use fewer processing resources.
The FM alignment approach is a technique of facilitating the features of suffix array (SA) and BWT that helps optimize the alignment of sequences through the use of their structural efficiencies.This combination allows for the rapid alignment of all instances of a given pattern P within a large genome reference (GR).Unlike other traditional alignment methods, FM-based sequence aligner is essentially faster than other popular algorithms, and its computational complexity depends mainly on the length of P. Thus, it is especially useful for large genomes.The detailed computation process is as follows.
(1) SA construction The creation of suffixes is initially conducted.For a given genome reference GR, all possible suffixes of GR are created by adding a unique character $ at the end of GR to define the ending of the string with a minimal lexicographical value.This is followed by sorting the suffixes in the lexicographical order.The SA is an array with elements indicating the initial positions where a sorted suffix begins in the original string.
(2) BWT computation Rotations and sorting are performed in the procedure.Each rotation of the string GR is formed through a circular shift in the input string.The set of rotations is sorted lexically, followed by BWT formation.The BWT is constructed by taking the last character from each of the sorted rotations.The reasoning behind using these characters is that, when concatenated, they reveal the repetitive nature of the original string, aiding in effective compression and transformation.
Table 1 illustrates the calculation of SA and BWT of the string GR = GTGTAC.Each row indicates a rotation of the sequences in sorted order.The SA index column contains the original indices of the suffixes before sorting.The rotations column displays every possible rotation of GR.The Sorted-BWT column is the sequence of rotations in sorted order, which directly affects the construction of the BWT.The sorted index represents the SA values for the starting positions of the suffixes in the original string.Thus, the BWT represented as CTAT$GG is derived from the last characters of the sorted rotations in the Sorted-BWT.This BWT sequence is essential for the reconstruction of the original sequence and facilitates effective pattern searching.

SA Index
Rotations Sorted-BWT BWT Sorted Index

Heterogeneity in the Nodes of HPC Field
Heterogeneity within the HPC field means that the field incorporates various computing infrastructures and architectures, making core configurations much more complex.However, this allows the end user to conveniently utilize diverse resources and enhance the overall computing power of the system.Heterogeneous architectures are significant to the scientific research community as they offer various computing resources that enhance the efficiency, scalability, and resilience of the system [30].In addition, heterogeneous architectures comprise different kinds of computing nodes with various processor architectures.NUMA processors, such as Intel Xeon and AMD Opteron, can be observed widely used in HPC platforms [31].These NUMA processors have dedicated local memory banks and local cores, allowing for significant performance improvements in applications like sequence alignment.The architectural features of NUMA processors, when employed in an HPC system, can lead to distinct performance advantages and various bottlenecks.
The architecture of processors is illustrated in Figure 1, showcasing two commonly utilized NUMA nodes as examples: the Intel Xeon E5-4620 and the AMD Opteron-6376.These NUMA processors possess the capacity to accommodate up to four processors, totaling 32 cores, 64 threads, and a memory capacity of 128 GB.Although both NUMA processors have four NUMA nodes in their systems, they differ in terms of the interconnections between the processors.This is because, while the AMD Opteron is a network structure with complete linkages, the Intel Xeon uses a ring structure with a circular connection.
Future Internet 2024, 16, x FOR PEER REVIEW 6 of 18 memory banks and local cores, allowing for significant performance improvements in applications like sequence alignment.The architectural features of NUMA processors, when employed in an HPC system, can lead to distinct performance advantages and various bottlenecks.
The architecture of processors is illustrated in Figure 1, showcasing two commonly utilized NUMA nodes as examples: the Intel Xeon E5-4620 and the AMD Opteron-6376.These NUMA processors possess the capacity to accommodate up to four processors, totaling 32 cores, 64 threads, and a memory capacity of 128 GB.Although both NUMA processors have four NUMA nodes in their systems, they differ in terms of the interconnections between the processors.This is because, while the AMD Opteron is a network structure with complete linkages, the Intel Xeon uses a ring structure with a circular connection.Figure 2 demonstrates the critical principle of NUMA architecture: access to local memory is fast.If a processor requests memory that is not local, then ultimately, it is provided with access to the memory.However, latency is introduced compared to local memory.In NUMA architecture, local access generally refers to a processor accessing memory that is directly attached to that processor or is co-located in the same node.Remote access means that the cores on Processor 1 access memory, which is physically on another processor, Processor 2. In practice, this difference in memory access in NUMA architecture is essential for many research programs and hardware configuration optimizations in performance, but it is often ignored in HPC.

The Proposed Algorithm
This research takes into consideration various NUMA computing architectures, which have been used in most of the latest developments in the HPC field.Heterogeneous clusters have become the first preference of most of the research community today because they can form a versatile architecture for computing systems [32].It is usually much easier Figure 2 demonstrates the critical principle of NUMA architecture: access to local memory is fast.If a processor requests memory that is not local, then ultimately, it is provided with access to the memory.However, latency is introduced compared to local memory.In NUMA architecture, local access generally refers to a processor accessing memory that is directly attached to that processor or is co-located in the same node.Remote access means that the cores on Processor 1 access memory, which is physically on another processor, Processor 2. In practice, this difference in memory access in NUMA architecture is essential for many research programs and hardware configuration optimizations in performance, but it is often ignored in HPC.
Future Internet 2024, 16, x FOR PEER REVIEW 6 of 18 memory banks and local cores, allowing for significant performance improvements in applications like sequence alignment.The architectural features of NUMA processors, when employed in an HPC system, can lead to distinct performance advantages and various bottlenecks.
The architecture of processors is illustrated in Figure 1, showcasing two commonly utilized NUMA nodes as examples: the Intel Xeon E5-4620 and the AMD Opteron-6376.These NUMA processors possess the capacity to accommodate up to four processors, totaling 32 cores, 64 threads, and a memory capacity of 128 GB.Although both NUMA processors have four NUMA nodes in their systems, they differ in terms of the interconnections between the processors.This is because, while the AMD Opteron is a network structure with complete linkages, the Intel Xeon uses a ring structure with a circular connection.Figure 2 demonstrates the critical principle of NUMA architecture: access to local memory is fast.If a processor requests memory that is not local, then ultimately, it is provided with access to the memory.However, latency is introduced compared to local memory.In NUMA architecture, local access generally refers to a processor accessing memory that is directly attached to that processor or is co-located in the same node.Remote access means that the cores on Processor 1 access memory, which is physically on another processor, Processor 2. In practice, this difference in memory access in NUMA architecture is essential for many research programs and hardware configuration optimizations in performance, but it is often ignored in HPC.

The Proposed Algorithm
This research takes into consideration various NUMA computing architectures, which have been used in most of the latest developments in the HPC field.Heterogeneous clusters have become the first preference of most of the research community today because they can form a versatile architecture for computing systems [32].It is usually much easier

The Proposed Algorithm
This research takes into consideration various NUMA computing architectures, which have been used in most of the latest developments in the HPC field.Heterogeneous clusters have become the first preference of most of the research community today because they can form a versatile architecture for computing systems [32].It is usually much easier to optimize load balance in homogeneous nodes as their similar designs allow for more coherent interaction.Therefore, our proposed scheme primarily focuses on addressing load balancing in heterogeneous systems comprising different types of NUMA architectures.Furthermore, we focus on homogeneous architecture to solve the potential parallelism within the processors.Our system comprises two phases, discussed as follows.

1.
Heterogeneous phase: Load balancing across all the nodes of the cluster is the primary purpose behind our efforts in this phase.We use a longest-first (LF) approach to achieve load balancing among the heterogeneous architectures.This enables us to distribute the workload across many nodes 2.
Homogeneous phase: The fundamental aim of this phase is to maximize the data parallelism on each homogeneous processor within the heterogeneous architectures.To achieve this objective, we have developed an enhanced partitioning (EP) technique by exploiting the available computational resources according to the maximum potential benefits on each node.

Longest-First Algorithm in the Heterogeneous Phase
The algorithm introduced in this research paper proposes a parallel methodology for partitioning a set of sequence reads, incorporating both the k-mer distance and the maximum length of the sequence read datasets.The k-mer distance is a measure of dissimilarity between genomic sequences with respect to the frequencies of subsequences of length k.The realization of multiple jobs executed simultaneously with an optimal balance is intended to be carried out in a NUMA node-based heterogeneous computing environment.

Algorithm 1. LF algorithm
Input: (3) The longest length of a sequence read set S is l(S).(4) j-part partitioned sequence reads set Steps: Rank sequence reads set S in the decreasing order K i ≥ K i+1 based on the values of k-mer distance set K.

3.
Divide S into j subread sets Sort S ′ in decreasing order S ′ i ≥ S ′ i+1 based on the longest length of subsequence reads set S ′ , make Set S ′′ i ̸ = ∅, S ′′ i ⊂ S ′ , L i is the total sum of complementation time in S ′′ i , 1 ≤ i ≤ d.
End for.
The following sequence depicts the parallel workflow of the longest-first algorithm in the heterogeneous phase.

1.
Calculating the k-mer distance: For all the sequence reads in the input set, the first operation calculates the k-mer distance in parallel.The k-mer distance is calculated for any two sequence reads and measures the dissimilarity between them.

2.
Ranking sequence reads: The sequence reads are ordered from the highest to the lowest based on k-mer distances.Therefore, the sequence reads with the highest k-mer distance are ranked at the top, the second highest k-mer distance follows, and so on.

3.
Partitioning sequence reads: The ranked sequence reads are partitioned into several sets.There can optimally be different numbers of sets based on the value of j, where j is the number of partitioned sequence read sets.All the sets could be of almost equal size except for the last set, which can have any size based on the total number of sequence reads and the value of j.

4.
Arranging subread sets: The subread sets are arranged in decreasing order from the longest size of the sequence reads in a subset.That means the subread with the longest sequence read is the one of the highest priority, and the subread with the second longest sequence read is the one of the second highest priority, and so on.5.
Merging partitions: The algorithm enters a loop that runs d times, where d is the final number of partitions.For each of those d iterations, the partitioned sequences are sorted in ascending order according to the sum of total complementation time.At each iteration, the sequence with the least complementation time is merged into the partitioned set, thereby updating the sum of complementation time for the partitioned set.
According to Algorithm 1, some crucial steps we conducted, including steps 3-8, are necessary to make an explanation.Step 3 describes how sequence reads are partitioned into subsets.It ensures that the sequence reads are divided as evenly as possible among the subsets.
There are components that exist in Step 3, including S, the set of all sequence reads; S ′ , the set of partitioned subread sets; j, the number of divided sequence reads; s, the total number of sequence reads; and S′ i , the i-th subread set.

One part of the formula {S′
describes the first j − 1 subread sets.Each of these sets S′ i (for 1 ≤ i ≤ j − 1) contains s j + 1 sequence reads.The floor function effectively balances the distribution by adding 1 to each subset.The other part of the formula {S′ i | i = j, |S′ i | = s − (j − 1) s j + 1 } describes the last subread set S′ j .The size of this last set is adjusted to ensure that the total number of sequence reads is accounted for.The size of S′ j is calculated as s minus the total number of reads allocated to the first j − 1 sets, which is (j − 1) s j + 1 .
Step 3 has two main characteristics.One is balancing subsets.The formula ensures that the first j − 1 subsets are filled with a nearly equal number of sequence reads.Each of these subsets receives s j +1 reads, where s j is the integer division of s by j.The other characteristic is adjusting the last subset.The last subset S′ j contains the remaining reads, which may not be divided evenly among the other subsets.This ensures that all reads are included in the partitioning without any being left out.
Step 4 involves sorting the partitioned sets of sequence reads based on the longest length of the subsequence reads within each set.The details are discussed in the following explanation.We sort S ′ in decreasing order.The subsets S ′ i are sorted such that S ′ i ≥ S ′ i + 1 based on the length of the longest sequence read in each subset.This means that the subset with the longest sequence read is first, the subset with the next longest sequence read is second, and so on.Let l(S ′ i ) denotes the length of the longest sequence read in subset S ′ i .The sorted subsets S ′ should satisfy the following condition.The length of the longest sequence read in the first subset is l(S ′ 1 ), in the second subset is l(S ′ 2 ), and so on.
Steps 5-8 describe the final steps of the algorithm for partitioning and balancing the sequence reads based on their complementation times.These steps involve iterating through the final sets, sorting them, and merging them to ensure balanced load distribution.These steps first make an initialization.We ensure that each final set {S ′′ i } is not empty and is a subset of the sorted sets S ′ .This is followed by the iterative merging and sorting of the final sets.In the iteration process, we sort the final sets {S ′′ 1 , S ′′ 2 , . . . ,S ′′ d−1 , S ′′ d } in increasing order based on the longest sequence read length l(S ′′ k ).We ensure that the total complementation times are in increasing order: Finally, we update the total complementation time by L k = L k + l(S ′ k ).These steps ensure that the sequence reads are efficiently and effectively balanced across the final sets, minimizing the total complementation time and achieving an optimal distribution of the computational load.This approach maximizes the performance and efficiency of sequence alignment tasks in high-performance computing environments.
According to the discussions above, the proposed LF method is a parallel approach for sequence read partitioning based on the average k-mer distance and the longest length of an obtained sequence read set.This implementation works in heterogeneous computing environments where numerous tasks can be performed in parallel.Algorithm 1 starts by calculating the k-mer distance between sequences in parallel and then ranking the sequences in decreasing order.These are later partitioned into subread sets by considering the value of j with the decreasingly ranked sequence reads.Finally, the subread sets are sorted in descending order by the sequence read length.The final partitioning loop is then run d times to ensure that the tasks are equally distributed among the processing units.These configurations ensure the efficiency and synchronization of the implementation process.

Enhanced Partitioning Algorithm in Homogeneous Phase
As shown in Figure 3, there are only two genome datasets involved in the sequence alignment procedure.S represents a short-read dataset and G represents a long genome reference.S x is the individual dataset of short reads at node x. S xy corresponds to a particular block in short read S x and s is the total amount of blocks of S x .G x is the responding genome reference at node x.G xy corresponds to a particular replica of the genome reference of G x and g is the number of replicas of G x .Therefore, short-read subsets in node x can be referred to as S x1 , S x2 , . . ., S xy , . . ., S xs and genome reference replicas in node x are referred to as G x1 , G x2 , . . ., G xy , . . ., G xg in Figure 3. Within the sequence alignment procedure, three pivotal steps can be identified: establishing, mapping, and gathering.These steps are summarized as follows.
sequence reads based on their complementation times.These steps involve iterating through the final sets, sorting them, and merging them to ensure balanced load distribution.These steps first make an initialization.We ensure that each final set {″ } is not empty and is a subset of the sorted sets  .This is followed by the iterative merging and sorting of the final sets.In the iteration process, we sort the final sets {″ , ″ , … , ″ , ″ } in increasing order based on the longest sequence read length (″ ).We ensure that the total complementation times are in increasing order:  ≤  ≤ ⋯ ≤  ≤  .Then, we merge sequence reads from {′ } into {″ }: {″ = ″ ∪ ′ }.Finally, we update the total complementation time by  =  + ( ).These steps ensure that the sequence reads are efficiently and effectively balanced across the final sets, minimizing the total complementation time and achieving an optimal distribution of the computational load.This approach maximizes the performance and efficiency of sequence alignment tasks in high-performance computing environments.
According to the discussions above, the proposed LF method is a parallel approach for sequence read partitioning based on the average k-mer distance and the longest length of an obtained sequence read set.This implementation works in heterogeneous computing environments where numerous tasks can be performed in parallel.Algorithm 1 starts by calculating the k-mer distance between sequences in parallel and then ranking the sequences in decreasing order.These are later partitioned into subread sets by considering the value of  with the decreasingly ranked sequence reads.Finally, the subread sets are sorted in descending order by the sequence read length.The final partitioning loop is then run  times to ensure that the tasks are equally distributed among the processing units.These configurations ensure the efficiency and synchronization of the implementation process.

Enhanced Partitioning Algorithm in Homogeneous Phase
As shown in Figure 3, there are only two genome datasets involved in the sequence alignment procedure. represents a short-read dataset and  represents a long genome reference. is the individual dataset of short reads at node . corresponds to a particular block in short read  and  is the total amount of blocks of  . is the responding genome reference at node . corresponds to a particular replica of the genome reference of  and  is the number of replicas of  .Therefore, short-read subsets in node  can be referred to as  ,  , … ,  , … ,  and genome reference replicas in node  are referred to as  ,  , … ,  , … ,  in Figure 3. Within the sequence alignment procedure, three pivotal steps can be identified: establishing, mapping, and gathering.These steps are summarized as follows.The majority of the total time in the sequence alignment is typically allocated to the establishing and mapping stages.It is noteworthy that the mapping stage necessitates a substantial allocation of computational resources and entails the most significant computing resource.The generation and dissemination of the index file can be performed in advance and shared among multiple instances during the experiments.Thus, the overall execution time of the three stages, denoted as T = T(G xy ) + ∑ T(R xy ) + T(R), can be simplified to T = ∑ T(R xy in general scenarios.
A range of s and g values is taken into consideration for a comprehensive view, which is detailed further in the following section.
(1) Case 1: s = g = 1.This case discusses the native implementation of the original structure of a sequence aligner built for mapping complete reads to a single genome reference.(2) Case 2: s = 1, g > 1.This case is utilized for the purpose of conducting the alignment of complete reads that are shared, against a set of g replicates of the reference genome.The use of many replicas of the genetic reference inside a NUMA architecture system offers benefits in terms of system architecture, as this design incorporates a large number of memory blocks.(3) Case 3: s = 1, 0 < g < 1.This case involves mapping the complete reads to a subsequence of the genome reference, specifically 1/g of the reference sequence.The genome reference is partitioned into 1/g segments, which are separately aligned with the whole set of read data.(4) Case 4: s > 1, g = 1.This case is implemented by dividing the reads into s subsets, which are mapped against a complete replica of the genome reference.The dataset is partitioned into s segments, and a complete genome reference is distributed across the threads residing in the memory.( 5) Case 5: s > 1, g > 1.This case is implemented with s subsets of reads being mapped to g replicates of the genome reference.The short read and genome reference are divided or duplicated into s segments and g duplicates, respectively.A partitioned subread can be regarded as a genome reference that has been repeated, resulting in the equation s = g.Thus, this case is suitable for large memory systems.( 6) Case 6: s > 1, 0 < g < 1.This case is implemented by mapping s subsets of the reads to 1/g subsequence of the genome reference.Both the short read and the genome reference are divided into s and 1/g segments, respectively.A partitioned subread can be viewed as a genome reference that has been replicated, resulting in a value of s = 1/g.Therefore, this case is suitable for memory-sensitive systems.
Table 2 presents a taxonomy of the alignment approaches within the context of the EP algorithm, categorized by the number of read subsets and genome replicates involved.Additionally, Table 2 displays operational descriptions and illustrative applications in genomics.The examples in cases 1-4 provide a comprehensive overview of the steps in the EP method evaluation and critically illustrate its effectiveness in performance.They also establish a baseline for a more detailed comparison and analysis in the experiments.Cases 5 and 6 are discussed to illustrate the best-case scenarios for the EP approach in terms of its implementation quality in a heterogeneous system with distinctive architectural features.The pseudocode of the EP algorithm is provided in Algorithm 2. Such an algorithmic framework is typical in bioinformatics because it facilitates the efficient processing of largescale sets of sequences.It is a scalable and parallelizable methodology applicable to the complex datasets commonly associated with genomic research.
reads: The set of reads to be aligned.

2.
genome_reference: The genome reference used for alignment.
num_genome_partitions g: The number of replicas/partitions for the genome reference. Output:

alignment_results:
The alignment results for the reads against the genome reference. Steps:
Initialize an array alignment_results for storing results.b.
Directly align reads to the genome using a standard alignment function.d.
Depending on the value of num_genome_partitions, either replicate the genome using replicateGenome() or partition it using partitionGenome().g.
Loop through each combination of reads and genome_partitions: k.
Store the results in alignment_results.m.

Experimental Environment
In general, the data utilized in our research are derived from the international 1000 Genomes Project, which commenced in January 2008.We specifically selected the human genome (ID: HS37D5) as our reference sequence, which can reach a size of up to three gigabytes.For one of our natural short-read datasets, we employed the SRR766060 paired-end genome, with each segment approximately measuring 8.7 GB in size.Furthermore, we incorporated four widely recognized benchmark datasets in their latest version for alignment quality evaluation purposes-BALIBASE, OXBENCH, SABRE, and PREFAB.Additionally, for speedup evaluation and sequence complexity experiments, we utilized specified sequence datasets obtained from several widely implemented read simulation tools in bioinformatics, such as randDNA and wgsim.The utilization of these compelling datasets has significantly contributed to obtaining reliable outcomes throughout the experiment we conducted.
The overview information on the experimental environment and configurations is shown in Table 3.Several additional widely used FM algorithms-SOAP2, BOWTIE2, BWA, and GEM3-were used in this research to comprehensively compare several significant issues.The experiments were also conducted on two different NUMA nodes, Intel Xeon and AMD Opteron, which are also commonly used in heterogeneous systems.These NUMA nodes have the capacity to support four processors with 32 cores, 64 threads, and 128 GB of memory.Although each processor has four NUMA nodes, they are interconnected differently among the cores.AMD Opteron 6376 utilizes a core network with full ties, while Intel Xeon E5 4620 employs a ring structure with circular connections among its cores, as shown in Figure 1.
Table 3.Details of the configurations in the experiments.

Components Details
Genome Reference Human genome (id: HS37D5) from the international 1000 Genomes Project.Sizes are up to 3 GB.
Short-Read Data SRR766060 paired-end genome.Each segment is 8.7 GB in size.
Sequence Simulation Tools randDNA and wgsim.
Experimental Setup Two NUMA nodes: Intel Xeon and AMD Opteron.
Processor Details AMD Opteron-6376: core network with full ties.Intel Xeon E5-4620: ring with circular connections.

Quality Assessment of the Alignment Results
In this section, we compare the LF-EP FM aligner's performance with the original FM aligner and other state-of-the-art FM algorithms such as SOAP2, BOWTIE2, BWA, and GEM3.We use BALIBASE, OXBENCH, SABRE, and PREFAB benchmarks to evaluate alignment quality.These benchmarks are highly respected in sequence alignment because they effectively evaluate aligner performance with respect to different features of alignment accuracy.This aims to assess aligner effectiveness across multiple benchmarks under various scenarios and sequence attributes.The benchmarks have been converted to the universally accepted FASTA format (as short-read sequences), which facilitates the following analytical processes.
In our evaluation, we use two accuracy metrics: Quality Score (QS) and Total Column Score (TCS).The QS measures the fraction of correct residue pairs aligned relative to the total number of residue pairs in the alignment.On the other hand, the TCS measures the correct alignment of columns relative to the total number of columns in the alignment.The TCS shows that the aligners are capable of reflecting the structural features of the sequences.The selection of the algorithms and the benchmarks is made in such a way that they represent a broad spectrum of methodologies for sequence alignment, allowing for the creation of a global assessment of the alignment performances of the aligners.The same FM-based algorithms are used in aligners tested in the benchmarks.Therefore, it might be expected that these various aligners would show slight differences in result quality.The absolute quantities QS and TCS are dimensionless; only their relative ratio is of interest.
Table 4 illustrates the QS and TCS values computed for different benchmarks employed in the present study.Since the QS and TCS values are dataset-dependent, we assess their relative values to provide a comparative assessment of the effectiveness of different sequence aligners.From the analysis in Table 4, the proposed LF-EP-based FM aligner surpassed the rest, with the highest scores in both QS and TCS.Evaluation results reveal that the LF-EP-based FM aligner is among the most highly reliable and accurate FM aligners compared with previous aligners, thus signifying the ability to integrate the LF-EP algorithm into the framework of FM aligner.Furthermore, SOAP2 was observed to perform at almost the same level as the LF-EP FM aligner, albeit a little lower.It can be observed that QS and TCS scores of SOAP2 are the second highest across most benchmarks, indicating that this tool is quite promising and could improve its alignment performance with further development.Additionally, BOWTIE2 shows a small decrease in performance quality compared to SOAP2, suggesting it might be suboptimal for strict quality standards.BWA also shows a similar drop in performance to BOWTIE2 but maintains impressive QS and TCS scores consistently during the evaluations.This finding positions BWA as a practical choice based on particular task needs and circumstances.Conversely, GEM3 displays the least impressive performance among the specified FM aligners.Although adequate for some tasks, its effectiveness is restricted in situations where top-tier accuracy is essential.
Therefore, the LF-EP-based FM aligner introduced in this research outperforms the other popular FM aligners across the four benchmarks, demonstrating its excellent reliability and precision.This finding substantiates the efficacy of implementing the LF-EP algorithm.Although SOAP2 demonstrates robustness and dependability, there is still room for improvement.Furthermore, the evaluation indicates that the performance quality of BOWTIE2, BWA, and GEM3 decreases progressively.Therefore, their suitability might fluctuate with the changing needs of specific tasks, ranging from high to lower levels.

Sequence Complexity Analysis
In biological sequences, which include DNA, RNA, and protein sequences, etc., average pairwise distance refers to the average genetic variation among all possible pairs within a particular set of sequences.It is used to show similarities or differences among a group of sequences.This complexity is examined by testing the datasets with 1000 sequences generated using the randDNA program.The average pairwise distance between sequences is varied from 50 to 500.The experiments are implemented on the selected datasets using the proposed algorithm along with other popular FM aligners such as BWA, SOAP2, BOWTIE2, and GEM3.The speedup factor is determined by measuring the performance efficiency of the BWA aligner in processing sequences with an average pairwise distance of 50.
The runtimes of different FM aligners on sequence datasets in the heterogeneous system of Intel Xeon and Intel Xeon Phi are depicted in Figure 4.As the sequence distance increases, the speedup factor decreases for all aligners.This implies that the acceleration effect weakens in proportion to the sequence distance increase.The LF-EP-based FM aligner exhibits the largest speedup factor, reaching 4.4× at a sequence distance of 50, which demonstrates that the LF-EP-based FM aligner has a substantial accelerating effect on short-distance sequences.Additionally, the LF-EP-based FM aligner performs proficiently even at longer distances such as 300, 400, and 500.The LF-EP-based FM aligner consistently displays the highest speedup factor across all sequence distances, hence confirming its suitability for managing high-complexity data.Overall, GEM3 and BOWTIE2 demonstrate lower speedup factors than the LF-EP-based FM aligner.Most of their speedup factors are below 3, indicating less significant speed improvements.Therefore, GEM3 and BOWTIE2 may require further optimization to improve their speedup factors on high-complexity data.At shorter sequence distances (50 and 100), SOAP2 and BWA have speedup factors below 1, indicating that their speed is comparable to the original speed.Nonetheless, their speedup factors decrease as the sequence distance increases.Thus, SOAP2 and BWA are appropriate for shorter sequence distances, but their performance may have speedup limitations as the distance increases.
consistently displays the highest speedup factor across all sequence distances, hence con-firming its suitability for managing high-complexity data.Overall, GEM3 and BOWTIE2 demonstrate lower speedup factors than the LF-EP-based FM aligner.Most of their speedup factors are below 3, indicating less significant speed improvements.Therefore, GEM3 and BOWTIE2 may require further optimization to improve their speedup factors on high-complexity data.At shorter sequence distances (50 and 100), SOAP2 and BWA have speedup factors below 1, indicating that their speed is comparable to the original speed.Nonetheless, their speedup factors decrease as the sequence distance increases.Thus, SOAP2 and BWA are appropriate for shorter sequence distances, but their performance may have speedup limitations as the distance increases.In conclusion, the research findings indicate that different aligners respond differently to variations in sequence distance, which affects their performance capabilities.The LF-EP-based FM aligner has been proven to be more efficient than SOAP2 in aligning sequences of both short and long distances.SOAP2, however, consistently demonstrates the least effective performance across all sequence distances.The BOWTIE2 algorithm exhibits a pronounced decline in its speedup as the sequence distance increases.The LF-EPbased FM aligner is the optimal tool for conducting complex sequence alignments on the Xeon-Xeon Phi heterogeneous system.It consistently demonstrates superior acceleration across various read lengths and is particularly well suited for large-scale datasets encompassing a mix of short and long sequences within a heterogeneous computing environment.The LF-EP-based FM aligner is an optimal solution for the management of a diverse range of sequence data due to its remarkable scalability.In conclusion, the research findings indicate that different aligners respond differently to variations in sequence distance, which affects their performance capabilities.The LF-EPbased FM aligner has been proven to be more efficient than SOAP2 in aligning sequences of both short and long distances.SOAP2, however, consistently demonstrates the least effective performance across all sequence distances.The BOWTIE2 algorithm exhibits a pronounced decline in its speedup as the sequence distance increases.The LF-EP-based FM aligner is the optimal tool for conducting complex sequence alignments on the Xeon-Xeon Phi heterogeneous system.It consistently demonstrates superior acceleration across various read lengths and is particularly well suited for large-scale datasets encompassing a mix of short and long sequences within a heterogeneous computing environment.The LF-EPbased FM aligner is an optimal solution for the management of a diverse range of sequence data due to its remarkable scalability.

Speedup Evaluation of FM Aligners
To accurately assess the speedup provided by the LF-EP-based FM aligner, we performed the following task of analyzing several sequence datasets.Some sequence mimic techniques, such as randDNA and wgsim, were used to establish these artificial datasets.In order to provide a complete picture of the speedup evaluation in these aligners, we included sequences whose average lengths ranged from 50 to 500, as well as a varied number of sequences ranging from 100 to 10,000.
In Figure 5, we provide a detailed comparison of the speedup achieved with the LF-EP-based FM aligner versus the original FM aligner, which was implemented in the native mode.We calculate the speedup based on the original FM aligner divided by the LF-EP-based FM aligner in each instance.Our performance analysis of the proposed algorithm shows an excellent speedup of up to 24.2×.It is seen that the speedup decreases as the average sequence length increases.This is because the computational complexity of the FM aligner is sequence length-based.As the analyzed sequences become more extensive, the demand for physical memory and computing power becomes high, resulting in a performance bottleneck for FM aligner implementation in such a heterogeneous architecture.mode.We calculate the speedup based on the original FM aligner divided by the LF-EPbased FM aligner in each instance.Our performance analysis of the proposed algorithm shows an excellent speedup of up to 24.2×.It is seen that the speedup decreases as the average sequence length increases.This is because the computational complexity of the FM aligner is sequence length-based.As the analyzed sequences become more extensive, the demand for physical memory and computing power becomes high, resulting in a performance bottleneck for FM aligner implementation in such a heterogeneous architecture.Furthermore, our study investigates those performance differences that really exist between the LF-EP-based FM aligner and the multi-threading FM aligner.This further analysis enables us to understand the relative strengths and limitations of each approach.Figure 6 gives a valuable overview of the speedup achieved by the LF-EP-based FM aligner concerning the variant of the multi-threading FM aligner.Speedup is obtained concerning the performance of the original FM aligner.
Three sequence datasets were generated using mimic sequence tools in the experiments, each varying in average sequence length and number of sequences.The first dataset had an average length of 50, the second an average length of 200, and the third an average length of 500.Each dataset contained sequences ranging from 100 to 10,000.The results shown in Figure 6 indicate that both the LF-EP-based FM aligner and the multithreading FM aligner demonstrate remarkable speedup for the dataset with an average sequence length of 50.Additionally, a 24.2× speedup is obtained for the LF-EP-based aligner in a heterogeneous system, while this speedup is 16.3× for the multi-threading FM aligner.These results indicate substantial benefits of the LF-EP-based aligner, leading to a largely reduced execution time compared to the multi-threading alignment.
Thus, performance is significantly more different when analyzing other datasets whose average sequence lengths are 200 and 500.In these cases, although still impressively showing the speedup, the FM aligner based on the LF-EP method attains a lower value of 8.9×.Furthermore, our study investigates those performance differences that really exist between the LF-EP-based FM aligner and the multi-threading FM aligner.This further analysis enables us to understand the relative strengths and limitations of each approach.Figure 6 gives a valuable overview of the speedup achieved by the LF-EP-based FM aligner concerning the variant of the multi-threading FM aligner.Speedup is obtained concerning the performance of the original FM aligner.The drop in speedup may be attributed to various factors, particularly the complexities of longer sequences and the corresponding increase in computational demands.The efficacy of our parallel sequence alignment approach is apparent in its enhanced performance on sequences of moderate length, such as 50 or 200 lengths.However, it unveils intrinsic constraints when examining lengthier and more intricate data.The approach under consideration demonstrates significant improvements in computational efficiency when tested on artificially generated datasets, with average sequence lengths varying from 50 to 200 and average sequence quantities ranging from 100 to 1000.However, it is noticed that when the size of the datasets became greater than 200 in length or greater than 1000 in number, the expected improvement in computational performance became less significant.It also reveals some potential when larger datasets were used with our approach.In summary, our parallel approach demonstrates significant potential under specific circumstances but requires more refinement to be applicable to a wide range of sequence lengths and varied amounts of sequences.

Conclusions and Future Works
In this research, we develop and assess an optimized approach to the FM-based approach of short-read alignment on heterogeneous NUMA architectures.The novel integration of the longest-first algorithm and enhanced partitioning technique significantly enhances load balancing and data parallelism, addressing the computational challenges Three sequence datasets were generated using mimic sequence tools in the experiments, each varying in average sequence length and number of sequences.The first dataset had an average length of 50, the second an average length of 200, and the third an average length of 500.Each dataset contained sequences ranging from 100 to 10,000.The results shown in Figure 6 indicate that both the LF-EP-based FM aligner and the multi-threading FM aligner demonstrate remarkable speedup for the dataset with an average sequence length of 50.Additionally, a 24.2× speedup is obtained for the LF-EP-based aligner in a heterogeneous system, while this speedup is 16.3× for the multi-threading FM aligner.These results indicate substantial benefits of the LF-EP-based aligner, leading to a largely reduced execution time compared to the multi-threading alignment.
Thus, performance is significantly more different when analyzing other datasets whose average sequence lengths are 200 and 500.In these cases, although still impressively show-

Figure 1 .
Figure 1.Two common NUMA architectures employed by Intel Xeon and AMD Opteron processors.

Figure 2 .
Figure 2. Two memory access in NUMA architectures.

Figure 1 .
Figure 1.Two common NUMA architectures employed by Intel Xeon and AMD Opteron processors.

Figure 1 .
Figure 1.Two common NUMA architectures employed by Intel Xeon and AMD Opteron processors.

Figure 2 .
Figure 2. Two memory access in NUMA architectures.

Figure 2 .
Figure 2. Two memory access in NUMA architectures.

Figure 3 .
Figure 3.The brief workflow of sequence alignment.

Figure 3 .
Figure 3.The brief workflow of sequence alignment.(1) Establishing stage: The aligner builds the index file for the genome reference G xy ; the wall time of this stage is denoted as T(G xy ).(2) Mapping stage: The aligner achieves an alignment sub-result R xy = map S xy , G xy by mapping short read S xy against genome reference G xy ; the wall time of this stage is denoted as T(R xy ).(3) Gathering stage: The aligner collects sub-results from stage 2 and takes some necessary configurations to merge the final output R = ∑ R xy ; the wall time of this stage is denoted as T(R).

Figure 5 .
Figure 5. Performance comparison between original FM aligner and LF-EP-based FM aligner.

Figure 5 .
Figure 5. Performance comparison between original FM aligner and LF-EP-based FM aligner.

Future
Internet 2024, 16, x FOR PEER REVIEW 16 of 18

Table 1 .
An illustration of SA and BWT computations based on single GR sequence.

Table 2 .
Summary of alignment cases based on different configurations of s and g.

Table 4 .
Quality of various FM aligners.