An O ( log 2 N ) Fully-Balanced Resampling Algorithm for Particle Filters on Distributed Memory Architectures

: Resampling is a well-known statistical algorithm that is commonly applied in the context of Particle Filters (PFs) in order to perform state estimation for non-linear non-Gaussian dynamic models. As the models become more complex and accurate, the run-time of PF applications becomes increasingly slow. Parallel computing can help to address this. However, resampling (and, hence, PFs as well) necessarily involves a bottleneck, the redistribution step, which is notoriously challenging to parallelize if using textbook parallel computing techniques. A state-of-the-art redistribution takes O (( log 2 N ) 2 ) computations on Distributed Memory (DM) architectures, which most supercomputers adopt, whereas redistribution can be performed in O ( log 2 N ) on Shared Memory (SM) architectures, such as GPU or mainstream CPUs. In this paper, we propose a novel parallel redistribution for DM that achieves an O ( log 2 N ) time complexity. We also present empirical results that indicate that our novel approach outperforms the O (( log 2 N ) 2 ) approach.


Introduction 1.Motivation
In several modern applications, it is often necessary to estimate the state of a system, given a mathematical model for the system and a stream of noisy observations.Particle Filters (PFs) are typically used in this context.The key idea is to sample N hypotheses (i.e., particles) from an arbitrary proposal distribution to approximate the probability density function (pdf) of the true state.However, at some point, the particles experience a numerical error, called particle degeneracy, which makes the estimates diverge from the true state.A resampling algorithm is then applied to correct for degeneracy by replacing the particles that are diverging from the true state with copies of the particles that are not doing so [1].This sampling-resampling approach is highly flexible, such that PFs find application in a wide range of fields, ranging from machine learning [2] to medical research [3], fault prediction [4], weather forecasting [5], tracking [6] or, broadly speaking, any domain involving decision making in response to streaming data.Resampling is also used in other Monte Carlo methods, such as sequential Monte Carlo samplers [7,8] and PHD filters [9].For brevity, this paper focuses exclusively on PF contexts.
Modern efforts of making models more detailed have translated to an increasing demand in making PFs more accurate.This demand can be satisfied in several ways, ranging from applying better proposal distributions [10] to collecting more measurements [11] and using more particles [12,13].Using more particles is especially important in settings where we are more interested in computing the probability that the true state falls within a certain state-space region, rather than simply estimating its mean [14].However, the likely side-effect of any of these approaches is a significant increment to the run-time, and this may be critical, especially in real-time applications [12].Parallel computing becomes necessary in order to compensate for this side-effect.

Problem Definition and Related Work
Although the particles can be sampled in embarrassingly parallel fashion, resampling is hard to parallelize globally, i.e., in such a way that the result for P > 1 processing units (or cores, as referred to in this paper) is identical to that achieved using a single core.This is because of the difficulties in parallelizing the constituent redistribution step, whose textbook implementation achieves an O(N) time complexity on one core.In order to bypass the need to parallelize resampling, one could use multiple PFs in parallel, each performing resampling locally, i.e., when each core performs resampling independently without considering the content of the other cores' particles.However, this approach has shown accuracy, scalability and model-dependent applicability issues [15][16][17].
On Shared Memory (SM) architectures, it has been shown that a global redistribution of the particles takes O(log 2 N) computations by using a static load balancing approach, in which, all cores perform independently up to N binary searches in order to achieve a perfect workload balance.Examples are found in [12,13,18,19] on GPU and mainstream CPUs.High Performance Computing (HPC) applications, however, need to use Distributed Memory (DM) architectures to overcome the limitations in modern SM of a low memory capacity and Degree Of Parallelism (DOP).
On DM, parallelization is more complicated, as the cores cannot directly access the other cores' memory without exchanging messages.Three master-worker solutions for DM (along with mixed versions of them) are presented in [20]: Centralized Resampling (C-R), Resampling with Proportional Allocation (RPA) and Resampling with Non-proportional Allocation (RNA).C-R performs resampling globally, but here, a central-unit gathers the particles from all cores, performs redistribution sequentially and scatters the result back to the cores, making this algorithm scale as O(N).RPA also performs global resampling, but the network topology is randomized, as the redistribution is partly or potentially entirely (worst-case) centralized to one or a few cores, leading to a strongly data-dependent run-time and an O(N) time complexity in the worst-case.RNA has a simpler central unit and communication pattern than RPA, but sacrifices accuracy, as local resampling is performed.For the routing, RNA could use RPA or a ring topology, where the cores cyclically exchange a user-defined number of particles with their neighbors, though such an approach risks redundant communication.These master-worker approaches have been used, re-interpreted or mixed in recent work, such as in [17,[21][22][23][24].In [17,24,25], it is shown that such strategies may have accuracy or scalability issues, especially for a highly unbalanced workload, large N or DOP.In this paper, we then consider only fully-balanced solutions, which we define as follows.
Definition 1.A fully-balanced redistribution meets the following requests:

•
All cores perform the same pre-agreed tasks (i.e., no central unit(s) are involved) to balance the workload evenly;

•
The number of messages for the load balancing is data-independent in order to guarantee a stable run-time, as often required in real-time applications;

•
The redistribution of the particles is performed globally in order to ensure the same output of sequential redistribution and that no speed-accuracy trade-off is made when the DOP increases.
In [26], it has been shown that redistribution can be parallelized in a fully-balanced fashion on DM by using a divide-and-conquer approach that recursively sorts and splits the particles.Since Bitonic Sort [27][28][29] is performed O(log 2 N) times, the achieved time complexity is O((log 2 N) 3 ).In [17], sort is also employed recursively in a dynamic scheduler for RPA/RNA.In [30], the time complexity is reduced to O((log 2 N) 2 ) by showing that Bitonic Sort is only needed once.In [25], this new idea is ported from MapReduce to the Message Passing Interface (MPI) and then further optimized in [7].However, the data movement in this fully-balanced redistribution is still the bottleneck, especially for high DOP.

Our Results
This paper proposes a novel fully-balanced approach for DM that achieves an O(log 2 N) time complexity and improves on the redistribution algorithms described in [7,25].Our experimental results demonstrate that our novel redistribution is approximately eight times faster than the O((log 2 N) 2 ) ones on a cluster of 256 cores.
The rest of the paper is organized as follows: in Section 2, we briefly describe PFs, with a particular emphasis on resampling (and redistribution).In Section 3, we give brief information about DM and MPI.In Section 4, we describe our novel parallel redistribution in detail and include proof of its algorithmic complexity.In Section 5, we show the numerical results for redistribution first, and then for a PF example.In Section 6, we outline our conclusions and give recommendations for future work.

Sequential Importance Resampling
In this section, we briefly describe Sequential Importance Resampling (SIR), one of the most common variants of PFs [1].
PFs employ the Importance Sampling (IS) principle to estimate the true state X t ∈ R M of a system.IS uses an arbitrary proposal distribution q(X t |X t−1 ) to randomly generate x t ∈ R N×M , a population of N statistically independent hypotheses for the state, called particles.Each particle x i t is then weighted by an unnormalized importance weight w i t ∈ R ∀i = 0, 1, ..., N − 1, which is computed based on the resemblance between x i t and X t .In this way, x t approximates the posterior of the true state.
To weight each particle correctly, at every time step t, we collect Y t ∈ R M y , a set of measurable quantities that are related to X t by some pdf.At the initial time t = 0, no measurement has been collected, so the particles are initially drawn from the initial distribution q 0 (X 0 ) = p 0 (X 0 ) and weighted equally as 1/N.Then, for any time step t = 1, 2, ..., T PF , measurements are collected, and each particle is drawn from the proposal distribution as follows: and weighted by where p(x i t |x i t−1 ) and p(Y t |x i t ) are known from the model.The IS step performs (1) and (2) in sequence.Then, to effectively represent the pdf of the state, the weights are normalized as follows: However, the side effect of using IS is degeneracy, a phenomenon that (within a few iterations) makes all weights but one decrease towards 0. The variance of the weights is indeed proven to increase at every t [1].The most popular strategy for tackling degeneracy is to perform resampling, a task that regenerates the particle population by deleting the particles with low weights and duplicating those with high weights.In SIR, resampling is performed if the Effective Sample Size (ESS) decreases below an arbitrary threshold, which is commonly set to N 2 .
Several resampling schemes exist [31] and can be described as comprising three steps.The first step is to process the normalized weights w to generate ncopies ∈ Z N whose i-th element, ncopies i , defines how many times the i-th particle must be copied, such that The second step is redistribution, which involves duplicating each particle x i as many times as ncopies i .A textbook redistribution is described by Algorithm 1, which we refer to as Sequential Redistribution (S-R).This algorithm achieves an O(N) time complexity on a single core as (5) holds.A practical example for N = 8 particles is shown in Figure 1.In the final step of resampling, all weights are reset to 1  N .

Algorithm 1 Sequential Redistribution (S-R)
Input: x, ncopies, N Output: x new 1: z ← 0 2: for j ← 0; j < N; j ← j + 1 do 3: end for 6: end for . S-R-example for N = 8.Each x i is actually a real vector, but is marked with a capital letter for brevity.
We note that all resampling schemes require redistribution, independently of the chosen strategy to perform the first step.Since we focus on proposing a novel fully-balanced redistribution, to perform the first step, we focus on using systematic resampling [31], which is also known as Minimum Variance Resampling (MVR) [7,25,26,30].The key idea of MVR is to first compute the cumulative density function of wt , cdf ∈ R N+1 , as follows: a random variable u ∼ Uniform[0, 1) is then sampled, such that each ncopies i can then be calculated as follows: where the brackets represent the ceiling function (e.g., 3.3 = 4).After resampling, a new estimate of X t is computed as

Distributed Memory Architectures
DM is a type of parallel system that is inherently different to SM.In this environment, the memory is distributed over the cores and each core can only directly access its own private memory.The exchange of information stored in the memory of the other cores is achieved by sending/receiving explicit messages through a common communication network (see Figure 2).DM provides several advantages over SM, such as: scalable and larger DOP; scalable and larger memory; memory contention (and other issues that stem from multiple cores accessing the same addresses in memory) not being a concern on DM.The main disadvantage of DM is the cost of communication and the consequent data movement.This can affect the speed-up relative to a single core implementation.

Memory
Any Application Programming Interface (API) for DM could be used.We choose MPI, since its intuitive syntax means it is arguably the most popular API for DM.MPI is also used in several referenced works on parallel redistribution for DM [7,17,22,23,25].In MPI, the P cores are uniquely identified by a rank p = 0, 1, ..., P − 1, connected via communicators, and use explicit communication routines, e.g., MPI_Sendrecv, to exchange messages of arbitrary size.
Most algorithms that we implement in this paper use the divide-and-conquer paradigm.Therefore, we recommend using a power-of-two number of cores to balance the communication between them.For the same reason, the number of particles N should also be a power-of-two.In this case, the particles x and every array related to them, e.g., w, ncopies etc., can then be evenly split between the cores.Every core then always owns exactly n = N P elements of all arrays, whose global indexes are spread over the cores in increasing order.This means that, given a certain N, P pair, the i-th particle (where i ∈ [0, N − 1]) will always belong to the same core with rank p = i n .More precisely, while all equations in this paper use global indexes for simplicity, each core p actually uses a local index j ∈ [0, n − 1] to address arrays and achieve parallelism, knowing that each j corresponds to global index i = j + pn.The space complexity is then O( N P ) for any P ≤ N.

Novel O(log 2 N) fully-balanced Redistribution
In this section, we prove how it is possible to redistribute N particles in O(log 2 N) parallel time on DM by using a novel fully-balanced redistribution algorithm, which we name Rotational Nearly Sort and Split (RoSS) redistribution.We also provide details on how to implement RoSS on MPI.We refer any reader who is interested in the description of the resulting algorithms to Algorithms 2-5.

Algorithm 2 Rotational Nearly Sort
Input: x, ncopies, N, P, n = N P , p Output: x, ncopies Send x j , ncopies j to partner, ncopies j ← 0 Send x j , ncopies j to partner, ncopies j ← 0 Accept or reject the received particles and shifts 32: end for Algorithm 3 Sequential Nearly Sort (S-NS) end if 8: end for 9: zeros ← n − l

General Overview
The algorithm consists of two phases.In the first phase, we want the elements in ncopies to be nearly sorted, a property which is defined as follows.
Definition 2. A sequence of N non-negative integers, ncopies, is nearly sorted in descending order when it has the following shape: where ncopies i > 0 (marked with λs in ( 9)) ∀i = 0, 1, ..., m − 1 and 0 ≤ m ≤ N. On the other hand, ncopies is an ascending nearly sorted sequence if the last m elements are positive and the first are 0.
In this paper, ncopies is nearly sorted in descending order.We also note that, as the elements in ncopies are progressively shifted to achieve (9), the related particles in x are consequently also shifted.The main purpose of this phase is to separate all particles that must be duplicated (i.e., those for which ncopies i > 0) from those that must be deleted.Here, we prove that ( 9) can be achieved by using Rotational Nearly Sort, an O(log 2 N) alternative to the O((log 2 N) 2 ) Nearly Sort step in [7] (as also described in Appendix A).We denote that (9) can also be achieved with sort, as carried out in [25] by using Bitonic Sort.In theory, one could also employ O(log 2 N) sorting networks [32,33].However, the constant time of these networks is notoriously prohibitive in practice, even with some of the most recent advances [34].Our constant time is significantly smaller; because of that, we do not consider AKS-like sorting networks as a practical approach for (9).
In the second phase, we want to achieve two goals: the first is to make room on the right of each particle that has to be copied; the second is for the P cores to have the same number of particles in their private memory.The first goal easily translates to shifting the particles to the right until ncopies has the following new shape: ncopies = [λ 0 , 0, ..., 0, λ 1 , 0, ..., 0, λ m−1 , 0, ..., 0] ( where, for each ncopies i > 0 (again marked with λs in (10)), ncopies i − 1 zeros follow.
The second can be expressed as follows: which is essentially (5) applied locally.In the next section, we prove it is possible to achieve both (10) and (11) in O(log 2 N) parallel time by using a single algorithm, which we refer to as Rotational Split.After that, the cores are completed by using S-R independently to redistribute the particles within their private memory.

Algorithmic Details and Theorems
In this section, we give details for a novel implementation of parallel redistribution and prove that it scales as O(log 2 N).The reader is referred to Figure 3 Compute csum, the Cumulative Sum over ncopies, and Figure 3. RoSS redistribution-example for N = 8 and P = 4.Each x i is actually a real vector, but marked with a letter for brevity.

Rotational Nearly Sort
Theorem 1.Given an array of N particles, x, and their copies to be created, ncopies, whose elements are evenly distributed across the P cores of a DM, Algorithm 2 (performed by each core p, ∀p = 0, 1, ..., P − 1) describes the steps to safely shift the elements in x and ncopies to achieve property (9), and performs that in O(log 2 N) parallel time for P = N.
Proof of Theorem 1.The first process to carry out is to nearly sort the particles locally by calling Sequential Nearly Sort (S-NS) (see Algorithm 3), which iteratively moves the i-th particle to the left/right side of the output array if ncopies i is positive/zero.We note that this routine only takes O( N P ) iterations, since every DM core owns N P particles.This will start moving the particles locally to the left before doing so globally.
The particles within the core p must now shift to the left by as many positions as the number of zero elements in ncopies owned by the cores with a lower rank.Let zeros ∈ Z P be the array that counts the number of ncopies i = 0 within each core; each element of shifts ∈ Z P (the array to keep track of the remaining shifts) can be initialized as follows: Equation ( 12) can be parallelized by using a parallel exclusive cumulative sum (after each core p has initialized zeros p to the sum of zeros within its memory, at the end of S-NS).It is well covered in the literature that the cumulative sum (also known as prefix sum or scan) takes O( N P + log 2 P) computations [35,36].We now want to express shifts p in binary notation and shift the particles by increasing power-of-two numbers of positions, depending on the bits of shifts p , from the Least Significant Bit (LSB) to the Most Significant Bit (MSB).This translates to using a bottom-up binary tree structure that will complete the task in O(log 2 N) time.
If P < N, then we first need to perform an extra leaf stage, in which, the cores send all of the particles to their neighbor if the bitwise AND of shifts p and N P − 1 is positive.In simple terms, the leaf stage masks the log 2 N P LSBs of shifts p and performs the rotations referring to those bits in one operation per particle.Hence, during this step, each core sends and receives up to N P particles, meaning that the leaf stage takes O( N P ).
After the leaf stage, the actual tree-structure routine can start.At every k-th stage of the tree (for k = 1, 2, ..., log 2 P), any core p will send to its partner p − 2 k−1 all its particles (i.e., x and ncopies) and shifts p − N P 2 k−1 (i.e., the number of remaining shifts after the current rotation) if the bitwise AND of shifts p and N P 2 k−1 is positive; this corresponds to checking a new bit of shifts p , precisely the one which is significant at this stage.We note that, in order to balance the communication at all stages, the cores are set to send particles to reject (i.e., having ncopies i = 0) if the LSBs are 0. Hence, this phase takes O( N P log 2 P) because each core sends and receives N P particles log 2 P times.Before the leaf stage, the particles in the memory of core p must shift at most from one end to the other end, i.e., by shifts p ≤ N − 1 positions.Any non-negative integer shifts p ≤ N − 1 can always be represented in base-2 by log 2 N bits.Therefore, each particle will be shifted O(log 2 N) times, each time by an increasing power-of-two number of positions, and, since shifts gets updated in O(1), Algorithm 2 achieves an O(log 2 N) time complexity.Furthermore, as long as no particle to be copied collides with or gets past another one, Algorithm 2 always achieves (9), since the shifts to perform will progressively decrease, whereas (12) will always hold.Lemma 1 and Corollary 1 prove that collisions and overtaking (defined in the following definitions) can never occur.
Definition 3 (Collisions).Let x i be a particle having ncopies i ≥ 1, and x j be a particle having ncopies j ≥ 1, with j = i + dist, where 0 < dist < N − 1.A collision would occur if dist is a power-of-two number, x j is rotating to the left by dist and x j stays where it is.More formally, a collision occurs if the total number of rotations that x j must perform has significant bit (i.e., the bit corresponding to dist rotations) equal to 1, while the same bit of the number of rotations that x i must perform is 0. The same definition can be applied to collisions when rotations are performed to the right if x i is rotating to the right (and, hence, have significant bit equal to 1) and x j is not rotating (and, hence, have significant bit equal to 0).Definition 4 (Overtaking).Let x i again be a particle having ncopies i ≥ 1, and x j again be a particle having ncopies j ≥ 1, with j = i + dist, where 0 < dist < N − 1.The particle x j can overtake x i if it is rotating to the left by a power-of-two number greater than dist while x i stays where it is.The same problem occurs when rotations are performed to the right, if x i is rotating to the right by a power-of-two number greater than dist while x j stays where it is.Lemma 1.During the k-th iteration of Rotational Nearly Sort, ∀k = 1, 2, ..., log 2 P, a particle x j , having ncopies j ≥ 1 and rotating to the left by N P 2 k−1 positions, can never collide with a particle x i , having ncopies i ≥ 1 and j = i + N P 2 k−1 .
Proof of Lemma 1.At the k-th iteration, particle x i has shifts i remaining rotations to the left, whereas x j has shifts j .Therefore, the necessary and sufficient condition for collisions, defined in Definition 3, can be restated for Rotational Nearly Sort by checking whether the following logical condition is true, which corresponds to checking whether the significant bit of shifts i is 0 and the significant bit of shifts j is 1.This condition can also be rearranged as follows: where zeros i+1:j−1 is the number of 0s in ncopies between positions i and j excluded.Since Rotational Nearly Sort performs rotations using an LSB-to-MSB strategy, it is easy to infer that the bits to the right of the significant one at this iteration (i.e., bit k − 1) are all 0. This means that, if the significant bit of shifts i is 0, the only condition that would make (14) true would be zeros i+1:j−1 = N P 2 k−1 .That is, however, impossible, because, in this case, there are only i + N P 2 k−1 − 1 memory slots between i and j, which means that: Corollary 1.During the k-th iteration of Rotational Nearly Sort, ∀k = 1, 2, ..., log 2 P, a particle x j , having ncopies j ≥ 1 and rotating to the left by N P 2 k−1 positions, can never overtake a particle x i , having ncopies i ≥ 1 and i < j < i + N P 2 k−1 .

Rotational Split
Theorem 2. Given an array of N particles, x, and their copies to be created, ncopies, whose elements are nearly sorted according to (9), and evenly distributed across the P cores of a DM, Algorithm 4 (performed by each core p, ∀p = 0, 1, ..., P − 1) describes the steps to shift and/or safely split the elements in x and ncopies to achieve properties ( 10) and ( 11), and performs that in O(log 2 N) parallel time for P = N.
Proof of Theorem 2. Let csum ∈ Z N be the inclusive cumulative sum of ncopies.As mentioned in the previous section, the cumulative sum achieves an O( N P + log 2 P) time complexity for any P ≤ N.
To achieve (10), we want to move the particles to the right to have ncopies i − 1 gaps after each index i, such that ncopies i > 0. It can be inferred that the minimum required number of shifts to the right that the i-th particle must perform is: However, (10) alone does not guarantee (11).This is because, for each particle x i that must be copied more than once, we are rotating all of its copies by the same minimum number of positions, and we are not considering that some of these copies could be split and rotated further to also fill in the gaps that a strategy for (10) alone would create.Therefore, we also consider the maximum number of rotations that any copy of x i has to perform, without causing collisions.Since (10) aims to creating ncopies i − 1 gaps for each i, such that ncopies i > 0, that number is: Therefore, the key idea is to consider min_shifts i and max_shifts i in binary notation and use their bits to rotate the particles accordingly, from the MSB to the LSB.This corresponds to using a top-down binary tree structure of log 2 P stages.At each stage k of the tree (for k = 1, 2, ..., log 2 P), any core with rank p sends particles N2 −k positions ahead to its partner with rank p + P 2 k .At each k-th stage, we check the MSB of both min_shifts i and max_shifts i to infer copies_to_send i , the number of copies of x i that must rotate by N2 −k positions.For each x i , three possible scenarios may occur:

•
None of its copies must move; • All of them must rotate; • Some must split and shift, and the others must not move.
Trivially, the copies will not move if the MSB of max_shifts i is 0, which also implies that the MSB of min_shifts i is 0, since min_shifts i ≤ max_shifts i at all stages.If both MSBs are 1, we send copies_to_send i = ncopies i copies of x i .However, if only the MSB of max_shifts i is equal to 1, copies_to_send i < ncopies i .The number of copies to split is equal to how many of them must be placed from position i + N2 −k onwards to achieve a perfect workload balance.This is equivalent to computing how many copies make Therefore, if only the MSB of max_shifts i is equal to 1, we send copies of x i and we keep the remaining ones where they were.Here, as in Section 4.2.1, particles to be rejected are sent if the MSB of max_shifts i is 0 in order to keep all cores equally busy.This phase takes an O( N P log 2 P) time complexity, because each core sends or receives N P particles at every stage.However, to ensure a logarithmic time complexity, one needs to update csum, min_shifts and max_shifts in O( N P ).This can be achieved if the cores send where s is the index of the first particle to send, having ncopies s > 0. As long as no particle overwrites or moves past another one (which is proven to be impossible by Lemma 2), each core p can safely see the received starter as and use it to re-initialize csum and update it in O( N P ) as csum i = csum i−1 + ncopies i ∀i = p N P + 1, ..., (p + 1) N P − 1.This strategy guarantees csum is always correct for at least any index i, such that ncopies i > 0, i.e., those indexes we require to be correct.Once csum is updated, Equations ( 17) and ( 18) are embarrassingly parallel.
If P < N, after log 2 P stages, the cores perform a leaf stage.Here, for each particle, we perform inter-core shifts or splits and shifts only if there is any copy to be sent to the neighbor; this is equivalent to checking if where (p + 1) N P is the first global index in the neighbor's memory.Arrays of 0s are again sent when no inter-core shifts are needed.We also consider internal shifts only if min_shifts i > 0, in order to both make room to receive particles from the neighbor, p − 1, and guarantee (10).This leaf stage takes O( N P ) because N P particles are sent or received.It is easy to infer that min_shifts i < N − 1 and max_shifts i ≤ N − 1, as a particle copy could at most be shifted, or split and shifted from the first to the last position.Both min_shifts i and max_shifts i can be represented in base-2 by log 2 N bits.Therefore, as also anticipated above, only up to log 2 N messages are required, which means that the achieved time complexity of Algorithm 4 is O(log 2 N).Furthermore, the particles are progressively split according to the MSBs of min_shifts i and max_shifts i , until they are split according to (21).Hence, after the final leaf stage, the first and last element of csum in the memory of each core will necessarily meet the following two requirements: which, if subtracted, automatically proves (11).
Lemma 2. Given a nearly sorted input ncopies, at the k-th iteration of Rotational Split, ∀k = 1, 2, ..., log 2 P, a particle x i , having ncopies i ≥ 1 and rotating to the right by N2 −k positions, can never collide with or get past a particle x j , having ncopies j ≥ 1 and j Proof of Lemma 2. This lemma can be proved in two possible complementary cases: 1.
There are one or more zeros between i and j; 2.
There are no zeros between i and j.
Case 1.Since the particles are initially nearly sorted, at the beginning, there are no zeros in between any pair of particles in position i and j.At the k-th iteration, if one or more zeros are found between x i and x j , it necessarily means that dist > max_shifts i ≥ N2 −k .This is because of two reasons.First, zeros between two particles x i and x j can only be created if the MSB of min_shifts i is 0 and the MSB of max_shifts j is 1.Second, for any binary number, its MSB, if equal to 1, is a greater number than the one represented by any disposition of all remaining LSBs (e.g., (1000) 2 = 8 > (0111) 2 = 7).Hence, if there are any zeros between x i and x j , it is because, during at least one of the previous stages, x j rotated by an MSB and x i did not, such that x j is now beyond reach of possible collisions with x i .
Case 2. In this case, all particles between i and j are still nearly sorted.Therefore, x i would collide with (when dist = N2 −k ) or get past x j (when dist < N2 −k ) if is true, which corresponds to checking whether the MSB of max_shifts i is 1 (which also includes those cases where the MSB of min_shifts i is 1) and the MSB of min_shifts j is 0.
In other words, ( 22) can be simplified to checking if max_shifts i > min_shifts j .However, for a pair of particles x i and x j within a nearly sorted group of particles, this is impossible, because:

Rotational Nearly Sort and Split Redistribution
Theorem 3. Given an array of N particles, x, and their copies to be created, ncopies, whose elements are evenly distributed across the P cores of a DM, Algorithm 5 (performed by each core p, ∀p = 0, 1, ..., P − 1) redistributes all particles in x for which ncopies i > 0, and performs that in O(log 2 N) parallel time for P = N.
Proof of Theorem 3.After the cores perform Rotational Nearly Sort first, and then Rotational Split, Equation (11) holds, as proven by Theorem 2. Therefore, the cores can independently use S-R to redistribute their N P particle copies in their private memory in O( N P ) iterations.As proven by Theorems 1 and 2, Algorithms 2 and 4 complete their task in O(log 2 N) parallel time, which means that the achieved time complexity by Algorithm 5 is then O(N) for P = 1, O(log 2 N) for P = N cores and, for any 1 ≤ P ≤ N, is: the first term in (23) represents S-R, which is always performed, and all of the steps that are only ever called once for any P > 1 (e.g., S-NS).The second term in (23) describes the log 2 P stages of Algorithms 2 and 4, during which, we update, send and receive up to N P particles.
With the results of Theorem 3 in view, we conclude that the PF tasks now take either O( N P ), O( N P + log 2 P) or O( N P + N P log 2 P) time.This is because (3), ( 4) and ( 8) require reduction (which notoriously scales as O( N P + log 2 P) [36]), the IS step is embarrassingly parallel and (in the resampling algorithm) Equation ( 6) requires an exclusive cumulative sum; in addition, Equation (7) and resetting the weights to 1/N are also embarrassingly parallel tasks.Table 1 summarizes this conclusion.This means that, even if we had an embarrassingly parallel fully-balanced redistribution, the time complexity of PFs will still be no less than O( N P + log 2 P), since reduction and the cumulative sum are required elsewhere.
Table 1.Time complexity of each task of PFs on DM.

Implementation on MPI
In this section, we give brief information concering which MPI routines are needed to implement RoSS and the rest of SIR on MPI.
The exclusive cumulative sum that is required before the leaf stage in Algorithm 2 is parallelized on MPI by calling MPI_Exscan [37].On the other hand, MPI_Scan is used to parallelize the inclusive cumulative sum of ncopies at the start of Algorithm 4.
During the binary tree and the leaf stages in Algorithms 2 and 4, the cores send and receive N P particles each time.On MPI, MPI_Sendrecv is ideal for these messages, since it requires the ranks to send to and receive from.Temporary arrays should be used on both communication ends to ensure data coherency before accepting or rejecting the received content.
All of the other operations or algorithms within RoSS, such as (17) or S-NS, are performed locally by each core and, therefore, do not need to call MPI routines.
For completeness, we point out that, in the other PF tasks, the reduction operation in (3), ( 4) and ( 8) is parallelized on MPI by calling MPI_Allreduce, whereas (6) needs MPI_Exscan.

Experimental Results
In this section, we show the numerical results of redistribution first and then show results for a PF example.In the experiment for redistribution, we compare RoSS, the novel fully-balanced algorithm presented in this paper, with N-R and B-R, two fully-balanced redistributions that take O((log 2 N) 2 ) steps (see Appendix A).These algorithms are compared by passing in input arrays with the same values to all algorithms: ncopies and x with M = 1.To guarantee (5), ncopies is generated randomly by using MVR (see Equation ( 7)) with a log-normally distributed input w.For the PF experiment, we consider three versions of a SIR PF that only differ in terms of the constituent redistribution used.Each PF test is run for T PF = 100 iterations.Resampling is computed every iteration so that we ensure that the frequency of redistribution is the same.The model we consider is the stochastic volatility example described in Appendix B.
All experiments are conducted for N ∈ {2 16 , 2 20 , 2 24 } particles and for up to P = 256 cores.Each reported run-time is the median of 20 runs collected for each N, P pair.
All algorithms in this paper are coded in C++ with OpenMPI-1.10.7 and compiled with -O3 optimization flag; double and unsigned int data types are, respectively, used for real and integer numbers.Combinations of MPI_Barrier and MPI_Wtime are used to collect the run-times for Figures 4 and 5, while TAU Performance System 2.7 is used for the profiling in Figure 6.The cluster for the experiments consists of eight machines, interconnected with InfiniBand 100 Gbps and each mounting a 2 Xeon Gold 6138 CPU that provides 40 cores running at 2 GHz.We note that the results for up to P = 32 DM cores have been collected by requesting (at scheduling time) cores from the same node.For P ≥ 64, we have requested 32 DM cores from each node.While we cannot identify any resulting artifacts in the results, we acknowledge that this feature of the hardware does potentially influence our results when comparing run-times related to P ≤ 32 with those for P ≥ 64.We emphasise that, to ensure the experimental results allow for a meaningful comparison between algorithms, all algorithms were assessed on the same hardware for each P. Future work would sensibly investigate these performance differences as part of a broader study on how to maximise performance as a function of hardware configuration.All algorithms here are ≡ S-R for speed-up Redistribution: run-times and speed-ups for increasing P, N = 2 16 and M = 1 speed-up Redistribution: run-times and speed-ups for increasing P, N = 2 20 and M = 1 All algorithms here are ≡ S-R for

RoSS vs. B-R and N-R
In Figure 4, we can see that the gap between the proposed approach and the other algorithms increases with P, particularly with large N.For P ≤ 4 and all three of the values for N that are considered, RoSS is comparable with N-R and roughly four times as fast as B-R.However, for P = 256, RoSS is up to eight times faster than N-R (which is slightly faster than B-R, as shown in [7]) and B-R.We also note that these fully-balanced approaches may also stop scaling when the computation shrinks and the communication becomes too intensive: e.g., here, RoSS stops scaling at N = 2 16 and P = 64.

Stochastic Volatility
Figure 5 shows that the PF using RoSS is four to six times faster than a PF using N-R/B-R.The maximum speed-up for a PF with RoSS is approximately 125 for P = 256 cores.Furthermore, the gap between a PF with RoSS and a PF with any other considered redistribution also increases with P, in line with the results in Section 5.1.Figure 6 describes profiling information for the three PFs with N = 2 24 .In Figure 6a, it is interesting to see that RoSS is the only redistribution variant that runs faster than IS for any P ≤ 256, while B-R and N-R emerge as the bottleneck for a relatively low P, i.e., for 2 ≤ P ≤ 16.This is mostly due to the larger volume of communication in B-R/N-R, as shown in Figure 6b.In Figure 6b, we have considered the average time spent on all MPI calls per core as the communication time.Since N P in this experiment, the time spent on the MPI_Sendrecv calls in the redistribution step is dominant over the time spent on all the other MPI routines, such as MPI_Allreduce or MPI_Exscan, which only consists of log 2 P single-scalar messages and up to 2 log 2 P arithmetical operations (in this case sum).Therefore, we can state that the reported average communication is equivalent (with a good degree of confidence) to the percentage of time that the cores are not used for any computation.

Conclusions
In this paper, we present RoSS, a novel fully-balanced redistribution for global resampling in SIR PFs on distributed memory environments.The algorithm has been implemented on MPI.The baselines for comparison are B-R and N-R, two similar state-of-the-art fully-balanced redistribution algorithms that both achieve an O((log 2 N) 2 ) time complexity and whose implementation is already available on MPI.We prove in this paper that RoSS redistribution achieves an O(log 2 N) time complexity.
The empirical results show that RoSS redistribution is almost an order of magnitude faster than B-R and N-R for up to P = 256 cores.Similar results are observed in the context of an exemplar PF.For the same level of parallelism, a PF using RoSS is up to six times faster than a PF using B-R/N-R and provides a maximum speed-up of 125.We also denote and cumulative sum, reduction and rotational shifts are performed log 2 P times each, B-R achieves an O((log 2 N) 2 ) time complexity for P = N, or equal to (A1) for any P ≤ N. As mentioned in Section 4.1, it would be theoretically possible to replace Bitonic Sort with AKS sort in order to make the asymptotic time complexity of the sorting phase equal to O(log 2 N), albeit with a large constant.However, since B-R needs a cumulative sum, a sum and rotational shifts up to log 2 N times, the overall computational complexity that would result from such an approach would still be O((log 2 N) 2 ).In [30], B-R has been implemented on MapReduce.In [25], B-R has been ported to MPI.A small change to B-R can result in a further 25% improvement, as results in [7] and Section 5.1 show.This is achieved by substituting Bitonic Sort with an alternative algorithm, Nearly Sort.One does not actually need to perfectly sort the particles to divide the workload afterwards, but only needs to guarantee ncopies has shape (9).To achieve this property, the particles are first nearly sorted locally by calling Algorithm 3. We emphasize that doing so is faster than sorting the particles according to ncopies.Then, the particles are recursively merged, as in Bitonic Sort, by using the same sorting network illustrated in Figure A1.Since S-NS takes O( N P ), and the number of sent/received messages per core equals (log 2 P) iterations.Here, as in [7], we refer to a B-R parallelization that uses Nearly Sort instead of Bitonic Sort as Nearly-Sort-Based Redistribution (N-R).Algorithm A1 summarizes both N-R and B-R, which achieve an O((log 2 N) 2 ) time complexity.Figure A2 shows an example of N-R for N = 16 and P = 8; a figure for B-R is omitted due to its similarities with N-R.Bitonic/Nearly_Sort(ncopies, x, N, P) 3: end if, ncopies has now shape (9) 4: for k ← 0; k < log 2 P; k ← k + 1 do binary tree ncopies, x ←Rot_Shifts ncopies, x, r, pivot, P 2 k , p , the P 2 k cores in each node rotate the N 2 k+1 particles on the right of pivot by r positions according to the bits of r 10: end for, ncopies has now shape (11) 11: x ← S-R(x, ncopies, n) state is sampled as X 0 ∼ (0, σ 2 1−φ 2 ).The particles are initially drawn from p 0 (X 0 ) and then from the dynamic model.Hence, (2) simplifies to w i t = w i t−1 p Y t |x i t .

Scan the log 2 NP
, which illustrates an example for N = 8 and P = 4. LSBs of shifts and shift accordingly Leaf Redistribution: run-times and speed-ups for increasing P, N = 2 24 and M = 1(c) N = 2 24

Figure 4 .
Figure 4. Redistribution-results for increasing N and P.

PF
avg computation vs avg communication for increasing P, N = 224 and T PF = 100 Average computation vs. average communication.

Figure A1 .
Figure A1.Bitonic/Nearly Sort for N = 32 and P = 8.The arrows represent inter-core messages (e.g., MPI_Sendrecv) and swap, which, for Nearly Sort, only applies to a zero-positive pair, since pairs of positive keys or zeros are nearly sorted.After each stage, results are in blue/red for Bitonic/Nearly Sort.

Figure A2 .
Figure A2.N-R -example for N = 16 and P = 8.Each x i is actually a real vector, but is marked with a capital letter for brevity.
depending on the MSBs of min shifts i & max shifts i ∀k ≥ 1 each core sends starter = csum s −copies to send s where x s the first copy to send; otherwise it sends 0. Update csum in O(N/P ) if the received starter > 0. csum i = 0 if all particles were sent and none was received 2, we can infer that Nearly Sort costs Cumulative_Sum N 2 k , P 2 k , ncopies , the P 2 k cores in each node perform cumulative sum over ncopies Linear_Search(ncopies, csum, n), search for first pivot s.t.csum pivot ≥ N 2 k+1 ; if not found pivot ← 0 Sum(pivot, P 2 k , p), the P 2 k cores in each node broadcast pivot to each other N 2 k+1 − pivot, rotations to perform within the node 5: csum ← pivot ← pivot ← 8: r ← 9: