A Time-Series Model for Varying Worker Ability in Heterogeneous Distributed Computing Systems

Daejin Kim; Suji Lee; Hohyun Jung

doi:10.3390/app13084993

,

and

¹

Samsung Electronics, Suwon 16677, Republic of Korea

²

Department of Statistics, Sungshin Women’s University, Seoul 02844, Republic of Korea

³

Data Science Center, Sungshin Women’s University, Seoul 02844, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci.2023, 13(8), 4993;https://doi.org/10.3390/app13084993

This article belongs to the Special Issue New Insights into Pervasive and Mobile Computing

Version Notes

Order Reprints

Abstract

In this paper, we consider the problem of estimating the time-dependent ability of workers participating in distributed matrix-vector multiplication over heterogeneous clusters. Specifically, we model the workers’ ability as a latent variable and introduce a log-normally distributed working rate as a function of the latent variable with parameters so that the working rate increases as the latent ability of workers increases, and takes positive values only. This modeling is motivated by the need to reflect the impact of time-dependent external factors on the workers’ performance. We estimate the latent variable and parameters using the expectation-maximization (EM) algorithm combined with the particle method. The proposed estimation and inference on the working rates are used to allocate tasks to the workers to reduce expected latency. From simulations, we observe that our estimation and inference on the working rates are effective in reducing expected latency.

Keywords:

distributed computing; coded computation; expectation maximization algorithm; particle filtering; sequential Monte Carlo

1. Introduction

In order to meet the ever-growing demand for massive-scale computations using a large amount of data, distributed computing systems have become a central part of many machine learning algorithms [1]. However, distributed computing systems often face the problem of waiting for slow workers, called stragglers [2]. To overcome this issue and provide straggler tolerance, Lee et al. introduced redundancy to distributed matrix multiplication [3]. Specifically, the authors show that computational latency can be reduced using the

(n, k)

maximum distance separable (MDS) code, since collecting any

k (< n)

out of n workers’ results would complete the assigned task.

Including the idea proposed by Lee et al. [3], there has been a growing body of work using coding to mitigate the straggler problem in homogeneous clusters. This includes coding for high-dimensional matrix multiplication [4,5], matrix multiplication that considers the architecture of practical systems [6], gradient descent [7,8,9], convolution [10], and non-linear computation using a learning-based approach [11]. Mitigating the effects of stragglers is not limited to environments that consider computing clusters but can also be used in many types of distributed computing architectures that may struggle with stragglers or failures. Examples include fog computing, computation with a deadline, mobile edge computing, layered architecture of edge-assisted Internet of Things, and federated learning [10,12,13].

The system designs under homogeneous assumptions in potentially heterogeneous clusters can significantly reduce the overall performance [14]. For example, suppose that we assign the same amount of tasks to two workers with different computing capabilities. Consider a time for two workers to finish the tasks. A worker with relatively large computing capabilities takes less time to complete the given task than a worker with relatively small computing capabilities, on average. Thus, the expected time to complete the entire task is highly dependent on the time for a worker with relatively small computing capabilities to finish the task. In other words, the uniform load allocation to workers without considering the heterogeneous characteristics of the workers’ computing capabilities can lead to the performance degradation of computing systems. Reflecting on these points, research has been conducted under the assumption of heterogeneous clusters [15,16]. In [15], the authors present an asymptotically optimal load allocation to minimize the expected task completion delay for fully heterogeneous clusters. Similarly, in [16], an optimal load allocation is proposed to minimize the expected execution time under the assumption of group heterogeneity, which means that workers in different groups have different latency statistics.

In this work, we consider a scenario where the ability of each heterogeneous worker changes over time. While the work of [15,16] covers distributed computation in a heterogeneous environment, the authors assume that each worker’s ability is a fixed known constant. However, in reality, even workers designed to maintain a constant level of ability can have time-dependent abilities due to external factors such as changes in temperature and humidity over time [17]. If we do not estimate workers’ abilities, the basic load allocation method would be the uniform load allocation, which assigns the same amount of tasks to each worker. However, as we mentioned earlier, this load allocation method can potentially lead to a loss of system performance. Thus, the estimation of workers’ abilities is essential to improve system latency.

We model the aggregated impact of such factors on the worker’s ability as a normal distribution and set the worker’s ability as a latent variable. As demonstrated in several prior works [3,4,6,15,16], it is widely accepted that the runtime of a worker is often modeled by an exponential distribution. We define the working rate of a worker as the rate of the exponentially distributed runtime and assume that the working rate is log-normally distributed with two parameters: the mean and standard deviation of the logarithm of the working rate, respectively. The log-normal distribution is a good fit for the working rate since it ensures that the working rate takes only positive values and increases as the latent ability of a worker increases. We aim to estimate the latent variable and parameters associated with the ability/working rate of the worker so that one can properly allocate the workload on the workers in distributed computing. Many pieces of research have been devoted to the analysis of the time-series data with latent variables from different points of view [18,19,20,21]. In this paper, we employ particle methods to deal with both the estimation of parameters and the inference of latent variables.

Contribution: The key contributions of this work are summarized as follows.

To the best of our knowledge, we first model the workers’ ability that changes over time with latent variables.
We present an estimation algorithm for the parameters and latent variables of the workers’ ability using the expectation maximization (EM) algorithm combined with the particle method.
We verify the validity of the presented algorithm with Monte Carlo simulations.
We confirm the validity of our inference by verifying that the load allocation based on the estimated workers’ ability achieves the lower bound of the expected execution time.

2. Preliminaries

In this section, we describe our system model and model assumptions. Then an optimal load allocation is introduced according to the system model and the model assumptions.

2.1. System Model

We focus on the distributed matrix-vector multiplication over a master–worker setup in time-dependent heterogeneous clusters (described in Figure 1). At time t, we assign a task to compute the multiplication of

A_{t}

and

b_{t}

to the N distributed workers with different runtimes for a given matrix

A_{t} \in R^{k \times d}

and the input vector

b_{t} \in R^{d \times 1}

. We assume that N workers are divided into G groups. Each group has a different number of workers and a different working rate. The workers in each group have the same working rate. We denote the number of workers in group i as

N_{i}

, which implies that

\sum_{i \in [G]} N_{i} = N .

The tasks are successively assigned to the workers at time points

t = t_{0}, t_{1}, \dots t_{T}, t_{T + 1}, \dots

. To demonstrate our computation model, we present the uncoded and coded computations for a given task computing the multiplication of

A_{t}

and

b_{t}

assigned at time t as follows.

Figure 1. Illustration of the time-dependent heterogeneous cluster (task assigned at time t). The master sends the input vector

b_{t}

to the N workers, each of which stores

A_{i, j, t}

and

{\tilde{A}}_{i, j, t}

in uncoded and coded computations, respectively, for

i \in [G]

and

j \in [N_{i}]

; (a) worker j in group i computes the multiplication of

A_{i, j, t}

and

b_{t}

with the working rate

μ_{i, t}

, and sends back the computation results to the master; (b) worker j in group i computes the multiplication of

{\tilde{A}}_{i, j, t}

and

b_{t}

with the working rate

μ_{i, t}

, and sends back the results to the master.

2.1.1. Uncoded Computation

The rows of

A_{t}

are divided into G disjoint submatrices as

A_{t} = [A_{1, t}; A_{2, t}; \dots; A_{G, t}]

where

A_{i, t} = [A_{i, 1, t}; A_{i, 2, t}; \dots; A_{i, N_{i}, t}] \in R^{N_{i} l_{i, t} \times d}

and

\sum_{i \in [G]} N_{i} l_{i, t} = k .

Here,

A_{i, j, t} \in R^{l_{i, t} \times d}

is a submatrix of

A_{i, t}

allocated to worker j in group i for

i \in [G]

and

j \in [N_{i}]

. Worker j in group i is assigned a subtask to compute the multiplication of

A_{i, j, t}

and

b_{t}

after receiving the input vector

b_{t}

from the master. Then, worker j in group i computes the multiplication of

A_{i, j, t}

and

b_{t}

and sends back the result to the master. After the master collects the computation results from all the workers, the desired computation result

A_{t} b_{t}

can be obtained.

2.1.2. Coded Computation

We apply an

(n, k)

MDS code to the rows of

A_{t}

to obtain the coded matrix

\tilde{A_{t}} \in R^{n \times d}

. Afterward, the rows of

{\tilde{A}}_{t}

are grouped into N submatrices as

{\tilde{A}}_{t} = [{\tilde{A}}_{1, t}; {\tilde{A}}_{2, t}; \dots; {\tilde{A}}_{G, t}],

where

{\tilde{A}}_{i, t} = [{\tilde{A}}_{i, 1, t}, {\tilde{A}}_{i, 2, t}, \dots, {\tilde{A}}_{i, N_{i}, t}] \in R^{N_{i} l_{i, t} \times d}

is the coded data matrix allocated to the workers in group i and

\sum_{i \in [G]} N_{i} l_{i, t} = n .

Here, worker j in group i is assigned a subtask to compute the multiplication of

{\tilde{A}}_{i, j, t} \in R^{l_{i, t} \times d}

and

b_{t}

for

i \in [G]

and

j \in [N_{i}]

. Worker j in group i sends the computation result

{\tilde{A}}_{i, j, t} b_{t}

to the master after finishing the subtask multiplying

{\tilde{A}}_{i, j, t}

and

b_{t}

. The master can retrieve the desired computation result

A_{t} b_{t}

by combining the inner products of any

k (< n)

coded rows with

b_{t}

from the MDS property.

For efficient load allocation to the distributed workers, we need to estimate the workers’ ability, which is assumed to change over time. To do so, for the first T time points

t = t_{0}, t_{1}, \dots, t_{T}

, we assign the matrix-vector multiplication tasks using the uncoded computation framework to measure the runtime of all the workers. Here, every worker is assigned the same amount of the subtask, i.e., we set

l_{i, t} = k / N

for

i \in [G]

,

j \in [N_{i}]

, and

t = t_{0}, t_{1}, \dots, t_{T}

. Based on the observations on the runtime, we estimate the parameters and latent variables of the working rates. Afterward (

t > t_{T}

), the coded computation is applied using the load allocation

l_{i, t}

based on the estimated working rates. The latent variable is obtained continuously for each time step to properly allocate

l_{i, t}

to the workers. The detailed process of estimating the working rates will be shown in Section 3.

2.2. Model Assumptions

For the task assigned at time t, we assume the time taken for calculating an inner product of a row of

A_{i, t}

; the input vector

b_{t}

follows the shifted exponential distribution with rate

μ_{i, t}

, which is defined as the working rate of workers in group i. It is assumed that workers having the same working rate form a group. In other words, workers in a group have the same computing capabilities.

We model the working rate

μ_{i, t}

as

μ_{i, t} : = e^{X_{i, t}}

(1)

using a latent variable

X_{i, t} .

Here,

X_{i, t}

denotes the ability of workers in group i for a subtask assigned at time t.

For workers in group i, it is assumed that the zero state

X_{i, 0}

follows the normal distribution

N (m, σ^{2})

, where m and

σ

are model parameters. Let us denote

θ = (m, σ)

. We write

f_{0}

as the probability density function of

x_{i, 0}

given parameters,

f_{0} (x_{i, 0} | θ = (m, σ)) = \frac{1}{\sqrt{2 π} σ} e^{- \frac{1}{2 σ^{2}} {(x_{i, 0} - m)}^{2}} .

(2)

The ability of workers may vary over time, which is formulated by

X_{i, t} : = X_{i, t - 1} + ϵ_{i, t},

where

ϵ_{i, t}

is a white noise that is independent across group i and time t and follows the normal distribution with zero mean and variance

σ_{ϵ}^{2}

. Here,

ϵ_{i, t}

represents the change in the worker ability as the time elapses from

t - 1

to t. In this work, we assume

σ_{ϵ}

is given. For a precise formulation, we denote the distribution of

X_{i, t}

given

X_{i, t - 1} = x_{i, t - 1}

by

f (x_{i, t} | x_{i, t - 1}) = \frac{1}{\sqrt{2 π} σ_{ϵ}} e^{- \frac{1}{2 σ_{ϵ}^{2}} {(x_{i, t} - x_{i, t - 1})}^{2}},

(3)

which is the probability density function of the normal distribution with mean

x_{i, t - 1}

and variance

σ_{ϵ}^{2}

.

We assume that

t_{j + 1} - t_{j}

is a sufficient time interval for the workers to carry out their subtasks allocated at time

t_{j}

. We further assume that a task is assigned at successive time points spaced at uniform intervals, i.e.,

t_{j + 1} - t_{j} = t_{l + 1} - t_{l}

for all

j \neq l

. Then,

t_{0}, t_{1}, \dots, t_{T}

can be rewritten as

0, 1, \dots, T

for notational convenience.

Let

η_{i, j, t}

be the observed variable that indicates the runtime of worker

j \in [N_{i}]

in group

i \in [G]

for a given data matrix

A_{i, j, t} \in R^{l_{i, t} \times d}

with the following probability density function:

p (η_{i, j, t} | X_{i, t}, l_{i, t}) = \frac{μ_{i, t}}{l_{i, t}} e^{- \frac{μ_{i, t}}{l_{i, t}} (η_{i, j, t} - l_{i, t})} .

(4)

Since

η_{i, j, t}

’s are independent, we obtain the probability density function g of the observation

η_{i, t}

given by

g (η_{i, t} | x_{i, t}, l_{i, t}) = \prod_{t \in {[T]}_{0}} \prod_{j \in [N_{i}]} g (η_{i, j, t} | x_{i, t}, l_{i, t}) .

(5)

for

i \in [G]

,

j \in [N_{i}]

, and

t \in {[T]}_{0}

. It follows from (5) that we have the probability density function of

η_{i, 0 : T}

(observations for workers in group i through the time points

t \in {[T]}_{0}

) as follows:

p (η_{i, 0 : T} | x_{i, 0 : T}, l_{i, 0 : T}) = \prod_{t \in {[T]}_{0}} g (η_{i, t} | x_{i, t}, l_{i, t}),

where

l_{i, 0 : T}

and

x_{i, 0 : T}

denote the number of rows of the data matrix assigned to workers in group i and the ability of workers in group i at time

t \in {[T]}_{0}

, respectively. Then, for all of the workers,

\begin{matrix} p (η_{1 : G, 0 : T} | x_{1 : G, 0 : T}, l_{1 : G, 0 : T}) = \prod_{i \in [G]} \prod_{t \in {[T]}_{0}} g (η_{i, t} | x_{i, t}, l_{i, t}) \\ = \prod_{i \in [G]} p (η_{i, 0 : T} | x_{i, 0 : T}, l_{i, 0 : T}), \end{matrix}

where

η_{1 : G, 0 : T}

,

x_{1 : G, 0 : T}

, and

l_{1 : G, 0 : T}

represent the overall observations, the workers’ ability, and the number of rows of the assigned matrices for workers in group

i \in [G]

at time

t \in {[T]}_{0}

, respectively.

2.3. Optimal Load Allocation

In this subsection, we provide an analysis of load allocation given

{\hat{μ}}_{i, t}

, for

i \in [G]

at time t. The previous work [16] presents the optimal load allocation to minimize the expected task runtime. We reproduce the results in [16] with some definitions according to our system model.

Let

T_{i}

denote the task runtime taken to calculate the inner product of

l_{i, t}

rows of

{\tilde{A}}_{i, t}

with

b_{t}

at a worker in group i at time t. We assume that

T_{i}

s are independent random variables, with the shifted exponential distribution as follows

F_{i} [η] = Pr [T_{i} \leq η] = 1 - e^{- \frac{μ_{i, t}}{l_{i, t}} (η - l_{i, t})}

(6)

for

i \in [G]

and

η \geq l_{i, t}

. Here, it is assumed that the cumulative distribution function of the task runtime for a worker in group i to calculate one inner product is

1 - e^{μ_{i, t} (η - 1)}

for

η \geq 1 .

Let

T_{r_{i, t} : N_{i}}^{l_{i, t}}

denote the

r_{i, t}

-th order statistics of

N_{i}

random variables following the distribution in (6). We define

T = max_{i \in [G]} \{T_{r_{i, t} : N_{i}}^{l_{i, t}}\}

(7)

as the task runtime for the master to finish the given task.

Theorem 1

(Theorem 2 in [16]). The optimal load allocation

(l_{1, t}^{*}, l_{2, t}^{*}, \dots, l_{G, t}^{*})

to achieve the minimum of

max_{i \in [G]} \{E [T_{r_{i, t} : N_{i}}^{l_{i, t}}]\},

denoted by

T^{★}

, is determined as follows:

l_{i, t}^{*} = \frac{k}{r_{i, t}^{*} + \sum_{i^{'} (\neq i) \in [G]} r_{i^{'}, t}^{*} \frac{ξ (r_{i, t}^{*}, N_{i})}{ξ (r_{i^{'}, t}^{*}, N_{i^{'}})}}

(8)

for

i \in [G]

, where

\begin{matrix} r_{i, t}^{*} = N_{i} (1 + \frac{1}{W_{- 1} (- e^{- (μ_{i, t} + 1)})}), \end{matrix}

\begin{matrix} ξ (r_{i, t}^{*}, N_{i}) = 1 + \frac{1}{μ_{i, t}} log (- W_{- 1} (- e^{- (μ_{i, t} + 1)})), \end{matrix}

and

\frac{r_{i, t}^{*}}{ξ (r_{i, t}^{*}, N_{i})} = - \frac{μ_{i, t} N_{i}}{W_{- 1} (- e^{- (μ_{i, t} + 1)})} .

Here,

W_{- 1} (x)

is the lower branch of the Lambert W function. (

W_{- 1} (x)

denotes the branch satisfying

W (x) e^{W (x)} = x

and

W (x) \leq - 1

). Then the minimum expected execution time,

T^{★}

, is represented as

- \frac{k}{\sum_{j \in [G]} \frac{μ_{i, t} N_{i}}{W_{- 1} (- e^{- (μ_{i, t} + 1)})}} .

Then, we have

E [T] \geq max_{i \in [G]} \{E [T_{r_{i, t} : N_{i}}^{l_{i, t}}]\} \geq T^{★} .

The first inequality follows from the definition of

T

in (7) and the last inequality comes from Theorem 1.

Next, we introduce the asymptotic behavior of the expected task runtime for the master to finish the given task as follows.

Theorem 2

(Theorem 3 in [16]). For the given optimal load allocation

(l_{1, t}^{*}, l_{2, t}^{*}, \dots, l_{G, t}^{*})

,

E [T]

is asymptotically equivalent to

T^{★}

for a sufficiently large N.

As we mentioned in Theorem 1, the optimal load allocation achieves the minimum of

max_{i \in [G]} \{E [T_{r_{i, t} : N_{i}}^{l_{i, t}}]\},

denoted by

T^{★}

. For the given

(μ_{1, t}, μ_{2, t}, \dots, μ_{G, t})

and

(N_{1}, N_{2}, \dots, N_{G})

, the constant

T^{★}

is a theoretical limit of

E [T]

. From Theorem 2, we can conclude that the load allocation given by (8) is also optimal for

E [T]

from an asymptotical perspective for sufficiently large N.

Remark 1.

For a fixed time t, the working rate of workers

μ_{i, t}

in group i is the rate parameter of the shifted exponential distribution. This implies that the expected execution time for a worker in group i to finish the given task is a linear function with respect to the reciprocal of the working rate. A large

μ_{i, t}

indicates that workers in group i have a relatively good ability, and the expected execution time for workers in group i tends to be relatively small. As given by Equation (1), a worker’s ability

X_{i, t}

in group i can be represented by

log μ_{i, t}

, where the log-normal distribution is widely used to describe the distribution of positive random variables.

3. Estimation of Latent Variable and Parameters

For the tasks allocated at time

t \in {[T]}_{0}

, the master collects the computation result

A_{i, t} b_{t}

, as well as the observed data

η_{1 : G, t}

which means the runtime of the worker in group

i \in [G]

for the task assigned at t. After obtaining

η_{1 : G, 0 : T}

at

t = T

, we estimate the parameter

θ

and the latent variable

X_{i, t}

using the EM algorithm combined with filtering and smoothing. For

t > T

, we obtain the estimated

X_{i, t}

based on the particle filtering algorithm and the estimated parameter

\hat{θ}

.

3.1. EM Algorithm

We define the complete-data likelihood functions as follows:

\begin{matrix} L (θ & | η_{1 : G, 0 : T}, x_{1 : G, 0 : T}, l_{1 : G, 0 : T}) = p (η_{1 : G, 0 : T}, x_{1 : G, 0 : T} | θ, l_{1 : G, 0 : T}) \\ = p (x_{1 : G, 0 : T} | θ) p (η_{1 : G, 0 : T} | x_{1 : G, 0 : T}, l_{1 : G, 0 : T}) \\ = \prod_{i \in [G]} p (x_{i, 0 : T} | θ) p (η_{i, 0 : T} | x_{i, 0 : T}, l_{i, 0 : T}) \\ = \prod_{i \in [G]} f_{0} (x_{i, 0} | θ) \prod_{t \in [T]} f (x_{i, t} | x_{i, t - 1}) \prod_{t \in {[T]}_{0}} g (η_{i, t} | x_{i, t}, l_{i, t}) \end{matrix}

(9)

Let

\begin{matrix} l (θ | η_{1 : G, 0 : T}, x_{1 : G, 0 : T}, l_{1 : G, 0 : T}) = log L (θ | η_{1 : G, 0 : T}, x_{1 : G, 0 : T}, l_{1 : G, 0 : T}) \end{matrix}

be the complete-data log-likelihood function where log represents the natural logarithm.

The EM algorithm is used to estimate the latent variable

X_{i, t}

and parameter

θ

simultaneously, which consists of the following two steps: The first step is called the E-step, which starts by calculating the expected value of the complete data log-likelihood function

Q (θ | θ^{(v)})

, defined as

\begin{matrix} Q (θ | θ^{(v)}) = E [l (θ | η_{1 : G, 0 : T}, X_{1 : G, 0 : T}, l_{1 : G, 0 : T}) | η_{1 : G, 0 : T}, l_{1 : G, 0 : T}, θ^{(v)}] . \end{matrix}

where the expectation is over

X_{1 : G, 0 : T}

, more precisely,

p (x_{1 : G, 0 : T} | η_{1 : G, 0 : T}, l_{1 : G, 0 : T}, θ^{(v)})

. Here,

θ^{(v)}

is the v-th parameter estimate in the procedure of the EM algorithm. Then we have

\begin{matrix} Q (θ | θ^{(v)}) = \int l (θ | η_{1 : G, 0 : T}, x_{1 : G, 0 : T}, l_{1 : G, 0 : T}) p (x_{1 : G, 0 : T} | η_{1 : G, 0 : T}, l_{1 : G, 0 : T}, θ^{(v)}) d x_{1 : G, 0 : T} . \end{matrix}

(10)

Hence, (10) implies that the distribution function for

x_{1 : G, 0 : T}

given

η_{i, 0 : T}

,

l_{1 : G, 0 : T}

, and

θ^{(v)}

are required to evaluate the Q function value at each iteration. The methodology for obtaining this distribution will be given in the next subsection. M-step finds the parameter of the next iteration using

θ^{(v + 1)} = arg max_{θ} Q (θ | θ^{(v)}) .

The aforementioned EM algorithm is summarized in Algorithm 1.

Algorithm 1 EM Algorithm.

(1): Initialize $θ^{(0)} = (α^{(0)}, β^{(0)})$ .
(2): [E-Step] For the $(v + 1)$ -th iteration, compute $Q (θ | θ^{(v)})$ .
(3): [M-Step] Find

$θ^{(v + 1)} = arg max_{θ} Q (θ | θ^{(v)}) .$
(4): Repeat the E-Step and M-Step until $θ^{(v)}$ converges.

3.2. Filtering and Smoothing

We resort to the sampling method to approximate the function

Q (θ | θ^{(v)})

since the integral (10) is analytically intractable. Our goal is to find an approximation of

p (x_{1 : G, 0 : T} | η_{1 : G, 0 : T}, l_{1 : G, 0 : T}, θ^{(v)})

Since the distribution is independent across groups, we only need to infer

p (x_{i, 0 : T} | η_{i, 0 : T}, l_{i, 0 : T}, θ^{(v)})

for each group i, which is the so-called a smoothed estimate. To obtain the smoothed estimate, we use the particle method [22] since it is efficient in high-dimensional sampling and more suitable for filtering [23,24,25] compared to the other competing schemes. Specifically, we choose sequential Monte Carlo (SMC) for the filtering algorithm [26]. The algorithm is based on sequential importance sampling with resampling, which has advantages in controlling the variance of estimates [27]. Note that the SMC algorithm is based on importance sampling, and the sampled object is called the particle (hence, its name).

For workers in group i and the task assigned at t, the number of particle sequences is denoted by L and the

l (\in [L])

-th particle is denoted by

x_{i, t}^{(l)}

. We need to obtain the proposal (importance) distributions

q_{0} (x_{i, 0} | η_{i, 0})

and

q (x_{i, t} | η_{i, t}, x_{i, t - 1})

. The optimal choices of the proposal distribution and proposal function are just the distributions of themselves, i.e.,

q_{0} (x_{i, 0} | η_{i, 0}) = p (x_{i, 0} | η_{i, 0})

and

q (x_{i, t} | η_{i, t}, x_{i, t - 1}) = p (x_{i, t} | η_{i, t}, x_{i, t - 1}) .

However, evaluating these distributions is intractable in our model. Although there are several possible methods to obtain a proposal distribution, we choose the following, which are known to work well in practice:

\begin{matrix} q_{0} (x_{i, 0} | η_{i, 0}) = \frac{f_{0} (x_{i, 0} | θ^{(v)}) g (η_{i, 0} | x_{i, 0}, l_{i, 0})}{g (η_{i, 0} | x_{i n i t}, l_{i, 0})} \end{matrix}

(11)

and

\begin{matrix} q (x_{i, t} | η_{i, t}, x_{i, t - 1}) = \frac{f (x_{i, t} | x_{i, t - 1}) g (η_{i, t} | x_{i, t}, l_{i, t})}{g (η_{i, t} | x_{i, t - 1}, l_{i, t})} \end{matrix}

(12)

for

t \in [T]

. Here,

x_{i n i t}

is an initial guess of

x_{0}

. These choices ease the computation of importance weights, which are given as follows for the l-th particle sequence:

w_{i, 0} (x_{i, 0}^{(l)}) = \frac{f_{0} (x_{i, 0} | θ^{(v)}) g (η_{i, 0} | x_{i, 0}^{(l)})}{q (x_{i, 0}^{(l)} | η_{i, 0})} = g (η_{i, 0} | x_{i n i t}, l_{i, 0})

and

w_{i, t} (x_{i, t - 1 : t}^{(l)}) = \frac{g (η_{i, t} | x_{i, t}^{(l)}, l_{i, t}) f (x_{i, t}^{(l)} | x_{i, t - 1}^{(l)})}{q (x_{i, t}^{(l)} | η_{i, t}, x_{i, t - 1}^{(l)})} = g (η_{i, t} | x_{i, t - 1}^{(l)}, l_{i, t}) .

Note that we can skip the resampling step for

t = 0

since the importance weights

w_{i, 0} (x_{i, 0}^{(l)})

for

t = 0

are independent of the particles

l \in [L]

, which means all particles have the same weights. Therefore,

x_{i n i t}

has no effects on the algorithm practically.

The algorithm for SMC filtering on workers in group i is summarized in Algorithm 2. One additional shorthand notation

x_{i, (0 : s) | s}^{(l)} = (x_{i, 0 | s}^{(l)}, x_{i, 1 | s}^{(l)}, \dots, x_{i, s | s}^{(l)})

is used in describing Algorithm 2.

Algorithm 2 The SMC for filtering on workers in group i.

Input:

η_{i, 0 : s}

,

l_{i, 0 : s}

,

θ^{(v)}

.

-

At time

t = 0

,

Sample ${\bar{x}}_{i, 0}^{(l)}$ from $q_{0} (x_{i, 0} | η_{i, 0})$ , $l = 1, 2, \dots, L$ .

-

At time

t = 1, \dots, s

,

Sample $x_{i, t}^{(l)}$ from $q (x_{i, t} | η_{i, t}, {\bar{x}}_{i, t - 1}^{(l)})$ , $l = 1, 2, \dots, L$ .
Set $x_{i, 0 : t}^{(l)} \leftarrow ({\bar{x}}_{i, 0 : t - 1}^{(l)}, x_{i, t}^{(l)}) .$
Compute unnormalized importance weights

$w_{i, t} (x_{i, t - 1 : t}^{(l)}) = g (η_{i, t} | x_{i, t - 1}^{(l)}) .$
Compute normalized weights

$W_{i, t}^{(l)} = \frac{w_{i, t} (x_{i, t - 1 : t}^{(l)})}{\sum_{l = 1}^{L} w_{i, 0} (w_{i, t} (x_{i, t - 1 : t}^{(l)}))} .$
Resample from ${W_{i, t}^{(l)}, x_{i, 0 : t}^{(l)}}_{l = 1, \dots, L}$ to obtain L equally weighted resampled particles ${1 / L, {\bar{x}}_{i, 0 : t}^{(l)}}_{l = 1, \dots, L} .$

Output:

\hat{p} (x_{i, 0 : s} | η_{i, 0 : s}, l_{i, 0 : s}, θ^{(v)}) .

We can effectively sample from the proposal distribution

q (x_{i, t} | η_{i, t}, x_{i, t - 1})

by the adaptive rejection sampling (ARS) [28] since the proposal distribution is log-concave as shown in Proposition 1.

Proposition 1.

The proposal distributions

q_{0} (x_{i, 0} | η_{i, 0})

and

q (x_{i, t} | η_{i, t}, x_{i, t - 1})

are log-concave functions with respect to

x_{i, 0}

and

x_{i, t}

, respectively.

Proof.

By (11), we have

log q_{0} (x_{i, 0} | η_{i, 0}) = log f_{0} (x_{i, 0} | θ^{(v)}) + log g (η_{i, 0} | x_{i, 0}, l_{i, 0}) - log g (η_{i, 0} | x_{i n i t}, l_{i, 0}) .

The second-order partial derivative of the above function with respect to

x_{i, 0}

evaluates to

- \frac{1}{σ_{0}^{(v) 2}} - \sum_{j \in [N_{i}]} \frac{η_{i, j, 0}}{l_{i, 0}} e^{x_{i, 0}},

which is negative for all real values

x_{i, 0}

. Since the logarithm of the function is concave, the proposal distribution

q_{0} (x_{i, 0} | η_{i, 0})

is log-concave in

x_{i, 0}

.

Similarly, (12) implies

log q (x_{i, t} | η_{i, t}, x_{i, t - 1}) = log f (x_{i, t} | x_{i, t - 1}) + log g (η_{i, t} | x_{i, t}, l_{i, t}) - log g (η_{i, t} | x_{i, t - 1}, l_{i, t}) .

The second-order partial derivative with respect to

x_{i, t}

gives the following negative value for all real

x_{i, t}

:

- \frac{1}{σ_{ϵ}^{2}} - \sum_{j \in [N_{i}]} \frac{η_{i, j, t}}{l_{i, t}} e^{x_{i, t}} .

Therefore, the proposal distribution

q (x_{i, t} | η_{i, t}, x_{i, t - 1})

is log-concave in

x_{i, t}

. □

We resample at every time step so that the particle sequences have equally weighted samples. In this paper, we use the multinomial resampling [22], which involves sampling from the multinomial distribution. The sampling probabilities are given by the weights of the corresponding original weighted samples.

Then, we can calculate

\hat{p} (x_{i, 0 : T} | η_{i, 0 : T}, l_{i, 0 : T}, θ^{(v)})

the approximation of the desired distribution by employing Algorithm 2. However, note that we cannot directly use the output

\hat{p} (x_{i, 0 : T} | η_{i, 0 : T}, l_{i, 0 : T}, θ^{(v)})

to the EM algorithm because the particle filtering method often suffers from the sample impoverishment, called the degeneracy problem. When the number of distinct particles

{x_{i, t}^{(l)}}_{l = 1, \dots, L}

is too small for

t ≪ T

, it can lead to

\hat{p} (x_{i, t} | η_{i, 0 : T}, l_{i, 0 : T}, θ^{(v)})

is not reliable for

t ≪ T

. Therefore, from Algorithm 2, we can take the filtered marginal distribution

\hat{p} (x_{i, s} | η_{i, 0 : s}, l_{i, 0 : s}, θ^{(v)})

of the distribution

\hat{p} (x_{i, 0 : s} | η_{i, 0 : s}, l_{i, 0 : s}, θ^{(v)})

at time

s = 0, 1, \dots, T

, when we are interested in the online estimation of state variables. We have

\hat{p} (x_{i, t} | η_{i, 0 : t}, l_{i, 0 : t}, θ^{(v)}) = \frac{1}{L} \sum_{l = 1}^{L} δ_{x_{i, t | t}^{(l)}} (x_{i, t})

(13)

for

t \in {[T]}_{0}

, where

δ_{x_{0}} (x)

is the Dirac delta mass located at

x_{0}

. Note that the weights are the same as

1 / L

due to resampling at the last step.

To obtain the smoothed estimate

\hat{p} (x_{i, t} | η_{i, 0 : T}, l_{i, 0 : T}, θ^{(v)})

for the EM algorithm, we rely on the forward-filtering backward-sampling (FFBSa) algorithm using the filtered particles

x_{i, t | t}^{(l)}

,

l \in [L]

, in (13). The detailed process is given in Algorithm 3.

Algorithm 3 FFBSa on workers in group i.

Input:

η_{i, 0 : T}

,

l_{i, 0 : T}

,

θ^{(v)}

.

Run Algorithm 2 to obtain the filtered distribution

${\hat{p} (x_{i, t} | η_{i, 0 : t}, l_{i, 0 : t}, θ^{(v)})}_{t = 0, 1, \dots, T} .$
Sample $x_{i, T}^{(l)}$ from

$\hat{p} (x_{i, T} | η_{i, 0 : T}, l_{i, 0 : T}, θ^{(v)}),$

$l = 1, 2, \dots, L$ .
For $t = T - 1, T - 2, \dots, 1, 0$ , sample $x_{i, t | T}^{(l)}$ from

$\hat{p} (x_{i, t} | η_{i, 0 : t}, x_{i, t + 1 | T}^{(l)}, l_{i, 0 : t}, θ^{(v)}),$

$l = 1, 2, \dots, L$ , where

$\hat{p} (x_{i, t} | η_{i, 0 : t}, x_{i, t + 1 | T}^{(l)}, l_{i, 0 : t}, θ^{(v)}) = \frac{\sum_{l = 1}^{L} f (x_{i, t + 1} | x_{i, t | T}^{(l)}) δ_{x_{i, t | t}^{(l)}} (x_{i, t})}{\sum_{l = 1}^{L} f (x_{i, t + 1} | x_{i, t | T}^{(l)})} .$

Output:

\hat{p} (x_{i, t} | η_{i, 0 : T}, l_{i, 0 : T}, θ^{(v)}), t = 0, 1, \dots, T .

By Algorithm 3, an estimate of the smoothed marginal distribution of

x_{i}^{t}

is represented as

\hat{p} (x_{i, t} | η_{i, 0 : T}, l_{i, 0 : T}, θ^{(v)}) = \frac{1}{L} \sum_{l = 1}^{L} δ_{x_{i, t | T}^{(l)}} (x_{i, t})

for

t = 0, 1, \dots, T

. Now we have the particles from the distribution

p (x_{i, t} | η_{i, 0 : T}, l_{i, 0 : T}, θ^{(v)}) .

Since all particle sequences have the same weights due to resampling, we can approximate the function

Q (θ | θ^{(v)})

as follows:

\begin{matrix} Q (θ | θ^{(v)}) \approx \frac{1}{L} \sum_{l = 1}^{L} l (θ | η_{1 : G, 0 : T}, x_{1 : G, t | T}^{(l)}, l_{1 : G, 0 : T}) . \end{matrix}

We can run the EM algorithm given in Algorithm 1 by using the particles obtained from Algorithm 3. We begin with the initial guess

θ^{(0)} = (α^{(0)}, β^{(0)})

. In the first iteration, we compute

Q (θ | θ^{(0)})

and find

θ^{(1)} = arg max_{θ} Q (θ | θ^{(0)}) .

We then run iterations until

θ^{(v)}

converges and obtain the converged parameter estimate

\hat{θ}

.

3.3. Inference of $X_{i, t}$

We have the estimated parameter

\hat{θ}

from the EM algorithm. Then the parameter

\hat{θ}

and the given values

l_{1 : G, 0 : T}

are omitted for brevity. Once a point estimate

{\hat{x}}_{1 : G, 0 : T}

of the ability

x_{1 : G, 0 : T}

is given, we can directly infer the working rates via (1). We will look at several point estimates in the following subsections based on particle filtering and smoothing algorithms.

3.3.1. Offline Inference

We consider estimating

x_{i, t}

when

t \leq T

, where we have observations until time

t = T

. Then the smoothed estimates can be exploited for this purpose. For example, offline estimation is used in the estimation of the model parameters in Algorithm 1 to avoid the particle degeneracy problem [23]. Given the observations

η_{i, 0 : s}

up to time s, we denote the mean of

X_{i, t}

as

{\hat{x}}_{i, t | s} = E [X_{i, t} | η_{i, 0 : s}] .

For a point estimate on

x_{i, t}

, we take the smoother mean

{\hat{x}}_{i, t | T}

. To calculate

{\hat{x}}_{i, t | T}

the smoothed estimate

\hat{p} (x_{i, t} | η_{i, 0 : T})

is obtained from Algorithm 3 considering the observed data

η_{i, 0 : T}

. Then,

{\hat{x}}_{i, t | T}

is represented as

m_{i, t | T} = E [X_{i, t} | η_{i, 0 : T}] \approx \frac{1}{L} \sum_{l = 1}^{L} x_{i, {t | T}^{(l)}},

(14)

where

x_{i, t | T}^{(l)}, l = 1, 2, \dots, L

are the smoothed samples from Algorithm 3.

3.3.2. Online Inference

This subsection considers the case of

t > T

; we need a point estimate on

x_{i, t}

for the load allocation at time

t > T

. The filtered mean

{\hat{x}}_{i, t | t}

can be obtained from Algorithm 2 as soon as we have observations at time

t > T

. Using the observed data

η_{i, 0 : t}

,

{\hat{x}}_{i, t | t}

is expressed as the filtered mean

{\hat{x}}_{i, t | t} = E [X_{i, t} | η_{i, 0 : t}] \approx \frac{1}{L} \sum_{l = 1}^{L} x_{i, t | t}^{(l)},

where

x_{i, t | t}^{(l)}, l = 1, 2, \dots, L

are the filtered samples from Algorithm 2. Unlike offline inference, online inference is free from the particle degeneracy problem. We can employ the online inference procedure to obtain the ability estimates simultaneously with the entry of observations.

4. Performance Evaluation

We perform a simulation study in order to validate the suggested inference methods. We fix the true parameters

m = 1

and

σ = 1

so that the initial ability of groups has an average of 1 and a standard deviation of 1. We allow a minor degree of ability changes over time by setting

σ_{ϵ} = 0.01

. We consider

T = 20

timepoints after the initial time and

G = 10

groups of workers. The number of particles is set to

L = 200

. For simplicity, we assume that all workers are assigned unit loads over time so that

l_{i, t} = 1

for

i \in [G]

,

j \in [N_{i}]

, and

t \in {[T]}_{0}

, and each group has the same number of workers

N_{w}

so that

N_{1} = \dots = N_{G} = N_{w}

. We vary the number of workers,

N_{w} = 10, 30, 50

, to check that the algorithm works for various situations and to see how the number of workers affects the performance of the proposed algorithms.

We generate the ability

x_{1 : G, 0 : T}

of workers using (2) and (3) and use them to generate the observation data

η_{1 : G, 1 : N_{w}, 0 : T}

that follows the distribution given in (4). This process is repeated 30 times for each case

N_{w} = 10, 30, 50

. Then we apply the inference algorithm to estimate the model parameters and the ability of workers.

Table 1 presents the parameter estimates of 30 repeated datasets for the cases

N_{w} = 10, 30, 50

. The algorithm estimates the parameters close to the true parameters

m = 1

and

σ = 1

in all of the cases of

N_{w}

, suggesting that the algorithm works well in estimating the true parameter. Moreover, we can see that the standard deviations of all the estimates over 30 repetitions tend to decrease as

N_{w}

increases, implying the larger number of workers in each group tends to yield more accurate estimates.

Table 1. Parameter estimates for the datasets with the cases

N_{w} = 10, 30, 50

. The true parameter is set as

m = 1

,

σ = 1

. We present the mean and standard deviation of 30 repetitions for each case.

We now study the accuracy of the ability estimate

{\hat{x}}_{1 : G, 0 : T}

using the smoothed estimates given in (14). We use three criteria, the Pearson correlation coefficient (COR), the mean squared error (MSE), and the mean absolute error (MAE) to evaluate the performance. The measure COR is the correlation coefficient between the estimated and the true worker abilities over all the groups

i \in [G]

and timepoints

t \in {[T]}_{0}

, given by

\begin{matrix} C O R = \frac{\sum_{i \in [G]} \sum_{t \in {[T]}_{0}} ({\hat{x}}_{i, t} - \bar{\hat{x}}) (x_{i, t} - \bar{x})}{\sqrt{\sum_{i \in [G]} \sum_{t \in {[T]}_{0}} {({\hat{x}}_{i, t} - \bar{\hat{x}})}^{2}} \sqrt{\sum_{i \in [G]} \sum_{t \in {[T]}_{0}} {(x_{i, t} - \bar{x})}^{2}}}, \end{matrix}

where

\bar{\hat{x}} = \frac{1}{G (T + 1)} \sum_{i \in [G]} \sum_{t \in {[T]}_{0}} {\hat{x}}_{i, t}

and

\bar{x} = \frac{1}{G (T + 1)} \sum_{i \in [G]} \sum_{t \in {[T]}_{0}} x_{i, t}

are the means of the estimated and true abilities, respectively. On the other hand, the other measures of MSE and MAE are based on the differences between the true and estimated item qualities over all the groups

i \in [G]

and timepoints

t \in {[T]}_{0}

, given by

\begin{matrix} M S E = \frac{1}{G (T + 1)} \sum_{i \in [G]} \sum_{t \in {[T]}_{0}} {({\hat{x}}_{i, t} - x_{i, t})}^{2}, \\ M A E = \frac{1}{G (T + 1)} \sum_{i \in [G]} \sum_{t \in {[T]}_{0}} | {\hat{x}}_{i, t} - x_{i, t} | . \end{matrix}

Table 2 shows the COR, MSE, and MAE results across 30 datasets. The average COR values are almost equal to 1, which suggests that the proposed algorithm properly estimates the true ability of worker groups. The average MSE and MAE values turn out to be close to 0, which suggests that the difference between the true and estimated abilities is almost the same. It is reasonable to have fewer discrepancies between the true and estimated values in the case of large

N_{w}

since we have more observations.

Table 2. COR, MSE, and MAE results for the datasets with the cases

N_{w} = 10, 30, 50

. We present the mean and standard deviations of 30 repetitions for each case.

We also depict the three measures in Figure 2, where the three measures are computed at each time

t \in {[T]}_{0}

. The

C O R (t)

at each time

t \in {[T]}_{0}

is given by

\begin{matrix} C O R (t) & = \frac{\sum_{i \in [G]} ({\hat{x}}_{i, t} - {\bar{\hat{x}}}_{t}) (x_{i, t} - {\bar{x}}_{t})}{\sqrt{\sum_{i \in [G]} {({\hat{x}}_{i, t} - {\bar{\hat{x}}}_{t})}^{2}} \sqrt{\sum_{i \in [G]} {(x_{i, t} - {\bar{x}}_{t})}^{2}}}, \end{matrix}

where

{\bar{\hat{x}}}_{t} = \frac{1}{G} \sum_{i \in [G]} {\hat{x}}_{i, t}

and

{\bar{x}}_{t} = \frac{1}{G} \sum_{i \in [G]} x_{i, t}

are the means of the estimated and true abilities at each time, respectively. The other two measures are represented by

\begin{matrix} M S E (t) & = \frac{1}{G} \sum_{i \in [G]} {({\hat{x}}_{i, t} - x_{i, t})}^{2}, \\ M A E (t) & = \frac{1}{G} \sum_{i \in [G]} | {\hat{x}}_{i, t} - x_{i, t} |, \end{matrix}

at each time

t \in {[T]}_{0}

. We can see the errors of estimates tend to be uniform over time, implying that the ability estimates can be reliable at any time point.

Figure 2.

C O R (t)

,

M S E (t)

, and

M A E (t)

results for the datasets with the cases

N_{w} = 10, 30, 50

for each time

t \in {[T]}_{0}

.

Figure 3 presents the true and estimated abilities for each group

i \in [G]

for the first dataset of the case

N_{w} = 50

. The true abilities tend to lie within the

\pm 2

standard deviation of the

L = 200

particle sequences, implying that the presented algorithm estimates the true abilities very well. The results for groups 1 and 6 indicate that the estimates can capture increasing or decreasing patterns as well. Figure 3 on the right-hand side shows the analogous plot for the working rate given in (1), suggesting that the ability estimates are useful for estimating the working rates. Note that the estimated working rates are quite accurate for a broad range of working rates. In conclusion, the simulation result demonstrates that the estimation/inference works well for both model parameters and workers’ abilities.

Figure 3. The ability (left) and working rate (right) estimates for all of the groups

i \in [G]

of the first dataset in the case of

N_{w} = 50

. The x-axis represents time. The true ability

x_{i, 0 : T}

and working rate

μ_{i, 0 : T}

for each group

i \in [G]

are represented as black solid lines with markers. The estimated ability

{\hat{x}}_{i, 0 : T}

and working rate

{\hat{μ}}_{i, 0 : T}

values are represented as blue solid lines. The dashed line represents the

\pm 2

standard deviation of

L = 200

particle sequences for each group.

In Figure 4, we compare the expected execution time of the uncoded scheme, the uniform load allocation with code rate

\frac{1}{2}

, the load allocation in (8) with the true working rates, and the load allocation in (8) with estimated working rates. For performance comparisons, the following simulations are conducted with the fixed parameters

m = 1

and

σ = 1

,

σ_{ϵ} = 0.01

. Moreover, we set

G = 2

,

(N_{1}, N_{2}) = (20, 30)

,

T = 20

, and the true working rate

(μ_{1}, μ_{2}) = (1, 5)

at time

t = 0

. At time T, the true working rate varies to

(μ_{1, T}, μ_{2, T}) = (1.0109, 4.9999)

. We obtain the estimated working rates of the groups

({\hat{μ}}_{1, T}, {\hat{μ}}_{2, T}) = (0.9892, 4.8364)

using the proposed algorithms. This simulation shows that the load allocation in (8) with the estimated working rates can achieve the same performance as the optimal load allocation in (8) with the true working rates. It is also observed that the load allocation in (8) with the estimated working rates shows 54% and 14% reductions in the expected execution times compared to the uncoded scheme and the uniform load allocation with code rate

\frac{1}{2}

, respectively. This result demonstrates that our estimation and the inference of the latent variable are valid in terms of the expected execution time.

Figure 4. Performance comparison between the uncoded scheme, uniform load allocation with code rate

\frac{1}{2}

, load allocation in (8) with the estimated working rates, load allocation in (8) with the true working rates, and the theoretical limit of the expected execution time.

5. Concluding Remarks

In this paper, we model the time-varying ability of workers in heterogeneous distributed computing as latent variables. Since we allow for the ability to change over time, we employ a particle method to infer the latent variables. We present a method for estimating the parameters of the working rate with the latent variable using the EM algorithm combined with sequential Monte Carlo (SMC) for filtering and FFBSa. Monte Carlo simulations verify that the proposed algorithm works reasonably well in estimating the workers’ ability as well as the model parameters. Particularly, the estimation of workers’ ability shows great performance in terms of two measures: COR and MSE between the true and estimated ability.

Exploiting the proposed estimation of the workers’ ability, one can devise the optimal load allocation for minimizing the latency in the heterogeneous distributed computation with workers having time-varying abilities. Numerical simulations show that the load allocation with the estimated working rates achieves the theoretical limit of the expected execution time and reduces the expected execution time by up to 54% compared to existing schemes.

Author Contributions

Conceptualization, D.K. and H.J.; methodology, D.K. and H.J.; software, D.K., S.L. and H.J.; validation, D.K. and H.J.; formal analysis, D.K. and H.J.; writing—original draft preparation, D.K., S.L. and H.J.; writing—review and editing, D.K. and H.J.; supervision, D.K. and H.J.; project administration, D.K. and H.J.; funding acquisition, H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported in part by a National Research Foundation of Korea (NRF) grant funded by the Korean government (no. 2021R1G1A109410312).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Notations

Symbol	Description
N	The number of workers participating in the matrix multiplication task
$N_{i}$	The number of workers in group i
G	The number of groups
$A_{t}$ ( ${\tilde{A}}_{t}$ )	(coded) Data matrix in $R^{k \times d}$ at time t
$A_{i, j, t}$ ( ${\tilde{A}}_{i, j, t}$ )	(Coded) data matrix in $R^{l_{i, t} \times d}$ at time t allocated to worker j in group i
$A_{i, t}$ ( ${\tilde{A}}_{i, t}$ )	(Coded) data matrix in $R^{N_{i} l_{i, t} \times d}$ at time t allocated to the workers in group i,
	i.e., $A_{i, t} = [A_{i, 1, t}; A_{i, 2, t}; \dots, A_{i, N_{i}, t}]$ ( ${\tilde{A}}_{i, t} = [{\tilde{A}}_{i, 1, t}; {\tilde{A}}_{i, 2, t}; \dots, {\tilde{A}}_{i, N_{i}, t}]$ )
$b_{t}$	Input vector in $R^{d \times 1}$ at time t
$l_{i, t}$	The number of rows of the matrix allocated to workers in group i at time t
$X_{i, t}$	The ability of workers in group i for a subtask assigned at time t
$μ_{i, t}$	Working rate of workers in group i at time t
$η_{i, j, t}$	Observed variable that indicates the runtime of worker j in group i at time t
$N (m, σ^{2})$	Normal distribution with the mean m and variance $σ^{2}$
$θ$	Model parameters, i.e., $θ = (m, σ)$
$f_{0} (x_{i, 0} \| θ = (m, σ))$	Probability density function (pdf) of the initial worker’s ability $X_{i, 0}$ in group i,
	where $X_{i, 0}$ follows a normal distribution $N (m, σ^{2})$
$f (x_{i, t} \| x_{i, t - 1})$	pdf of worker’s ability $X_{i, t}$ in group i at time t provided the worker’s ability
	$X_{i, t - 1}$ at the previous time
$g (η_{i, j, t} \| x_{i, t}, l_{i, t})$	pdf of the runtime $η_{i, j, t}$ of worker j in group i at time t, given the workers’
	ability $X_{i, t}$ and load allocation $l_{i, t}$

$ϵ_{i, t}$	White noise that is independent across group i and time t and follows normal
	distribution $N (0, σ_{ϵ}^{2})$
$T_{i}$	Task runtime taken to calculate the multiplication of ${\tilde{A}}_{i, t}$ and $b_{t}$ at a worker
	in group i at time t
$T$	Task runtime for the master to finish the given task

References

Dean, J.; Corrado, G.; Monga, R.; Chen, K.; Devin, M.; Mao, M.; Senior, A.; Tucker, P.; Yang, K.; Le, Q.V. Large scale distributed deep networks. Proc. Adv. Neural Inform. Process. Syst. (NIPS) 2012, 1, 1223–1231. [Google Scholar]
Dean, J.; Barroso, L.A. The tail at scale. Commun. ACM 2013, 56, 74–80. [Google Scholar] [CrossRef]
Lee, K.; Lam, M.; Pedarsani, R.; Papailiopoulos, D.; Ramchandran, K. Speeding up distributed machine learning using codes. IEEE Trans. Inf. Theory 2018, 64, 1514–1529. [Google Scholar] [CrossRef]
Lee, K.; Suh, C.; Ramchandran, K. High-dimensional coded matrix multiplication. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 2418–2422. [Google Scholar]
Yu, Q.; Maddah-Ali, M.; Avestimehr, S. Polynomial codes: An optimal design for high-dimensional coded matrix multiplication. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Los Angeles, CA, USA, 4–9 December 2017; pp. 4403–4413. [Google Scholar]
Park, H.; Lee, K.; Sohn, J.-Y.; Suh, C.; Moon, J. Hierarchical coding for distributed computing. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 1630–1634. [Google Scholar]
Tandon, R.; Lei, Q.; Dimakis, A.G.; Karampatziakis, N. Gradient coding: Avoiding stragglers in distributed learning. In Proceedings of the International Conference Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3368–3376. [Google Scholar]
Raviv, N.; Tandon, R.; Dimakis, A.; Tamo, I. Gradient coding from cyclic MDS codes and expander graphs. In Proceedings of the International Conference Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4305–4313. [Google Scholar]
Ozfaturay, E.; Gündüz, D.; Ulukus, S. Speeding up distributed gradient descent by utilizing non-persistent stragglers. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019. [Google Scholar]
Dutta, S.; Cadambe, V.; Grover, P. Coded convolution for parallel and distributed computing within a deadline. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 2403–2407. [Google Scholar]
Kosaian, J.; Rashmi, K.V.; Venkataraman, S. Learning-Based Coded Computation. IEEE J. Sel. Areas Commun. 2020, 1, 227–236. [Google Scholar] [CrossRef]
Li, S.; Maddah-Ali, M.A.; Avestimehr, A.S. Coding for distributed fog computing. IEEE Commun. Mag. 2017, 55, 34–40. [Google Scholar] [CrossRef]
Fu, X.; Wang, Y.; Yang, Y.; Postolache, O. Analysis on cascading reliability of edge-assisted Internet of Things. Reliab. Eng. Syst. Saf. 2022, 223, 108463. [Google Scholar] [CrossRef]
Zaharia, M.; Konwinski, A.; Joseph, A.D.; Katz, R.H.; Stoica, I. Improving mapreduce performance in heterogeneous environments. In Proceedings of the USENIX Symposium on Operating Systems Design Implement (OSDI), San Diego, CA, USA, 8–10 December 2008; pp. 29–42. [Google Scholar]
Reisizadeh, A.; Prakash, S.; Pedarsani, R.; Avestimehr, S. Coded computation over heterogeneous clusters. IEEE Trans. Inf. Theory. 2019, 65, 4227–4242. [Google Scholar] [CrossRef]
Kim, D.; Park, H.; Choi, J.K. Optimal load allocation for coded distributed computation in heterogeneous clusters. IEEE Trans. Commun. 2021, 69, 44–58. [Google Scholar] [CrossRef]
Gao, J. Machine Learning Applications for Data Center Optimization. Google White Pap. Available online: https://research.google/pubs/pub42542/ (accessed on 6 April 2023).
Tang, B.; Matteson, D.S. Probabilistic transformer for time series analysis. Adv. Neural Inf. Process. Syst. 2021, 34, 23592–23608. [Google Scholar]
Lin, Y.; Koprinska, I.; Rana, M. SSDNet: State space decomposition neural network for time series forecasting. In Proceedings of the 2021 IEEE International Conference on Data Mining (ICDM), Auckland, New Zealand, 7–10 December 2021; pp. 370–378. [Google Scholar]
Hu, Y.; Jia, X.; Tomizuka, M.; Zhan, W. Causal-based time series domain generalization for vehicle intention prediction. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 7806–7813. [Google Scholar]
Jung, H.; Lee, J.G.; Kim, S.H. On the analysis of fitness change: Fitness-popularity dynamic network model with varying fitness. J. Stat. Mech. Theory Exp. 2020, 4, 043407. [Google Scholar] [CrossRef]
Kitagawa, G. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. J. Comput. Graph. Stat. 1996, 5, 1–25. [Google Scholar]
Kantas, N.; Doucet, A.; Singh, S.S.; Maciejowski, J.; Chopin, N. On particle methods for parameter estimation in state-space models. Stat. Sci. 2015, 30, 328–351. [Google Scholar] [CrossRef]
Pitt, M.K.; Shephard, N. Filtering via simulation: Auxiliary particle filters. J. Am. Stat. Assoc. 1999, 94, 590–599. [Google Scholar] [CrossRef]
Carpenter, J.; Clifford, P.; Fearnhead, P. Improved particle filter for nonlinear problems. IEE Proc. Radar Sonar Navig. 1999, 146, 2–7. [Google Scholar] [CrossRef]
Doucet, A.; De Freitas, N.; Gordon, N. An introduction to sequential Monte Carlo methods. Seq. Monte Carlo Methods Pract. 2001, 3–14. [Google Scholar] [CrossRef]
Doucet, A.; Johansen, A.M. A tutorial on particle filtering and smoothing: Fifteen years later. Handb. Nonlinear Filter 2009, 12, 656–704. [Google Scholar]
Gilks, W.R.; Wild, P. Adaptive rejection sampling for Gibbs sampling. Appl. Statist. 1992, 41, 337–348. [Google Scholar] [CrossRef]

Figure 2.

C O R (t)

,

M S E (t)

, and

M A E (t)

results for the datasets with the cases

N_{w} = 10, 30, 50

for each time

t \in {[T]}_{0}

.

Figure 3. The ability (left) and working rate (right) estimates for all of the groups

i \in [G]

of the first dataset in the case of

N_{w} = 50

. The x-axis represents time. The true ability

x_{i, 0 : T}

and working rate

μ_{i, 0 : T}

for each group

i \in [G]

are represented as black solid lines with markers. The estimated ability

{\hat{x}}_{i, 0 : T}

and working rate

{\hat{μ}}_{i, 0 : T}

values are represented as blue solid lines. The dashed line represents the

\pm 2

standard deviation of

L = 200

particle sequences for each group.

Figure 4. Performance comparison between the uncoded scheme, uniform load allocation with code rate

\frac{1}{2}

, load allocation in (8) with the estimated working rates, load allocation in (8) with the true working rates, and the theoretical limit of the expected execution time.

Table 1. Parameter estimates for the datasets with the cases

N_{w} = 10, 30, 50

. The true parameter is set as

m = 1

,

σ = 1

. We present the mean and standard deviation of 30 repetitions for each case.

Table 1. Parameter estimates for the datasets with the cases

N_{w} = 10, 30, 50

. The true parameter is set as

m = 1

,

σ = 1

. We present the mean and standard deviation of 30 repetitions for each case.

$N_{w}$	m		$σ$
$N_{w}$	Mean	St. Dev.	Mean	St. Dev.
10	1.0149	0.3102	0.9720	0.2512
30	1.0133	0.2538	0.9568	0.2086
50	1.0382	0.2607	1.0395	0.1977

Table 2. COR, MSE, and MAE results for the datasets with the cases

N_{w} = 10, 30, 50

. We present the mean and standard deviations of 30 repetitions for each case.

Table 2. COR, MSE, and MAE results for the datasets with the cases

N_{w} = 10, 30, 50

. We present the mean and standard deviations of 30 repetitions for each case.

$N_{w}$	COR		MSE		MSE
$N_{w}$	Mean	St. Dev.	Mean	St. Dev.	Mean	St. Dev.
10	0.9967	0.0020	0.0065	0.0034	0.0615	0.0125
30	0.9990	0.0006	0.0020	0.0007	0.0357	0.0069
50	0.9993	0.0004	0.0016	0.0005	0.0319	0.0054

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A Time-Series Model for Varying Worker Ability in Heterogeneous Distributed Computing Systems

Abstract

1. Introduction

2. Preliminaries

2.1. System Model

2.1.1. Uncoded Computation

2.1.2. Coded Computation

2.2. Model Assumptions

2.3. Optimal Load Allocation

3. Estimation of Latent Variable and Parameters

3.1. EM Algorithm

3.2. Filtering and Smoothing

3.3. Inference of $X_{i, t}$

3.3.1. Offline Inference

3.3.2. Online Inference

4. Performance Evaluation

5. Concluding Remarks

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Notations

References

Article Metrics

Citations

Article Access Statistics

A Time-Series Model for Varying Worker Ability in Heterogeneous Distributed Computing Systems

Abstract

1. Introduction

2. Preliminaries

2.1. System Model

2.1.1. Uncoded Computation

2.1.2. Coded Computation

2.2. Model Assumptions

2.3. Optimal Load Allocation

3. Estimation of Latent Variable and Parameters

3.1. EM Algorithm

3.2. Filtering and Smoothing

3.3. Inference of X i , t

3.3.1. Offline Inference

3.3.2. Online Inference

4. Performance Evaluation

5. Concluding Remarks

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Notations

References

Article Metrics

Citations

Article Access Statistics

3.3. Inference of $X_{i, t}$