Towards Collaborative Edge Intelligence: Blockchain-Based Data Valuation and Scheduling for Improved Quality of Service †

: Collaborative edge intelligence, a distributed computing paradigm, refers to a system where multiple edge devices work together to process data and perform distributed machine learning (DML) tasks locally. Decentralized Internet of Things (IoT) devices share knowledge and resources to improve the quality of service (QoS) of the system with reduced reliance on centralized cloud infrastructure. However, the paradigm is vulnerable to free-riding attacks, where some devices benefit from the collective intelligence without contributing their fair share, potentially disincentivizing collaboration and undermining the system’s effectiveness. Moreover, data collected from heterogeneous IoT devices may contain biased information that decreases the prediction accuracy of DML models. To address these challenges, we propose a novel incentive mechanism that relies on time-dependent blockchain records and multi-access edge computing (MEC). We formulate the QoS problem as an unbounded multiple knapsack problem at the network edge. Furthermore, a decentralized valuation protocol is introduced atop blockchain to incentivize contributors and disincentivize free-riders. To improve model prediction accuracy within latency requirements, a data scheduling algorithm is given based on a curriculum learning framework. Based on our computer simulations using heterogeneous datasets, we identify two critical factors for enhancing the QoS in collaborative edge intelligence systems: (1) mitigating the impact of information loss and free-riders via decentralized data valuation and (2) optimizing the marginal utility of individual data samples by adaptive data scheduling.


Introduction
In machine learning (ML), data and the model are two fundamental elements.Traditional model-centric approaches focus on improving the prediction accuracy of artificial intelligence (AI) by using larger and more delicate models.For example, wider and deeper Transformer models perform better than smaller ones in AI training [1].However, Internet of Things (IoT) devices are predominantly engineered with stringent constraints on physical size and power consumption, prioritizing portability and energy efficiency.Consequently, these resource-constrained devices often lack the computational power and energy capacity necessary for training and inferencing large AI models effectively.Nevertheless, edge intelligence in the future internet [2] requires even more stringent quality of service (QoS).
For years, a fixation on models has led to difficulties with understanding and reproducing ML results [3].Therefore, the severe challenge of QoS at the network edge raises a fundamental question: Apart from model design, can the QoS of edge intelligence be improved by device collaboration and data conditioning at the network edge?
In this work, we focus on the exploration of an affirmative answer to the above question.We aim to give a simple yet effective framework to motivate collaboration for improved QoS at the network edge.Fortunately, distributed machine learning (DML) provides an opportunity to realize collaborative edge intelligence.DML is a paradigm that enables the training of ML models across multiple decentralized edge devices without exchanging private local data samples directly.Federated learning (FL) [4] is a well-known example of collaborative DML.In the context of FL, private and distributed data are not allowed to be shared for privacy reasons, as the model is trained locally by mini-batch stochastic gradient descent (SGD) on each participant's private data, and only model weights are communicated to a central server for global model aggregation.FL is especially beneficial for AI to learn private information under data regulations [5].For instance, smart health applications rely on IoT data: especially private health data [4].Instead of uploading IoT data, users transmit model weights trained from health data for prediction.Despite its privacy preservation feature, FL suffers from free-riding attacks [6][7][8], wherein malicious participants exploit the system by benefiting from the global model without meaningful contributions, thereby compromising the efficiency and fairness of the collaborative learning process.Moreover, FL requires a centralized parameter server, which may constitute a single point of failure (SPOF).Therefore, a decentralized approach is needed to enhance system security and reliability.As a distributed ledger technology, blockchain is a decentralization approach with consecutive timestamps to solve the above issues.
Additionally, data-centric methods have recently emerged for understanding and conditioning data in the context of centralized ML [9][10][11].Compared with large datasets containing biased information, a small but unbiased dataset tends to have more useful information [9].Therefore, data valuation and scheduling are needed before feeding data into an ML model such as a deep neural network (DNN).Intuitively, the more useful information that a DNN can obtain from data, the more accurate prediction results we can achieve.However, IoT data are often heterogeneous at the network edge.User data depend on living habits, preferences and locations.They are often not independent and identically distributed (non-IID).However, DML relies on a training method named mini-batch SGD, which is designed under the assumption that data are IID [4,5].A mismatch between model and real data decreases a model's prediction accuracy.In this paper, we consider two factors that violate the IID property of data: label distribution skew (i.e., users produce biased data) and data quantity skew (i.e., users produce different amounts of data).These factors are particularly significant in DML scenarios where, due to beneficial data protection regulations, raw data are not directly accessible during model training.Without knowledge of data valuation, it is difficult to apply data-centric methods at a network edge.
To solve the above challenges, we propose a blockchain-based incentive mechanism to diversify rewards received by different contributors according to decentralized data valuation.Intuitively, an up-to-date model is of greater utility.Therefore, we aim to ensure every participant will leave the system with a global model reflecting its incremental contributions in the form of data valuation.Data valuation can be recorded on the blockchain, which is maintained by a group of entities instead of a centralized server.Therefore, data valuation becomes a consensus among all entities.By replacing one-server decisions with a group consensus, the evaluation of data valuation is decentralized.However, the consensus mechanism in blockchain, e.g., proof-of-work (PoW), is not suitable for data valuation recording because it requires extensive computing resources, which delays the training process of DML [12].
In this paper, we investigate a blockchain based on multi-access edge computing (MEC) servers, where a delegated proof-of-stake (DPoS) [13] is adopted as the blockchain consensus mechanism.We refer to the MEC server simply as a server because there is only one kind of server in this paper.As a server is located closer to users than a cloud computing center, communication latency in DML is further reduced [12].Instead of uploading local data distributions [14], valuations of decentralized data are based on private validation datasets.As private information is not included in blockchain records, data valuation is evaluated privately.
To the best of our knowledge, this is the first work that explores blockchain's potential in diversifying model rewards by decentralized data valuation and data scheduling.Our major contributions are summarized as follows: • A time-dependent incentive mechanism is proposed atop blockchain to diversify model rewards.It improves the QoS of collaborative edge intelligence while preventing free-riders from using the system.

•
We propose a decentralized data valuation method combining cross-validation and DPoS consensus to mitigate information loss of DML on heterogeneous data.The valuation function achieves group rationality, fairness and additivity at the network edge.

•
To maximize the marginal utility of data samples, a curriculum data scheduling approach is designed.With an adaptive moving window, the efficiency of data scheduling is improved with reduced latency.
The rest of this paper is organized as follows.Related works are summarized in Section 2. The system model and problem formulation are described in Section 3. Our proposed blockchain-based incentive mechanism is given in Section 4 to defend against free-rider attacks.An adaptive curriculum learning method is further proposed in Section 5 to improve the QoS of each user.We further propose our algorithms in Section 6 as an optimized solution to the formulated QoS problem.Performance evaluation results on non-IID datasets are provided in Section 7. Finally, we conclude this paper in Section 8.

The Integration of Blockchain and Edge Intelligence
In the next generation of IoT, a massive number of diverse devices will be connected [15].To enhance locally trained models with knowledge gained at other nodes, participants must securely exchange model parameters on a peer-to-peer basis.Blockchains offer this secure sharing capability without requiring mutual trust between nodes or reliance on a trusted third party [16].Once local model parameters are recorded in a block, they become traceable, immutable and irrevocable, ensuring data integrity throughout the collaborative process [17].As edge nodes increasingly support DML and provide AI models for rapid decision-making, their computational resources may become strained and insufficient.To address this challenge, it may be beneficial for neighboring edge nodes to share their computational resources.To encourage such sharing, an incentive mechanism offering rewards can be implemented [18].Blockchain technology provides an ideal platform for deploying such a mechanism in a distributed manner, ensuring transparency and trust among participating nodes.Specifically, an AI-Chain has been proposed with a proof-of-learning protocol to unlock the sharing of more advanced intelligence among edges [19].A blockchain-enabled decentralization approach was given in [20] to optimize the QoS of industrial IoT.Furthermore, smart contracts have been designed and applied to FL for personalized edge intelligence service in IoT systems [21].In this paper, we discuss a novel integration of blockchain and edge intelligence via decentralized data valuation to improve QoS.

Data Valuation for Collaborative Edge Intelligence
The data valuation problem has been a central topic in the realm of collaborative ML for a long time.Fairness of data valuation has been recognized as the key to trustworthy AI [22].As a line of research, Shapley values have been extensively investigated as an approach in data valuation [23].Specifically, the data Shapley value measures the contribution of a single data point to a learning task [24].The distributional Shapley value extends the concept to arbitrary data distributions and provides stability guarantees [25].The contribution index extends the data Shapley value to FL by gradient-based model reconstructions [26].The authors of [26] investigated the profit allocation problem for FL, wherein a diminishing valuation factor was introduced to encourage participants in early rounds.The federated Shapley value is determined from local model updates in each training iteration, avoiding the need for retraining the model [27].Proof-of-Shapley opens opportunities for Shapleyvalue-based blockchain consensus without additional need for trust [28].However, it is often challenging to identify a viable, well-defined and regulated source of financial compensation for data contributors in the public domain.Apart from Shapley value, self-reported valuation requests the data owners to submit their bids, which consist of information about the combination of resources, local accuracy and costs [29].Additionally, permutation-based data valuation is for identifying the training points (a labeled data sample) that are most responsible for a given prediction [30].The method depends on computing influence functions of data samples.Furthermore, the reinforcement-learningbased method adaptively learns the contribution of each data point towards the learned predictor model [31].Unlike the existing literature, we aim to explore blockchain's potential for improving the QoS of edge intelligence via decentralized, time-dependent and tokenindependent data valuation.

System Model and Problem Formulation
In this section, we describe the network model, task model, training model, threat model and blockchain incentive model.

Network Model
We consider an IoT network as illustrated in Figure 1, where M servers serve N smartphones.Smartphones collect health data with L possible labels and execute DML training for local model weights.We assume each smartphone uploads local weights to its nearest server.As raw IoT data never leave the corresponding smartphones in DML training, privacy leakage is reduced [12].To further secure DML for smart health, a permissioned blockchain is used [32].Only verified entities can join the proposed DML system.We assume servers are honest and do not launch attacks.Servers store complete records of all blocks, while smartphones only store block headers to verify and download global weights [33].However, smartphones may become free-riders by contributing very little while still downloading the global model [8].A detailed threat model is introduced later in this section.

Task Model
We consider any smartphone n, n ∈ N = {1, 2, • • • , N}, which aims to obtain a welltrained model with the maximal utility (i.e., model test accuracy), denoted by U n , within a latency threshold, denoted by T n .
A DML model is trained for the detection of potential health risks.Specifically, the goal of DML is to train a model with L possible outputs for health data classification with L data labels through R rounds of collaborations.For example, data label l, l ∈ L = {1, 2, • • • , L}, could be "high blood pressure".In round r ∈ R = {1, 2, • • • , R}, let p n,r ∈ P r = {p 1,r , p 2,r , • • • , p N,r } denote the observed categorical distribution of N smartphone data labels.Denote q as the constant IID categorical distribution of L labels throughout R rounds.Then, p n,r = {p 1,n,r , p 2,n,r , • • • , p L,n,r }, and q = {q 1 , q 2 , • • • , q L }.
For example, a model is trained for detecting blood pressure risk.Collected health data have three labels: "high", "low" and "normal".In round 3, smartphone 1 owns 10 health data samples, including 2 samples of high blood pressure, 1 low blood pressure sample and 7 samples of normal blood pressure.Then, p 1,3 = { 2 10 , 1 10 , 7 10 } and q can be a uniform distribution or other constant distribution, e.g., q = { 1 100 , 1 100 , 49  50 }.However, q can be largely unknown by smartphones and servers during DML.In this paper, we do not assume prior knowledge of q at the network edge.
To achieve the goal of DML, any smartphone n should decide when to participate in DML and how much effort it should contribute to DML.Let a n,r , n ∈ N , r ∈ R, be a decision variable such that a n,r = 1 indicates that smartphone n participates in round r; otherwise, a n,r = 0.That is, In practice, any smartphone n owns a limited amount of private data.Let dn and d n,r denote the total data size and the training dataset actually used in round r, respectively.Let d n be the local training dataset owned by smartphone n, d n = {(x n,1 , y n,1 ), • • • , (x n, dn , y n, dn )}, where (x, y) denotes one instance that includes the raw data and label.Then, r∈R a n,r d n,r ⊆ d n , ∀n ∈ N , r ∈ R. ( Let t r be the time consumed for DML training round r, including the time for local training, model aggregation and blockchain consensus.The latency constraint for smartphone n is described as

Training Model
As there is only one type of round in this paper, we refer to a DML training round simply as a round.In round r, smartphone n produces one set of local weights, denoted by Let } denote the number of data samples and the actual contribution produced by smartphone n in round r, respectively.Define f n (w n,r ) as the local loss function used by smartphone n.Popular loss functions include cross-entropy loss and mean squared error [4].Let C r = ∑ n∈N C n,r denote the total amount of actual contributions produced by N smartphones.We define the global loss function on all distributed health data as F(w r ) cannot be directly computed without sharing C r and W r among smartphones and servers.Note that FedAvg [5] uses D r to approximate C r .However, this approximation may not always accurately reflect the true contribution or value of data samples to global model performance.Let η r , g n,r and d r = ∑ N n=1 d n,r denote the learning rate, gradients and total number of data samples used in round r, respectively.For FedAvg, weight aggregation is described as In DML, the marginal utility of data samples diminishes as the quantity of data (i.e., d n,r ) increases [34].This principle of diminishing returns suggests that each additional data sample contributes less to model improvement than previous data samples with the same quality.However, (5) does not account for the diminishing marginal utility.In Sections 4.3 and 5.1, we propose and discuss alternative methods to approximate C n,r in (4) that incorporate the concept of diminishing marginal utility, aiming to reflect the true value of additional data samples more accurately.In this paper, the objective of DML is to determine the optimal set of global weights w * r for round r that minimizes the global loss function F(w r ).The optimization problem is The above problem is often solved by mini-batch SGD [4] due to its NP-hardness [35].
Training a DNN at the network edge is challenging due to resource constraints: IoT devices often lack the computational power required to efficiently perform SGD and backpropagation on large models.To solve this problem, we introduce a pre-trained model framework in Figure 2 for efficient feature extraction [36].Neurons in gray color are frozen during training and inference.Only the weights of model heads are trainable, i.e., global and local weights in this paper.Instead of training and uploading the whole model, any smartphone n can train and upload the weights of the model head.Utilizing pre-trained models for feature extraction is not novel.As it can significantly reduce computational and communication overheads, we would like to highlight this method for improved QoS at the network edge.

Threat Model
Despite the benefits of the collaboration among edge devices, there are critical threats and trust issues.One major problem is how to defend against free-rider attacks.To be specific, we consider two categories of free-riders: (1) Fake contributors: These freeriders typically deceive the system by submitting fabricated local model weights.Fake contributions can include random model parameters, a combination of existing model weights and existing model weights altered with additive noise.(2) Weak contributors: Each free-rider trains a local model based on a private local dataset.However, the submitted local model barely improves the performance of the global model.Weak contributors may use noisy or very small amounts of data during training.Free-riders contribute very little while still reaping the benefit of the aggregated global model in every round.
To better clarify the impact of free-riding attacks, we classify possible consequences into three categories: unfair reward allocation, delay of service and poor performance of the global model.Unfair reward allocation is due to the fact that all honest contributors and free-riders receive the same and up-to-date global model, while their contributions are distinct.Delay of service is obvious since the inclusion of free-riders takes system resources, such as communication bandwidth, consensus overheads, etc.Furthermore, it is catastrophic to aggregate local models submitted by free-riders into the global model.The convergence speed and test accuracy of the global model can be degraded significantly [7].Therefore, eliminating free-riders is a critical need for a robust and fair DML system.
Existing defensive methods [6][7][8] assume a trusted model aggregator to conduct freeriding detection, which is not applicable to DML.Detecting free-riders in a peer-to-peer network is challenging because each participant has limited information about the true positive contributions of others, which are based solely on their private local dataset.To solve the problem, we aim to use a decentralized approach to defend against free-rider attacks.We aim to develop an incentive mechanism that ensures every participant leaves the system with a global model reflecting their incremental contributions to DML.

Blockchain Incentive Model
Blockchain is a record list chained with consecutive timestamps.As training continues, the performance of the global model improves.Therefore, it is perfectly suitable to discriminate the value of global models according to timestamped records.Our proposed incentive model is illustrated in Figure 3.
In this paper, we consider synchronized communication during DML, as illustrated in Figure 1.However, smartphones may only be allowed to continue synchronized DML until a specific round based on their data valuations.For any smartphone n, it can download the up-to-date global model only when the phone is in the system.At the end of DML, smartphone n leaves the system with a global model reflecting its contribution to DML.All smartphones are required to contribute continually such that they can obtain the up-to-date global model as the model reward.As illustrated in Figure 3, any contributor, including freeriders, may access a set of global models.However, the length of the accessible model chain depends on the individual incremental contribution.Free-riders cannot access specific models that are beyond their access length.For any model w recorded on blockchain, the length of the accessible model chain determines whether the access to w is denied or permitted.In this context, weak contributors can still access a subset of on-chain models, while fake contributors may not have access to any model on the blockchain.As expected, weak contributors are not penalized at the same level as fake contributors.

Time Global Model Blockchain Raw Data
Rewards Attacker Noisy Data Let V and R n be the valuation function in the system and the exit round of smartphone n, respectively.Let β be a system parameter reflecting the valuation rate.For any smartphone n, we formally introduce the model reward constraint as The above constraint ensures that any model reward will surely be based on data valuations contributed from any smartphone.Let V n,r ∈ V r = {V 1,r , V 2,r , • • • , V N,r } denote the valuation of data contributed from smartphone n in round r.Refer to Section 4 for the calculation of V n,r .To estimate actual data contributions C r in round r, the data valuation of N distributed datasets are recorded on the blockchain.Our proposed blockchain structure is shown in Figure 4. Subscripts represent different rounds in DML.Let w 0 denote weights of the initial model for DML.Public keys are exchanged among verified smartphones and servers.Note that blockchain-enabled DML for smart health needs to follow data regulations for data safety protection.Therefore, a DPoS [13] is used in our system to ensure the security and integrity of data in distributed ledgers.We assume stakeholders stake tokens for votes.Stakeholders (i.e., smartphones) vote for servers with their stakes.Let v m,r

Problem Formulation
In a typical decentralized edge system, each smartphone prioritizes its own QoS without considering the impact on others.To improve the QoS of collaborative edge intelligence, for any user n, we focus on maximizing the utility U n of the downloaded global model within the latency requirement More formally, considering constraints (1)-( 3) and ( 7), we formulate the QoS optimization problem for any user n ∈ N as follows: s.t.: (1)-( 3) and (7).
Note that ( 8) is a variant of the unbounded multiple knapsack problem (UMKP), a classic NP-hard combinatorial optimization problem [37], but differs from UMKP due to additional constraints (3) for a QoS guarantee and (7) for fair reward allocation.It is very challenging to achieve the optimal solution to (8) in DML, as true data distributions in heterogeneous edge networks are largely unpredictable.According to inequality (7), (8) can be relaxed to the following data valuation maximization problem:

t.: (1)-(3).
We propose V based on decentralized calculations in Section 4 and discuss an optimized solution to (8) in Section 5.

Decentralized Data Valuation
In this section, we demonstrate a collaborative method to evaluate data value for improving the QoS of every smartphone.We first describe the workflow of block generation for a decentralized assessment of data contribution.We further identify the IID property of data as a criterion in distributed data evaluation.Then, we propose our novel data valuation approach for our decentralized system.

Block Generation Workflow
In our DPoS blockchain, servers that ascend to the top M D votes join the consensus group and become delegated servers.Servers with the same number of votes are sorted by ascending hash values of public keys.To attract enough votes, servers need to build strong and positive reputations by following data regulations in DML.
For round r, a leading server Then, delegated server m * r collects transactions, aggregates local weights and generates a new block.Note that L equals the number of output digits from w 0 .
As an additional clarification of Figure 4, we first show genesis block generation at the start of DML: Genesis Block Generation: (round 0): (1) Blockchain stakeholders deploy initial weights w 0 with total round number R and number of delegated servers M D ; (2) verified servers and smartphones register in DML by submitting their public keys; (3) smartphones submit votes V 0 to select M D servers; (4) a server is selected by (10) to create a genesis block.
We further clarify the workflow of one round as follows: Regular Block Generation: (round r, r ∈ {1, 2, • • • , R}): (1) Any smartphone n, n ∈ N , collects health data to train a set of local weights w n,r ; (2) after exchanging w n,r in a peer-to-peer network, each smartphone n calculates and broadcasts other smartphones' data contributions and determines d n,r based on received local weights and its local health data, respectively; (3) any smartphone n submits w n,r , V n,r and d n,r to servers by launching a blockchain transaction signed by its private key; (4) each server receives and verifies N transactions by the public key set recorded on the genesis block, and then, a server is selected by DPoS to aggregate model weights; (5) the selected delegated server signs w r with its private key and generates a new block, and then, the block is propagated to all servers for verification; (6) once w r is recorded on the blockchain, permitted smartphones can download w r ; (7) any smartphone can verify w r by the server's public key.

KL Divergence in Distributed Data
The Kullback-Leibler (KL) divergence of probability distribution p diverging from the referenced probability distribution q is denoted as D KL (p∥q) [38].In a general interpretation, KL divergence quantifies the expected information loss because of using distribution q to approximate the actual data distribution p.
In DML, training a deep network relies on the assumption that a stochastic gradient on distributed data is an unbiased estimate of the full gradient on the entire dataset (i.e., distributed health data are assumed to be IID) [5].However, as introduced in Section 1, real datasets are non-IID.Therefore, a decrease in DML performance is inevitable.We use KL divergence as a metric to evaluate the information loss when applying SGD, which was originally designed for IID training data, in non-IID cases.
For smartphone n in round r, the KL divergence between the real data distribution p n,r and IID data distribution q is D KL (p n,r ∥q) = ∑ l∈L p n,r,l ln( p n,r,l q l ), where q l is unknown.Note KL divergence is asymmetric (i.e., D KL (p n,r ∥q) ̸ = D KL (q∥p n,r )) in most cases [38].According to (11), we can calculate the information loss of applying SGD on p n,r instead of the unknown q.
Let Q n,r be the data quality of d n,r for N smartphones in round r.Suppose the data quantity d n,r is the same across N smartphones; Q n,r is calculated by Q n,r can be used to indicate the true quality of distributed data in DML [39].However, it is very challenging to compute Q n,r due to the uncertainty of q and data quantity skews.In this paper, we investigate an alternative method to evaluate the valuation of any d n,r .

Decentralized Calculation of Data Valuation
One approach to approximate q is via decentralized cross-validation in distributed data networks.We define the utility of a global model as the validation accuracy based on private local data.Let d val n be the validation dataset of smartphone n.The validation accuracy of any model w, denoted by A(w, d val n ), is calculated by the ratio of matching predictions to the total number of validation samples.More formally, for any smartphone n, the utility of the global model in round r is calculated by Let Φ i,n,r (d n,r ) = U i (w n,r ) − U i (w r−1 ) be the contribution of d n,r to smartphone i, i, n ∈ N .Then, Φ i,n,r (d n,r ) is defined by because datasets from individual smartphones remain localized and are never transmitted beyond their respective devices.Then, the valuation of data contribution from any smartphone n is determined by all smartphones except n.That is, Equation ( 15) describes a cross-validation of contributions.Every smartphone's data valuation is truly based on its uploaded model weights and is determined by other smartphones.Therefore, the calculation of V n,r (i.e., V(d n,r )) is fully decentralized: relying solely on the blockchain records of model weights.To solve (8) and maximize U n , we argue that decentralized calculation of V n,r is a critical step to minimize D KL (p n,r ∥q) at the network edge, especially when q is unpredictable.By examining ( 14) and ( 15), our data valuation function V has the following properties: (1) Group rationality: The valuation of the per-round data contribution is completely distributed among all data contributors, i.e., V r = ∑ n∈N V(d n,r ).Furthermore, we present the following theorem to elucidate how decentralized data valuation benefits collaborative edge intelligence.Theorem 1. Cross-validation of data contributions using (15) can prevent free-riders.
Proof of Theorem 1.We will prove this by contradiction.
Suppose in any round r, ∀r ∈ R, a free-rider j, j ∈ N , has data valuation of V(d j,r ).An honest data contributor k, k ∈ N , k ̸ = j, has V(d k,r ) as the lowest data contribution among other honest contributors.
Suppose for the sake of contradiction proof, that the free-rider, smartphone j, can mimic the honest data contributor, smartphone k.That is, V(d j,r ) ≥ V(d k,r ), and we can obtain By examining ( 13) and ( 14), we can conclude that the free-rider achieves a greater improvement in validation accuracy on other smartphones compared with the honest contributor when the improvement is measured against the previous global model w r−1 .
Therefore, the free-rider must improve w r more than the honest contributor.Now we have arrived at a contradiction.Thus, our initial assumption that smartphone i is a free-rider must be false.No free-rider can mimic an honest data contributor without a true contribution to the global model.
Therefore, (15) can make free-riders unable to mimic honest contributors and can block free-riders from our system.This completes the proof.

Curriculum Data Scheduling
In this section, we introduce a novel method for smartphones to schedule their limited private data across multiple rounds for better QoS.We first give the analysis of the marginal utility of data samples.Next, principles of curriculum learning [40] are introduced, with a focus on local data scheduling at the network edge.Finally, a novel data scheduling method is proposed for every smartphone to improve the utility of the downloaded model.

Marginal Utility of Data Samples
Despite data valuation, the performance of a trained model also relies on the quantity of data used in DML.However, the test accuracy of a trained model does not increase linearly with the number of data samples [34].In fact, as data quantity d n,r grows larger, model test accuracy increases more slowly.In other words, the marginal utility of data samples diminishes with data quantity in terms of test accuracy.Specifically, the term marginal refers to one health data sample in this paper.In round r, the utility of data samples, denoted by U, is used to measure contribution C n,r for improving the test accuracy of global model w r .We define the marginal utility of data samples as where ∆C n,r represents the change of contribution by adding one data sample, and ∆d n,r = 1.
To demonstrate the diminishing of the marginal utility in DML, we define the marginal utility of data samples by the reciprocal of d n,r multiplied by a system parameter λ: Therefore, C n,r is calculated by where d n,r > 1, meaning at least two samples are required for the calculation of the marginal utility.For a finite set of data samples d n,r , maximizing the marginal utility of each sample results in the highest possible value for C n,r .In this paper, we propose a novel solution to maximize data utility by optimizing the marginal value of each sample.

Principles of Curriculum Learning at the Network Edge
Curriculum, a learning strategy, refers to a structured set of content and learning experiences.In the realm of ML, curriculum learning is a concept inspired by the way humans learn new information progressively, from simpler instances to more complex ones.In the context of DML, we aim to design a training process in a way that starts with easier tasks and gradually moves to more difficult tasks.This learning approach has been proven to be helpful for improving the utility of global models [41].
To apply curriculum learning at the network edge, for any smartphone n, we summarize the following key principles: (2) Pacing functions should be monotonically increasing: The number of data samples scheduled per round is determined by a pacing function.Intuitively, it becomes more difficult to improve the performance as the global model converges.Therefore, more data should be scheduled for the later rounds to push the performance of the global model.(3) The difficulty level should be progressively increased: As model training progresses through successive rounds, the average difficulty level of d n,r is expected to increase with the round number r.Furthermore, d n,r should be sorted such that the difficulty level of instances is also progressively increasing during local training.(4) The amount of data learned per round should be optimized: In a typical edge system, latency should be considered a key QoS factor.Therefore, data samples processed per round should be controlled and optimized; otherwise, curriculum learning may not be practical for the network edge.

Optimized Data Scheduling with an Adaptive Window
Next, we introduce our approach based on the above principles.To begin with, we first define our score function for reflecting the difficulty of any instance.Motivated by [41], we choose a loss-based measure as the score.Let y n,i,l ∈ y n,i = {y n,i,1 , • • • , y n,i,L } denote the true label indicator of the ith instance in d n : y n,i,l = 1 if the ground-truth class of the instance is l; otherwise, y n,i,l = 0. Let ŷn,i,l ∈ ŷn,i = { ŷn,i,1 , • • • , ŷn,i,L } be the probability value predicted from the softmax layer of global model w r .In any round r, the difficulty score of the ith instance in d n , denoted by S i,n,r ∈ S n,r = {S 1,n,r , • • • , S dn ,n,r }, is defined as A greater S i,n,r indicates a higher difficulty level.Once d n is sorted in ascending order according to S n,r , the next step is to use a proper pacing function to balance the trade-off between effectiveness and latency.Let dn,r denote the sorted d n in round r.For any round r and smartphone n, we assume that t r is a system parameter and that the latency requirement T n is known before training.By examining (3), the upper bound of the number of rounds, denoted by R n,max , that smartphone n can participate in is calculated by Unlike the pacing functions described in [41], we use a moving window that gradually increases and selects more difficult instances for training.Let W initial n , W start n,r and W end n,r be the initial number of data samples and start and end index of the sorted d n in round r, respectively.Considering the marginal utility of data samples, as described in Section 5.1, we propose the following moving window as our pacing function for any round r: Note that only a subset of private local data is selected per round, i.e., d n,r ⊂ d n .Our approach is shown to be effective in Section 7 with improved QoS.Then, the optimized dataset d * n,r consists of instances selected from dn , which has been sorted based on difficulty.These instances are chosen from within a moving windows defined by Equations ( 22)-(24).

Algorithm Design for Improved QoS
In this section, we propose our solution to solve (8).We first analyze an optimized participation strategy for every smartphone and then summarize the model aggregation algorithm followed by a curriculum data scheduling algorithm.

An Optimized Participation Strategy at the Network Edge
For any smartphone n that aims to optimize (8), U n (w R n ) is required to be maximized within a limited number of rounds due to (3).As ( 8) is equivalent to (9), smartphone n can target maximizing the valuation of its data contribution within T n .
In DML with synchronized model aggregation, being denied access to the up-todate global model even once can be catastrophic.Local training starting from a stale model results in a reduced contribution to the overall model utility [42].Therefore, any smartphone n should make sure that the accumulated valuation is large enough to access the up-to-date global model; otherwise, ∑ r∈R V(a n,r d n,r ) will be reduced, since local training based on a stale model is inevitable.As time is limited, every participation opportunity is valuable.Therefore, an optimized strategy is to contribute in every round to accumulate valuations for better model rewards: that is, Private data of any smartphone n are limited.An optimized approach to data usage in DML is to reuse each instance as frequently as possible to maximize its value.However, the reuse of instances might lead to model overfitting on d n and further reduce the valuation calculated by other smartphones.Therefore, we can use the proposed curriculum data scheduling in Section 5.3 for greater data valuation while not reducing global model utility.More formally, the optimized data scheduling for any smartphone n is described as Intuitively, the proposed moving window ensures that easier instances are only used in the early rounds to help the global model converge fast.In contrast, more difficult instances are used in later rounds to push the near-converged model to better performance.

Decentralized Algorithm Design
Based on the above analysis, we propose (1) a data scheduling algorithm that can help maximize model utility within a latency constraint-curriculum learning is used in the algorithm with an adaptive moving window-and (2) an aggregation algorithm that can better reflect the diminishing marginal contribution effect with the data quantity to replace the one proposed by FedAvg.Both the valuation and marginal utility of data samples are considered in order to improve the test accuracy of the aggregated model.
We first describe our curriculum data scheduling algorithm as the pseudo-code in Algorithm 1.
Note that calculating S i,n,r is basically model inference on d n .The time complexity varies depending on the model structure.The time complexity of sorting operations for smartphone n is O( dn log( dn )).
After local training on curriculum data, model weights W r are aggregated into a global model.Let w U r and w L r denote weights aggregated by considering marginal utility and information loss, respectively.We provide the pseudo-code for weight aggregation using blockchain-based data valuation in Algorithm 2. 1: for instance (x n,i , y n,i ) ∈ d n do // obtain difficulty score.

2:
Calculate S i,n,r by (20); 3: end for 4: Sorting S n,r in ascending order.5: Obtain dn,r by sorting d n accordingly.6: Obtain d n,r according to (26).
Algorithm 2 Proposed aggregation algorithm for round r.
Input: Model weights w r−1 , W r and data quantity D r .Output: Global weights w r .
1: for smartphone n ∈ N do // decentralized data valuation. 2: Calculate Φ i,n,r by ( end for 5: end for 6: for server m ∈ M D do // delegated servers conduct model aggregation.

7:
if m = r + L (mod M D ) then // a round-robin leader conducts aggregation.V r ← ∑ n∈N V n,r ; 13: C r ← ∑ n∈N C n,r ; 14: V r g n,r ; // aggregation by valuation of data samples. 15: C r g n,r ; // aggregation by data marginal utility.We choose to apply two aggregation protocols separately in every round.Global weights w r are obtained by comparing the utilities of aggregated weights w U r and w V r , i.e., ∑ n∈N A(w U r , d val n ) and ∑ n∈N A(w V r , d val n ).Our scheme is shown to be effective in Section 7.

Performance Evaluations
In this section, we evaluate the proposed incentive mechanism against free-riders.Additionally, curriculum data scheduling is tested on two datasets: (1) CIFAR-10 images with 10 classes and (2) CIFAR-100 images with 100 classes [43].We also show the information loss when training DNN on non-IID human activity recognition (HAR) signals [44].We aim to show that data valuation and scheduling will work on datasets with different scales and distributions.

Experiment Settings and Benchmarks
To simulate real-world data distributions, two kinds of data distribution skews are applied to local datasets.Detailed settings are described as follows: (1) Label distribution skew: For the HAR dataset, we assume each smartphone owns the same number of samples.Six human activities are labeled: sitting, lying down, walking, going upstairs, going downstairs and standing.A total of 34,440 data samples are assigned to N = 10 smartphones according to a Dirichlet distribution [45].To match real-world data distributions, we set concentration parameter α at 0.5 [46].(2) Label and data quantity skew: For the CIFAR-10 and CIFAR-100 datasets, we consider that each smartphone owns at most 4 out of 10 labels for CIFAR-10 and at most 8 out of 100 labels for CIFAR-100.Note that we do not consider simple cases wherein labels or the quantity of training data are uniformly distributed.We process CIFAR-10 data samples (10 classes, with 6000 images per class) and CIFAR-100 data samples (100 classes, with 600 images per class) to form our synthetic non-IID datasets based on [46].To be specific, we simulate label skew by assigning a random subset of classes to each smartphone.The number of classes per phone is generated using the function random.randint().Then, we create quantity skew by using a Dirichlet distribution to allocate different amounts of data for each class to different smartphones.The use of the function numpy.random.dirichlet( ) results in non-uniform data quantities across smartphones (https://github.com/IBM/probabilistic-federated-neural-matching/blob/master/experiment.py accessed on 30 June 2024).To simulate different degrees of skews, we set α = 10 and α = 0.5 for CIFAR-10 and CIFAR-100, respectively.
We consider a fixed T n for smartphones (i.e., R n,max = 200) and λ = 1 in experiments.For each smartphone, data samples are shuffled according to a discrete uniform distribution.Then, 70% of 60,000 samples are used for training and 15% of 60,000 samples are used for validating and testing, respectively.For benchmarks and model training, we follow similar parameter settings as the open-source code base (https://github.com/CharlieDinh/pFedMe accessed on 30 June 2024).To be specific, we set the local learning rate η r to 0.005 for mini-batch SGD.The batch size and local iterations are set to 64 and 10, respectively.
In our experiments, we use a DNN for classification.To be specific, a DNN is defined in [47] for HAR datasets.For CIFAR datasets, we use Swin transformer v2 [48] trained on Imagenet as the pre-trained model.Model heads are three-layer DNNs with input size 768, middle dimensions 500 and 100, and output size 10 for CIFAR-10 and 100 for CIFAR-100.We perform DML using PyTorch [49] version 2.3.0+computecanada.To run the simulation, an NVIDIA V100L GPU is used, and 24 CPU cores and 180 gigabytes of RAM are allocated.
To assess our approaches on heterogeneous datasets, seven benchmarks are considered in our experiments: namely, FedProx [50], pFedMe [51], PerAvg [52], Curricula [41], Anti-Curricula, Random Weights and Addictive Noise [7,8].FedProx uses a regulation term between local and global models to mitigate deviations.Similarly, pFedMe uses Moreau envelopes as regularized loss functions.PerAvg is a meta-learning approach to handle the data heterogeneity problem.The superiority of our data scheduling is shown by comparing it with the state-of-the-art (SOTA) Curricula, for which our proposed adaptive window is missing.Curricula uses 20% of d n in round 1, linearly increases to 100% at round 0.8R n,max , and maintains the quantity thereafter.Additionally, we use Anti-Curricula as a benchmark, wherein instances are learned from difficult ones to easier ones.To ensure consistency across experiments, FedAvg [53] is employed as the standard aggregation protocol for both the proposed data scheduling method and the benchmark scheduling algorithms.
Finally, to show the effectiveness of our incentive mechanism as a defense against free-riding attacks, we use random weights and local weights from other models with additive Gaussian noise to simulate attackers.A total of 10% of the smartphones are considered as free-riders in our system.Specifically, the Random Weights attack generates a new model with weights sampled from a normal distribution based on the statistics of the current global model weights.The mean of each new weight is set to the mean of the corresponding global weights, while the standard deviation is scaled to 1% of the original standard deviation.The Additive Noise attack adds Gaussian noise to the aggregated model of the local models submitted from others in the current round, where the noise is drawn from a normal distribution with mean 0 and a standard deviation set to 10% of the parameter's current standard deviation.Both attacks aim to simulate different strategies that malicious participants might employ to free-ride in the DML system without meaningful contributions.

Results and Discussion
We evaluate our blockchain-based approaches from different perspectives, with a focus on heterogeneous data environments at the network edge.

QoS Improvement with Heterogeneous Data
In an edge intelligence system utilizing DML, QoS is primarily determined by two factors: training latency and model test accuracy.An improvement in QoS is achieved either by reducing the time required to produce a deployable global model or by attaining a global model with higher test accuracy within the specified latency constraints.
Figure 5 shows a training latency reduction by the proposed Algorithms 1 and 2. The test accuracy is calculated by averaging global model test accuracy values across 20 smartphones.An observation related to the CIFAR-10 dataset is that the global model converges smoothly and quickly to 90% accuracy using our approach.Latency is reduced by more than 25% compared with benchmarks.However, an additional observation is that the test accuracy of our proposed protocol and FedProx is relatively close when α = 10.Therefore, our solution may not boost test accuracy significantly when local data distributions are more homogeneous.Figure 6 demonstrates a more challenging scenario using the CIFAR-100 dataset, where the global model is trained within a time constraint T n to classify testing instances into 100 distinct categories.A notable improvement in test accuracy is achieved by our approach, while the benchmarks are below 50% accuracy and are thus not usable.Therefore, the model utility is shown to be enhanced.We also remind the reader that it is very challenging to reach a test accuracy close to 100% on CIFAR-100 in heterogeneous networks where α = 0.5.Although model design is not the focus of this paper, an advanced model structure can be considered atop our solution for improved model utility.
Compared with the benchmarks, our proposed Algorithm 2 utilizes decentralized data valuation to assess actual contributions.Therefore, local weights trained from a dataset that closely resembles an IID dataset have larger summation weights in aggregation.As mini-batch SGD is designed for training on IID data, our proposed scheme can make mini-batch SGD function well under non-IID data settings.

Effectiveness of Data Scheduling at the Network Edge
To evaluate data scheduling at the network edge, we first illustrate the information loss and marginal utility of non-IID HAR data samples in Figure 7.As class concentration parameter α increases, local data distributions become more homogeneous.As mini-batch SGD performs well on IID datasets, data quality Q n,r increases in Figure 7a.Furthermore, the diminishing effect of data marginal utility is shown in Figure 7b.An interesting observation is that the prediction accuracy curves in Section 7.2.1 reflect the marginal utility.Therefore, (19) as derived can approximate the real diminishing formula of the data marginal utility effectively.
Figure 8 shows that our proposed data scheduling can achieve slightly better performance than SOTA Curricula when all three methods use FedAvg as the model aggregation protocol.Algorithm 1 and Curricula achieve 90% test accuracy on CIFAR-10 images with similar times, while Anti-Curricula fails to reach the desired 90% accuracy threshold.Although Curricula achieves similar performance to our design, it uses a significantly greater amount of data samples per round; thereby, training latency is increased.Figure 9 shows the average training latency of each smartphone during DML.Algorithm 1 is observed to reduce the overall training latency by 30% and achieve approximately 50% latency reduction in later rounds.Figures 8 and 9 jointly show the superiority of our data scheduling algorithm.

Robustness against Free-Riding Attacks
Our proposed time-dependent incentive mechanism aims to track the incremental contribution of each smartphone.Note that Algorithm 2 takes data valuation as the factor in model aggregation.With our design, contributors with low or zero data valuations will have minimal impact on the aggregated global model.We further illustrate the incremental valuation of honest contributors and attackers to show the superiority of our decentralized data valuation.
Figure 10 shows that the QoS of classifying CIFAR-10 images is not changed with or without free-riding attacks.The resilience against free-riders is a desired security feature in collaborative edge intelligence, where fairness must be guaranteed.Furthermore, Figure 11 shows the cumulative data valuation of honest contributors without an attack and with free-riders conducting an attack.As attackers do not contribute to the validation accuracy of honest contributors, the data valuation of attackers remains zero.In Figure 11, free-riders try to attack in every round.As the cumulative data valuation of honest contributors increases, it becomes increasingly difficult to mimic an honest contributor and achieve a successful attack.The observation and analysis jointly explain why the QoS of our system does not change under free-riding attacks.

Conclusions
In this paper, a time-dependent and decentralized data valuation approach has been proposed to improve the QoS of collaborative edge intelligence and defend against freeriding attacks.By considering information loss and diminishing marginal utility of data, a robust aggregation algorithm has been proposed for improving DML on non-IID data.Based on experimental results, we have shown that the QoS of edge intelligence can be improved by decentralized device collaboration and curriculum data scheduling at the network edge.We have improved DML by evaluating and understanding data rather than designing finely tuned models.Our blockchain-enabled data-centric method has been shown to be simple yet effective at improving the fairness and performance of DML.Based on experiments and observations, we conclude that decentralization of data valuation with scheduling is a promising approach towards collaborative edge intelligence.
Despite the advantages of our approach, we remind the reader that maintaining local validation data on every smartphone is still required for cross-validation.For future work, we will study a data-free valuation framework.Additionally, the privacy issue of uploading model weights is not the focus of this paper.Therefore, we will also study how to improve the differential privacy of honest contributors in a blockchain network.
r }.N sets of local weights are uploaded to M servers for model aggregation.After receiving N sets of local weights, server m, m ∈ M = {1, 2, • • • , M}, executes model aggregation to obtain global weights, represented by w r .

Figure 2 .
Figure 2. Pre-trained model for feature extraction: only parameters for the model head are trainable and exchangeable in order to reduce the computational and communication overheads at the network edge.

Figure 3 .
Figure 3. Proposed time-dependent incentive mechanism atop blockchain: users gain access to varying sets of global models as rewards in each round, depending on their diverse contributions; free-riders have very limited or no access to global models recorded on the blockchain.
} denote votes received by server m in round r.Let M D denote the number of delegated servers in DML.A detailed block generation workflow is described in Section 4. Note that the recorded data valuation can be verified by evaluating the prediction accuracy of local weights.Further discussions on blockchain-based data valuation are given in Section 4.

( 2 )
Fairness: Two data contributors with identical data contributions should have the same valuation, i.e., V(d i,r ) = V(d j,r ) if datasets d i,r and d j,r are identical; a free-rider n with zero Φ i,n,r for all other N − 1 smartphones has zero valuation, i.e., V(d n,r ) = V(∅) = 0.(3) Additivity: In any round r, the data valuation of multiple data contributors equals the sum of the data valuations of individual data contributors, i.e., V(d 1,r ) + V(d 2,r ) = V(d 1,r + d 2,r ).

( 1 )
Score functions should depend on the global model: Any instance in d n is mapped to a numerical value by a score function.As (8) aims to maximize the utility of the global model, this score function should solely rely on the global model.

Algorithm 1
Proposed curriculum data scheduling for any smartphone n in round r.Input: Global model weights w r , private local dataset d n and latency requirement T n .Output: Local training dataset for the current round d n,r .

Figure 5 .
Figure 5. Training latency reduction: proposed approach shows more than 25% faster convergence speed to reach 90% test accuracy on CIFAR-10 dataset.

Figure 6 .
Figure 6.Model utility enhancement: proposed approach shows more than 10% improvement in test accuracy on CIFAR-100 dataset.

Figure 7 .
Information loss and marginal utility of heterogeneous HAR datasets: (a) information loss of label skews; (b) marginal utility as a function of data quantity.

Figure 8 .
Figure 8. Similar performance with SOTA Curricula: Algorithm 1 and Curricula reach 90% test accuracy faster than Anti-Curricula.

Figure 9 .
Figure 9. Reduced per-round training latency: our adaptive moving window is applied to both Algorithm 1 and Anti-Curricula to reduce overall training latency by about 30% during DML.

Figure 10 .
Figure10.Superior resilience against free-riding attacks: free-riders that conduct random weights and additive noise attacks do not harm the QoS of collaborative edge intelligence.

Figure 11 .
Figure 11.Distinct data valuation helps discriminate attackers: decentralized valuation of data contribution shows superior discriminative performance.