CIM-LP: A Credibility-Aware Incentive Mechanism Based on Long Short-Term Memory and Proximal Policy Optimization for Mobile Crowdsensing

Mu, Sijia; Ma, Huahong

doi:10.3390/electronics14163233

Open AccessArticle

CIM-LP: A Credibility-Aware Incentive Mechanism Based on Long Short-Term Memory and Proximal Policy Optimization for Mobile Crowdsensing

by

Sijia Mu

^* and

Huahong Ma

School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(16), 3233; https://doi.org/10.3390/electronics14163233

Submission received: 20 June 2025 / Revised: 10 August 2025 / Accepted: 12 August 2025 / Published: 14 August 2025

Download

Browse Figures

Versions Notes

Abstract

In the field of mobile crowdsensing (MCS), a large number of tasks rely on the participation of ordinary mobile device users for data collection and processing. This model has shown great potential for applications in environmental monitoring, traffic management, public safety, and other areas. However, the enthusiasm of participants and the quality of uploaded data directly affect the reliability and practical value of the sensing results. Therefore, the design of incentive mechanisms has become a core issue in driving the healthy operation of MCS. The existing research, when optimizing long-term utility rewards for participants, has often failed to fully consider dynamic changes in trustworthiness. It has typically relied on historical data from a single point in time, overlooking the long-term dependencies in the time series, which results in suboptimal decision-making and limits the overall efficiency and fairness of sensing tasks. To address this issue, a credibility-aware incentive mechanism based on long short-term memory and proximal policy optimization (CIM-LP) is proposed. The mechanism employs a Markov decision process (MDP) model to describe the decision-making process of the participants. Without access to global information, an incentive model combining long short-term memory (LSTM) networks and proximal policy optimization (PPO), collectively referred to as LSTM-PPO, is utilized to formulate the most reasonable and effective sensing duration strategy for each participant, aiming to maximize the utility reward. After task completion, the participants’ credibility is dynamically updated by evaluating the quality of the uploaded data, which then adjusts their utility rewards for the next phase. Simulation results based on real datasets show that compared with several existing incentive algorithms, the CIM-LP mechanism increases the average utility of the participants by 6.56% to 112.76% and the task completion rate by 16.25% to 128.71%, demonstrating its significant advantages in improving data quality and task completion efficiency.

Keywords:

mobile crowdsensing; incentive mechanism; credibility; long short-term memory; proximal policy optimization; markov decision process

1. Introduction

Mobile crowdsensing (MCS) is an emerging paradigm that leverages smart mobile devices spread across large areas to perform large-scale sensing in dynamic and complex physical environments [1]. MCS enables the intelligent analysis of massive crowdsourced data and provides real-time feedback to participants. This facilitates the continuous development of collective intelligence and supports integrated decision-making in diverse scenarios. Compared with traditional methods, MCS systems exhibit advantages such as multi-source heterogeneous sensing data, extensive and uniform coverage, powerful scalability, and multifunctionality. These characteristics have contributed to the widespread application of MCS in various fields, including smart city construction, intelligent traffic management [2], environmental monitoring [3], noise detection, infrastructure development, and public safety [4]. Given its broad applicability and real-time responsiveness, MCS has become a key technology for supporting modern urban systems and data-driven services.

Although MCS systems exhibit considerable potential for future development, their practical effectiveness depends on the availability of high-quality sensing data from participants. However, in the absence of adequate incentive mechanisms, participants are often reluctant to submit high-quality data. This reluctance arises from the fact that, on the one hand, participants need to invest considerable time and physical effort into the sensing process; on the other hand, such tasks also lead to additional energy consumption on their smart devices. Therefore, a key challenge lies in designing incentive mechanisms that effectively optimize participants’ utility, thereby promoting sustained participation and enhancing the quality of the collected data.

Within the study of utility-optimized incentive mechanisms, existing research has indicated that participants can derive utility from sensing data contributed by others. This indirect benefit is regarded as social utility, which enhances participants’ willingness to perform sensing tasks [5]. For example, on an intelligent transportation platform, vehicles have the opportunity to form virtual vehicle networks [6]. Drivers can automatically join the virtual network and obtain traffic information collected by others, thereby improving their driving routes and avoiding traffic congestion. This peer-assisted mode based on social networks enhances drivers’ willingness to share, thereby creating effective “social network utility” [7]. Therefore, this paper explores improving utility rewards through social utility to maintain participant engagement and ensure high-quality data submission.

Moreover, the maximization of participants’ utility rewards depends on their sensing strategies, i.e., the time participants spend on sensing tasks. Thus, the maximization of utility rewards can be reduced to a series of strategic decisions made by participants regarding sensing duration. Previous optimization problems have typically used game theory models to refine the participants’ strategies and encourage their active participation. For example, in the case of fixed rewards and individual resource constraints, the incentive problem of maximizing the participants’ utility rewards has been modeled as a non-cooperative game [8,9]. However, as the decision environment becomes more complex, the strategy space may expand rapidly, making it difficult to find feasible solutions. To address this issue, some studies have started to use deep reinforcement learning (DRL) techniques to optimize decisions [10]. Since DRL does not require global information, it alleviates concerns about personal privacy leakages to some extent. However, existing DRL studies have often overlooked the long-term dependencies in time-series data during the optimization process. This can lead to suboptimal decision-making, affecting the participants’ utility rewards and reducing their motivation to participate.

Furthermore, to ensure high-quality sensing data and effectively incentivize user participation, several credibility-based incentive mechanisms have been proposed. For instance, a staged incentive and punishment mechanism updates credibility scores based on the utility value, thereby encouraging participants to submit high-quality data [11]. Another mechanism introduces a ’participant performance index’ to evaluate participants’ task execution capabilities more clearly. This index is then applied to the credibility feedback mechanism [12]. Additionally, credibility-based incentive mechanisms comprehensively consider factors such as participation willingness, data quality, and rewards to assess the likelihood of high-quality data submissions [13]. However, these mechanisms fail to adequately account for dynamic changes in credibility over time in the long-term incentive process. Specifically, over time, participants’ credibility may change, but this change has not been sufficiently incorporated into the optimization of the utility rewards.

Although existing incentive mechanisms provide a foundation for enhancing participant motivation and data quality, they often overlook the dynamic nature of credibility and the long-term dependencies in time-series data. This leads to challenges in effectively motivating participants while optimizing long-term utility rewards. These limitations make it difficult for the current mechanisms to adapt to complex and ever-changing environments, thus impacting data quality and the long-term sustainability of the system. To address this issue, we propose a credibility-aware incentive mechanism based on long short-term memory and proximal policy optimization (CIM-LP). This mechanism continuously updates participants’ credibility by accumulating quality detection results and incorporates it as one of the indicators for utility rewards. Based on this, CIM-LP uses long short-term memory (LSTM) networks to capture the long-term dependencies and, using proximal policy optimization (PPO), derives the optimal sensing duration strategy for each participant, thereby effectively improving their utility rewards and task completion rates.

In summary, the main contributions of this paper are as follows:

A credibility update method incorporating truth discovery algorithms is proposed to quantitatively assess the reliability of uploaded data and feed back the evaluation results into participants’ statuses and reward allocation. In experiments with real datasets, the cumulative utility rewards of the group with high initial credibility reached 550.25 by the 10th round, which was 481.85 higher than those of the low-credibility group, and the average credibility improved by 0.282, effectively guiding the participants to continuously provide high-quality data.
A reinforcement learning model combining LSTM and PPO is proposed to enable the participants to learn adaptive sensing strategies in partially observable environments, making full use of historical time-series data for decision optimization. After 500 training rounds, the average utility of CIM-LP converged to approximately 4.50, surpassing that of PPO-DSIM (3.65) and RLPM (2.75) by 23.29% and 63.64%, respectively. This advantage stems from LSTM’s ability to capture time-dependent features and PPO’s stability in policy updates.
The credibility update mechanism is integrated with the LSTM-PPO model into a unified CIM-LP framework, enabling long-term utility maximization in complex dynamic environments. In experiments with varying participant numbers, CIM-LP’s average utility was higher by 0.75, 1.86, and 2.45 compared to that of PPO-DSIM, RLPM, and GSIM-SPD, respectively. The task completion rate reached a maximum of 91.13%, which was a 5.2% improvement over the second best method. In energy consumption performance tests, CIM-LP retained usable energy after 50 time slots, while GSIM-SPD depleted its energy after 30 time slots, highlighting its stability and efficiency in energy management.

The subsequent sections of this paper are organized as follows. Section 2 reviews the related work on existing incentive mechanisms, Section 3 presents the model and assumptions, and Section 4 describes the CIM-LP mechanism in detail. Section 5 shows the simulation experiments and the analysis of the results. Finally, Section 6 provides a summary of this paper.

2. Related Work

In mobile crowdsensing, a well-designed incentive mechanism is key to ensuring the efficient and reliable completion of sensing tasks. With the continuous advancement of technology and the deepening of research, various incentive strategies have been proposed to stimulate participants’ enthusiasm and ensure the quality and privacy security of data. This section provides a brief overview of the relevant work on existing incentive mechanisms.

Existing incentive studies have generally assumed that high-credibility participants will provide high-quality sensing data. For example, Liu et al. [14] proposed an incentive mechanism based on blockchain smart contracts and credibility assessments which ensures the quality of data resources by comprehensively evaluating the static credibility, dynamic credibility, and incentive credibility of nodes. Similarly, Fu et al. [15] have developed a blockchain-based framework that integrates credibility assessments and smart contracts to dynamically select credible participants, thereby ensuring data quality in crowdsourcing systems and effectively preventing malicious behavior. Zhang et al. [13] have designed a credibility-based incentive mechanism that determines the credibility value of participants by evaluating their willingness to participate, the quality of their data, and the rewards provided, thus reflecting their potential to submit high-quality sensing data. However, while these studies aim to improve data quality through credibility-based incentives, they typically do not assess the credibility value by directly measuring data quality outcomes, which may lead to the credibility-based data quality incentives being less reliable. Additionally, Sun et al. [16] have proposed a contract-based, personalized, privacy-preserving incentive mechanism called Paris-TD for truth value discovery in MCS systems. Wang et al. [17] have introduced a reverse-auction-based incentive mechanism (TVD-RA) that enhances the data quality by dynamically updating participants’ trust levels. While some incentive mechanisms employ truth discovery technology to detect data quality and aim to achieve more effective incentive effects, they fail to effectively combine credibility and quality detection and do not consider the impact of dynamic changes in credibility on the utility rewards.

In mobile crowdsensing systems, social networks have played a crucial role in the design of incentive mechanisms. By promoting information sharing and interaction among participants, social networks have not only significantly increased participants’ willingness to engage but also improved data quality. This phenomenon is referred to as the social network effect. This effect has provided new opportunities for designing incentive mechanisms. For example, Shi et al. [18] have shown that leveraging social influences in social networks can significantly enhance participant engagement and data quality, while simultaneously reducing the platform’s incentive costs. Guo et al. [19] have proposed a multi-task diffusion model that analyzes the information propagation in social networks, demonstrating how strategic diffusion can increase task participation. Similarly, Wang et al. [20] have solved the cold-start problem through social networks, proving that recruiting users via social ties can effectively optimize task execution and resource allocation. Li et al. [21] have proposed a Stackelberg-game-based framework that optimizes social reciprocity, multi-mechanism data quality control, and delay-aware resource allocation, thereby enhancing user engagement and requester utility. These studies have demonstrated that social networks have had a significant impact on improving user engagement and data quality; reducing costs; and increasing efficiency. Therefore, this paper has incorporated the concept of social networks into the design of the incentive mechanism to enhance participant engagement and data quality further. However, while social networks have brought many benefits, they have also raised concerns regarding breaches to participant privacy. Hence, future incentive mechanism designs must carefully consider how to balance social networks with privacy protection.

With the advancement of deep reinforcement learning (DRL), it has become a focal point of research. DRL and its derivative techniques have proven to be effective in protecting participant privacy and optimizing decision-making strategies without requiring complete information about the participants. For instance, Zhan et al. [10] explored how DRL can be used to determine the optimal pricing and training strategies in situations where edge node information is incomplete. Xu et al. [22] have proposed a DRL-based Stackelberg game mechanism that optimizes both the data freshness (through Age of Information guarantees) and social welfare without prior knowledge of the utility parameters, employing TD3 to learn near-optimal strategies. Similarly, Liu et al. [23] have described how to model a multi-leader, multi-follower Stackelberg game among users with a dual deep Q-network (DDDQN) to learn the optimal payment and sensing contribution strategies. Zhao et al. [24] have proposed a DRL-based, socially aware incentive mechanism for vehicular crowdsensing that optimizes the sensing strategies while accounting for social dependencies, demonstrating significant utility improvements in dynamic environments. Furthermore, Zhao et al. [25] have designed a deep reinforcement learning socially aware incentive mechanism (DRL-SIM) in vehicular social networks which simultaneously protects privacy and captures social dependencies, enabling the derivation of near-optimal sensing strategies. These studies have demonstrated the effectiveness of DRL in protecting privacy and optimizing the incentive strategies. They also indicate that DRL has played and will continue to play an increasingly important role in the design and implementation of future incentive mechanisms.

However, the existing incentive mechanisms have several limitations and challenges, such as relying solely on information from the previous time step and fixed credibility, which cannot adapt to dynamic changes in factors such as the task rewards, resource availability, and credibility. This results in poor decision-making optimization. Therefore, this paper proposes the CIM-LP mechanism, which improves decision-making processes to enhance the participants’ utility rewards.

3. Models and Assumptions

This section will provide a detailed description of the system model, the utility model, the problem formulation, and the assumptions proposed in this paper.

3.1. The System Model

The mobile crowdsensing (MCS) system consists of two main components: the sensing platform and the user platform. The requester on the user platform submits specific service requirements to the sensing platform. Based on the available resources and task requirements, the sensing platform initiates a recruitment process to gather appropriate sensing data from participants on the user platform. Upon the participants’ data being uploaded, the sensing platform provides rewards corresponding to the participants’ efforts. A model of the MCS system covered in this paper is shown in Figure 1, and the specific processes are as follows:

(1): The requester posts a task on the sensing platform;
(2): The sensing platform distributes tasks to potential participants in the social network;
(3): Participants will actively decide whether to accept the sensing task, and those who do will upload their willingness to participate;
(4): The sensing platform selects an optimal group of participants and provides supporting information to help them complete the task more effectively;
(5): Participants collect and upload sensing data;
(6): The sensing platform distributes rewards based on the participants’ effort levels;
(7): The sensing platform integrates sensing data and sends it to the requester.

For each time period

t \in [1, 2, \dots, T]

, the sensing platform publishes a sensing task j with a fixed reward budget

r_{j}^{t}

, and the participants share the rewards of the task. To ensure data quality, each task j will have a quality detection threshold

\partial_{j}^{t}

to check whether the data quality provided by a participant meets the required standards and to update the participant’s credibility. During the whole sensing process T, suppose the platform has a collection of participants involved in the sensing task, denoted as

A = \{a_{1}, \dots, a_{i}, \dots, a_{n}\}

. At each time period t, each participant

a_{i}

has a credibility value that reflects the credibility of their submission of high-quality data and affects their own utility benefits, denoted as

q_{i}^{t}

. Before each participant

a_{i}

participates in sensing task j, they will autonomously choose a sensing strategy that maximizes their utility based on the reward for each t period. In this paper, the sensing strategy is simplified as the duration of time that the participant

a_{i}

spends collecting sensing data, denoted as

x_{i}^{t}

. Additionally, the set of sensing durations for all participants

a_{i}

is defined as

X^{t} = \{x_{1}^{t}, \dots, x_{i}^{t}, \dots, x_{n}^{t}\}

, and the set

X_{- i}^{t}

represents the sensing durations of all participants except

a_{i}

.

Relationships of social networks are represented in this paper as undirected graphs, and the social closeness between participants is represented as matrices

G = {[g_{i k}]}_{n \times n}

, where n denotes the number of participants, and

g_{i k} \in (0, 1)

indicates the social closeness between participant

a_{i}

and participant

a_{k}

. The formula for social closeness can be implemented through Jaccard similarity [26], as shown in Equation (1):

g_{i k} = \frac{|S e t_{i} ⋂ S e t_{k}|}{|S e t_{i} ⋃ S e t_{k}|}

(1)

Here,

S e t_{i}

and

S e t_{k}

represent the sets of neighbors connected to participants

a_{i}

and

a_{k}

, respectively.

|S e t_{i} ⋂ S e t_{k}|

is the size of the intersection between two sets, and

|S e t_{i} ⋃ S e t_{k}|

is the size of their union. The average value of all elements in the matrix G is defined as

μ

, which measures the average social closeness across the entire network. A larger value for

μ

indicates a tighter connection between participants in a graph.

Before providing a detailed description, the main symbols used in this paper are presented in Table 1.

3.2. The Utility Model and Problem Formulation

In this paper, the utility of a participant is composed of monetary utility

u_{i, t}^{m o n}

and social utility

u_{i, t}^{s o c}

.

3.2.1. Monetary Utility

When participant

a_{i}

receives a sensing task j during time period t, he/she will determine a sensing duration

x_{i}^{t}

in which to complete the sensing task. Throughout the sensing process, the participant will inevitably incur certain costs, such as energy consumption. In this paper, the energy consumption cost

c (x_{i}^{t})

for participant

a_{i}

is defined as a strongly convex quadratic function [27], as shown in Equation (2):

c (x_{i}^{t}) = η_{i, 1} {(x_{i}^{t})}^{2} + η_{i, 2} x_{i}^{t} + η_{i, 3}

(2)

Here,

η_{i, 1} (η_{i, 1} \geq 0)

,

η_{i, 2} (η_{i, 2} > 0)

, and

η_{i, 3} (η_{i, 3} > 0)

are predefined parameters associated with the participant and sensing task. These parameters represent the rate of growth in the energy cost due to an increased sensing duration, energy cost directly proportional to sensing duration, and basic energy cost, respectively.

Participant

a_{i}

will be rewarded by the platform after spending time

x_{i}^{t}

completing the sensing task and uploading data to the sensing platform. The monetary reward is estimated based on participant credibility and their proportion of the total sensing duration. Thus, the reward obtained by participant

a_{i}

for completing sensing task j in time period t is given by Equation (3):

I (x_{i}^{t}, X_{- i}^{t}) = \frac{x_{i}^{t} q_{i}^{t}}{\sum_{k = 1}^{n} x_{k}^{t} q_{k}^{t}} r_{j}^{t}

(3)

It is noteworthy that sensing tasks need to be completed collaboratively by multiple participants, and the rewards should be shared among all participants completing the same task and distributed according to each individual’s sensing duration and data quality. In particular, the data quality is usually unknown in advance; in order to maximize the prediction of the data quality before a task starts, quality-detection-oriented credibility is considered when measuring the data quality status. It is updated in real time based on the participants’ sensing data quality in each round to serve as a reference for the next round.

Thus, the monetary utility of a participant comprises the monetary reward received and the cost incurred by the participant, as shown in Equation (4):

u_{i, t}^{m o n} (x_{i}^{t}, X_{- i}^{t}) = I (x_{i}^{t}, X_{- i}^{t}) - c (x_{i}^{t})

(4)

3.2.2. Social Utility

The social utility of participant

a_{i}

depends on the closeness of the ties between participants in a social network, as shown in Equation (5):

u_{i, t}^{s o c} (x_{i}^{t}, X_{- i}^{t}) = \underset{k = 1}{\sum^{n}} x_{i}^{t} g_{i k} x_{k}^{t}

(5)

Here,

g_{i k}

denotes the degree of connectivity between participants

a_{i}

and

a_{k}

, while

x_{i}^{t}

and

x_{k}^{t}

represent their respective sensing durations.

3.2.3. Participant Utility

In summary, the utility of participant

a_{i}

in completing sensing task j during time period t is expressed by Equation (6):

u_{i}^{t} (x_{i}^{t}, X_{- i}^{t}) = u_{i, t}^{m o n} (x_{i}^{t}, X_{- i}^{t}) + u_{i, t}^{s o c} (x_{i}^{t}, X_{- i}^{t})

(6)

3.2.4. Problem Formulation

To maximize the utility of the participants, this paper models the decision-making process by the participants as an optimization problem. Specifically, by optimizing each participant’s sensing strategy at each time slot, each participant’s individual utility is maximized. The decisions made by each participant are not only influenced by their own state but also by the impacts from other participants. In the time slot t, this optimization problem can be represented as a three-dimensional tuple

ϖ^{t} = {n, X^{t}, U^{t}}

, Here, n represents the total number of participants,

X^{t} = {x_{1}^{t}, x_{2}^{t}, \dots, x_{n}^{t}}

represents the set of sensing durations of all participants, and

U^{t} = {u_{1}^{t}, u_{2}^{t}, \dots, u_{n}^{t}}

represents the set of utilities of all participants. To sum up, given a set

X_{- i}^{t}

of sensing durations of the other participants during time slot t, the participant

a_{i}

maximizes their utility by selecting a strategy

x_{i}^{t}

, as shown in Equation (7).

max_{x_{i}^{t} \in X^{t}} (u_{i}^{t} (x_{i}^{t}, X_{- i}^{t}))

(7)

3.3. Assumption

To ensure the effectiveness of the incentive mechanism, the following assumptions are proposed in this paper:

(1): All participants are rational, actively engage in the sensing tasks, and choose strategies to maximize their utility rewards. This assumption is fundamental to the design of the incentive mechanism, ensuring that the participants’ decision-making processes can be modeled as an optimization problem.
(2): The quality of the data submitted by the participants is proportional to their credibility, ensuring that participants submit high-quality data by improving their credibility. This assumption simplifies the relationship between data quality and the rewards.
(3): Participants exchange information through wireless communication channels within the communication range, which helps enhance cooperation and the optimization of sensing strategies. However, in reality, there may be communication limitations and privacy issues, and future research will explore how to adjust this assumption within these constraints.

4. The CIM-LP Mechanism

To address the aforementioned issues in existing incentive mechanisms, a CIM-LP mechanism is proposed in this paper. Next, we will elaborate on the following four aspects. Firstly, the definition and the updating algorithm for participants’ credibility are introduced, explaining how credibility is dynamically adjusted based on the data quality evaluation. Secondly, the Markov decision process (MDP) is described in detail, showing how the CIM-LP mechanism learns and optimizes the optimal strategy. Finally, the LSTM-PPO incentive model and the specific implementation process of the CIM-LP mechanism are presented, aiming to enhance participants’ motivation and improve data quality.

4.1. Designing for Credibility

The sensing task relies on a large number of participants providing sensing data. However, the quality of the data submitted by different participants may vary due to factors such as device performance, sensing effort, and behavioral habits, which in turn affect the results of the sensing task and the reward received for it. Existing studies usually rely on user engagement and task completion when assessing credibility while ignoring credibility assessments based on quality testing. This limitation leads to low-efficiency incentive mechanisms. Therefore, a credibility update algorithm based on quality detection is proposed in this article to improve the efficiency and fairness of the incentive mechanisms.

4.1.1. Definition of Credibility

In this study, credibility reflects the degree to which participants submit high-quality data and directly affects their reward distribution. By dynamically assessing credibility through data quality perception, participants are incentivized to ensure data quality. Since the Beta distribution is defined on the interval [0, 1], its flexible probability density function can intuitively represent the continuous variation process from complete distrust (0) to complete trust (1) [28]. Therefore, this study models participant credibility using the beta distribution, as shown in Equation (8):

P (x; α, β) = \frac{Γ (α + β)}{Γ (α) Γ (β)} x^{α - 1} {(1 - x)}^{β - 1}

(8)

Here,

α

represents the number of high-quality data submissions, and

β

represents the number of low-quality data submissions. During the initialization phase of the perception platform, due to a lack of historical data on the participants, the parameters are set as

α_{0} = β_{0} = 1

, causing the beta distribution to degenerate into a uniform distribution over the interval [0, 1] (

Beta (1, 1)

). On the one hand, the expected value

E [X] = \frac{α}{α + β} = 0.5

lies at the midpoint of the credibility interval, avoiding any inherent bias toward high or low credibility. On the other hand, the uniform distribution strictly reflects the state of zero prior knowledge, which aligns with the fairness requirements of cold-start scenarios. Furthermore, this choice has mathematical advantages—since the beta distribution is a conjugate prior to the binomial distribution, when

α_{0} = β_{0} = 1

, the posterior distribution update simplifies into

Beta (α_{0} + k, β_{0} + n - k)

(where k is the number of high-quality data submissions, and

n - k

is the number of low-quality submissions), significantly reducing the computational complexity.

If the initial values are set as

α_{0} = β_{0} = 0.1

(with the expected value also being 0.5), the model requires approximately 100 data updates to converge to a stable state (with a standard deviation < 0.05). In contrast, the standard parameters (

α_{0} = β_{0} = 1

) only require 20 updates to achieve the same level of stability. More extreme initialization schemes (such as

α_{0} = β_{0} = 10

), while maintaining the expected value of 0.5, compress the credibility distribution in advance (reducing the variance by 67%), which results in new participants requiring an additional 40% more data to correct the initial bias. The experimental results confirm the optimality of

Beta (1, 1)

in terms of the convergence speed and robustness against disturbances.

4.1.2. Updating Credibility

To accurately update participants’ credibility, sensing data must be compared with the true value of a task. However, since the true value of a sensing task is often unknown, it must be predicted based on the data submitted by the participants. Accordingly, a truth value discovery algorithm is proposed to predict a task’s true value, which serves as a quality benchmark [29]. Based on this benchmark, the submitted sensing data is compared, and the participants’ credibility is subsequently updated.

For the sensing task j, the participant submits a sensing dataset as

D = \{d_{1}, \dots, d_{i}, \dots, d_{n}\}

. Since the true value is unknown, the sensing platform uses the participants’ sensing data to predict the true value

d_{j}^{*}

of the sensing task. The predicted true value

d_{j}^{*}

is defined as the data that minimizes the weighted distance between itself and all other data, as shown in Equation (9):

d_{j}^{*} = \arg min_{d_{j}^{*}} \sum_{d_{i} \in D} ({dist}^{2} (d_{j}^{*}, d_{i}) \times ω_{i j})

(9)

Here,

{dist}^{2} (d_{j}^{*}, d_{i})

represents the squared distance function between any of the sensory data points and the predicted true value, and

ω_{i j}

denotes the weight of each data point. In the initial phase of the algorithm, the predicted true value

d_{j}^{*}

is initialized randomly. As the process progresses, the predicted true value

d_{j}^{*}

is iteratively adjusted, and the optimization procedure gradually minimizes the weighted distance between the true value and the other data points, ultimately converging to the true value.

The weight

ω_{i j}

represents the importance of sensing data. When the distance between two data values is smaller, the weight of the data should be larger. The weights are updated in real time using Equations (10) and (11):

\begin{matrix} J = \sum_{d_{i} \in D} {dist}^{2} (d_{j}^{*}, d_{i}) \end{matrix}

(10)

ω_{i j} = \frac{1 - \frac{{dist}^{2} (d_{j}^{*}, d_{i})}{J}}{\sum_{d_{i} \in D} (1 - \frac{{dist}^{2} (d_{j}^{*}, d_{i})}{J})}

(11)

Here,

ω_{i j} \in (0, 1)

, and

\sum_{d_{i} \in D} ω_{i j} = 1

. Due to not knowing the weight of each sensing data point at the beginning, the weight value of all sensing data is initialized as

1 / |D|

and continuously iterated to predict the true values and update the weights.

The iteration process stops and the true value of the task

d_{j}^{*}

is obtained when the weight change

| ω_{i j}^{(t)} - ω_{i j}^{(t - 1)} |

between two consecutive iterations is smaller than the threshold

τ

. Through extensive experimentation, the weight update threshold

τ

is set to 0.05. Specifically, when the following condition is satisfied,

max_{i \in D} |\frac{ω_{i j}^{(t)} - ω_{i j}^{(t - 1)}}{ω_{i j}^{(t - 1)}}| < 0.05

(12)

the system determines that the relative weight change for all data points between the two iterations is below 5%, and the weight update process terminates. This threshold is set based on the following consideration: if the weight change between consecutive iterations is too small, this indicates that the model has converged. Continuing to update the model would not only increase the computational overhead but also potentially amplify ineffective adjustments caused by noise or random fluctuations, thereby affecting the model’s stability and robustness. By setting this convergence condition, unnecessary iterations are reduced, improving the computational efficiency while ensuring accuracy.

According to the true value

d_{j}^{*}

, the quality of the sensing data

d_{i}

submitted by participant

a_{i}

, as shown in Equation (13),

ι_{i} = e^{- dist^{2} (d_{j}^{*}, d_{i})}

(13)

Here,

ι_{i} = 1

represents the indication that the sensing data

d_{i}

of

a_{i}

is in agreement with the true value

d_{j}^{*}

of the task, and

ι_{i} \to 0

signifies that the sensing data

d_{i}

for

a_{i}

exhibits a significant deviation.

Whether the quality test is qualified is measured by the current quality of the data

ι_{i}

from participant

a_{i}

. Here, ’qualified’ refers to data that meets the predefined quality standards and is thus considered high-quality. Data that meets the quality standards is closer to the true value of the task than data that does not meet the standards, resulting in the data quality value being closer to 1. The quality test result is denoted as

ζ_{i}

, as shown in Equation (14):

ζ_{i} = \{\begin{matrix} 1, & if ι_{i} \geq \partial_{j}^{t} \\ 0, & otherwise \end{matrix}

(14)

Here,

\partial_{j}^{t}

represents the quality threshold of the sensing task. When the data quality

ι_{i}

of the participant is greater than or equal to the threshold, it is considered that the data submitted by the participant is qualified; otherwise, the data quality is not up to standard.

Next, participant credibility is updated based on the quality test results. Assume that participant

a_{i}

performed

S u m

quality tests. In all quality tests, the number of quality tests passed is denoted as

y_{i} = \sum_{k = 1}^{S u m} ζ_{i}

. Therefore, the credibility update for participant

a_{i}

follows a beta distribution

q_{i}^{*} \sim B e t a (α_{i} + y_{i}, β_{i} + S u m - y_{i})

, and the expected value for the credibility

q_{i}^{*}

is shown in Equation (15):

E (q_{i}^{*}) = \frac{α_{i} + y_{i}}{α_{i} + β_{i} + S u m}

(15)

Notably, the updated credibility will be used as an evaluation criterion for the next round of optimized utility rewards, in order to continuously motivate the participants to submit high-quality sensing data.

The pseudo-code for the credibility update is shown in Algorithm 1. The core of the credibility updating process is to iteratively predict the true value of the task and update the weights for each data point. Specifically, in each iteration, the distance from the predicted value needs to be calculated and the weights updated for all data points in the dataset D. The time complexity of this part is

O (|D|)

. When iteration until convergence requires M iterations, the overall complexity of the iterative part is

O (M \times |D|)

. Compared to the iterative part, the complexity of traversing D to judge the quality of the data and perform statistics is negligible. Therefore, the computational complexity of the credibility update is

O (M \times |D|)

.

Algorithm 1 Credibility update

1:: Input: Participant set A, dataset D, quality threshold $\partial_{j}^{t}$ , weight update threshold $τ$ ;
2:: Output: Updated participant credibility $q_{i}^{*}$ ;
3:: Initialize the data weight $ω_{i j} = 1 / |D|$ of each participant $a_{i}$ for task j;
4:: while Weight update changes > $τ$ do
5:: Predict the truth value $d_{j}^{*}$ of the task;
6:: for $d_{i}$ in D do
7:: Calculate the distance between $d_{i}$ and $d_{j}^{*}$ based on ${dist}^{2} (d_{j}^{*}, d_{i})$ ;
8:: end for
9:: Calculate the sum of the distances between the sensing data and the true value;
10:: for $d_{i}$ in D do
11:: Update weight $ω_{i j}$ ;
12:: end for
13:: If the maximum weight change is less than $τ$ , stop iterating;
14:: end while
15:: Obtain the true value $d_{j}^{*}$ of the task;
16:: Derive the data quality $ι_{i}$ of participant $a_{i}$ ;
17:: Determine whether the data quality is qualified based on the threshold $\partial_{j}^{t}$ ;
18:: Count the number of tests and qualified tests for participant $a_{i}$ ;
19:: Calculate the updated credibility $q_{i}^{*}$ ;
20:: Return $q_{i}^{*}$ ;

4.2. The Markov Decision Process

In the decision optimization process, participants need to adjust their sensing duration strategies according to a changing dynamic environment. To effectively model this dynamic optimization process, a Markov decision process (MDP) is employed, formally defined as a tuple including states, actions, and rewards, such as

M D P = \{S, A, U\}

. In deep reinforcement learning, the agent acts as a decision-maker. It first observes the current state of the environment and selects an optimal policy based on that state to execute an action. Then, it analyzes the impact of the action on the environment and derives the next state according to this impact and the reward provided by the environment. Finally, the policy is updated based on the next state and the received reward. In this study, participants are modeled as agents within the Markov decision-making framework. They determine the optimal sensing duration based on past experiences, calculate the corresponding utility, and iteratively update their strategies to achieve improved outcomes.

State Space:

At time period t, participant

a_{i}

determines the optimal sensing duration based on the current state

s_{i}^{t}

. The state space consists of time-series data as well as multiple discrete data together, denoted as

s_{i}^{t} = [s_{1}^{t}, s_{2}^{t}] \in S

.

Time-series data: To obtain historical data for other participants, the platform ensures that each participant can share their own sensory data history through a distributed architecture.

s_{1}^{t} = (X_{- i}^{t - L}, X_{- i}^{t - L + 1}, \dots, X_{- i}^{t - 1})

denotes the set of historical sensing durations for the previous L time points of the other participants at the current moment. These data are processed through an LSTM network, which can capture the temporal dependencies and improve the decision-making efficiency. If a participant has been involved in a platform task before, their status information at different points in time will be recorded and summarized to form

s_{1}^{t}

. Conversely, for users who have never been involved, the system will default to a null value for their historical status information.

Discrete data:

s_{2}^{t} = [q_{i}^{t}, e_{i}^{t}, \partial_{j}^{t}, r_{j}^{t}]

represents the data relevant to decision-making. Here,

q_{i}^{t}

denotes the participants’ credibility value,

e_{i}^{t}

represents the remaining power of the device,

\partial_{j}^{t}

is the task quality assessment threshold, and

r_{j}^{t}

indicates the task reward.

Notably, all data within the state space are taken into account when selecting the sensing duration.

Action Space:

The sensing duration for each participant is constrained not only by the device’s energy but also by the current state information. Therefore, the action space of each participant can be represented as

A = [0, \bar{x}]

, where

\bar{x}

denotes the maximum sensing duration. In this paper, the action space of each participant

a_{i}

at time period t is defined as the corresponding sensing duration

x_{i}^{t} \in A

which is used to collect sensing data.

Reward Function:

The reward function reflects the feedback the environment provides to the agent after executing action

x_{i}^{t}

. This reward corresponds to the utility obtained by the participant upon completing the sensing task. Consequently, in the context of the Markov decision process, the reward function is defined as the participant’s utility, as shown in Equation (16):

u_{i}^{t} (x_{i}^{t}, X_{- i}^{t}) \in U

(16)

4.3. The Architecture of the LSTM-PPO Model

As shown in Figure 2, the LSTM-PPO model is composed of four key components: a feature extraction network

φ_{i}

, a long short-term memory network

λ_{i}

, a policy network

π_{θ_{i}}

(parameterized as

θ_{i}

), and a value network

ν_{ω_{i}}

(parameterized as

ω_{i}

).

The feature network

φ_{i}

is designed to extract effective features from the state space of a Markov decision process. Its inputs are time-series data

s_{1}^{t}

and discrete data

s_{2}^{t}

, and its outputs are feature vectors

φ_{i} (s_{1}^{t})

and

φ_{i} (s_{2}^{t})

extracted through a fully connected layer.

The long short-term memory network

λ_{i}

is specialized to process sequence data and capture temporal dependencies in the data. Its input is time-series data

φ_{i} (s_{1}^{t})

that has undergone feature extraction, and its output is the hidden state

h_{t}

of the final time step, which is then used to predict the current state feature

{\bar{X}}_{- i}^{t}

based on the feature network.

The policy network

π_{θ_{i}}

, referred to as the actor, aims to derive sensing durations close to the optimal values to maximize the long-term cumulative rewards. It takes a comprehensive feature representation

ϕ_{i}^{t}

as the input and outputs a probability distribution

π_{θ_{i}} (ϕ_{i}^{t})

over the action space. The participants choose actions in accordance with this distribution, with higher-probability actions being preferred.

The value network

ν_{ω_{i}}

, known as the critic, estimates the expected cumulative reward that can be achieved by adhering to the current strategy from any given state. In addition, the value network is responsible for computing the dominance function, which evaluates the additional value of performing a given action in a given state compared to the average level of actions in the strategy and thus guides the direction of optimization of the policy network

π_{θ_{i}}

. The input of the value network is the comprehensive feature representation

ϕ_{i}^{t}

, and its output is a scalar value

ν_{i}^{t} = ν_{ω_{i}} (ϕ_{i}^{t}) = ν_{ω_{i}} (s_{i}^{t})

.

4.4. Implementation of the CIM-LP Mechanism

The CIM-LP mechanism is composed of two major components: the sensing decision-making process and the policy update process. The sensing decision-making process focuses on the participants selecting the optimal sensing duration strategy through deep reinforcement learning based on the state information provided by the sensing platform. This process aims to maximize their utility rewards. Meanwhile, the policy update process optimizes the policy network and the value network through the PPO algorithm, ensuring that the participants’ decisions continue to improve with experience, thus enhancing the overall quality of the sensing task and the motivation of the participants. It is important to note that the deployment of the CIM-LP mechanism is not centralized on a single central server but rather is distributed across each participant, utilizing a cloud computing platform to provide computational support to the participants. The specific implementation process for the CIM-LP mechanism is shown in Figure 3.

4.4.1. The Sensing Decision-Making Process

The sensing platform provides the participant with the current state information, including time-series data

s_{1}^{t}

and discrete data

s_{2}^{t}

. Participants process time-series data through a long short-term memory network, capturing temporal dependencies and extracting the hidden state

h_{t}

. Subsequently, the current state

{\bar{X}}_{- i}^{t}

is predicted through a feature network. Meanwhile, the discrete data is feature-extracted

φ_{i} (s_{2}^{t})

through the feature network and combined with the predicted current state

{\bar{X}}_{- i}^{t}

in the fusion layer to form a comprehensive feature representation

ϕ_{i}^{t}

. Based on the combined features, the participants generate the probability distribution of the sensing duration strategies through a deep reinforcement learning policy network and select the optimal strategy

x_{i}^{t}

based on the probability distribution. In addition, the value network evaluates the current state and outputs the expected cumulative reward

ν_{ω_{i}} (s_{i}^{t})

, which is used to guide strategy updates.

After the sensing task is completed, the sensing data uploaded by the participant is tested for quality, and the quality results will be used to update the participant’s credibility and influence their utility rewards in subsequent tasks. The relevant information

(s_{i}^{t}, x_{i}^{t}, u_{i}^{t}, ν_{i}^{t}, s_{i}^{t + 1})

on the task is stored as experience in the buffer

D_{i}

and used for policy updates in the PPO algorithm, ensuring that the participants continuously optimize their decision-making in subsequent sensing tasks to maximize their long-term utility rewards.

4.4.2. The Policy Update Process

In the decision update process, the goal of the actor–critic network is to gradually adjust the strategy by maximizing the cumulative reward, thereby improving the participant’s decision-making. To optimize the participant’s strategy further, CIM-LP samples the stored experience data; computes the gradients of the policy network and the value network; and updates the parameters. Specifically, CIM-LP extracts a small batch of empirical data from the buffer

D_{i}

, which is of the size B, to compute gradient

\nabla J_{θ_{i}}

in the actor network and gradient

\nabla L_{ω_{i}}

in the critic network, with parameters of

θ_{i}

and

ω_{i}

for the two networks, respectively. The parameter update steps are as follows.

Firstly, the cumulative utility reward

U_{i}^{t}

for each participant is calculated as shown in Equation (17):

U_{i}^{t} = u_{i}^{t} + γ u_{i}^{t + 1} + \cdot \cdot \cdot + γ^{T - t + 1} u_{i}^{T - 1} + γ^{T - t} ν_{ω_{i}} (s_{i}^{T})

(17)

Here,

γ \in [0, 1]

is a discount factor that determines the scope of the value network. When

γ = 0

,

U_{i}^{t} = u_{i}^{t}

, indicating that the focus is solely on the current utility rather than the long-term benefits. In contrast, when

γ = 1

, the emphasis shifts to the total cumulative utility from time t to T. It is important to note that we use

ν_{ω_{i}} (s_{i}^{T})

instead of

u_{i}^{T}

, as

ν_{ω_{i}} (s_{i}^{T})

represents the value estimate based on the current state

s_{i}^{T}

, which reflects the long-term return of the state.

Secondly, for the policy network, the importance sampling ratio

ρ_{i}^{b}

is used to measure the relative performance between the new and old policies. It adjusts the magnitude of policy network updates to ensure the stability of policy updates, as shown in Equation (18):

ρ_{i}^{b} = \frac{π_{θ_{i}} (x_{i}^{b} | s_{i}^{b})}{π_{θ_{i}}^{o l d} (x_{i}^{b} | s_{i}^{b})}

(18)

Here, b is a small batch index. The numerator and the denominator represent the probabilities of participants taking action

x_{i}^{t}

in state

s_{i}^{t}

under the new and old strategies, respectively.

To avoid instability caused by too large a policy update, a cropping technique is used to limit the range of changes in

ρ_{i}^{b}

to be within the interval of

[1 - ε, 1 + ε]

. The cropped objective function

J^{c l i p} (θ_{i})

is shown in Equation (19):

J^{c l i p} (θ_{i}) = \frac{1}{B} \underset{b = 1}{\sum^{B}} [m i n (ρ_{i}^{b} A_{i}^{b}, c l i p (ρ_{i}^{b}, 1 - ε, 1 + ε) A_{i}^{b})]

(19)

Here, the default value of the hyperparameter

ε

is 0.1.

A_{i}^{b}

denotes the dominance function, which measures how much better the current strategy

π_{θ_{i}}

is than the old one, and is defined as

A_{i}^{b} = U_{i}^{b} - ν_{i}^{b}

.

Based on the objective function of the policy network, we can calculate the gradient

\nabla J_{θ_{i}}

of the policy network, as shown in Equation (20):

\nabla J_{θ_{i}} = \frac{1}{B} \underset{b = 1}{\sum^{B}} \nabla_{θ_{i}} \log π_{θ_{i}} (x_{i}^{b} | s_{i}^{b}) (m i n (ρ_{i}^{b}, clip (ρ_{i}^{b})) A_{i}^{b})

(20)

Here,

\nabla_{θ_{i}} \log π_{θ_{i}} (x_{i}^{b} | s_{i}^{b})

denotes the gradient of the logarithmic probability of taking action

x_{i}^{b}

given state

s_{i}^{b}

under the new strategy.

For the value network, the objective of optimization is to minimize the error between the value prediction and the actual cumulative utility reward, and its loss function

L (ω_{i})

takes the form of the mean square error, as shown in Equation (21):

L (ω_{i}) = \frac{1}{B} \underset{b = 1}{\sum^{B}} {(ν_{i}^{b} - U_{i}^{b})}^{2}

(21)

The gradient

\nabla L_{ω_{i}}

of the value network can be calculated based on the loss function, as shown in Equation (22):

\nabla L_{ω_{i}} = \frac{1}{B} \underset{b = 1}{\sum^{B}} (ν_{i}^{b} - U_{i}^{b}) \nabla_{ω_{i}} ν_{i}^{b}

(22)

Here,

\nabla_{ω_{i}} ν_{i}^{b}

denotes the partial derivative of the value network with respect to the parameter

ω_{i}

.

Finally, their respective parameters are updated by calculating the gradients of the obtained policy network and the value network. The policy network

π_{θ_{i}}

is updated using small batch stochastic gradient ascent, while the value network

ν_{ω_{i}}

is updated using small batch stochastic gradient descent, as shown in Equations (23) and (24):

θ_{i} \leftarrow θ_{i} + l_{i, 1} \nabla J_{θ_{i}}

(23)

ω_{i} \leftarrow ω_{i} - l_{i, 2} \nabla L_{ω_{i}}

(24)

Here,

l_{i, 1}

and

l_{i, 2}

are the learning rates of the strategy network and the value network, respectively.

Through the above steps, each round of policy updates ensures the participants’ decision-making abilities are optimized while maintaining the stability of the updates, gradually enhancing the long-term utility rewards.

The pseudo-code for the CIM-LP mechanism is shown in Algorithm 2. The hyperparameters in CIM-LP are periodically saved for testing purposes. In this paper, we utilize the trained feature network

φ_{i}

, the long and short-term memory network

λ_{i}

, and the policy network

π_{θ_{i}}

for each participant

a_{i}

to determine their sensing duration levels. As CIM-LP is a distributed control system, each participant uses its own LSTM-PPO model to optimize the sensing strategies. Specifically, a fully connected neural network is employed in the model, with the computational complexity denoted as

O (N \times (\sum_{l = 1}^{L_{φ}} n_{l - 1} n_{l} + \sum_{l = 1}^{L_{λ}} n_{l - 1} n_{l} + \sum_{l = 1}^{L_{π}} n_{l - 1} n_{l} + \sum_{l = 1}^{L_{ν}} n_{l - 1} n_{l}))

, where

n_{l}

is the number of neurons in layer l;

L_{φ}

,

L_{λ}

,

L_{π}

, and

L_{ν}

are the number of fully connected layers in each subnetwork; and N is the number of participants. When the total number of layers in each participant’s model is L and the number of neurons in each layer is n, the main computation for each model can be approximated as the matrix multiplication of each layer, that is,

O (L \times n^{2})

. Considering the fact that there are N participants running in parallel, the overall computational complexity can be simplified into

O (N \times L \times n^{2})

.

Algorithm 2 CIM-LP

1:: Input: The state space $s_{i}^{t} = [s_{1}^{t}, s_{2}^{t}]$ ;
2:: Output: The sensing duration $x_{i}^{t}$ ;
3:: Initialize the network parameters $φ_{i}, λ_{i}, π_{θ_{i}}, ν_{ω_{i}}$ ;
4:: for episode in $1, 2, \dots, 1000$ do
5:: Initialize the replay cache $D_{i}$ ;
6:: for t in $1, 2, \dots, T$ do
7:: Predict the sensing duration strategy ${\bar{X}}_{- i}^{t}$ ;
8:: Obtain the feature vector $φ_{i}$ ;
9:: Further obtain the comprehensive feature $ϕ_{i}^{t}$ ;
10:: Derive the value estimate $ν_{ω_{i}} (ϕ_{i}^{t})$ based on the value network $ν_{ω_{i}}$ ;
11:: Derive the probability distribution $π_{θ_{i}} (ϕ_{i}^{t})$ of the sensing duration strategy;
12:: Sample an action $x_{i}^{t}$ from the probability distribution;
13:: Collect and upload sensing data based on $x_{i}^{t}$ ;
14:: Acquire the utility reward $u_{i}^{t}$ ;
15:: Update the credibility $q_{i}^{t + 1} = q_{i}^{*}$ according to Algorithm 1;
16:: Update the state space $s_{i}^{t + 1} = [s_{1}^{t + 1}, q_{i}^{t + 1}, e_{i}^{t + 1}, \partial_{j}^{t + 1}, r_{j}^{t + 1}]$ ;
17:: Store experience $(s_{i}^{t}, x_{i}^{t}, u_{i}^{t}, ν_{i}^{t}, s_{i}^{t + 1})$ to the cache $D_{i}$ ;
18:: end for
19:: $π_{θ_{i}}^{o l d} \leftarrow π_{θ_{i}}$ ;
20:: for m in $1, 2, \dots, M$ do
21:: Sample a small batch of sample B in cache $D_{i}$ ;
22:: Calculate the policy gradient $\nabla J_{θ_{i}}$ and update parameters;
23:: Calculate the value gradient $\nabla L_{ω_{i}}$ and update parameters;
24:: end for
25:: end for

5. Simulation and Result Analysis

In order to evaluate the performance of the CIM-LP mechanism, this section compares the CIM-LP mechanism proposed in this paper with three existing incentive mechanisms through simulations and provides a detailed introduction to the simulation settings, the comparative mechanism, the performance metrics, and a performance comparison and analysis.

5.1. Simulation Environment Settings

This simulation is based on Python version 3.9 and utilizes the PyTorch 2.1 framework for model construction and training. For the dataset selection, the Gowalla dataset is used to model the social relationships among participants in the real world. Gowalla is a location-based social networking service provider that allows users to check in at specific locations and share location information, thereby forming a social network through natural interactions among users [30]. The dataset includes users’ geographical check-in information and activities across different times and locations and a social network consisting of 196,591 user nodes and 950,327 edges. As shown in Equation (1), Jaccard similarity is used to calculate the social closeness between each pair of users in the Gowalla dataset, generating a social closeness matrix [26]. Based on this matrix, subsets of small-scale social networks with a specific average social closeness level can be extracted, thereby constructing a corresponding social closeness matrix. Instead of the social network constructed using a normal distribution in [31], we used a real dataset, making it more relevant to the real world.

The experiment was simulated using social networks of varying sizes, with the participant counts ranging from 100 to 500, in increments of 100, to evaluate the model’s performance across different scale scenarios. Additionally, the influence of varying average social closeness between the participants (ranging from 0.1 to 0.9, with step sizes of 0.2) was also considered.

For the time-series data modeling problem, this study designs a prediction module based on a single-layer long short-term memory network. Drawing on the application experience with an LSTM structure in mobile crowdsensing from reference [13], the initial configuration includes 128 hidden units, a sequence length of 5, and a Dropout probability of 0.2 to mitigate overfitting. To validate the rationality of the hyperparameter selection, sensitivity experiments were conducted, comparing the impact of different numbers of hidden units (64, 128, 256) and different Dropout rates (0.1, 0.2, 0.3) on the model performance. The results indicate that a combination of 128 units and a 0.2 Dropout rate strikes a good balance between accuracy and stability.

In the strategy optimization section, the proximal policy optimization algorithm was selected, with the learning rate set to 0.0003, a discount factor of 0.99, and a clipping parameter of 0.1. These configurations are based on the findings in reference [13], which highlight the strong performance of PPO in dynamic environments. During training, the model employs a mini-batch update method, sampling 125 trajectory samples from the experience buffer for each training step. To further demonstrate the stability of the hyperparameter settings, experiments were conducted testing different combinations of learning rates (0.0001, 0.0003, 0.001) and clipping ranges (0.05, 0.1, 0.2). The results show that the current configuration achieves a better performance in terms of the convergence speed and policy generalization.

The specific parameters used in the simulation experiments are provided in Table 2.

5.2. The Comparative Mechanism

From the perspective of optimizing the participants’ utility rewards, PPO-DSIM [13], RLPM [32], and GSIM-SPD [9] will be used as comparison mechanisms.

PPO-DSIM: This mechanism primarily utilizes the proximal policy optimization (PPO) algorithm from deep reinforcement learning to maximize the utility rewards, thereby motivating participants to actively engage in sensing tasks. Based on [13], we introduce PPO-DSIM as a comparative mechanism to demonstrate the performance of our proposed approach, which integrates an LSTM network.

RLPM: This mechanism is specifically designed to solve utility maximization problems, primarily addressing decision-making issues with discrete action spaces. According to [32], we introduce RLPM as a comparative mechanism to showcase the superiority of CIM-LP in handling continuous action spaces.

GSIM-SPD: This mechanism is used to solve optimal decision-making problems. Based on [9], we introduce GSIM-SPD as a benchmark mechanism to highlight the advantages of CIM-LP, which is based on deep reinforcement learning, in addressing complex decision-making tasks.

5.3. Performance Metrics

In this section, evaluation metrics such as the average participant utility, average monetary utility, average social utility, and task completion rate will be introduced to assess program performance.

1. Average participant utility: This represents the average utility of all participants at all time periods, as shown in Equation (25):

u = \frac{1}{n \times T} \underset{i = 1}{\sum^{n}} \underset{t = 1}{\sum^{T}} u_{i}^{t} (x_{i}^{t}, X_{- i}^{t})

(25)

Here, n represents the total number of participants, T represents the total number of periods, and

u_{i}^{t} (x_{i}^{t}, X_{- i}^{t})

represents the utility of participant

a_{i}

in period t.

2. Average monetary utility: This represents the average monetary utility of all participants over all time periods, as shown in Equation (26):

u^{m o n} = \frac{1}{n \times T} \underset{i = 1}{\sum^{n}} \underset{t = 1}{\sum^{T}} u_{i, t}^{m o n} (x_{i}^{t}, X_{- i}^{t})

(26)

Here,

u_{i, t}^{m o n} (x_{i}^{t}, X_{- i}^{t})

represents the monetary utility of participant

a_{i}

in different periods.

3. Average social utility: This represents the average social utility of all participants at all times, as shown in Equation (27):

u^{s o c} = \frac{1}{n \times T} \underset{i = 1}{\sum^{n}} \underset{t = 1}{\sum^{T}} u_{i, t}^{s o c} (x_{i}^{t}, X_{- i}^{t})

(27)

Here,

u_{i, t}^{s o c} (x_{i}^{t}, X_{- i}^{t})

represents the social utility of participant

a_{i}

in different time periods.

4. The task completion rate: To evaluate the effectiveness of various mechanisms in task completion during the sensing process, the average task completion rate throughout the sensing process T is calculated as shown in Equation (28):

ς = \frac{1}{T} \sum_{t = 1}^{T} f^{t} (x_{i}^{t})

(28)

Here, the function

f^{t} (x_{i}^{t})

is defined as shown in Equation (29):

f^{t} (x_{i}^{t}) = \{\begin{matrix} 1, & if \sum_{i = 1}^{n} x_{i}^{t} \geq ψ \\ 0, & otherwise \end{matrix}

(29)

This function is used to indicate whether the sum of the sensing duration in time period t satisfies the threshold value required by the task

ψ

. When the sensing duration is greater than or equal to the task threshold value, the task is considered to be completed. Otherwise, the task cannot be completed.

5.4. Performance Comparison and Analysis

The convergence of the average participant utility obtained using various comparison mechanisms under the conditions

n = 100

and

μ = 0.9

is illustrated in Figure 4. Based on the experimental results, it can be seen that the utility achieved by CIM-LP converges to approximately 4.50 after 500 training rounds and remains stable thereafter. In contrast, the benchmark mechanisms PPO-DSIM and RLPM converge to approximately 3.65 and 2.75 only after about 450 and 600 training iterations, respectively. It can be seen that the CIM-LP mechanism enables the participants to achieve a higher utility. Compared to the PPO-DSIM mechanism, the CIM-LP mechanism introduces long short-term memory networks on the basis of deep reinforcement learning. This enhancement improves the neural network’s ability to process sequential data, especially time-dependent data, thereby increasing the accuracy of the decision-making. In contrast to the RLPM mechanism, the experimental results demonstrate the advantages of the CIM-LP mechanism in handling continuous action spaces. Furthermore, incentive mechanisms based on deep reinforcement learning demonstrate higher utility compared to those in approaches such as GSIM-SPD. This is because GSIM-SPD finds it difficult to deal with complex and dynamic decision environments and lacks the ability to learn and optimize online, which leads to lower efficiency in practice.

Figure 5a–d represent the effects of changes in the number of participants on the average participant utility, average monetary utility, average social utility, and task completion rate at a social closeness

μ = 0.9

, respectively. As can be seen from Figure 5a, regardless of the number of participants, CIM-LP consistently achieves a higher average utility. For CIM-LP, when

n = 100

, the average utility is higher than that for PPO-DSIM, RLPM, and GSIM-SPD, with values of 0.75, 1.86, and 2.45, respectively. This demonstrates that CIM-LP can dynamically adjust the perception duration based on factors such as credibility, resource status, and task rewards. Additionally, compared to PPO-DSIM, the CIM-LP mechanism incorporates an LSTM network, which enables the inclusion of a large amount of critical temporal data into the decision-making process, thereby improving the utility reward.

As shown in Figure 5b, the average monetary utility of the participants decreases as the number of participants increases. For example, when the number of participants is 100, the CIM-LP mechanism demonstrates a relatively high average monetary utility. However, when the number of participants increases to 300, the average monetary utility of CIM-LP drops to 0.59, which is 1.38 lower than that at 100 participants. This suggests that with an increasing number of participants, the competition for resources intensifies, leading to a gradual decrease in the rewards that each participant can obtain. Despite this, the CIM-LP mechanism consistently maintains a high level of average monetary utility. This is primarily attributed to the ability of CIM-LP, through LSTM-PPO, to dynamically adjust the reward distribution and sensing duration strategies, ensuring that each participant receives a reasonable monetary utility even in larger-scale environments. In contrast, the PPO-DSIM, RLPM, and GSIM-SPD mechanisms, when dealing with a large number of participants, fail to adjust the strategies and reward distributions as flexibly as CIM-LP. The lack of processing of the historical temporal data on the other participants results in these mechanisms not maintaining a comparable average monetary utility in more complex resource competition scenarios. As shown in Figure 5c, the average social utility of the participants increases with the number of participants. For instance, for CIM-LP, when the number of participants is 300, the average social utility is 9.86, which is 7.33 higher than when the number of participants is 100. This is because a larger number of participants leads to a broader social network, which in turn motivates each participant’s sensing duration strategy, resulting in a higher average social utility. Additionally, for RLPM, as the number of participants increases from 300 to 500, there is a significant increase in the average social utility of the participants. This phenomenon arises because RLPM enhances the social utility by extending the participants’ sensing durations. However, this strategy leads to excessive energy consumption, which negatively impacts the task completion rates.

As shown in Figure 5d, for the three deep-reinforcement-learning-based mechanisms (CIM-LP, PPO-DSIM, and RLPM), the task completion rate decreases at a progressively faster rate as the number of participants increases. For CIM-LP, for example, when

n = 300

, the task completion rate reaches 90.20%, which is a 0.3% decrease compared to

n = 100

and a 5.2% increase compared to

n = 500

. This is because as the number of participants increases from 100 to 300, the amount of data collected from the participants increases, which facilitates smooth completion of the task. However, as the number of participants continues to increase, the social utility becomes more significant, and the participants tend to gain more utility rewards by increasing their sensing duration. But since mobile devices have limited energy, indiscriminately extending the sensing duration can lead to devices running out of power before tasks are completed, thereby reducing the task completion rate.

Figure 6a–d represent the impact of changes in the average social closeness

μ

on the average participant utility, average monetary utility, average social utility, and task completion rate at a participant number

n = 200

, respectively. As shown in Figure 6a, CIM-LP achieves the best sensing strategy and the highest utility reward. For instance, when

μ = 0.9

, the average participant utility for CIM-LP is 7.73, surpassing PPO-DSIM, RLPM, and GSIM-SPD by 0.45, 2.61, and 4.35, respectively. CIM-LP, through its deep reinforcement learning mechanism, is able to effectively adjust its sensing strategy in complex environments, enabling the participants to achieve a higher utility. This reflects the mechanism’s comprehensive consideration of multiple factors in the decision-making optimization process. Furthermore, as shown in Figure 6b, CIM-LP demonstrates a clear advantage in terms of average monetary utility, yielding more average monetary utility than the other mechanisms. This is because the mechanism takes into account factors such as the historical data, participants’ individual conditions, and device energy during the optimization of the decision-making, thereby effectively compensating for the shortcomings in sensing duration decision-making.

As shown in Figure 6c, it can be seen that for all mechanisms, the average social utility of the participants increases with the growth of

μ

. For example, in the case of GSIM-SPD, when

μ = 0.9

, the average social utility is 3.37. When

μ = 0.5

is 2.11, the former is 1.26 higher than the latter. Additionally, by combining Figure 6c and Figure 6b, the simple sensing strategy mechanism GSIM-SPD, when compared with the other mechanisms, is most affected by the average social closeness. However, if limited energy resources and task rewards are taken into account, the sensing strategy mechanism requires reasonable optimization during the long-term sensing process. The CIM-LP mechanism will guide the participants to adopt the appropriate sensing strategies, thereby maximizing the monetary and social utility rewards and achieving higher efficiency in the long term.

As shown in Figure 6c,d, it can be seen that the average social utility increases with the growth of

μ

for two reasons. For GSIM-SPD, the increase in the average social utility is primarily due to the enhancement of the sensing duration, which is evidenced by a reduction in the task completion rate. This is because GSIM-SPD lacks foresight, indiscriminately guiding the participants to extend their sensing durations to gain more immediate social utility, leading to excessive energy consumption and affecting the completion of subsequent tasks. In contrast, for CIM-LP, PPO-DSIM, and RLPM, the increase in average social utility is mainly caused by the increase in average social closeness, which can be confirmed by the almost constant task completion rate. In addition, it is noteworthy that when

μ = 0.9

, the task completion rate with CIM-LP reaches 91.13%. This advantage lies in the fact that CIM-LP aims to maximize the overall utility of all sensing tasks by guiding the participants to save device energy in a reasonable way, achieving greater utility in the future. Meanwhile, by integrating LSTM networks, the prediction of the sensing durations for the other participants has been optimized, enabling them to intelligently adjust their strategy execution, effectively manage their energy consumption, and ensure efficient completion of a large number of tasks.

CIM-LP demonstrates significant advantages in its experimental results, effectively enhancing participant utility and the task completion rate. However, this advantage comes with a computational complexity that increases with the number of participants. According to the computational complexity analysis, CIM-LP’s complexity is

O (N \times L \times n^{2})

, where N is the number of participants, L is the number of network layers, and n is the number of neurons per layer. In particular, CIM-LP performs decision optimization through deep reinforcement learning (LSTM-PPO), which requires significant computational resources, especially when handling participant decisions. The behavior and decision-making processes for each participant involve complex network computations. In contrast, the computational complexity of PPO-DSIM, RLPM, and GSIM-SPD is lower, primarily due to the simplified decision-making processes in these mechanisms. Specifically, PPO-DSIM utilizes the PPO algorithm within deep reinforcement learning, RLPM relies on Q-learning, and GSIM-SPD adjusts the decisions through a policy optimization model, resulting in a much lower computational overhead compared to that of CIM-LP.

For the CIM-LP mechanism, participants with different levels of credibility receive varying utility rewards at each moment, and these differences ultimately influence their cumulative utility rewards. Figure 7a illustrates the effect of varying the initial credibility levels within the same group on the cumulative utility rewards when

n = 100

and

μ = 0.9

. The experimental results indicate that participants with a higher initial credibility achieve faster growth in their cumulative utility rewards. For instance, after 10 experimental rounds, the cumulative utility reward for participants with a credibility

q = 0.8

reaches 550.25, which is 481.85 higher than that for participants with a credibility

q = 0.2

. This demonstrates that participants with higher credibility are more likely to receive rewards under the CIM-LP mechanism, motivating them to consistently provide high-quality data in future tasks. Furthermore, Figure 7b further shows the trend in the average credibility changes across experimental rounds for groups with different initial credibilities, where each group consists of 100 participants. The experimental results indicate that the CIM-LP mechanism can effectively guide different groups to improve their data quality through a dynamic reward system. Notably, groups with high initial credibility maintain a relatively high level of average credibility early in the experiment, with their credibility stabilizing over the course of the task. In contrast, groups with low initial credibility gradually improve their average credibility throughout the experimental rounds. For example, for an average credibility

q = 0.2

, the value reaches 0.482 in the 10th round, an increase of 0.282 compared to that in the first round. This dynamic trend demonstrates that the CIM-LP mechanism can enhance the overall credibility levels while reducing fluctuations in the data quality.

Figure 8 illustrates the energy consumption performance of the four different mechanisms across time slots. By comparing the remaining device energy across time slots, the advantages and disadvantages of each mechanism in energy management are revealed. The experimental parameters are the same as those in Figure 5a, and the number of participants is 100. For each mechanism, the initial energy of each device is 50. The CIM-LP mechanism shows the best performance, with the slowest rate of energy depletion and residual energy still remaining available after 50 time slots. This underscores its efficiency and long-term stability in managing the energy consumption. By prioritizing high-reward tasks and minimizing the use of energy on low-reward ones, CIM-LP demonstrates its capability to optimize the sensing strategies. In contrast, the GSIM-SPD mechanism exhibits rapid early energy consumption, leading to device energy depletion after approximately 30 time slots, rendering it incapable of completing subsequent tasks and reflecting its inefficiency. The PPO-DSIM and RLPM mechanisms achieve a balance between short-term and long-term efficiency through reinforcement learning. However, their energy management remains inferior to that of CIM-LP.

5.5. Analysis of Key Parameters

To evaluate the performance stability and adaptability of the CIM-LP mechanism under different parameter settings, this study conducts sensitivity experiments and robustness tests on key parameters, including the LSTM structure parameters, the beta distribution initialization parameters, and the PPO hyperparameters (such as the learning rate, clipping range, and discount factor

γ

). The focus is on examining the impact of these parameters on the average participant utility, task completion rate, and model convergence speed.

The number of hidden units in the LSTM network is set to 64, 128, and 256, respectively, to compare their effects on model utility and convergence speed. The results are shown in Table 3.

It can be observed that when the number of hidden units is 128, the model achieves an optimal balance between utility improvements and convergence speed. The initial parameters of the beta distribution are set as follows:

(α_{0}, β_{0}) \in (1, 1), (2, 2), (1, 5), (5, 1)

The impact on the credibility convergence and long-term rewards is tested, and the results are shown in Table 4.

The experiment shows that the initial credibility setting affects the early-stage strategy formation and reward accumulation. Initial values that are too high or too low may lead to bias, while (1,1) or (2,2) settings exhibit greater generality and robustness.

Under the condition that the other settings remain unchanged, the following hyperparameters—learning rate, clipping parameter (

c l i p

), and discount factor

γ

—are individually adjusted to test their impact on the model performance. The results are shown in Table 5.

As can be seen from the table above, the default settings (learning rate of 0.0003,

γ = 0.99

,

c l i p = 0.1

) achieve an optimal balance between utility and stability. A larger

γ

value helps guide long-term strategy optimization, while an excessively large

c l i p

parameter weakens the constraint on the policy updates, leading to instability.

The above experimental results indicate that the CIM-LP mechanism demonstrates good robustness to the key parameters. Within a moderate range of parameter perturbations, the system performance remains stable, with the average utility and task completion rate maintaining high levels. Among them, the LSTM structure settings; the beta distribution’s initial parameters; and

c l i p

and the discount factor

γ

in PPO have a significant impact on the model’s performance and thus deserve attention and fine-tuning during actual deployment.

6. Conclusions

To address the issues of inadequate decision optimization processes in mobile crowdsensing (MCS), which affect participants’ utility rewards, a credibility-aware incentive mechanism based on LSTM-PPO (CIM-LP) is proposed in this paper. Leveraging the LSTM-PPO incentive model, factors such as the task rewards, participant resources, and credibility are incorporated to determine the optimal sensing duration strategy for each participant, thereby maximizing the utility rewards. Meanwhile, the participants’ credibility is updated based on the results of the quality evaluation, which subsequently influence their utility rewards in the next round. CIM-LP effectively mitigates the challenges of low participation and poor data quality caused by insufficient utility rewards in MCS. The experimental results demonstrate that CIM-LP offers significant advantages over the three other mechanisms.

Future research will focus on optimizing the energy consumption model to reduce errors due to device performance and environmental factors. We will also explore methods for improving the data transmission efficiency and reducing the communication overhead. Additionally, privacy protection mechanisms compliant with the data protection regulations will be designed to ensure data security. Finally, future research will aim to optimize the quality assessment algorithms and enhance the mechanism’s adaptability and robustness in dynamic environments.

Author Contributions

Conceptualization: S.M. and H.M.; methodology: S.M. and H.M.; investigation: S.M. and H.M.; supervision: H.M.; writing—original draft preparation: S.M.; writing—review and editing: S.M. and H.M.; funding acquisition: H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Henan Science Foundation for Distinguished Young Scholars (222300420006), the Henan Support Plan for Science and Technology Innovation Team of Universities (21IRTSTHN015), and the Leading Talent in Scientific and Technological Innovation in Zhongyuan (234200510018).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Karaliopoulos, M.; Bakali, E. Optimizing mobile crowdsensing platforms for boundedly rational users. IEEE Trans. Mob. Comput. 2020, 21, 1305–1318. [Google Scholar] [CrossRef]
Ali, A.; Qureshi, M.A.; Shiraz, M.; Shamim, A. Mobile crowd sensing based dynamic traffic efficiency framework for urban traffic congestion control. Sustain. Comput. Inform. Syst. 2021, 32, 100608. [Google Scholar] [CrossRef]
El Hafyani, H.; Abboud, M.; Zuo, J.; Zeitouni, K.; Taher, Y.; Chaix, B.; Wang, L. Learning the micro-environment from rich trajectories in the context of mobile crowd sensing: Application to air quality monitoring. Geoinformatica 2024, 28, 177–220. [Google Scholar] [CrossRef]
Saafi, S.; Hosek, J.; Kolackova, A. Cellular-enabled wearables in public safety networks: State of the art and performance evaluation. In Proceedings of the 2020 12th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), Brno, Czech Republic, 5–7 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 201–207. [Google Scholar]
Nie, J.; Luo, J.; Xiong, Z.; Niyato, D.; Wang, P.; Guizani, M. An incentive mechanism design for socially aware crowdsensing services with incomplete information. IEEE Commun. Mag. 2019, 57, 74–80. [Google Scholar] [CrossRef]
Wang, Z.; Lv, C.; Wang, F.Y. A new era of intelligent vehicles and intelligent transportation systems: Digital twins and parallel intelligence. IEEE Trans. Intell. Veh. 2023, 8, 2619–2627. [Google Scholar] [CrossRef]
Nie, J.; Luo, J.; Xiong, Z.; Niyato, D.; Wang, P. A stackelberg game approach toward socially-aware incentive mechanisms for mobile crowdsensing. IEEE Trans. Wirel. Commun. 2018, 18, 724–738. [Google Scholar] [CrossRef]
Zhan, Y.; Liu, C.H.; Zhao, Y.; Zhang, J.; Tang, J. Free market of multi-leader multi-follower mobile crowdsensing: An incentive mechanism design by deep reinforcement learning. IEEE Trans. Mob. Comput. 2019, 19, 2316–2329. [Google Scholar] [CrossRef]
Lu, J.; Zhang, Z.; Wang, J.; Li, R.; Wan, S. A green stackelberg-game incentive mechanism for multi-service exchange in mobile crowdsensing. ACM Trans. Internet Technol. (TOIT) 2021, 22, 1–29. [Google Scholar] [CrossRef]
Zhan, Y.; Li, P.; Qu, Z.; Zeng, D.; Guo, S. A learning-based incentive mechanism for federated learning. IEEE Internet Things J. 2020, 7, 6360–6368. [Google Scholar] [CrossRef]
Tao, D.; Zhong, S.; Luo, H. Staged incentive and punishment mechanism for mobile crowd sensing. Sensors 2018, 18, 2391. [Google Scholar] [CrossRef]
Wang, W.; Gao, H.; Liu, C.H.; Leung, K.K. Credible and energy-aware participant selection with limited task budget for mobile crowd sensing. Ad Hoc Netw. 2016, 43, 56–70. [Google Scholar] [CrossRef]
Zhang, J.; Li, X.; Shi, Z.; Zhu, C. A reputation-based and privacy-preserving incentive scheme for mobile crowd sensing: A deep reinforcement learning approach. Wirel. Netw. 2024, 30, 4685–4698. [Google Scholar] [CrossRef]
Liu, S.; Liu, Z.; Chen, B.; Pan, X. Construction and Application of Online Learning Resource Incentive Mechanism Driven by Smart Contract. IEEE Access 2024, 2, 37080–37092. [Google Scholar] [CrossRef]
Fu, S.; Huang, X.; Liu, L.; Luo, Y. BFCRI: A blockchain-based framework for crowdsourcing with reputation and incentive. IEEE Trans. Cloud Comput. 2022, 11, 2158–2174. [Google Scholar] [CrossRef]
Sun, P.; Wang, Z.; Wu, L.; Feng, Y.; Pang, X.; Qi, H.; Wang, Z. Towards personalized privacy-preserving incentive for truth discovery in mobile crowdsensing systems. IEEE Trans. Mob. Comput. 2020, 21, 352–365. [Google Scholar] [CrossRef]
Wang, H.; Liu, A.; Xiong, N.N.; Zhang, S.; Wang, T. TVD-RA: A truthful data value discovery-based reverse auction incentive system for mobile crowdsensing. IEEE Internet Things J. 2023, 11, 5826–5839. [Google Scholar] [CrossRef]
Shi, Z.; Yang, G.; Gong, X.; He, S.; Chen, J. Quality-aware incentive mechanisms under social influences in data crowdsourcing. IEEE/ACM Trans. Netw. 2021, 30, 176–189. [Google Scholar] [CrossRef]
Guo, J.; Ni, Q.; Wu, W.; Du, D.Z. Multi-task diffusion incentive design for mobile crowdsourcing in social networks. IEEE Trans. Mob. Comput. 2023, 23, 5740–5754. [Google Scholar] [CrossRef]
Wang, P.; Li, Z.; Long, S.; Wang, J.; Tan, Z.; Liu, H. Recruitment from social networks for the cold start problem in mobile crowdsourcing. IEEE Internet Things J. 2024, 11, 30536–30550. [Google Scholar] [CrossRef]
Li, M.; Ma, M.; Wang, L.; Yang, B. Quality-improved and delay-aware incentive mechanism for mobile crowdsensing with social concerns: A stackelberg game approach. IEEE Trans. Comput. Soc. Syst. 2024, 11, 7618–7633. [Google Scholar] [CrossRef]
Xu, Y.; Xiao, M.; Zhu, Y.; Wu, J.; Zhang, S.; Zhou, J. AoI-guaranteed incentive mechanism for mobile crowdsensing with freshness concerns. IEEE Trans. Mob. Comput. 2023, 23, 4107–4125. [Google Scholar] [CrossRef]
Liu, Y.; Wang, H.; Peng, M.; Guan, J.; Wang, Y. An incentive mechanism for privacy-preserving crowdsensing via deep reinforcement learning. IEEE Internet Things J. 2020, 8, 8616–8631. [Google Scholar] [CrossRef]
Zhao, N.; Sun, Y.; Pei, Y.; Niyato, D. Joint sensing and computation incentive mechanism for mobile crowdsensing networks: A multi-agent reinforcement learning approach. IEEE Internet Things J. 2024, 12, 13033–13046. [Google Scholar] [CrossRef]
Zhao, Y.; Liu, C.H. Social-aware incentive mechanism for vehicular crowdsensing by deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2020, 22, 2314–2325. [Google Scholar] [CrossRef]
Elkabani, I.; Aboo Khachfeh, R.A. Homophily-based link prediction in the facebook online social network: A rough sets approach. J. Intell. Syst. 2015, 24, 491–503. [Google Scholar] [CrossRef]
Duan, X.; Zhao, C.; He, S.; Cheng, P.; Zhang, J. Distributed algorithms to compute Walrasian equilibrium in mobile crowdsensing. IEEE Trans. Ind. Electron. 2016, 64, 4048–4057. [Google Scholar] [CrossRef]
Abdelhamid, S.; Hassanein, H.S.; Takahara, G. Reputation-aware, trajectory-based recruitment of smart vehicles for public sensing. IEEE Trans. Intell. Transp. Syst. 2017, 19, 1387–1400. [Google Scholar] [CrossRef]
Yang, S.; Wu, F.; Tang, S.; Gao, X.; Yang, B.; Chen, G. On designing data quality-aware truth estimation and surplus sharing method for mobile crowdsensing. IEEE J. Sel. Areas Commun. 2017, 35, 832–847. [Google Scholar] [CrossRef]
Nguyen, T.; Chen, M.; Szymanski, B.K. Analyzing the proximity and interactions of friends in communities in Gowalla. In Proceedings of the 2013 IEEE 13th International Conference on Data Mining Workshops, Washington, DC, USA, 7–10 December 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1036–1044. [Google Scholar]
Nie, J.; Xiong, Z.; Niyato, D.; Wang, P.; Luo, J. A socially-aware incentive mechanism for mobile crowdsensing service market. In Proceedings of the 2018 IEEE Global Communications Conference (GLOBECOM), Abu Dhabi, United Arab Emirates, 9–13 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–7. [Google Scholar]
Xu, H.; Qiu, X.; Zhang, W.; Liu, K.; Liu, S.; Chen, W. Privacy-preserving incentive mechanism for multi-leader multi-follower IoT-edge computing market: A reinforcement learning approach. J. Syst. Archit. 2021, 114, 101932. [Google Scholar] [CrossRef]

Figure 1. Model of mobile crowdsensing system.

Figure 2. Architecture of LSTM-PPO model.

Figure 3. Specific implementation of CIM-LP mechanism.

Figure 4. Comparison of utility convergence.

Figure 5. The impact of the number of participants on the average participant utility, average monetary utility, average social utility, and task completion rate.

Figure 6. Impact of average social closeness

μ

on average participant utility, average monetary utility, average social utility, and task completion rate.

Figure 6. Impact of average social closeness

μ

on average participant utility, average monetary utility, average social utility, and task completion rate.

Figure 7. An analysis of the impact of and the trend in credibility in the CIM-LP mechanism.

Figure 8. Comparison of energy consumption among different mechanisms across time slots.

Table 1. Main notations.

Notation	Explanation
t, T	Time index and total time duration
$r_{j}^{t}$	Total reward of task j at time t
$a_{i}$ , A	Participant $a_{i}$ and the set of participants
$x_{i}^{t}$ , $X^{t}$	The sensing strategy of participant $a_{i}$ at time t and the set of all sensing strategies at time t
$X_{- i}^{t}$	The sensing strategy set at time t of all participants except participant $a_{i}$
${\bar{X}}_{- i}^{t}$	The predicted sensing strategy set of all participants except participant $a_{i}$ during time period t
$q_{i}^{t}$	Credibility of participant $a_{i}$ at time t
$\partial_{j}^{t}$	Quality detection threshold for task j at time t
$g_{i k}$ , G	The social closeness between participant $a_{i}$ and participant $a_{k}$ and the social closeness matrix between participants
$u_{i}^{t}$ , $u_{i, t}^{m o n}$ , $u_{i, t}^{s o c}$	The total utility, monetary utility, and social utility obtained by participant $a_{i}$ during time period t

Table 2. Simulation parameters.

Parameter	Value
Total number of participants n	[100, 500]
Average social closeness $μ$	[0.1, 0.9]
Total time periods T	50
Number of batches M	4
Minimum batch B	125
Discount factor $γ$	0.99
Clip parameter $ε$	0.1
Learning rate	0.0003
Exploration steps	50
Entropy coefficient	0.01
Value coefficient	0.1
LSTM hidden size	128
LSTM time sequence length	5
Dropout	0.2

Table 3. The impact of the number of LSTM hidden units on performance metrics.

Hidden Units	Average Participant Utility	Convergence Epochs	Task Completion Rate (%)
64	3.94	580	88.63
128	4.50	500	90.20
256	4.52	510	89.85

Table 4. The impact of Beta distribution initialization on cumulative utility (first 20 epochs).

Initial Parameters $(α_{0}, β_{0})$	Initial Credibility	Average Cumulative Utility Reward	Average Credibility Variation Trend
(1, 1)	0.50	1985.2	steady increase
(2, 2)	0.50	1992.8	stable
(1, 5)	0.17	1831.6	large fluctuations in the early stage, stabilizing later
(5, 1)	0.83	2088.9	leading in the early stage, leveling off in the later stage

Table 5. The impact of key PPO hyperparameters on CIM-LP performance.

Parameter Type	Value	Average Utility	Convergence Epochs	Task Completion Rate (%)
learning rate	0.0001	4.16	630	88.45
	0.0003 (default)	4.50	500	90.20
	0.0005	4.47	480	89.77
discount factor $γ$	0.90	4.01	460	86.93
	0.99 (default)	4.50	500	90.20
$c l i p$	0.05	4.33	520	88.76
	0.1 (default)	4.50	500	90.20
	0.2	4.28	510	89.34

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mu, S.; Ma, H. CIM-LP: A Credibility-Aware Incentive Mechanism Based on Long Short-Term Memory and Proximal Policy Optimization for Mobile Crowdsensing. Electronics 2025, 14, 3233. https://doi.org/10.3390/electronics14163233

AMA Style

Mu S, Ma H. CIM-LP: A Credibility-Aware Incentive Mechanism Based on Long Short-Term Memory and Proximal Policy Optimization for Mobile Crowdsensing. Electronics. 2025; 14(16):3233. https://doi.org/10.3390/electronics14163233

Chicago/Turabian Style

Mu, Sijia, and Huahong Ma. 2025. "CIM-LP: A Credibility-Aware Incentive Mechanism Based on Long Short-Term Memory and Proximal Policy Optimization for Mobile Crowdsensing" Electronics 14, no. 16: 3233. https://doi.org/10.3390/electronics14163233

APA Style

Mu, S., & Ma, H. (2025). CIM-LP: A Credibility-Aware Incentive Mechanism Based on Long Short-Term Memory and Proximal Policy Optimization for Mobile Crowdsensing. Electronics, 14(16), 3233. https://doi.org/10.3390/electronics14163233

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CIM-LP: A Credibility-Aware Incentive Mechanism Based on Long Short-Term Memory and Proximal Policy Optimization for Mobile Crowdsensing

Abstract

1. Introduction

2. Related Work

3. Models and Assumptions

3.1. The System Model

3.2. The Utility Model and Problem Formulation

3.2.1. Monetary Utility

3.2.2. Social Utility

3.2.3. Participant Utility

3.2.4. Problem Formulation

3.3. Assumption

4. The CIM-LP Mechanism

4.1. Designing for Credibility

4.1.1. Definition of Credibility

4.1.2. Updating Credibility

4.2. The Markov Decision Process

4.3. The Architecture of the LSTM-PPO Model

4.4. Implementation of the CIM-LP Mechanism

4.4.1. The Sensing Decision-Making Process

4.4.2. The Policy Update Process

5. Simulation and Result Analysis

5.1. Simulation Environment Settings

5.2. The Comparative Mechanism

5.3. Performance Metrics

5.4. Performance Comparison and Analysis

5.5. Analysis of Key Parameters

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI