Crowdsourcing Framework for Security Testing and Verification of Industrial Cyber-Physical Systems

Li, Zhenyu; Ding, Yong; Zhao, Ruwen; Wang, Shuo; Li, Jun

doi:10.3390/s26010079

Open AccessArticle

Crowdsourcing Framework for Security Testing and Verification of Industrial Cyber-Physical Systems

by

Zhenyu Li

^1,2,3

,

Yong Ding

^1,3,*,

Ruwen Zhao

^3,4

,

Shuo Wang

^1,3 and

Jun Li

⁵

¹

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China

²

Postdoctoral Research Station in Instrument Science and Technology, Guilin University of Electronic Technology, Guilin 541004, China

³

Guangxi Academy of Artificial Intelligence, Nanning 530028, China

⁴

School of Mathematics and Computing Science, Guilin University of Electronic Technology, Guilin 541004, China

⁵

China Industrial Control Systems Cyber Emergency Response Team, Beijing 100040, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(1), 79; https://doi.org/10.3390/s26010079 (registering DOI)

Submission received: 17 November 2025 / Revised: 10 December 2025 / Accepted: 18 December 2025 / Published: 22 December 2025

(This article belongs to the Special Issue Challenges and Advances of Cybersecurity in Cloud-Sensor Digital Infrastructure)

Download

Browse Figures

Versions Notes

Abstract

With the widespread deployment of Industrial Cyber-Physical Systems (ICPS), their inherent vulnerabilities have increasingly exposed them to sophisticated cybersecurity threats. Although existing protective mechanisms can block attacks at runtime, the risk of defense failure remains. To proactively evaluate and harden ICPS security, we design a distributed crowdsourced testing platform tailored to the four-layer cloud ICPS architecture—spanning the workshop, factory, enterprise, and external network layers. Building on this architecture, we develop a Distributed Input–Output Testing and Verification Framework (DIOTVF) that models ICPS as systems with spatially separated injection and observation points, and supports controllable communication delays and multithreaded parallel execution. The framework incorporates a dynamic test–task management model, an asynchronous concurrent testing mechanism, and an optional LLM-assisted thread controller, enabling efficient scheduling of large testing workloads under asynchronous network conditions. We implement the proposed framework in a prototype platform and deploy it on a virtualized ICPS testbed with configurable delay characteristics. Through a series of experimental validations, we demonstrate that the proposed framework can improve testing and verification speed by approximately 2.6 times compared to Apache JMeter.

Keywords:

industrial cyber-physical system; testing; security

1. Introduction

With the rise of Industry 4.0, Industrial Cyber-Physical Systems (ICPS) are being rapidly adopted by enterprises to enable intelligent, automated, and data-driven manufacturing. By tightly integrating computing resources, network communication, and physical processes, ICPS can monitor and adjust production activities in real time, reducing manual intervention while improving efficiency, resource utilization, and production flexibility [1]. These systems have become a key enabling technology for smart factories and digital transformation [2]. The global ICPS market is projected to reach 177.5 billion USD by 2030, reflecting its accelerating adoption across sectors such as energy, transportation, logistics, and critical infrastructure [3]. However, as ICPS become increasingly interconnected and shift from isolated internal networks to cloud-connected architectures, they also face increased exposure to cyber threats [4,5]. A notable turning point was the Stuxnet attack, which demonstrated that malware could infiltrate industrial control systems, manipulate cyber-physical operations, and cause real physical damage [6,7]. Since then, securing ICPS has become a priority concern for both industry and governments worldwide.

Despite the adoption of many security mechanisms, existing defenses for ICPS remain largely reactive and insufficient in the face of increasingly sophisticated cyberattacks. As ordinary enterprises connect ICPS to the Internet and cloud platforms, incidents such as the Colonial Pipeline ransomware attack in May 2021 have shown how a single breach can disrupt critical operations and cause significant economic loss [8]. The multilayered architecture of ICPS, spanning hardware, software, edge devices, and networks, introduces numerous potential attack surfaces, especially when combined with cloud services. From a risk perspective, ICPS are exposed to multiple categories of threats, including: network- and communication-layer risks such as eavesdropping, replay, and man-in-the-middle attacks; control-logic and protocol-implementation risks such as malformed command injection and logic bypass; data integrity and availability risks such as false data injection and denial-of-service; and configuration and access-control risks such as weak authentication and privilege abuse. Many ICPS still operate with unauthenticated or unencrypted communication protocols to prioritize performance, making them particularly vulnerable to these threats [9]. Moreover, long-term deployment within internal edge networks often results in overlooked vulnerabilities and insufficient proactive security checks. Attackers can exploit these weaknesses, chaining multiple vulnerabilities together to send incorrect control signals, tamper with physical processes, or bring production to a halt [10]. Therefore, beyond traditional perimeter-based defense, there is an urgent need for proactive, vulnerability-aware security solutions that can detect risks early and prevent attacks before they cause physical or operational damage.

To overcome the lack of proactive security verification mechanisms in industrial cyber-physical systems, this paper develops a distributed crowdsourced testing platform specifically designed for cloud-based ICPS environments. Building upon a four-layer architectural model that spans workshop, factory, corporation, and external network layers, this platform enables security testing across heterogeneous hardware, software, and network environments while simulating realistic production conditions. To address core challenges in distributed ICPS testing, particularly the difficulty of post-test result verification due to asynchronous network delays, we propose a Distributed Input–Output Testing and Verification Framework (DIOTVF) that supports controllable communication delays and multithreaded parallel testing. This framework enables testers to generate test events with precise timing while improving verification accuracy and reproducibility. Based on the framework, we further design and implement a multithreaded testing architecture capable of adjustable delay control, significantly improving test verification speed over existing tools such as Apache JMeter. The proposed platform allows enterprises to proactively detect potential vulnerabilities before system deployment or during continuous operations, making it a practical tool for security assurance in industrial intelligence scenarios.

In addition, this paper uses the term crowdsourced testing in a practical engineering sense. The platform is designed to coordinate a large number of geographically distributed testing agents (e.g., in-house QA teams from different plants, external security testers under contract, and automated test nodes deployed in different networks) that pull test tasks from a shared task pool and report their results back to the platform, instead of relying on a single centralized executor. We do not address incentive schemes or workforce management; rather, we abstract the “crowd” as a set of authenticated but mutually untrusted test nodes that can be dynamically added or removed. Under this abstraction, crowdsourcing is necessary mainly for scaling ICPS security testing across many factories, configurations, and network locations, and for increasing test diversity beyond what a single testbed can provide. Moreover, we use “testing” to refer to the process of injecting stimuli into the system under test, and “verification” in a narrower, runtime sense, namely checking whether the responses observed on distributed outputs conform to the expectations encoded in the test templates. We do not claim formal verification of all system requirements. Within this architectural view, the DIOTVF presented in the remainder of this paper should be understood as the core testing engine of the crowdsourced platform: each registered agent runs instances of the sender and verifier components, while the central service maintains the shared test object pool and orchestrates task assignment and aggregation of verification statistics.

The main contributions of this paper are as follows:

A distributed ICPS testing and verification architecture tailored for crowdsourced execution. We design a testing and verification architecture for a four-layer cloud-based ICPS that decouples the system under test from the testing crowd via standardized test and verification ports, a shared test object model, and a dynamic test–task management mechanism. This allows heterogeneous testing agents deployed in different networks to exercise the same ICPS deployment without requiring intrusive instrumentation of industrial controllers.
A delay-aware input–output testing and verification framework for distributed ICPS. We propose a protocol-agnostic, template-driven testing and verification framework that explicitly models spatially distributed inputs and outputs, maintains a hash-based pool of in-flight test objects, and tolerates random communication delays and out-of-order responses. This design ensures that each sent request can still be matched to its corresponding response and checked even under asynchronous conditions, thereby achieving a 100% valid verification rate in our experiments.
A concurrent testing architecture with optional LLM-assisted thread control. We implement asynchronous sender and verifier groups with bucket-locked access to the test object pool, together with an optional LLM-assisted thread controller that adapts the number of sending threads based on observed throughput. On an ICPS-like workload, the resulting framework improves effective verification throughput by up to 2.6 times compared with Apache JMeter, while maintaining the same or higher valid verification rate.

The remaining sections of this paper are organized as follows. Section 2 discusses related work. Section 3 describes the architecture of industrial cyber-physical systems, followed by Section 4, which presents the detailed algorithms designed for the implementation of the architecture. In Section 5 and Section 6, we evaluate the performance of the DIOTVF and conclude the paper.

2. Related Work

As industrial networked systems grow increasingly complex, ensuring the correctness, reliability, and scalability of testing methodologies has become a critical research challenge. This section reviews the relevant literature from three perspectives: distributed input–output testing, asynchronous concurrent testing, and LLM-assisted software testing [11]. The goal is to establish the necessary technical background and clarify how these areas of research motivate the DIOTVF proposed in this paper. These three perspectives directly correspond to the core technical challenges addressed in this paper: modelling and managing spatially distributed input/output paths, coping with random delays and out-of-order responses in concurrent testing, and exploring how large language models can help coordinate complex test tasks and resource scheduling.

2.1. Distributed Input–Output Testing

Research on distributed input–output testing has advanced rapidly alongside the growth of large-scale distributed services and industrial control systems. DistFuzz [12] introduces a feedback-guided blackbox fuzzing framework that treats requests, injected faults, and event timing as a multi-dimensional input space, using pruned network message sequences as feedback to uncover deep bugs. However, it still depends on carefully designed abstractions of events and messages, which are difficult to generalize across heterogeneous industrial protocols. Mallory [13] applies graybox fuzzing with multi-node execution feedback to perturb message sequences and fault injections, but its reliance on manual annotations and instrumentation increases adoption costs in production environments. MONARCH [14] proposes a scalable multi-node semantic coverage model for distributed file systems, enabling detection of cross-node inconsistencies, though its abstraction is tightly coupled to storage semantics and less suitable for general distributed I/O flows. DiamonT [15] enhances runtime observability by modelling alternative event orderings in asynchronous programs, offering strong diagnostic capabilities but not a systematic active testing workflow.

In distributed learning workloads, D3 [16] performs differential testing by generating model variants and inputs to surface inconsistencies across back-ends and configurations. While effective for deep-learning pipelines, it does not generalize to message-driven industrial protocols. Complementing fuzzing, Mahe et al. [17] present an interaction-based offline runtime verification approach using multi-trace lifeline removal to handle partial observations, which is well suited for log-based analysis but not for coverage-guided active testing of live systems. FieldFuzz [18] focuses on industrial automation runtimes by reverse-engineering the Codesys environment and combining network fuzzing with on-device tracing to uncover critical vulnerabilities. Although it demonstrates the practicality of blackbox distributed I/O fuzzing in industrial settings [4,19], it remains tightly tailored to the Codesys ecosystem and provides limited abstractions for orchestrating distributed I/O testing across diverse industrial networks. Beyond single-cluster fuzzing setups, Jang et al. [20] propose Fuzzing@Home, a distributed fuzzing framework that leverages untrusted heterogeneous clients at scale, while Chen et al. [21] redesign parallel fuzzing as a microservice-based architecture in µFUZZ to decouple coordination and execution; both works illustrate the scalability benefits of distributed test input generation, but they focus on general-purpose binaries rather than industrial protocols with spatially separated I/O. From the runtime-verification perspective, Audrito et al. [22] formalize distributed runtime verification using past-CTL and the field calculus, and Momtaz et al. [23] study monitoring of Signal Temporal Logic properties in distributed cyber-physical systems, showing how formal specifications can be evaluated across distributed inputs and outputs, although these efforts primarily target property checking rather than high-throughput active fuzzing under asynchronous network delays.

2.2. Asynchronous Concurrent Test

Asynchronous and concurrent behavior is a major source of complexity and flakiness in large-scale systems, motivating extensive work on specialized testing and runtime verification. Wolff et al. [24] apply graybox fuzzing to systematically explore thread interleavings with coverage-guided mutation and concurrency-aware feedback, exposing data races and atomicity violations. Although effective for shared-memory programs, it does not address long-latency, message-driven industrial workloads. Zhao et al. [25] propose selectively uniform concurrency testing (SURW), which explores representative equivalence classes of schedules to reduce redundant interleavings, but its abstractions remain tied to thread-level scheduling rather than distributed I/O tasks.

From a runtime verification viewpoint, Ang et al. [26] introduce predictive monitoring for pattern regular languages, enabling online detection of complex concurrent behaviors with controlled overhead. This provides expressive specifications but assumes integrated monitors and does not consider high-rate asynchronous test injection. Ganguly et al. [27] verify metric temporal properties under partial synchrony using bounded clock skew and SMT-based progression, showing how timing constraints can be enforced across nodes, yet the method targets runtime traces rather than orchestrated concurrent testing. Bonakdarpour et al. [28] develop decentralized, crash-resilient runtime verification that tolerates failures and delays, improving robustness in asynchronous settings, though the focus is on monitor reliability rather than scalable test generation. At the test-management level, Tahir et al. [29] provide a multivocal review of test flakiness, identifying concurrency-induced nondeterminism as a dominant cause and noting that industry still relies heavily on reruns instead of systematic modelling. Parry et al. [30] enhance rerun-based flaky-test detection using machine learning (CANNIER), yet this work continues to treat the executor as a black box without explicit control over asynchronous send/receive operations or large pools of concurrent test tasks. Therefore, recent research has started to direct concurrency fuzzing towards specific bug classes. Yuan et al. [31] present DDRace, a directed graybox fuzzer for discovering concurrency use-after-free vulnerabilities in Linux device drivers, Ito et al. [32] design Schfuzz to detect concurrency bugs via feedback-guided exploration of thread interleavings, and Ito et al. [33] further extend this line with race-directed fuzzing to focus testing effort on suspected race locations. Jiang et al. [34] complement these approaches with CONZZER, a context-sensitive and directional concurrency fuzzer specialized for device drivers, illustrating how schedule-sensitive exploration strategies can systematically expose concurrency defects but still operate at the program level rather than coordinating distributed I/O endpoints.

2.3. LLM-Assisted Software Testing

With the rapid advancement of large language models, extensive research now examines how LLMs can support or automate software testing. Wang et al. [35] survey over one hundred studies across testing tasks, model types, prompting strategies, and auxiliary techniques, highlighting persistent challenges in reliability, controllability, and security of LLM-generated tests. Li et al. [36] benchmark multiple LLMs on diverse testing tasks and show that, while LLMs can outperform traditional tools in certain cases, they still exhibit hallucinated assertions and unstable performance across projects, limiting their use in safety-critical contexts. Rehan et al. [37] propose an LLM-based pipeline that generates scalable test suites using prompt engineering and post-processing, demonstrating good results for conventional application code, though the approach remains focused on code-centric unit and functional testing rather than distributed or timing-sensitive interactions. Celik and Mahmoud [38] review LLM-driven test generation methods and identify research gaps such as tighter integration with runtime feedback and domain-specific models, noting that structured test representations could significantly improve LLM reliability.

Yuan et al. [39] present ChatTester, which iteratively refines LLM-generated unit tests to improve compilation and assertion correctness, but its design targets single-function testing and does not address concurrent or distributed behaviors. Li et al. [40] study LLM-generated web-form tests across 146 forms, showing that richer context and carefully crafted prompts substantially improve submission success rates, though the domain remains confined to web front ends. Expanding beyond task-specific work, Zhao et al. [41] introduce an AI-augmented QA framework that combines NLP-based requirement analysis, ML-based test generation and prioritization, and deep-learning-based anomaly detection within a PyTest-BDD workflow, illustrating how AI components can be integrated throughout a unified testing pipeline. Building on unit-level evaluations, Schäfer et al. [42] conduct a large-scale empirical study of LLM-based unit test generation for JavaScript libraries, while Dakhel et al. [43] introduce MuTAP, which augments LLM-generated tests with mutation testing to improve fault-detection effectiveness. Kang et al. [44] propose Libro, demonstrating that LLMs can act as few-shot testers that synthesize bug-reproducing tests directly from natural-language bug reports, and Meng et al. [45] integrate LLMs into protocol fuzzing by extracting protocol grammars and stateful message sequences for coverage-guided exploration of network services. Readers can refer to these works [46,47] for more discussion of the risks caused by LLM-driven methods.

Across the above lines of work, there is room for improvement in both specificity and universality. Approaches such as DistFuzz, Mallory, MONARCH, and FieldFuzz achieve strong bug-finding capability in their targeted ecosystems, but they are often tightly coupled to particular abstractions, protocol stacks, or runtime environments and may require substantial manual annotations or instrumentation effort to deploy in new industrial settings. On the other hand, log-based analysis and formal runtime verification frameworks (e.g., DiamonT, Mahe et al., Audrito et al., Momtaz et al.) offer powerful post-hoc guarantees, yet they do not provide a high-throughput, active testing workflow for live, delay-prone industrial networks. Our work complements these directions by focusing on a protocol-agnostic, template-driven active testing and verification framework that treats most of the system under test as a black box apart from configurable test and verification ports, explicitly models spatially distributed input/output paths via a shared test object pool, and scales to multithreaded crowdsourced execution without requiring special-purpose instrumentation of industrial controllers or industrial middleware.

3. The Design

3.1. System Architecture

The test architecture of the crowdsourced testing platform is aligned with the layered structure of cloud-based industrial cyber-physical systems, which is shown in Figure 1. It consists of four interconnected layers: the workshop layer, factory layer, corporation layer, and external region layer. These layers reflect different enterprise functions and jointly integrate physical devices, software systems, network communication, and cloud services. Moreover, each layer interacts with the others to emulate realistic operational workflows: the workshop layer connects to physical equipment and collects data; the factory layer coordinates device operation and scheduling; the corporation layer manages resources across multiple factories; and the external region layer supports cross-enterprise data exchange through cloud platforms.

In a crowdsourced deployment, each workshop, factory, or external site can host one or more testing agents that register with the platform and periodically pull test objects from a shared test object pool. The proposed dynamic test–task management model and asynchronous concurrent testing model then execute these tasks across all agents, while the central platform aggregates testing and verification results for further analysis. In other words, the crowdsourced platform provides global coordination, task distribution, and result collection, whereas the DIOTVF described below specifies the concrete mechanisms used by agents and the central service to generate, send, match, and verify test payloads.

From the perspective of the crowdsourced testing platform, industrial systems often exhibit mismatches between the locations where input data are injected and where output data are collected, due to the distributed and complex nature of their operational processes. To address the testing challenges arising from this input–output asymmetry, we propose a distributed input–output testing and verification framework, illustrated in Figure 2. In this architecture, the system under test is assumed to have geographically or logically separated input and output points, with internal devices interconnected through the industrial network to form a complete data-flow path. During testing, the framework sends test payloads over the Internet to designated input-side devices within the industrial system, while continuously retrieving output data from devices located elsewhere in the network. After collecting the output, the testing tool validates the returned data against the original inputs and performs statistical analysis to evaluate correctness and performance. To support this process, we introduce two key models: a dynamic test–task management model and an asynchronous concurrent testing model. Together, these models enable reliable dispatching of test data and accurate verification of results across distributed locations, ensuring that input–output testing remains effective in complex, heterogeneous industrial environments.

3.2. Dynamic Test–Task Management Model

To effectively manage distributed testing tasks, this paper proposes a dynamic test–task management model, which is used to implement test payload generation, the addition of test–task objects, and the dynamic management of the test object pool. The model’s workflows are shown in Figure 3. The dynamic test–task management model consists of three components: the test–task constructor, the test object pool, and the timeout checker. By integrating these three functions within the dynamic test–task management model, the entire execution of the testing tasks can be effectively supported.

3.2.1. Test Payload Generator

The test–task constructor includes several key functions: data parsing for filling the test template, generation of partial test payloads for test mutations, merging mutated payloads into new test objects, and setting the basic attributes of the new test objects, such as the test targets, identifiers, and the last operation time. To support the proper functioning of the test–task constructor, this paper defines the test template, denoted as

t m p l

, which can be expressed as Equation (1).

t m p l = {h e a d t m p l, \sum_{i = 1}^{n} (r e p_{i}, r e p f i l l_{i}), r e a r t m p l}

(1)

where

h e a d t m p l

represents the header data of the template, with a length greater than or equal to 0;

r e a r t m p l

represents the rear data of the template, with a length greater than or equal to 0;

r e p_{i}

denotes the data in the template that needs to be mutated and replaced, and

r e p f i l l_{i}

represents the payload data after the mutation replacement in the template. To ensure the proper operation of the template replacement mechanism, i should be at least 1, and

r e p f i l l_{i}

should be greater than or equal to 0. Additionally, for effective marker replacement,

r e p_{i}

should be longer than the replacement marker length, and

r e p_{i}

can be expressed as Equation (2).

r e p_{i} = {r e p M a r k, r e p I D, r e p L e n, r e p M a r k}

(2)

where

r e p M a r k

represents the identifier region to be replaced, used to mark the start and end;

r e p I D

is the identifier for the replacement region; and

r e p L e n

is the length of the data to be generated for the replacement.

After the test user completes the test template creation, the test payload constructor needs to parse the information marked by

r e p_{i}

. At this point, this paper employs a non-deterministic finite automaton (NFA) approach for template matching and positioning. This model efficiently identifies and replaces the necessary parts of the template, ensuring that dynamically mutated test payloads are correctly processed during the testing process. For a non-deterministic finite automaton, it is typically composed of the five-tuple as shown in Equation (3).

A = (Q, Σ, δ, q_{0}, F)

(3)

where Q is the set of states (including states for matching

h e a d t m p l

,

r e p_{i}

, and

r e a r t m p l

),

Σ

is the input alphabet (the set of possible characters or symbols in the template),

δ

is the transition function, which defines how the NFA moves between states based on the input symbols,

q_{0}

is the initial state, where the matching begins (specifically matching headtmpl), F is the set of final states, indicating a successful match of the entire template. Thus, the matching process can be formally expressed as Equation (4).

Match (t m p l) = NFA (r e p M a r k, r e p I D, r e p L e n, r e p M a r k)

(4)

In this expression, the process first matches

r e p M a r k

, followed by

r e p I D

and

r e p L e n

, and finally matches

r e p M a r k

again. During this matching process, the NFA transitions between different states, ultimately reaching a final state when the match is successful. Once the NFA successfully identifies the entire template in the input data, the next step is to perform the replacement. At the same time, before performing the template mutation payload replacement, the test data to be replaced must first be generated. During the generation process, the replacement data, denoted as

t d

, must satisfy Equation (5).

t d = G e n (r e p L e n)

(5)

where

G e n

represents the length of data to be generated. In practical use, the data will be generated according to the

r e p L e n

length extracted from the NFA matching and recognition results, ensuring the generation of the corresponding test data. After generating

t d

, it needs to be replaced in the corresponding position of

t m p l

, resulting in the mutated test payload, denoted as

p l

, which must satisfy Equation (6).

p l = {h e a d t m p l, \sum_{i = 1}^{n} (t d_{i}, r e p f i l l_{i}), r e a r t m p l}

(6)

where

h e a d t m p l

represents the header data of the template;

r e a r t m p l

represents the rear data of the template;

r e p f i l l_{i}

is the original padding data for the corresponding replacement section. Meanwhile,

t d_{i}

is the newly generated mutated replacement data.

3.2.2. New Test Object

After generating the test payload, a test object needs to be constructed to support the execution of the testing tasks. The test object is denoted as

t o

and should satisfy Equation (7).

t o = {t o i d, t i m e, t a r g e t, p l}

(7)

where

t o i d

is the identifier of the test object,

t i m e

represents the last operation time of the test object,

t a r g e t

refers to the test target defined for the test object, and

p l

is the test payload. The identifier of the test object,

t o i d

, must satisfy Equation (8).

t o i d = l t o i d + 1

(8)

where

l t o i d

represents the ID of the last constructed test object, which should be of a numeric type that supports atomic operations to avoid potential issues where the same

t o i d

is generated during concurrent multithreaded operations. Additionally, the test target defined for the test object,

t a r g e t

, must satisfy Equation (9).

t a r g e t = {t e s t T a r g e t, v e r T a r g e t}

(9)

where

t e s t T a r g e t

refers to the information about the location where the test payload needs to be sent, while

v e r T a r g e t

refers to the test location information for retrieving the test results.

3.2.3. Test Object Pool Addition

In the dynamic management of the test object pool, the test object pool contains a set of test objects that are currently being processed. Therefore, in this paper, the test object pool is denoted as

t o p

, which consists of a set of test objects

t o

, and can be expressed as Equation (10).

t o p = {t o_{1}, t o_{2}, \dots, t o_{n}}

(10)

where n represents the number of test objects currently present in the test object pool. At the same time, since the test object pool can be operated by multiple threads concurrently, it must support thread-safe operations. To improve the performance of multithreaded operations, and given that the test object identifier is used as the primary key, the model is implemented using a basic bucket lock scheme. In this case, the number of bucket locks can be defined as Equation (11).

b u c k e t = d i f f ({H (t o_{1}), H (t o_{2}), \dots, H (t o_{n})})

(11)

where H is the hash function used to obtain the hash value of the currently operated object;

d i f f

is used to merge identical elements within the set. Therefore, the total number of

b u c k e t s

in the collection is always less than or equal to the number of elements in the test object pool. Once the bucket locks are constructed, each thread-safe addition operation requires locking the corresponding bucket, and the lock is released after the addition is complete. For read operations, however, a lock-free approach is used.

3.2.4. Remove Expired Test Objects

In the process of removing expired test objects, the timeout checker dynamically checks the elements in the test object pool to remove test objects that have not returned matched test data after a prolonged period since they were sent. By periodically removing expired test objects, the storage pressure on the test object pool is reduced, and the retrieval efficiency of the pool is improved. After the timeout checker completes the check, expired test objects are removed based on the time difference. The removed test object, denoted as

R T O

, is defined as Equation (12).

R T O = {r t o_{1}, r t o_{2}, \dots, r t o_{n}}

(12)

where each

r t o \in R T O

satisfies

s t - r t o . t i m e > t i m e o u t S p a n

; all removed expired test objects must have elapsed more than the defined timeout period since the current check time; and n is the number of expired test objects that were removed.

3.3. Asynchronous Concurrent Testing Model

After constructing the test tasks and adding them to the test object pool, the DIOTVF needs to implement the specific testing tasks using asynchronous concurrent testing. Therefore, this paper proposes the asynchronous concurrent testing model, with the overall working principle illustrated in Figure 4. In the asynchronous concurrent testing model, based on the test object pool provided by the dynamic test–task management model, a test sender group, a test verifier group, and a large language model-assisted (LLM-assisted) thread controller are introduced [46]. By integrating these key components, the model can effectively perform testing on distributed input–output objects.

3.3.1. Test Sender Group

The test sender group is responsible for concurrently executing test tasks according to the testing configurations defined within the test object pool. In this paper, the test sender group is denoted as

t s g

, which consists of a set of test senders, defined as Equation (13).

t s g = {t s_{1}, t s_{2}, \dots, t s_{n}}

(13)

where

t s_{i}

represents a single test sender, and n denotes the total number of test senders, which also corresponds to the number of sending threads in the implementation. After the test sender group is initialized, each test sender extracts a test object from the test object pool to initiate its testing operation. Since the sending process is performed asynchronously and concurrently, the test senders do not follow a fixed order when accessing test objects. To describe this non-sequential access relationship, a mapping is defined as Equation (14).

f : t s g \to t o p

(14)

where

f (t s_{i}) = t o_{j}

indicates that the test sender

t s_{i}

retrieves the test object

t o_{j}

from the test object pool for testing. This mapping satisfies the uniqueness constraint shown in Equation (15).

\forall i \neq k, f (t s_{i}) \neq f (t s_{k})

(15)

which ensures that no two test senders retrieve the same test object. In addition, the access order of test objects is not constrained and can be expressed as Equation (16).

f (t s_{i}) = t o_{π (i)}, π \in S_{m}

(16)

where

π

is a permutation function on

{1, 2, \dots, m}

representing the non-sequential assignment of test objects. Therefore, the access behavior of the test sender group exhibits both non-sequentiality and uniqueness. After completing the extraction of test objects, each test sender reads the corresponding test payload and configures the transmission target according to the defined testing objectives. Let the payload of the test object

t o_{j}

be denoted as

p l_{j}

, and its transmission target as

t a r g e t_{j}

. Then, the transmission target configuration expression is shown as Equation (17).

g (t s_{i}) = p l_{j}, t g t (t s_{i}) = t a r g e t_{j}, where f (t s_{i}) = t o_{j}

(17)

Once the payload retrieval and target configuration are completed, the test sender performs the sending operation and collects statistical information regarding the transmission. Let

σ (\cdot)

denote the sending statistics function, which records parameters such as transmission time, data size, and send count. The sending statistics can then be expressed as Equation (18).

s t a t (t s_{i}) = σ (p l_{j}, t a r g e t_{j})

(18)

Hence, based on the non-sequential and unique extraction of test objects, each test sender accomplishes a complete operational process, including payload reading, target configuration, and data transmission statistics, providing quantitative support for subsequent verification and performance analysis.

3.3.2. Test Verifier Group

The test verifier group is responsible for receiving, matching, and validating the response data generated by the test sender group, thereby determining the execution results of the test tasks. In this paper, the test verifier group is denoted as

t v g

, which consists of a set of test verifiers defined as Equation (19).

t v g = {t v_{1}, t v_{2}, \dots, t v_{n}}

(19)

where

t v_{i}

represents an individual test verifier, and n denotes the number of verifiers, corresponding to the number of concurrent verification threads in the system. Once the test verifier group is activated, each test verifier receives response data from the system under test through the verification port and matches it with the corresponding test objects stored in the test object pool. Let the set of received response data be represented as

r d = {r d_{1}, r d_{2}, \dots, r d_{m}}

, then the verification mapping can be defined as Equation (20).

h : t v g \to r d

(20)

where

h (t v_{i}) = r d_{j}

indicates that the test verifier

t v_{i}

is responsible for validating the response data

r d_{j}

. To correctly associate each received response with its corresponding test object, a data matching function is defined as Equation (21).

m a t c h (r d_{j}) = t o_{k}, where h (t v_{i}) = t o_{k}

(21)

This means that the verifier establishes the correspondence between the received response data and its associated test object through the matching function to complete the validation task. After successful matching, the verifier performs object locking and removal operations to prevent duplicate verification. Let the locking and removal operations be denoted by

l o c k (\cdot)

and

r e m o v e (\cdot)

, respectively, as shown in Equation (22).

l o c k (t o_{k}), r e m o v e (t o_{k}) \in t o p

(22)

where

l o c k (\cdot)

marks the test object currently under verification, and

r e m o v e (\cdot)

indicates that the verified test object has been removed from the test object pool after the verification process is completed. Finally, to quantitatively analyze the verification performance, let

ρ (\cdot)

denote the verification statistics function, which records metrics such as verification count, matching success rate, and data loss rate. The statistical relation can thus be expressed as Equation (23).

s t a t (t v_{i}) = ρ (r d_{j}, t o_{k})

(23)

Accordingly, the test verifier group completes the entire verification workflow—including response reception, data matching, object locking, and statistical analysis—thereby enabling asynchronous validation and performance evaluation of test–task execution results.

3.3.3. LLM-Assisted Thread Controller

The LLM-assisted thread controller is designed as an optional auxiliary module that dynamically adjusts the number of concurrent threads in the test sender group based on real-time performance data collected during the testing process. Its primary objective is to balance system load and testing throughput while maintaining test accuracy, thereby improving overall efficiency and resource utilization. In the asynchronous concurrent testing model, the execution speeds of the test sender group and the test verifier group may differ due to factors such as network latency, device response time, and system resource consumption. To address these variations, an LLM-based thread control mechanism is introduced to perform adaptive analysis and adjustment of the system state. Similar to Deng et al.’s work [48], we introduce a data selection technique to make sure the LLM-based mechanism is trustworthy. Let the test speed data set be denoted as

s p d = {s_{1}, s_{2}, \dots, s_{t}}

, and the thread control function be denoted as

θ (\cdot)

. The mapping relationship for thread adjustment can be defined as Equation (24).

θ : s p d \to n^{'}

(24)

where

n^{'}

represents the adjusted number of testing threads, and the adjustment strategy of

θ (\cdot)

is inferred by the LLM based on performance prompts provided as input. The controller first collects statistical data such as testing throughput and resource utilization, then converts these indicators into prompt information and submits them to the LLM endpoint. The LLM analyzes the current system load and performance state and generates thread adjustment recommendations. The controller then updates the number of threads in the test sender group according to these recommendations, achieving adaptive scheduling of testing workloads. Let

φ (\cdot)

denote the thread adjustment recommendation function generated by the LLM. The final adjustment relationship can be expressed as Equation (25).

n^{'} = φ (θ (s p d))

(25)

Through this mechanism, the system can automatically optimize the concurrency scale based on model inference results under scenarios with varying testing speeds or complex resource competition, thereby enabling dynamic self-scheduling and intelligent optimization of testing tasks. It should be noted that the LLM-assisted thread controller is an optional component and may be disabled in environments with sufficient resources or small-scale testing tasks to reduce computational overhead.

4. The Implementation

To realize the practical implementation of the DIOTVF described in the previous section, this paper proposes a systematic method that translates the conceptual models into executable algorithmic procedures. The proposed method focuses on enabling automated, scalable, and concurrent testing of industrial systems with distributed data input and output positions. It integrates the dynamic test–task management model and the asynchronous concurrent testing model into a cohesive operational workflow, covering test payload generation, concurrent sending, timeout management, and asynchronous verification. Each algorithm is designed to handle a specific stage of the distributed testing process, while collectively ensuring correctness, efficiency, and adaptability. The overall methodology aims to support large-scale industrial software testing where spatially separated devices and asynchronous data flows can be tested under realistic network conditions. The detailed algorithmic procedures are presented as follows.

4.1. Test Payload Generation

To enable large-scale and automated testing of distributed input–output systems, this paper first proposes a template-driven test payload generation algorithm, which is shown in Algorithm 1. The approach employs a non-deterministic finite automaton (NFA) to locate and replace variable regions within the test template, ensuring that the generated payloads conform to the expected structural and semantic rules of the target industrial protocol. Through dynamic filling and mutation of template parameters, diversified test payloads are generated to comprehensively stimulate different states of the system under test. After each payload is constructed, the test–task constructor encapsulates it into a test object, assigns a unique identifier, and inserts it into the test object pool in a thread-safe manner for subsequent concurrent testing.

Algorithm 1 Test payload generating

Input:: Test template $t m p l$
Output:: Test payload $p l$ and constructed test object $t o$
1:: Initialize non-deterministic finite automaton $A = (Q, Σ, δ, q_{0}, F)$ for template parsing
2:: for each replacement region $r e p_{i}$ in $t m p l$ do
3:: Locate $r e p M a r k, r e p I D, r e p L e n$ via $A$ transitions
4:: Generate replacement data $t d_{i} = G e n (r e p L e n)$
5:: Substitute $r e p_{i}$ with $t d_{i}$ and append padding $r e p f i l l_{i}$
6:: Assemble final payload $p l$
7:: Construct new test object $t o$ with atomic $t o i d$ increment
8:: Add $t o$ into test object pool $t o p$ through thread-safe addition
9:: return $t o$

4.2. Concurrent Test Payload Sending

After the test payloads are generated and stored in the test object pool, the system proceeds to concurrent test execution. The concurrent test payload sending algorithm is illustrated in Algorithm 2. The proposed asynchronous concurrent testing model enables multiple sender threads to operate in parallel, where each test sender retrieves a unique test object from the pool without a fixed access order. This non-sequential and thread-safe mechanism ensures that no two senders process the same object simultaneously. Each sender reads the corresponding payload and target configuration, performs transmission through the testing port, and logs transmission statistics for further analysis.

To maintain an optimal balance between throughput and system load, an optional LLM-assisted thread controller is employed. Instead of relying on fixed hand-crafted rules, the controller uses the current throughput and thread count as feedback to decide whether the number of sending threads should be increased, decreased, or kept unchanged. The prompt template used in the LLM-assisted thread controller is shown in Figure 5. In practice, the sending procedure is organized into consecutive time windows. At the end of each window, the framework aggregates the transmission statistics from all sender threads, computes the current throughput S for that window, and invokes the LLM controller once. All invocations are issued within the same LLM session so that the model can see the sequence of previous “Current thread count = {n}, current throughput = {S}.” prompts as the historical performance records mentioned in the system instruction. For each window, the controller forms a short user prompt that encodes the current thread count and the measured throughput, sends it together with the fixed system instruction, and receives a recommended new thread count for the next window. Before applying this recommendation, the framework enforces several simple safety rules: the new thread count must stay within a configured minimum and maximum range; the difference between the previous and the new value is limited to a small step size to avoid oscillation. To avoid blocking normal packet transmission, the LLM controller is only triggered after a window completes, and the framework degrades gracefully to a fixed-thread mode when the LLM endpoint is unavailable.

Algorithm 2 Concurrent test payload sending

Input:: Test object pool $t o p = {t o_{1}, t o_{2}, \dots, t o_{m}}$ , sender group $t s g = {t s_{1}, t s_{2}, \dots, t s_{n}}$
Output:: Transmission statistics set $σ = {σ_{1}, σ_{2}, \dots, σ_{n}}$
1:: for all sender $t s_{i}$ in $t s g$ in parallel do
2:: Randomly select $t o_{j}$ from $t o p$ where $f (t s_{i}) = t o_{j}$ and $t o_{j}$ not locked
3:: Extract payload $p l_{j}$ and transmission target $t a r g e t_{j}$
4:: Perform transmission $Transmit (p l_{j}, t a r g e t_{j})$ via transmitter
5:: Record send statistics $σ_{i} = σ (p l_{j}, t a r g e t_{j})$
6:: Aggregate per-sender statistics over the current time window to compute the current throughput S
7:: if LLM controller is enabled and the adjustment condition is satisfied then
8:: Build the user prompt: “Current thread count = {n}, current throughput = {S}.”
9:: Send the fixed system instruction and the user prompt to the LLM service
10:: Receive the recommended thread count $n^{'}$
11:: Clamp $n^{'}$ into the configured minimum and maximum range
12:: Limit the difference between $n^{'}$ and n to a small step size
13:: Resize sender group $t s g$ to use $n^{'}$ threads in the next window
14:: Update $n \leftarrow n^{'}$
15:: return $σ$

4.3. Timeout Test Payload Management

During large-scale testing, some payloads may remain unmatched for extended periods due to network latency, device response failures, or transmission errors. Therefore, timeout test payload management is necessary, and the algorithm shown in Algorithm 3. To prevent memory overflow and reduce retrieval overhead, a timeout checker continuously monitors the test object pool. If the elapsed time since an object’s last operation exceeds the predefined timeout threshold, that object is marked as expired and safely removed. This mechanism ensures efficient memory utilization and maintains high system throughput under long-duration distributed testing conditions.

Algorithm 3 Timeout test payload management

Input:: Test object pool $t o p$ , system time $s t$ , timeout threshold $t i m e o u t S p a n$
Output:: Removed expired test objects $R T O = {r t o_{1}, r t o_{2}, \dots, r t o_{k}}$
1:: Initialize $R T O \leftarrow \emptyset$
2:: for each $t o_{i}$ in $t o p$ do
3:: if $s t - t o_{i} . t i m e > t i m e o u t S p a n$ then
4:: Lock $t o_{i}$
5:: Remove $t o_{i}$ from $t o p$
6:: Append $t o_{i}$ to $R T O$
7:: Update pool metrics to maintain load balance
8:: return $R T O$

4.4. Concurrent Retrieval and Asynchronous Analysis of Test Results

Once the test payloads have been transmitted, the verification layer asynchronously collects response data returned from different locations of the industrial system. Therefore, concurrent retrieval and asynchronous analysis of the test results are required, with the algorithm shown in Algorithm 4. Each test verifier thread retrieves one piece of response data, performs matching with the corresponding test object, and verifies whether the returned data satisfies the expected behavior. Upon successful validation, the verifier locks and removes the associated object from the pool to avoid duplicate matching and then updates the verification statistics. This asynchronous design enables the real-time evaluation of distributed test results while maintaining high throughput and low coupling between sending and verification threads. Aggregated performance indicators—such as matching success rate, latency, and data loss ratio—are computed to provide quantitative metrics for subsequent performance analysis.

Algorithm 4 Concurrent retrieval and asynchronous analysis of test results

Input:: Response data set $r d = {r d_{1}, r d_{2}, \dots, r d_{m}}$ , verifier group $t v g = {t v_{1}, t v_{2}, \dots, t v_{n}}$ , test object pool $t o p$
Output:: Verification statistics set $ρ = {ρ_{1}, ρ_{2}, \dots, ρ_{n}}$
1:: for all verifier $t v_{i}$ in $t v g$ in parallel do
2:: Receive response $r d_{j}$ from verification port
3:: Find matched object $t o_{k}$ using $m a t c h (r d_{j}) = t o_{k}$
4:: if match succeeds then
5:: $l o c k (t o_{k})$ ; perform validation $Verify (r d_{j}, t o_{k})$
6:: $r e m o v e (t o_{k})$ from $t o p$
7:: Record statistics $ρ_{i} = ρ (r d_{j}, t o_{k})$
8:: else
9:: Increment mismatch or loss counter
10:: Compute aggregated verification metrics (success rate, latency, data loss)
11:: return $ρ$

5. Evaluation

In this section, we evaluate the proposed DIOTVF from three perspectives: data exchange speed during testing and verification, the number of successful verifications, and the overall valid verification rate. Moreover, valid verification rate refers to a low-level, per-request metric: the fraction of test objects in the test object pool whose responses can be correctly matched and checked against their originating payloads under asynchronous delays, i.e., a 100% valid verification rate means that every test request issued during an experiment was successfully matched with a corresponding response and subjected to the expected output check. We do not claim that the test suite covers all functional requirements or execution paths of the system under test; coverage and requirement engineering are orthogonal to the framework proposed in this paper.

5.1. System Configuration

Multiple virtual machines were deployed on the virtualization platform for the experiment, each configured with 8 Intel(R) Xeon(R) Gold 6128 with 3.40 GHz vCPUs and 16 GB of memory. To simulate cross-network communication, random transmission delays were injected between the testing and verification interfaces. Accordingly, the experiments included both zero-delay tests and tests with randomly added delays ranging from 1 ms to 10 ms. Performance was measured under thread counts of 2 and 4. The proposed method requires both a sender and a receiver to operate concurrently, with each running in its own thread. As a result, the total number of threads must be even, making single-threaded testing infeasible; therefore, no single-thread results are reported for our method. For comparison, we selected three additional tools: Apache JMeter, Locust, and a baseline implementation of our approach written in the same programming language to emulate JMeter (denoted as baseline). Unlike our framework, these control-group tools execute a send-then-receive sequence in a single thread, allowing single-thread tests. To fairly evaluate sequential execution behavior, the test target was configured to return exactly one test payload requiring verification per request.

5.2. Evaluation of Valid Testing and Verification Rate

The main contribution of this paper is to propose a DIOTVF that supports distributed unpredictable delays. To effectively assess the success rate of testing under distributed unpredictable delays, we evaluated the valid testing verification rate, and the results are reported in Table 1. The results show that under conditions of multithreaded contention and varying delays, the proposed method, by utilizing the test object pool, is able to search for and verify previously sent test payloads, even in the case of unordered responses due to unpredictable delays, thereby ensuring that every test payload sent in the experiment can still be matched with its corresponding response and checked, even under random delays and multithreaded contention. Meanwhile, with one thread and no unpredictable delays, other methods also managed to achieve 100% verification due to sequential execution. However, with thread counts greater than one, random competition and querying resulted in valid verification rates as low as 54% (e.g., when four threads were tested simultaneously without unpredictable delays). Additionally, under conditions of unpredictable delay, the valid verification rates of both Apache JMeter and the baseline method (which was implemented following JMeter principles) do not apply because their test–task scheduling mechanisms are designed to verify the test payload as soon as it is sent, resulting in JMeter and the baseline approach almost always having each thread verify a test payload sent by another thread earlier in the test. Locust, due to its sending mechanism, achieved a verification rate of approximately 66.67%. Overall, these results indicate that the proposed framework can maintain a consistently high valid verification rate even in the presence of asynchronous delays, which is essential for trustworthy testing of ICPS-like workloads.

5.3. Evaluation of Successful Verification Count

The experimental results for the successful verification count are shown in Figure 6. The results show that the proposed method outperforms all the competitors in tests with different delay ranges. Specifically, with no additional unpredictable delays, the proposed method achieved a 0.87 times increase in verification count compared to the baseline group when four threads are used, and the data length is 8000 bytes. It achieved a 2.6 times increase compared to Apache JMeter and a 12.24 times increase compared to Locust. Under conditions of unpredictable delay, both JMeter and the baseline method produced zero valid verifications under our definition, and their valid verification rate is therefore considered not applicable in this delayed setting. Compared to Locust, which had a non-zero verification rate, the proposed method achieved a 15.63 times increase in verification count at the test payload size of 10,000 bytes. These observations confirm that the DIOTVF can sustain a high volume of correctly matched verifications in the presence of concurrent traffic and non-deterministic network delays.

5.4. Evaluation of Data Verification Speed

The experimental results of data exchange speed are shown in Figure 7. As can be seen, the proposed method yielded higher data verification speeds with two threads compared to the other methods, even when they used four threads, regardless of the delay range. Since sending speed and data packet size are proportionally related, performance improvement rates are consistent with successful verification counts. With four threads and a 10,000-byte payload, the proposed method achieved an additional 23.53 MB/s in data verification speed compared to JMeter without unpredictable delays, and an additional 30.53 MB/s in data verification speed compared to Locust with unpredictable delays. In summary, the proposed framework not only improves the number of successful verifications but also significantly increases effective throughput, which is crucial for large-scale ICPS testing campaigns.

5.5. Discussion of ICPS-Specific Metrics

Although the proposed experiments are conducted on an ICPS-like protocol, the evaluation configuration is designed to approximate several key ICPS characteristics while maintaining accuracy and reproducibility. First, the injected delay range of 1–10 ms reflects typical network latency and jitter observed in many industrial deployments, where control messages are expected to complete within tens of milliseconds. Under these conditions, the framework maintains a 100% valid verification rate, suggesting that it can preserve correct input–output correspondence without violating soft real-time constraints in the tested regime. Each request/response pair emulates a control-cycle update, so the successful verification count and valid verification rate together indicate how many control cycles can be reliably exercised and checked under concurrent test traffic. By separating senders and verifiers and explicitly modelling out-of-order responses, the framework can also respect protocol-level requirements such as strict ordering between logically related messages, even when physical delivery order is perturbed by delays.

From the viewpoint of experimental accuracy and reproducibility, the explicit modelling of test objects and their life cycle further helps to tame nondeterminism in distributed experiments. Because each request is tagged with a unique identifier and can only be matched once by the verifier group, the framework enforces a one-to-one correspondence between sent payloads and verified responses even under asynchronous delays and multithreaded contention. For a fixed configuration (number of threads, delay distribution, and payload templates), the set of successfully verified test objects is therefore determined by the pseudo-random generator used to construct payloads, which makes comparative experiments repeatable on the same testbed. At the same time, we acknowledge that our current evaluation is limited to a single virtualized environment. A more fine-grained analysis involving hardware-in-the-loop PLCs, detailed scan-cycle measurements, and protocol-specific timing profiles, as well as extending the study to larger heterogeneous deployments and long-running crowdsourced campaigns where node churn and scheduler interference are more pronounced, remains important future work for fully characterizing ICPS relevance and reproducibility at scale.

5.6. Scalability and Complexity Discussion

From an algorithmic perspective, the main overhead of the proposed framework comes from maintaining the test object pool, matching responses to test objects, and parsing and mutating protocol payload templates. Let N denote the number of active test objects in the pool. The pool is implemented as a hash-based data structure with bucket locks, so insertion, lookup, and removal of test objects are

O (1)

on average. Response matching is also

O (1)

with respect to N, because each response carries an identifier that is used as a hash key to locate its corresponding test object.

Moreover, for the ICPS protocol used in our evaluation, this cost is dominated by network latency and the processing time of the system under test. As the size of the ICPS and the complexity of its protocols grow, the dominant factor in overall runtime is therefore the volume of test traffic and the number of active test objects, rather than the asymptotic complexity of our framework.

At the same time, the memory footprint of the test object pool grows linearly with the number of in-flight test objects. Very large-scale ICPS deployments with thousands of concurrent test threads or extremely large payloads would require careful resource provisioning. A more detailed scalability study that covers heterogeneous industrial protocols, deeper protocol stacks, and multi-site deployments is left as future work.

6. Conclusions

In summary, this work presents a DIOTVF within a distributed crowdsourced platform for ICPS security testing and makes three main contributions. The crowdsourced platform provides the architectural context in which heterogeneous testing agents are coordinated, while the focus of this paper is on the design, implementation, and evaluation of the distributed input–output framework that serves as the platform’s core testing engine. Built on a four-layer cloud-based ICPS architecture, the proposed distributed crowdsourced testing architecture decouples heterogeneous testing agents from the system under test via a shared test object model and standardized test/verification ports, thereby supporting proactive, cross-network, cross-device, and cross-layer security testing under realistic deployment conditions. On top of this architecture, we design a delay-aware input–output testing and verification framework that uses a hash-based test object pool and explicit timeout management to maintain correct request–response matching under unpredictable communication delays, and combine a dynamic test–task management model with an asynchronous multithreaded testing architecture, optionally enhanced by an LLM-assisted thread controller, to support complete test verifications even in distributed environments with delays induced by data processing or network instability between test and verification ports. Experimental results show that, compared to Apache JMeter, Locust, and a baseline implementation, the proposed framework can achieve up to 2.6 times improvement in verification efficiency while maintaining a 100% valid verification rate in the sense defined in Section 5, i.e., every test request is matched to a response and successfully checked.

In future work, we plan to extend the proposed framework along three directions. First, we will further enrich ICPS-specific evaluation, including tighter integration with hardware-in-the-loop PLCs, scan-cycle and jitter measurements, and protocol-aware behavioral metrics. Second, we aim to explore dynamic test templates and adaptive construction of test payloads based on intermediate verification results, in order to support fuzzing-style exploration of disordered responses and corner-case behaviors in industrial cyber-physical systems. Third, we will investigate deeper integration with formal specification and verification techniques, so that property-driven test generation and distributed runtime monitoring can be combined with the proposed crowdsourced testing platform to provide stronger assurance for ICPS security and reliability. In parallel, we plan to refine the risk classification introduced in the Introduction based on feedback from real deployments, so that future test campaigns can be systematically aligned with the most critical ICPS risk categories.

Author Contributions

Conceptualization, Z.L.; Methodology, Z.L.; Software, Z.L.; Validation, Z.L.; Investigation, Z.L.; Resources, Y.D.; Writing—original draft, Z.L.; Writing—review and editing, Y.D., R.Z., S.W. and J.L.; Supervision, Y.D.; Project administration, Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This article is supported by the National Key R&D Program of China (2023YFB3107300), the Guangxi Natural Science Foundation (2025GXNSFGA069004), the National Natural Science Foundation of China (12561054, 62172119), and the Innovation Project of Guangxi Graduate Education (YCBZ2024163).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Pivoto, D.G.S.; de Almeida, L.F.F.; da Rosa Righi, R.; Rodrigues, J.J.P.C.; Lugli, A.B.; Alberti, A.M. Cyber-physical systems architectures for industrial internet of things applications in Industry 4.0: A literature review. J. Manuf. Syst. 2021, 58, 176–192. [Google Scholar] [CrossRef]
Wu, K.; Xu, J.; Zheng, M. Industry 4.0: Review and proposal for implementing a smart factory. Int. J. Adv. Manuf. Technol. 2024, 133, 1331–1347. [Google Scholar] [CrossRef]
Zion Market Research. Cyber-Physical Systems (CPS) Market Size, Share, Trends, and Research. 2025. Available online: https://www.zionmarketresearch.com/report/cyber-physical-systems-market (accessed on 17 November 2025).
Feng, X.; Zhu, X.; Han, Q.-L.; Zhou, W.; Wen, S.; Xiang, Y. Detecting Vulnerability on IoT Device Firmware: A survey. IEEE/CAA J. Autom. Sin. 2022, 10, 25–41. [Google Scholar] [CrossRef]
Haghighi, M.S.; Farivar, F.; Jolfaei, A.; Asl, A.B.; Zhou, W. Cyber Attacks via Consumer Electronics: Studying the Threat of Covert Malware in Smart and Autonomous Vehicles. IEEE Trans. Consum. Electron. 2023, 69, 825–832. [Google Scholar] [CrossRef]
Jiang, Y.; Wu, S.; Ma, R.; Liu, M.; Luo, H.; Kaynak, O. Monitoring and Defense of Industrial Cyber-Physical Systems Under Typical Attacks: From a Systems and Control Perspective. IEEE Trans. Ind. Cyber-Physical Syst. 2023, 1, 192–207. [Google Scholar] [CrossRef]
Chen, X.; Li, C.; Wang, D.; Wen, S.; Zhang, J.; Nepal, S.; Xiang, Y.; Ren, K. Android HIV: A Study of Repackaging Malware for Evading Machine-Learning Detection. IEEE Trans. Inf. Forensics Secur. 2019, 15, 987–1001. [Google Scholar] [CrossRef]
Benmalek, M. Ransomware on cyber-physical systems: Taxonomies, case studies, security gaps, and open challenges. Internet Things Cyber-Physical Syst. 2024, 4, 186–202. [Google Scholar] [CrossRef]
Sheng, C.; Zhou, W.; Han, Q.-L.; Ma, W.; Zhu, X.; Wen, S.; Xiang, Y. Network Traffic Fingerprinting for IIoT Device Identification: A Survey. IEEE Trans. Ind. Informatics 2025, 21, 3541–3554. [Google Scholar] [CrossRef]
Stojmenovic, I.; Wen, S.; Huang, X.; Luan, H. An overview of Fog computing and its security issues. Concurr. Comput. Pr. Exp. 2016, 28, 2991–3005. [Google Scholar] [CrossRef]
Zhu, X.; Zhou, W.; Han, Q.-L.; Ma, W.; Wen, S.; Xiang, Y. When Software Security Meets Large Language Models: A Survey. Ieee/caa J. Autom. Sin. 2025, 12, 317–334. [Google Scholar] [CrossRef]
Zou, Y.; Bai, J.-J.; Jiang, Z.-M.; Zhao, M.; Zhou, D. Blackbox Fuzzing of Distributed Systems with Multi-Dimensional Inputs and Symmetry-Based Feedback Pruning. In Proceedings of the Network and Distributed System Security Symposium (NDSS), Copenhagen, Denmark, 24–28 February 2025; pp. 1–17. [Google Scholar] [CrossRef]
Meng, R.; Pîrlea, G.; Roychoudhury, A.; Sergey, I. Greybox Fuzzing of Distributed Systems. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA, 26–30 November 2023; pp. 1615–1629. [Google Scholar] [CrossRef]
Lyu, T.; Zhang, L.; Feng, Z.; Pan, Y.; Ren, Y.; Xu, M.; Payer, M.; Kashyap, S. Monarch: A Fuzzing Framework for Distributed File Systems. In Proceedings of the 2024 USENIX Annual Technical Conference (USENIX ATC 24), Santa Clara, CA, 10–12 July 2024; pp. 529–543. [Google Scholar]
Fernando, V.; Joshi, K.; Laurel, J.; Misailovic, S. Diamont: Dynamic monitoring of uncertainty for distributed asynchronous programs. Int. J. Softw. Tools Technol. Transf. 2023, 25, 521–539. [Google Scholar] [CrossRef]
Wang, J.; Pham, H.V.; Li, Q.; Tan, L.; Guo, Y.; Aziz, A.; Meijer, E. D3: Differential Testing of Distributed Deep Learning With Model Generation. IEEE Trans. Softw. Eng. 2025, 51, 38–52. [Google Scholar] [CrossRef]
Mahe, E.; Bannour, B.; Gaston, C.; Le Gall, P. Efficient interaction-based offline runtime verification of distributed systems with lifeline removal. Sci. Comput. Program. 2025, 241, 103230. [Google Scholar] [CrossRef]
Bytes, A.; Rajput, P.H.N.; Doumanidis, C.; Maniatakos, M.; Zhou, J.; Tippenhauer, N.O. FieldFuzz: In Situ Blackbox Fuzzing of Proprietary Industrial Automation Runtimes via the Network. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, Hong Kong, China, 16–18 October 2023; pp. 499–512. [Google Scholar]
Zhu, X.; Wen, S.; Camtepe, S.; Xiang, Y. Fuzzing: A Survey for Roadmap. ACM Comput. Surv. 2022, 54, 1–36. [Google Scholar] [CrossRef]
Jang, D.; Askar, A.; Yun, I.; Tong, S.; Cai, Y.; Kim, T. Fuzzing@Home: Distributed Fuzzing on Untrusted Heterogeneous Clients. In Proceedings of the 25th International Symposium on Research in Attacks, Intrusions and Defenses, Limassol, Cyprus, 26–28 October 2022; pp. 1–16. [Google Scholar] [CrossRef]
Chen, Y.; Zhong, R.; Yang, Y.; Hu, H.; Wu, D.; Lee, W. µFUZZ: Redesign of parallel fuzzing using microservice architecture. In Proceedings of the 32nd USENIX Conference on Security Symposium, Anaheim, CA, USA, 9–11 August 2023; pp. 1325–1342. [Google Scholar]
Audrito, G.; Damiani, F.; Stolz, V.; Torta, G.; Viroli, M. Distributed runtime verification by past-CTL and the field calculus. J. Syst. Softw. 2022, 187, 111251. [Google Scholar] [CrossRef]
Momtaz, A.; Abbas, H.; Bonakdarpour, B. Monitoring Signal Temporal Logic in Distributed Cyber-physical Systems. In Proceedings of the ACM/IEEE 14th International Conference on Cyber-Physical Systems (with CPS-IoT Week 2023), San Antonio, TX, USA, 9–12 May 2023; pp. 154–165. [Google Scholar] [CrossRef]
Wolff, D.; Shi, Z.; Duck, G.J.; Mathur, U.; Roychoudhury, A. Greybox Fuzzing for Concurrency Testing. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, La Jolla, CA, USA, 27 April–1 May 2024; Volume 2, pp. 482–498. [Google Scholar] [CrossRef]
Zhao, H.; Wolff, D.; Mathur, U.; Roychoudhury, A. Selectively Uniform Concurrency Testing. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Rotterdam, The Netherlands, 30 March–3 April 2025; Volume 1, pp. 1003–1019. [Google Scholar]
Ang, Z.; Mathur, U. Predictive Monitoring against Pattern Regular Languages. Proc. Acm Program. Lang. 2024, 8, 2191–2225. [Google Scholar] [CrossRef]
Ganguly, R.; Xue, Y.; Jonckheere, A.; Ljung, P.; Schornstein, B.; Bonakdarpour, B.; Herlihy, M. Distributed Runtime Verification of Metric Temporal Properties. J. Parallel Distrib. Comput. 2024, 185, 104801. [Google Scholar] [CrossRef]
Bonakdarpour, B.; Fraigniaud, P.; Rajsbaum, S.; Rosenblueth, D.; Travers, C. Decentralized Asynchronous Crash-resilient Runtime Verification. J. ACM 2022, 69, 1–31. [Google Scholar] [CrossRef]
Tahir, A.; Rasheed, S.; Dietrich, J.; Hashemi, N.; Zhang, L. Test Flakiness’ Causes, Detection, Impact and Responses: A Multivocal Review. J. Syst. Softw. 2023, 206, 111837. [Google Scholar] [CrossRef]
Parry, O.; Kapfhammer, G.M.; Hilton, M.; McMinn, P. Empirically Evaluating Flaky Test Detection Techniques Combining Test Case Rerunning and Machine Learning Models. Empir. Softw. Eng. 2023, 28, 1–52. [Google Scholar] [CrossRef]
Yuan, M.; Zhao, B.; Li, P.; Liang, J.; Han, X.; Luo, X.; Zhang, C. DDRace: Finding concurrency UAF vulnerabilities in Linux drivers with directed fuzzing. In Proceedings of the 32nd USENIX Conference on Security Symposium, Anaheim, CA, USA, 9–11 August 2023; pp. 2849–2866. [Google Scholar]
Ito, H.; Matsubara, Y.; Takada, H. Schfuzz: Detecting Concurrency Bugs with Feedback-Guided Fuzzing. In Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering, INSTICC, Prague, Czech Republic, 24–25 April 2023; pp. 273–282. [Google Scholar] [CrossRef]
Ito, H.; Matsubara, Y.; Takada, H. Race Directed Fuzzing for More Effective Concurrency Testing. In Proceedings of the 2024 The 6th World Symposium on Software Engineering (WSSE), Kyoto, Japan, 13–15 September 2024; pp. 1–6. [Google Scholar] [CrossRef]
Jiang, Z.M.; Bai, J.J.; Lu, K.; Hu, S.M. Context-sensitive and directional concurrency fuzzing for data-race detection. In Proceedings of the Network and Distributed Systems Security (NDSS) Symposium 2022, San Diego, CA, USA, 24–28 April 2022; pp. 1–18. [Google Scholar]
Wang, J.; Huang, Y.; Chen, C.; Liu, Z.; Wang, S.; Wang, Q. Software Testing With Large Language Models: Survey, Landscape, and Vision. IEEE Trans. Softw. Eng. 2024, 50, 911–936. [Google Scholar] [CrossRef]
Li, Y.; Liu, P.; Wang, H.; Chu, J.; Wong, W.E. Evaluating Large Language Models for Software Testing. Comput. Stand. Interfaces 2025, 93, 103942. [Google Scholar] [CrossRef]
Rehan, S.; Al-Bander, B.; Ahmad, A.A.-S. Harnessing Large Language Models for Automated Software Testing: A Leap Towards Scalable Test Case Generation. Electronics 2025, 14, 1463. [Google Scholar] [CrossRef]
Celik, A.; Mahmoud, Q.H. A Review of Large Language Models for Automated Test Case Generation. Mach. Learn. Knowl. Extr. 2025, 7, 97. [Google Scholar] [CrossRef]
Yuan, Z.; Liu, M.; Ding, S.; Wang, K.; Chen, Y.; Peng, X.; Lou, Y. Evaluating and Improving ChatGPT for Unit Test Generation. Proc. ACM Softw. Eng. 2024, 1, 1703–1726. [Google Scholar] [CrossRef]
Li, T.; Cui, C.; Huang, R.; Towey, D.; Ma, L. Large Language Models for Automated Web-Form-Test Generation: An Empirical Study. ACM Trans. Softw. Eng. Methodol. 2025, 1703–1726, Just Accepted. [Google Scholar]
Zhao, X.; Wang, H.; Ding, J.; Hu, Z.; Tian, Q.; Wang, Y. Augmenting software quality assurance with AI and automation using PyTest-BDD. Autom. Softw. Eng. 2025, 33, 1–44. [Google Scholar] [CrossRef]
Schäfer, M.; Nadi, S.; Eghbali, A.; Tip, F. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Trans. Softw. Eng. 2024, 50, 85–105. [Google Scholar] [CrossRef]
Dakhel, A.M.; Nikanjam, A.; Majdinasab, V.; Khomh, F.; Desmarais, M.C. Effective test generation using pre-trained Large Language Models and mutation testing. Inf. Softw. Technol. 2024, 171, 107468. [Google Scholar] [CrossRef]
Kang, S.; Yoon, J.; Yoo, S. Large Language Models are Few-shot Testers: Exploring LLM-based General Bug Reproduction. In Proceedings of the 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), Melbourne, Australia, 14–20 May 2023; pp. 2312–2323. [Google Scholar] [CrossRef]
Meng, R.; Mirchev, M.; Böhme, M.; Roychoudhury, A. Large language model guided protocol fuzzing. In Proceedings of the 31st Annual Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 26 February–1 March 2024; pp. 1–17. [Google Scholar]
Zhou, W.; Zhu, X.; Han, Q.-L.; Li, L.; Chen, X.; Wen, S.; Xiang, Y. The Security of Using Large Language Models - A Survey with Emphasis on ChatGPT. Ieee/caa J. Autom. Sin. 2025, 12, 1–26. [Google Scholar] [CrossRef]
Deng, Z.; Ma, W.; Han, Q.-L.; Zhou, W.; Zhu, X.; Wen, S.; Xiang, Y. Exploring DeepSeek: A Survey on Advances, Applications, Challenges and Future Directions. Ieee/caa J. Autom. Sin. 2025, 12, 872–893. [Google Scholar] [CrossRef]
Deng, Z.; Sun, R.; Xue, M.; Ma, W.; Wen, S.; Nepal, S.; Yang, X. Hardening LLM Fine-Tuning: From Differentially Private Data Selection to Trustworthy Model Quantization. IEEE Trans. Inf. Forensics Secur. 2025, 20, 7211–7226. [Google Scholar] [CrossRef]

Figure 1. Test architecture of crowdsourced testing platform with industrial cyber-physical systems.

Figure 2. Test architecture of crowdsourced testing platform with distributed input–output testing and verification framework.

Figure 3. Dynamic test–task management model.

Figure 4. Asynchronous concurrent testing model.

Figure 5. Prompt Template Used in the LLM-assisted Thread Controller.

Figure 6. Comparison of successful verification count. (a) Data Return Delay Range = 0 ms; (b) Data Return Delay Range = 1 ∼ 10 ms.

Figure 7. Comparison of data verification speed. (a) Data Return Delay Range = 0 ms; (b) Data Return Delay Range = 1 ∼ 10 ms.

Table 1. Valid verification rate.

Method	Thread Count	Verification Rate for Different Delay
Method	Thread Count	0 ms	1 ∼10 ms
Baseline	1	100 %	N/A
	2	73.91%	N/A
	4	54.05%	N/A
Apache JMeter	1	100%	N/A
	2	87.72%	N/A
	4	63.19%	N/A
Locust	1	100%	66.67%
	2	99.94%	66.67%
	4	98.86%	66.85%
Our Method	2	100%	100%
Our Method	4	100%	100%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Z.; Ding, Y.; Zhao, R.; Wang, S.; Li, J. Crowdsourcing Framework for Security Testing and Verification of Industrial Cyber-Physical Systems. Sensors 2026, 26, 79. https://doi.org/10.3390/s26010079

AMA Style

Li Z, Ding Y, Zhao R, Wang S, Li J. Crowdsourcing Framework for Security Testing and Verification of Industrial Cyber-Physical Systems. Sensors. 2026; 26(1):79. https://doi.org/10.3390/s26010079

Chicago/Turabian Style

Li, Zhenyu, Yong Ding, Ruwen Zhao, Shuo Wang, and Jun Li. 2026. "Crowdsourcing Framework for Security Testing and Verification of Industrial Cyber-Physical Systems" Sensors 26, no. 1: 79. https://doi.org/10.3390/s26010079

APA Style

Li, Z., Ding, Y., Zhao, R., Wang, S., & Li, J. (2026). Crowdsourcing Framework for Security Testing and Verification of Industrial Cyber-Physical Systems. Sensors, 26(1), 79. https://doi.org/10.3390/s26010079

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Crowdsourcing Framework for Security Testing and Verification of Industrial Cyber-Physical Systems

Abstract

1. Introduction

2. Related Work

2.1. Distributed Input–Output Testing

2.2. Asynchronous Concurrent Test

2.3. LLM-Assisted Software Testing

3. The Design

3.1. System Architecture

3.2. Dynamic Test–Task Management Model

3.2.1. Test Payload Generator

3.2.2. New Test Object

3.2.3. Test Object Pool Addition

3.2.4. Remove Expired Test Objects

3.3. Asynchronous Concurrent Testing Model

3.3.1. Test Sender Group

3.3.2. Test Verifier Group

3.3.3. LLM-Assisted Thread Controller

4. The Implementation

4.1. Test Payload Generation

4.2. Concurrent Test Payload Sending

4.3. Timeout Test Payload Management

4.4. Concurrent Retrieval and Asynchronous Analysis of Test Results

5. Evaluation

5.1. System Configuration

5.2. Evaluation of Valid Testing and Verification Rate

5.3. Evaluation of Successful Verification Count

5.4. Evaluation of Data Verification Speed

5.5. Discussion of ICPS-Specific Metrics

5.6. Scalability and Complexity Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI