1. Introduction
According to the World Health Organization (WHO), it is predicted that between 2020 and 2030, about 500 million people will develop heart disease, obesity, diabetes, or other non-communicable diseases due to physical inactivity [
1,
2]. This scenario highlights the urgent need to develop solutions that address these problems while promoting health and autonomy in the general population, particularly among older adults. Considering that a large part of the population is now over 65 years old and needs long-term assistance in later life, it is clear that planning for intelligent long-term care systems—in financial, management, and technological terms—is more urgent than ever.
In IoT-based AAL systems, activity detection commonly relies on sensors, including audiovisual devices combined with computer vision (CV) and AI techniques to recognize domestic activities and enhance user safety. Distributed collaboration among devices improves the detection of complex activities and supports privacy-preserving processing. However, challenges remain in ensuring privacy, reliability, and adaptability to diverse user behaviors. Collaborative processing helps mitigate these issues by reducing false positives and improving detection accuracy.
Considering these issues, this article presents a low-cost implementation that can contribute to addressing these challenges in the AAL area, using RGB video cameras and microphones integrated into smart devices.
Some of the scenarios that monitoring systems aim to address are detailed below:
Fall detection improves confidence in fall recognition through multi-sensor fusion across rooms, reducing false alarms and supporting timely alerts to caregivers or emergency services.
Security and safety monitoring uses distributed sensing to flag suspicious activity or atypical anomalies and trigger notifications or alarms.
Energy-consumption analysis aggregates device-level measurements to identify consumption patterns and opportunities for efficiency improvements.
Personal assistance and home automation uses context-aware sensing to support adaptive control (e.g., lighting, temperature) and reminders, improving daily autonomy.
Pet monitoring uses remote observation of activity patterns to detect abnormal behavior and support wellbeing.
Sleep and behavior analysis uses long-term sensing to extract routines and detect deviations that may indicate risks or health-related changes.
1.1. A Reference Architecture for Smart Home Intelligent Systems
Figure 1 shows an architecture commonly implemented in smart home monitoring systems, described here as a layered network architecture. This approach divides the system into different layers, each with a specific function, to better organize components and facilitate scalability and maintenance.
In the initial (physical) layer of the architecture, sensors capture data. These sensors can be integrated into low-cost devices, such as Raspberry Pi or Jetson Nano, whose primary function is to acquire information from the environment and forward it to the subsequent layer. The second layer is the network layer, where data collected by sensors is sent to a central device, such as a hub or controller, through communication technologies as Wi-Fi, Bluetooth, or ZigBee, ensuring connectivity between devices and the central monitoring system. The next layer is the analysis and processing layer, in which data is processed and analyzed to extract relevant information. Algorithms and AI techniques can be applied at this stage to identify patterns, make predictions, and make automated decisions based on the collected data. The fourth layer is the decision layer, where decisions made based on data analysis are converted into actions; for example, if the room temperature is above a set threshold, the system can automatically activate the air conditioning to regulate it. This layer enables automation and remote control of devices in the smart home. Finally, the fifth layer is the application layer, which provides intuitive and user-friendly interfaces (e.g., smartphone apps) that allow residents to interact with the monitoring system and control devices conveniently.
1.2. The DistSense System
This article proposes the DistSense system. In essence, DistSense is a distributed network of sensing nodes that detect and classify activities of daily living in smart homes from audio and RGB video streams, while maintaining privacy and security as key requirements. DistSense is defined in Definition 1.
Definition 1 (DistSense). The DistSense (Distributed Activity Recognition System) is defined as a set of interconnected sensing nodes, each capable of local perception and inference, that collaboratively detect and classify activities by exchanging high-level semantic information rather than raw sensor data.
To achieve this goal, a P2P network architecture was implemented to process captured data at each node, with customized neural network models trained for audio and video inference. This design enables local and collaborative processing of audiovisual activity cues, while sharing only relevant high-level events with other nodes or the user, in line with Definition 2.
Definition 2 (Privacy-Preserving Collaborative Inference). Privacy-preserving collaborative inference refers to a distributed decision-making process in which multiple nodes contribute to a shared inference outcome while ensuring that sensitive raw data remain locally processed and are never transmitted or centrally stored.
In addition, the DistSense system integrates a component aimed at improving accuracy in activity detection in domestic environments. This improvement is achieved through collaboration among devices in the network, supported by the integration of distributed ledger technology. When a device cannot detect an activity reliably, it can consult distributed records, enabling more robust decision-making in activity classification and reducing false positives and false negatives. Collaboration also enables the sharing of relevant information among devices: when a device detects a specific activity, it can notify other devices on the same network, enabling mutual validation of detected events.
Over the past few decades, computing paradigms have evolved, and the discussion around IoT computing continues to develop [
3]. In this work, the proposed system design is guided by four constraints:
Data Privacy and Security requires protecting sensitive data when stored, transferred, or processed, in accordance with European data protection rules [
4].
Cost must be considered when choosing approaches and technologies, as the solution must ensure scalability for different scenarios.
Network Connection requires considering the communication type and protocols that must be supported, since the solution must ensure communication between all devices.
Distributed Processing can make scalability in processing, storage, and communication challenging, even in popular open-source home automation solutions such as Home Assistant [
5].
To enable real-world deployment, we propose a low-cost distributed monitoring system based on audiovisual sensors integrated into smart devices, leveraging platforms such as Jetson Nano [
6] and Raspberry Pi [
7]. The system is evaluated through a combined experimental and case study methodology, involving controlled and real-world scenarios to assess performance, robustness, and usability. Simulation in controlled environments supports testing under diverse conditions, while real-world experiments validate practical applicability. Data collection follows ethical guidelines, considering the system’s social impact [
8].
In the IoT context, edge intelligence enables local and autonomous data processing, reducing reliance on centralized resources [
9]. However, practical implementations remain limited. To address this gap, we propose an edge AI-based distributed architecture focused on privacy and security. The system design specifies network modules and communication mechanisms, and its effectiveness is evaluated through representative application scenarios. The main contributions are:
Implementation of a P2P architecture of intelligent audiovisual sensors in domestic environments (DistSense System).
Collaboration between smart sensors in the detection and analysis of domestic activities to disambiguate situations with uncertain information.
Development of a solution that guarantees the integrity, security, and privacy of user data through local processing and storage of captured information.
Specification and implementation of the DistSense system for both simulated and real deployments using containerized components.
Use of a Blockchain (BC) based system to enable strong forensic auditing and guarantee immutability of detected events.
Beyond system integration, the scientific contribution of DistSense is positioned at the level of architecture-method coupling for privacy-preserving collaborative inference in domestic environments. Concretely, our work contributes: (i) an explicit uncertainty-handling workflow that combines local multimodal confidence with distributed event-history consultation; (ii) a reproducible communication and orchestration design that separates local inference, distributed validation, and external notification; and (iii) an implementation-oriented analysis of design trade-offs (privacy, traceability, coordination overhead) under edge constraints.
Essentially, this work was guided by three research questions (RQs): (1) Can distributed peer collaboration improve activity inference in ambiguous domestic scenes without transmitting raw audiovisual data? (2) Can a ledger-backed event history improve traceability and support uncertainty resolution in edge inference workflows? (3) Can the proposed architecture maintain practical real-time operation on low-cost edge hardware? From these, we evaluate the following hypotheses (Hs) in the presented case studies: (1) collaborative consultation of recent peer events improves decision confidence when local confidence is low; (2) high-level event exchange (instead of raw data transfer) preserves functionality while improving privacy; (3) the elected-coordinator workflow remains operationally lightweight for the studied deployment scale.
In this work, privacy-preserving collaborative inference is operationalized as follows: each node performs local audiovisual inference, shares only high-level semantic events (never raw audio/video streams), and consults recent distributed event history only when local confidence is below the global threshold. This operational definition directly guides the communication, orchestration, and ledger design choices described in
Section 3.
2. Related Work in Smart Home Monitoring Systems
As smart-home sensors capture users’ routines and behaviors, concerns arise regarding data security and privacy. The transmission and storage of sensitive information, including biometric and behavioral data, expose systems to potential misuse and cyberattacks. Additionally, user acceptance remains a challenge, as intrusive monitoring may be perceived as invasive despite its benefits in automation and personalization.
This trade-off has led to the distinction between intrusive and non-intrusive sensing, as well as hybrid approaches that balance accuracy and user comfort. In parallel, cloud, edge, and fog computing paradigms play a central role in data processing [
10]. Edge computing enables low-latency, local processing and reduces the exposure of sensitive data, while fog computing distributes computation across intermediate nodes to improve scalability.
Smart home monitoring systems are widely applied in domains such as healthcare, energy, and safety. In e-Health, they support continuous monitoring of vital signs, enabling remote care and improving quality of life.
In [
11], the authors propose SCAMPI (Self-Care Advice, Monitoring, Planning, and Intervention), a framework based on low-cost sensors to support autonomous monitoring of daily activities in patients with dementia or Parkinson’s disease. Data are locally collected and analyzed to identify patterns related to disease progression. The system integrates multiple communication protocols (Wi-Fi, Bluetooth, ZigBee, Z-Wave) through a Raspberry Pi acting as a central hub, running Home Assistant (HA) [
5,
12]. Evaluation included laboratory and real-world studies, showing feasibility, ease of deployment, and minimal disruption to daily activities. However, the use of non-intrusive sensors limits activity recognition accuracy compared to more invasive approaches, such as audiovisual sensing.
In [
13], the authors propose a modular healthcare monitoring system based on wearable sensors and a myRio processor, responsible for data acquisition and transmission via Wi-Fi. Data are visualized and stored using the EVOTHINGS platform, enabling continuous monitoring of physiological signals.
A three-layer DNN is used for feature extraction and prediction, achieving an accuracy of approximately 97.2%. The results demonstrate the system’s effectiveness in monitoring and predicting physiological signals. Beyond vital signs, such systems also support the analysis of users’ daily physical activities, enabling more comprehensive behavior assessment.
In [
14], the authors propose an IoT-based system for real-time monitoring of older adults’ physical activities using a low-cost wearable accelerometer. Data are transmitted via Wi-Fi (IEEE 802.11b/g/n) [
15] to a central node, which applies supervised ML for activity recognition. The architecture includes a wearable device, a main node, a cloud server, and a mobile application, with communication supported by TCP and MQTT protocols.
The system classifies activities such as walking, sitting, sleeping, and standing using a decision tree model, achieving over 80% accuracy. Results highlight the feasibility of low-cost, real-time monitoring, while emphasizing the importance of reliable connectivity and hardware selection.
In addition, another area of interest in the monitoring of smart environments is energy efficiency, where centralized systems allow the intelligent management and control of energy consumption, optimizing its use and reducing costs.
In [
16], the authors propose an intrusive IoT-based architecture for energy consumption monitoring and activity recognition, structured in five layers: device, perception, communication, middleware, and application. ML models (FFNN, LSTM, SVM) are applied using the UK-DALE dataset [
17], achieving over 90% accuracy for FFNN and LSTM, and around 80% for SVM. Results show that LSTM benefits from larger data groupings, and that retraining with updated data improves performance. Although the system is not real-time, it can identify concurrent activities from feature sequences. The study also highlights increasing concerns regarding privacy and user acceptance due to the use of intrusive sensing technologies in smart home environments.
In [
18], the authors propose a distributed smart camera-based system for privacy-aware elderly monitoring, combining image processing and smart cameras to detect abnormal events and track user activity. The system incorporates security and anonymity mechanisms to preserve privacy while enabling accurate recognition of faces, objects, and behaviors, and can be extended with additional sensors. Despite ethical concerns related to surveillance, the authors argue that privacy risks can be mitigated through safeguards, transparency, and user consent. Experimental results in controlled and real environments demonstrate accurate event detection, effective activity tracking, and good user acceptance.
In [
19], the authors propose a distributed intrusion detection system (DIDS) for IoT environments, leveraging a P2P architecture and a Distributed Hash Table (DHT)-based reputation mechanism. Smart devices collaboratively detect malicious behavior using ML-based classification across kernel, network, and DHT data, with Kademlia ensuring efficient communication.
Similarly, ref. [
20] presents a distributed multi-camera system for activity recognition based on consensus matrix completion and convex optimization. Evaluated on IXMAS and MuHAVi, it achieves 85.9% accuracy, demonstrating robustness to noise and variability in human activities.
In AAL contexts, [
21] proposes a distributed fog-based platform for acoustic event detection, combining ANN-based real-time detection with CBR refinement. The system achieves up to 94.6% accuracy and 90.58% F1-score, supporting scalable monitoring with acceptable computational overhead. Despite promising results, improvements in data quality and model adaptation are required for enhanced performance in real-world deployments.
The studies explored highlight the potential of centralized and/or distributed monitoring systems in smart environments to improve users’ health, energy efficiency, and comfort, enabling more effective monitoring and improved interaction with the smart space. In
Table 1, we present a critical analysis to examine and contrast the main monitoring systems investigated in the literature, considering fundamental criteria such as the architectural model, ML model, computation model, sensors, and hubs used. In addition, special attention is paid to data security and privacy, given the increasingly sensitive and regulated context surrounding the protection of users’ personal information.
Regarding the architectural models explored, systems that rely on intrusive sensors such as audio and/or video tend to adopt a distributed approach. This choice can be attributed to the advantages of decentralizing processing, safeguarding data privacy, and minimizing critical points of failure. Distributed systems have greater redundancy, which contributes to higher overall reliability. However, the centralized systems examined in this analysis may offer simpler management and control, although they may demonstrate less scalability in high-demand scenarios. This limitation is particularly evident when a central hub coordinates communication between devices, reducing tolerance to failures.
ML models play a key role in data analysis and decision-making in monitoring systems in smart homes. The comparative analysis reveals that, although the investigated studies apply ML algorithms for activity detection and process optimization, there is still a lack of standardization in the choice of specific algorithms. This lack of uniformity suggests an opportunity for future research that can identify the most effective ML models for specific applications.
Privacy is a critical aspect when it comes to monitoring smart home environments. In this context, it is possible to see that most distributed systems show concern for protecting users’ privacy, using security mechanisms or requesting user consent.
Through the comparative analysis of the investigated systems, two fundamental aspects considered essential for developing the DistSense system are emphasized: user privacy and the adoption of a collaborative approach among smart devices through a distributed architecture.
To avoid ambiguity in
Table 1, we stress that labels such as “Multiple Approaches” summarize heterogeneous model families reported by the original works, and do not imply methodological equivalence. Likewise, “Video” in the sensor column does not by itself indicate whether raw streams are transmitted or whether only locally extracted representations are exchanged. DistSense addresses this distinction explicitly by enforcing local processing with semantic-only event sharing.
From a design-trade-off perspective, DistSense prioritizes edge/distributed operation to reduce raw-data exposure and improve responsiveness, while accepting additional orchestration complexity compared with centralized cloud pipelines.
When intrusive sensors are used, a distributed approach is especially important because combining information from multiple sources enables a more comprehensive and accurate view of the environment.
To clarify novelty relative to prior studies, DistSense does not claim a new standalone ML model or a new consensus algorithm. Instead, novelty lies in the end-to-end methodological composition for smart-home monitoring: collaborative multimodal inference under uncertainty, strict high-level semantic exchange (no raw stream sharing), and ledger-backed cross-node traceability integrated with practical home-automation delivery. This combination, implemented and validated in both controlled and real deployments, is a core differentiating contribution of this work.
3. Design and Implementation of DistSense
The DistSense distributed system follows a decentralized peer-to-peer architecture composed of self-organizing edge nodes. At any given time, exactly one node acts as coordinator, and this role may change dynamically because of network reconfiguration events (e.g., node joins or departures). The system architecture, computational logic, and runtime behavior of the nodes form the foundation of DistSense (available at
https://github.com/josemctorres/DistSense-Smart-Environments (accessed on 12 February 2026)).
3.1. Edge Node Logic and Architecture
The DistSense system targets the continuous acquisition and analysis of audiovisual data for smart-environment monitoring. The extracted high-level information enables the detection of trends, anomalies, and activity patterns, supporting informed decisions. The system operates in a distributed manner: nodes collaborate and synchronize to increase confidence in event detection and classification. Each edge node is composed of four modules/layers that support efficient monitoring of the environment, the lowest layer is Communication, followed by Discovery; the higher layers are Knowledge Representation and Processing and Machine Learning (ML), which support semantic inference and decision support.
When a peer node is initialized, it discovers other peers on the local network (via the discovery service described in
Section 3.2) and establishes connections accordingly. After peer discovery, an election procedure selects the coordinator, which assumes specific orchestration responsibilities in the distributed system. The election procedure ensures that, at any given time, there is a single leader among the active nodes. Election can be based on different criteria, such as processing power, availability, or distributed consensus mechanisms [
22]. In this way, the system maintains a reliable coordinator and supports proper task coordination and management.
After the election process is completed, or once enough nodes are available in the network, the peers begin communicating with one another. Communication between nodes is performed through secure and encrypted channels to ensure the confidentiality and integrity of the transmitted data. This approach ensures that only authorized peers can access and interpret information sent by other peers. After the initial network configuration, each node captures and processes audiovisual information efficiently, enabling intelligent real-time analysis of events occurring in the monitored space.
Figure 2 illustrates the data flow of each device in the DistSense system; the first stage consists of capturing audiovisual data. At this stage, carefully selected sensors and devices collect relevant information from the observed environment.
In the next step, the captured signals undergo preprocessing to improve data quality and reliability before inference. This stage typically includes normalization, denoising/cleaning, and transformations that produce a consistent representation suitable for model input. After local preprocessing, the data is forwarded to the inference model. The model performs audiovisual event recognition and outputs high-level predictions that are used for decision-making and monitoring. Additionally, DistSense records detected events in a distributed ledger (blockchain) [
23], as shown in
Section 3.5.1.
The P2P architecture provides several advantages. First, it eliminates single points of failure because the system does not rely on a central server. Second, decentralization improves scalability, as new peers can be added to the network without significantly affecting overall performance.
In this context, given privacy concerns associated with intrusive sensors (e.g., cameras and microphones), the ledger is used to improve traceability and tamper-evidence for events shared among peers. In DistSense, the blockchain acts as an append-only log that links events in chronological order; blocks are replicated across the P2P network to avoid reliance on a single storage authority. This design provides integrity guarantees (tamper-evidence) and improves resilience to single-node compromise; it does not, by itself, prevent all attacks and therefore complements (rather than replaces) conventional security controls.
Data confidentiality remains a key concern, especially with audiovisual sensing. In DistSense, confidentiality is addressed by (i) processing raw data locally at the edge, (ii) sharing only high-level event descriptors, and (iii) protecting ledger participation and access through standard cryptographic mechanisms (node identities/keys and access control policies). Smart contracts can be used to encode authorization rules when applicable [
24]; however, sensitive content is not intended to be publicly exposed on-chain. The resulting event history can be analyzed to characterize routines and detect deviations while minimizing disclosure of raw sensor data.
After user behavioral patterns are detected through data processing, the corresponding high-level information is sent by the elected coordinator node to the hub. Using an IoT hub facilitates communication between the system and the user. One of the most widely adopted IoT hubs is Home Assistant (HA), an open-source home automation platform designed to control smart devices in residential environments. HA provides a centralized platform for integrating and controlling devices from different manufacturers, enabling custom rules, task automation, and interaction with virtual assistants. According to [
25], HA can be a cost-effective solution because of its open-source nature and the ease with which features can be adapted to the user’s context. This approach enables an efficient and customizable monitoring system that combines the detection of behavioral patterns with the automation and control capabilities provided by HA. By using low-cost devices and integrating them with HA, it is possible to obtain an affordable, flexible, and easy-to-deploy monitoring system for smart home environments.
Collaboration among devices in the network significantly increases system accuracy and reliability, providing strong capabilities for real-time monitoring of complex and dynamic environments.
3.2. Node Discovery and Coordinator Election
System bootstrapping in a distributed environment is a critical step for stabilizing the network and ensuring that nodes connect and become active correctly. A well-defined initialization process is therefore essential to preserve scalability and reliability while allowing new nodes to be added and integrated seamlessly.
Before proceeding with the implementation of the discovery module, an initial network bootstrap step is performed. In this step, each device obtains its IP address via Dynamic Host Configuration Protocol (DHCP), which is used only for Layer-3 addressing configuration (IP, gateway, DNS) and conflict avoidance. Service-level peer discovery is then performed by mDNS/DNS-SD; i.e., DHCP handles network attachment, while mDNS/DNS-SD handles decentralized service discovery. Initializing the DistSense system is divided into two fundamental steps: discovery of devices in the network and coordinator election.
3.2.1. Discovery Service
DistSense performs local device discovery using mDNS/DNS-SD [
3,
26] to enable zero-configuration operation in typical home networks. Each node advertises its presence (e.g., UUID, IP, port, and location tag) and listens for announcements to maintain an updated peer node table. In the prototype, discovery is implemented in Python 3.11 using the zeroconf library, providing dynamic join/leave behavior and the input set for the coordinator-election step. This whole process is essential for establishing a cohesive environment that can seamlessly integrate new nodes as they are added.
As with standard mDNS/DNS-SD behavior, discovery is most direct within a local broadcast/multicast domain. Scenarios with subnet segmentation, aggressive device sleep cycles, or filtered multicast may require auxiliary gateway/reflector configuration and/or heartbeat-driven re-registration policies, which are outside the baseline prototype configuration and are treated as deployment considerations.
3.2.2. Coordinator Election Process
Once discovery is complete, the system executes an election algorithm among the participating nodes. The purpose of this algorithm is to select a node to assume a specific coordination role within the network. The elected node becomes responsible for coordinating operations, collecting aggregated information, or performing other tasks relevant to system operation.
The choice of the Bully algorithm, one of the classic approaches to coordinator election in distributed systems, is well justified when compared with other available alternatives, particularly because of its efficiency in failure and recovery situations. If the coordinator node fails, the remaining nodes can promptly detect the failure and start a new election in a direct and agile manner. This rapid transition significantly reduces the time during which the system may operate without a functional coordinator, thereby improving continuity of service and minimizing negative impacts [
27,
28].
The coordinator plays a crucial role in blockchain event management by taking responsibility for validating and adding new blocks to the blockchain. This function helps ensure transaction integrity and security, contributing to data immutability and the overall consistency of the distributed system. In addition, the coordinator establishes two-way communication with HA through the MQTT protocol [
29]. This interaction allows the DistSense system to send locally processed high-level events to the user as a usable representation of system knowledge. Through this functionality, HA acts as a valuable user interface, providing relevant information and allowing interaction with the developed system in a simple and efficient way.
In the current implementation, UUID ordering is used as a deterministic tie-break criterion, not as a proxy for computational performance. This choice simplifies reproducibility and prevents election ambiguity under heterogeneous network conditions. In fact, UUID ranking does not encode compute capacity, stability, or energy state; therefore, a weighted election criterion (e.g., availability score, CPU headroom, or battery state) is a planned extension.
For the evaluated two-node real deployment, failover was measured from coordinator-timeout detection to COORDINATOR_ANNOUNCE reception. Across repeated runs, re-election completed in the sub-second range, which is consistent with the low message complexity at this scale and supports practical continuity of service.
3.3. Node Communication
DistSense adopts a P2P communication model in which each node acts as both client and server. Nodes exchange control messages and inference/validation information using JSON-encoded payloads transported over authenticated and encrypted channels (TLS), ensuring confidentiality and integrity [
30]. This layer supports (i) discovery and election coordination, (ii) propagation of event/transaction data for distributed validation, and (iii) periodic liveness monitoring (e.g., heartbeats/keep-alive) to detect failures and trigger reconfiguration (e.g., re-election).
Messages are represented as JSON objects with a small fixed header (message type, sender UUID, timestamp) and a payload (e.g., inference event, transaction, or coordination command). The minimal mandatory exchanged fields are: MSG_TYPE, SENDER_UUID, TIMESTAMP, EVENT_TYPE, LOCATION, and CONFIDENCE. Optional fields (e.g., MODEL_ID, SIGNATURE) are included only when required by the specific message type. Each node implements a receive–dispatch loop that routes incoming messages to the appropriate handler.
Failure detection and node inactivity marking are based on heartbeat timeout. In the current prototype, heartbeats are emitted every 2 s; a node is marked inactive after 3 consecutive missed heartbeats (6 s timeout window). To improve robustness under transient network issues, control messages use retry with exponential backoff (up to 3 retries), and duplicate messages are discarded by the pair (SENDER_UUID, TIMESTAMP) plus message-type checks.
In terms of communication overhead, discovery traffic is periodic and small (service advertisements and keep-alive messages). For election, the message complexity follows the expected behavior of Bully-like workflows (best-case near-linear, worst-case quadratic in node count). At the tested scale (2 nodes), an election cycle requires a small constant number of control messages and remains below 1 s, as reported above.
3.4. Activity Recognition Using Machine Learning
Each DistSense node performs local audiovisual inference using lightweight deep models at the edge. For audio, we use YAMNet [
31]; for video, we use MoViNet, pre-trained on Kinetics-600 [
32] and fine-tuned for the target smart-home classes. Training data are drawn from standard public datasets (e.g., FSD50k [
33] and ESC-50 [
34] for audio, and Toyota Smart Home [
35] and Charades [
36] for video) after class selection and balancing. Both models were implemented using the TensorFlow platform [
37]. For deployment on resource-constrained devices, the resulting models are converted to TFLite, enabling low-latency inference and real-time mapping from audio/video inputs to activity labels with associated confidence scores [
38]. For reproducibility, we provide the model families, dataset sources, preprocessing workflow, and deployment conversion path used in the prototype. We emphasize that this work reports deployment-oriented model behavior and does not claim a full benchmark study with exhaustive per-class statistical reporting; such extended benchmarking is left for future work.
After detecting audio and video events with confidences
and
, a fusion mechanism is applied to calculate the node-level event confidence according to Equation (
1).
In the experiments reported in this work, we use a fixed value , giving equal weight to audio and video confidences. This choice was adopted to keep the fusion rule simple, transparent, and reproducible under limited labeled data for modality-specific calibration.
Before fusion, confidence outputs are mapped to and interpreted as model confidence scores on the same numeric scale. We emphasize that this is a pragmatic score-level normalization (not full probabilistic calibration such as temperature scaling). A comparative study of non-linear or learned fusion strategies was left outside the scope of this work and is left as future work.
3.4.1. Preprocessing the Audiovisual Dataset
Before training the models, both the audio and video datasets must be preprocessed. For audio, two main datasets were selected: FSD50k [
33] and ESC-50 [
34]. For video, we used Toyota Smart Home [
35] and Charades [
36]. These datasets were selected for the specific task of identifying and classifying relevant sounds and videos in the home context.
A crucial step in this procedure is selecting the most relevant classes from each dataset. This requires analyzing each audio clip and video sample individually to assess its quality and relevance to the target objective. During this process, sounds and/or videos that contain several mixed classes are filtered out, since the presence of multiple classes in a single file can hinder correct classification.
Another important preprocessing procedure is class balancing, which helps prevent the model from becoming biased toward classes with more examples and thus promotes more balanced learning across all classes.
In addition, in the audio domain, silence segments are removed from the sound files. This consists of eliminating low-decibel portions, which are usually located at the beginning, middle, or end of each recording. Removing these silence segments helps reduce unwanted noise and ensures that only relevant information is considered during model training. After this preprocessing phase, the curated dataset is ready to be used for model training.
3.4.2. Models Training
The transfer learning strategy has proven highly effective for optimizing the performance of neural network models such as MoViNet and YAMNet for specific tasks.
In the field of audio monitoring, extracting relevant information from acoustic signals remains challenging; therefore, the use of specialized models for audio analysis is of fundamental importance.
Accordingly, properly preparing the preprocessed audio data for model training is a crucial step. This preparation includes redefining critical variables, such as the number of channels and the sampling rate, which play a decisive role in ensuring the quality of the resulting model.
The extraction of embeddings from the original model plays a central role in building a simplified approach. As continuous vector representations of discrete variables, embeddings allow the model to learn features that are relevant to the monitoring context. Converting categorical variables into compact vector spaces reduces data complexity while preserving meaningful representations of the selected classes.
In the particular case of YAMNet, the model uses features with dimension 1024 to characterize each audio frame, corresponding to a 0.96-s time interval. During training, the input audio sequence must have a sampling frequency of 16 kHz and a single channel. To capture temporal aspects, the audio is partitioned into 0.96-s windows with a 0.48-s step, resulting in a sliding-window approach.
For visual analysis, the training procedure of the MoViNet model also incorporates transfer-learning-based approaches after the audiovisual datasets have been preprocessed, curated, and balanced. This process properly prepares the datasets for real-time activity classification.
The MoViNet model stands out as a robust video classifier for streaming scenarios and real-time inference in action-recognition tasks. By contrast, models based on 2D frames, despite their efficiency in analyzing complete videos or individual frames in a streaming regime, show limitations in capturing temporal context, which can result in lower precision and inconsistencies between successive frames.
A more sophisticated approach uses 3D convolutional networks, which incorporate bidirectional temporal context and thus improve temporal accuracy and consistency. However, these networks may require more computational resources and are not ideal for processing continuous data streams because they depend on future information.
The distinctive architectural feature of the MoViNet model lies in its adoption of causal 3D convolutions along the time axis, resembling the layers.Conv1D configuration with the parameter padding=’causal’. This design combines the advantages of previous approaches and enables effective analysis in streaming settings.
Additionally, the implemented MoViNet model requires a 5D RGB video tensor as input, with the structure [batch_size, num_frames, height_pixels, width_pixels, num_channels=3] (last value corresponding to 3 RGB channels). This configuration allows the model to analyze each frame within a broader context, thereby capturing the temporal and spatial relationships present in the video.
Causal convolution ensures that the output at time x is calculated only from inputs available up to that time. This streaming efficiency can be illustrated by analogy with RNNs, in which the state is propagated over time. In the context of MoViNet, this state is referred to as the stream buffer.
Training the MoViNet model through transfer learning is based on a previously curated and balanced dataset. It involves using MoViNet weights pre-trained on a broader dataset and then fine-tuning the model with the dataset specifically prepared for the task of action recognition in videos.
This process requires adapting the parameters to reflect the nuances and characteristics of the new dataset. The upper layers of the model are adjusted to suit the task at hand, while the deeper layers, which capture generic features, remain unchanged.
Figure 3 shows the accuracy progression during training of the audio and video models, using 20 epochs for the audio model and 10 epochs for the video model. The final confusion matrices are shown in
Figure 4.
Training effectiveness is strongly influenced by the quality of the curated and balanced dataset. Selecting representative examples of each action class in sufficient quantity to avoid imbalance is fundamental to the success of transfer learning. This approach allows the model to generalize more accurately to previously unseen data.
After training the models, the next step is to convert them to a lighter format, such as TFLite. One common issue is operation mismatch, where certain operations present in the original models are not supported by the TFLite format, as occurs in the audio model with operations such as ComplexAbs and RFFT2D. The solution is to create equivalent models that use compatible operations. This step can also lead to a slight loss in inference quality, which can be mitigated through optimization techniques such as quantization. Quantization allows floating-point numbers to be represented with fewer bits while maintaining an acceptable level of precision, as described in [
39].
3.5. Knowledge Representation, Orchestration and High-Level Communication
DistSense is designed to avoid storing or transmitting raw sensitive audio/video data, given the growing concern about what sensitive information is collected and how it is used [
40,
41]. Instead, nodes produce high-level events (type, confidence, timestamp, location, and node UUID). These events are (i) recorded in a distributed ledger for integrity and later cross-checking, and (ii) exposed to the user through the home automation hub (Home Assistant) via MQTT.
3.5.1. Distributed Ledger: Blockchain
The use of blockchain technology offers a secure and transparent approach to recording and validating the information collected by the system. Through its immutability and decentralization, it enables reliable storage of inferred activities and associated data while preserving information integrity and authenticity [
42]. In addition, it allows the creation of tamper-evident records, increasing confidence in the activities inferred by the proposed system and providing an additional level of security.
Compared with lighter alternatives (e.g., centralized append-only logs or signed single-database journals), our choice of a replicated ledger is motivated by multi-node auditability and tamper-evidence without a single trust anchor. In DistSense, this design specifically supports (i) cross-node verification of high-level events, (ii) traceable historical queries for uncertainty resolution, and (iii) fault tolerance under node churn. This design choice comes with additional overhead and implementation complexity; therefore, DistSense stores only compact semantic events rather than raw streams.
Table 2 provides a structured comparison between the replicated ledger approach adopted in DistSense and a centralised append-only log, the most common simpler alternative.
The overhead difference is justified by the AAL use case: in a domestic monitoring context where the coordinator may be a low-cost device subject to failure, relying on a single trust anchor for the integrity of sensitive activity records is architecturally fragile. The measured propagation time (≤500 ms at 2-node scale) and payload size (≤∼1 kB) confirm that the additional overhead does not compromise near-real-time operation.
Although the architecture includes a coordinator role, the system is not centralized in data processing: inference remains fully local at each node, and event validation/storage is distributed across peers. The coordinator is an operational role for orchestration (MQTT publication and block-finalization workflow), dynamically elected and replaceable after failures, rather than a permanent central authority.
In this sense, the blockchain architecture is especially relevant to the DistSense system, as it offers a reliable and secure way to store and maintain a chronological record of captured activities. In addition, this architecture allows nodes to query the event log, which, in situations of uncertainty, facilitates more accurate inference based on information perceived by other nodes in the system.
This secure and decentralized technology is based on the structure of its blocks, which form the foundation of the chain. Each block in the blockchain follows a carefully defined structure, consisting of essential elements such as timestamp, hash, target, nonce, height, and transactions. The timestamp records the exact date and time of block creation and is essential for chronologically ordering events in the blockchain. This capability is crucial in monitoring scenarios in which the order of actions matters for data analysis. The hash, in turn, is a unique digital signature derived from the block content using cryptographic algorithms. This compact and secure representation serves as an authentication and identification mechanism because, by linking each block to its predecessor through the hash, the blockchain builds an immutable chain of transactions and events [
43]. The target parameter plays a key role in determining the complexity of the mining process. By establishing the degree of difficulty required to validate blocks, target regulates the frequency with which new blocks are added to the blockchain. In the context of monitoring, this parameterization is important for maintaining efficient and timely operation. Subsequently, the nonce, or “number only used once”, plays a crucial role in the PoW algorithm. This value is adjusted iteratively by miners until a valid hash is reached. The computational effort required by this process constitutes an additional security layer that is important for the confirmation and validation of transactions on the network [
44].
Additionally, the height of a block indicates its position in the chain, defining the ordered sequence of recorded events. The concept of height provides a hierarchical perspective for data visualization and promotes understanding of the temporal evolution of monitored events.
Transactions are essential components of the blockchain, recording transfers of information, goods, or contracts. In the DistSense system, transactions represent monitored data that are recorded immutably, providing a chronological sequence of events across the network and throughout the inference pipeline.
In this context, inference in the DistSense system consists of feeding preprocessed data into the models, enabling them to identify patterns, events, and anomalies. When the inference confidence exceeds a critical threshold , the system records the identified event and its associated details, ensuring that only high-level information is stored.
When the inference confidence does not reach the predefined threshold, the node responsible for recording the activity consults the blockchain to obtain more precise information about recent events within a time interval corresponding to the location where the activity was detected. In the current implementation, the global threshold is fixed at for all nodes and modalities. This procedure aims to improve detection quality and contextual understanding while still ensuring that only relevant high-level information is stored.
In addition, the event registration process compares each new event with the previously captured event. In the current implementation, an event is considered “significantly different” when at least one of the following changes occurs: event class, room/location tag, or confidence state crossing the configured decision threshold. Only such transitions are recorded, since they represent meaningful changes in monitored activity.
Thus, events identical to their predecessors are considered part of a continuous activity and are not stored, since they do not add relevant information about temporal evolution.
The identified event, together with details such as its type, date, location, and the UUID of the device that detected it, is integrated into a transaction and then transmitted to all nodes in the home network, as illustrated in
Figure 5. Before sending the transaction, however, the system ensures the integrity and reliability of the data so that other devices can validate it.
To that end, the transaction undergoes an encryption process: robust cryptographic algorithms sign the message after it is hashed. The protected transaction is then sent to other devices on the local network.
This procedure not only adds an additional layer of security, but also helps ensure that the data remain intact and confidential during transmission. The resulting cryptographic signature preserves transaction integrity as it is shared among network nodes.
When the transaction reaches each node in the home network, it undergoes a rigorous validation process to ensure legitimacy and compliance with the rules of the DistSense system. If validation succeeds, the transaction is added to the list of pending transactions.
As the number of pending transactions grows and reaches the established limit, the coordinator node is notified, triggering the mining process. Mining involves solving computational challenges, culminating in the creation of a new block that groups previously validated transactions; the block is then incorporated into the blockchain, ensuring integrity in a secure and immutable manner.
For transparency regarding ledger overhead, transaction payloads exchanged in DistSense are compact JSON semantic events (typically ≤∼1kB before transport/security headers). In our small-scale deployment, this overhead remained significantly lower than raw multimedia transmission and did not compromise near-real-time operation.
Thus, the DistSense system enables the acquisition, analysis, and storage of information originating from audio and video sources. Through ML algorithms, it can identify relevant events, such as anomalous sounds or activities of daily living. After processing, these events are immutably recorded on the blockchain, ensuring data authenticity and tamper-evidence.
3.5.2. MQTT Connection Between DistSense and Home Assistant (HA) Hub
The home automation platform HA was selected for its flexibility and adaptability. As an open-source solution, it can be tailored to specific smart home needs, enabling interface customization as well as the development of automations and integrations for a wide variety of devices from different sources. Its inherent modularity and extensibility help ensure a solution that can be adapted to individual preferences. The diversity of integrations offered is another crucial factor. HA is compatible with a wide range of devices and services, enabling the inclusion of most smart home elements in automation workflows. This attribute helps prevent vendor lock-in and supports smooth interconnection between heterogeneous devices. In addition, security is a key consideration underlying the preference for HA. The platform helps safeguard sensitive user data by maintaining local control over information and avoiding dependence on external servers for storage and processing, thereby reducing the risks associated with cyberattacks and information leaks. HA’s user-friendly interface makes it easy to interact with the platform, either through the mobile app or the web interface, as illustrated in
Figure 6.
One of the essential elements of this implementation is the use of MQTT [
29] for communication between the DistSense system and external consumers. In this system, the coordinator node is responsible for sending inferred events, encapsulated in JSON messages, to specific MQTT topics. In turn, HA acts as a subscriber, listening for messages on the relevant topics. This simplifies the transfer of information from the system to the user and enables efficient, continuous, real-time data sharing. In present implementation, MQTT integration is evaluated functionally (correct delivery and triggering behavior). A dedicated stress characterization of end-to-end notification latency and reliability under heavy network load is left for future work.
This integration enables automated event visualization and downstream actions. Upon receiving an MQTT message, HA parses the JSON payload and triggers the corresponding automation (e.g., persistent notifications or rule-based actions), presenting the event type, location, and inference timestamp. Additional rules can be implemented in Python/YAML to support application-specific logic and tighter integration with other HA components.
4. Experiments and Results
The analysis and evaluation of this distributed system is important to ensure not only its efficiency, but also its adaptability and reliability in a real operating environment. In this context, the evaluation is carried out in two distinct phases, to ensure that the system can address challenges both in a controlled and simulated environment (digital twinning) and in the complexities of the real world (real scenarios).
The first phase of testing focused on the individual evaluation of each module of the security system, with the aim of ensuring that each implemented module works harmoniously together, ensuring the scalability and overall efficiency of the system, according to the research specifications. Also in this phase, the system was subjected to tests that simulated a specific use case: the detection of domestic hazards. This use case involves risky situations, such as water leaks, for example, or other common threats in the home environment. Simulated tests are conducted in controlled environments, where hazards are staged through the use of videos, and system response is evaluated. It is crucial to verify that the system properly identifies simulated hazards and acts in accordance with established safety guidelines. This phase allows the evaluation of the algorithms and decision logics implemented to ensure accurate detection and effective performance of the system.
In the second phase, the system is subjected to an evaluation in a real environment, using two Jetson Nano devices (NVIDIA Corporation, Santa Clara, CA, USA). In this context, several types of noise and interference common in a domestic environment are introduced. These include background noise, variations in lighting, pet movement, and other real-world conditions. The goal is to evaluate the system’s performance under more challenging conditions, which can affect its ability to detect hazards and act appropriately.
Collaboration between devices to ensure more accurate and reliable event detection was evaluated under these more dynamic circumstances, and system scalability was analyzed as more devices were integrated into the local network.
4.1. DistSense Digital Twinning
Digital twinning offers significant advantages for simulating distributed sensor architectures and inference pipelines within smart homes, particularly when leveraging edge devices such as Jetson Nano. By creating a virtual replica of the system using Docker containers [
45] and network emulation tools such as GNS3 [
46], developers can model complex interactions between sensors, data flows, and AI inference processes before deploying them in real environments. This approach enables rigorous testing of scalability, resilience, and interoperability under diverse network and hardware conditions, while reducing the costs and risks associated with physical prototyping. Furthermore, integrating digital twins with containerized solutions enhances flexibility, reproducibility, and rapid iteration, ultimately accelerating innovation in smart home automation and intelligent edge computing.
4.1.1. DistSense Digital Twinning Implementation
Simulation is used as a cost-effective step prior to real deployment, enabling controlled evaluation of DistSense without dedicated hardware. We package the software stack into Docker containers and use a network-emulation environment (GNS3) to instantiate multi-node topologies and control connectivity between nodes. This setup supports repeatable experiments on node discovery, coordinator election, and message exchange under controlled network conditions before deploying the same containerized components on physical devices.
4.1.2. Case Study: Detection of Domestic Hazards
In the development of smart home systems, the unitary evaluation of modules emerges as a essential step to ensure effective and reliable operation. This approach proved to be particularly useful when considering the direct impact that the performance of the modules has on the overall operation of the system. Functional testing by module focuses on evaluating the specific functionalities of each component of the system. This strategy involved the application of test cases outlined to verify the compliance of each module with the functional requirements established in the development process of the DistSense system.
Figure 7 demonstrates the interactions between the machine learning modules and knowledge-representation processing after a node is initialized and communication is established with other nodes in the network.
Operation of the DistSense modules starts with the network discovery module, which establishes initial connections. Each node in the network is equipped with audiovisual sensors that enable real-time detection of household activities. When a node detects an activity, two operating conditions arise, as illustrated in
Figure 7.
First, if the confidence level for detecting the activity reaches a threshold predefined by the system, the captured event is recorded in blockchain, along with crucial details such as the activity type, location, date, and time.
However, if the activity is detected below that confidence threshold, the system queries blockchain to verify the last recorded event at the location where the activity is being detected, leveraging collaboration and distributed event logging. Events within a time window defined by the system for the same room are considered for collaborative validation, while older events are discarded to prevent outdated information from influencing inference. After querying blockchain, the flow leads to a new condition: whether the node can determine the event after consulting blockchain. If it succeeds, the event is recorded on blockchain, including all relevant details; otherwise, the flow terminates.
Once an event is registered in the blockchain, all other nodes validate the transaction. The coordinator node, elected dynamically based on network availability, maintains exclusive communication with the Home Assistant (HA) platform. It publishes only aggregated, privacy-preserving events via MQTT, either periodically or when a hazard exceeds the threshold. This ensures timely and secure user notifications without exposing sensitive information or overloading the external integration.
This flowchart represents the generic operational process of the system, ensuring that information is stored securely on the blockchain and made available efficiently to users through HA. This approach was followed in the implemented and tested case.
Security plays an essential role in smart residential environments, and it is crucial to identify hazards such as floods and fires in a timely and reliable manner to safeguard residents. Implementing a distributed system that benefits from collaboration among nodes in the local network is a solid and efficient approach for early detection of domestic threats, reducing false alarms in cases of uncertainty. In this context, the ability of the DistSense system to recognize and alert users about detected risk situations was tested, contributing to improved safety in a smart home.
The tests for this use case were carried out in a simulation environment by creating two nodes, NODE-1 and NODE-2, and using video sequences depicting dangerous domestic situations, such as water leaking from a tap left open.
In this use case, the time period used to consider an action as a potential hazard was set to 30 s, assuming the user is not on site. This choice corresponds to the point at which the system stops its assessment after detecting a potentially dangerous situation. It reflects the need for the system to act within a realistic and practically useful time frame, ensuring an efficient response when necessary.
Videos with the same perspective but different levels of audio proximity were used to evaluate sound detection in depth. This approach ensured that, in ambiguous situations, the system could use other nodes to corroborate detections, reducing the likelihood of false positives and improving overall effectiveness. In this way, a balance was achieved between system sensitivity and the minimization of unwanted alarms in everyday life.
Careful consideration of false positives, together with a mechanism for querying other nodes in cases of uncertainty, demonstrates an effective approach to minimizing unnecessary alarms and avoiding alerts in everyday situations.
It was observed that NODE-1 did not reach the certainty threshold, achieving an audio-inference accuracy of only 43%. However, through network collaboration and by querying the blockchain, it was able to obtain information about the last event recorded in that room within the time period defined by the system. This collaborative analysis confirmed that the activity in question was the sound of water from a tap.
The tasks and interactions between the modules follow the data flow shown in
Figure 7. The main difference lies in HA automation: the user is notified immediately if a dangerous situation is detected, and only the coordinating node communicates with HA.
When the sensors identify anomalous concentrations in the domestic environment, an alert is issued to the user through the HA platform, demonstrating the system’s ability to identify emerging hazards and trigger preventive responses. To illustrate this, a rule was created in HA such that, when an activity considered dangerous remains active for more than 30 s, a notification is sent to the user via the HA platform GUI.
Timely detection of domestic threats is essential for safeguarding residents and property. Through collaboration among distributed nodes, the system can identify risk situations and promote immediate responses, minimizing potential losses.
In addition, integration with the HA platform and storage on the blockchain ensure that users stay informed about dangerous situations, even when they are away.
This use case highlights the importance of reducing false alarms through collaboration among system nodes in critical contexts. The ability to detect water leaks and proactively alert users can help protect lives and reduce property damage. The system’s distributed nature also increases resilience, as cooperation between nodes enables more accurate hazard detection.
4.2. Case Study: Detection of Domestic Activities with Variations in Audiovisual Noise
To evaluate the system in a real environment, a hardware setup was considered. The selected hardware comprised two Jetson Nano devices integrated with audiovisual sensors, which were installed and initialized in two strategic locations in the house, covering areas where domestic activities are more frequent (smart-environments kitchen and living room). Detailed specifications of these devices are provided in
Table 3. To perform the tests, the container, which includes all features developed throughout the research, was executed on each Jetson Nano device. The devices initialized successfully and integrated into the local network without setbacks in this real-world setup.
Identifying users’ day-to-day activities plays an essential role in optimizing the smart home experience. This case study aims to demonstrate the operation and cooperation of the various modules of the DistSense system in a real environment. Based on collaboration among multiple nodes, the system enables accurate capture and interpretation of information related to residents’ usual activities, such as watching television, reading a book, eating meals, and performing household chores (e.g., washing dishes). Sensor data are fused, and advanced algorithms are applied to accurately infer user activities. In the implemented case studies, two typical activities of daily living were considered:
Washing/cleaning dishes.
Reading a book.
To determine and classify the user’s daily activities, the machine learning module plays a central role in the DistSense system. However, its effectiveness can be compromised by challenges in dynamic environments, such as occlusions, variations in lighting, and acoustic-noise interference.
Under these conditions, the system faces obstacles that can affect the accuracy of activity identification and classification. Occlusions, for example, can hide parts of the body, making action detection more difficult. Variations in lighting can hinder image interpretation and environmental analysis. In addition, acoustic noise can degrade the quality of the collected data, affecting voice and audio recognition accuracy.
Figure 8a,b illustrate two smart-home rooms (the kitchen and the living room, respectively) captured from different perspectives by two distinct nodes in the network. The sensors were distributed to cover as many angles as possible, thereby minimizing the impact of occlusion.
The reported values in this section are presented as representative operational outcomes from the implemented scenarios, not as full statistical performance claims over a large benchmark campaign. We therefore avoid over-generalization and use these results primarily to demonstrate system behavior under collaborative and non-collaborative conditions. In the first scenario, shown in
Figure 8a, the system effectively detected the Washing/cleaning activity performed by the user, achieving an accuracy of approximately 78%. Because the activity was detected above the established accuracy threshold, the responsible node recorded the event on the blockchain.
Subsequently, these data were distributed to the node’s peers so that the event inserted on the blockchain could be validated by all nodes in the network. After validation, information about the detected event was sent to HA by the coordinator node to inform the user and store the data. This storage later enables more detailed analysis of the activities performed by the user throughout the day.
In the second scenario, presented in
Figure 8b (living room), one of the nodes responsible for detecting the activity (
NODE-1) did not achieve the minimum reliability level for identifying that audiovisual activity, attaining an accuracy of approximately 24%. In this context,
NODE-1 collaborated with another node (
NODE-2) that observed the same scene from a different perspective and achieved an accuracy of approximately 82%. Because
NODE-1 did not reach the minimum certainty threshold, it queried the blockchain to obtain additional information.
This additional information, obtained by querying the blockchain, allows the node to more accurately identify the ongoing activity, even if it is not fully visible or clear to any single node. This ability to cooperate and consult external sources to improve decision-making in a smart residential context is particularly useful when the environment is constantly changing.
This process enriched understanding of the ongoing situation, based on records kept by other DistSense nodes, and increased the accuracy of activity detection.
Additionally, the DistSense system addresses acoustic noise. Beyond detecting audio-based activities (e.g., water running from a tap), it applies noise-filtering techniques to remove unwanted interference such as background noise and echoes, ensuring that only information relevant to the user’s activity is considered during inference.
Through collaboration among distributed nodes, the system can infer residents’ actions based on environmental changes captured by audio-visual sensors.
As the system continues to operate, information is stored sequentially on the blockchain and on the HA platform, as mentioned above, enabling data analysis and interpretation of activity patterns over time. Identifying patterns in daily activities can provide tangible benefits, resulting in personalized experiences in which individual preferences and routines are carefully considered.
The results demonstrate the system’s ability to support device collaboration when determining user activities, reducing false positives and improving overall efficiency. The transition to a real residential environment added authenticity and complexity. Smart homes are dynamic and can involve a wide variety of activities performed by different users, with individual preferences increasing the heterogeneity of the sensing task and requiring an adaptable system.
4.3. Deployment Characterization
To characterise the practical deployment behaviour of DistSense, we measured key operational metrics on the two Jetson Nano devices used in the real-environment evaluation (
Section 4.2).
Table 4 summarises the results.
These measurements confirm that DistSense operates within near-real-time constraints on commodity edge hardware. The blockchain transaction overhead (≤∼1 kB payload, <500 ms propagation) is negligible relative to the raw audiovisual data that would otherwise need to be transmitted in a centralised alternative.
Table 5 provides a structured scalability analysis for each DistSense component, distinguishing empirically validated behaviour (at
n = 2 nodes) from theoretical projections for larger deployments.
The physical validation performed was limited to nodes. Scaling claims for are theoretical projections and are explicitly framed as such. The Bully algorithm’s worst-case message complexity motivates investigation of alternative consensus strategies for larger deployments, identified as future work.
4.4. Component Contribution Analysis
To assess the individual contribution of each architectural layer of DistSense, we evaluate three progressive system configurations across the two scenarios presented in
Section 4.2:
- (a)
Single-node, no collaboration—only the local inference model is used; no peer consultation or blockchain query is performed.
- (b)
Multi-node, no blockchain—multiple nodes observe the same scene simultaneously, but no distributed ledger is used. Each node acts independently.
- (c)
Full DistSense (multi-node + blockchain)—the complete system as described in
Section 3. When local confidence is below
= 60%, the node consults the blockchain for recent peer-recorded events in the same location.
Results are presented in
Table 6. For each configuration and scenario we report: the detection confidence achieved by the primary node, and whether the uncertain event was resolved.
The results demonstrate two distinct contributions. First, comparing (a) and (b) shows that multi-node coverage alone improves detection for occluded scenes: NODE-2 unoccluded view achieves 82% confidence on the reading-book activity, immediately available to the network. Second, comparing (b) and (c) shows that blockchain-assisted uncertainty resolution is decisive for NODE-1: without the ledger query, NODE-1 cannot resolve the reading-book activity (24%, below threshold) and terminates without a classification; with the full system it correctly classifies the activity by retrieving NODE-2 blockchain record. These results are derived from representative operational runs in our prototype deployment, not from a large-scale statistical benchmark; claims are intentionally bounded to the demonstrated two-node scale.
4.5. Discussion
The evaluation of the modules that make up the DistSense system clarified the performance and effectiveness of its core components for activity recognition in a distributed smart home setting. The use cases were selected to reflect module integration and collaboration in contexts where DistSense is most useful to users. In the first use case, identifying users’ daily activities emerged as a critical facet for delivering personalized and efficient experiences in smart home environments. However, it is imperative to recognize the environmental complexity that characterizes domestic settings. These spaces are highly dynamic, and multiple factors, such as occluded viewing angles, substantially influence the interpretation of users’ actions. In addition, environmental variables, such as changes in lighting, background noise, and the presence of diverse objects, contribute to contextual complexity.
In addition, the variety of activities carried out within a home and individual user preferences further increase the heterogeneity of the task. In this scenario, system adaptability is of paramount importance. These challenges were mitigated through collaboration among system nodes, improving the accuracy of domestic activity detection.
Subsequently, by detecting activities over time, the system can adapt to user-specific patterns and preferences, which is critical for achieving smart home environments that respond effectively to individual needs.
In the second use case, focused on detecting domestic hazards, the system demonstrated effectiveness in the early identification of risk situations such as water leakage. However, the breadth of hazard scenarios introduced additional challenges because domestic hazards can vary considerably. Accurately detecting these hazards and issuing alerts without false positives or negatives posed a significant technical challenge. Integration with the HomeAssistant platform and storage on the blockchain were crucial for protecting user data and ensuring effective communication of dangerous situations, even in the absence of users.
It is important to note that a fundamental requirement of the DistSense system is the protection of user data. To achieve this goal, data processing was carried out locally in a distributed manner, minimizing exposure of sensitive user information. In addition, storage on the blockchain and queries to the blockchain in cases of uncertainty in activity detection and classification reduced false positives and increased the reliability of the system in determining user activities.
In both use cases, the efficient interaction between system modules was evident. Collaboration among sensor nodes played a key role in identifying complex activities.
The implementation of machine learning algorithms also proved valuable, allowing the system to adapt and continuously improve its ability to detect and interpret patterns. This learning capability makes the DistSense system more resilient and accurate in detecting hazards in smart home environments.
Scope, Limitations, and Threats to Validity
In this study, the empirical validation is limited in scale and is primarily aimed at demonstrating functional feasibility and system behavior in representative scenarios, rather than establishing exhaustive statistical superiority over all alternative architectures. Therefore, claims are intentionally bounded to deployment-oriented evidence from the implemented prototype.
Three limitations are particularly relevant. First, the number of real-deployment scenarios and nodes is limited, which restricts external validity for larger and more heterogeneous households. Second, the current evaluation emphasizes integrated system behavior rather than a full ablation study isolating every design choice (e.g., collaborative query, election policy, ledger layer). Third, confidence fusion is intentionally simple and fixed-parameter in this version, prioritizing transparency and reproducibility over model-complexity exploration.
To mitigate over-interpretation, we report explicit operating parameters and measured prototype-level timings/overheads, and we frame broader generalization as future work.
From a security perspective, relevant residual risks include credential leakage, compromised edge nodes, and message replay attempts in misconfigured networks. Current mitigation relies on authenticated channels, signatures, and distributed validation; however, hardening measures such as key rotation automation, secure element integration, and formal threat-model validation are left for future work.
5. Conclusions
This paper presents and evaluates a design-science contribution for distributed monitoring in smart-home environments, instantiated in the DistSense system implementation. The system integrates modules for discovery, communication, machine learning, processing, and knowledge representation. During the development and integration of these modules, user data privacy and security were treated as central requirements. The need for automated device discovery motivated the use of encryption algorithms to ensure secure and reliable communication only among legitimate devices. The application of ML techniques to enhance analysis further highlighted the need to safeguard training data and protect models from potential malicious attacks. In addition, secure communication between system modules proved essential to prevent the interception of sensitive data and the falsification of messages. The use of intrusive sensors, such as cameras and microphones, requires additional care regarding privacy and data integrity. The DistSense system was designed around these requirements, ensuring that sensitive data such as images and audio are processed locally while only high-level data are stored. However, security is an ongoing effort, and the system must be continually evaluated and updated to remain resilient and ensure the safety of user data in the face of evolving threats.
The distributed architecture itself presented challenges during implementation. Coordinating actions between different nodes without compromising data integrity or confidentiality required robust security protocols and the adoption of cryptographic algorithms. The knowledge processing and representation module, meanwhile, highlighted the importance of ensuring data consistency and accuracy in dynamic environments. The use of ML techniques to improve detection in domestic environments also reinforces the relevance of this technology for optimizing the implemented audiovisual models. Although model training accuracy is satisfactory, challenges still arise due to variations in the smart-home environment. Collaboration and consultation of event history on the blockchain emerged as essential elements because they reduced false positives in situations in which noise fluctuations can be harmful.
Technologies for simulating intelligent environments played a key role throughout system development, allowing errors to be identified and system functionality to be optimized efficiently, thereby saving time compared with tests in a real environment during the development phase.
Although system validation was achieved through functional testing and the implementation of use cases, the distributed approach adopted in the DistSense system may also inform future projects focused on monitoring smart home environments in which spaces and conditions change continuously. However, as technology advances and cyber threats become more sophisticated, cross-device security and collaboration remain ongoing efforts; the system must be continually evaluated and updated to remain resilient and ensure the ongoing security of user data.
Several directions for future work are identified. In particular, we plan a dedicated ablation study to quantify the contribution of each DistSense component (e.g., multimodal inference, node collaboration, and blockchain-assisted validation) to overall detection performance and robustness. One direction is the automatic detection of each node’s location within the system. This could be achieved through computer vision techniques, allowing each device to infer the specific room in which it is located. By analyzing similarities in images captured from different perspectives, it may be possible to infer that devices are co-located when the same elements are detected. This would avoid pre-assigning node locations, thereby simplifying configuration and optimizing system operation. Another direction is to improve accuracy through collaboration between nodes during activity detection, which is critical in dynamic and complex environments. This requires the development of more advanced computer vision and data-processing algorithms. In addition, in the context of home activity detection, the promising results obtained with the models can be further improved through a more diverse dataset encompassing a wider variety of smart home scenarios. Expanding the system’s ability to identify a broader range of risk situations, such as fires or intrusions, requires the integration of additional sensors and the development of specialized algorithms. Another aspect to explore is the long-term identification of complex activity patterns. For example, it may be possible to assess whether a user is adopting more sedentary or less sociable behaviors based on activity history. This information could help infer emotional state, especially if the user is becoming more depressed, and trigger alerts for the user and their caregivers or family members about changes in activity patterns. Finally, usability studies and evaluations of end-user expectations can contribute to a more satisfactory experience and a better understanding of user needs.