1. Introduction
Recent advances in sensing hardware, embedded computation, and intelligent data processing have created new opportunities for building autonomous monitoring systems. And yet, while this work has made great strides, unifying heterogeneous data sources into a coherent sensing workflow is still a key challenge. Conventional Internet of Things (IoT) nodes are suitable for high-frequency and scalar measurements—temperature, humidity, soil conditions or any additional environmental indicators—but their capabilities are limited to single-point signals with limited spatial context. In contrast, contemporary machine vision systems offer extensive visual information and detailed scene structure and are usually utilized as standalone subsystems that do not interface with wide-scale IoT architectures. This separation creates a fragmentation of data streams and hinders monitoring systems from providing a universal view of complex conditions. To fill this gap, this paper proposes a systematic framework that integrates machine vision and IoT sensors in a single, end-to-end monitoring architecture. Instead of treating visual and scalar measurements as independent modalities, the framework captures, processes, and fuses them into a cohesive framework that underlies context-aware reasoning and adaptive control. We leverage the visual information provided by imaging systems and the temporal accuracy offered by IoT networks to develop a more comprehensive and interpretable description of the field environment.
1.1. Machine Vision and IoT Sensing Integration Challenges
While each sensing technology has advanced at an accelerated pace, the interplay between machine vision and IoT networks is still not optimized. Environmental sensor arrays are a commonly used approach for accurately measuring temporal time and local trends [
1], and their single-parameter architecture limits the way the system can reason about spatial patterns or visual cues. By contrast, deep learning-based machine vision models (e.g., Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs)) effectively capture semantic structure, object appearance, and contextual features in images [
2,
3,
4]. But these vision pipelines tend to exist in isolation, and many of these streams generate information streams that cannot interface with the environmental data generated by IoT sensors. Without coordinated processing, we do not have a network connection in place that enables the system to connect visual events to the scene environment. In the absence of a fusion mechanism, for example, a visual anomaly cannot be immediately associated with a temperature or humidity fluctuation, and it cannot be associated with the original atmospheric value. Prior research has demonstrated the advantages of combining data types [
5], but scalable and algorithmically designed architecture, which bridges vision and IoT sensing all along the process pipeline, is still lacking. A comprehensive framework is needed to enable cross-modal reinforcement, enhance interpretability, and improve the robustness of automated monitoring systems.
1.2. Research Gaps and Proposed Framework
Current monitoring approaches tend to prioritize one modality over the other; IoT-centered systems emphasize energy efficiency and temporal measures, while vision-centered systems focus on retrieving semantic richness from images [
6]. This split has led to a distinct research hole: the lack of a structured, end-to-end framework that could support heavily coordinated convergence between vision and IoT data for co-operative, context-sensitive intelligence. Although most existing systems generate fusion at the application layer [
7], our work proposes a structured pipeline composed of well-defined architectural components to integrate low-level synchronization, cross-modal representation learning, and edge–cloud co-inferencing. To remedy this issue, we present an integrated architecture for the integration of machine vision and IoT sensing in a single model. The framework expands the idea beyond parallel data collection with joint inference, multimodal representation learning, and adaptive response mechanisms that are suitable for applications in the field. The major contributions of this work are summarized as follows:
Unified Integration Architecture: We develop a comprehensive architecture that outlines the data flow synchronization and functional parts needed at scale for visual and IoT sensing streams to be integrated, from the sensor level upwards, and allows temporal syncing for joint inference and integration.
Joint Multimodal Inference: We propose learning techniques that are able to directly infuse inference on fused data, using environmental conditions for enhanced visual interpretation and vice versa. This results in a more robust and accurate context for the whole system.
Cross-Modal Knowledge Transfer: The framework has a mechanism to transfer learned representations between vision and IoT models, where the features from one modality are transferred to the other, strengthening the process and making learning regularized into the next one to improve generalization.
Adaptive Edge–Cloud Coordination: We design a closed-loop operating strategy where real-time inference is computed at the edge, while the cloud resources maintain fine-tuning that allows periodic optimization of the model in response to changing field contexts.
The remainder of this paper is structured as follows:
Section 2 reviews prior work on multimodal sensing and integration methodologies.
Section 3 presents the proposed architecture and describes its core components.
Section 4 provides experimental evaluation across multiple tasks.
Section 5 discusses the broader implications, limitations, and potential extensions of the framework, and
Section 6 concludes the study.
2. Related Works
Most of the development of multimodal monitoring systems has been taken along three different paths: scalar sensing networks, imaging-based monitoring, and recently, AI-driven vision analytics. However, these paradigms are still developing in isolation with a significant evolution of each paradigm to inform the development of environmental and agricultural monitoring. This separation constrains the field’s capability to establish such coordinated designs that can combine the spatial richness of visual data with the temporal precision of IoT measurements. In this section, we summarize the main pillars of work, identifying the critical necessity for an integrated sensing paradigm that bridges machine vision to IoT-driven data acquisition.
2.1. Scalar Sensing Technologies for Monitoring Systems
Scalar sensors continue to be cornerstones in most IoT monitoring projects and make continuous measurements of parameters such as temperature, humidity and soil moisture [
6]. Since they are cheap and reliable, they have become an accepted tool in the environmental and agricultural fields. However, the fact that these sensors are confined to specific locations limits their spatial coverage in all spatial areas, as only a single node can monitor the local position of the same node, despite the fact that the environmental conditions may differ drastically over distance [
8]. Dense sensor placement is needed to achieve wider coverage. However, dense sensor placement leads to higher costs, system operation maintenance difficulty, and increased costs associated with network maintenance. Lastly, scalar sensors cannot detect structural or visual cues such as discoloration, surface texture changes, or morphological patterns which are often used to identify environmental stress or abnormal behavior [
5]. Although excellent in temporal accuracy and numerical accuracy, these tools are not interpretative in order to allow for deep understanding of the situation. This limitation demonstrates the importance of employing complementary sensing modalities for the capture of spatial and contextual information [
9].
2.2. Technologies for Machine Vision and Imaging
Machine vision approaches are used due to the spatial constraints of the scalar sensing. The use of ground-based cameras, aerial platforms and multispectral or hyperspectral imaging systems provides detailed visual information at different scales [
10]. For instance, UAVs provide widespread coverage and are used to perform plant investigations, detect anomalies, and aid with resource acquisition tasks. However, vision-led approaches operate with limitations. UAVs should follow airspace regulations, are weather-sensitive, and face computational load and flight endurance challenges. On the ground, imaging devices are stable but exist only in very narrow monitoring locations that often do not integrate with larger sensing systems in a direct manner. Therefore, because visual and scalar data often reside in separate lines of communication, unified analysis or collective decision support becomes rare.
2.3. AI for Multimodal Monitoring
Artificial intelligence has greatly advanced vision and IoT-based monitoring systems. Deep learning architectures, such as CNNs (Convolutional Neural Networks), ViTs (Vision Transformers), and hybrids, have displayed good performances for classification, detection, and segmentation tasks [
2,
4,
11]. CNNs can identify local patterns, for instance, while transformer-based models can classify long-range dependencies and contextual relationships [
12]. But for all that they can offer, the vast majority of AI-driven solutions are largely tailored to a single modality. They are typically trained on controlled data with low environmental variations, and they perhaps struggle to generalize across different forms of operational domains [
5]. The challenge of deployment is also limited at the edge because of computational constraints for complex models to be run in an efficient manner. In addition, the interoperability of visual and scalar data streams is seldom considered in mainstream solutions, although both approaches may provide stronger and context-dependent monitoring [
1,
13]. This case highlights the demand for adaptable, scalable hybrid modalities architectures capable of leveraging the synergy between modalities.
2.4. Multimodal Fusion Frameworks
More recently, architectures integrating visual data with heterogeneous sensing modalities are attracting more attention. For example, ref. [
7] explore a smart traffic management system combining video analytics and IoT-based road sensors. They provide strong evidence of the benefit of combining visual and scalar observations, but they utilize a late fusion scheme for modality integration, where each modality operates independently, before being merged at the decision layer. This is a limitation of synchronized feature interactions and of low-level cross-modal alignment. Edge–cloud collaboration systems constitute another branch of this field. Sathupadi et al. [
14] provide an edge–cloud synergy framework intended for the real-time processing of industrial sensor network data. The system is efficient in distributing computation and minimizing latency, but its main domain is scalar sensor streams, and visual input and joint representations are not integrated between modalities. More generally, recent surveys such as that of Mondal et al. [
15] underscore the increasing interest in multimodal event detection and provide an overview of the common data fusion strategies adopted in all domains. Beyond these dedicated environmental sensors, there is growing interest in using the communication infrastructure itself as a sensing modality. The field of Integrated Sensing and Communications (ISAC), for instance, investigates how wireless channel data can be used for perception. Ref. [
16]’s work on ISAC imaging for 6G using ray tracing is a key example. While our work fuses data from dedicated cameras and IoT sensors, this illustrates the expanding universe of data sources available for environmental monitoring. Nevertheless, most of the schemes mentioned in the literature work at the application layer, and they tend to regard each modality as an independent resource of evidence. Such schemes fail to account for the necessity for fine temporal synchronization, shared feature spaces, or the inter-modal transfer of knowledge. Overall, the existing work confirms the good effects of multimodal sensing, but still less effort has been made in realizing very deep temporal fusion over several visual and IoT data streams. In the absence of a common architecture that allows for the synchronization of data acquisition and joint representation learning and adaptive edge–cloud co-ordination, there have been fragmented systems missing both the full complementarities of vision and IoT sensing. This void justifies the proposed integrated framework described in this work. The introduced framework fills a gap by incorporating vision and IoT sensing within a common edge–cloud context to provide complementary contributions to environmental interpretation, model learning, and decision support. As shown in
Table 1, the framework enables the fused processing of data, cross-modal transfer of representation and adaptive feedback for the feedback systems adapted to the field characteristics. In contrast to previous work, which treats visual and scalar sensing as distinct or loosely associated operations, our design prioritizes deep integration throughout the pipeline to allow for scalable, context-aware and self-improving monitoring across many applications.
3. Methodology
This section introduces the unified framework designed to combine machine vision and IoT sensing within a coordinated monitoring pipeline. The architecture establishes a continuous flow of information from heterogeneous data acquisition to multi-task deep learning inference and distributed edge–cloud computation. As illustrated in
Figure 1, the system is organized around three tightly connected layers that ensure consistent operation even under varying network conditions and environmental dynamics.
3.1. System Architecture Design
The proposed architecture follows a three-layer organization intended to support scalable, low-latency multimodal processing. The sensing layer forms the foundation of the system and includes fixed and mobile RGB cameras (12–16 MP); Sony IMX Series Sensors (Sony Semiconductor Solutions Corp., Tokyo, Japan), supported by synchronized environmental sensor modules. together with distributed IoT nodes measuring environmental variables such as temperature, humidity, and other scalar conditions. To ensure consistent temporal alignment across all sensing devices, the cameras and IoT nodes are synchronized using a lightweight IEEE 1588 Precision Time Protocol (PTP) service. This mechanism assigns a unified timestamp to each capture event, allowing visual frames and sensor packets to be matched even in cases where network delays occur. A small buffer at the edge device automatically discards packets that exceed an allowable drift threshold (typically 30–50 ms), preventing delayed sensor readings from misaligning with the corresponding image ROIs. This design ensures that downstream models receive tightly synchronized multimodal inputs.
The edge fusion layer serves as the primary computation stage for real-time inference. Implemented on embedded devices such as the NVIDIA Jetson Nano and Google Coral TPU, this layer performs feature extraction, compression, and cross-modal fusion under strict latency and power constraints. Several optimization strategies are employed, including structured pruning, quantization-aware training, and lightweight knowledge distillation, collectively reducing computation by up to 70% relative to uncompressed baselines. These adjustments preserve predictive accuracy while enabling deployment in energy-constrained environments. Temporal alignment from the sensing layer is preserved here so that the edge device can conduct joint inference using synchronized visual and scalar features.
The cloud intelligence layer manages global learning, long-horizon analytics, and system coordination. High-capacity models such as large-scale transformer architectures are trained and updated in this layer, using fused multimodal data aggregated over longer intervals. The cloud serves as a supervisory controller by pushing updated backbone weights to the edge at scheduled intervals (typically every 5–10 inference epochs or when performance drift is detected). This structure creates a digital replica of the monitored environment, allowing the cloud to refine anomaly detection, perform global trend analysis, and maintain system stability over time. Elastic computation ensures that the cloud scales according to data volume and model complexity while maintaining real-time feedback to the edge layer. The cloud-side model refinement employs a cosine embedding loss as the matching objective to align the feature representations from the vision and sensor modalities in a shared latent space.
3.2. Multimodal Data Preprocessing Pipeline
The preprocessing pipeline shown in
Figure 2 prepares heterogeneous data for unified analysis. Visual inputs are first normalized using color-constancy correction to reduce illumination variance, which is followed by a hybrid noise-reduction stage that applies median filtering for impulse noise and Gaussian smoothing for high-frequency artifacts. Essential edges and fine textures are retained to support downstream feature extraction. Region-of-interest (ROI) extraction is performed using YOLOv5 [
18]. Since the sensing layer ensures synchronized timestamps, ROIs inherit the same temporal markers as their paired IoT measurements. When network-induced delays cause certain sensor packets to arrive late, the associated ROIs are either aligned using interpolation or excluded based on a drift threshold to prevent supervising the model with misaligned labels. To prevent misalignment between visual and sensor inputs, any image–sensor pair whose timestamps differ by more than a drift threshold
is automatically discarded or temporally realigned before fusion.
Equation (
1) shows the fusion of normalized visual and scalar features:
Z signifies the unified feature embedding formed from merging visual and sensor branch outputs.
represents the feature vector obtained from the image stream, and
captures the value of the contributions from environmental sensor inputs. The bias
b describes the affine transformation generated in fusion. The operation function ⊕ specifies the distinct technique that combined the two modalities, like concatenation, weighted addition, or a gated mixing solution. The notation
is also used as this fusion process as it is not a mathematical integral but represents the mapping that creates a shared representation space of the two sets of features.
3.3. Multimodal AI Models and Integration Strategy
The integration strategy is based on a two-stream structure which allows for the separate processing of visual and scalar data and the integration of the visual and scalar data for joint inference (see
Figure 3):
Equation (
2) describes how the fused representation
Z is formed by combining the processed outputs of the vision and sensor branches.
and
represent the feature vectors obtained from image and sensor inputs, respectively. The matrices
and
apply learned transformations that adjust each feature set before they are merged. The bias term
b completes the affine mapping. The operator ⊕ indicates the specific fusion rule used to bring the two modalities together. Further details on the motivation and structure of this fusion operator are previously explained in Equation (
1).
To handle different learning requirements, several model families are incorporated. EfficientNet-B0 [
19] provides robust visual feature extraction and is adapted with feature-wise modulation conditioned on sensor signals. For cross-modal knowledge transfer, the pretrained convolutional backbone (blocks 1–7) of EfficientNet-B0 is used as a fixed feature extractor. The final classification head is replaced, and the extracted features are then modulated by the sensor data embeddings before being passed to the task-specific layers. ViT-Tiny [
2] models temporal dynamics by capturing long-range dependencies between image features and environmental trends. For pixel-level interpretation, U-Net with a ResNet-34 encoder [
3,
20] is used, where convolutional activations are modulated according to sensor context to improve segmentation consistency.
Joint optimization is formulated as shown below:
where
is dynamically adjusted for each task during training. Smaller
values are assigned when gradient magnitudes grow excessively, following a simplified GradNorm-style balancing rule; this stabilizes multi-task optimization while reducing competition between tasks. Specifically, the weight for task
T at training step
k is updated as shown below:
where
is a damping hyperparameter that ensures tasks with disproportionately large gradients are down-weighted to maintain stable multi-task optimization.
Classification tasks use label smoothing to prevent overconfident predictions:
where
is the smoothing parameter and
is the entropy regularization weight.
For segmentation, the model uses a hybrid loss that combines Dice similarity and cross-entropy terms:
This composite formulation improves spatial consistency and class balance, which is particularly beneficial for heterogeneous field imagery.
For regression estimation tasks, a robust Huber-style function is applied to handle outliers effectively:
This piecewise design ensures quadratic behavior for small residuals and a linear penalty for large deviations, improving robustness to noise and imperfect labels.
Classification tasks use label smoothing (Equation (
6)) to prevent overconfident predictions, segmentation uses a Dice–cross-entropy hybrid loss (Equation (
7)), and regression tasks rely on a robust Huber formulation (Equation (
8)) to reduce sensitivity to outliers.
3.4. Optimization Framework and Training Methodology
The multimodal framework is trained in PyTorch2.0 with Automatic Mixed Precision (AMP), which decreases physical memory use and accelerates computations without degrading numerical robustness. To suit the heterogeneous and multi-task characteristics of the architecture, different optimization strategies were tested.
Table 2 summarizes the performances of the optimizers experimented on and points out their unique respective strengths for different parts of the pipeline. AdamW [
21] is selected as the standard optimizer for the visual backbone, which is primarily attributed to the decoupled weight decay and stable, monotonic convergence when extracting features. An adaptive momentum version known as an MADA is adopted to facilitate cross-modal alignment and thus to minimize the time error for the changing gradients between modalities by regulating the momentum term with an adaptive coefficient, promoting the convergence of the modality at the gradient level. The sigSignAdamW variant further enhances robustness for sensor-based tasks with the addition of sign-based variance normalization that supports robustness in contexts with gradient quality deterioration as a result of sensor noise.
Figure 4 visualizes the optimization dynamics and indicates convergence paths and multimodal integration, which leads to accuracy in both cases improving every time.
The gradient update procedure begins by computing the composite gradient defined in Equation (
9), which incorporates task-specific losses and both regularization terms:
Momentum terms are then updated according to Equations (
10) and (11), which is followed by standard bias correction:
The updated parameters follow:
where the learning rate
is scheduled using cosine annealing:
This scheduling slowly decreases the learning rate for each epoch and follows a cosine curve, giving large exploratory updates in the first stages and smaller controlled changes later while the training is stable. The relatively smooth decay can help to avoid sudden optimization changes, to enhance convergence in multi-task environments, and to avoid the trap of early local minima—a crucial criterion under which visual and sensor branches need to be harmonized.
3.5. Closed-Loop Adaptation System
Based on the real-time edge inference and periodic improvements performed in the cloud, the framework builds on a circular closed loop as summarized in
Figure 5. Edge devices continuously monitor prediction confidence, and they detect distribution shifts by measuring rolling statistics of feature activations and estimating uncertainty. When drift is detected or confidence drops below preset bounds, chosen samples together with their synchronous sensor traces are sent asynchronously to the cloud. This closed-loop mechanism enables the system not only to be stable over years of long-term operation but also to adapt to changing environmental conditions and changing visual patterns as well as changes in sensor quality. This enables the framework to continuously monitor distribution shifts and adjust model parameters where required to remain a generalizable framework across tasks (classification, segmentation, anomaly detection, and regression) without requiring a manual recalibration. The result is a self-sustaining sensing system that scales well across a diversity of operational environments by performing a trade-off of accuracy, responsiveness, and robustness.
Cloud servers perform global refinement using Equation (
16):
The coefficients , , and are adapted based on the validation improvements observed in each modality. To prevent catastrophic forgetting while still improving performance, cloud-side refinement incorporates elastic weight consolidation and knowledge distillation. Updated backbones are then pushed to the edge according to the system’s synchronization schedule, completing the adaptation loop.
The complete end-to-end workflow—summarized in Algorithm 1—spans each stage of the proposed framework, beginning with raw data capture, moving through preprocessing, feature extraction, multimodal fusion, multi-task inference, optimization, edge deployment, and finally cloud-based adaptation. All mathematical notation used throughout the algorithm is defined in
Table 3. Together, these components establish a unified process that supports synchronized multimodal learning within a distributed edge–cloud environment.
3.6. Comprehensive Evaluation Methodology
The evaluation methodology assesses both predictive performance and deployability. Classification tasks are evaluated using accuracy, precision, recall, and F1-score. Segmentation is measured using mean Intersection over Union, Dice coefficient, and boundary F-score. Regression outputs are evaluated using RMSE, MAE, and
. Deployment metrics—including inference latency, throughput (frames per second), and memory footprint—capture the system’s suitability for real-time operation on constrained hardware. Energy consumption is measured through average power draw and energy per inference, while reliability is assessed through mean time between failures and recovery time. To ensure statistical robustness, all experiments include confidence intervals, hypothesis tests for significance, and cross-validation across multiple data splits. Scalability is analyzed by increasing sensor rates, image traffic, and task loads, enabling the characterization of system behavior under diverse real-world operating conditions.
Table 3.
Comprehensive mathematical notation and definitions.
Table 3.
Comprehensive mathematical notation and definitions.
| Symbol | Mathematical Definition and Context |
|---|
| Data Symbols |
| I | Input image tensor . |
| S | IoT sensor data vector . |
| M | Metadata vector . |
| y, | Ground-truth labels and model predictions. |
| Processing Functions |
| Visual preprocessing: illumination normalization, denoising, ROI extraction. |
| Sensor processing: synchronization, interpolation, normalization. |
| Metadata normalization. |
| Backbone networks for visual and sensor processing. |
| Feature Symbols |
| Extracted visual features. |
| Extracted sensor features. |
| Z | Fused multimodal representation. |
| Model Parameters |
| Full parameter set: . |
| Vision and sensor backbone parameters. |
| Fusion parameters. |
| Task-specific head parameters. |
| Loss Functions |
| Total multi-task loss with regularization. |
| Classification loss (e.g., smoothed cross-entropy). |
| Dice + cross-entropy segmentation loss. |
| Regression loss (e.g., Huber). |
| Task-specific weighting coefficients. |
| Optimization Symbols |
| Gradient at step t. |
| First and second moments. |
| Bias-corrected moments. |
| Decay factors. |
| Learning rate under cosine annealing. |
| Adaptation Symbols |
| Feature set at iteration k. |
| U | Cloud update operator. |
| Edge compression operator. |
| System Symbols |
| Task set. |
| Task-specific head. |
| Logged inference data for drift detection. |
| Batch selected for cloud retraining. |
| Algorithm 1: Machine Vision and IoT Sensing Integration Framework |
![Sensors 25 07237 i001 Sensors 25 07237 i001]() |
4. Experiments and Results
The experimental framework proposed in this section was evaluated to compare performance across a range of sensing conditions, environments, and hardware platforms. Both algorithmic efficiency and practical applicability for applying the framework for agricultural monitoring were assessed through the experiments.
4.1. Experimental Setup and Validation Methodology
A multifaceted validation approach was devised to investigate the benefit of the unified machine perception–IoT sensing approach. The fields of agriculture were chosen as potential targets as the sensing requirements could be very complex for agriculture and natural changes in the surrounding environment, and common monitoring methods on single-modality monitoring are limited.
Datasets and Evaluation Protocol: Three datasets were used with standardized 70/15/15 train/validation/test splits:
Baseline comparisons included VGG16, ResNet50, U-Net, EfficientNet-B3 [
19], and DeepLabv3+ [
24]. All baselines were retrained under identical preprocessing and training protocols to ensure fairness.
Implementation Details: Experiments were implemented in PyTorch 2.0 using mixed-precision (AMP) training. Input sizes were standardized to for classification and for segmentation tasks. Edge-device evaluations were conducted on NVIDIA Jetson Nano (4 GB) and Google Coral TPU units to validate throughput, latency, and power consumption under realistic deployment constraints. All compared methods received identical input ROI crops, preprocessing pipelines, and data augmentation to ensure fair comparison.
4.2. Multimodal Integration Performance Evaluation
Pattern Recognition and Classification: Disease detection was used in evaluating classification performance.
Table 4 shows that the integration-enhanced EfficientNet-B0 reached 94.8% accuracy, outperforming the stronger baseline EfficientNet-B3 by 2.7% and using significantly less computational resources.
Figure 6 visualizes improvements across both classification and segmentation tasks.
Spatial Understanding and Segmentation
Segmentation performance was assessed on weed–crop discrimination tasks. The integrated U-Net with ResNet-34 encoder achieved 87.6% mIoU and a Dice score of 0.89 (
Table 5), outperforming DeepLabv3+ by 4.1%. Including soil-condition and microclimate information improved boundary separation in dense or occluded vegetation, especially under variable illumination.
4.3. Edge Deployment and Computational Efficiency
For assessing real-time deployment feasibility, the integrated model was assessed on compact embedded platforms. Jetson Nano and Coral TPU both reached over 15 FPS, requiring less than 10 W, indicating a promising means of field operation at a very low cost.
Figure 7 shows latency–power trade-offs, and
Table 6 summarizes platform metrics.
4.4. Advantages of Multimodal Integration
Joint learning was shown to steadily enhance dependability and prediction stability. Combining growth stage classification and height estimation in the combined framework yielded 89.1% accuracy and a mean absolute error of 2.4 cm. The data provided confirm that the cross-modal knowledge sharing greatly improves generalization. Feature shareability among the visual and sensor streams increased with cross-modal knowledge transfer using the convolutional backbone of EfficientNet-B0, achieving a 36.8% performance increase.
4.5. Regularization and Ablation Analysis
The impact of L1 and L2 regularization factors (
,
) was systematically explored to prevent overfitting. The optimal values of
(L2) and
(L1) in
Figure 8 showed the best trade-off between training loss and validation, decreasing overfitting by 34% compared to the unregularized baseline while maintaining model capacity for multi-task learning.
4.6. Qualitative Evaluation
Qualitative samples provide additional perspective on the system’s behavior.
Figure 9 illustrates that the model exhibits very high spatial precision at light, occlusion, and density conditions, detecting disease regions and segmenting weeds with strong spatial consistency.
4.7. Comparison with State of the Art
Finally, comparisons of our method with state-of-the-art approaches (
Table 7 and
Table 8) show that while some standalone vision models perform competitively in controlled scenarios, our integrated design outperforms and is completely deployable on edge hardware with excellent real-world performance.
For the integration framework proposed in this study, the results demonstrate a strong balance between model fidelity, computational efficiency, and deployability. This approach presents a feasible pathway for large-scale agricultural and environmental monitoring where hardware constraints and fluctuating field conditions remain major challenges.
5. Discussion
It is reported in this study that a tight coupling of sensing, vision and scalar environmental measurements may overcome the basic defects of environmental monitoring systems. Instead of viewing modalities in isolation, the presented model highlights that synergistic benefits occur when spatial richness and temporal precision are considered as two interacting information streams. This architectural approach goes beyond traditional approaches to multimodal research in that it specifically embeds synchronized acquisition and joint representation learning inside deployable edge computing paradigms. The primary revelation from this work is that proper multimodal fusion demands a systems architecture in which a holistic approach is needed rather than post hoc integration. Older monitoring solutions generally focus on a certain modality, which causes vulnerabilities when operating variables vary. Our approach to integration builds a cooperative pattern where visual context enriches spatial context and sensor inputs stabilize during visual degradation conditions to contribute strength which cannot be achieved in one stream (non-multi-threaded based).
5.1. Domain Transfer and Architectural Generality
The architecture-agnostic nature of the framework has been tested in agricultural research and is well suited for the development of such an application in other contexts. For industrial IoT, the same concepts would allow the monitoring of equipment health, which would incorporate thermal imaging combined with vibration and temperature sensors. Traffic camera feeds embedded with air quality sensors might allow for real-time environmental correlation analysis in smart city deployments. Applications in structural inspection would combine drone imagery with embedded stress sensors for predictive maintenance. Still critical, a focus on synchronized sensing and edge-connectedness remains the key distinction, which is also focused on synchronized sensing and edge-fit (as well as edge-compatible design, which supports scaling up deployment into these disparate devices, with intermittent connections and power constraint). This differentiating factor applies across different application domains.
5.2. Scalability and System-Level Issues
Although the framework is efficient on single-edge devices and even when using a few edge devices, the massive scale of the model, which supports hundreds of nodes in a high-end edge, can be daunting. The current cloud synchronization mechanism can be bottlenecked to have information pulled from multiple edge devices simultaneously as well. With this model, future architectures would enable hierarchical aggregation or peer-to-peer model updates to share computation burden and reduce dependency on a centralized cloud. Moreover, a reliance on co-located vision and sensor units might not be suitable for highly distributed deployments, where more agile strategies to orchestrate spatial alignment are necessary.
5.3. Robustness Through Cross-Modal Regularization
The robustness of the framework can be traced back to its natural cross-modal regularization effect. Vision-only models often fail with poor performance, but the use of time-based sensor metadata in this framework provides a stabilizing factor that diminishes sensitivity to visual fluctuations. This redundancy elevates the multimodal fusion from mere additive to a transformational fusion, leading to correct predictions across real-world variability. But this advantage is limited to the extent that such precision is achieved in terms of temporal matching—which shows that sound low-level system fusion depends as much on reliable low-level system integration as algorithmic construction depends on multimodal effects.
5.4. Comparative Positioning in the Sensing Landscape
Table 9 includes our approach as an aspect of the ecosystem of sensing systems. Such high-performance architectures usually consume high computational effort as a result of their nature, and edge-optimized architectures typically have reduced predictive abilities. We uniquely position our framework to achieve a balance between inference quality and the impact of hardware constraints by combining multimodal feature fusion and architectural optimization in a very practical manner, which is an important consideration for how many real-world application cases are available, since most of these are resource constrained.
5.5. Limitations and Future Research Directions
There are a number of limitations to consider in future versions. The sensitive dependence on synchronized sensing for timestamps is extremely important: very fine network performance may exceed normal drift-handling mechanisms. Furthermore, although the structure is friendly to deploying edge, hard-intensive operation like multi-class division in multifaceted situations will still require hardware acceleration. These limitations provide clear directions for research. Adaptive temporal fusion modules tuned dynamically to patterns of sensor arrival could improve robustness. Federated learning technologies offer a solution that allows cross-site adaptation without data transfer to a central computing center, alleviating concerns related to privacy and scalability. Unsupervised domain adaptation will allow automatic adjustment to new environments that do not require extensive relabeling to be carried out. And finally, developing additional approaches such as hyperspectral imaging or distributed acoustic sensing can pave the way for a further enhanced ecological comprehension. The integration framework proposed would not only prove to be able to deliver an effective multimodal sensing solution but would also serve as the extensible basis for future intelligent monitoring systems in a multitude of application domains.
6. Conclusions
In this paper, a single proposed unified framework based on machine vision and IoT sensing was implemented to tackle the long-term issues of environmental monitoring. This framework moves away from single-modality conventional approaches to a more mobile, resilient and context-aware monitoring system by supplementing high spatial resolution frames with temporal scaling. This architecture illustrates that through thoughtful co-design, coupling multimodal fusion, efficient edge computation and cloud-assisted adaptation, it is possible to bridge analytical sophistication and field deployability. There are three main contributions to this approach. Firstly, the integrated architectural methodology in this paper gives vision and sensor modalities a synergistic approach rather than a single individual one, which allows for deep environmental understanding without the fragmentation that usually characterizes traditional monitoring pipelines. Second, the framework uses cross-modal representation learning to improve generalization and reduce overlap across tasks and showcases heterogeneous signals that can reinforce one another. Third, the edge–cloud execution model shows that real-time sensing intelligence is successful under practical constraints, enabling rapid inference and low power on embedded platforms. This integrated approach has been empirically evaluated, notably in agricultural monitoring contexts with a robust maintenance of inference quality, in keeping with the fluctuation of the environment, changes in illumination, and the availability of resources. These results validate that a well-conceived multimodal sensing architecture offers well-established field-ready performance without the computational burden of high-capacity conventional models. The structure provides a platform for additional multimodal extension. Prospective studies could include other sensing methods (thermal, spectral, hyperspectral, etc.) for an even better early identification of anomalies and the detection of environmental hazards. The use of self-supervised, continual, or federated learning in the adaptive learning cycle would reinforce long-term robustness as well as facilitate distributed deployments with privacy limitations. Research that focuses on lightweight fusion modules and distributed inference is showing promising lines of work. Overall, this work shows that the optimal functionality of high-level multimodal integration is a key combination of operation efficiency and performance. The proposed framework aligns sensing diversity and computational feasibility, thus providing a scalable and adaptable mechanism for next-generation monitoring systems of all types for agricultural, industrial, and environmental applications.
Author Contributions
Methodology, G.N.; Software, G.N.; Validation, G.N.; Formal analysis, H.P.; Resources, G.N.; Writing—original draft, G.N.; Writing—review & editing, H.P.; Visualization, H.P.;Supervision, G.N.; Supervision, H.P.; Project administration, G.N. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The data presented in this study are available on request from the corresponding author.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Wolfert, S.; Ge, L.; Verdouw, C.; Bogaardt, M.J. Big Data in Smart Farming—A Review. Agric. Syst. 2017, 153, 69–80. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. Available online: https://arxiv.org/abs/2010.11929 (accessed on 17 September 2025).
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Kamilaris, A.; Prenafeta-Boldú, F.X. Deep Learning in Agriculture: A Survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
- Alahi, M.E.E.; Sukkuea, A.; Tina, F.W.; Nag, A.; Kurdthongmee, W.; Suwannarat, K.; Mukhopadhyay, S.C. Integration of IoT-Enabled Technologies and Artificial Intelligence for Smart Agriculture: Recent Advancements and Future Trends. Sensors 2023, 23, 5206. [Google Scholar] [CrossRef] [PubMed]
- Shafi, U.; Mumtaz, R.; García-Nieto, J.; Hassan, S.A.; Zaidi, S.A.R.; Iqbal, N. Precision Agriculture Techniques and Practices: From Considerations to Applications. Sensors 2019, 19, 3796. [Google Scholar] [CrossRef] [PubMed]
- Dadheech, A.; Bhavsar, M.; Verma, J.P.; Prasad, V.K. Deep Learning Based Smart Traffic Management Using Video Analytics and IoT Sensor Fusion. Soft Comput. 2024, 28, 13461–13476. [Google Scholar] [CrossRef]
- Mekonnen, Y.; Namuduri, S.; Burton, L.; Sarwat, A.; Bhansali, S. Machine Learning Techniques in Wireless Sensor Network Based Precision Agriculture. J. Electrochem. Soc. 2019, 167, 037522. [Google Scholar] [CrossRef]
- Ferrag, M.A.; Shu, L.; Yang, X.; Derhab, A.; Maglaras, L. Security and Privacy for Green IoT-Based Agriculture: Review, Blockchain Solutions, and Challenges. IEEE Access 2020, 8, 32031–32053. [Google Scholar] [CrossRef]
- Maes, W.H.; Steppe, K. Perspectives for Remote Sensing with Unmanned Aerial Vehicles in Precision Agriculture. Trends Plant Sci. 2019, 24, 152–164. [Google Scholar] [CrossRef] [PubMed]
- Ferentinos, K.P. Deep Learning Models for Plant Disease Detection and Diagnosis. Comput. Electron. Agric. 2018, 145, 311–318. [Google Scholar] [CrossRef]
- Zhang, J.; Huang, Y.; Pu, R.; Gonzalez-Moreno, P.; Yuan, L.; Wu, K.; Huang, W. Monitoring Plant Diseases and Insect Pests Through Remote Sensing Technology: A Review. Comput. Electron. Agric. 2020, 165, 104943. [Google Scholar] [CrossRef]
- Walter, A.; Finger, R.; Huber, R.; Buchmann, N. Smart Farming is Key to Developing Sustainable Agriculture. Proc. Natl. Acad. Sci. USA 2017, 114, 6148–6150. [Google Scholar] [CrossRef] [PubMed]
- Sathupadi, K.; Achar, S.; Bhaskaran, S.V.; Faruqui, N.; Abdullah-Al-Wadud, M.; Uddin, J. Edge–Cloud Synergy for AI-Enhanced Sensor Network Data: A Real-Time Predictive Maintenance Framework. Sensors 2024, 24, 7918. [Google Scholar] [CrossRef] [PubMed]
- Mondal, M.; Khayati, M.; Sandlin, H.; Cudré-Mauroux, P. A Survey of Multimodal Event Detection Based on Data Fusion. VLDB J. 2025, 34, 9. [Google Scholar] [CrossRef]
- Bazzi, A.; Ying, M.; Kanhere, O.; Rappaport, T.S.; Chafii, M. ISAC Imaging by Channel State Information Using Ray Tracing for Next Generation 6G. IEEE J. Sel. Top. Electromagn. Antennas Propag. 2025, in press. [Google Scholar] [CrossRef]
- Liu, J.; Wang, X. Plant Diseases and Pests Detection Based on Deep Learning: A Review. Plant Methods 2021, 17, 22. [Google Scholar] [CrossRef] [PubMed]
- Jocher, G. YOLOv5 by Ultralytics. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 17 September 2025).
- Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. Available online: https://arxiv.org/abs/1905.11946 (accessed on 17 September 2025).
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019; Available online: https://arxiv.org/abs/1711.05101 (accessed on 17 September 2025).
- Appiah, O.; Oppong Hackman, K.; Diallo, B.A.; Ogunjobi, K.O.; Ouedraogo, V.; Bebe, M.; Son, D. TOM2024: Datasets of Tomato, Onion and Maize Images for Pest and Disease Detection. 2024. Available online: https://data.mendeley.com/datasets/3d4yg89rtr/1 (accessed on 17 September 2025).
- Orbibarobotics. Crop and Weed Dataset. 2024. Available online: https://universe.roboflow.com/orbibarobotics/crop-and-weed-datase (accessed on 17 September 2025).
- Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder–Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
- Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. arXiv 2021, arXiv:2105.05537. Available online: https://arxiv.org/abs/2105.05537 (accessed on 19 September 2025).
Figure 1.
Integrated machine vision–IoT architecture showing synchronized data capture, multimodal fusion, edge inference, and cloud coordination.
Figure 1.
Integrated machine vision–IoT architecture showing synchronized data capture, multimodal fusion, edge inference, and cloud coordination.
Figure 2.
Preprocessing pipeline illustrating synchronized visual and sensor data operations prior to multimodal fusion.
Figure 2.
Preprocessing pipeline illustrating synchronized visual and sensor data operations prior to multimodal fusion.
Figure 3.
Fusion strategy illustrating independent feature extraction from visual and sensor streams prior to joint multimodal inference.
Figure 3.
Fusion strategy illustrating independent feature extraction from visual and sensor streams prior to joint multimodal inference.
Figure 4.
Convergence behavior showing (a) loss trajectories per optimizer and (b) accuracy gains due to multimodal integration.
Figure 4.
Convergence behavior showing (a) loss trajectories per optimizer and (b) accuracy gains due to multimodal integration.
Figure 5.
Closed-loop adaptation framework showing the sequence of edge inference, cloud retraining, validation, and redeployment.
Figure 5.
Closed-loop adaptation framework showing the sequence of edge inference, cloud retraining, validation, and redeployment.
Figure 6.
Comparative performance analysis showing (a) classification accuracy and (b) segmentation mIoU across different model families.
Figure 6.
Comparative performance analysis showing (a) classification accuracy and (b) segmentation mIoU across different model families.
Figure 7.
Edge deployment evaluation showing (a) latency and throughput performance for Jetson Nano and Coral TPU, and (b) device-level power consumption for energy-efficient operation.
Figure 7.
Edge deployment evaluation showing (a) latency and throughput performance for Jetson Nano and Coral TPU, and (b) device-level power consumption for energy-efficient operation.
Figure 8.
Effect of regularization parameters on (a) validation accuracy and (b) training–validation loss gap.
Figure 8.
Effect of regularization parameters on (a) validation accuracy and (b) training–validation loss gap.
Figure 9.
Qualitative examples: (top) tomato disease detection; (bottom) weed segmentation under occlusion using fused vision–sensor inputs.
Figure 9.
Qualitative examples: (top) tomato disease detection; (bottom) weed segmentation under occlusion using fused vision–sensor inputs.
Table 1.
Comparison of monitoring methods highlighting the fusion properties of the proposed integration framework.
Table 1.
Comparison of monitoring methods highlighting the fusion properties of the proposed integration framework.
| Approach | Sensing Modality | AI Inference Scope | System Integration | Adaptivity and Knowledge Reuse |
|---|
| Traditional Scalar Sensors [6,8] | Single-parameter IoT sensing | Single-variable analysis | Centralized, point-based deployment | Static models without cross-sensor learning |
| Machine Vision Systems [4,17] | Visual data only | Image-based classification and segmentation | Standalone vision pipelines | Visual domain adaptation only |
| AI-Enhanced Solutions [5,9] | Single-modality optimized | Domain-specific intelligence | Modality-specific processing | Only within-modality learning |
| Proposed Integration Framework | Fused vision and IoT sensing | Cross-modal joint inference | Unified edge–cloud coordination | Cross-modal transfer with continuous adaptation |
Table 2.
Comparison of optimization algorithms within the multimodal learning framework.
Table 2.
Comparison of optimization algorithms within the multimodal learning framework.
| Optimizer | Core Mechanism | Convergence Characteristics | Multimodal Suitability |
|---|
| AdamW | Decoupled weight decay: | Smooth, monotonic descent | Highly stable for vision backbone training |
| MADA | Adaptive momentum: | Fast convergence with small oscillations | Effective for cross-modal feature alignment |
| sigSignAdamW | Sign-variance control: | Noise-resistant, steady under uncertainty | Suitable for sensor-stream optimization |
Table 4.
Classification performance on agricultural monitoring tasks.
Table 4.
Classification performance on agricultural monitoring tasks.
| Model | Accuracy (%) | Precision | Recall | -Score |
|---|
| VGG16 | 86.2 | 0.84 | 0.83 | 0.83 |
| ResNet50 | 89.5 | 0.88 | 0.87 | 0.87 |
| EfficientNet-B3 | 92.1 | 0.91 | 0.91 | 0.91 |
| Ours (Integration Framework) | 94.8 | 0.94 | 0.95 | 0.94 |
Table 5.
Segmentation performance on spatial understanding tasks.
Table 5.
Segmentation performance on spatial understanding tasks.
| Model | mIoU (%) | Dice Coefficient |
|---|
| U-Net | 79.4 | 0.81 |
| DeepLabv3+ | 83.5 | 0.85 |
| Ours (Integrated U-Net) | 87.6 | 0.89 |
Table 6.
Deployment performance across embedded platforms.
Table 6.
Deployment performance across embedded platforms.
| Platform | Latency (ms) | Throughput (FPS) | Power (W) |
|---|
| Jetson Nano | 62 | 16.1 | 8.4 |
| Coral TPU | 54 | 18.3 | 6.7 |
Table 7.
Comparison with state-of-the-art classification approaches.
Table 7.
Comparison with state-of-the-art classification approaches.
| Method | Accuracy (%) | Operational Characteristics |
|---|
| VGG16 | 86.2 | High complexity, limited deployability |
| ResNet50 | 89.5 | Balanced accuracy and efficiency |
| EfficientNet-B3 | 92.1 | High accuracy, higher compute cost |
| ViT-Tiny | 90.6 | Strong features, transformer overhead |
| Our Framework | 94.8 | Edge-deployable, optimized fusion |
Table 8.
Comparison with state-of-the-art segmentation approaches.
Table 8.
Comparison with state-of-the-art segmentation approaches.
| Method | mIoU (%) | Dice | Deployment Feasibility |
|---|
| U-Net | 79.4 | 0.81 | Lightweight, reduced precision |
| DeepLabv3+ | 83.5 | 0.85 | Memory-intensive, GPU preferred |
| Swin-UNet [25] | 85.2 | 0.87 | High compute requirements |
| Our Framework | 87.6 | 0.89 | Edge-optimized and efficient |
Table 9.
Comparative analysis emphasizing performance–deployability trade-offs.
Table 9.
Comparative analysis emphasizing performance–deployability trade-offs.
| Approach | Performance Metric | Deployment Profile | Operational Characteristics |
|---|
| EfficientNet-B3 | High accuracy | High-resource requirement | Accuracy-oriented, computationally intensive |
| DeepLabv3+ | Strong segmentation | GPU-dependent | Very heavy memory consumption, limited portability |
| Swin-UNet | Transformer-level quality | Specialized hardware required | Reduced scalability due to large model size |
| Standard U-Net | Moderate segmentation | Edge compatible | Lightweight but accuracy-limited |
| Proposed Framework | Strong multimodal performance | Edge optimized | Optimized efficiency and adaptability |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).