Early-Stage Sensor Data Fusion Pipeline Exploration Framework for Agriculture and Animal Welfare

Martin, Devon; Roberts, David L.; Bozkurt, Alper

doi:10.3390/agriengineering7070215

Open AccessArticle

Early-Stage Sensor Data Fusion Pipeline Exploration Framework for Agriculture and Animal Welfare

by

Devon Martin

¹

,

David L. Roberts

²

and

Alper Bozkurt

^1,*

¹

Department of Electrical Engineering, North Carolina State University, 890 Oval Dr, Raleigh, NC 27695, USA

²

Department of Computer Science, North Carolina State University, Campus Box 8206, 890 Oval Drive, Raleigh, NC 27695, USA

^*

Author to whom correspondence should be addressed.

AgriEngineering 2025, 7(7), 215; https://doi.org/10.3390/agriengineering7070215

Submission received: 17 April 2025 / Revised: 13 June 2025 / Accepted: 18 June 2025 / Published: 3 July 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

Internet-of-Things (IoT) approaches are continually introducing new sensors into the fields of agriculture and animal welfare. The application of multi-sensor data fusion to these domains remains a complex and open-ended challenge that defies straightforward optimization, often requiring iterative testing and refinement. To respond to this need, we have created a new open-source framework as well as a corresponding Python tool which we call the “Data Fusion Explorer (DFE)”. We demonstrated and evaluated the effectiveness of our proposed framework using four early-stage datasets from diverse disciplines, including animal/environmental tracking, agrarian monitoring, and food quality assessment. This included data across multiple common formats including single, array, and image data, as well as classification or regression and temporal or spatial distributions. We compared various pipeline schemes, such as low-level against mid-level fusion, or the placement of dimensional reduction. Based on their space and time complexities, we then highlighted how these pipelines may be used for different purposes depending on the given problem. As an example, we observed that early feature extraction reduced time and space complexity in agrarian data. Additionally, independent component analysis outperformed principal component analysis slightly in a sweet potato imaging dataset. Lastly, we benchmarked the DFE tool with respect to the Vanilla Python3 packages using our four datasets’ pipelines and observed a significant reduction, usually more than 50%, in coding requirements for users in almost every dataset, suggesting the usefulness of this package for interdisciplinary researchers in the field.

Keywords:

data fusion; open-source; precision agriculture; animal welfare

1. Introduction

This section provides an overview of the need for data fusion in precision agriculture and animal welfare research fields, categorizes common techniques and formats used in data fusion, and summarizes the four datasets to which we have access for proof-of-concept demonstrations.

1.1. The Need for Data Fusion in Precision Agriculture and Animal Welfare

The past few decades have seen a remarkable achievement in the development of new sensors and improvement of their operational performance (e.g., sensitivity, specificity, power consumption) combined with novel network technologies enabling the Internet-of-Things (IoT) to become ubiquitous in every aspect of our lives. The information age has brought forth data centers, cloud computing, smart technologies, and distributed personal smart devices. As much as 95% of industry is investing in cloud solutions and embedded sensors in some way [1] and the market value of IoT systems in the healthcare, automotive, and consumer electronics industries has nearly doubled from 2014 to 2020 to more than USD 1 trillion in each [2]. Precision agriculture and smart farming have attracted increased attention more recently due to growing awareness about the challenges of feeding the growing global population.

Improvements in the accuracy, cost, and resolution of sensor systems and devices have also improved our understanding of the physical world in these domains. However, multiple sensors measuring various physical signals in an environment pose a significant challenge in interpreting these data streams due to the complexity and volume of data representing a diverse set of phenomena. There is a growing need for data fusion and fission tools to handle such large volumes of data acquired by multiple sensors simultaneously. Sensors work because physical phenomena corresponding to events move through a space and interact with the sensor. The splitting of information from the event to various observations in a one-to-many relationship defines data fission. Various sensors allow scientists to record these pieces of information from a various points of view [3]. The methods used to fuse these data back together are problem-specific. In the last few decades, many successful attempts have been made to expand these techniques to precision agriculture domains.

1.2. The Data Flow in Precision Agriculture and Animal Welfare

The Dasarathy model groups data fusion techniques that can be used in precision agriculture and animal welfare applications by level of abstraction [3]. Information is evaluated by whether it is data (low level), features (mid level), or decisions (high level). Unfortunately, there is no perfect technique that works for all problems. Even more advanced data fusion techniques perform worse in certain scenarios [4]. Instead, design iteration based on trial-and-error testing is necessary. Idiosyncrasies also exist with each data fusion problem, such as pre-processing needs, certain formatting corrections, etc., which in turn require unique solutions. Fortunately, there are common procedures that can be grouped. Data typically starts in one of three formats:

Singlets: information is low-dimensional, such as a temperature from a thermistor, or physical activity level from an inertial measurement unit.
Arrays: data is collected over a range. Examples include spectral data, soil moisture across a field, etc.
Images: pixel-based images, most commonly from camera-based sensors for computer vision-based analysis, such as of diseased crops and fruits.

Array-style data often requires dimensional reduction due to large redundancies in arrays. Image-style data also requires image processing before proper feature selection can take place. Temporal and spatial alignment is required early on to ensure proper comparison of various sensors (Figure 1). This is typically followed by feature extraction to reduce the size of the data while retaining the main information. Decision making, usually performed with a classifier or predefined ruleset, must be carried out before fusion from different sources can then take place. All of these different data formats emerging in the precision agriculture and animal welfare domains set the framework upon which a data fusion strategy can be built.

1.3. The Focus and Contribution of This Paper

In addition to presenting our proposed data fusion pipeline methodology to the scientific community, here we demonstrate the feasibility and effectiveness of the framework in completing custom data fusion objectives with four datasets as application examples. Two of these datasets are related to crop health and the other two to animal monitoring. We also share unplanned sub-objectives that organically arose during the data pipeline development. These were easy to implement with our corresponding toolkit despite not being pre-planned, indicating a type of emergent behavior within this module. This highlights the potential of the tool, as its modular setup allows for easy customization. The objectives we were able to complete include the following:

Determined when to perform a data fusion procedure in the data pipeline;
Compared low-level and mid-level fusion approaches for a dataset; and
Prototyped different pipeline events to determine an optimal ordering.

Overall, the pipeline methodology and the corresponding tool allow researchers to obtain a broad and high-level perspective on the structure of their data fusion processes from raw data to final inference without being burdened by the intricacies of data formatting or best coding practices.

2. Review of the State of the Practice

Data fusion methods have been extensively applied across various industries, especially with the rapid growth of IoT and sensor technologies. In the food and beverage industry, data fusion has increased classification reliability and foodstuff specifications [4]. In the sports and biomedical fields, where multiple features are often required to make accurate detections, fusing data from various sensors enabled more complex and robust analyses [5]. In precision farming, data fusion has led to reduced network traffic and the development of innovative methodologies such as virtual fencing [6]. Similarly, in smart homes, data fusion has shown potential in energy management, uncertainty reduction, and the handling of unstructured data [7].

The fundamental concept behind multi-sensor data fusion is to integrate data in a way that produces synergistically improved outcomes, such as results that are actionable, explainable, and personalized [7]. Other advantages of data fusion include compensating for inaccurate data, lowering energy consumption in wireless networks, enhancing reliability, reducing noise, increasing robustness to individual node failures, and improving precision [8]. Together, these benefits contribute to higher quality of service and more efficient power usage. Furthermore, such networks offer greater flexibility and can support localized decision making within smaller network subsets [3].

These are all broad, industry-wide applications of sensor data fusion. However, since the intended application area for our data fusion framework is agriculture and animal monitoring, we investigated common research topics in these fields that specifically use data fusion. We would like to emphasize that providing a comprehensive and systematic review is beyond the scope of this paper. Instead, we provide a compilation of exciting examples to offer a snapshot of the state-of-the-practice in sensor fusion to study animals and plants. Subsequently, we review the current potential data fusion Python software libraries and explain how our proposed software is unique.

2.1. Crop and Agriculture-Related Sensor Data Fusion

In agricultural applications focusing on crops, previous efforts on data fusion can be broadly categorized in three main categories: plant health ([9,10,11,12]), food source quality ([4,10,13,14,15,16]), and soil profiling ([17,18,19,20]). Many of these applications address key agricultural aspects. As the IoT continues to evolve, crops can now be evaluated at multiple levels: the high (field- or crop-) level, the medium (organ-) level, and the low (cellular-) level [21]. This section concludes with an overview of the constraints and opportunities for sensor fusion in farm environments.

Healthy plants are disease- and defect-free and well-hydrated. Data fusion techniques have been applied to these indicators of plant health. For example, apples were classified based on spoilage, ripeness, and flavor using electronic noses [10]. Plant hydration levels have been assessed using bioimpedance spectroscopy, where surface electrodes placed on Arabidopsis sprouts were used to track the impedance over time, enabling precision hydration assessments [12]. Various mosaic viruses and fungal infections in cucumbers were monitored over several weeks using a combination of temperature sensors, thermography, and fluorometry with data fused using stepwise discriminant analysis [9]. In some cases, genetic diseases can be monitored using data fusion with unmanned aerial vehicles (UAVs) or drones [11]. These researchers fused large-scale traditional, multispectral and thermal images using extreme learning machines to predict chlorophyll a and b levels.

Ensuring food quality is obviously critical for public health. Electronic noses, near-infrared spectroscopy (NIRS), and machine vision have been fused with decision trees to detect unripe apples [16]. Interactance spectroscopy has been used to identify internal defects in sweet potatoes [15]. Banned dyes, such as Sudan III and IV, have been successfully detected in culinary spices using a combination of nuclear magnetic resonance and ultraviolet–visible spectroscopy, followed by data fusion using concatenation and fuzzy logic. This achieved over 97% accuracy [13]. Sensor fusion also contributed to other culinary applications; for example, electronic noses and tongues have been combined using a fuzzy neural network to identify wine types [14]. A comprehensive review of other fusion methods applied to vegetables, fruits/juices, cheeses, and honey is provided in [4].

Data fusion techniques have also been applied to soil profiling. In one study, pH, organic carbon, nitrogen and phosphorus levels were combined with measurements such as soil texture, trunk diameter, and carbon dioxide efflux to study the relationship between soil quality and poplar tree growth [18]. To address soil heterogeneity in precision agriculture, gamma-ray spectrometry, visible and near-infrared spectrometry, and a galvanic soil sensor were used to determine soil electrical conductivity and radioisotope content [19]. The Land Surface Data Toolkit, a multi-model simulation tool [17], integrates data using artificial neural networks to support agricultural and water management practices.

Lastly, a broader scope of agricultural applications of sensor fusion is discussed in [6], covering aspects such as location and user awareness, network traffic reduction, and implementation of virtual fences. One common trade-off in these applications involves the exchange of processing power with high-level fusion schemes for reduced network traffic, a concept known as edge mining. Common hindrances to such agricultural applications include managing data distribution across large farming areas and dealing with poor communication environments. Sensors must also be robust to withstand harsh environmental conditions found in large agricultural settings, including significant temperature and humidity fluctuations, as well as contamination from dirt and other waste. Despite these challenges, there are opportunities that need to be considered. Farm environments, combined with edge computing, can reduce reliance on cloud infrastructure and improve data security. In addition, small-scale energy harvesting technologies, such as solar and wind energy, can significantly support these distributed systems.

2.2. Animal-Monitoring-Related Sensor Data Fusion

In animal-related applications, sensor data fusion has been primarily applied to animal classification ([22,23,24,25]) and gait analysis ([5,26,27,28,29]). When extended to humans, data fusion has also supported fall detection in the elderly ([28,30,31]), assisted living and smart home applications [32], and biometric health monitoring, such as in patient care, chronic disease management and stress detection ([5,28,33,34]). Since our platform is designed to support animal-focused fusion applications, we concentrate here on the most relevant domains, including animal classification, gait analysis, and biometric health monitoring.

Animal classification is widely used in wildlife animal tracking to understand migration patterns, species endangerment, and biodiversity. Data fusion methods vary depending on species and sensor types, which may range from satellite images to local acoustic sensors. For example, zebra and nyala were classified using a residual neural network-based computer vision system, achieving a top-1 accuracy of 72% for the zebras but only 4.7% for the nyalas. The poor latter performance was due to effective camouflage [25]. In marine environments, plankton were classified using two video cameras and eight acoustic receivers, with confidence-weighted fusion reducing classification error by over 50% [24]. Population estimates of 35 animal species have also been achieved using combined data from visual transects, dung transects, and camera traps [23]. More recently, satellite geomagnetic imaging has been proposed to complement traditional tracking techniques for studying migratory behavior [22].

Gait classification typically employs wearable sensors. In one study, high-dimensional raw sensor data was used to classify dog behaviors such as standing, walking, running, sitting, and resting, with quaternion-based models achieving 93% precision [26]. Other research has modeled gait at the physiological and biomechanical levels. For example, fuzzy logic and multiple surface electromyography (EMG) sensors were used to estimate skeletal muscle forces [29]. Additional studies combined inertial measurement units (IMUs) with EMG sensors to calculate energy expenditure and muscle activation [27]. Sensor fusion techniques have also been applied using pressure sensors in footwear and sports equipment, including basketballs, helmets, and hockey sticks, to analyze motion and biomechanics [5].

Although primarily applied to humans, many health biometrics fusion techniques are adaptable to animal monitoring. A study combined step count, calorie intake, and sleep quality data to predict heart disease achieving 98% accuracy by focusing on the most predictive features [33]. Stress detection has also been explored using various biosignals, including electrocardiography (ECG), EMG, electrodermal activity (EDA), and electroencephalography (EEG). Interestingly, ECG alone produced the most significant results [34]. Additionally, researchers have found that data fusion can help reduce false positive rates in medical testing [5]. A comprehensive overview of wearable health monitoring systems is available in [28].

2.3. Open-Source Tools

Considered as the first open-source platform for the development of biomarkers, the Digital Biomarker Discovery Pipeline (DBDP) was designed to work with the developing field of mobile health (mHealth) to assist in the monitoring of chronic diseases [35]. As an open-source tool, its aim is to help facilitate collaboration between various medical communities and to strongly encourage contributions from other like-minded researchers in the community. Upon release, the DBDP provided three example modules—“Resting Heart Rate”, “Sleep and Circadian Rhythms”, and “Continuous Glucose Monitoring”. In 2024, the publicly available DBDP included common modules like “Preprocessing” and “Exploratory Data Analysis”, as well as specific task-oriented modules, such as “Glucose Variability”, “Heart Rate Variability”, etc. (https://github.com/DigitalBiomarkerDiscoveryPipeline/DBDP, accessed on 17 April 2025).

A very popular Python package for machine learning applications, Scikit-learn, was originally designed as a high-level language for medium-scale machine learning problems [36]. Its aim is to bring machine learning algorithms to non-software specialists, especially in other scientific fields such as biology. The creators focused on making Scikit-learn workable with other common packages such as NumPy, scipy, and matplotlib. Despite the focus on creating a high-level language with good computing efficiency, this tool was planned to scale to larger datasets in future research [36].

These state-of-the-practice packages and platforms can be compared to our tool in the following. Similarly, all of our earlier work and our tool utilize open-source code for widespread adoption using public GitHub repositories. These use common platforms such as NumPy, pandas, matplotlib, and Scikit-learn. These tools are designed as end-to-end platforms with expectations that future modules will be able to be implemented easily. The main difference is that the DBDP code bases were explicitly designed to provide access to the medical community for mHealth applications. Scikit-learn makes machine learning accessible to the scientific community mainly related to biology and physics. Neither of these has a heavy emphasis on data fusion.

While similar in modularity and intention, our data fusion tool is uniquely designed to bring data fusion exploration distinctly to early-scale testing in animal and plant analytics. Its focus area is early-stage pipeline prototyping within the veterinary and agricultural application areas, and the codebase is more generalized for end-to-end pipeline development for researchers in these fields.

3. Methods

The overall schema for our data fusion framework was designed to align with the five fundamental elements of a typical data fusion process (Figure 1). In summary:

Data Alignment assists with temporal and spatial alignment between different sources;
Feature Extraction performs temporal feature extraction from a time series and includes feature selection strategies;
Dimensional Reduction reduces the set of features of a dataset;
Data Fusion incorporates data from multiple various sources to arrive at a decision;
Decision Making sets the decision space for the model using classification and regression models.

As a theoretical example, we have a network of motion sensors on a single animal harness and want to know how much movement takes place during demonstrated behaviors: running, walking, and eating. Figure 2 shows an overview of the following process. We could see how data alignment would be necessary to account for sensors on different body parts; these would also have to be temporally aligned to ensure the proper timing of each behavior. We then perform feature extraction techniques, such as finding the average impulse of movements or integrating motion into temporal windows. These metrics tell us what we want to know. If the dataset is too large, or highly correlated, as can be expected of sensors on the same animal, we may then reduce the dimension size with principal component analysis (PCA). We fuse the data into one large dataset, using raw data or the features extracted before. Lastly, a classifier would be trained to distinguish running from walking and vice versa.

3.1. Towards an Open Source Tool for Precision Agriculture and Animal Welfare Researchers

In this paper, we also present the “Data Fusion Explorer (DFE),” a software toolbox for precision agriculture and animal welfare researchers to perform early-stage discovery without requiring a deep level of expertise on multi-sensor data fusion (Figure 1). This is an open-source Python software to assist users in rapidly prototyping fusion schemes to gain valuable insights in even earlier stages of their research. We refer to each sequence, from raw data through processing to regression or classification outcomes, as an individual data pipeline. Due to the open-ended nature of data fusion and the multitude of possible sequence rearrangements, we find iterative prototyping to be more beneficial than intricate model design. This work uniquely presents a generalized data fusion platform designed for rapid experimental prototyping and enabled by a distinct design framework derived from common applications in the animal monitoring and growing precision agriculture spaces. The study presented here also includes the applicability and translation of this framework as a new open-source Python package.

The DFE allows users to work at the pipeline level in a plug-and-play manner. The purpose of this effort is to facilitate early exploration and development of a data pipeline, to organize the most common tasks within a global framework, and to be functionally useful for different data formats in the field of agricultural and animal monitoring. Following the development and operations (DevOps) strategy, we released a reasonably useful product before formal completion. This also includes updated documentation and 102 unit test cases that have been automated with PyTest using GitHub Actions [37] as well as multiple example cases that act as informal integration tests.

As indicated, the DFE was designed with data problems in agricultural and animal monitoring spaces in mind. These application areas span fields where data fusion is common, such as soil measurements [19], the food industry [4], determination of food quality, and animal welfare and monitoring. Datasets were also selected to vary in formatting between singlet, array-style, and image-style data.

Coded in Python3, using open-source modules like scipy, numpy, pandas, sklearn, etc. and accessible using the pip installer, the DFE can realize these corresponding data fusion pipelines. Users create a DFE_object and all further calls are made through this object. An associated reference guide that describes all available methods as well as their inputs and outputs is included in the GitHub repository. Modifications to the software undergo automated checks in accordance with continuous integration and deployment (CI/CD) to ensure continued compatibility with past versions.

3.2. Dataset Descriptions

We had access to four diverse early-stage datasets (Table 1) that we used to demonstrate the usefulness of the presented data fusion framework for a wide variety of agricultural and animal applications. It should be noted that the analysis of these databases is beyond the scope of this paper and not the intention of the research presented here. Rather, the three common types of data formatting (singlets, arrays, and images) contained in these databases are used as a medium to compare and benchmark various features of pipelining, modularization, and rapid prototyping. The array-style data is separated from singlet-style data because array data requires a dimensional reduction step before further processing. This is to prevent over-analyzing highly redundant data.

3.3. Dog Collar Dataset: Singlets to Classification

The Dog Collar Dataset was acquired from a sensor system attached to the collars of guide dog puppies being trained for one of the largest guide dog organizations in the United States: Guiding Eyes for the Blind [38]. The collar sensor collects data related to the local environment (i.e., ambient temperature, relative humidity, barometric pressure, ambient light level, ambient acoustic noise level) and dog behavior (i.e., inertial measurement units including accelerometers) simultaneously (Table 1). This database is used to better understand and predict the performance of guide dog training. During guide dog puppy evaluation and testing, the dogs performed tasks in response to several dozen activity states.

For this dataset, we used our data fusion framework to classify the activity state of the dog. We compared three pipelines spanning low-level (data to decision, see P1) and mid-level (feature to decision) fusion. We also compared two mid-level pipelines to observe the effect of fusion followed by classification as separate steps, such as principal component analysis (PCA)—Random Forest (P2) versus fusion and classification performed in the same step, in linear discriminant analysis (LDA) (P3). The different pipelines are shown in Figure 3.

3.4. Moth Detector Dataset: Singlets to Regression

The second dataset was related to a sensing platform developed to count corn earworm moths emerging in agriculture fields [39]. It is a moth trap with an infrared-based counter implemented at the trap entrance. The system also measured local environmental data, including ambient temperature, relative humidity, barometric pressure, wind, rain, and ambient light levels (Table 1). These sensors were installed in farmland regions around North Carolina for several months.

The objective of using the dataset in this paper was to explore the relation between moth emergence density and local environmental conditions. Using the presented data fusion framework, we compared three different pipeline schemas varying by window scheme (see Figure 4).

3.5. Plant Impedance Dataset: Arrays to Regression

The third dataset was a set of sensors placed on the leaves of maize plants to measure the impedance of the leaf in parallel with the environmental conditions in an effort to optimize irrigation and fertilizer applications. This plant leaf impedance dataset contained months of impedance spectroscopy data with the sensor system producing a spectrum every ten minutes [12] (Table 1). Previous work with this dataset used a MATLAB-based (R2023b, MathWorks Inc., Natick, MA, USA) script to fit a double-shell equivalent circuit to the spectrum and monitored the corresponding parameters over days of collected data. The objective here was to extract drought-related model parameters from features of the impedance spectrums. Three pipelines were performed with this dataset, varying according to the order dimension reduction, feature extraction, and linear regression (see Figure 5). The two upper paths (P1 and P2) change the order of dimension reduction and feature extraction, while the bottom path (P3) merges dimension reduction and regression using the least absolute shrinkage and selection operator (LASSO).

3.6. Sweet Potato Dataset: Images to Classification

The final dataset consisted of image data captured in a sweet potato processing facility aimed at automating the high-throughput quality assessment of food resources [40]. The completed image processing steps and other details can be found in [40]. Starting with the features presented in this study, the aim was to classify each potato quality as either “U.S. No. 1” or “Cull” quality (Table 1). We achieved classification with two dimension reduction techniques followed by classification (Figure 6). The hypothesis was that independent component analysis (ICA) would perform better with a Naïve Bayes classifier, as this classifier assumes independence in the data, which aligns with ICA’s objective of maximizing independence.

3.7. Metrics of Comparison

For pipeline comparisons, we report different metrics based on the task type:

For classification tasks, we report the following metrics (all averaged over classes unless otherwise stated):

Precision measures the portion of correctly classified instances among all instances classified as a given class.
Recall measures the portion of actual instances of a class that were correctly identified.
F1-Score is the harmonic mean of precision and recall and a common metric for overall classification performance.
Accuracy is the overall proportion of correctly predicted instances across all classes.

For regression tasks, we report the following:

The Pearson Correlation Coefficient (R) measures the linear correlation between predicted values and ground truth.
Root Mean Square Error (RMSE) measures the average squared differences between predicted and actual values.
Relative Root Mean Square Error (RelRMSE) is similar to RMSE, but normalized by the magnitude of predictions.
Mean Absolute Error (MAE) computes the average absolute difference between predictions and actual values, treating all errors equally.
Relative Absolute Error (RAE) is a normalized version of MAE, relative to the errors in a naive model or baseline.

We also report space and time complexity metrics for each pipeline to enable performance comparisons. Furthermore, to assess the usefulness of the corresponding DFE, we measured the number of lines of code saved when using the five DFE modules versus implementing the same functionality without them.

4. Results

We calculated metrics to determine the best fusion pipeline for each dataset. With each model, we divided the data into a 80% training set and a 20% testing set. For classification models, standard metrics were precision, recall, F1-score, and accuracy between estimated categories and true categories. For regressions, we calculated the correlation between the estimates and true values; this would ideally form a line of slope 1. We also included standard metrics such as RMSE, MSE, and relative MSE.

The pipeline comparison methods were primarily relative, as the objective was to optimize the performance of a data fusion pipeline. The true values used for the evaluation were derived from the test scores, although these values were inherently subject to standard measurement errors. To illustrate specific examples, the performance of the Dog Collar dataset was impacted by poor sensor fitting and attachment, with ergonomic challenges exacerbated by the presence of fur [41]. In the Moth dataset [39], significant errors remained due to both missed detections and false positives, the latter often caused by moth fluttering behavior as they entered the trap. The weather sensor also contributed to random noise and spatial interpolation of field data introduced an additional modeling error. The Plant Impedance dataset [12] was highly sensitive to plant hydration levels and local weather conditions. Electrode orientation could subtly affect measurements. Additionally, the plant’s immune response may have increased impedance by covering or attacking the electrodes. Lastly, the Sweet Potato dataset relied heavily on image data, making it susceptible to image quality issues [40]. Additionally, potato orientation and accurately determining the main axes posed further challenges during image analysis.

As is common in animal welfare and agriculture studies, our aim was to understand the trade-off between model fit and computational complexity. We monitored time and space complexity using the

t i m e

and

s y s

Python packages, respectively, and present these metrics as well.

Lastly, we evaluated the usefulness of the DFE as a tool by its simplified capability for the user. We achieved this by estimating the number of lines of code required to complete the pipeline. We compared the lines of code required with and without the DFE tool. Lines used exceeding those required using the DFE module were considered superfluous.

4.1. Dog Collar Dataset: Singlets to Classification

The Dog Collar dataset had three pipelines, one that performed low-level fusion (P1), another that performed mid-level fusion using dimension reduction followed by classification (P2), and a third that performed dimension reduction and classification simultaneously (P3) (Figure 3). A particularity of this dataset was the pre-alignment. With the many different modalities of different sampling rates, we interpolated the datapoints to a common time axis, and then concatenated each sensor as a separate feature. This was a classification model, and the model fit scores for each pipeline are presented in Figure 7. We see that P3 performs overwhelmingly better than P1 and P2 in all metrics, suggesting that LDA was the best choice for this classification dataset.

The computational complexity of time and space is shown in Figure 8a. Here, we see that all pipelines perform roughly the same in both time and space complexity. With its additional significantly higher accuracy (Figure 7), there was no reason not to use P3.

Lastly, the level of code simplification provided by the DFE is shown in Figure 9. All pipelines had a drastic reduction in the number of code lines required by about three times. The simpler design of P1 required about 40% of the amount of code as the other pipelines to begin with, but it was still dramatically improved with the use of DFE.

4.2. Moth Detector Dataset: Singlets to Regression

The Moth Detector dataset had three pipelines, all of which varied in window scheme for temporal feature extraction (Figure 4). The window variation (5 min or 20 min or 1 h) was used to test the temporal resolution of the data; if the resolution was poor, then longer window sizes should not have affected the model. All windowing had a 50% overlap, but P1 used a short window of 5 min, P2 used a medium window of 20 min, and P3 a long window at 1 h. This was a regression model, and the model fit scores for each pipeline are presented in Figure 10. We see that all three pipelines performed approximately equally in both R and RAE. R values close to 1 indicated a strong correlation between the predictions and the test values. However, the RMSE and MAE were not informative in this case as they yielded zero for all pipelines. The relative metric RAE revealed overall similarity between the prediction plots, while RelRMSE showed that the P3 pipeline had a higher relative error compared to P1 and P2. This suggests that P3 may have included several outliers that contributed to the increased RelRMSE—an effect that was less apparent in the RAE results. Overall, this suggests that the window length was largely irrelevant, at least up to 1 h. This dataset was also unique for its use of Kernel Density Estimation (KDE) before temporal alignment.

The computational time and space complexity are shown in Figure 8b. Here, we see that pipeline P1 used about twice as much time and space as the other pipelines and that pipeline P2 used more time than P3. This is likely because the short windowing scheme in P1 generated the greatest number of features, followed by the windowing scheme in P2, then P3. P3 used slightly less space as expected. However, space usage was similar between P2 and P3. We hypothesize that this is due to a similar backend array initialization process that sets a standard minimal storage size. Since the accuracy of P3 is on par with P1, there is no reason not to use P3.

Lastly, the level of code simplification provided by the DFE is shown in Figure 9. All pipelines have about a 5-fold reduction in lines of code compared to Vanilla Python. All three pipelines used the same amount of code because the window design was a simple change in a single for-loop parameter.

4.3. Plant Impedance Dataset: Arrays to Regression

The Plant Impedance dataset had three pipelines (P1, P2, and P3), varying the order of dimensional reduction, feature extraction, and linear regression (Figure 5). This was a regression model, and the model fit scores for each pipeline are presented in Figure 10. We saw that R values were close to 1 across all pipelines, indicating a strong correlation between the predictions and test values. The RMSE and MAE identified pipeline P2 as having the highest error, followed by P3, with P1 performing the best. In contrast, the relative metrics (RelRMSE and RAE) were not effective in this case, as they yielded zero for most pipelines. Overall, these results suggest that P1 achieved the best fit, indicating that reducing dimensionality prior to feature extraction was an effective approach for this dataset. Computational complexity (Figure 8c) shows many differences between these pipelines. All three performed within subseconds, but P1 was the fastest. This is likely because it began with dimensional reduction, lowering the number of later calculations. However, P1 also used about twice the space requirements of the other two pipelines. The level of code simplification is shown in Figure 9. All three pipelines had similar numbers of lines of code, and all three were reduced three-fold in comparison to Vanilla Python.

4.4. Sweet Potato Dataset: Images to Classification

The Sweet Potato dataset had two pipelines (P1, P2), varying the dimension reduction process (Figure 6). This was a classification model, and the model fit scores for each pipeline are presented in Figure 7. We see that the two perform equally well, suggesting that these differences were irrelevant for fit purposes. Computational complexity analysis (Figure 8d) showed that time usage was almost identical but that pipeline P2 used lower space requirements. The level of code simplification is shown in Figure 9. Both pipelines had similar user input requirements and produced a reduction of about 50%. The modest improvement was primarily due to the brevity of the script, as the initial dataset had already undergone extensive preprocessing. Temporal and spatial alignment were therefore not required.

5. Discussion

In the Dog Collar dataset, the three pipelines demonstrated significantly better results with mid-level LDA-based fusion compared to the other two schemes. The time and space requirements showed similar results within all pipelines tested, so the LDA-based fusion pipeline was overall better at extracting the information from the data. The DFE tool was able to quickly identify this.

The Moth Detector dataset evaluated the effects of temporal feature windowing on moth density prediction using environmental data. We tested window lengths of 5 min, 20 min, and 1 h, with environmental data collected approximately once per minute. We concluded that the longest window length (1 h) was the best pipeline due to its lowest time complexity. However, the 20-min window also performed similarly, so either setup is a reasonable choice. Additionally, this information could help adjust the environmental sampling rate to align more closely with the theoretical Nyquist sampling rate for weather data, potentially reducing upstream data-space requirements.

With the Plant Impedance dataset, we investigated the effects of the order of dimensional reduction, feature extraction, and classification. Starting with dimensional reduction (P1) performed best. However, pipeline P1 incurred approximately twice the space costs of the other pipelines, which is an important consideration for certain applications. For example, for a space-sensitive IoT device, pipeline P3 would be more preferable.

In the Sweet Potato dataset, we compared the performance of ICA (pipeline P1) or PCA (pipeline P2) with a Naïve Bayes classifier. We hypothesized that ICA would yield better performance, as it produces statistically independent components, an assumption that aligns well with the core assumption of the Naïve Bayes model. In fact, the ICA dimensionality reduction pipeline outperformed PCA, achieving 69% accuracy, which was 6% higher than that of PCA. To assess whether this performance difference was statistically significant, we repeated each pipeline 30 times and applied a one-tailed paired t-test to the resulting accuracy scores. The analysis confirmed that ICA significantly outperformed PCA, with a p-value of 0.041. We attribute this improvement to the alignment between ICA’s independence objective and the Naïve Bayes assumption of feature independence. This suggests that pipeline P1 was likely a better engineered solution. Additionally, the Sweet Potato dataset contained high levels of redundancy, with many measurements collected along a central axis. It was plausible that PCA, focusing on variance rather than independence, discarded some seemingly redundant but informative components that ICA was able to retain.

In almost every dataset, a reduction of at least a 60% in code length was observed, with reductions of 70–80% being common. This indicated that the tool was a valuable module for setting up and evaluating data pipeline flows. This open-source Python package is expected to simplify the required coding skills for users, particularly researchers working on early-stage studies on animal and plant monitoring. Excluding the need for custom dataset formatting, the average number of lines of code for a pipeline using the DFE was 15, compared to 52 lines when the tool was not used.

These four scenarios were intentionally diverse, representing a broad range of tasks typically encountered by beginner modelers in the fields of agriculture or animal welfare. Our data fusion framework successfully accommodated all four use cases, offering multiple pipeline options for each. This demonstrated both the versatility and practical utility of the proposed pipeline methodology. In its current form, the DFE serves as a a minimum viable product. It provides a solid foundation for future development as we continue to identify and integrate new, application-driven features. It is a fact that these four examples may not be comprehensively covering the full spectrum of data fusion scenarios likely to emerge in the future, and that generalizability limitations may exist. Nonetheless, our framework’s flexible and extensible architecture is well positioned for future enhancements and adaptations as new requirements potentially arise.

Specifically for our DFE tool, we included a standard user guide for maintenance and quick start purposes. While the function names are designed to be intuitive, they do make assumptions and may require further study by researchers to ensure appropriate use depending on the specific application.

Designed for early-stage pipeline development, the DFE was not intended for large-scale projects, nor has it been evaluated on such scales. However, there are several ways to scale up. For example, when datasets involve multiple subjects or arrays, parallelization can be easily implemented. In a cluster, each subject could be assigned to its own node, where calculations and dimensionality reduction can be performed separately and then aggregated for larger datasets. Performing varying levels of reduction a priori can make the size of the aggregated dataset more manageable. Additionally, as demonstrated in the four test cases, users can identify pipelines that may not scale well in terms of time or space and explore alternative methods for these cases. Users can also conduct tests to determine whether scaling up is advisable. By combining data at lower levels, applying data reduction techniques, and then aggregating, users can confirm whether such groupings yield similar results. Alternatively, intermediate data can be extracted or processed manually for further scaling, while pipeline design may still provide valuable insights.

6. Future Work and Conclusions

The development of user-friendly multisensor data fusion tools is a pressing need and an open challenge within the precision agriculture field, particularly when it involves plant and animal data. Due to the complexity of physical, environmental, and biological phenomena, data fusion approaches often benefit from iterative design processes. Since multisensor data fusion pipelines involve numerous operations and are highly versatile, modular frameworks that enable rapid prototyping of different pipeline configurations are essential. We proposed such a framework along with an open-source tool called the Data Fusion Explorer (DFE). This is the first generalized data fusion platform with a unique design framework specifically tailored for animal monitoring and agricultural applications. In this article, we introduce this framework to the research community engaged in early-stage animal and plant monitoring research. We also demonstrated the applicability of this framework to four early-stage datasets, which span various disciplines, data types (singlet, array, image), objectives (classification or regression), and distributions (spatial or temporal). For each dataset, we derived comparison metrics for the customized pipeline, comparing these with traditional regression and categorical metrics. Using the DFE, we observed a significant reduction (averaging over 50% less written lines) in coding requirements for users across almost all datasets, highlighting the tool’s potential to simplify workflow for common users.

The data fusion framework is a work in progress, and we anticipate that future updates will introduce new functionalities as emerging needs are identified. The framework was evaluated using custom-designed pipelines applied to the four datasets available to the authors. As the framework is adopted more broadly, we anticipate that additional, commonly encountered use cases will emerge, informed by the evolving needs of the community. While future versions of the DFE tool may include new modules, the core framework is expected to remain consistent. Looking ahead, GitHub engagement metrics, such as stars and pull requests, will eventually provide insight into community interest and serve as proxies for the framework’s usefulness. However, since this is a newly released package, such metrics are not yet available.

Planned enhancements for the DFE tool include built-in space and time-complexity analysis using decorator functions and new feature extraction methods. Examples include Fourier and Wavelet transforms, as well as user-defined feature functions. Once an optimal pipeline is identified, parallelization strategies may be implemented to enable large-scale deployments and improve efficiency. We recognize the potential to further abstract the DFE’s pipelines into a unified pipeline object that accepts processing functions along with their parameters. For example, the short window Moth pipeline (P1) could be expressed as: P1 = pipeline(moth_dataset, [{temporal_features, 5, “min”}, {PCA, 20}, {linear_regression}]). Such an abstraction would significantly streamline the user experience and reduce code overhead, making the tool more intuitive and scalable for diverse applications. The DFE has also been designed with dependencies on well-established packages like NumPy, pandas, and Scikit-learn. Future enhancements could facilitate easier integration with these libraries along with easy exporting to common file formats. Lastly, we may expand support to include data input formats like streaming datasets or time series-specific data.

A particularly interesting future application of the framework lies in its potential use with Rashomon sets [42]. These are collections of models that achieve near-equivalent prediction accuracy. When multiple distinct models yield similar results, the existence of a large Rashomon set becomes plausible [42]. In such cases, the likelihood of identifying a simpler and more interpretable model increase, aiding in model transparency and usability. The DFE framework facilitates rapid prototyping across diverse learning strategies, enabling the users the identify simpler models when performance is comparable. In such cases, the simpler model is preferred, or users may further explore even more minimal pipelines.

Author Contributions

D.M.: Conceptualization, methodology, software, validation, formal analysis, data curation, writing—original draft preparation, visualization. D.L.R.: methodology, writing—review and editing. A.B.: Conceptualization, resource generation, writing—review and editing, funding and project management. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the US National Science Foundation grant numbers IIS-1915599, IIS-2037328, EF-2319389, ITE-2344423 and EEC-1160483, the US Department of Agriculture and National Institute of Food and Agriculture grant numbers 2017-70006-27205, 2020-33522-32272, 2018-67007-28423 and 2023-67021-40547, by the North Carolina Soybean Producers Association and the US National Cancer Institute grant number 1R01CA297854-01.

Data Availability Statement

The corresponding code, the DFE module(s) and the code corresponding to the four integration test examples that were presented in this study are openly available in GitHub at https://github.com/ibionics/data_fusion_explorer, accessed on 17 April 2025.

Acknowledgments

We thank Edgar Lobaton of North Carolina State University for early discussions and suggestions. We want to thank Haque et al. for their sweet potato data, which was used in the Sweet Potato dataset [40].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Popescu, G.H.; Petreanu, S.; Alexandru, B.; Corpodean, H. Internet of Things-based Real-Time Production Logistics, Cyber-Physical Process Monitoring Systems, and Industrial Artificial Intelligence in Sustainable Smart Manufacturing. J. Self-Gov. Manag. Econ. 2021, 9, 52–62. [Google Scholar] [CrossRef]
Watts-Schacter, E.; Kral, P. Mobile Health Applications, Smart Medical Devices, and Big Data Analytics Technologies. Am. J. Med. Res. 2019, 6, 19–24. [Google Scholar] [CrossRef]
Dasarathy, B. Sensor fusion potential exploitation-innovative architectures and illustrative applications. Proc. IEEE 1997, 85, 24–38. [Google Scholar] [CrossRef]
Borràs, E.; Ferré, J.; Boqué, R.; Mestres, M.; Aceña, L.; Busto, O. Data fusion methodologies for food and beverage authentication and quality assessment—A review. Anal. Chim. Acta 2015, 891, 1–14. [Google Scholar] [CrossRef]
Mendes, J.J.A., Jr.; Vieira, M.E.M.; Pires, M.B.; Stevan, S.L., Jr. Sensor Fusion and Smart Sensor in Sports and Biomedical Applications. Sensors 2016, 16, 1569. [Google Scholar] [CrossRef]
Ivanov, S.; Bhargava, K.; Donnelly, W. Precision Farming: Sensor Analytics. IEEE Intell. Syst. 2015, 30, 76–80. [Google Scholar] [CrossRef]
Dasappa, N.S.; Kumar G, K.; Somu, N. Multi-sensor data fusion framework for energy optimization in smart homes. Renew. Sustain. Energy Rev. 2024, 193, 114235. [Google Scholar] [CrossRef]
Kenyeres, M.; Kenyeres, J.; Hassankhani Dolatabadi, S. Distributed Consensus Gossip-Based Data Fusion for Suppressing Incorrect Sensor Readings in Wireless Sensor Networks. J. Low Power Electron. Appl. 2025, 15, 6. [Google Scholar] [CrossRef]
Berdugo, C.A.; Zito, R.; Paulus, S.; Mahlein, A.K. Fusion of sensor data for the detection and differentiation of plant diseases in cucumber. Plant Pathol. 2014, 63, 1344–1356. [Google Scholar] [CrossRef]
Li, C.; Heinemann, P.; Sherry, R. Neural network and Bayesian network fusion models to fuse electronic nose and surface acoustic wave sensor data for apple defect detection. Sens. Actuators B Chem. 2007, 125, 301–310. [Google Scholar] [CrossRef]
Maimaitijiang, M.; Ghulam, A.; Sidike, P.; Hartling, S.; Maimaitiyiming, M.; Peterson, K.; Shavers, E.; Fishman, J.; Peterson, J.; Kadam, S.; et al. Unmanned Aerial System (UAS)-based phenotyping of soybean using multi-sensor data fusion and extreme learning machine. ISPRS J. Photogramm. Remote Sens. 2017, 134, 43–58. [Google Scholar] [CrossRef]
Martin, D.; Reynolds, J.; Daniele, M.; Lobaton, E.; Bozkurt, A. Towards Continuous Plant Bioimpedance Fitting and Parameter Estimation. In Proceedings of the 2021 IEEE Sensors, Sydney, Australia, 31 October–3 November 2021; pp. 1–4. [Google Scholar] [CrossRef]
Anibal, C.V.D.; Callao, M.P.; Ruisánchez, I. 1H NMR and UV-visible data fusion for determining Sudan dyes in culinary spices. Talanta 2011, 84, 829–833. [Google Scholar] [CrossRef] [PubMed]
Rong, L.; Ping, W.; Wenlei, H. A novel method for wine analysis based on sensor fusion technique. Sens. Actuators B Chem. 2000, 66, 246–250. [Google Scholar] [CrossRef]
Kudenov, M.W.; Scarboro, C.S.; Altaqui, A.; Boyette, M.; Yencho, G.C.; Williams, C.M. Internal defect scanning of sweetpotatoes using interactance spectroscopy. PLoS ONE 2021, 16, e0246872. [Google Scholar] [CrossRef]
Xiaobo, Z.; Jiewen, Z. Apple quality assessment by fusion three sensors. In Proceedings of the SENSORS, 2005 IEEE, Irvine, CA, USA, 31 October–3 November 2005; p. 4. [Google Scholar] [CrossRef]
Arsenault, K.R.; Kumar, S.V.; Geiger, J.V.; Wang, S.; Kemp, E.; Mocko, D.M.; Beaudoing, H.K.; Getirana, A.; Navari, M.; Li, B.; et al. The Land surface Data Toolkit (LDT v7.2)—A data fusion environment for land data assimilation systems. Geosci. Model Dev. 2018, 11, 3605–3621. [Google Scholar] [CrossRef]
Ferré, C.; Castrignanò, A.; Comolli, R. Assessment of multi-scale soil-plant interactions in a poplar plantation using geostatistical data fusion techniques: Relationships to soil respiration. Plant Soil 2015, 390, 1573–5036. [Google Scholar] [CrossRef]
Ji, W.; Adamchuk, V.I.; Chen, S.; Su, A.S.M.; Ismail, A.; Gan, Q.; Shi, Z.; Biswas, A. Simultaneous measurement of multiple soil properties through proximal sensor data fusion: A case study. Geoderma 2019, 341, 111–128. [Google Scholar] [CrossRef]
Pan, L.; Adamchuk, V.I.; Prasher, S.; Gebbers, R.; Taylor, R.S.; Dabas, M. Vertical soil profiling using a galvanic contact resistivity scanning approach. Sensors 2014, 14, 13243. [Google Scholar] [CrossRef]
Jez, J.M.; Topp, C.N.; Buckner, E.; Tong, H.; Ottley, C.; Williams, C. High-throughput image segmentation and machine learning approaches in the plant sciences across multiple scales. Emerg. Top. Life Sci. 2021, 5, 239–248. [Google Scholar] [CrossRef]
Benitez-Paez, F.; Brum-Bastos, V.D.; Beggan, C.D.; Long, J.A.; Demšar, U. Fusion of wildlife tracking and satellite geomagnetic data for the study of animal migration. Mov. Ecol. 2021, 9, 31. [Google Scholar] [CrossRef]
Nuñez, C.L.; Froese, G.; Meier, A.C.; Beirne, C.; Depenthal, J.; Kim, S.; Mbélé, A.E.; Nordseth, A.; Poulsen, J.R. Stronger together: Comparing and integrating camera trap, visual, and dung survey data in tropical forest communities. Ecosphere 2019, 10, e02965. [Google Scholar] [CrossRef]
Roberts, P.L.D.; Jaffe, J.S.; Trivedi, M.M. A multiview, multimodal fusion framework for classifying small marine animals with an opto-acoustic imaging system. In Proceedings of the 2009 Workshop on Applications of Computer Vision (WACV), Snowbird, UT, USA, 7–8 December 2009; pp. 1–6. [Google Scholar] [CrossRef]
Zyl, T.L.V.; Woolway, M.; Engelbrecht, B. Unique Animal Identification using Deep Transfer Learning For Data Fusion in Siamese Networks. In Proceedings of the 2020 IEEE 23rd International Conference on Information Fusion (FUSION), Rustenburg, South Africa, 6–9 July 2020; pp. 1–6. [Google Scholar] [CrossRef]
Muminov, A.; Mukhiddinov, M.; Jinsoo, C. Enhanced Classification of Dog Activities with Quaternion-Based Fusion Approach on High-Dimensional Raw Data from Wearable Sensors. Sensors 2022, 22, 9471. [Google Scholar] [CrossRef] [PubMed]
Celik, Y.; Stuart, S.; Woo, W.; Sejdic, E.; Godfrey, A. Multi-modal gait: A wearable, algorithm and data fusion approach for clinical and free-living assessment. Inf. Fusion 2022, 78, 57–70. [Google Scholar] [CrossRef]
King, R.C.; Villeneuve, E.; White, R.J.; Sherratt, R.S.; Holderbaum, W.; Harwin, W.S. Application of data fusion techniques and technologies for wearable health monitoring. Med. Eng. Phys. 2017, 42, 1–12. [Google Scholar] [CrossRef]
Potluri, C.; Anugolu, M.; Schoen, M.P.; Naidu, D.S.; Urfer, A.; Rieger, C. Computational intelligence based data fusion algorithm for dynamic sEMG and skeletal muscle force modelling. In Proceedings of the 2013 6th International Symposium on Resilient Control Systems (ISRCS), San Francisco, CA, USA, 13–15 August 2013; pp. 74–79. [Google Scholar] [CrossRef]
Gjoreski, H.; Stankoski, S.; Kiprijanovska, I.; Nikolovska, A.; Mladenovska, N.; Trajanoska, M.; Velichkovska, B.; Gjoreski, M.; Luštrek, M.; Gams, M. Wearable Sensors Data-Fusion and Machine-Learning Method for Fall Detection and Activity Recognition. In Challenges and Trends in Multimodal Fall Detection for Healthcare; Ponce, H., Martínez-Villaseñor, L., Brieva, J., Moya-Albor, E., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 81–96. [Google Scholar] [CrossRef]
Tsinganos, P.; Skodras, A. On the Comparison of Wearable Sensor Data Fusion to a Single Sensor Machine Learning Technique in Fall Detection. Sensors 2018, 18, 592. [Google Scholar] [CrossRef]
Vidya, B.; P, S. Wearable multi-sensor data fusion approach for human activity recognition using machine learning algorithms. Sens. Actuators A Phys. 2022, 341, 113557. [Google Scholar] [CrossRef]
Muzammal, M.; Talat, R.; Sodhro, A.H.; Pirbhulal, S. A multi-sensor data fusion enabled ensemble approach for medical data from body sensor networks. Inf. Fusion 2020, 53, 155–164. [Google Scholar] [CrossRef]
Rodrigues, C.; Fröhlich, W.R.; Jabroski, A.G.; Rigo, S.J.; Rodrigues, A.; de Castro, E.K. Evaluating a New Approach to Data Fusion in Wearable Physiological Sensors for Stress Monitoring. In Proceedings of the Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, 20–23 October 2020; Cerri, R., Prati, R.C., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 544–557. [Google Scholar]
Bent, B.; Wang, K.; Grzesiak, E.; Jiang, C.; Qi, Y.; Jiang, Y.; Cho, P.; Zingler, K.; Ogbeide, F.I.; Zhao, A.; et al. The digital biomarker discovery pipeline: An open-source software platform for the development of digital biomarkers using mHealth and wearables data. J. Clin. Transl. Sci. 2021, 5, e19. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Chandrasekara, C. Hands-on GitHub Actions Implement CI/CD with GitHub Action Workflows for Your Applications; Springer Science+Business Media LLC: New York, NY, USA, 2021. [Google Scholar]
Cleghern, Z.; Williams, E.; Mealin, S.; Holder, M.F.T.; Bozkurt, A.; Roberts, D.L. An IoT and Analytics Platform for Characterizing Adolescent Dogs’ Suitability for Guide Work. In Proceedings of the Sixth International Conference on Animal-Computer Interaction (ACI’19), New York, NY, USA, 12–14 November 2019. [Google Scholar] [CrossRef]
Reynolds, J.; Williams, E.; Martin, D.; Readling, C.; Ahmmed, P.; Huseth, A.; Bozkurt, A. A Multimodal Sensing Platform for Interdisciplinary Research in Agrarian Environments. Sensors 2022, 22, 5582. [Google Scholar] [CrossRef]
Haque, S.; Lobaton, E.; Nelson, N.; Yencho, G.C.; Pecota, K.V.; Mierop, R.; Kudenov, M.W.; Boyette, M.; Williams, C.M. Computer vision approach to characterize size and shape phenotypes of horticultural crops using high-throughput imagery. Comput. Electron. Agric. 2021, 182, 106011. [Google Scholar] [CrossRef]
Martin, D.; Holder, T.; Nichols, C.; Park, J.; Roberts, D.; Bozkurt, A. Comparing Accelerometry and Computer Vision Sensing Modalities for High-Resolution Canine Tail Wagging Interpretation. In Proceedings of the Ninth International Conference on Animal-Computer Interaction (ACI’22), Newcastle-upon-Tyne, UK, 5–8 December 2022. [Google Scholar]
Semenova, L.; Rudin, C.; Parr, R. On the existence of simpler machine learning models. arXiv 2022, arXiv:1908.01755. [Google Scholar] [CrossRef]

Figure 1. Overview of data fusion pipeline framework.

Figure 2. An example of a data fusion sequence. Steps that align with the data fusion framework are labeled with brackets ([]). We abbreviate the time column as “t” and aligned or corrected time as “t*”.

Figure 3. Dog collar test data pipeline flowchart. The top pipeline (P1) is low-level fusion. The middle (P2) and bottom (P3) pipelines are different mid-level fusion techniques.

Figure 4. Moth dataset pipeline flowchart. The top pipeline (P1) uses short-length windowing, middle pipeline (P2) mid-length, and bottom pipeline (P3) long-length windowing.

Figure 5. Plant impedance pipeline flowchart. The top pipeline (P1) starts with dimensional reduction. The middle pipeline (P2) performs feature extraction first. The bottom pipeline (P3) combines dimension reduction and regression using LASSO.

Figure 6. Sweet potato pipeline flowchart. The top pipeline (P1) uses ICA and the bottom pipeline (P2) uses PCA for dimension reduction.

Figure 7. For each classification dataset, each pipeline’s precision, recall, F1-score, and accuracy are shown. P1…P3 = Pipeline 1…3. See Figure 3 and Figure 6 for pipeline details.

Figure 8. Time and space complexity are shown for each pipeline in the (a) Dog Collar, (b) Moth Detector, (c) Plant Impedance, and (d) Sweet Potato datasets. Blue and left axes indicate time usage; orange and right axes indicate space usage. P1…3 = Pipeline 1…3.

Figure 9. Level of DFE tool simplification for each pipeline. Blue represents the lines of code required when using the DFE. Orange shows the additional lines of code required using Vanilla Python. DC = Dog Collar, M = Moth, PI = Plant Impedance, SP = Sweet Potato, P1…3 = Pipeline 1…3.

Figure 10. For each regression dataset, each pipeline’s R, RMSE, RelRMSE, MAE, and RAE are shown. P1…P3 = Pipeline 1…3. See Figure 4 and Figure 5 for pipeline details.

Table 1. Description of the datasets used for evaluating the data fusion pipeline framework.

Dataset	Dog Collar	Moth Detector	Plant Impedance	Sweet Potato
Research	Animal	Environmental	Agrarian	Food Quality
Area	Monitoring	Monitoring	Monitoring	Monitoring
Dataset Type	Singlet	Singlet	Array	Image
Class/Reg	Classification	Regression	Regression	Classification
Temporal/Spatial	Temporal	Spatial	Temporal	Spatial
Sensors	Envir., Audio,	Envir.,	Impedance	Camera
Sensors	Inertial	Moth Trap	Spectrum	Camera

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Martin, D.; Roberts, D.L.; Bozkurt, A. Early-Stage Sensor Data Fusion Pipeline Exploration Framework for Agriculture and Animal Welfare. AgriEngineering 2025, 7, 215. https://doi.org/10.3390/agriengineering7070215

AMA Style

Martin D, Roberts DL, Bozkurt A. Early-Stage Sensor Data Fusion Pipeline Exploration Framework for Agriculture and Animal Welfare. AgriEngineering. 2025; 7(7):215. https://doi.org/10.3390/agriengineering7070215

Chicago/Turabian Style

Martin, Devon, David L. Roberts, and Alper Bozkurt. 2025. "Early-Stage Sensor Data Fusion Pipeline Exploration Framework for Agriculture and Animal Welfare" AgriEngineering 7, no. 7: 215. https://doi.org/10.3390/agriengineering7070215

APA Style

Martin, D., Roberts, D. L., & Bozkurt, A. (2025). Early-Stage Sensor Data Fusion Pipeline Exploration Framework for Agriculture and Animal Welfare. AgriEngineering, 7(7), 215. https://doi.org/10.3390/agriengineering7070215

Article Menu

Early-Stage Sensor Data Fusion Pipeline Exploration Framework for Agriculture and Animal Welfare

Abstract

1. Introduction

1.1. The Need for Data Fusion in Precision Agriculture and Animal Welfare

1.2. The Data Flow in Precision Agriculture and Animal Welfare

1.3. The Focus and Contribution of This Paper

2. Review of the State of the Practice

2.1. Crop and Agriculture-Related Sensor Data Fusion

2.2. Animal-Monitoring-Related Sensor Data Fusion

2.3. Open-Source Tools

3. Methods

3.1. Towards an Open Source Tool for Precision Agriculture and Animal Welfare Researchers

3.2. Dataset Descriptions

3.3. Dog Collar Dataset: Singlets to Classification

3.4. Moth Detector Dataset: Singlets to Regression

3.5. Plant Impedance Dataset: Arrays to Regression

3.6. Sweet Potato Dataset: Images to Classification

3.7. Metrics of Comparison

4. Results

4.1. Dog Collar Dataset: Singlets to Classification

4.2. Moth Detector Dataset: Singlets to Regression

4.3. Plant Impedance Dataset: Arrays to Regression

4.4. Sweet Potato Dataset: Images to Classification

5. Discussion

6. Future Work and Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI