1. Introduction
Managing honey bee (
Apis mellifera) colonies requires that human bee keepers conduct regular hive inspections and treatments, and travel to distant apiaries. Sensor-based monitoring alleviates some well-known bottlenecks in human monitoring such as limited time, impatience, and vehicle wear and tear. It also provides consistent information on various aspects of the colony’s phenology and helps with the retroactive identification of performance bifurcation points. As off-the-shelf electronic sensors become more reliable and smaller in size, they will be increasingly integrated into multisensor hive monitoring systems [
1,
2,
3,
4]. Thus, stakeholders of multisensor hive monitors must choose not only which sensors to use and when, but also which software works best with their selection [
5].
Two sensors that have played a critical role in continuous hive monitoring are electronic scales and in-hive temperature sensors [
6,
7]. Scales have become popular among precision apiculture researchers and practitioners because they generate regular weight observations around the clock, regardless of ambient weather, and do not interfere with hive inspections. Foraging activity is reflected by weight changes due to forager traffic, and foraging success can be estimated with increases in hive food stores [
8]. However, scales require initial calibration and subsequent recalibration and are hard to move from hive to hive and apiary to apiary. Unlike scales, in-hive temperature sensors are smaller in size, less expensive, and have smaller power footprints, which makes them more mobile and deployable than scales. In-hive temperature observations are indicative of a colony’s thermoregulatory behaviors, e.g., clustering and fanning, and can be correlated with the diurnal rhythms of metabolic rate and activity [
9]. However, due to their small size, in-hive temperature sensors are easy to lose or damage during hive inspections or hive woodenware maintenance.
Hive weight has been used in precision apiculture studies to characterize different events in a colony’s biological life cycle (e.g., honey production [
10], swarming and colony abandonment [
11], overwintering [
12], pesticide exposure [
13], and proximity to conservation reserve lands [
14]). In-hive temperature has also been used to investigate colony phenology (e.g., daily metabolic cycles [
15], colony size [
16], and thermoregulation as a function of subspecies [
17] or genetic diversity [
18]). These studies provided valuable insights into the relationship of a colony’s phenology to hive weight and in-hive temperature. However, due to the fundamentally descriptive nature of the employed techniques of regression and detrending, they formulated and tested specific hypotheses or classified static events, but did not attempt to forecast into the future. This gap, which has been recently noted in the precision apiculture literature [
1,
5], can be formulated as follows: computational models that forecast the status of managed honey bee colonies using affordable off-the-shelf sensors are few and far between.
A key objective of our investigation was to address this gap by investigating time series forecasters of hive weight and in-hive temperature observations that researchers and practitioners can obtain with off-the-shelf sensors. Specifically, we explored and exploited the forecasting of hive weight and in-hive temperature with ARIMA and SARIMA over longer forecast horizons (several hundred hours). We are not aware of such investigations in the precision apiculture literature. Our contributions to precision apiculture and sensor-based hive monitoring reported in this article are as follows:
- (1)
An exploration of the stationarity, trend, and seasonality of hive weight and in-hive temperature time series with the Ljung–Box White Noise test [
19], the Augmented Dickey–Fuller test [
20], the Kwiatkowski–Phillips–Schmidt–Shin test [
21], and the Shapiro–Wilk Normality test [
22];
- (2)
A systematic exploitation of a curated dataset with a grid search of 1538 (769 weight and 769 temperature) ARIMA/SARIMA forecasters to answer three fundamental questions: (a) Does seasonality matter in the ARIMA forecasting of hive weight and in-hive temperature? (b) To what extent do the best forecasters of one hive generalize to other hives? and (c) Which time series type (i.e., hive weight or in-hive temperature) is better predictable?
- (3)
Three principal findings were as follows: (a) seasonality matters, in that SARIMA forecasters outperformed their ARIMA counterparts; (b) the best hive weight and in-hive temperature forecasters of the ten monitored colonies appeared to be colony-specific; and (c) hive weight was more predictable than in-hive temperature.
- (4)
Our grid search ARIMA/SARIMA source code and the top hive weight and in-hive temperature forecasters found using it can be used by researchers, practitioners, and citizen scientists in the longitudinal monitoring of managed honey bee colonies across diverse environments [
23].
- (5)
A curated dataset of 4320 timestamped observations (2160 hive weight and 2160 in-hive temperature) from ten managed honey bee colonies at a United States Department of Agriculture (USDA) research apiary [
23].
The remainder of our article is organized as follows: In
Section 2, we review related research. In
Section 3, we describe our metadata, data, and methods. In
Section 4, we report the results of our investigation. In
Section 5, we discuss our results, and the implications and limitations of our findings. In
Section 6, we offer our conclusions and outline the scope of our future work.
2. Related Work
Hive weight and in-hive temperature observations have been used in many studies of colony phenology. Thoenes and Buchmann [
11] showed that colony weight is related to foraging, swarming, and hive abandonment. Marceau et al. [
10] demonstrated a regressive, polynomial relationship between hive weight and colony growth and productivity. Meikle et al. [
24] investigated how colony weight may be related to pesticide exposure. Stalidzans et al. [
12] reported a relationship between colony weight and overwintering. Szabo et al. [
15] reported that when in-hive temperature sensors were placed inside or close to the cluster of bees, i.e., the mass of bees at the core of the colony, the temperature readings were more affected by the cluster than by exterior conditions. The experiments by Southwick and Moritz [
25] showed that daily cycles of in-hive temperature and metabolic activity may be driven by ambient weather. Meikle et al. [
16] reported some evidence of the thermoregulation of colonies in the absence of brood, and that in-hive temperature may be affected by colony size and the location of the in-hive temperature sensor. Worswick [
17] argued that a colony’s thermoregulation was a function of subspecies. Jones et al. [
18] found evidence that thermoregulation may be related to within-colony genetic diversity. Stalidzans and Berzonis [
26] offered some evidence that thermoregulation may be a function of a colony’s biological life cycles.
A notable recent trend in precision apiculture is predictive sensor-based monitoring of managed colonies. In a 3-year long investigation of multiple colonies at different geographical locations, Braga et al. [
1] used in-hive temperature, hive weight, ambient temperature, dew point, wind direction, wind speed, rainfall, and daylight in combination with weekly apiary inspection results. The team designed K-nearest neighbors (KNNs) models, random forests, and artificial neural networks to assess colony health from in-hive temperature, hive weight, and ambient weather. On a curated dataset, random forest turned out to be the best predictor of hive health, with an accuracy of 98%. Zaman and Dorin (2023) [
5] proposed a comprehensive qualitative framework to assess predictive hive monitoring systems from the viewpoints of different stakeholders, e.g., research institutions, regulatory agencies, and commercial operations. Kulyukin et al. [
27] compared non-seasonal ARIMA with artificial, convolutional, and long short-term memory networks as time series forecasters of hive weight, in-hive temperature, and omnidirectional forager traffic at the hive’s entrance over the short forecast horizons of 6, 12, 18, and 24 h. Non-seasonal ARIMA performed on par with machine learning counterparts in predicting falling, rising, and unchanging trends.
There is evidence of predictive sensor-based modeling on the Internet sites of several commercial platforms. Thus,
Arnia (
https://arnia.co/ accessed on 5 May 2025) claims to provide predictive modeling on the basis of eight sensors: audio, temperature, humidity, weight, light sensor, accelerometer, bee counter, and video;
ApisProtect (
https://pitchbook.com/profiles/company/182117-80 accessed on 5 May 2025) states that its hive monitoring tools use four sensors: temperature, humidity, audio, and accelerometer;
IOBee (
https://io-bee.eu/ accessed on 5 May 2025) also uses temperature, humidity, weight, and bee counts to assess colony status;
Pollenity (
https://www.pollenity.com/ accessed on 5 May 2025) proposes a hive monitoring system using temperature, humidity, weight, and acoustic sensors. The actual algorithms and datasets of these commercial platforms appear to be proprietary, which is understandable, because their business models likely depend on them.
3. Materials and Methods
3.1. Metadata and Data
The data were acquired from ten managed colonies at a research apiary of the Carl Hayden Bee Research Center of the U.S. Department of Agriculture Agricultural Research Service (USDA-ARS) in Tucson, AZ, USA (GPS coordinates: 32°13′18.274″ N, 110°55′35.324″ W) from June to October 2022. Each hive included a bottom board with a landing pad, two deep Langstroth boxes with ten frames in each, an inner hive cover, a box with an on-hive camera, and a hive cover with a cardboard box reinforced with all-weather duct tape to protect the camera against the elements. Each hive was mounted on an electronic scale and an in-hive temperature sensor was placed at the top bar of the middle frame in the second deep super of each hive (cf.
Figure A1).
For each hive, 2160 hive weight and 2160 in-hive temperature observations were timestamped and logged. No observations were missed due to sensor failures. The weight and temperature observations were smoothed by computing hourly means. All hourly means were labeled with natural numbers to obtain a universal time axis for the time series analysis. The curated dataset is provided in the supplementary materials [
23]. Hourly means were used as the ground truth observations of the hive weight time series, denoted as
, and of the in-hive temperature time series, denoted as
. The notation
was chosen because the temperature observations were logged in degrees Celsius (
Table 1).
Five hives (USDA-ARS IDs 2059, 2120, 2141, 2142, 2158) had Russian queens and five hives (USDA-ARS IDs 2123, 2129, 2130, 2137, 2146)—Wooten. Hive evaluations were conducted on 21 June, 11 August, and 23 September. Each evaluation included a visual queen status check (presence/absence) and removal of queen supersedure cells. On 20 June 2022, 2141 was requeened; 2140 was requeened on 23 June 2022. On 22 July 2022, a supersedure queen cell was removed from 2059. On 26 July 2022, 2141 and 2059 were requeened. On 15 August 2022, a laying worker was detected in 2141. On 19 September 2022, the queens were removed from 2158 and 2120 for unrelated, scheduled experiments. The final hive evaluations were conducted on 23 September 2022.
Adult bee mass in kg was determined by subtracting the weight of the frames and woodenware from the total hive weight as measured the previous midnight. Digital frame photographs were taken during hive evaluations and analyzed for brood surface area. The areas of sealed brood per frame were summed across all frames from that hive to provide a brood surface area estimate for the colony. Thirty-second videos of forager traffic at the hive’s entrance were recorded every five minutes from 7:00 to 20:55 daily. The analysis of adult bee mass, frame photographs, and forager traffic videos are outside of the scope of this investigation.
3.2. Time Series Forecasting
Auto-Regressive Integrated Moving Average (ARIMA) models were originally developed by Wiener in the 1940s for electrical engineering applications [
28] and were applied by Box, Jenkins, and Riensel to business and economic data in the 1970s [
29]. ARIMA models a time series through self (auto)-regression (i.e., looking at the previous values in the same time series). The prediction
of a random variable
is the sum of an optional constant, a weighted sum of a finite number of recent values (the AR component), and a weighted sum of a finite number of recent errors (the MA component). Thus,
, where
C is an optional real constant,
p is the number (a non-negative integer) of terms in the weighted sum of the recent values of the series, and
q is the number (a non-negative integer) of terms in the weighted sum of the recent errors. The parameters
p and
q designate the numbers of autoregressive and moving average terms, respectively.
A time series is stationary if its standard statistical properties (e.g., mean, standard deviation, variance, etc.) stay more or less constant over time. In practice, stationary time series tend to be more predictable, because they exhibit little trend or heteroskedasticity (i.e., heterogeneity of variance). Differencing is a technique used to stationarize a series. A differenced series is called integrated (hence, incidentally, the I in ARIMA). The parameter d specifies the number of non-seasonal differences. If we denote the original time series as and the differenced (i.e., stationarized) series as , then, if (i.e., no differencing), the prediction is , i.e., the prediction is the value of the current measurement. If , the prediction is . If , the prediction is = − = ≠. Thus, the second difference is a measure of acceleration. The integers 1, 2, … in , , … are called lags and denote how many steps back are taken along a time axis. A complete ARIMA model is denoted as ARIMA(p, d, q), where p is the number of AR lags (i.e., ), q is the number of MA lags (i.e., ), and d is the differencing factor. The construction of an ARIMA model begins by stationarizing the series, if necessary, through differencing and possibly averaging, logging, deflating, or other smoothing data transformations. Autocorrelations and partial autocorrelations are examined to determine whether lags in the stationarized series or lags in the forecast errors should be incorporated into the forecast. The model suggested by this analysis is fitted, and its residual diagnostics, i.e., the residual autocorrelation function (ACF) and partial autocorrelation function (PACF) plots are examined to ensure that all coefficients are significant.
In the early 2000’s, seasonal ARIMA (SARIMA) was created, to include seasonality in time series forecasting [
30]. The seasonal component is characterized by three additional parameters:
P (the number of seasonal autoregressive terms);
D (the number of seasonal differences); and
Q (the number of seasonal moving-average terms). A complete SARIMA model is given as SARIMA (
), where
p,
d,
q are non-seasonal ARIMA parameters;
are the seasonal equivalents of
p,
d, and
q, respectively; and
S is the seasonal period (e.g.,
for hourly data). Time series stationarity is achieved by combining the non-seasonal and seasonal differences. For example, if
and
, then
, whereas if
and
, then
=
. The introduction of the seasonal terms SAR(
P) and SMA(
Q) obviously adds complexity, insomuch as when fitting a model to a series with a pronounced seasonal pattern, various combinations of non-seasonal and seasonal differences must be tried and the ACF and PACF plots examined at multiples of
S to determine the number of the SAR and SMA terms in the forecasting equation.
3.3. Exploration
Four time series tests were applied to explore the hive weight and in-hive temperature series and to inform the subsequent grid search of optimal ARIMA and SARIMA models: (1) Ljung–Box White Noise [
19]; (2) Augmented Dickey–Fuller [
20]; (3) Kwiatkowski–Phillips–Schmidt–Shin [
21]; and (4) Shapiro–Wilk Normality [
22]. All tests were executed at
with PyCaret (
https://pycaret.org accessed on 17 June 2025), a low- code machine learning Python library.
The hypotheses of Ljung–Box White Noise are as follows: the null hypothesis (H0) is that the time series data are white noise (i.e., the data are random and there is no autocorrelation); the alternative hypothesis (H1) is that the time series data are not white noise. The hypotheses of Augmented Dickey–Fuller (ADF) are as follows: H0 is that the time series data are not difference-stationary; H1 is that the time series data are difference-stationary. The hypotheses of Kwiatkowski–Phillips–Schmidt–Shin (KPSS) are as follows: H0 is that the time series data are trend-stationary; H1 is that the time series data are not trend-stationary. The hypotheses of Shapiro–Wilk Normality are as follows: H0 is that the time series data are normally distributed; H1 is that the time series data are not normally distributed.
3.4. Forecaster Grid Search and Generalizability
The parameter ranges in the grid search were informed by the findings in [
31]. We implemented the grid search in Python 3.10.12 using PyCaret 3.3.2 and executed it on a Hewlett Packard Z220 workstation with Ubuntu 22.10.
Figure 1 gives the pseudocode of the grid search procedure. The grid search examined all
SARIMA forecasters, where
and
, i.e., the seasonality was fixed at 24 h. The CSV processing was performed with pandas 2.1.4 (
https://pandas.pydata.org accessed on 17 June 2025). The plots were generated with matplotlib 3.7.5 (
https://matplotlib.org/ accessed on 17 June 2025). RMSE and MAPE were computed with scikit-learn 1.4.2 (
https://scikit-learn.org accessed on 17 June 2025) and numpy 1.26.4 (
https://numpy.org accessed on 17 June 2025).
A total of 729 weight and 729 temperature forecasters were fitted and evaluated with RMSE and MAPE in the grid search. The number 729 comes from the parameter ranges
used in the grid search. Each of the six parameters can take on the values of 0, 1, and 2. Hence, there were
possible forecasters. The supplementary materials [
23] include plots of the second- and third-best native hive weight and in-hive temperature forecasters for each hive.
There were ten hive weight and ten in-hive temperature time series, one for each hive. In our implementation, we designated these as MWHID or MTHID, where MWH and MTH abbreviate mean weight of hive and mean temperature of hive, respectively, and ID designates the USDA-ARS hive id in Tucson, AZ. For example, MWH2059 and MTH2059 are the hive weight and in-hive temperature time series for hive 2059. Each time series had 2160 observations and was split 80/20 into training (1728 consecutive observations) and testing (the remaining 432 consecutive observations) datasets. All forecasters were fitted on the training dataset and tested on the testing dataset. The performance of each fitted forecaster on the testing dataset was evaluated with the root mean squared error (RMSE) and with the mean absolute percentage error (MAPE).
In analyzing the performance of the top forecasters, we distinguished native and non-native forecasts. A forecast for hive H was native to a forecaster F if, and only if, F was fitted on the training data from H and evaluated with RMSE or MAPE on the test data from H. We also refer to H as the native hive of F. Thus, the native forecasts indicated how well a forecaster predicted a characteristic (weight or temperature) of a hive on whose training data it was fitted. We ranked all grid-searched forecasters by RMSE and MAPE on the test data of their native hives, from smallest to largest, to identify the top three forecasters for each hive. Our objective here was to compare RMSE and MAPE as forecaster selection metrics. We will hereafter refer to a top forecaster found with RMSE as an RMSE forecaster and with MAPE as a MAPE forecaster. We applied the top native forecaster of each hive to the test data from the nine other hives, in order to estimate the generalizability of the ARIMA/SARIMA forecasters. We will call plots of non-native forecasts generalizability plots, because they reflect, to a degree, how well forecasters generalized to non-native hives. As we analyzed the forecast plots, we attempted to discover weight and temperature forecast patterns to analyze forecasts qualitatively. After patterns were identified, we counted their distributions in the native and non-native forecasts.
5. Discussion
Our study had two main goals: (1) Explore the stationarity, trend, seasonality, and distribution of the hive weight and in-hive temperature of ten managed honey bee colonies at a research apiary, where the hives were monitored with electronic scales and in-hive temperature sensors from June to October 2022; (2) If the exploration showed that the series were not white noise and not normally distributed, perform a systematic autoregressive integrated moving average (ARIMA) time series analysis to answer three fundamental questions: (a) Does seasonality matter in the ARIMA forecasting of hive weight and in-hive temperature? (b) To what extent do the best forecasters of one hive generalize to other hives? and (c) Which time series type (i.e., hive weight or in-hive temperature) is better predictable?
The exploratory analysis indicated that the hive weight and in-hive temperature series were not white noise and not normally distributed. Both types of time series for most hives were not difference- or trend-stationary. Thus, the ARIMA and SARIMA time series forecasting exploitation of the datasets was justified.
The grid search of 769 weight and 769 temperature forecasters (cf.
Table 4 and
Table 5) showed that seasonality mattered over a forecast horizon of 432 h. As a comparison baseline, non-seasonal ARIMA was outperformed by SARIMA over this horizon. In particular,
Table 4 shows that only 2 out of the top 60 weight forecasters (i.e., 3.3%) found in the grid search had the three seasonality parameters
P,
D,
Q equal to 0. Of the top 60 temperature forecasters, only one forecaster (i.e., 1.7%) had zero seasonality parameters. This result complements the findings in [
27], which was Part I of our investigation, where non-seasonal ARIMA performed on par with artificial neural networks, convolutional neural networks, and long short-term memory models in predicting trends for shorter horizons from 6 up to 24 h. In Part I of our investigation, non-seasonal ARIMA was used as the comparison baseline and was not outperformed by the machine learning methods for shorter horizons.
The distribution of patterns in the native and non-native weight and temperature forecasts (cf.
Table 6 and
Table 7) indicate that, on the curated dataset, the forecasters were better at predicting the weight and temperature of the native hives. Among the native weight forecasts, four of the best forecasters predicted the ground truth values and trends completely and one forecaster predicted them partially; five forecasters successfully predicted trends. There were no trend or forecast failures. The situation was reversed for the non-native weight forecasts, where no forecasts predicted observations and trends completely or partially.
We observed a similar picture in the native temperature forecasts. Among the native forecasts, five native forecasts predicted the ground truth observations and trends either completely or partially, while four native forecasts successfully predicted the trends. There was also one instance of a moving average trend, and no forecast failures were observed. Among the non-native forecasts, the majority—69 forecasts (77.8%)—accurately predicted trends; if we consider the moving average trend patterns as also predicting trends, the number of non-native forecasts predicting trends rises to 79 (87.8%).
Table 4 and
Table 5 show that the RMSE and MAPE scores of the top three weight forecasters were lower than those of the top three for temperature. This result indicates that weight may be more accurately predictable than temperature, which, in turn, may be explained by the fact that weight fluctuated less than temperature across all monitored hives.
The forecasters selected by RMSE and MAPE (cf.
Figure A3 and
Figure A4) showed very similar predictions over the entire forecast horizon. Thus, either metric can be used to select the best forecaster on the curated dataset. The results of the top three native hive weight forecasters in
Table 4 indicate that practical applications of SARIMA to forecast native hive weight may use a threshold between 0.02 and 0.08 for RMSE and between 0.0015 and 0.015 for MAPE over a horizon of 432 h. The results of the top three native in-hive temperature forecasters in
Table 5 indicate that practical applications of SARIMA to forecast native in-hive temperature may use a threshold between 0.12 and 0.89 for RMSE and a threshold between 0.002 and 0.02 for MAPE over the horizon of 432 h. Of course, these recommendations should be taken with caution, because ambient weather, exposure to pesticides, and queen lines may influence the colony’s weight and thermoregulation patterns. Further longitudinal research is required to find an optimal placement of in-hive temperature sensors in a Langstroth hive. As the reviewed hive monitoring literature suggests, in-hive temperature can be affected by ambient temperature, depending on where the temperature sensor is placed. Bees do not control temperature very much in the outer frames, so there may be an interaction between ambient temperature and bee-controlled temperature. However, if one were to monitor temperature at the center of the brood cluster, the literature indicates that very little impact of ambient temperature is found (see, e.g., [
15]).
Our approach has several limitations. The weight monitoring of the hives was confined to the hives with exactly two Langstroth supers. Consequently, hives with more than two supers may exhibit different weight patterns. The temperature monitoring was carried out with temperature sensors placed at the top bar of the middle frame in the second deep super of each hive. The in-hive temperature literature indicates that in-hive temperature may be affected by the location of the temperature sensor in the hive (see, e.g., [
15,
16]). Thus, a study where in-hive temperature sensors are placed in a different location may exhibit different (e.g., more or less predictable) time series patterns. Some investigations show that colony performance and health depends on ambient weather (see, e.g., [
1,
25]) and on geographical locations, e.g., the foraging distance proximity of apiaries to the U.S. Conservation Reserve Program lands [
14]. Thus, the predictability of hive weight and in-hive temperature may be affected by the geographical location of monitored apiaries.
While ambient weather, geographical location, pesticide exposure, and other factors influence a colony’s hive weight and in-hive temperature, they may not necessarily influence the predictability of the weight or temperature observations obtained from deployed sensors. A weight forecaster built on weight observations from a hive in an arid, dry climate where moisture does not impact the weight of the woodenware will necessarily be different from a weight forecaster built on the observations from a hive in a humid, rainy climate. Analogously, a temperature forecaster fitted on the observations from a hive in a warm climate will be different from a temperature forecaster trained on the observations from a hive in a colder climate. That said, all forecasters should display some degree of predictability, so long as the training data are not white noise. Therefore, precision pollination researchers and citizen scientists can use the presented methods to analyze hive weight and in-hive temperature observations. In particular, if exploration of the observations from a particular hive indicates that they are not random, our grid search method can be used to identify hive weight and in-hive temperature forecasters for monitored hives.
We expect our methods to work with any off-the-shelf electronic scale or in-hive temperature sensor capable of logging digital measures at user-specified intervals (e.g., every 5 min). The data can then be easily converted into the CSV format, similar to the format in our datasets in the supplementary materials. Once this is achieved, our grid search code can be used to find the best forecasters. Designers of electronic scales and in-hive temperature sensors can benefit from our approach,—provided their data loggers connect to a cloud computing and storage service—since real-time exploration and grid-search exploitation of the captured time-series data enable the discovery of optimal forecasters.