Temporal Encoding Strategies for YOLO-Based Detection of Honeybee Trophallaxis Behavior in Precision Livestock Systems

Vdoviak, Gabriela; Sledevič, Tomyslav

doi:10.3390/agriculture15222338

Open AccessArticle

Temporal Encoding Strategies for YOLO-Based Detection of Honeybee Trophallaxis Behavior in Precision Livestock Systems

by

Gabriela Vdoviak

^*

and

Tomyslav Sledevič

Department of Electronic Systems, Vilnius Gediminas Technical University, Saulėtekio Ave. 11, LT-10223 Vilnius, Lithuania

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(22), 2338; https://doi.org/10.3390/agriculture15222338

Submission received: 2 October 2025 / Revised: 8 November 2025 / Accepted: 9 November 2025 / Published: 11 November 2025

(This article belongs to the Special Issue Precision Livestock Farming and Artificial Intelligence for Sustainable Livestock Systems)

Download

Browse Figures

Versions Notes

Abstract

Trophallaxis, a fundamental social behavior observed among honeybees, involves the redistribution of food and chemical signals. The automation of its detection under field-realistic conditions poses a significant challenge due to the presence of crowding, occlusions, and brief, fine-scale motions. In this study, we propose a markerless, deep learning-based approach that injects short- and mid-range temporal features into single-frame You Only Look Once (YOLO) detectors via temporal-to-RGB encodings. A new dataset for trophallaxis detection, captured under diverse illumination and density conditions, has been released. On an NVIDIA RTX 4080 graphics processing unit (GPU), temporal-to-RGB inputs consistently outperformed RGB-only baselines across YOLO families. The YOLOv8m model improved from 84.7% mean average precision (mAP50) with RGB inputs to 91.9% with stacked-grayscale encoding and to 95.5% with temporally encoded motion and averaging over a 1 s window (TEMA-1s). Similar improvements were observed for larger models, with best mAP50 values approaching 94–95%. On an NVIDIA Jetson AGX Orin embedded platform, TensorRT-optimized YOLO models sustained real-time throughput, reaching 30 frames per second (fps) for small and 23–25 fps for medium models with temporal-to-RGB inputs. The results showed that the TEMA-1s encoded YOLOv8m model has achieved the highest mAP50 of 95.5% with real-time inference on both workstation and edge hardware. These findings indicate that temporal-to-RGB encodings provide an accurate and computationally efficient solution for markerless trophallaxis detection in field-realistic conditions. This approach can be further extended to multi-behavior recognition or integration of additional sensing modalities in precision beekeeping.

Keywords:

beehive monitoring; trophallaxis behavior; convolutional neural networks; Jetson GPU

1. Introduction

The management of honeybee (Apis mellifera L.) colonies is a global challenge given their essential role in ecological stability and agricultural productivity [1]. The pollination services provided by honeybees are crucial to the global food system, plant biodiversity and associated ecosystem functions. However, habitat loss, diseases, pesticide exposure, climate change, and invasive predators contribute to persistent colony instability with implications for food security [2,3].

Continuous monitoring of honeybee colonies is increasingly vital for understanding colony health, behavioral dynamics, and their implications for ecosystem stability. Recent advances in robotic and AI-assisted observation have demonstrated that non-invasive, long-term monitoring enables quantification of queen behavior, brood development, and colony interactions without disrupting natural processes [4,5]. Such minimally invasive systems enhance behavioral research and support sustainable pollinator management by providing continuous, high-resolution data on colony functioning [6].

In response to these challenges, precision beekeeping has emerged, leveraging IoT sensors, machine learning and advanced computer vision techniques to provide continuous, non-invasive monitoring of the colony state [7,8]. Early precision beekeeping systems effectively monitor macro-level metrics, such as colony traffic and aggregate activity at the hive entrance [8,9,10]. These systems often use multi-object tracking techniques that link per-frame detections across time using Kalman filtering and Hungarian algorithm, yielding robust quantification of individual and collective movement [11]. While high-fidelity monitoring is valuable for the quantification of movement, it must transition its focus from tracking individual motion to recognizing specific, complex social behaviors and interactions. In dense, fast-moving groups, this remains challenging due to frequent occlusions, appearance similarity, and the subtle, transient kinematics that characterize social exchange [12,13].

Trophallaxis is defined as the process of mouth-to-mouth transfer of liquid food in honeybee social behavior [14]. It redistributes nutritional resources and water, disseminates pheromones and transmits diseases throughout the colony [15]. Changes in trophallaxis can reflect resource uncertainty or internal stress [16]. Studies indicate that trophallaxis exhibits a fine-grained temporal signature, characterized by brief solicitation contacts that can terminate within one second and liquid-transfer bouts that typically persist for several seconds [16]. Reliable detection therefore requires spatial localization of interacting bees and temporal modeling capable of resolving sub-second onsets and distinguishing sustained transfer from momentary contacts.

Recent advances in the study of trophallaxis detection have relied on a range of methodological approaches, from manual observation to automated tracking and deep learning [17,18,19,20]. Early systems mainly tracked individually barcoded bees, applied geometric filters to flag candidate interactions, and then verified contacts with computer-vision rules [18]. These methods enabled the large-scale collection of interaction data, making it possible to reconstruct detailed social networks of trophallactic exchanges, but remained restricted to controlled observation hives and dependent on physical tagging [21], highlighting the continuing challenge of efficient dataset annotation in behavioral detection tasks. Subsequent work integrated convolutional neural networks with barcode-based tracking to detect trophallaxis and infer donor and recipient roles, improving detection sensitivity [20]. Barcode-free approaches have demonstrated that machine learning and computer vision can classify trophallaxis alongside other social behaviors [8,10,19], using contexts such as face-to-face orientation, proximity, and contact duration. In addition to vision-based tools, some studies have used computational techniques, such as topological data analysis, to describe collective dynamics during trophallaxis [22]. Despite the improvements, many systems still depend on artificial hive setups, marked individuals, or controlled imaging, limiting scalability to field-realistic monitoring [17,18,20]. This highlights the need for more generalizable, markerless, and efficient approaches to detect trophallaxis under natural colony conditions, since synthetic datasets have shown potential to substitute or augment real-world training data when large-scale annotation is difficult [23].

To address these limitations, our study investigates markerless, deep learning-based approach for trophallaxis detection directly at hive entrances under field-realistic conditions. In contrast to previous research, which often required artificial observation hives, barcodes, or controlled environments, our approach focuses on unmarked bees in natural settings, where challenges such as occlusion, variable lighting, and crowding are most pronounced. By introducing a new large-scale annotated dataset and evaluating modern object detection architectures together with temporal encoding strategies, our research aims to provide a more robust, scalable, and practical solution for automated recognition of trophallaxis in honeybee colonies. With this background, we conducted this study under the following objectives:

to collect, annotate, and publicly release a large-scale dataset of honeybee trophallaxis at the hive entrance, captured under diverse illumination, hive density, and occlusion conditions;
to evaluate You Only Look Once (YOLO) object detectors combined with different temporal-to-RGB encoding strategies for trophallaxis detection;
to assess the feasibility of real-time, markerless deployment on both workstation-class graphics processing units and embedded edge hardware such as the NVIDIA Jetson AGX Orin.

2. Related Works

Trophallaxis is the direct mouth-to-mouth exchange of liquid food between honeybees and is crucial for nutrient distribution, communication, and disease dynamics. Studies have investigated this behavior from diverse perspectives, using techniques ranging from manual annotation to advanced computer vision and deep learning. This section reviews key contributions relevant to our work, focusing first on automated detection of trophallaxis and its biological contexts, and second on temporal-to-RGB encoding strategies that embed motion information into image inputs for two-dimensional convolutional detectors.

2.1. Detection Methods and Behavioral Contexts for Trophallaxis

2.1.1. Automated Detection of Trophallaxis

Early computational efforts focused on barcode-based tracking systems. Gernat et al. [18] developed a high-throughput approach using barcoded bees and geometric filtering to identify potential trophallaxis events, later verified through computer vision algorithms that detected proboscis contact. This study revealed bursty temporal patterns in trophallaxis with implications for social spreading dynamics.

A more advanced system by Gernat et al. [20] integrated convolutional neural networks (CNNs) with barcode-guided region proposals to not only detect trophallaxis events but also infer donor–recipient roles. This two-stage model significantly improved sensitivity (+67%) and reduced error rates (−11%), enabling detailed social network analyses. However, both systems rely on barcoded individuals within controlled observation hives, limiting their scalability to field conditions.

Blut et al. [17] introduced the Bee Behavioral Annotation System (BBAS), which combined 2D barcode tracking with per-frame feature extraction and a machine learning classifier. Their system accurately detected various encounter behaviors, including trophallaxis, based on orientation, proximity, and interaction duration, achieving over 90% classification accuracy.

In contrast, Bernardes et al. [19] proposed Ethoflow, a field-adapted AI software platform that uses Mask R-CNN for object detection and CNNs for complex behavior recognition. Ethoflow does not require barcoded bees and was validated for detecting trophallaxis in heterogeneous environments, offering a flexible tool for real-time behavioral monitoring.

2.1.2. Biological and Ecological Contexts of Trophallaxis

Beyond its nutritional role, trophallaxis is tightly linked to colony health and exposure to stressors. As a conduit for pathogens and contaminants, it mediates the spread and regulation of disease within the colony. Geffre et al. [24] showed that Israeli Acute Paralysis Virus (IAPV) infection alters trophallactic frequency, suggesting an adaptive modulation of contact rates, while pesticide studies reported both reduced interaction rates after sublethal neonicotinoid–pyrethroid exposure in Melipona quadrifasciata [25] and increased trophallaxis and body temperature in Apis mellifera following S-dinotefuran exposure [26]. These results highlight trophallaxis as a sensitive behavioral endpoint for assessing colony-level stress and contaminant dynamics.

Trophallaxis also functions as an information channel within social communication. Farina and Wainselboim [27,28] demonstrated that nectar exchange conveys cues about food quality and foraging conditions, often complementing waggle dance communication, while Gil and De Marco [29] and Mc Cabe et al. [30] showed that scented food and associated vibratory signals can induce and enhance olfactory learning. Ramsey et al. [31] further argued that the “whooping signal” associated with trophallaxis serves multiple social functions, underscoring the behavioral richness of these interactions.

At the interface of species, trophallaxis can mediate both cooperative and parasitic relationships. Romero et al. [32] documented honeybee workers performing trophallaxis within Bombus terrestris colonies, indicating considerable behavioral plasticity, whereas Langlands et al. [33] and Neumann et al. [34] showed that the small hive beetle Aethina tumida exploits trophallaxis by mimicking solicitation signals to obtain carbohydrate- and protein-rich food via “hit-and-run” strategies. Collectively, these studies demonstrate that changes in the frequency and context of trophallactic contacts are informative about health, communication, and host–parasite dynamics, motivating scalable, markerless methods for quantifying trophallaxis in natural settings.

2.1.3. Collective Behavior and Group Dynamics

At a macro level, Gharooni-Fard et al. [22] applied topological data analysis (TDA) to analyze spatio-temporal patterns in honeybee aggregations during trophallaxis. Using persistent homology and CROCKER matrices, they detected distinct behavioral phases (e.g., aggregation and dispersal) and demonstrated the usefulness of TDA in understanding trophallaxis-driven group morphologies and transitions over time.

Weidenmüller and Tautz [35] provided behavioral evidence that pollen foragers adjust their in-hive tempo and trophallactic behavior based on colony pollen need. During high demand, shorter and more frequent trophallactic contacts occurred, suggesting an informational feedback loop to optimize foraging effort.

These studies establish trophallaxis as a multi-functional behavior with role in nutrition, communication, learning, and disease dynamics. While previous work has advanced automated detection using barcoded individuals or static classifiers, our approach introduces YOLO-based object detection for real-time monitoring of trophallaxis in unmarked bees at natural hive entrances. This contributes to the development of scalable, non-invasive behavioral analysis tools for ecological monitoring and health assessment in honeybee colonies.

2.2. Temporal-to-RGB Encoding Strategies

Temporal encoding strategies that represent motion information through RGB or RGB-structured image inputs have emerged as an efficient solution for incorporating temporal context into convolutional neural networks. These approaches enable the use of temporally enriched visual data within detection frameworks that are primarily designed for spatial inference, such as YOLO. These methods have shown promising results in domains such as human action recognition [36,37] and small object detection [38]. However, their applicability to insect-related activity recognition remains unexplored, particularly in the context of detecting specific behaviors in honeybees.

A commonly used strategy is to stack consecutive video frames into the red, green, and blue channels of a single image, enabling static convolutional models to capture short-term temporal dynamics with minimal changes to network architecture or computational cost. Kim et al. [37] proposed two such lightweight encodings for two-dimensional CNNs: Time-Color (TC) Reordering, which constructs a pseudo-RGB image by sampling channels from different time steps, and Grayscale Short-Term Stacking (GrayST), which stacks three consecutive grayscale frames into RGB channels. Zhang et al. [39] introduced the Gray Temporal Model (GTM), which similarly stacks three grayscale frames as input channels but augments them with a 1D Identity Channel-wise Spatio-Temporal Convolution (1D-ICSC) module for lightweight temporal reasoning. Both works reported substantial gains over conventional RGB and optical-flow inputs on standard action-recognition benchmarks without increasing computational complexity. Building on stacking-based encodings in an object-detection context, van Leeuwen et al. [38] proposed the Temporal-YOLOv8 architecture for small object detection in aerial surveillance. Temporal information is introduced by stacking either RGB or grayscale frames along the channel dimension of the input tensor, allowing YOLOv8 to exploit temporal context without architectural modifications. Their Color-T-YOLO variant stacks RGB triplets from three distinct time points, while Manyframe-YOLO stacks eleven consecutive grayscale frames, and both variants achieved substantial improvements in mean average precision (mAP) compared with single-frame baselines.

Other temporal-to-RGB encodings compress longer sequences into one or a few images using rank pooling or motion fusion. Dynamic Image Networks [40] construct a single “dynamic image” from a video sequence by applying rank pooling to raw pixel intensities, providing a compact representation of spatial structure and temporal evolution and achieving strong performance on UCF101 and HMDB51. Wang et al. [36] extended this idea with dynamic flow images, which apply rank pooling to short windows of optical flow fields to summarize motion in a two-channel image; combining dynamic flow with RGB further improved accuracy. Mukherjee et al. [41] applied dynamic images separately to RGB and depth streams, adding gestalt-based preprocessing to suppress background motion, while Kopuklu et al. [42] proposed Motion Fused Frames (MFFs), which append multiple optical-flow frames to RGB as additional channels for hand-gesture recognition. These approaches offer temporally expressive representations and high accuracy but rely on computationally intensive preprocessing (dense optical flow, rank pooling), which can limit real-time or embedded deployment.

Temporal-to-RGB encoding strategies provide an effective approach to embedding motion information into object detection pipelines. These methods enable the reuse of well-established two-dimensional CNN architectures while capturing short- or long-term temporal patterns. However, none of the reviewed methods have been applied to the recognition of trophallaxis behavior in honeybees. This presents an opportunity to explore the integration of such encoding strategies into precision beekeeping systems for honeybee behavior recognition tasks.

3. Materials and Methods

3.1. Dataset

A dedicated dataset was established to support automated detection of trophallaxis behavior in honeybees at hive entrances. Recordings were conducted between the 2023 and 2025 beekeeping seasons at a local apiary in the Vilnius district. Stationary cameras were positioned approximately 30 cm above the hive landing boards, capturing footage at a resolution of

1920 \times 1080

pixels and 30 fps. This resolution ensured sufficient visual granularity for identifying proboscis contact and subtle body orientation patterns, while maintaining a manageable processing load for both training and deployment scenarios. Recordings were made under diverse environmental conditions, including sunny, cloudy, and partially shaded scenes, to capture variability in illumination, contrast, and crowd density.

Although trophallaxis predominantly occurs inside the hive on comb surfaces, it can also be reliably observed at the hive entrance, particularly during periods of high forager traffic and food exchange. Previous studies have noted that returning foragers may engage in short mouth-to-mouth transfers with receiver bees at or near the entrance, especially under conditions of high nectar flow or resource uncertainty. In this work, the hive entrance is used purely as a convenient, non-intrusive observation site for automated behavioral detection. The focus is methodological rather than biological: our approach does not aim to redefine the established understanding of trophallaxis as primarily an in-hive behavior, but to demonstrate that such interactions can also be detected and analyzed under field-realistic, external conditions without disturbing the colony interior. All images in Figure 1, Figure 2 and Figure 3 were derived from the authors’ publicly available dataset.

From the raw recordings, individual frames were extracted for manual annotation. Using the LabelImg (https://github.com/tzutalin/labelImg, accessed on 15 September 2025) tool, trophallaxis instances were labeled with bounding boxes around interacting bees. The final dataset consists of 20,480 annotated frames collected from eight different beehives (Figure 1). Of these, 83% contain visible trophallaxis behavior, while 17% depict non-interacting bees. Across the dataset, 22,189 individual trophallaxis instances were identified, covering a wide range of body postures, orientations, occlusions, and background contexts. These annotations reflect the complexity of real-world hive entrance monitoring, where crowding, overlapping individuals, and variable viewing angles pose significant challenges for automated detection.

Figure 2 illustrates representative examples of annotated trophallaxis behavior across different hive entrance contexts. Frames (a–b) depict typical trophallaxis interactions involving two and three bees, respectively. Frames (c) show trophallaxis occurring within a crowded background, while frames (d) highlight self-occlusion due to the body orientation of the interacting bees. These examples underscore the variability of trophallaxis expression in natural colony conditions.

Figure 3 presents additional cases that emphasize challenging visual conditions. Frames (a) depict bees engaged in trophallaxis under partial occlusion, where only portions of the body are visible. Frames (b) illustrate trophallaxis interactions partially obscured by the metallic hive gate. Together, these scenarios demonstrate the dataset’s coverage of difficult visual situations that models must overcome to achieve robust detection performance.

3.2. Temporal-to-RGB Encoding and Dataset Generation

Figure 4 illustrates the four input representations used in this work for converting short temporal windows of video into three-channel images suitable for 2D single-shot detectors (YOLO variants). The central idea is to inject short-term temporal information into the three RGB channels so that standard 2D detection networks (which expect 3-channel input) can exploit motion information without changing network architectures. Below we describe each representation, the motivation behind it, and the exact procedure used to create encoded images and the corresponding datasets.

The source videos were recorded at 30 fps, corresponding to a frame interval of 33.3 ms between consecutive frames. Temporal encodings were constructed using neighboring frames sampled at this native rate. For short-term encodings (TSG and TEM), each encoded image incorporated three consecutive frames (covering 0.067 s total). For longer-term encodings (TEMA), we evaluated multiple temporal window sizes of 0.17 s, 0.3 s, 0.5 s, 1 s, and 2 s, corresponding to 5, 9, 15, 30, and 60 frames, respectively, at 30 fps.

3.2.1. Direct RGB

The baseline input is the standard color frame at time n:

I (n) \in R^{H \times W \times 3}

. Direct RGB preserves full spatial and color information and therefore provides the detector with appearance features (body shape, color patterns, shadows) but contains no explicit temporal information. For each frame index n extract the raw RGB image

I (n)

, optionally resize to the model input resolution (we used 1024 × 576 in most experiments), and save as the RGB dataset image. Annotations (bounding boxes) are copied verbatim (coordinate scaling applied if resizing). In this work, direct RGB serves as a baseline and is useful when motion information is weak or when color and texture are highly informative.

3.2.2. Temporally Stacked Grayscale (TSG)

This approach provides the detector with short-term temporal context by stacking three neighboring grayscale frames into the three channels. This is a low-cost way to make a 2D network temporally aware while leaving the architecture unchanged. Channel mapping is implemented as follows:

B \leftarrow I_{g} (n - 1), G \leftarrow I_{g} (n), R \leftarrow I_{g} (n + 1),

(1)

here

I_{g} (\cdot)

denotes the grayscale conversion of the RGB frame. Grayscale conversion used the standard luminosity equation:

I_{g} (x, y) = 0.299 R (x, y) + 0.587 G (x, y) + 0.114 B (x, y),

(2)

here for each central index n: load frames

I (n - 1)

,

I (n)

, and

I (n + 1)

; convert each to grayscale

I_{g} (\cdot)

; stack them into a 3-channel image as shown above; then resize and save. For boundary frames (start/end of a clip) we replicate the nearest available frame (e.g.,

I (- 1) \leftarrow I (0)

) so every n has a full triplet. Annotations are taken from the central frame

I (n)

. TSG is simple, inexpensive to compute, and preserves implicit motion information (differences across channels). It may, however, conflate appearance and motion signals and is less explicit about where motion occurs.

3.2.3. Temporally Encoded Motion (TEM)

It makes motion explicit by replacing one or both channels with pixel-wise temporal differences. TEM emphasizes areas that changed between consecutive frames (motion saliency), suppressing static background. This is particularly useful for subtle motions such as proboscis contact and small abdominal movements in trophallaxis. Channel mapping is implemented as follows:

B \leftarrow D^{-} (n) = |I_{g} (n) - I_{g} (n - 1)|, G \leftarrow I_{g} (n), R \leftarrow D^{+} (n) = |I_{g} (n + 1) - I_{g} (n)|,

(3)

here for each n, compute

I (n - 1)

,

I (n)

,

I (n + 1)

; then calculate the backward and forward absolute differences

D^{-} (n)

and

D^{+} (n)

. Assemble the channels B/G/R as described above, clip or linearly scale the differences to the 0–255 range to preserve contrast, resize, and save. As with TSG, annotations come from frame

I (n)

. Boundaries are handled by frame replication. TEM explicitly highlights motion and is robust to static clutter, but single-frame diffs are noisy (sensor noise, lighting flicker) and can produce false positives for global changes (shadows, illumination flicker).

3.2.4. Temporally Encoded Motion and Average (TEMA)

TEMA augments TEM by adding a longer-term average to the channels. It combines a moving average image to represent the quasi-static appearance of the scene, the current grayscale frame, and a smoothed measure of motion obtained by averaging absolute frame-to-frame differences over a short window. This combination gives the detector both the stable appearance context and a smoothed motion energy map—a particularly effective indicator for slow or episodic interactions such as trophallaxis. TEMA variants are parameterized by the averaging window (e.g., 0.17 s, 0.3 s, 0.5 s, 1 s, 2 s). The best-performing variants in our experiments included TEMA-1s. Channel mapping is implemented as follows:

R \leftarrow MA (n), G \leftarrow I_{g} (n), B \leftarrow MAAD (n),

(4)

here MA is the moving average image and MAAD is the moving average of absolute differences (motion energy). This mapping matches TEMA schematic in Figure 4. Let

N = round (d u r a t i o n \times f p s)

be the window length in frames (e.g., at 30 fps, 1 s

\to N = 30

). Then for a causal (past-only) average:

MA (n) = \frac{1}{N} \sum_{k = 0}^{K - 1} I_{g} (n - k),

(5)

MAAD (n) = \frac{1}{N - 1} \sum_{k = 1}^{N - 1} |I_{g} (n - k + 1) - I_{g} (n - k)| .

(6)

In this investigation, a centered temporal window was used for offline dataset creation to symmetrically aggregate past and future frames around each central frame index n; for real-time inference use causal windows to avoid future look-ahead. For each central frame index n:

The N grayscale frames were collected $I_{g} (n - N + 1), \dots, I_{g} (n)$ (for causal MA) or symmetrically around n if using a centered window for offline training data;
$MA (n)$ and $MAAD (n)$ were computed according to the formulas above;
The RGB image was assembled as $R = MA (n)$ , $G = I_{g} (n)$ , $B = MAAD (n)$ ;
The results were scaled or clipped to the 0–255 range, resized as required, and saved.

We created multiple TEMA datasets using different window durations (TEMA-0.17s, TEMA-0.3s, TEMA-0.5s, TEMA-1s, TEMA-2s) and observed consistent gains with larger windows up to 1 s (see Table 1 for quantitative results). TEMA combines long-term appearance context with smoothed motion energy, reducing spurious noise from single-frame differencing while still highlighting true sustained interactions. This makes it particularly effective for trophallaxis detection where contact and liquid transfer can be temporally spread and subtle.

Figure 5 illustrates visual examples of the four temporal encoding strategies applied to the same hive entrance scene. Figure 5a shows the raw RGB representation, which preserves appearance detail but lacks explicit temporal information. Figure 5b presents the TSG encoding, where three consecutive grayscale frames are mapped into the RGB channels, thereby embedding short-term motion patterns as inter-channel variations. Figure 5c depicts the TEM approach, in which the blue and red channels represent backward and forward absolute differences, while the green channel contains the current grayscale frame; this highlights motion saliency relative to the present time step. Finally, Figure 5d shows the TEMA encoding, where the red channel contains a moving average of past frames, the green channel the current frame, and the blue channel the moving average of absolute differences. TEMA thus combines stable appearance context with smoothed motion energy, making it especially well suited for detecting subtle interactions such as trophallaxis under crowded and noisy hive entrance conditions.

3.3. Model Training and Optimization

To ensure consistency and reproducibility, all models were trained and optimized using the following hyperparameter settings. Standard augmentation included random translations up to ±10% of image width, random scaling with a gain of ±0.5, and horizontal flips with a probability of 0.5. Color variation was introduced via HSV adjustments (

h s v_{h}

= 0,

h s v_{s}

= 0, and

h s v_{v}

= 0.2). Mosaic augmentation was disabled for the final 10 epochs to stabilize convergence. Model optimization was performed with the AdamW optimizer [43], using a base learning rate of 0.001, momentum of 0.9, and weight decay regularization. Training was limited to 1000 epochs, with checkpoints stored every 10 epochs. An early-stopping strategy with a patience of 100 epochs was applied to avoid overfitting and unnecessary computation. Batch size varied dynamically between 6 and 32 depending on model size to balance GPU memory use and throughput. For most models, convergence to minimal validation loss occurred between 100–200 epochs.

4. Results and Discussion

All experiments were carried out on a workstation equipped with an NVIDIA GeForce RTX 4080 Super GPU with 16 GB VRAM. The software environment included Ultralytics YOLO (v8.3.80), Python 3.12.9, PyTorch 2.5.1, and CUDA 12.6. For deployment on the NVIDIA Jetson AGX Orin, PyTorch-trained weights were exported and converted into TensorRT-optimized engines using TensorRT v8.6.2. The input resolution was standardized to

1024 \times 576

px for all experiments. The recordings originated from multiple hives (see Figure 1) that differed in landing board structure, background clutter, surface materials, and bee density, providing diverse visual contexts that support model generalization. The dataset was split into 80% training and 20% validation/testing, with the same split applied consistently across all models. The split was performed by hive rather than by random frame sampling. All frames from a given record were assigned exclusively to either the training or the test subset to avoid data leakage and to evaluate cross-hive generalization. The specific record assignments and corresponding trophallaxis duration statistics are listed in our publicly available dataset.

4.1. Investigation of Precision vs. Inference Time

Figure 6 shows a clear and consistent pattern: temporally aware encodings (TSG, TEM and the moving-average TEMA variants) substantially improved detection precision for trophallaxis relative to plain RGB, and these precision gains came at small-to-moderate additional per-image cost depending on the model family and size.

Table 1 quantifies these effects, showing that on the RTX 4080 the RGB baseline yielded mAP50 values in the high-70s to mid-80s across YOLO variants (e.g., YOLOv8m 84.7%), and that switching to TSG or TEM raised mAP50 by multiple percentage points (for example, YOLOv8m: RGB 84.7% vs. TSG 91.9% and TEM 91.9%), while TEMA variants produced the largest gains: TEMA-1s reached the single-best mAP50 numbers across nearly all model sizes (e.g., YOLOv8m 95.5%, YOLOv8x 96.4%), at the cost of a moderate increase in per-image processing time relative to the simplest encodings.

An examination of speed–accuracy trade-offs across YOLO families showed predictable but informative differences. For matched model size (n, s, m, l, x):

YOLOv8 tended to be the fastest for a given size and achieved high accuracy when paired with temporal encodings (TSG/TEM/TEMA). Per-image times for YOLOv8 fell into the lowest column of Table 1, making medium-sized YOLOv8 (m/l) attractive when both latency and accuracy matter.
YOLO11 generally improved accuracy slightly over YOLOv8 for the same size when using temporal encodings, but at a small increase in per-image time; for example, YOLO11m with TSG/TEMA reached very competitive mAP50 numbers while remaining in a similar latency class.
YOLO12 attained top-end accuracy in several TEMA variants, comparable to or slightly above YOLO11/YOLOv8 for the largest sizes, but it incurred the largest per-image time, making it better suited to offline/batch processing or when highest accuracy is essential.

The per-image time costs of the TEMA-1s encoding are summarized in Table 1, with values reported for three model input resolutions. At the highest resolution of

1024 \times 576

px, per-image processing times ranged from ∼13 ms for YOLOv8n to nearly 30 ms for YOLO12x, including preprocessing, inference, and postprocessing. These values corresponded to throughput between ∼75 fps for the smallest models and ∼33 fps for the largest on an RTX GPU, positioning medium-sized YOLOv8 and YOLO11 variants as a favorable compromise between accuracy and efficiency. When the resolution was reduced to

640 \times 384

px, per-image times decreased by roughly 25–30% across models (e.g., ∼10 ms for YOLOv8n, ∼26 ms for YOLO12x), enabling many medium configurations to sustain frame rates above 70 fps while preserving high detection accuracy. At the lowest tested resolution of

320 \times 192

px, inference times fell further (e.g., ∼9 ms for YOLOv8n, ∼25 ms for YOLO12x), but this speed gain came at a significant cost in accuracy, with mAP50 values dropping below 75% for most models.

Practical takeaways from Figure 6 and Figure 7 and Table 1 can be summarized as follows. Temporal encodings consistently outperformed RGB, and TSG and TEM provided large, low-cost gains. TEMA used a longer-window moving average and smoother motion, yielding the highest precision; TEMA-1s was best in our tests and was the preferred choice when top accuracy was required and a modest per-image cost was acceptable. Model choice balanced final-stage accuracy and latency: medium or large YOLOv8 or YOLO11 combined with TEMA provided an optimal trade-off, whereas YOLO12x with TEMA gave only slight accuracy improvements at significantly higher latency.

The superior performance of the TEMA-1s variant can be explained by its temporal alignment with the natural dynamics of trophallaxis behavior. As described in Section 1, solicitation contacts in honeybees often terminate within one second, while liquid-transfer bouts typically persist for several seconds [16]. A one-second moving-average window therefore captured both the onset and full duration of these brief exchanges while suppressing frame-to-frame noise and illumination flicker. This window length thus provides an effective balance between temporal sensitivity and stability, which likely underlies its superior detection accuracy.

Figure 7 summarizes the net gains in mAP50 obtained by introducing temporal information into the input stream. All temporal encodings provided consistent improvements over the RGB baseline, with the TSG and the TEM producing modest but dependable uplifts across every YOLO family and size. The TEMA variants, which aggregate motion over a longer window and apply smoothing, delivered the largest average gains: short TEMA windows (0.17–0.5 s) already pushed most models several percentage points above TSG/TEM, and the 1 s TEMA configuration produced the biggest overall increases, with several model/configuration pairs showing double-digit percentage-point improvements in mAP50 relative to RGB. Increasing the window to 2 s yielded only marginal additional benefit in most cases, indicating diminishing returns beyond the 1 s scale. The pattern was stable across model families: smaller and medium-sized models tended to exhibit the largest relative gains because their RGB baselines were lower and thus temporal features yielded a proportionally larger effect, while the largest models also improved but by a smaller relative amount. These results demonstrated that adding temporal encodings was an effective and generally inexpensive way to raise detection precision for trophallaxis, and that a smoothed, longer-window temporal aggregation (TEMA-1s) was the most effective single choice when maximizing mAP50 was the goal.

To quantify the individual contributions of the moving-average (MA) and moving-average-absolute-difference (MAAD) components within the TEMA encoding, we conducted an ablation experiment in which each submodule was applied independently using the 1 s temporal window. The results (YOLO11m,

1024 \times 576

px) showed mAP50 = 84.6% for the RGB baseline and 95.3% for the full TEMA-1s combination (Table 1). After ablation, the scores were 93.8% for MA-1s only and 93.6% for MAAD-1s only. Both submodules therefore contributed substantial accuracy gains relative to plain RGB input, indicating that long-term appearance averaging (MA) stabilized texture and lighting variation, while motion energy aggregation (MAAD) enhanced sensitivity to sustained inter-individual contact. Their joint use in TEMA yielded a synergistic effect that integrates stable context with smoothed motion signals, producing the highest overall detection precision.

4.2. Deployment on Jetson AGX Orin Platform

We evaluated the end-to-end throughput on the Jetson AGX Orin and presented effective frame rate values in Table 2, while Figure 8 illustrates comparative performance trends between the RTX and AGX platforms. On AGX, TensorRT conversion (FP16/INT8) reduced inference time substantially and preserved most of the PyTorch accuracy in FP16; INT8 yielded the highest speeds but with a small mAP penalty for the smallest models.

Key deployment observations from Table 2 indicate that on the Jetson AGX Orin small models (nano, small) compiled with TensorRT in FP16 or INT8 achieved real-time frame rates, for example, YOLOv8n ran at 31 fps in FP16 and 32 fps in INT8, and YOLOv8s and YOLO11n reached roughly 30 fps in INT8, making these configurations suitable when throughput is critical and some mAP loss under INT8 quantization is acceptable. Medium and large models such as YOLOv8m/l and YOLO11m/l ran in the 22–25 fps range in FP16 or INT8 on AGX and delivered substantially higher mAP when paired with TSG, TEM or TEMA, which makes them a good middle ground for many field deployments and suggests medium-to-large sizes in FP16 for maximum precision on AGX. Extra-large models like YOLO12x remained the slowest on AGX, often below 20 fps even in INT8, and are therefore less suitable for strict real-time constraints unless frame skipping or input downscaling is applied.

Figure 6 and Figure 8 show that the relative ranking of model/encoding combinations was stable across platforms (RTX vs. AGX): encodings that improved accuracy on RTX also improved accuracy on AGX, and performance ordering of models was preserved, although absolute timings differ (AGX was slower). This made offline RTX experiments useful when selecting the best candidate for later Jetson deployment.

Deployment tradeoffs followed the same themes as in the RTX experiments: when maximum mAP for trophallaxis is required, use a TEMA-1s encoding with a medium/large YOLO variant (YOLO11m/YOLOv8l) and accept the reduced fps on AGX. To achieve sustained real-time throughput (≥30 fps), choose a nano/small model in INT8/FP16 with TSG or TEM to reclaim much of the temporal benefit while staying within the fps budget. The detailed fps summary (Table 2) and the accuracy summary (Table 1) together allow selecting the precise model/precision/encoding combination that meets a deployment’s accuracy and latency targets.

4.3. Visualizations

Figure 9 illustrates representative detections of trophallaxis behavior using YOLOv8m with the TEMA-1s temporal encoding approach. These visualizations highlight the model’s capability to consistently identify interacting bee pairs across a wide range of hive entrance conditions, including dense aggregations, variable illumination, and partial occlusions.

In scenes characterized by high bee density, the detector successfully identified trophallactic pairs with bounding boxes tightly localized around the head-to-head interaction zone. This demonstrated the model’s robustness in distinguishing true trophallaxis events from incidental contacts or overlapping body postures, a frequent source of ambiguity at hive entrances. The use of temporal averaging in TEMA enhanced these detections by stabilizing motion indicators across a longer time window, thereby amplifying subtle interaction signals such as sustained proboscis contact.

Even under challenging lighting conditions, including sharp contrasts from direct sunlight or shadowed regions of the hive gate, detections remained stable. The temporally encoded motion energy suppressed transient artifacts caused by illumination flicker or passing shadows, allowing the model to focus on localized, biologically relevant interactions. Similarly, in cases of partial occlusion, where only portions of the bee bodies were visible, the temporal aggregation emphasized the continuity of motion and contact, enabling the model to correctly infer trophallaxis despite incomplete visual information.

These qualitative examples reinforce the quantitative findings reported earlier: temporal encoding, and in particular the TEMA-1s strategy, provides a substantial advantage for behavior detection by integrating appearance and motion information. By reliably isolating trophallaxis in visually complex and dynamic hive entrance environments, these visualizations demonstrate the practical viability of temporally enriched YOLO detectors for real-world behavioral monitoring in apiculture.

5. Conclusions

This study demonstrates that temporal-to-RGB encoding strategies provide a simple yet highly effective means of enhancing YOLO-based detection of honeybee trophallaxis behavior under field-realistic conditions. By introducing a large-scale, publicly available dataset of annotated hive entrance recordings, we evaluated multiple temporal representations and showed that temporally enriched inputs consistently outperform standard RGB frames across YOLO model families. Among the tested approaches, the TEMA-1s encoding achieved the best balance between accuracy and efficiency, raising mAP50 values to over 95% in medium- and large-scale models while maintaining real-time throughput on both high-performance GPUs and edge devices such as the Jetson AGX Orin.

The findings highlight that even lightweight temporal representations, stacked grayscale or short-term differentiating, yield substantial accuracy improvements at minimal computational cost, making them attractive for resource-constrained applications. At the same time, longer-window temporal averaging provides stability against noise, occlusion, and illumination variability, enabling robust behavioral detection in complex hive entrance environments.

Beyond methodological contributions, this work underscores the potential of markerless, deep learning-based monitoring as a scalable tool for precision beekeeping. Reliable detection of trophallaxis can provide novel behavioral indicators of colony health, resource dynamics, and stress responses, complementing existing macro-level monitoring approaches. The ability to deploy such systems on embedded hardware platforms supports their practical integration into continuous hive monitoring frameworks.

Future work should extend this approach to multi-behavior recognition, integrate complementary sensing modalities, and explore semi-supervised or synthetic data generation to reduce annotation demands. Temporal-to-RGB encodings thus offer a practical and efficient pathway for automated monitoring of honeybee social behavior in precision beekeeping.

Author Contributions

Conceptualization, G.V. and T.S.; methodology, T.S. and G.V.; software, G.V.; validation, G.V.; investigation and analysis, G.V. and T.S.; data curation, G.V. and T.S.; writing—original draft preparation, G.V. and T.S.; writing—review and editing, G.V. and T.S.; visualization, G.V. and T.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original data presented in this study are openly available in Zenodo at https://doi.org/10.5281/zenodo.15813172.

Acknowledgments

The authors thank Vilnius Gediminas Technical University for providing computational resources for data analysis and model training.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Klein, A.M.; Vaissière, B.E.; Cane, J.H.; Steffan-Dewenter, I.; Cunningham, S.A.; Kremen, C.; Tscharntke, T. Importance of pollinators in changing landscapes for world crops. Proc. R. Soc. B Biol. Sci. 2007, 274, 303–313. [Google Scholar] [CrossRef]
Smith, K.M.; Loh, E.H.; Rostal, M.K.; Zambrana-Torrelio, C.M.; Mendiola, L.; Daszak, P. Pathogens, pests, and economics: Drivers of honey bee colony declines and losses. EcoHealth 2013, 10, 434–445. [Google Scholar] [CrossRef] [PubMed]
Osterman, J.; Aizen, M.A.; Biesmeijer, J.C.; Bosch, J.; Howlett, B.G.; Inouye, D.W.; Jung, C.; Martins, D.J.; Medel, R.; Pauw, A.; et al. Global trends in the number and diversity of managed pollinator species. Agric. Ecosyst. Environ. 2021, 322, 107653. [Google Scholar] [CrossRef]
Ulrich, J.; Stefanec, M.; Rekabi-Bana, F.; Fedotoff, L.A.; Rouček, T.; Gündeğer, B.Y.; Saadat, M.; Blaha, J.; Janota, J.; Hofstadler, D.N.; et al. Autonomous tracking of honey bee behaviors over long-term periods with cooperating robots. Sci. Robot. 2024, 9, 1–16. [Google Scholar] [CrossRef] [PubMed]
Janota, J.; Blaha, J.; Stefanec, M.; Rouček, T.; Ulrich, J.; Fedotoff, L.A.; Rekabi-Bana, F.; Arvin, F.; Schmickl, T.; Krajník, T. Non-invasive honeybee colony monitoring via robotic mapping of combs in observation hives. Comput. Electron. Agric. 2025, 239, 111031. [Google Scholar] [CrossRef]
Stefanec, M.; Hofstadler, D.N.; Krajník, T.; Turgut, A.E.; Alemdar, H.; Lennox, B.; Şahin, E.; Arvin, F.; Schmickl, T. A minimally invasive approach towards “ecosystem hacking” with honeybees. Front. Robot. AI 2022, 9, 791921. [Google Scholar] [CrossRef]
Turyagyenda, A.; Katumba, A.; Akol, R.; Nsabagwa, M.; Mkiramweni, M.E. IoT and Machine Learning Techniques for Precision Beekeeping: A Review. AI 2025, 6, 26. [Google Scholar] [CrossRef]
Bilik, S.; Zemcik, T.; Kratochvila, L.; Ricanek, D.; Richter, M.; Zambanini, S.; Horak, K. Machine learning and computer vision techniques in continuous beehive monitoring applications: A survey. Comput. Electron. Agric. 2024, 217, 108560. [Google Scholar] [CrossRef]
Jeong, K.; Oh, H.; Lee, Y.; Seo, H.; Jo, G.; Jeong, J.; Park, G.; Choi, J.; Seo, Y.D.; Jeong, J.H.; et al. IoT and AI systems for enhancing bee colony strength in precision beekeeping: A survey and future research directions. IEEE Internet Things J. 2024, 12, 362–389. [Google Scholar] [CrossRef]
Šabić, J.; Perković, T.; Šolić, P.; Šerić, L. Buzzing with Intelligence: A Systematic Review of Smart Beehive Technologies. Sensors 2025, 25, 5359. [Google Scholar] [CrossRef]
Kongsilp, P.; Taetragool, U.; Duangphakdee, O. Individual honey bee tracking in a beehive environment using deep learning and Kalman filter. Sci. Rep. 2024, 14, 1061. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Chen, Y.; Zhang, H.; Zhang, Y.; Shi, Z.; Ren, J.; Ye, H.; Zuo, Z.; Luo, Z. A 3D group object tracking method for honeybees in open spaces. Comput. Electron. Agric. 2025, 237, 110535. [Google Scholar] [CrossRef]
Matuzevičius, D.; Urbanavičius, V.; Miniotas, D.; Mikučionis, Š.; Laptik, R.; Ušinskas, A. Key-Point-Descriptor-Based Image Quality Evaluation in Photogrammetry Workflows. Electronics 2024, 13, 2112. [Google Scholar] [CrossRef]
Crailsheim, K. Trophallactic interactions in the adult honeybee (Apis mellifera L.). Apidologie 1998, 29, 97–112. [Google Scholar] [CrossRef]
Trhlin, M.; Rajchard, J. Chemical communication in the honeybee (Apis mellifera L.): A review. Vet. Med. 2011, 56, 265–273. [Google Scholar] [CrossRef]
De Marco, R.; Farina, W. Trophallaxis in forager honeybees (Apis mellifera): Resource uncertainty enhances begging contacts? J. Comp. Physiol. A 2003, 189, 125–134. [Google Scholar] [CrossRef]
Blut, C.; Crespi, A.; Mersch, D.; Keller, L.; Zhao, L.; Kollmann, M.; Schellscheidt, B.; Fülber, C.; Beye, M. Automated computer-based detection of encounter behaviours in groups of honeybees. Sci. Rep. 2017, 7, 17663. [Google Scholar] [CrossRef]
Gernat, T.; Rao, V.D.; Middendorf, M.; Dankowicz, H.; Goldenfeld, N.; Robinson, G.E. Automated monitoring of behavior reveals bursty interaction patterns and rapid spreading dynamics in honeybee social networks. Proc. Natl. Acad. Sci. USA 2018, 115, 1433–1438. [Google Scholar] [CrossRef]
Bernardes, R.C.; Lima, M.A.P.; Guedes, R.N.C.; da Silva, C.B.; Martins, G.F. Ethoflow: Computer vision and artificial intelligence-based software for automatic behavior analysis. Sensors 2021, 21, 3237. [Google Scholar] [CrossRef]
Gernat, T.; Jagla, T.; Jones, B.M.; Middendorf, M.; Robinson, G.E. Automated monitoring of honey bees with barcodes and artificial intelligence reveals two distinct social networks from a single affiliative behavior. Sci. Rep. 2023, 13, 1541. [Google Scholar] [CrossRef]
Matuzevičius, D. A Retrospective Analysis of Automated Image Labeling for Eyewear Detection Using Zero-Shot Object Detectors. Electronics 2024, 13, 4763. [Google Scholar] [CrossRef]
Gharooni-Fard, G.; Byers, M.; Deshmukh, V.; Bradley, E.; Mayo, C.; Topaz, C.M.; Peleg, O. A computational topology-based spatiotemporal analysis technique for honeybee aggregation. npj Complex. 2024, 1, 3. [Google Scholar] [CrossRef]
Matuzevičius, D. Synthetic Data Generation for the Development of 2D Gel Electrophoresis Protein Spot Models. Appl. Sci. 2022, 12, 4393. [Google Scholar] [CrossRef]
Geffre, A.C.; Gernat, T.; Harwood, G.P.; Jones, B.M.; Morselli Gysi, D.; Hamilton, A.R.; Bonning, B.C.; Toth, A.L.; Robinson, G.E.; Dolezal, A.G. Honey bee virus causes context-dependent changes in host social behavior. Proc. Natl. Acad. Sci. USA 2020, 117, 10406–10413. [Google Scholar] [CrossRef]
Boff, S.; Friedel, A.; Mussury, R.M.; Lenis, P.R.; Raizer, J. Changes in social behavior are induced by pesticide ingestion in a Neotropical stingless bee. Ecotoxicol. Environ. Saf. 2018, 164, 548–553. [Google Scholar] [CrossRef]
Zhang, F.; Cao, W.; Zhang, Y.; Luo, J.; Hou, J.; Chen, L.; Yi, G.; Li, H.; Huang, M.; Dong, L.; et al. S-dinotefuran affects the social behavior of honeybees (Apis mellifera) and increases their risk in the colony. Pestic. Biochem. Physiol. 2023, 196, 105594. [Google Scholar] [CrossRef] [PubMed]
Farina, W.M.; Wainselboim, A.J. Trophallaxis within the dancing context: A behavioral and thermographic analysis in honeybees (Apis mellifera). Apidologie 2005, 36, 43–47. [Google Scholar] [CrossRef][Green Version]
Wainselboim, A.J.; Farina, W.M. Trophallaxis in the honeybee Apis mellifera (L.): The interaction between flow of solution and sucrose concentration of the exploited food sources. Anim. Behav. 2000, 59, 1177–1185. [Google Scholar] [CrossRef] [PubMed]
Gil, M.; De Marco, R.J. Olfactory learning by means of trophallaxis in Apis mellifera. J. Exp. Biol. 2005, 208, 671–680. [Google Scholar] [CrossRef]
Mc Cabe, S.I.; Hrncir, M.; Farina, W.M. Vibrating donor-partners during trophallaxis modulate associative learning ability of food receivers in the stingless bee Melipona quadrifasciata. Learn. Motiv. 2015, 50, 11–21. [Google Scholar] [CrossRef]
Ramsey, M.; Bencsik, M.; Newton, M.I. Long-term trends in the honeybee ‘whooping signal’revealed by automated detection. PLoS ONE 2017, 12, e0171162. [Google Scholar]
Romero-González, J.E.; Solvi, C.; Peng, F.; Chittka, L. Behaviour of honeybees integrated into bumblebee nests and the responses of their hosts. Apidologie 2024, 55, 50. [Google Scholar] [CrossRef]
Langlands, Z.; du Rand, E.E.; Crailsheim, K.; Yusuf, A.A.; Pirk, C.W. Prisoners receive food fit for a queen: Honeybees feed small hive beetles protein-rich glandular secretions through trophallaxis. J. Exp. Biol. 2021, 224, 1–9. [Google Scholar] [CrossRef] [PubMed]
Neumann, P.; Naef, J.; Crailsheim, K.; Crewe, R.M.; Pirk, C.W. Hit-and-run trophallaxis of small hive beetles. Ecol. Evol. 2015, 5, 5478–5486. [Google Scholar] [CrossRef]
Weidenmüller, A.; Tautz, J. In-hive behavior of pollen foragers (Apis mellifera) in honey bee colonies under conditions of high and low pollen need. Ethology 2002, 108, 205–221. [Google Scholar] [CrossRef]
Wang, J.; Cherian, A.; Porikli, F. Ordered pooling of optical flow sequences for action recognition. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 168–176. [Google Scholar]
Kim, K.; Gowda, S.N.; Mac Aodha, O.; Sevilla-Lara, L. Capturing temporal information in a single frame: Channel sampling strategies for action recognition. arXiv 2022, arXiv:2201.10394. [Google Scholar] [CrossRef]
van Leeuwen, M.C.; Fokkinga, E.P.; Huizinga, W.; Baan, J.; Heslinga, F.G. Toward versatile small object detection with Temporal-YOLOv8. Sensors 2024, 24, 7387. [Google Scholar] [CrossRef]
Zhang, Y.; Yu, Y. Gtm: Gray temporal model for video recognition. arXiv 2021, arXiv:2110.10348. [Google Scholar] [CrossRef]
Bilen, H.; Fernando, B.; Gavves, E.; Vedaldi, A.; Gould, S. Dynamic image networks for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3034–3042. [Google Scholar]
Mukherjee, S.; Anvitha, L.; Lahari, T.M. Human activity recognition in RGB-D videos by dynamic images. Multimed. Tools Appl. 2020, 79, 19787–19801. [Google Scholar] [CrossRef]
Kopuklu, O.; Kose, N.; Rigoll, G. Motion fused frames: Data level fusion strategy for hand gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2103–2111. [Google Scholar]
Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]

Figure 1. Annotated images from the publicly available dataset for detecting trophallaxis behavior at the hive entrance. The eight examples shown originate from the entrances of eight different beehives used in this study, illustrating the diversity in hive architecture, lighting conditions, and background complexity.

Figure 2. Examples of trophallaxis among honeybees in various hive locations. Frames (a,b) show typical trophallaxis behavior between two and three bees, respectively. Frames (c) depict trophallaxis occurring within a crowded background, while frames (d) present trophallaxis with self-occlusion due to the body orientation of the bee.

Figure 3. Examples of different trophallaxis behaviors observed in bees. Frames (a) present bees engaged in trophallaxis with partial occlusion. Frames (b) illustrate trophallaxis behavior partially occluded by a metallic gate.

Figure 4. Direct RGB and three temporal encoding approaches for converting video frames into RGB representations: Temporally Stacked Grayscale (TSG), Temporally Encoded Motion (TEM), and Temporally Encoded Motion and Average (TEMA).

Figure 5. Visualization of input image variants for trophallaxis detection: standard RGB frame at time n (a); Temporally Stacked Grayscale (TSG) channels: Blue =

I (n - 1)

, Green =

I (n)

, Red =

I (n + 1)

(b); Temporally Encoded Motion (TEM) representation: Blue =

| I (n) - I (n - 1) |

, Green =

I (n)

, Red =

| I (n) - I (n + 1) |

(c); Temporally Encoded Motion and Average (TEMA) approach: Red = moving average, Green =

I (n)

, Blue = moving average absolute difference (d).

Figure 5. Visualization of input image variants for trophallaxis detection: standard RGB frame at time n (a); Temporally Stacked Grayscale (TSG) channels: Blue =

I (n - 1)

, Green =

I (n)

, Red =

I (n + 1)

(b); Temporally Encoded Motion (TEM) representation: Blue =

| I (n) - I (n - 1) |

, Green =

I (n)

, Red =

| I (n) - I (n + 1) |

(c); Temporally Encoded Motion and Average (TEMA) approach: Red = moving average, Green =

I (n)

, Blue = moving average absolute difference (d).

Figure 6. Accuracy–efficiency trade-off of YOLO models with different input encodings for trophallaxis detection on RTX 4080 GPU. TSG encodes motion by stacking grayscale frames at times n − 1, n, and n + 1 into RGB channels. TEM highlights movement by combining the current frame with forward and backward pixel-wise differences. TEMA-1s presents the temporally encoded motion and average over one second. Horizontal axis represents per-image time for preprocessing, inference, and postprocessing.

Figure 7. Performance improvements of YOLO variants across approaches.

Figure 8. Comparison of YOLOv8 and YOLO11 performance across RTX 4080 and Jetson AGX Orin. The blue and red curves correspond the same PyTorch reference models in Figure 6.

Figure 9. YOLOv8-m detections of trophallaxis behavior among bees at hive entrances using the TEMA-1s temporal encoding approach.

Table 1. mAP50 performance of YOLOv8, YOLO11, and YOLO12 variants on RTX 4080 in detecting trophallaxis at the hive entrance. Per-image time includes preprocessing, inference, and postprocessing.

Approach	YOLOv8					YOLO11					YOLO12
Approach	n	s	m	l	x	n	s	m	l	x	n	s	m	l	x
Model input resolution: $1024 \times 576$ px
RGB, %	78.8	84.5	84.7	85.8	85.8	76.2	84.4	84.6	86.2	86.4	80	82.1	85.2	85.7	86.4
TSG, %	86.1	91.9	91.9	92.1	92.5	86	90.7	92.1	92.5	93.5	86.6	90.3	91.8	93.1	93.7
TEM, %	85.4	88.9	91.9	92	92.7	86.7	89.3	91	92.6	92.9	85.4	90.5	92.6	92.8	93.5
TEMA-0.17s, %	90.4	90.9	93.2	94.1	94.9	90.8	91.9	94.2	94.5	94.5	87.3	89.8	93.8	94.8	94.8
TEMA-0.3s, %	91.5	94.3	94.9	95.1	95.2	90.9	93	93.6	94.6	95.1	89.8	93.6	94.8	95	95.8
TEMA-0.5s, %	92	94.2	95.1	95.2	95.8	92.7	93.1	95.2	95.3	95.6	93.1	93.2	94.6	95.7	96
TEMA-1s, %	93.3	95.3	95.5	95.6	96.4	94.2	94.5	95.3	96	96.2	92.4	94.6	95.4	96.4	96.4
TEMA-2s, %	93.6	93.6	95.3	95.6	96.1	93.1	94.3	94.6	95.7	96.3	91.8	94.8	95.4	95.7	96.1
Time, ms	13.3	13.6	15.3	17.3	25.2	15.4	15.7	17.7	16.2	22.9	19.3	19.5	20.3	28.8	29.6
Model input resolution: $640 \times 384$ px
TEMA-1s, %	85.8	89.2	91.1	91.2	91.8	81.2	91.2	91.5	92.4	92.6	85.6	90.2	91.3	92.6	93.4
Time, ms	9.9	10.1	11.9	13.9	14.2	12	12.2	14.2	19.5	20	16	16.2	16.9	25.2	25.8
Model input resolution: $320 \times 192$ px
TEMA-1s, %	48.7	63.8	65.2	69.8	73.9	47.2	64.3	70	71.5	73.2	48.8	66.9	70.8	73.6	73.9
Time, ms	8.9	9	10.9	12.4	12.7	10.8	11.1	13.2	18.4	18.7	14.9	15.1	16	24.3	24.9

Table 2. Maximum frames per second achieved by YOLOv8, YOLO11, and YOLO12 models on Jetson AGX Orin with

1920 \times 1080

px image resolution and

1024 \times 576

px model input resolution.

Table 2. Maximum frames per second achieved by YOLOv8, YOLO11, and YOLO12 models on Jetson AGX Orin with

1920 \times 1080

px image resolution and

1024 \times 576

px model input resolution.

Model	YOLOv8					YOLO11					YOLO12
Model	n	s	m	l	x	n	s	m	l	x	n	s	m	l	x
PyTorch	26	22	18	15	11	23	22	18	17	11	23	20	16	11	8
FP32	27	23	20	17	13	26	23	20	19	14	24	21	17	14	10
FP16	31	27	23	22	19	29	26	24	22	20	27	24	21	19	15
INT8	32	30	25	23	21	30	28	24	23	21	27	25	21	19	15
PyTorch (RTX)	75	74	65	58	40	65	64	56	44	43	52	51	49	35	34

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Vdoviak, G.; Sledevič, T. Temporal Encoding Strategies for YOLO-Based Detection of Honeybee Trophallaxis Behavior in Precision Livestock Systems. Agriculture 2025, 15, 2338. https://doi.org/10.3390/agriculture15222338

AMA Style

Vdoviak G, Sledevič T. Temporal Encoding Strategies for YOLO-Based Detection of Honeybee Trophallaxis Behavior in Precision Livestock Systems. Agriculture. 2025; 15(22):2338. https://doi.org/10.3390/agriculture15222338

Chicago/Turabian Style

Vdoviak, Gabriela, and Tomyslav Sledevič. 2025. "Temporal Encoding Strategies for YOLO-Based Detection of Honeybee Trophallaxis Behavior in Precision Livestock Systems" Agriculture 15, no. 22: 2338. https://doi.org/10.3390/agriculture15222338

APA Style

Vdoviak, G., & Sledevič, T. (2025). Temporal Encoding Strategies for YOLO-Based Detection of Honeybee Trophallaxis Behavior in Precision Livestock Systems. Agriculture, 15(22), 2338. https://doi.org/10.3390/agriculture15222338

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Temporal Encoding Strategies for YOLO-Based Detection of Honeybee Trophallaxis Behavior in Precision Livestock Systems

Abstract

1. Introduction

2. Related Works

2.1. Detection Methods and Behavioral Contexts for Trophallaxis

2.1.1. Automated Detection of Trophallaxis

2.1.2. Biological and Ecological Contexts of Trophallaxis

2.1.3. Collective Behavior and Group Dynamics

2.2. Temporal-to-RGB Encoding Strategies

3. Materials and Methods

3.1. Dataset

3.2. Temporal-to-RGB Encoding and Dataset Generation

3.2.1. Direct RGB

3.2.2. Temporally Stacked Grayscale (TSG)

3.2.3. Temporally Encoded Motion (TEM)

3.2.4. Temporally Encoded Motion and Average (TEMA)

3.3. Model Training and Optimization

4. Results and Discussion

4.1. Investigation of Precision vs. Inference Time

4.2. Deployment on Jetson AGX Orin Platform

4.3. Visualizations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI