1. Introduction
In recent years, the dairy farming industry has faced workforce challenges due to aging populations and declining numbers of farm workers. The result is fewer farms managing larger herds, as evidenced by trends in Japan where the number of dairy farming households has decreased while the average herd size per farm is increasing. To maintain productivity with limited labor, farmers are turning to automation and smart agriculture technologies such as robotics and IoT. Key areas of interest include camera surveillance systems, automated individual animal identification, behavior classification, and tracking of livestock movements [
1,
2]. Among these, individual cow identification and tracking of their location (or staying frequency in certain areas of the barn) are fundamental for health monitoring, estrus detection, and behavior analysis [
3,
4]. Developing reliable and efficient methods for these tasks is therefore of high importance in precision livestock farming.
Current prevalent solutions for cow monitoring often rely on contact-based devices attached to each animal. For example, wireless RFID tags and sensors can log body weight, feed intake, or milk yield on a per-cow basis [
5,
6]. Activity monitors like pedometers or accelerometers are also used to gather movement and health data [
7,
8]. While effective, these approaches have notable drawbacks: each cow requires a dedicated device or tag, which increases cost and maintenance effort, and handling animals to attach or service devices can cause stress and added labor. The need for lower-cost, less intrusive monitoring has driven interest in non-contact methods that leverage cameras and computer vision.
Research on vision-based cattle identification has shown promising results [
9,
10,
11,
12,
13]. Shen et al. (2020) [
10] utilized fixed surveillance cameras in a barn to capture side-profile images of cows; by applying the YOLO object detector with a convolutional neural network (CNN) classifier, they achieved 96.65% accuracy in distinguishing individual dairy cows from such images. Phyo et al. (2018) [
11] similarly obtained 96.3% identification accuracy using a neural network on top-down images of cows taken at a milking station. These studies underscore that deep learning models can recognize individual animals given clear images of each cow. Other efforts have combined identification with location tracking: for instance, Zin et al. (2020) [
12] installed a camera aimed at a feeding area to capture multiple cows’ faces while eating. They first detected cow heads with a YOLO model, then read the ID numbers on each cow’s ear tag using a specialized CNN, thereby determining which cow was at each feeder station. Their system reported 100% success in detecting cow heads and 92.5% accuracy in recognizing the ear tag numbers, effectively tracking individual feeding locations.
While promising, existing vision-based approaches often have practical limitations. Many require images of one cow at a time or a small group under controlled angles. In a typical free-stall barn, however, dozens of cows roam and intermingle freely, making it difficult to consistently obtain clear views of each individual without significant infrastructure or manual effort. Fixed cameras only monitor specific zones, so an animal is observed only when it enters that camera’s field of view. Monitoring a single cow’s movement across the entire barn with fixed cameras would require installing many units to cover all areas, which is costly and logistically challenging. These constraints motivate a more flexible and cost-effective solution.
In this paper, we propose an IoT-enabled, contactless monitoring system that overcomes the above challenges by using a single Pan–Tilt–Zoom (PTZ) camera to cover a wide area. A PTZ camera can rotate horizontally (pan), vertically (tilt), and zoom in or out, allowing one device to surveil the entire barn from various angles. We pair the camera with a state-of-the-art YOLOv8 object detection model to perform real-time identification of a target cow within the herd. By automatically controlling the PTZ camera’s orientation and logging the camera’s viewing angle as metadata, our system knows which barn area each image represents. This enables it to estimate the target cow’s location in the barn for each detection and to aggregate the cow’s area-wise staying frequency over time.
The key contribution of this work is a low-cost, vision-based proof-of-concept system demonstrating that a single network-connected PTZ camera can perform targeted cow tracking—identifying and following one specific individual within a herd—using only image data. This binary formulation (“target cow” vs. “not target”) serves as a preliminary step toward developing scalable multi-cow identification frameworks. We focus on a single target cow as a proof of concept, as individual identification is a foundational step for many herd management tasks (it can be extended to multiple cows in future work). Our results demonstrate that the system can reliably monitor the target cow’s presence in different zones of the barn, which could be used to infer behavioral patterns or health issues. By covering the entire barn with one device, we obtain more comprehensive data on the cow’s activity than methods confined to a small area.
In typical dairy barn monitoring setups, complete spatial coverage using fixed cameras often requires between four and six units positioned at different angles to avoid blind spots, depending on barn size and layout. Each camera unit (hardware, installation, cabling, and maintenance) can cost approximately USD 300–500, leading to a total installation cost exceeding USD 2000 for a single barn. In contrast, our system achieves comparable spatial coverage using one PTZ camera costing roughly USD 400–600, with only a single network and power connection. This quantitative contrast highlights the potential of the proposed PTZ-based approach to significantly reduce both equipment and installation costs while maintaining comprehensive visual coverage.
While many studies have investigated cow identification or zone-specific monitoring using fixed cameras, there remains a clear research gap in developing a scalable, low-cost, multi-zone monitoring system that can cover the entire barn using only a single vision device. Existing systems typically require multiple fixed cameras or complex infrastructure, limiting their affordability and scalability for small- and medium-scale farms. Therefore, this study aims to address this gap by proposing an IoT-based PTZ camera framework that provides continuous coverage across multiple barn zones while maintaining minimal equipment cost and setup complexity.
The rest of this article is organized as follows.
Section 2 (Related Work) reviews prior ICT applications in cattle management, contrasting contact-based and vision-based methods and positioning our approach among them.
Section 3 (Methodology) describes the proposed system architecture, including the PTZ camera setup, data annotation, and YOLOv8-based identification model.
Section 4 (Experimental Setup) details the implementation environment, dataset preparation, and evaluation metrics used.
Section 5 (Results) presents experimental results for cow identification accuracy and location tracking performance, with comparative analysis to ground truth.
Section 6 (Discussion) interprets the results, discusses the system’s practicality, and outlines limitations and future improvements. Finally,
Section 7 (Conclusions) summarizes the findings and the contribution to smart agriculture IoT systems.
2. Related Work
Early adoption of ICT in cattle management has been dominated by wearable sensor systems. For example, RFID-based identification allows automatic logging of individual cow data such as weight, feed intake, and milk production by scanning an ear tag or collar sensor. Other studies attach accelerometers or pedometers to cows to monitor activity levels, feeding and drinking behavior, or even signs of lameness. Adrion et al. (2020) [
14] developed an ultra-high-frequency RFID setup to track dairy cow feeding behavior, demonstrating the feasibility of monitoring individual visits to feed troughs via ear tags. While effective, these contact-type solutions share common drawbacks of cost and maintenance: each animal needs a device and regular upkeep (battery changes and repairs), and the initial deployment is expensive for large herds. There are also animal welfare considerations, as frequent close contact to attach or adjust devices can induce stress and affect natural behavior. These issues drive the exploration of non-contact methods that can passively observe animals without physical intervention.
Advancements in computer vision and deep learning have enabled visual identification of livestock using standard cameras [
15,
16]. As mentioned, Shen et al. [
10] used convolutional neural networks to identify individual Holstein cows from barn images and achieved high accuracy (over 96%). Their approach underlined that coat patterns or physical features can distinguish one cow from another when the cow’s body is clearly visible. Phyo et al. [
11] extended this concept to top-view images at milking stations, indicating that identification is possible even from partial views like a cow’s back. In addition to direct identification, vision systems have targeted related tasks such as tracking cow locations within facilities. Zin et al. [
12] developed an automatic cow tracking system utilizing cameras at feed bunks: their method detected cow heads and recognized ear tag numbers to log which cow occupied each feeding area. This approach effectively integrated identification with spatial monitoring, providing farmers with real-time knowledge of feeding activity (with reported accuracies of 100% for detecting presence and 92.5% for reading tags).
Additionally, recent studies have explored multi-target tracking (MTT) techniques to simultaneously follow several animals in dynamic environments. DeepSORT and ByteTrack algorithms have been employed in livestock video analysis to maintain unique IDs across frames using motion and appearance features. Qiao et al. [
17] proposed a unified architecture combining YOLO detection with deep re-identification for cattle, achieving over 90 % tracking accuracy in controlled settings.
However, most vision-based studies to date have limitations in scope or scalability. They often consider scenarios with a limited field of view or assume that the cow of interest is well-separated from others in the image. In practice, obtaining high-quality images of every individual in a group-housed setting is challenging. Cows can occlude each other or move unpredictably, and lighting or barn structures (e.g., stalls and fences) may obstruct views. Systems that rely on a fixed camera observing a fixed location (such as a water trough or a single walkway) only capture data when the target animal happens to visit that spot. This yields an incomplete picture of the animal’s overall activity. Covering the entire barn with enough fixed cameras to catch all movements would require a large number of devices installed at different angles (e.g., several cameras to cover one barn), which is impractical and costly for farms. Each additional camera increases installation and maintenance burden and generates more data to manage.
Our work differentiates itself by using a single moving camera to monitor a large area. A PTZ camera can be programmed to periodically scan across different sections of the barn, effectively acting as multiple cameras in one, as shown in
Figure 1. Prior research has not extensively explored PTZ cameras for cattle monitoring, even though they offer a clear advantage in coverage flexibility. By capturing images from various angles and locations using one device, we reduce the infrastructure needs. Our approach also accepts that barn images will contain multiple cows simultaneously (reflecting reality) and focuses on robustly picking out the target individual among them. This is more challenging than scenarios where cows are imaged one at a time, but it increases the system’s practicality and applicability to real barns.
Importantly, our system performs not only identification but also continuous tracking of the cow’s position over time throughout the barn. In contrast to earlier systems that monitor a single behavior (like feeding at a station), we obtain a broader view of the cow’s daily activities by logging which zone of the barn it stays in and how often. This information can feed into higher-level analyses such as detecting changes in routine or early illness indicators. By using a networked camera and automated analytics, the system aligns well with IoT frameworks: data can be transmitted to cloud services or farm management software, enabling remote supervision and data-driven decision-making in smart agriculture.
Overall, the proposed system builds upon the strengths of vision-based identification while addressing their limitations through a more dynamic imaging strategy. It aims to deliver a practical, scalable solution that farmers could deploy with minimal equipment—just one camera and an internet connection—to continuously monitor individual animals. The following sections detail the system design and demonstrate its effectiveness in a real-world barn setting.
6. Discussion
The experimental results indicate that our approach is viable for real-world smart farming applications. By using a PTZ camera and a deep learning model, we demonstrated continuous, contactless monitoring of individual animals. The system dramatically reduces the equipment needed to monitor cow behavior. Instead of outfitting each cow with a sensor or installing a network of cameras, a farmer can deploy a single PTZ camera to cover a large area. As analyzed in the related work, covering the same barn with fixed cameras might require 4–6 units from different angles, incurring high costs in purchase, installation, and maintenance. Our solution uses one device to achieve comparable coverage. Moreover, since the data needed is just images, the infrastructure is simplified—images can be sent over Wi-Fi or wired network to a central system, aligning with IoT architecture where each camera is a smart sensor node.
The contactless nature of our system means the cow does not need to wear any equipment. This eliminates the stress and potential injury that can occur when attaching devices to animals. It also reduces labor for farm staff, as they do not have to routinely check or fix sensors on the cows. The cows in our test barn were unaware of the monitoring process, continuing their normal routine. This suggests that behavior data collected (like area preferences, resting times, feeding times) are natural and not influenced by the monitoring method, which is ideal for animal welfare and for the validity of the observations.
Through our evaluation, we found that prioritizing Precision was essential for reliable monitoring. In many IoT sensing scenarios (e.g., health alerts), missing a few events might be acceptable, but false alarms can be problematic. In our context, if the system falsely identifies the cow in a zone where it is not, it could lead to incorrect conclusions (for example, thinking the cow visited the water trough when it did not). By tuning the model to be conservative (high confidence threshold), we ensured that when a detection is recorded, it is highly likely to be correct. The trade-off is that the system might momentarily lose track of the cow (e.g., if it does not detect the cow for a few seconds), but as discussed, the PTZ scanning mitigates this by giving repeated opportunities to catch the cow on subsequent passes. In practice, the cow’s movement speed is slow enough that missing one frame is not critical—it will still be in roughly the same area in the next frame or two.
The system we built can be considered an IoT node in a larger smart farm ecosystem. The PTZ camera with edge computing (Raspberry Pi) could run the detection model locally or send the images to a cloud service for processing. In our experiments, images were processed after collection, but an online deployment could use edge AI hardware to run YOLOv8 in real-time and stream results. The output—identified cow and location—can be transmitted to farm management software. For example, if the system is monitoring a cow that needs special attention (perhaps one that is sick or in estrus), alerts could be generated if the cow does not show up at the feeder for a certain period or if it remains in an unusual location. The data collected (stay frequencies) can feed into analytics for space utilization in barns, helping optimize barn design or stocking density by understanding how cows use the space.
In terms of computational performance, inference using the fine-tuned YOLOv8x model on a GPU (NVIDIA T4, Google Colab) required approximately 35–40 ms per 640 × 640 image, corresponding to a processing speed of roughly 25 frames per second. On a Raspberry Pi 4 (8 GB RAM) without hardware acceleration, inference latency increased to about 1.2 s per frame. This implies that real-time performance is achievable with embedded AI accelerators such as NVIDIA Jetson Nano/Orin or Coral TPU modules. Since the PTZ camera captures one image every 3 s during rotation, the current system already meets the temporal requirement for sequential monitoring. Future deployment will incorporate on-device inference to minimize transmission delay and enable instant visualization of detections on the farm network.
Despite the positive results, our approach has some limitations. Currently, the system was evaluated for a single target cow. Scaling to identify and track all individuals in a herd would require either training multiple models (one per cow) or training one multi-class model that recognizes each cow as a distinct class. The former does not scale well as the number of cows grows, while the latter would require a lot more labeled data (each cow needs to be labeled, and the model complexity increases). However, multi-object tracking techniques or re-identification algorithms (re-ID) could be integrated. One potential extension is to use the detection model to find all cows in an image and then apply a secondary re-ID network that distinguishes which one is the target or assigns IDs to each. Our work focused on proving the concept for one cow; future work will explore multi-cow scenarios.
Another limitation is that YOLOv8, while powerful, can sometimes be confused by similar-looking animals or certain poses. For instance, if two cows with similar coat patterns stand close, the model might identify the wrong one. Ensuring a diverse training set (images of the target cow from different angles, in groups, at different positions) helped mitigate this. In practice, farmers might choose target cows that have distinctive visual features if they want to deploy such a system (e.g., a cow with unique markings could be easier to track). Otherwise, additional markers (like a colored collar visible to cameras) could be a practical compromise—still much simpler than an active sensor, just a visual aid for the algorithm.
The barn environment presents challenges like variable lighting (day vs. night) and occlusions (other cows, posts, feeding racks). Our experiments were in daytime; at night, additional lighting or IR-capable cameras would be needed, and the model might require adaptation to different light conditions. Occlusions remain a challenge—if the target cow is completely obscured behind others or lies in a hard-to-see corner, no vision system can identify it. The PTZ’s flexibility in angles can reduce occlusions (it can capture from different sides) but not eliminate them. In the future, combining this with a second PTZ camera on the opposite side could ensure any cow hidden from one is visible to another, still with fewer cameras than a fully fixed rig. It is insightful to compare our results with a hypothetical multi-camera setup. The identification accuracy (Precision 86%; Recall 68%) is likely on par with what a fixed camera might achieve if it had a continuous view of the target (since it is largely a function of the model and image clarity). The Recall loss in our case partly comes from the cow sometimes not being in view (camera is looking elsewhere). A fixed array of cameras might catch the cow more consistently (higher Recall) but at a huge cost. Our single-camera Recall of 68% was sufficient to accurately gauge location usage, as evidenced by the match to ground truth frequencies. This indicates a diminishing returns scenario—doubling or tripling cameras might yield higher raw Recall, but the improvement in practical insight (like knowing where the cow spends time) might be marginal, not justifying the expense.
Because the proposed system relies on continuous network connectivity between the PTZ camera, local server, and potential cloud services, data security is a critical concern. All image streams should be transmitted through encrypted protocols (e.g., HTTPS or secure RTSP) and stored within protected local servers. Although the system captures only barn scenes without human subjects, compliance with farm data-management policies and national privacy regulations remains essential. Future implementations will integrate lightweight encryption and edge-storage options to ensure that sensitive operational data (such as animal IDs or farm layout) are safeguarded while maintaining system responsiveness.
Overall, this discussion underscores that the PTZ + YOLOv8 approach is a practical alternative for farms that cannot invest in extensive hardware. It brings IoT and AI to the barn in an accessible way. There are certainly scenarios where a hybrid approach (some critical points covered by fixed cameras, plus a PTZ for general coverage) could be optimal. The system could also be extended beyond identification: for example, once the cow is detected, additional analyses like body posture detection or gait assessment could be done on the image to infer health conditions. The modular nature of our pipeline allows for such extensions.
Additionally, although the system demonstrated satisfactory Precision (85.96%) and practical tracking capability, several limitations should be acknowledged. First, the current implementation focuses on a single target cow and daytime conditions; performance under low-light or crowded scenarios requires further validation. Second, the Recall (68%) indicates that occasional misses occurred when the cow was partially occluded or outside the current camera view. In comparison, fixed multi-camera systems reported Recalls above 85% (e.g., Zin et al. [
12]; Shen et al. [
10]), though at a significantly higher infrastructure cost. Our single-camera design sacrifices some temporal continuity for broader spatial coverage, reducing hardware by approximately 80%. Finally, the limited training dataset (510 images) restricts generalization; larger, multi-farm datasets will be collected to evaluate model robustness across environments. Despite these constraints, the achieved balance between cost, accuracy, and simplicity positions the PTZ + YOLOv8 framework as a practical foundation for scalable precision livestock monitoring.
In the next section, we conclude the paper by summarizing the achievements and outlining future work directions, including how this system can be expanded for broader use in precision livestock farming.