Deep Learning Classification of Canine Behavior Using a Single Collar-Mounted Accelerometer: Real-World Validation

Simple Summary Collar-mounted activity monitors using battery-powered accelerometers can continuously and accurately analyze specific canine behaviors and activity levels. These include normal behaviors and those that are indicators of disease conditions such as scratching, inappetence, excessive weight, or osteoarthritis. Algorithms used to analyze activity data are validated by video recordings of specific canine behaviors, which were used to label accelerometer data. The study described here was noteworthy for the large volume of data collected from more than 2500 dogs in clinical and real-world home settings. The accelerometer data were analyzed by a machine learning methodology, whereby algorithms were continually updated as additional data were acquired. The study determined that algorithms from the accelerometer data detected eating and drinking behaviors with a high degree of accuracy. Accurate detection of other behaviors such as licking, petting, rubbing, scratching, and sniffing was also demonstrated. The study confirmed that activity monitors using validated algorithms can accurately detect important health-related canine behaviors via a collar-mounted accelerometer. The validated algorithms have widespread practical benefits when used in commercially available canine activity monitors. Abstract Collar-mounted canine activity monitors can use accelerometer data to estimate dog activity levels, step counts, and distance traveled. With recent advances in machine learning and embedded computing, much more nuanced and accurate behavior classification has become possible, giving these affordable consumer devices the potential to improve the efficiency and effectiveness of pet healthcare. Here, we describe a novel deep learning algorithm that classifies dog behavior at sub-second resolution using commercial pet activity monitors. We built machine learning training databases from more than 5000 videos of more than 2500 dogs and ran the algorithms in production on more than 11 million days of device data. We then surveyed project participants representing 10,550 dogs, which provided 163,110 event responses to validate real-world detection of eating and drinking behavior. The resultant algorithm displayed a sensitivity and specificity for detecting drinking behavior (0.949 and 0.999, respectively) and eating behavior (0.988, 0.983). We also demonstrated detection of licking (0.772, 0.990), petting (0.305, 0.991), rubbing (0.729, 0.996), scratching (0.870, 0.997), and sniffing (0.610, 0.968). We show that the devices’ position on the collar had no measurable impact on performance. In production, users reported a true positive rate of 95.3% for eating (among 1514 users), and of 94.9% for drinking (among 1491 users). The study demonstrates the accurate detection of important health-related canine behaviors using a collar-mounted accelerometer. We trained and validated our algorithms on a large and realistic training dataset, and we assessed and confirmed accuracy in production via user validation.


Introduction
Much as recent progress in smartwatches has enabled new telehealth applications [1][2][3][4], recent progress in internet-connected pet wearables, such as collar-mounted activity monitors, has prompted interest in using these devices to improve the cost and efficacy of veterinary care [5]. Just as with smartwatches in human telehealth, accelerometer-based activity monitors have emerged as an inexpensive, low-power, and information-rich approach to pet health monitoring [6][7][8].
Accelerometer-based pet activity monitors analyze the moment-to-moment movement measured by a battery-powered accelerometer. They are typically attached to the pet via a collar, though attachment methods may be more elaborate in research settings. Using the device's accelerometer signal (sometimes in combination with gyroscope, magnetometer, GPS, or other sensor signals), collar-mounted activity monitors can accurately estimate pet activity levels [9][10][11][12][13][14], step count, and distance traveled [12].
In recent years, advances in machine learning have allowed pet activity monitors to move beyond estimating aggregate activity amounts, to detecting when and for how long a pet performs common activities such as: walking, running, lying down, or resting [15][16][17], these biometric capabilities have progressed to include increasingly specific and varied activities such as drinking, eating, scratching, and head-shaking [8,16,[18][19][20][21].
The benefits of accurate and quantitative behavior detection in pet health are extensive. Pet activity monitors have been shown to be useful in the detection and diagnosis of pruritis [22,23] and in potential early prediction of obesity [24]. They have also been used in monitoring response to treatments such as chemotherapy [25]. Furthermore, statistical analysis of activity and behavior monitoring on large numbers of pets can be an expedient approach to medical and demographic studies [24].
Although several studies have demonstrated and measured the accuracy of activity recognition algorithms [8,16,[18][19][20][21], the datasets used to train and evaluate the algorithms are typically not representative of the broad range of challenging environments in which commercial pet activity monitors must function. For instance, most existing studies use exclusively healthy dogs and are often run in controlled environments that promote welldefined and easily detectable behaviors with a low risk of confounding activities.
Unfortunately, real-world algorithm performance often lags far behind the performance measured in controlled environments [26,27]. For instance, existing studies typically ensure careful installation of the device in a specific position on a properly adjusted collar. In real-world usage, collars vary in tightness and often rotate to arbitrary positions unless the activity monitor device is very heavy. Collar rotation and tightness [28], as well as the use of collar-attached leashes [29], can compromise performance. In our experience, confounding activities like riding in a car or playing with other pets can produce anomalous results if not adequately represented in training datasets. Finally, some studies use multiple accelerometers or harness-mounted devices [30], which limit applicability in many consumer settings.
The work described here was performed as part of the Pet Insight (PI) Project [31], a large pet health study to enable commercial pet activity monitors to better measure and predict changes in a pet's health by: • Sourcing training data from project participants and external collaborators to build machine learning training databases and behavior detection models such as those described in this work.

•
Combining activity data, electronic medical records, and feedback from more than 69,000 devices distributed to participants over 2-3 years to develop and validate proactive health tools. • Using the resulting datasets, currently covering over 11 million days in dogs' lives, to enable insights that support pet wellness and improve veterinary care.
This work presents the results of the PI Project's efforts to develop and validate these behavior classification models [32], including an evaluation of model performance in a real-world context and addressing limitations from controlled research settings such as device fit and orientation.

Activity Monitor
Data were collected primarily via a lightweight canine activity monitor (Figure 1, Whistle FIT ® , Mars Petcare, McLean, VA, USA), which was designed and produced specifically for this study. Smaller amounts of data were collected via the commercially available Whistle 3 ® and Whistle GO ® canine activity monitors. All three devices used the same accelerometer. Unlike the Whistle FIT ® , these latter devices are furnished with GPS receivers and cellular radios. However, in all cases, the behavior classification in this study is performed using only the output of the devices' 3-axis accelerometers.
real-world context and addressing limitations from controlled research settings such as device fit and orientation.

Activity Monitor
Data were collected primarily via a lightweight canine activity monitor (Figure 1, Whistle FIT ® , Mars Petcare, McLean, VA, USA), which was designed and produced specifically for this study. Smaller amounts of data were collected via the commercially available Whistle 3 ® and Whistle GO ® canine activity monitors. All three devices used the same accelerometer. Unlike the Whistle FIT ® , these latter devices are furnished with GPS receivers and cellular radios. However, in all cases, the behavior classification in this study is performed using only the output of the devices' 3-axis accelerometers. Activity monitors used in this study. Most data in this study were acquired from Whistle FIT ® activity monitors. Device dimensions are shown in (a), and a device in use is shown in (b). The device often rotates to different positions around each dog's collar. The device can attach to most dog collars up to 1" (25 mm). Attachment detail is shown in (c). The two other devices used this study (the Whistle 3 ® and the Whistle GO ® ) are larger and heavier.

Accelerometry Data Collection
All monitoring devices acquired accelerometry data and uploaded it according to their usual operation. That is, the devices acquired 25-50 Hz 3-axis accelerometry data for at least several seconds whenever significant movement was detected. Data were compressed and annotated with timing data using a proprietary algorithm. Data were temporarily stored on-device and then uploaded at regular intervals when the devices were in Wi-Fi range. Uploads were processed, cataloged, and stored in cloud-hosted database services by Whistle servers. The compressed accelerometry data were retrieved on demand from the cloud database services in order to create the training, validation, and testing databases used in this study.

Animal Behavior Data Collection
Animal behavior data were collected (summarized in Table 1 and further described elsewhere in this report) and used to create two datasets used in model training and evaluation: • Crowd-sourced (crowd) dataset. This dataset contained both (a) long (multi-hour) inclinic recordings, as well as (b) shorter recordings submitted by project participants. This large and diverse dataset was meant to reflect real-world usage as accurately as possible.

•
Eating and drinking (eat/drink) dataset. This dataset consisted of research grade sensor and data using a protocol designed to represent EAT and DRINK behaviors. Other observed behaviors were incidental.
For brevity, we refer to these datasets simply as the crowd and eat/drink datasets.

Accelerometry Data Collection
All monitoring devices acquired accelerometry data and uploaded it according to their usual operation. That is, the devices acquired 25-50 Hz 3-axis accelerometry data for at least several seconds whenever significant movement was detected. Data were compressed and annotated with timing data using a proprietary algorithm. Data were temporarily stored on-device and then uploaded at regular intervals when the devices were in Wi-Fi range. Uploads were processed, cataloged, and stored in cloud-hosted database services by Whistle servers. The compressed accelerometry data were retrieved on demand from the cloud database services in order to create the training, validation, and testing databases used in this study.

Animal Behavior Data Collection
Animal behavior data were collected (summarized in Table 1 and further described elsewhere in this report) and used to create two datasets used in model training and evaluation: • Crowd-sourced (crowd) dataset. This dataset contained both (a) long (multi-hour) inclinic recordings, as well as (b) shorter recordings submitted by project participants. This large and diverse dataset was meant to reflect real-world usage as accurately as possible.

•
Eating and drinking (eat/drink) dataset. This dataset consisted of research grade sensor and data using a protocol designed to represent EAT and DRINK behaviors. Other observed behaviors were incidental. For brevity, we refer to these datasets simply as the crowd and eat/drink datasets.

Eat/Drink Study Protocol
This study was conducted using dogs owned by the WALTHAM Petcare Science Institute and housed in accordance with conditions stipulated under the UK Animals (Scientific Procedures) Act 1986. Briefly, the dogs were pair housed in environmentally enriched kennels designed to provide dogs free access to a temperature-controlled interior and an external pen at ambient temperature. Dogs were provided with sleeping platforms at night. The dogs had access to environmentally enriched paddocks for group socialization and received lead walks and off-lead exercise opportunities during the day. Water was freely available at all times and dogs were fed to maintain an ideal body condition score. The study was approved by the WALTHAM Animal Welfare and Ethical Review Body. One hundred and thirty-eight dogs across 5 different breeds (72 Labrador Retrievers, 18 Beagles, 17 Petit Basset Griffon Vendeens, 14 Norfolk Terriers and 17 Yorkshire Terriers) took part for two consecutive days each. Each dog was recorded once a day during its normal eating and drinking routine using a GoPro camera (GoPro, San Mateo, CA, USA).
In this study, either one (ventral only) or four (ventral, dorsal, left, and right) activity monitors were affixed to a collar. For each observation, the collar was removed from the dog, the correct number of activity monitors were attached, and then shaken sharply in view of the camera to provide a synchronization point that was identifiable in both the video and accelerometer signals (so that any time offset could be removed). The collar was then placed on the dog at a standardized tightness. The dogs were recorded from approximately one minute before feeding until approximately one minute after feeding. In order to increase the diversity of the dataset, collar tightness was varied between a two-finger gap and a four-finger gap, and food bowls were rotated between normal bowls and slow-feeder or puzzle-feeder bowls. For each data recording, researchers noted the date and time, device serial number(s), collar tightness, food amount and type, and various dog demographic data.

Crowd-Sourcing Protocol
Pet Insight participants were requested to use smartphones to video record their pets performing everyday activities while wearing activity monitors. The participants were told that the activity monitor should be worn on the collar but were not given any other instructions about how the collar or monitor should be worn. Participants were asked to prioritize recording health-related behaviors like scratching or vomiting, but to never induce these events and to never delay treatment in order to record the behaviors. As a participation incentive, for every crowd-sourced video used, the PI project donated one dollar to a pet-related charity.
After recording each video, participants logged into the PI crowd-sourcing website, provided informed consent, uploaded the recorded video, and completed a short questionnaire confirming which pet was recorded and whether certain behaviors were observed. The device automatically uploaded its accelerometry data to Whistle servers.

In-Clinic Observational Protocol
This study was conducted at several Banfield Pet Hospital (BPH) clinics. Its objective was to acquire long-duration (multi-hour) naturalistic recordings to augment the shorter crowd-sourced recordings, which were typically several minutes or less in duration.
Randomly selected BPH clients who chose to participate signed fully informed consent forms. Their dogs were outfitted with Velcro breakaway collars with one attached activity monitor device each. Collar tightness and orientation were not carefully controlled. Video was recorded via a 4-channel closed-circuit 720 p digital video security system. Video cameras were ceiling-or wall-mounted and oriented towards the in-clinic kennels so that up to four dogs could be observed at a time. For each recording, researchers noted the date and time, the device serial number, and the dog/patient ID number.

Video Labeling
All uploaded videos were transcoded into a common format (H.264-encoded, 720 p resolution, and up to 1.6 Mb/s) using Amazon's managed Elastic Transcoder service, and their audio was stripped for privacy. Video start times were extracted from the video metadata and video filenames. Matching device accelerometry data were downloaded from Whistle's databases, and automatic quality checks were performed.
Videos were then labeled by trained contractors using the open-source BORIS (Behavioral Observation Research Initiative Software V. 7.9.8) software application [33]. The resulting event labels were imported and quality-checked using custom Python scripts running on one of the PI project's cloud-based web servers. Labels were stored alongside video and participant metadata in a PostgreSQL database.
All video labeling contractors were trained using a standardized training protocol, and inter-rater reliability analyses were performed during training to ensure consistent labeling. Videos were labeled according to a project ethogram [8,15,20]. This report describes several of these label categories.
Labelers divided each video into valid and invalid regions. Regions were only valid if the dog was clearly wearing an activity monitor and was fully and clearly visible in the video. Invalid regions were subsequently ignored. In each valid video region, the labeler recorded exactly one posture, and any number (0 or more) of applicable behaviors.
Postures (Table 2) reflect the approximate position and energy expenditure level of the pet, while behaviors (Table 3) characterize the pet's dominant behavior or activity in a given moment. For instance, during a meal, a dog might exhibit a STAND posture and an EAT behavior. While pausing afterwards, the same dog might exhibit a STAND posture and no behavior. Multiple simultaneous behaviors are rare but possible, such as simultaneous SCRATCH and SHAKE behaviors. Purposeful walking from one point to another. VIGOROUS Catch-all for high-energy activities such as running, swimming, and playing.

MIXED
Default category for any other posture, for ambiguous postures, and for postures that are difficult to label due to rapid changes. Table 3. Behaviors Ethogram.

DRINK Drinking water. EAT
Eating food, as in out of a bowl. Does not include chewing bones or toys. LICKOBJECT Licking an object other than self, such as a person or empty bowl. LICKSELF Self-licking, often due to pain, soreness, pruritis, or trying to clear a foreign object. PETTING Being pet by a human.

RUBBING
Rubbing face or body on an object or person due to pruritis. SCRATCH Scratching of the body, neck, or head with a hind leg.

SHAKE
Shaking head and body, as in when wet. Does not include head-shaking that is clearly due to ear discomfort, which is labeled separately and has not been included in this report. SNIFF Sniffing the ground, the air, a person or other pet NONE 'Default' class indicating that no labeled behavior is happening.

Training Data Preparation
Although accelerometer data and smartphone video data were both time-stamped using the devices' network-connected clocks, inaccuracies led to alignment errors of typically several seconds, and sometimes much longer. Short activities such as SHAKE, in particular, require more accurate alignment. We aligned approximately 1200 videos manually by matching peaks in accelerometer activity to labels for high-intensity behaviors like SHAKE and SCRATCH. We used these manual alignments to develop and validate an automatic alignment algorithm that aligned the remaining videos.
We created each of the two training datasets (crowd and eat/drink) by:

1.
Selecting appropriate videos from our database.

2.
Limiting the number of entries per dog to 30 (some dogs are overrepresented in our database).

3.
Allocating all of each dog's data into one of 5 disjoint cross-validation folds.

4.
Downloading each dataset and labeling each time-point with a posture and/or behavior(s).
The specific method of separating data into cross-validation folds (step 3 above) is critical [34]. Classifiers trained on individual dogs have been shown to over-perform on those dogs relative to others, even if those classifiers are trained and evaluated using separate experimental observations. Gerencsér et al. experienced an accuracy reduction from 91% for a single-subject classifier to 70-74% when generalizing to other dogs [35]. Consequently, we were careful to ensure that all of a dog's videos fall in a single fold, so that data from a single dog is never used to both train and evaluate a classifier.
The overall data acquisition process, from video capture (red), accelerometer data (blue) to a completed dataset (purple), is shown in Figure 2.
The overall data acquisition process, from video capture (red), accelerometer data (blue) to a completed dataset (purple), is shown in Figure 2. Figure 2. Data acquisition flow. Dogs wearing collar-mounted activity monitors were video recorded performing behaviors of interest or performing everyday activities. Videos were uploaded and the behaviors exhibited in them were manually labeled (tagged). The devices automatically uploaded accelerometer (activity) data to cloud servers, and the device data were aligned with the video labels to remove any temporal offset. The aligned labels and accelerometer time series were combined into datasets suitable for training machine learning (ML) models.

Deep Learning Classifier
Our deep learning classifier is based on our FilterNet architecture [32]. We implemented the model in Python using PyTorch v1.0.1 [36] and the 2020.02 release of the Anaconda Python distribution (64-bit, Python 3.7.5). We trained and evaluated our models on p2.xlarge instances on Amazon Web Services [37] with 4 vCPUs (Intel Xeon E5-2686 v4), 61 GB RAM, and a NVIDIA Tesla k80 GPU with 12 Gb RAM, running Ubuntu 18.04.4.
We used the crowd dataset for cross-validated training and evaluation ( Figure 3). Specifically, we trained and evaluated five different models, using a different held-out fold as a test set for each model. We combine the models' predictions for each of the five test sets for model evaluation, as described below. We also generated behavior classifications for the eat/drink dataset using one of the models trained on the crowd dataset (that is, we did not use the eat/drink dataset for model training). There were no dogs in common between the crowd and eat/drink datasets, so cross-validation was not needed in this step.

Model training and execution Predictions
Model metrics Figure 2. Data acquisition flow. Dogs wearing collar-mounted activity monitors were video recorded performing behaviors of interest or performing everyday activities. Videos were uploaded and the behaviors exhibited in them were manually labeled (tagged). The devices automatically uploaded accelerometer (activity) data to cloud servers, and the device data were aligned with the video labels to remove any temporal offset. The aligned labels and accelerometer time series were combined into datasets suitable for training machine learning (ML) models.

Deep Learning Classifier
Our deep learning classifier is based on our FilterNet architecture [32]. We implemented the model in Python using PyTorch v1.0.1 [36] and the 2020.02 release of the Anaconda Python distribution (64-bit, Python 3.7.5). We trained and evaluated our models on p2.xlarge instances on Amazon Web Services [37] with 4 vCPUs (Intel Xeon E5-2686 v4), 61 GB RAM, and a NVIDIA Tesla k80 GPU with 12 Gb RAM, running Ubuntu 18.04.4.
We used the crowd dataset for cross-validated training and evaluation ( Figure 3). Specifically, we trained and evaluated five different models, using a different held-out fold as a test set for each model. We combine the models' predictions for each of the five test sets for model evaluation, as described below. We also generated behavior classifications for the eat/drink dataset using one of the models trained on the crowd dataset (that is, we did not use the eat/drink dataset for model training). There were no dogs in common between the crowd and eat/drink datasets, so cross-validation was not needed in this step. The overall data acquisition process, from video capture (red), accelerometer data (blue) to a completed dataset (purple), is shown in Figure 2. Figure 2. Data acquisition flow. Dogs wearing collar-mounted activity monitors were video recorded performing behaviors of interest or performing everyday activities. Videos were uploaded and the behaviors exhibited in them were manually labeled (tagged). The devices automatically uploaded accelerometer (activity) data to cloud servers, and the device data were aligned with the video labels to remove any temporal offset. The aligned labels and accelerometer time series were combined into datasets suitable for training machine learning (ML) models.

Deep Learning Classifier
Our deep learning classifier is based on our FilterNet architecture [32]. We implemented the model in Python using PyTorch v1.0.1 [36] and the 2020.02 release of the Anaconda Python distribution (64-bit, Python 3.7.5). We trained and evaluated our models on p2.xlarge instances on Amazon Web Services [37] with 4 vCPUs (Intel Xeon E5-2686 v4), 61 GB RAM, and a NVIDIA Tesla k80 GPU with 12 Gb RAM, running Ubuntu 18.04.4.
We used the crowd dataset for cross-validated training and evaluation ( Figure 3). Specifically, we trained and evaluated five different models, using a different held-out fold as a test set for each model. We combine the models' predictions for each of the five test sets for model evaluation, as described below. We also generated behavior classifications for the eat/drink dataset using one of the models trained on the crowd dataset (that is, we did not use the eat/drink dataset for model training). There were no dogs in common between the crowd and eat/drink datasets, so cross-validation was not needed in this step.   Figure 3. Model training and evaluation data flow. The crowd dataset consisted of naturalistic, highly diverse data divided by dog into five folds. The eat/drink dataset focused on high-quality eating and drinking data. Behavior classification models were trained and evaluated in a cross-validated fashion (where a given model i is trained on all folds of data except fold i) on the crowd dataset, and the first of these five models was also evaluated on the eat/drink dataset. Confusion matrices and classification metrics were produced for each dataset using the resulting predictions.

Evaluation
For evaluation, we modeled the task as two multi-class classification problems, one for behaviors and one for postures. At each time point in each video entry in a dataset we recorded the labeled behavior and posture, and every 320 ms we calculated the most likely predicted behavior and posture. We tallied the labeled and predicted pairs from all five test folds together using the PyCM multiclass confusion matrix library to create separate behavior and posture confusion matrices [38]. We used the PyCM package to calculate metrics derived from the confusion matrices [39].
As the MIXED posture is used primarily for expediency in labeling, we dropped any time points with MIXED labels from the postures confusion matrix, and replaced any MIXED-class posture predictions with the next most likely prediction for that time point. We also excluded any time points with more than one simultaneous labeled behavior (about 3% of the data) from the behaviors confusion matrix.
Furthermore, following Uijl et al. [8], we excluded any time points within 1 s of a class transition in both classification problems. However, also similar to [8], we treated the SHAKE class differently due to its very short duration. For SHAKE, we only excluded the outer one-third second. In dropping these transition regions, we attempted to follow established convention for minimizing the effects of misalignment in labeling, and to make our reported results easier to compare to related works.
Performance of these models was evaluated based on widely used metrics in the machine learning field including F1 scores. These metrics can be expressed in terms of the number of true and false positive predictions (TP and FP) and the number of true and false negative predictions (TN and FN). They include precision ( TP(TP + FP) ), sensitivity or recall ( TP(TP + FN) ), and specificity ( TN(TN + FP) ). F1 scores examine the relationship between the precision and recall of a model to better understand a model's accuracy.

User Validation
Although the crowd dataset is meant to be representative of real-world data, it is subject to biases such as underrepresentation of behaviors that are unlikely to be video recorded, such as riding in cars or staying at home alone. Furthermore, it is impossible to anticipate all of the myriad situations that may serve as confounders. Consequently, we ran real-world user validation campaigns on the two behaviors that users are most likely to be aware of, EAT and DRINK behavior. We defined events as periods of relatively sustained, specific behaviors detected with high confidence, such as eating events (meals) consisting of several minutes of sustained eating behavior. We adapted our production system, which runs the models described in this study in near-real time on all PI project participants, to occasionally send validation emails to participants when an EAT or DRINK event had occurred within the past 15 min. Respondents categorized the event detection as correct ("Yes") or incorrect ("No") or indicated that they were not sure. Users were able to suggest what confounding event may have triggered any false predictions. We excluded any responses that arrived more than 60 min after an event's end, as well as any "Not Sure" responses.

Data Collected
After applying the steps described above, the crowd dataset contained data from 5063 videos representing 2217 subjects, and the eat/drink dataset contained data from 262 videos representing 149 unique dogs. The distribution of weights and ages represented in these datasets is shown in Figure 4, while a breed breakdown is given in Table 4.  These datasets also differed in the length and frequency of labeled events, as shown in Table 5. The crowd and eat/drink datasets contain 163.9 and 22.4 h of video data labeled as VALID, respectively.
The EAT class was highly represented in both the crowd dataset (because participants were specifically requested to submit videos of their dogs at mealtime, since it is an easily filmed and important behavior) and in the eat/drink dataset (due to study design). The eat/drink dataset included only small amounts of incidental LICKSELF, SCRATCH, PET-TING, and SHAKE behavior, while the crowd dataset contained many of these events because participants were repeatedly reminded of their importance.  These datasets also differed in the length and frequency of labeled events, as shown in Table 5. The crowd and eat/drink datasets contain 163.9 and 22.4 h of video data labeled as VALID, respectively.
The EAT class was highly represented in both the crowd dataset (because participants were specifically requested to submit videos of their dogs at mealtime, since it is an easily filmed and important behavior) and in the eat/drink dataset (due to study design). The eat/drink dataset included only small amounts of incidental LICKSELF, SCRATCH, PETTING, and SHAKE behavior, while the crowd dataset contained many of these events because participants were repeatedly reminded of their importance.
The distribution of lengths for each label class was highly skewed, with many short labels and a smaller number of longer labels ( Figure 5). The distribution of SHAKE labels was less skewed, likely because it is typically a short behavior and less prone to interruption.

Classification Accuracy
Cross-validated classification metrics for the crowd dataset are given in Table 6, and classification metrics obtained from evaluating the eat/drink dataset using a model trained on the crowd dataset are given in Table 7. Subsequent sections may report behaviors only

Classification Accuracy
Cross-validated classification metrics for the crowd dataset are given in Table 6, and classification metrics obtained from evaluating the eat/drink dataset using a model trained on the crowd dataset are given in Table 7. Subsequent sections may report behaviors only due to postures having less accurate labels and are typically used in an aggregate form where individual misclassifications are less important. Of the metrics in Tables 6 and 7, only sensitivity and specificity are independent of class prevalence.
The "behaviors" confusion matrix for the crowd dataset is shown in Figure 6 in nonnormalized and normalized forms. The non-normalized confusion matrix gives raw tallies (that is, the total number of one-third second time points) of predicted and labeled classes, and the normalized confusion matrix gives the percentage of each actual label classified by the algorithms as a given predicted label (so that the percentages in each row sum to 100%). The non-normalized matrix is dominated by correctly predicted NONE and EAT samples, due to their high prevalence and effective classification in this dataset. The normalized matrix suggests the reliable classification of DRINK, EAT, NONE, and SHAKE. The LICKSELF and SCRATCH classes are of moderate reliability, and the LICKOBJECT, PETTING, RUBBING, and SNIFF classes exhibit some systematic misclassification and are of lesser reliability.

Effect of Device Position on Performance
The system's classification performance, as measured by F1 score, shows no significant dependence on device position (Figure 7).

Effect of Device Position on Performance
The system's classification performance, as measured by F1 score, shows no significant dependence on device position (Figure 7). Animals 2021, 11, x 14 of 20 Figure 7. Classification performance as measured by F1 score. F1 scores measuring test accuracy are broken out by device position, for n = 48 videos from the eat/drink dataset where the dog's collar had exactly four attached devices with known orientation. Error bars are 95% confidence intervals on the mean, as determined by bootstrapping. The classification accuracy per class was similar between the four positions, indicating that system accuracy is not substantially affected by collar rotation.

User Validation
Participants responded far better than expected to user validation efforts. Users opened emails, clicked through to the web form, and submitted validation results for 55% of the EAT validation emails and 42% of the DRINK validation emails.
Responses are summarized in Table 8. As described above, we excluded any responses that arrived more than 60 min after an event's end, as well as any "Not Sure" responses. The positive ("Yes") validation rate was approximately 95% for both event types. As expected, the rate of users responding "Not Sure" was far greater for DRINK (12%) than for EAT (2%). As the production system generates candidate EAT and DRINK events, it calculates a confidence score (the mean algorithm confidence over the event's duration) that varies between 0 and 1.0, and drops any events with a score below a threshold of 0.3. Figure 8 shows how the percentage of "Yes" responses (the true positive rate) varied with this confidence score. For EAT events, the rate grew from 83% for the lowest-confidence bin (0.3-0.4) to 100% (201 out of 201) for the highest-confidence bin (0.9-1.0). Since users do not see the confidence score, this trend suggests that the EAT validation data are relatively reliable. The DRINK data show a less convincing trend, which is consistent with users' lower awareness of DRINK events.
It is unfortunate that, of the behavior classes measured in this work, only EAT is likely to exhibit the level of user awareness required for validation using this method. . Classification performance as measured by F1 score. F1 scores measuring test accuracy are broken out by device position, for n = 48 videos from the eat/drink dataset where the dog's collar had exactly four attached devices with known orientation. Error bars are 95% confidence intervals on the mean, as determined by bootstrapping. The classification accuracy per class was similar between the four positions, indicating that system accuracy is not substantially affected by collar rotation.

User Validation
Participants responded far better than expected to user validation efforts. Users opened emails, clicked through to the web form, and submitted validation results for 55% of the EAT validation emails and 42% of the DRINK validation emails.
Responses are summarized in Table 8. As described above, we excluded any responses that arrived more than 60 min after an event's end, as well as any "Not Sure" responses. The positive ("Yes") validation rate was approximately 95% for both event types. As expected, the rate of users responding "Not Sure" was far greater for DRINK (12%) than for EAT (2%). As the production system generates candidate EAT and DRINK events, it calculates a confidence score (the mean algorithm confidence over the event's duration) that varies between 0 and 1.0, and drops any events with a score below a threshold of 0.3. Figure 8 shows how the percentage of "Yes" responses (the true positive rate) varied with this confidence score. For EAT events, the rate grew from 83% for the lowest-confidence bin (0.3-0.4) to 100% (201 out of 201) for the highest-confidence bin (0.9-1.0). Since users do not see the confidence score, this trend suggests that the EAT validation data are relatively reliable. The DRINK data show a less convincing trend, which is consistent with users' lower awareness of DRINK events.

Comparison with Previous Work
We compared our dataset and results with several previous works (Table 9), and we tabulated several important qualitative differences between the datasets (Table 10). In comparing these results, it is important to account for: • Class distribution. Each dataset exhibits a different distribution of behaviors. In general, classifiers exhibit better F1 scores for common behaviors than for rare behaviors.
The classifier sensitivity and specificity are relatively insensitive to this distribution, so we recommend using these metrics for comparing performance across different datasets. • Dataset collection methods. Classifiers are more accurate when applied to high-quality datasets collected under controlled conditions. Accuracy can drop substantially in naturalistic versus laboratory settings [26,27]. Classifiers benefit from consistent device position, device attachment, and collar tightness, and they also benefit when the labeled behaviors as well as the collection environment are consistent and well-defined.
Previous works have used relatively controlled and high-quality datasets, similar to the eat/drink dataset in this work [8,18,19,21]. As expected, our crowd sourced dataset exhibits a far greater diversity of weights, ages, and breeds than our eat/drink dataset, since the eat/drink subjects are sampled from several relatively homogeneous subpopulations.
The classification performance of the classifier presented here on the EAT and DRINK classes in the eat/drink dataset advances the sensitivity, specificity, and F1 score for these classes. Sensitivity and specificity are independent of class prevalence. The balance between sensitivity and specific is a design choice, so we have calibrated our algorithms to favor specificity in order to minimize false positives.
The classifiers' performance on SCRATCH in the challenging crowd dataset also advances the state of the art. Comparable detection of LICKOBJECT, LICKSELF, PETTING, RUBBING, and SNIFF has not been previously demonstrated to our knowledge. We note that SCRATCH, LICKSELF, and RUBBING behaviors are highly relevant to dermatological health and welfare applications [19], and that PETTING is an important confounder It is unfortunate that, of the behavior classes measured in this work, only EAT is likely to exhibit the level of user awareness required for validation using this method.

Comparison with Previous Work
We compared our dataset and results with several previous works (Table 9), and we tabulated several important qualitative differences between the datasets (Table 10). In comparing these results, it is important to account for: • Class distribution. Each dataset exhibits a different distribution of behaviors. In general, classifiers exhibit better F1 scores for common behaviors than for rare behaviors. The classifier sensitivity and specificity are relatively insensitive to this distribution, so we recommend using these metrics for comparing performance across different datasets. • Dataset collection methods. Classifiers are more accurate when applied to high-quality datasets collected under controlled conditions. Accuracy can drop substantially in naturalistic versus laboratory settings [26,27]. Classifiers benefit from consistent device position, device attachment, and collar tightness, and they also benefit when the labeled behaviors as well as the collection environment are consistent and well-defined.
Previous works have used relatively controlled and high-quality datasets, similar to the eat/drink dataset in this work [8,18,19,21]. As expected, our crowd sourced dataset exhibits a far greater diversity of weights, ages, and breeds than our eat/drink dataset, since the eat/drink subjects are sampled from several relatively homogeneous subpopulations.
The classification performance of the classifier presented here on the EAT and DRINK classes in the eat/drink dataset advances the sensitivity, specificity, and F1 score for these classes. Sensitivity and specificity are independent of class prevalence. The balance between sensitivity and specific is a design choice, so we have calibrated our algorithms to favor specificity in order to minimize false positives.  The classifiers' performance on SCRATCH in the challenging crowd dataset also advances the state of the art. Comparable detection of LICKOBJECT, LICKSELF, PETTING, RUBBING, and SNIFF has not been previously demonstrated to our knowledge. We note that SCRATCH, LICKSELF, and RUBBING behaviors are highly relevant to dermatological health and welfare applications [19], and that PETTING is an important confounder that can be easily misclassified as SCRATCH or LICKSELF in classifiers that are not exposed to this behavior. We have found the classifiers' detection of SHAKE to be highly accurate (though susceptible to temporal misalignment between device and video data, due to the short event lengths). It is difficult to compare the per-time-sample SHAKE classification metrics here to published per-event metrics due to differing methodologies [8,18].
The device position invariance demonstrated by our classifier is a key property that enables real-world performance to approach that of controlled studies, allowing accurate detection of our reported behaviors in home environments.

Challenges
In Supplementary Materials, we include seven videos (Videos S1-S7) annotated with behavior classification predictions, as well as an explanatory figure ( Figure S1) and table (Table S1), in order to demonstrate the system's operation. The system excels at certain clearly defined and easily recognizable activities, especially those repeating and universal movement patterns such as drinking (lapping), walking, running, shaking, and most eating behaviors. It also performs well on well-defined instances of scratching and self-licking.
Device positioning and collar tightness do not appear to have a strong effect on system accuracy, meaning that accurate behavior metrics can be acquired via normal activity monitor usage. An important feature of the devices described in this study is their insensitivity (invariance) to collar orientation or position (Figure 7). In real-world settings, and especially with lightweight devices such as the Whistle FIT, the device can be, and often is, rotated away from the conventional ventral (bottom) position at the lowest point of the collar.
The system appears to use the angle of a dog's neck (that is, whether the dog is looking up or down) as an important behavioral clue. Consequently, activities such as eating or drinking appear to be less accurate when raised dog bowls are used, and activities such as sniffing and scratching, and self-licking can go undetected if performed in unusual positions. Slow-feed food bowls, collars attached to taut leashes, and loose collars with other heavy attachments can also cause misclassifications, but are often classified correctly nonetheless.
The class distribution of both datasets is highly imbalanced, which presented a challenge for algorithm training. For instance, in the crowd dataset, which we used for training, the EAT class total duration is 117 times greater than that of SHAKE.
It is important to note that the class balance (class prevalence) of these datasets is not representative of real-world canine behavior. As the videos are typically taken in stimulating or interesting situations, these datasets exhibit a lower relative prevalence of LIE DOWN and other low-energy postures. Furthermore, the datasets exhibit much higher levels of EAT, DRINK, and possibly other behaviors, due to either study design (in the eat/drink dataset) or because the PI project requested that participants film certain behaviors.
Other sets of activities simply present very similar accelerometer data, such as eating wet food, which can be confounded with drinking; or being pet by a human or riding in a moving vehicle, which can be confounded with scratching or self-licking; or even vigorous playing and 'tug-of-war', which can be confounded with shaking and other activities. These misclassifications become less common as the models improve, but in some cases confusion may be unavoidable. Some other activities are simply rare or unusual, for instance, drinking from a stream, drinking from a water bottle, or licking food off of a raised plate.
A different type of problem relates to activities that are ambiguous even to human labelers, such as the distinction between eating a small part of a meal versus eating a large treat. Similarly, label fragmentation, where a long stretch of the labeled activity is interrupted either by the dog temporarily pausing (for instance, lifting up its head to look around several times while drinking or while eating a meal) or by discontinuities in the labeling when the dog leaves the camera's field of view (since labelers only marked videos as VALID when the dog was fully and clearly visible). These types of labeling ambiguity can be very deleterious to certain classification metrics, even though it is questionable whether the system's usefulness or real-world accuracy is affected.
User Validation participant comments confirmed our expectation that users were less aware of DRINK behavior than of EAT behavior. This lack of awareness likely also contributed to the lower DRINK response rate. It is unfortunate that, of the behavior classes measured in this work, only EAT is likely to exhibit the level of user awareness required for validation using this method.

Conclusions
We advanced the science of wearables through the development of novel machine learning algorithms which validated the sensitivity and specificity for detecting drinking and eating behavior. We also used a large real-world dataset of 2500 dogs to demonstrate detection of licking, petting, rubbing, scratching, and sniffing. Ensuring that the wearables would collect accurate data in a real-world setting, we demonstrated that system performance is not sensitive to collar position. In production, users reported high rates of true positives, consistent with the metrics measured via cross-validation on the crowd training database. This means that the data collected through the accelerometers in wearables can provide valuable data which can be applied in diagnosing and treating conditions. A subsequent survey of 10,550 dogs was used to validate the eating and drinking behavior. This survey takes the data from the laboratory and brings them into the real world to confirm results. The systems described in this work can further improve via the incorporation of additional training data and through the improvement of the underlying algorithms. Through the foundational algorithms built on the vast dataset, a world of opportunity is opened to further our understanding of animal behavior and advance individualized veterinarian care with the inclusion of wearables.
Author Contributions: R.D.C., Conceptualization, data curation, formal analysis, investigation, methodology, software, validation, visualization, writing-original draft preparation; N.C.Y., conceptualization, data curation, formal analysis, investigation, methodology, software, validation, writing-review and editing; A.B.C., conceptualization, investigation-in-clinic crowd sourcing, writing-review and editing; C.J., data curation, investigation, methodology, software, writingreview and editing; D.E.A., conceptualization, data curation, investigation, validation, project administration; L.M.P., software, writing-review and editing; S.B., conceptualization support-eating and drinking study, investigation-experimental work at the WALTHAM Petcare Science Institute, writing-review and editing; G.W., conceptualization, funding acquisition, methodology, project administration; K.L., conceptualization, funding acquisition, methodology, resources; S.L., conceptualization, writing-review and editing. All authors have read and agreed to the published version of the manuscript. Data Availability Statement: Data available on request due to privacy restrictions. The data presented in this study are available on request from the corresponding author. The data is not publicly available.