Objective Video-Based Assessment of ADHD-Like Canine Behavior Using Machine Learning

Simple Summary This paper applies machine learning techniques to propose an objective video-based method for assessing the degree of canine ADHD-like behavior in veterinary consultation room. The method is evaluated using clinical data of dog patients in a veterinary clinic, as well as in a focus group of experts. Abstract Canine ADHD-like behavior is a behavioral problem that often compromises dogs’ well-being, as well as the quality of life of their owners; early diagnosis and clinical intervention are often critical for successful treatment, which usually involves medication and/or behavioral modification. Diagnosis mainly relies on owner reports and some assessment scales, which are subject to subjectivity. This study is the first to propose an objective method for automated assessment of ADHD-like behavior based on video taken in a consultation room. We trained a machine learning classifier to differentiate between dogs clinically treated in the context of ADHD-like behavior and health control group with 81% accuracy; we then used its output to score the degree of exhibited ADHD-like behavior. In a preliminary evaluation in clinical context, in 8 out of 11 patients receiving medical treatment to treat excessive ADHD-like behavior, H-score was reduced. We further discuss the potential applications of the provided artifacts in clinical settings, based on feedback on H-score received from a focus group of four behavior experts.


Introduction
According to the American Psychiatric Association, Attention-Deficit/Hyperactivity Disorder (ADHD) is defined as persistent symptoms of inattention and/or hyperactivityimpulsivity which interfere with development and/or functioning. Recent surveys estimate the prevalence of ADHD among children within 1-12% (see, e.g., in [1,2]). ADHD is further often associated with abnormalities in social behaviour [3]; enhanced aggression [4]; difficulties of adapting to norms [5]; and cognitive, language, motor, emotional, and learning impairments [6].
ADHD is commonly assessed and diagnosed by relying on information from interviews, observations, and ratings collected from multiple sources (parents, teachers, etc.). Such subjective measures are associated with the risk of informant biases [7] and often present inconsistencies [8]. There is, therefore, increasing interest in objective measures for the diagnosis and assessment of ADHD, in the form of neuropsychological tests [9] and direct measurement of movement [10].

Materials, Tools and Methods
Before conducting the study, we captured explicit consent of the dog owners to participate in the study. The procedure was designed with, and approved by, two behavioral veterinarians, in line with published guidelines for the treatment of animals in behavioral research and teaching [37]. Recordings were made as part of regular scheduled veterinary visits. Dogs were allowed to withdraw from the participant at any moment and were not forced to engage with participants.
For automatically tracking the dog movement, we used the Blyzer system, which is described in further details in Appendix A. On top on the tracking module of the system, we implemented a feature computation module (in Python), as explained in Section 2.2 below.

Data Collection
To address our research questions, we collected video data during behavioral consultations in two veterinary clinics at the 'Veterinar Toran' Hospital in Tel Aviv and Petach Tikva, Israel. The participating dogs were of two types: normal controls, who arrived to the clinic for standard checkup and/or vaccination procedures, and dogs with excessive ADHD-like behavior who received medical treatment due to this problem. These dogs were recorded in two situations: • Exploration Trial: free exploration of the room: when entering the consultation room, dog was discharged off leash and left to freely explore the room. • Dog-Robot Interaction Trial: 20 min into the consultation, the dog was presented a moving dog-like robot, and was left to freely interact with it.
The dogs treated for ADHD-like behavior were recorded at two points in time: in their first visit, and follow-up visit after receiving medical treatment. The control group dogs were only recorded once. The process of data collection is presented in Figure 1.
In what follows, we provide further details on the participants, location, stimulus (robot), and preprocessing of video recordings.

Location
The consultation rooms' floor sizes (To rule out confounding effects from the difference in floor size in the different clinics, we verified whether there was any significant difference in any measured variable between recording from the Tel Aviv (N = 28) and Petah Tikva (N = 10) clinic. A two-tailed Mann-Whitney U test found no significant difference for any of the variables (p > 0.05).) were 260 × 160 (Petach Tikva) and 340 × 220 cm (Tel Aviv), where video was captured by a web camera (Logitech HD Pro Webcam C920) fixed on the ceiling (see Figure 2) and connected to the vet's computer. During the recording, the vet and the dogs' owner(s) sat at a fixed location outside of the captured frame.

Robot
We used a simple commercial dog-shaped toy robot of size 10 cm × 14 cm × 6 cm (see Figure 3b), which made repeated circular movements and barking noise. The latter was disabled by removing the robot's vocalization mechanism. The robot was placed in a fixed location (marked by X in Figures 2 right and 3a) during veterinary examination. The consultation rooms' floor sizes (To rule out confounding effects from the difference in floor size in the different clinics, we verified whether there was any significant difference in any measured variable between recording from the Tel Aviv (N = 28) and Petah Tikva (N = 10) clinic. A two-tailed Mann-Whitney U test found no significant difference for any of the variables (p > 0.05).) were 260 × 160 (Petach Tikva) and 340 × 220 cm (Tel Aviv), where video was captured by a web camera (Logitech HD Pro Webcam C920) fixed on the ceiling (see Figure 2) and connected to the vet's computer. During the recording, the vet and the dogs' owner(s) sat at a fixed location outside of the captured frame.

Robot
We used a simple commercial dog-shaped toy robot of size 10 cm × 14 cm × 6 cm (see Figure 3b), which made repeated circular movements and barking noise. The latter was disabled by removing the robot's vocalization mechanism. The robot was placed in a fixed location (marked by X in Figures 2 right and 3a) during veterinary examination.  Table A1 in the Appendix B presents the participants' demographic data and the information on their respective recorded trials, as well as their descriptive statistics.

Participants
Participants 1-19 formed the H-group, according to the following inclusion criteria: 1.
Their first recorded visit was their first visit to the clinic in the context of ADHD-like behavior complaints.

2.
The patient was diagnosed with excessive ADHD-like behavior by the consulting behavioral veterinarian.   Table A1 in the Appendix B presents the participants' demographic data and the information on their respective recorded trials, as well as their descriptive statistics.

Participants
Participants 1-19 formed the H-group, according to the following inclusion criteria: 1.
Their first recorded visit was their first visit to the clinic in the context of ADHD-like behavior complaints.

2.
The patient was diagnosed with excessive ADHD-like behavior by the consulting behavioral veterinarian.

3.
The veterinarian prescribed a medical treatment (with or without addition of behavior correction) for treating excessive ADHD-like behavior.
Participants 20-38 formed the C-group, which included dogs with no reported health issues, visiting the hospital for annual checkup or vaccination. During their consultation, the behavioral vet ruled out any behavioral-related disorder and other comorbidities.

Trial Protocols
As mentioned above, the participants had two trials: (i) exploration and (ii) interaction with toy robot. For the first part, owner and dog entered the room simultaneously, the owner took place in the designated chair, the vet sat by his desk. The owner(s) took their place at a predefined spot in the room. They were requested not to interact or make eye contact with the dog during the experiment, regardless of what the dog was doing. Video recording (Recording samples can be found here (exploration trial) and here (dog-robot interaction)) was started (the vet and the owner are outside of camera scope, and only the dog and the robot are visible on the recording). The dog was allowed to freely move around the room while the veterinarian interviews the owner(s), also filling out information on his computer. The veterinary doctor and the owner(s) were always placed in the same location (fixed chairs in the room), except for the moment at which the robot was introduced in the middle of the room. An owner with his dog is shown in Figure 3. At this point the dog was released off leash, and video recording started.
The second part had the following structure (A similar protocol was used in an earlier work [39]). The dog was brought into the room and taken off its leash. Introduction phase: About 20 min into the interview, the veterinarian placed the inactive robot in the center of the room and returns to his chair. The dog was recorded for three minutes. Testing phase: The veterinarian activated the robot and returned to his place. Interaction of the dog with the moving robot was recorded for three minutes. The veterinarian then deactivated the robot and put it away. The dog was recorded for additional 10 min after the end of testing phase. The introduction phase was introduced in order to let the dog get acquainted with a strange object, thus preventing too high stress levels of patient dogs.

Video Recordings Processing
Automatic tracking. The automatic tracking module of Blyzer was run on the videos. The tracking method (neural networks) used the following elements (see also Figure A1 in the Appendix A). For the exploration trial, we used a neural network based on the FASTER RCNN architecture [40] pretrained on COCO and Pascal Voc datasets, in addition to 6000 annotated frames from our vet clinics dataset. Figure 4 shows example frames where the dog object is detected. For robot detection we used the MobileNets SSD framework [41], pretrained on COCO, Kitti, Open Image, AVAv2.1, iNaturalist and Snapshot datasets, in addition to 550 annotated frames from the vet clinic dataset. Figure 5 shows example frames with dog and robot detection. Postprocessing operations supported by Blyzer (such as smoothing and extrapolation) were applied to remove noises and enhance detection quality.  Filtering of low-quality tracking. The following inclusion criteria for videos were defined for both types of trials: (i) percentage of frames where dog is present is at least 70% of the frames, and (ii) dog and robot are identified with average certainty threshold above 70%. In Table A1 videos excluded using these criteria are marked with '-'.

Choice of Features
In Blyzer architecture (see Figure A1 in Appendix A), the feature analysis module is responsible for extracting the values of higher-level features, in our context they are related to the dog's movement trajectory, and its interaction with robot. Thus, we needed to add to the library the implementation of features which are relevant for our problem and domain.
A feature in machine learning (ML) is an individual measurable property of what is being observed [42]. Many different features can be extracted in this case, however not all of them may be meaningful or relevant for our problem and domain. One possible way forward is using standard feature extraction and selection strategies [42]. Another alternative is by relying on expert knowledge to manually select the promising features. Due to the exploratory nature of our study, we combined these approaches in the following way. First, we held in-depth interviews with experts, and performed a literature review related to metrics of animal movement trajectories. After compiling a list of potential features, we applied four different feature selection techniques, which yielded four different subsets of features suggested for use by classification algorithms. Below we describe this process and the obtained features in further details.

1.
Expert interviews. For elicitation of possible features from experts, we held in-depth semistructured interviews with four behavioral specialists. (One was Dip. ECWABM, one was ECWABM resident, one was veterinary doctor consulting on behavior, and one was a dog trainer and a researcher (PhD) in dog behavior.) During interviews, we first asked them to characterize (i) free movement of a dog with excessive ADHD-like behavior, and (ii) interaction of such dog with a toy robot, as opposed to a dog with no such problem. Appendix C provides the details on the chosen features. Table A2 summarizes behavioral notions mentioned by the experts, and their characteristics for the two types of dogs, as well as their mapping to potential features. Table A3 presents a list all the chosen features which are also explained in further details.

2.
Animal movement metrics. The description of animal movement paths is also a cornerstone of movement ecology [43]. A common characteristic used to describe and analyze movement paths is tortuosity, or how much tortuous and twisted a path is. We hypothesized tortuosity can be related to the experts' highlighting 'erratic movement' and 'turning around' (Table A2). Thus, we selected as features the following five movement indices, which have been linked to tortuosity in [44]: straightness, Mean Squared Displacement, Intensity of Use, Sinuosity, and Fractal D; Table A4 provides their mathematical definitions and references.

3.
Feature Subset Selection. Feature selection involves analyzing the relationship between input variables and the desired variable while selecting those input features that have the highest correlation with the target variable. Two of the most commonly used feature selection methods types (i) filter-based methods, which select subset of features based on their correlation with the target feature, and (ii) wrapper-based methods, which search for well-performing subset of features [45][46][47]. We chose to apply three filter-based algorithms: Univariate Correlation (f-classif), Chi 2 and Importance, and one wrapper-based: Recursive Feature Elimination (RFE). Table A5 presents the results of selections made by each of these two methods for two trials: E (exploration) and DR (dog-robot) (The reason we separated the two was because the set of dogs who had both trials available was smaller than the set of dogs who had only the exploration trial.).

Classification Models and the H-Score
We experimented with several well-known classification algorithms: stochastic gradient descent, random forest, k-nearest neighbors, gaussian process, gaussian naive bayes, multinomial naive bayes, bernoulli naive bayes, complement naive bayes, and support vector machines [48]. Each of these algorithms was run with each of the subsets of features suggested in Table A5 in the Appendix C.
We used leave-one-out cross-validation, which is a standard method for evaluating the performance of classification algorithms [49]. We further used the following classification accuracy metrics: precision, recall, F-measure,and ROC. Precision and recall use the notions of True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN). TP and FP refer as correct/incorrect positive prediction (that the dog is hyperactive), while FN and TN refer to correct/incorrect negative prediction (that the dog is in control group). Precision (P, or specificity) and Recall (R, or sensitivity) are defined as follows: The F-measure (also called F 1 ) represents the combination of precision and recall: ROC curve-based metrics provide a theoretically grounded alternative to precision and recall. The ROC model attempts to measure the extent to which an information filtering system can successfully distinguish between signal (relevance) and noise [50].
To provide the H-score which would assess the level of ADHD-like behavior, we decided to look at class probabilities offered by the different models.

Focus Group of Experts
To evaluate the H-score and the whole approach of objective assessment in a clinical context, we conducted a semistructured Focus Group Discussion (FGD) [51] to explore the perceived usefulness of the objective hyperactivity assessment, and elicit any further usability requirements.
As the quality of FGD data relies heavily on the selection of appropriate participants and targeted questions [52], with only a few focus groups typically sufficient to achieve data saturation [53], we opted for a maximum stratification approach by including experts from different (dog-related) backgrounds, with different levels familiarity of computeraided diagnostic systems. This led to a selection of four total participants: three behavioral veterinarians, one of which had prior experience with computational animal behavior analysis systems, and one animal behavior researcher with expertise in dog training.
The FGD was structured as follows: • Participants were welcomed by the moderator, and the purpose of the FGD was explained. • Participants were asked to discuss (i) the use of ML for objective behavior assessment, and (ii) the use of ML for assessment of ADHD-like behavior within their professional practice. • Next, we showed: An example of exploration trial of a normal dog and of a hyperactive dog (see the video here) and presented their respective H-scores. Two examples of exploration trials of a hyperactive dog before and after clinical treatment (see the video here) and presented their respective H-scores. • We next asked participants to discuss: To what extent they felt the H-score was consistent with their own expert opinion on the watched video; To what extent they felt the H-score would support them in clinical practice, and how; To what extent they felt using the H-score would be well integrated in clinical practice.
We used follow-up questions in order to elicit additional information, triggered by mentions relating to specific non-functional requirements such as the speed of analysis, security aspects, etc.
The FGD session was conducted over Zoom. We live transcribed and took notes during the session, which we then discussed and analyzed in order to determine key reactions from the FGD participants.

Hyperactivity Classification Results (RQ1)
Out of all the options we experimented with, the Random Forest classification algorithm achieves optimal results (83.3% precision, 80% recall, 81% F1-score, and 81.6% ROC score). The details of the comparison as well as the list of the prevalent features are presented in Appendix D.

H-Score Evaluation Results (RQ2)
The H-score was taken as the class probability of the classification model. Table 1 presents the H-scores of the H-group, together with information on the recommended treatment, behavioral modification was also suggested (B.mod column). Eleven participants from the H-group had also a follow-up visit (after receiving medical treatment), time passing between the visits (in months) appears in column TbV. For them, we compared the H-scored between the first and follow-up visit: as can be seen, in 8 out of 11 patients the H-score was reduced. The three dogs in which it was not reduced (but stayed the same or increased) were dogs who indeed have not shown sufficient progress in the vet's opinion, as further medication was prescribed in the follow-up visit.  Table 2 further shows the H-scores of C-group participants.
When comparing the H-scores of the first visit between C-group (N = 19) and H-group (N = 19), the C-group score was found to be significantly lower (median = 0.26) than that of the H-group (median = 0.96) (two-tailed Mann-Whitney U = 49.5, p < 0.00001).

The H-Metric in Clinical Context (RQ3)
Upon showing them comparative recordings of pre-and post-treatment phases, overlaid with H-metric scores, all focus group participants agreed with the observed difference in hyperactivity scores.
Based on our analysis of the focus group discussion, we conclude that the H-score is perceived by behavioral experts as a valuable tool in the context of assessment of symptoms of ADHD-like behavior in the context of clinical treatment. This is due to its complete objectivity, as opposed to all other assessment methods available today in the context of ADHD-like behavior. Yet, the experts noted that clinical diagnosis cannot be based solely at the H-score, and additional information is required. This also explains why the participants found the accuracy of the tool satisfactory, claiming one should not expect higher accuracy of the classification models with the present dataset looking only at the first three minutes of the dog's behavior.
The H-score also is perceived as useful for communication of treatment outcome to the dog owner. Outside of clinical context, as a side note, it was noted that the tool also has potential for preventive alerts to owners about the potential ADHD-like behavior of their dogs, if in the future it is implemented as a tool for owners and not only for clinical experts.
Further details concerning the analysis of the focus group discussions can be found in Appendix E.

Discussion and Future Research
In this study, we introduced a novel method for assessing canine ADHD-like behavior using machine learning techniques. The method is completely objective-it analyzes movement of dogs based on a video footage and without relying on (potentially subjective) information from owner or the vet. However, the latter is also in some sense a limitation of the method in its ability to support diagnostic decision-making, as it may not take into account critical information which is not observed in the video.
We explored the potential of such approach to classify excessive indications of ADHDlike behavior, and to quantify its degree. We have found that the Random Forest classification algorithm reached the best performance (with 83.3% precision, 80% recall, 81% F1-score, and 81.6% ROC score). The most prevalent features were found to be total distance and average speed, reflecting the intuition of erratic movement around the room, expressed in expert interviews.
We further explored the perceptions of behavioral veterinarians on the usefulness and feasibility of this approach in clinical settings using a focus group. The experts agreed on the potential of a tool offering objective measurement of symptoms of ADHD-like behavior in the context of their clinical practice, and also agreed that perhaps much better cannot be achieved, due to obvious lack of important information (such as background information about the dog, or its environment) in the short footage analyzed.
Due to the exploratory nature of this research, we faced some major challenges, and had to make concrete decisions related to the design of this study, and its potential threats to validity, which we discuss below.
Data collection in a consultation room of an animal hospital entails that the setting is not completely controlled. To mention just some aspects which may have affect on the dog's behavior: scents and noises outside the consultation room, time of visit, and what the dog experienced prior to visit. To mitigate these threats, we made sure that the places where the vet and owner(s) sat were always fixed, using marking on the floor. We also excluded from the dataset consultations in which another veterinarian entered the room and interrupted the standard protocol, or the owner went out, leaving the dog alone in the room.
The use of Blyzer's deep learning models for object detection made the processing of a whole consultation (approximately 40 min) infeasible in terms of processing times, and decisions which fragments to analyze were also crucial. After consulting with several behavioral experts, it was decided that the first three minutes of the visit are of crucial importance, as they introduce the dog into a novel environment, and its reaction at the first minutes is the most informative. Some participants of the focus group also remarked on including additional video footage, e.g., from the dog's home as being potentially important. However, this poses challenges due to the non-uniform shooting angle and room size, as well as the complete inability to control the dog's environment. Based on an earlier study [39] which used dog-robot interactions as a tool for eliciting reactions from dogs in the context of a behavioral problem, we decided to also add such dimension to the considered protocol. However, the obtained final model which had the best performance did not make use of any of the features of the dog-robot interaction. This could indicate that the first three minutes are more informative in the context of ADHD-like behavior. However, note that the number of dogs at which we looked in the context of dog-robot interactions was smaller than the overall number due to technical reasons (low-quality videos being filtered out), this too could be an explanation for these features ending up not being included, so this issue needs further examination with a larger dataset.
Reflecting further on practical aspects of using the suggested approach in clinical settings, it is important to note that in addition to the high processing time needed to produce the tracking data (which can be addressed by using stronger machines), another problematic aspect with which we faced in our study was quality of data. This can be divided into two dimensions: (i) quality of detection when dog is in frame and (ii) quality of footage with the dog going out of frame too frequently. Item (i) can be addressed by improving the tracking models used by extending their training set to include more dogs of different sizes, colors, and breeds. Item (ii) was mainly by privacy considerations, as the owner needed to be left out of frame. This could be partially addressed by using more sophisticated interpolation techniques, predicting the dog's movement even when it is not visible. However, it is clear that these considerations need to be taken into account when planning a tool that will provide real-time (or near real-time) H-score in a consultation room that would be integrated in the clinician's workflow.
Another limitation of this study is the rather limited number of dogs in our dataset. This is related to the fact that we decided to recruit participants who only exhibited pure ADHD-like symptoms without comorbidities. Re-examination of our results with a significantly larger dataset is a natural step for further research.
A further direction for future research is considering other behavioral disorders than ADHD-like behavior, as well as ADHD mixed with further comorbidities such as anxiety, depression, etc. These may call for changes in the selected features, which need to be elicited in further interviewing experts concerning the specific way in which these conditions are reflected in the dog's behavior and/or its interaction with humans or objects.
Based on the focus group findings, the suggested approach seems promising in the context of clinical decision making of behavioral veterinarians, as well as for non-clinical behavior assessment of canine professionals, as it offers an objective tool which is much appreciated in behavior assessment which is usually based on subjective reports, or ownerfilled questionnaires. An important aspect for future research is the role social cues play in eliciting hyperactive behavior. Extending our approach using protocols which integrate social cues (such as hand gestures, looking at the dog, petting the dog, etc.) are an important direction for future research on objective assessment of ADHD-like behavior.  Institutional Review Board Statement: Ethical review and approval were waived for this study due to the observations recorded during ordinary vet clinic visits.

Informed Consent Statement:
Informed consent was obtained from owners of all subjects involved in the study.

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. The Blyzer System
The Blyzer system [36,38,54] aims to provide automatic analysis of animal behavior with minimal restrictions on the animal's environment (unlike tracking systems designed for rodents, e.g., in [55], which are usually situated in a semi-controlled restricted setting), or camera setting (as opposed to, e.g., the works in [56,57] where a 3D Kinect camera is used).
Blyzer's input is video footage of a dog freely moving in a room and possibly interacting with objects, humans, or other animals. Its output includes measurements of specific parameters specified by the user, which then provide some form of quantification of behavioral parameters.
Blyzer has already been used for a number of animal behavior projects. One example is a multi-method study, combining fMRI, eye-tracking, and behavioral measures, where Blyzer was utilized for the latter purpose to explore the possibility of a neural attachment system in dogs. Full details are provided in [58].
Another example is the analysis of time budget and sleeping patterns of breeding stock kenneled dogs as welfare indicators. The dogs, bred and maintained by the Animal Science Center in Brazil, were observed for eight consecutive months using simple security cameras installed in their kennels (using night vision at night). Blyzer was used to measure parameters such as total amount of sleep, sleep interval count, and sleep interval length, for further details see in [36]. Figure A1. BLYZER Architecture.
The most relevant use of Blyzer to our purposes is the study in [39], where a setting similar to ours-recording in the consultation room of a behavioral veterinarian-was used in the context of a different behavioral issue related to anxiety. This study makes use of the idea of using artificial agents as stimuli to elicit responses from a dog. In particular, various robots have been used for studying canine social behaviors. For example, Leaver and Reimchen [59] investigated the approach preference of dogs towards a dog-like robot with different tail sizes and movements. Gergely et al. [60] examined dogs' interactive behavior in a problem-solving task in which the dog had no access to the food with three different social partners, two of which were simple robots (remotely controllable cars), and the third was a human behaving in a robot-like manner. Dogs' interactions with more complex commercial robots, displaying a wide variety of (programmed) behavior and/or similarity to the target species, have also been explored [61]. Given that dogs exhibit social behaviors towards robots, this study's hypothesis was that canine behavioral disorders that are related to social fear, may also be reflected in the way dogs interact with robots. Thus, the use of dog-robot interactions (DRIs) was examined as a tool for the assessment of canine behavioral disorders. An exploratory study, recording DRIs for a group of 20 dogs, consisting of 10 dogs diagnosed by a behavioral expert veterinarian with deprivation syndrome, a form of phobia/anxiety caused by inadequate development conditions, and 10 healthy control dogs. It was found that pathological dogs moved significantly less than the control group during these interactions, thus confirming the hypothesis. This provided an inspiration for our study, where we also analyzed DRIs in the context of ADHD-like behavior. Table A1 presents the details of participants from the two groups: The C-Group (N = 19, 8 males, 11 females) and the H-Group (N = 19, 10 males, 9 females).
Remark A2. Note that while Table A1

Appendix C. Feature Selection
Table A2 summarizes behavioral notions mentioned by the experts during the interviews in relation to ADHD-like behavior, and their characteristics, as well as their mapping to potential features. Table A3 presents a list all the chosen features which are also explained below in further details-divided into exploration trial and dog-robot interaction trial.  [39] analyzing dog movement in the consultation room in the context of another behavioral problem (anxiety), average speed was used. We also added the standard descriptive statistics of pointwise characteristics of median, maximum, variance and standard deviation (see, e.g., in [62]).

Number of turns:
To capture excessive turning around the room, we defined the parameter of number of turns, dividing it also to four types according to angle sharpness: between 30-60 • , 60-90 • , 90-120 • , and above 120 • . To calculate the degree, the spatiotemporal was divided to vectors, and the degree between vectors was calculated as the inverts-cosine of the distance to intersection-point and shown in the following formula: Area: To calculate the dog's area covered during consultation, convex hull is first calculated. Convex hull is the smallest polygon shape that contains the whole vectors traveled by the dog (see the black polygon in Figure A2). The area is the region occupied inside the convex hull polygon and given the ordered vertices that engulf it, Shoelace formula [63] is applied:

Number of points:
To capture the smoothness of the trajectory, we defined the parameter of the number of points found on the curve of the dog's trajectory, obtained by segmenting and smoothing it. Various filtering techniques can be used to smooth the trajectory (e.g., moving average [64]); we chose a variant (Intuitively, our variant of the Ramer-Douglas-Pecker curve approximation reduces the number of a points in a curved 'polyline' that is approximated by a series of points by defining a straight line between first and last point in a set of points that form the curved line. It finds the furthest point from this line and checks if its closer than a given distance. If so, it removes all the points while keeping only the first and last. If not, the curve is split as follows: (1) from first line up to, and including, the outlier; (2) from the outlier to the last point. The process is applied iteratively) of the Ramer-Douglas-Pecker curve approximation algorithm [65]. Figure A2 shows example of dog movement's graph where the points obtained by the above mentioned segmentation are in gray.

Quadrant point count:
The consultation room was divided into four quadrants of equal size, numbered as in Figure A3. Let PQ i denote the number of points of a dog's trajectory belonging to quarter i.

Dog-Robot Interaction Trial Features:
Time until first contact with robot: TFC is defined as the time from start of DR-trial to the time when the dog comes to close proximity to robot (using a predefined threshold).
Duration of first contact with robot: DFC is defined as the duration of contact for the dog and robot (i.e., the dog being in proximity to robot).
Trajectory length: TL is defined as the sum of pixel the dog covered, including the time of interaction with robot.
Pace: Pace is defined as the ratio between the length of the trajectory until first contact and the time until first contact. Table A4 presents the list of movement indices included in the feature list. Below we provide more detailed explanations of these indices.
Intensity of use: IU is defined as the ratio between total movement and the square root of the area of movement [66], intensity of use is proportional to the active time spent per unit area, which should increase with tortuosity of the path.
Straightness: The straightness ST (or linearity) index is defined as the Euclidian distance between the start and the final point, divided by the total length of the movement [67].
Sinuosity: The Sinuosity, SI, assumes that paths are correlated random walks, and thus were produced by animals randomly searching a homogenous environment [68].
Mean square displacement: The Mean Square Displacement, MSD, is an important parameter used as index of movement area or home range [69]. It is likely to be inversely related to path tortuosity, similarly to ST, as more tortuous paths take more time to leave a certain area.
Fractal dimension: The fractal dimension of a path, D, is another measure of tortuosity that has been used [70], based on the theoretical framework of fractal geometry. The Fractal D of a set of two points (as a curve) can be seen as a measure of its propensity to cover the plane, being a value of one for no plane coverage (a straight line, for example) and two for full coverage of some area in the plane. Generally, Fractal D must be correlated with path tortuosity, but it is more appropriately considered an area-filling index.  dE -Euclidean path distance between two points, L -trajectory length between them. The Straightness S(p1, p2) between two points p1 and p2 ∈ P is defined as their ratio between the Euclidean path distance (dE) and their graph movement (L). [44,71] Mean Squared Displacement (MSD) MSD = VarX+VarY Where X and Y are Cartesian location coordinates around the movements group centroid [44,69] Intensity of Use (IU) where L is the total path length and A is the movement area [44,66] Sinuosity (SI) Where p is the mean step length, c is the mean cosine of turning angles s is the mean sine of turning angles and b is the coefficient of variation of step length [44,68,72] Fractal D (FD) FD = θ 1+log 2 (cosθ+1) Where θ is the turning angle between 2 steps vector [44,73] Appendix D. Classification Algorithms Details    Table A7 presents a comparison of the considered classification algorithms in terms of precision, recall, F1-score, and ROC (ROC score provides the area under the Receiver Operating Characteristic curve and is a common metric in ML.) score. Random Forest, combined with RFE-method selected features' list, had the best performance with 83.3% precision, 80% recall, 81% F1-score, and 81.6% ROC score. Table A8 presents the count for each feature appearance in the subsets selected by the different feature selection algorithms, providing an indication for the prevalence of the features in the classification.

Appendix E. Focus Group Discussion Analysis
Below we present results of our analysis of the focus group transcriptions, identifying some common emerging themes.
In the first part of the FGD, four videos of dogs were shown to the participants in the following order: Lichi (H-group), Dream (C-group), Sia (H-group), and Laila (C-group). They were requested to discuss which of the shown dogs belongs to which groups. All participants of the FGD correctly identified the dogs' classification.
Here "/.../ he [Lichi] was zapping from a corner to a corner to a corner. And this could be seen like impulsive behavior. And compulisuve behavior is the impossibility for the dog to stop. It is interesting to put them in front of something new, and then you see the impulsivity and compulsivity and everything. And he [Lichi] was not afraid, he was not aggressive, he was just, I will say "over" happy." (P3) "/.../ I agree Lichi shows hyperactivity. But like in humans, in dogs ADHD is a spectrum, you have severe ADHD and the grey zone, where its rather normal. For a better diagnosis, it's better to look at a dog for 1 h to see whether it is able to stop the impulsive behavior. That is why we always also have house information from owner. So yes, we need a lot more information to characterize everything. From just looking at this movement we do not have the whole picture, that's for sure. Even the vet can't do it precisely, so of course the Blyzer cannot do it either." (P4) The participants also expressed positive attitudes concerning the approach and its usefulness in clinical settings. Here are some example quotes: "/.../ In behavior we like very much objective assessment. We use grades, we use scales...That why it's a very good tool to confirm our decision making. I do not see it replacing us in our practice." (P4) "/.../ It's a great tool. But for me it's not a tool to say the dog is hyperactive, but a tool that says that in this particular situation the dog is acting hyperactively. But I am searching for objective tools and in this context its really great start." (P1) "/.../ Because we are taking the same 3 min from all dogs, it is comparable. It will work, if we have lots of data." (P4) "/.../ It's a great tool to measure signs and symptoms objectively, and a small step towards the next level where we can make a diagnosis. It's like you take a stethoscope, put it on the heart and you hear murmur, but that is not sufficient information to make a diagnosis." (P2) "I think we all agree its a great tool, but [specifically] a great tool to measure signs or symptoms to go to the next level and to say it can make a diagnosis." (P4) Additional themes emerging from the discussion centered around: • The potential of the approach for early detection: "/.../ you could also see this tool as a prevention. . . they can use your app on phone and like they film their dog quiet in the room when no-one is doing anything and maybe one day it will give them a score "your dog is hyper" if it has these symptoms and so they know it can happen /.../" (P3) • The importance of further exploring the importance of social cues in a protocol for ADHD-like behavior testing.
"/.../ The future protocols probably should not allow hand movement and petting the dog, to produce less social cues. " (P2) "/.../ for instance, for me I see in Lichie a dog who's moving faster maybe because of the social cues that are there... It also could be the case that he is also socially impaired', reacting to people. This should be taken into account." (P1) • The added value of the approach for communicating with owners: "/.../ I often talk to owners, explaining to them how the treatment will help. Having scores to show them would be good for the link with owners. So yes, it can be a great help for us...Of course, it's not only hyperactivity that we should measure, but it is a good start." (P4) I think it's interesting and important for owners to see objective data on their dogs and I think its interesting to maybe in (chatters?) or general consultations, I am very interested in this for all these reasons and I also think about something else; (P2)