Developing an Animal Welfare Assessment Protocol for Cows in Extensive Beef Cow-Calf Systems in New Zealand. Part 2: Categorisation and Scoring of Welfare Assessment Measures

Simple Summary Animal welfare assessment protocols use different methods to categorise and score animal welfare. This study has demonstrated the feasibility of developing standards for a welfare assessment protocol of cow-calf farms in New Zealand by validating potential categorisation thresholds for measures of assessment on 25 beef farms. Imposed thresholds of categorisation and derived thresholds based upon the poorest 15% and best 50% of farms for each measure were compared to see which was the most appropriate to the range of observations and the significance of the welfare implications of the measure. For measures with significant welfare implications, the stricter threshold was retained, while derived thresholds appeared more appropriate for commonly occurring traits but of less welfare importance for the production system at hand. Abstract The intention of this study was to develop standards for a welfare assessment protocol by validating potential categorisation thresholds for the assessment of beef farms in New Zealand. Thirty-two measures, based on the Welfare Quality and the University of California (UC) Davis Cow-Calf protocols, plus some indicators specific to New Zealand, that were assessed during routine yardings of 3366 cattle on 25 cow-calf beef farms in the Waikato region were categorised on a three-point welfare score, where 0 denotes good welfare, 1 marginal welfare, and 2 poor/unacceptable welfare. Initial categorisation of welfare thresholds was based upon the authors’ perception of acceptable welfare standards and the consensus of the literature, with subsequent derived thresholds being based upon the poorest 15% and best 50% of farms for each measure. Imposed thresholds for lameness, dystocia, and mortality rate were retained in view of the significance of these conditions for the welfare of affected cattle, while higher derived thresholds appeared more appropriate for dirtiness and faecal staining which were thought to have less significant welfare implications for cattle on pasture. Fearful/agitated and running behaviours were above expectations, probably due to the infrequent yarding of cows, and thus the derived thresholds were thought to be more appropriate. These thresholds provide indicators to farmers and farm advisors regarding the levels at which intervention and remediation is required for a range of welfare measures.


Protocol Used
The protocol was developed and trialled by Kaurivi [19] to create a robust, achievable suite of 32 measures which were usable on pasture-based extensive cow-calf beef farms in New Zealand [21]. Briefly, the protocol involved assessing measures combined from the Welfare Quality cattle protocol and the UC Davis Cow-Calf protocol, with additional New Zealand-specific measures. The assessment was trialled on 25 mixed sheep and cow-calf farms over 2 visits, with the first during routine yarding for pregnancy testing and the second using a questionnaire and observations of the herd at pasture during winter. In the first visit, a total of 4956 cows were presented for pregnancy diagnosis, with yard observations made on 3366 animals (see [19] for herd details). Observations were made of cows in the race regarding body condition, rumen-fill, behaviour and physical health. Stockpersonship was evaluated as cows entered and exited the race by observing how cows were handled. The yard design and handling facilities were also evaluated for ease of handling of cows. In the second visit, a farm resource evaluation and a questionnaire guided assessment of health and management of the herd was undertaken.

Categorisation of Measures
Categorisation of scores was based on the authors' experience and perception of what good welfare would be in extensive pasture-based beef cattle in New Zealand, together with a consensus developed from relevant literature. For each farm, the welfare impact of each of the 32 measures was categorised separately into 3 categories (Table 1). All discrete data were measured according to the proportion of cases and given an ordinal score of welfare on that 3-point category score. For example, the mean percentages of poor body condition (BCS; thin cows) and poor rumen fill (RFS; hungry cows) were given an ordinal score to indicate acceptable and unacceptable welfare in the herds. For consistency, thresholds of measures across welfare principles were kept constant where reasonable. For example, for good health, absence of injuries or physical impairment welfare categories were kept the same at 2% threshold. Painful conditions (dystocia and short tail) were also given a similar category to the health issues. Ordinal data (i.e., age of castration) and subjective measures (i.e., handlers' noise) were similarly given a categorical score to indicate severity of marginal and unacceptable welfare in the herds. Finally, the categorisation of measures such as yarding frequency and health check frequency were moderated with respect to findings during the assessment visits. The categorisation and details on how each of these 32 measures were assessed are summarised in Tables 2-4.

Data Analysis
Data were analysed using SPSS version 24 (IBM). Descriptive statistics for continuous measures were used to capture central tendency (mean and median), dispersion of data (standard deviation), range (minimum and maximum), variance and percentiles. Qualitative methods were used to analyse the frequency of ordinal measures. The Shapiro-Wilk test was used to test for normality, and log10(n + 1) was used to transform those variables that were not normally distributed. An alternative approach to applying pre-determined value judgements is to determine the threshold from the data, so that an arbitrary 15% of farms were considered poor and 50% good derived thresholds were determined based on z scores to result in approximately 50% of farms falling into a good welfare band ('green') and 15% of farms into a poor welfare band ('red'). Farms not in the green or red band were classified as orange. The arbitrary 15% was chosen to fit with the '15% rule' where animals (in this case farms) below this point are considered as worse-off in terms of animal welfare compromise [12,18]. Farms were not given an 'overall' welfare score [9]. For each non-categorical measure, the derived red threshold and the imposed score 2 threshold were then compared by dividing the derived threshold by the imposed threshold.

Welfare Assessments Summary Statistics
Consolidated data for the welfare observations made using the final protocol are shown in Tables 5 and 6.  Table 4 for description of assessments used to create these figures. For painful management procedures, castration was performed with rubber rings on 20 of the 25 farms (mode and median two months of age; range 1-4 months). Calves were disbudded only on two farms, at three and four months respectively. Ear-tagging was performed without the use of anaesthesia on all farms at median and mode of two months. None of the farms reported high levels of diseases in the last 12 months (based on 2017 herd size). Of the diseases reported, only lameness (which was reported on 18 out of 25 farms), abortion (13 out of 25 farms), and dystocia (21 out of 25 farms) had a mean recorded incidence across all farms of greater than 1% (1.1%, 1.5%, and 2.6%, respectively). On individual farms, the highest recorded incidences of lameness, abortion and dystocia per herd were 4.1% (five out of 155 cattle), 8.8% (20 out of 226 and seven out of 80 cows) and 16.7% (two out of 12 cows), respectively. The mean incidence of all other diseases was < 0.5%, and of those diseases, only eye cancer (three out of 125; 2.4%), theileriosis (10 out of 470; 2.1%), vaginal prolapse (one out of 35; 2.9%), and Mg deficiency (10 out of 470; 2.1%) had maximum recorded incidences in an individual herd of >2%. (See Appendix A for the main diseases recorded as per farmers' recollection in the questionnaire assessment at the 25 Waikato beef farms).

Categorisation of Measures
Categorised observational data are illustrated in Figure 1 Table 2 for further information on how each measure was categorised into a score of 0, 1 or 2.  Table 3 for further information on how each measure was categorised into a score of 0, 1 or 2.  Table 2 for further information on how each measure was categorised into a score of 0, 1 or 2.
Animals 2020, 10, x 9 of 17  Table 2 for further information on how each measure was categorised into a score of 0, 1 or 2.  Table 3 for further information on how each measure was categorised into a score of 0, 1 or 2.  Table 3 for further information on how each measure was categorised into a score of 0, 1 or 2.  Table 3 for further information on how each measure was categorised into a score of 0, 1 or 2.  Table 2 for further information on how each measure was categorised into a score of 0, 1 or 2.
Most farms (17 out of 25) scored poorly for rumen fill, whereas no farms had poor welfare for distance to water. Poor welfare scores for short tails and dirtiness were reported at 14 out of 25 farms, whilst 22 out of 25 farms had faecal soiling ('diarrhoea'). All farms except one obtained good welfare score for shade and no farms had a good score for environmental hazards. Most farms scored poorly for mortality rate followed by lameness. No farm was scored as having poor welfare associated with hair loss or abrasions. Only two farms disbudded calves and both did so after two months, which gave a poor welfare score. The rest had calves that were genetically polled. A poor welfare score was noted at five out of 25 farms for castration, and all farms received a marginal welfare score for ear tagging (tagging without local anaesthesia). No farm scored poorly for mis-catching, hitting cows and handling noise whereas 10 and 11 out of 25 farms were placed in category 2, for equipment noise and dog noise, respectively. Most yarding frequency scores were in category 1, and cattle were generally fearful and agitated in the yards.
The accumulated welfare score according to the 3-point scores for each farm is shown in Figure 4, with farms ordered from most poor scores to fewest (range 14-3). The highest number of good scores was 22 out of 32 and lowest was six. For marginal scores, the range was 6-17.
beef farms, for which scores were assigned as either 0: good, 1: marginal, or 2: poor welfare. See Table  2 for further information on how each measure was categorised into a score of 0, 1 or 2.
The accumulated welfare score according to the 3-point scores for each farm is shown in Figure  4, with farms ordered from most poor scores to fewest (range 14-3). The highest number of good scores was 22 out of 32 and lowest was six. For marginal scores, the range was 6-17.

Refined Thresholds
Derived threshold values are shown in Table 7. Measures that were normally distributed were hungry cows, dirtiness, diarrhoea (faecal soiling), mortality rate and fearful/ agitated cows. Seven measures had a derived red threshold that was >2 times the threshold imposed by categorisation: short tail, diarrhoea, lameness, dystocia, mortality rate, fearful/ agitated, and cows running on exit. Table 7. Normally distributed and log-transformed traits indicating multiples of the standard deviation with a 15% cut-off thresholds at the 25 Waikato beef farms showing thresholds of 50% of farms in "green" (good welfare), those in orange and 15% of farms in the "red" (poor welfare).

Discussion
Kaurivi [21] identified 32 measures of animal welfare that were feasible to assess during routine yarding of pasture-based beef cattle. This study categorised these animal welfare measures into scores that indicate a threshold of acceptable and unacceptable welfare, to provide guidance for when intervention was needed [8]. The thresholds that have been imposed or derived in this study are based on individual measures, rather than an aggregated 'score' for each farm [9].
Cattle were in good body condition at the time of assessment, with an average of 10.7% of cows having a BCS ≤ 4. The range of thin cows across farms was wide (0-60.7%). The imposed threshold for categorisation as poor welfare was 10% of the herd, whilst the derived threshold, based upon the poorest 15% of farms, was 19% of thin cows. In terms of identifying the need for nutritional intervention, the lower threshold seemed more appropriate, even though cows' productivity is not impaired in the short term, at BCS 4 [23]. Other studies have suggested that a threshold for the proportion of thin cows that is deemed unacceptable could be set at 5-15% [24] and 6.7% [16]. The BCS data in the present study were largely correlated with rumen fill score (RFS) data, although, whilst a poor RFS can reflect long-term underfeeding it can also occur during short term feed deprivation [25], such as when cows are drafted a day before pregnancy testing. Hence, the derived threshold of ≤19% of the herd with a low RFS may not be more appropriate for the detection of poor nutrition than the original imposed figure of 50%.
Assessing the dirtiness of cattle was both difficult and unrewarding. Kaurivi [21] concluded that all sites of dirt (tail, hindquarters and flank) should be amalgamated to provide a single 'dirtiness' score. They also noted the confounding of faecal staining of the tail head as a sign of infectious/parasitic diarrhoea with its very common occurrence in normal cattle that are being fed lush pasture. Hence, whilst in housed cattle, dirtiness and diarrhoea are rightly interpreted as signs of poor housing and/or health control, these interpretations may not be relevant to the study population. Rather, dirtiness and diarrhoea probably reflect the degree of muddiness of the paddocks and/or the lushness of the pasture which, in turn, are largely dependent on the season of the year. A point may be reached when the level of dirt in a pasture-based system does represents a welfare compromise [26], so creating standards for interpretation of dirtiness is therefore difficult [9,27]. The interpretation of dirtiness as a measure of welfare might require the setting of seasonal thresholds, e.g., finding dirty cows in January (summer lush pasture) is different to finding dirty cows in July (winter muddy terrain). Taken together, such considerations suggest that the derived threshold for red score of 36% of the herd being dirty seems more appropriate than the original imposed threshold of 20%. Likewise, the ubiquitousness of faecal staining due to the fluid nature of the cows' normal faeces means that the derived threshold of close to 60% is probably more realistic than the imposed threshold of 20%. It could in fact be argued that, whatever threshold is used for faecal staining, it may represent the imposition of a characterisation of a trait that is poorly related to welfare compromise; rather, it is merely a sign of cows having plenty of grass. On the other hand, faecal soiling may contribute to a risk of disease [25], so perhaps adopting the re-categorised threshold of >60% (or 50%) may indeed provide a meaningful measure of welfare. Perhaps a qualitative determination may also need to be made of whether watery faeces are simply the result of the pasture diet or whether some identifiable disease process is causing abnormally loose faeces. Finally, there are economic implications associated with dirtiness in cases of cattle destined for slaughter [28] so, again, the scoring that is imposed might vary with the circumstances and/or purpose for which it is being undertaken.
Assessing the incidence of short tails may help to determine whether faeces on tails does, or does not, represent compromised welfare, given that the aetiology of short tails is, in most cases, constriction of blood supply to the tails by hardened faecal rings. Short tails were present in 4.6% of cows, which compares unfavourably with the imposed standard of >2% of affected cows representing poor welfare. It seems reasonable to assume that the condition is associated with a significant level of pain to the cow, probably like that associated with tail docking with a rubber ring [29]. This is a good example of setting thresholds based on what should be achieved on-farm and not based on the status quo, and the envisioned scale could be used as a tool to caution farmers about the state of tail soiling, so that remedial actions can be taken to curb or prevent the occurrence of this condition (i.e., washing off the dirt or clearing the hardened faecal balls before the tail sloughs or breaks off).
No farm in extensive hill or high-country in New Zealand is without any risk of hazards (i.e., steep hills, cliffs, streams, gullies and tomos). Farms (n = 8) that lost animals in tomos were considered as having major welfare compromise without considering the presence of the other hazardous terrains. Otherwise, the ranking of this measure was influenced by the prevailing conditions of the beef farms. Potential threats of the environment can never be eliminated, thus the application of strategies to minimise or bring accidents to tolerable levels would be more achievable [9]. The issue might be controlling the access of cows to these hazards rather than the presence of these hazards, so linking welfare compromise to good environmental management, such as preventing access to hazards, could provide a useful focus for reducing accidental death.
For most health-related measures, the welfare impacts were small on most farms. The exceptions were lameness, dystocia and mortality rate, for which the derived threshold was more than twice the imposed threshold. Importantly, whilst relatively low incidences of these conditions probably have relatively limited impact upon herd productivity per se, they have a very significant impact upon the individual cow. Thus, lameness is a critical welfare compromise indicator, as it is both a painful condition, and affects productivity [30,31]. The mean incidence of lameness in this survey was 2.7% (range 0-11.5%), but the incidence on the worst 15% of farms (derived threshold) was ≥4.8%, indicating that it has the potential to be a significant welfare issue. Consequently, the original imposed threshold of 2% is probably more appropriate than the derived threshold: particularly as the lower threshold would have the benefit of increasing the awareness of farmers to the need for intervention [31]. Whether it is appropriate to use a single 'catch all' criterion for lameness might be questioned [1,11]. For example, should lameness be differentiated into severe (non-weight-bearing) and non-severe lameness, with thresholds of 1% and 3% of the herd, respectively. On the other hand, in the circumstances in which observations were made in the present study, it was probably more accurate to use the catch all than to try to differentiate between levels of degrees of lameness. Dystocia similarly has a very significant impact upon the welfare of animals affected (and upon calves born/stillborn as a result of dystocia), and, at high incidences, can markedly impair the productivity of the farm [5]. Although the mean incidence, 2.6%, was close to the imposed limit, the derived limit was 4.9%, which indicates that dystocia is probably a relatively common trait on the beef farms. Again, given the significance of the condition for affected individuals, the 2% threshold seems more appropriate than the derived 5% limit. A threshold of 2% could aid in benchmarking for monitoring and correction of this condition. Finally, the average mortality rate was 3.9%, which is rather higher than the New Zealand industry standard for beef cattle (2-3% [32]). Similar figures have also been reported by international studies of pasture-based cow-calf units [33][34][35]. The threshold for the worst 15% of farms was 6.3%: given that mortality represents the total economic loss of the cow, mortality has the potential to be both common and economically serious on beef farms. Hickson [36] found a death rate of 2.1% per year in New Zealand beef herds, which is close to the imposed 2% threshold of the present study. Therefore, the 2% categorisation threshold appears to be a rational figure to trigger investigation of underlying contributing factors to reduce mortality rate.
The threshold for the ages above which performing painful management procedures (castration, removal of the horn bud) were considered as unacceptable welfare were set at >2 months. New regulations in New Zealand prohibit disbudding/dehorning without local anaesthesia, whilst the New Zealand Veterinary Association [37] advocates that these procedures should be undertaken at 2-6 weeks of age, and in conjunction with the use of analgesia [38]. For castration, this painful procedure can be mitigated using analgesia [39] and animals which are castrated early cope and recover faster than if this is done at an older age [39,40]. Ear notching is more painful than tagging, but the adverse effects can be mitigated using vapo-coolant [41] which provides a local cooling of the skin and thereby reduces pain perception. Hence, performing notching without the use of any anaesthetic was deemed to be a significant welfare compromise.
Stockpersonship was categorised using ordinal measures related to the behaviour of the cattle in the yards and race, and categorical measures based upon observations of the stock handling. The ordinal measures 'fearful/agitated' and 'run' had derived thresholds that were more than twice the imposed threshold. Running was a common behaviour, for which the derived threshold (23.4%) was much higher than the imposed threshold (10%). Stumbling and falling were less common, with the derived and imposed thresholds being very similar at~2%. Many of the stumbling cattle appeared to have been merely correcting their stance and hence might not warrant a stricter threshold. However, if extensiveness per se is the underlying cause, strategies such as more yarding events could be implemented to ascertain and prevent the welfare compromise. On the other hand, yarding is itself associated with stresses upon the cattle, so there are benefits to avoiding yarding cattle more often than is essential. In the present study, most farms (20/25) yarded the cattle 3-4 times per year, with the remainder of the farms yarding 5-6 times. A similar study of California ranches recorded an average of 3.4 yardings per year [42], but with a significant reduction in cattle vocalisation, stumbles and hitting with additional yardings per year [43]. This indicates an association of infrequent yarding and handling with difficult handling, restraining and fearfulness [44,45]. Concern around the infrequent yarding of cattle in extensive beef systems is supported by the finding in our study (Kaurivi Part 1) that yarding per year was correlated with fearful behaviour ( = 0.50).
In the present study, the derived threshold for fearful/agitated behaviour was 4.9%, versus the imposed threshold of 2%. It is likely that the commonness of fearful/agitated behaviour may primarily be an indication of the lack of familiarity of extensively managed cattle with yarding and handling; as also found by Simon [43]. Taken together, it appears that the benefits of more frequent yarding (>4 times per year, for example) may be more compatible with acceptable welfare when cows are handled than yarding <3 times per year. Fortunately, the proportion of cows that were mis-caught during restraint when gates were closed into or within the race was low: even setting the threshold at 1% of cows mis-caught, only 4/25 farms exceeded that threshold. One explanation is that beef farmers do not routinely use a head bail for mass management procedures including pregnancy diagnosis (except on 8 out of 25 farms where the first cow was caught in the single-file race). Another explanation for this could be awareness of welfare compromise of this practice at New Zealand beef farms; a lower risk of mis-catching was reported if farmers undergo training in cattle handling techniques [43].
The frequency of health checks of cows by farmers during winter/pregnancy was based on the findings at the 25 beef farms, where health check frequency was regular with 20/25 farms inspecting at an interval of ≤1 week and 11/25 farms doing daily checks. Frequent health checks are expected to coincide with a good health status [43] and low mortality [46]. The limitation of health checks on extensive systems is that it is an overall inspection of cows at pasture, with rare close inspection to detect early health problems and injuries [47]. Thus, just recording health checks, without consideration of what health checks entailed while looking at individual cows versus the whole herd might influence the findings and hence categorisation of this measure [48].
This study was undertaken in only one region (Waikato) of New Zealand, thus the derived thresholds may reflect beef farming in that region rather than across the country. We concluded in our previous study that before the protocol was used widely it needed further testing on more farms across New Zealand with more assessors. This process should also include the calculation of derived thresholds for each of the measures in the protocol where data were collected on a continuous basis. These thresholds should, ideally, be calculated at a regional rather than a national level, so that, if present, differences between regions, can be highlighted. Only once this process is completed can we finalise the categorisation process started in this study. This finalisation process should involve beef farmers, beef exporters, animal welfare experts, veterinarians, consumers, and ideally, animal welfare advocacy groups.

Conclusions
This study has demonstrated the feasibility of developing standards for a welfare assessment protocol of cow-calf beef farms in New Zealand. Initial welfare thresholds were based upon the authors' perception of acceptable welfare standards and the literature, with subsequent derivation of thresholds based upon the poorest 15% and best 50% of farms for each category. Imposed and derived thresholds were compared to see which was the most appropriate to the range of observations and the significance of the welfare implications of the measure. Some of the derived thresholds were much higher than those originally imposed, with lameness, dystocia, and mortality rate being between two and three times higher than the imposed threshold. Nonetheless, in view of the significance of these conditions for the welfare of affected cattle, the original threshold appeared the more appropriate. The proportion of cows with low BCS or RFS evoked similar considerations. On the other hand, measures of dirtiness and faecal staining were more common, but less significant than originally envisaged, so the derived thresholds appeared more appropriate. Similarly, measures of cow behaviour during handling were above expectations. Again, due to the infrequent yardings that these animals experienced, the derived threshold appeared to be the more appropriate. Taken together, these thresholds provide indicators to farmers and farm advisors regarding the levels at which intervention and remediation is required. Findings during the assessments that were supported by national and international standards also rationalised the categorisation of measures such as yarding frequency/year and health checks frequency. Further data are required from more assessments across the country in order to finalise the categorisation process started in this study.