A Context-Aware Method Based Cattle Vocal Classification for Livestock Monitoring in Smart Farm

: This paper focuses on livestock monitoring in smart farm to improve animal well-being and production. The great potential for increased automation and technological innovation in agriculture help livestock farmers in monitoring the welfare of their animals for precision livestock farming. A new acoustical method exploiting contextual information is introduced for cattle vocal classification. The proposed scheme considers the raw recordings which contain cattle sounds. Then a set of contextual acoustic features is constructed as input to the MSVM classifier to track the types of cattle vocalizations. Categorized noisy cattle calls are finally classified into four types of calls (i.e. cattle food anticipating call, animal estrus call, cough sound, and normal call) with an overall classification accuracy of 84% outperforming the results obtained using conventional MFCC features. We have used an open access dataset consists of 270 cattle classification records acquired using multiple sound sensors. Promising results are obtained by the proposed method for livestock monitoring enabling farm owners to determine the status of their cattle.


Introduction
Automated acoustic monitoring could be a useful tool in precision livestock farming [1]. As farming systems become increasingly automated, it is possible to dynamically adjust the environment in which the animals are kept and automatically change the temperature, lighting and ventilation. With the help of sensor and artificial intelligence (AI) technologies, the farmers and farm owners can also detect diseases in animals and take immediate actions accordingly. The implementation of smart technologies in livestock farming helps in gathering and processing real-time data related to animals health and general behavior, including their feeding behavior, food and water quality, hygiene levels, and others. For example, growing population of cattle with increasing dairy farms and increasing adoption of livestock monitoring technology in developing countries create a strong demand for livestock monitoring.
In agriculture, now-a-days, there is a growing focus on livestock monitoring for smart farm. Various research works have been published in the related contexts, such as animal welfare and disease detection. In [2], a deep-learning based method is proposed recently for classification of cattle vocalization for real-time livestock monitoring. The proposed scheme consists of a 2D CNN (convolutional neural network) to classify cattle vocals as well as remove background noise using STFT, and another CNN for behavior classification from the existing datasets, achieving an overall accuracy of 81.96% using MFCC features. The work in [3] deals with classification of cattle cough sounds using a total of 205 min of sounds resulting in 285 labelled calf coughs. The extracted features are obtained by calculating the FFT of the input audio, removing the background noise and reducing the resolution in the spectrograms by summing the frequencies into 12 separate bands and considering duration of the cough. An example-based classifier is then used based on the , , 2 of 6 Euclidean distance between the two rough spectrograms of the input audio data and the labelled data. The lower the distance, the more it resembled its corresponding spectrogram providing 98% specificity rate (true negatives) and 52% sensitivity rate (true positive). Despite the low sensitivity, the method is yet able to identify increased periods of coughing, allowing farmers to administer treatment for the respiratory disorder. In [4], the feeding behaviour of the cattle is studied using cattle grazing sounds by attaching microphones and cameras to a cow's forehead and exposing the cattle to different treatments which varied plant species, two different heights, an increasing of herbage mass and the number of bites it takes to finish . The sounds are analysed by extracting the energy flux density from the sounds and it is found that energy flux density relates linearly to the dry matter intake. A non-human animals positive and negative emotions monitoring scheme is reported in [5] for goats using emotion-linked vocalizations based on behavioural(looking) and physiological(cardiac) measures. It is found that during the habituation phase, goats gradually reduced the duration of looking towards the sound source, and the heart rate also decreased, suggesting that habituation to the valence of the stimuli occurred. Later, the occurrence of looking and heart-rate increased immediately after the second call of dishabituation, and decreased during rehabituation, suggesting that goats perceived the change in call valence. In [6], a cattle call monitor (CCM) system is developed using cattle vocalizations for automatic estrus detection. By matching the information of zero-crossing (>5), frequency range (50-2000 Hz), and duration (0.5-10 s) of the windowed signals from the structure-borne and airborne sound microphones, the proposed algorithm is able to identify the estrus calls providing detection with sensitivity of 87% and specificity of 94%. Another work has been published in [7] on estrus detection for dairy cattle by monitoring cattle vocalization based on vocalization rate (calls/hr), the rate of harmonic moo and nonharmonic(noisy) bellow calls for the complete vocalizations.
In this paper, a new acoustical method is developed by utilizing contextual information for cattle vocal classification. The proposed scheme considers the raw recordings which contain various cattle sounds. Then a set of contextual acoustic features is constructed as input to the MSVM classifier to track the types of cattle calls from the noisy raw input data.

Materials and Methods
The basic idea of the proposed approach lies in the integration of auditory processing model and contextual information for extracting useful features. The method adopts the multiresolution framework. The general outline of the multiclass classification considered here is shown in Figure 1 which consists of training stage followed by prediction.

Database Used
We have used an open access dataset [8] containing 270 cattle classification records collected from 12 recording sensors (USB mic, Shenzhen kobeton technology, Shenzhen, China, frequency response: 16 Hz-100 kHz, sensitivity: -47 dB ± 4 dB). The audio data are collected in three separate zone with 4 microphones are placed in each zone and located at a height of 3 m in three separate livestock facilities(see [2] for more details).

Data Preprocessing
The dataset of each vocalization is resampled (from 44,100 Hz to 16,000 Hz) and resized into N-sample data blocks (N = 8192 samples here, referring to 0.512 s) followed by time windowing using N-sample Hamming windows [9]. Note that resampling is done here to reduce the computational complexity, while resizing is performed to save memory by compressing the signal without changing the spectral content [10].

Contextual Acoustic Features
We have introduced a set of contextual acoustic features, which encodes the multiresolution energy distributions in the time-frequency plan based on the cochleagram representation of an input signal. We incorporate a number of cochleagrams at different resolutions to design the contextual features set. The cochleagram with high resolution captures the local information, while the other low resolution cochleagrams capture the contextual information at different scales. To compute the cochleagram, we first pass an input signal to a gammatone filter bank, where a particular gammatone filter has an impulse response given by where parameter η is the order of the filter, f c denotes the center frequency while B f c refers to the bandwidth given f c . The gammatone filter function is used in models of the auditory periphery representing critical-band filters where the center frequencies f c are uniformly spaced on the equivalent rectangular bandwidth (ERB) scale. The relation between B f c and f c is given by Then each output signal from the gammatone filter bank is divided into 20 ms frames with a 10 ms frame shift; the cochleagram is then obtained by calculating the energy of each time frame at each frequency channel. Each T-F unit in the cochleagram contains only local information, which may not be sufficient to accommodate the diversity in the real-recorded input data. To compensate for this, the new contextual feature set provides contextual information by including the energy distribution in the neighborhood of each T-F unit. The steps for computing the features are as follows. 1. Given input data, compute the first 32-channel cochleagram (CB1) followed by a log operation applied to each T-F unit. 2. Similarly, the second cochleagram (CB2) is computed with the frame length of 200 msec and frame shift of 10 msec. 3. The third cochleagram (CB3) is derived by averaging CB1 using a rectangular window of size (5×5) including 5 frequency channels and 5 time frames centered at a given T-F unit. If the window goes beyond the given cochleagram, the outside units take the value of zero (i.e. zero padding). 4. The fourth cochleagram CB4 is computed in a similar way to CB3, except that a rectangular window of size (11×11) is used. 5. Concatenate CB1-CB4 to generate a feature matrix F and integrate it along the time frame to obtain a set of contextual features of dimension (128×1).

MSVM Classification
Separating various cattle calls is a multiple classification based monitoring problem, which is solved here by considering one-against-all optimization formulation based on Crammer and Singer (CS) model [11] for a multiclass support vector machine (MSVM) providing fast convergence and high accuracy. In general, a MSVM classifier solves a d-class classification problem by constructing decision functions of the form: given i.i.d. training data ((x 1 , y 1 ), . . . , (x l , y l )) ∈ (X × {1, · · · , d}) l . Here, ϕ : X → H, ϕ(x) = k(x, ·), is a feature map into a reproducing kernel Hilbert space H with corresponding kernel k, w 1 · · · , w d ∈ H are class-wise weight vectors, and T(·) stands for transpose operator. The CS method is usually only defined for hypotheses without bias terms, i.e. b c =0. This CS based MSVM classifier is trained by solving the primal problem subject to ∀n ∈ {1, · · · , l}, ∀c ∈ {1, · · · , d}\{y n } : (w y n − w c ) T ϕ(x n ) ≥ 1 − η n and η n ≥ 0 where η n refers to 'slack' variables for each data item, in such a way that the margin between correct class and most confusing class is penalized. For learning structured data, CS's method is usually the MSVM algorithm of choice taking all class relations into account at once to solve a single optimization problem with fewer slack variables.

Results and Discussion
Four types of cattle calls namely food anticipation calls, estrus calls, cough sounds, and normal calls which have been used are shown in Table 1. In Figure 2, the spectrograms of different audio samples corresponding to food anticipation, estrus, cough sound, and normal vocals are displayed for illustration. In order to reduce the redundancy while maintaining the variability of the contextual features, the decimated version of the feature set is considered as a kind of feature selection. A decimation factor of 8 is used here which reduces the length of the features from 128 to 16. For illustration the average of all the (16×1) features for each types of cattle vocals are plotted in Figure 3. For simulations, 70% of the total data samples are used for training while the rest 30% are used for prediction(testing). Each feature set is normalized to zero-mean and unit standard deviation prior to use for classification. The results presented here are the average results of 50 different realizations. For each realization the configurations of the training and thereby the testing sets are changed randomly. Here we have selected the default MSVM parameter C (regularization parameter) and γ (bandwidth parameter) of the radial Gaussian kernel k(x; x ′ ) = exp(−γ||x − x ′ || 2 ) as C=10 and γ=2 [12]. The average of the confusion matrices of the MSVM classification over all realizations using the contextual feature set is presented in Table 2 providing the average classification accuracy of 84%. Note that both the sensitivity and specificity are highest for the estrus which has the largest number of samples (i.e. vocalizations). The average classification accuracy (%) for various feature size (M) are listed in Table 3, where M = 16 gives the best result by the proposed scheme. The comparison results with MFCC (mel frequency ceptral coefficients) [13] features using the same training setup are presented in Table 4. The parameters for the MFCC are set as follows: MFCC window length=20 ms (320 samples), number of MFCC features=12, MFCC window overlapping=50%. The best average classification accuracy with the MFCC features is obtained as 60.81%. False Negative) which outperforms the results obtained by the MFCC features (c.f. Table 2 & Table 4).

Conclusions
This paper introduces a new acoustical method for automatic livestock monitoring in smart farm. The proposed framework is found to be effective in classifying various types of cattle sounds analyzed herein. Preliminary experimental results have shown improved performance by the contextual features over the MFCC features. Future works include the use of larger dataset to improve the performance as well as analyze other types of vocalizations, e.g. poultry, sheep, with the aim to deliver the highest levels of animal welfare for precision livestock farming.