Bag-of-Visual-Words for Cattle Identification from Muzzle Print Images

Cattle, buffalo and cow identification plays an influential role in cattle traceability from birth to slaughter, understanding disease trajectories and large-scale cattle ownership management. Muzzle print images are considered discriminating cattle biometric identifiers for biometric-based cattle identification and traceability. This paper presents an exploration of the performance of the bag-of-visual-words (BoVW) approach in cattle identification using local invariant features extracted from a database of muzzle print images. Two local invariant feature detectors—namely, speeded-up robust features (SURF) and maximally stable extremal regions (MSER)—are used as feature extraction engines in the BoVW model. The performance evaluation criteria include several factors, namely, the identification accuracy, processing time and the number of features. The experimental work measures the performance of the BoVW model under a variable number of input muzzle print images in the training, validation, and testing phases. The identification accuracy values when utilizing the SURF feature detector and descriptor were 75%, 83%, 91%, and 93% for when 30%, 45%, 60%, and 75% of the database was used in the training phase, respectively. However, using MSER as a points-of-interest detector combined with the SURF descriptor achieved accuracies of 52%, 60%, 67%, and 67%, respectively, when applying the same training sizes. The research findings have proven the feasibility of deploying the BoVW paradigm in cattle identification using local invariant features extracted from muzzle print images.


Introduction
Cattle, buffalo and cows are the major sources of meat in the food supply chain and their protection has become a vital need. Cattle identification is the process of accurately recognizing individual animals-buffalo and cows-via a unique physical marker or biometric identifier. Cattle identification is beneficial to different stakeholders, including animal producers, food consumers and the food industry [1]. For instance, cattle identification systems contribute to limiting the spread of animal diseases by allowing a better understanding of disease trajectories and therefore effectively managing cattle vaccination programs. In addition, cattle identification helps in limiting cattle losses, reducing the costs of disease destruction, minimizing trade losses and facilitating cattle ownership management in large-scale farms [2,3].
Conventional buffalo and cow identification methods are divided into three groups-permanent, temporary and electrical identification methods [4]. Traditional identification methods, such as

Related Work
Cattle muzzle prints have received considerable research attention compared to the other animal biometric identifiers [26]. Some approaches for muzzle print images have been used for feature extraction and matching. A joint pixel approach of skin grooves was utilized by Minagawa et al. [19] as a key feature for muzzle print matching. This approach achieved matching scores of 60% and 12%. A database of 170 images collected from 43 animals was used, and 13 samples were excluded due to a feature extraction failure. The rest of the database was matched against itself. Twenty animals were correctly identified with 66.6 % total accuracy. Noviyanto and Arymurthy [23] applied SURF and its variant, upright SURF (U-SURF), in cattle identification for extracting muzzle print image features. A database of 120 muzzle print images was collected from 8 animals (15 images of each animal). The main experimental scenario considered 10 muzzle print images for the training sample, while 5 images were used as testing samples. This method achieved 90% identification accuracy under rotation conditions. Awad et al. [22] applied SIFT, followed by the random sample consensus (RANSAC) algorithm to improve the robustness of SIFT feature matching. The identification scenario considered a database of 105 muzzle print images collected from 15 cattle (7 muzzle print images from each animal). The 7 images of each animal were swapped between the enrollment and the identification phases and therefore a confusion matrix with a dimension of 105 × 105 was created from the calculated similarity scores. The proposed SIFT with the RANSAC method achieved an identification accuracy of 93.3%. In Reference [27], a cattle classification approach was proposed based on utilizing multiclass support vector machines (SVMs) and texture features extracted by a box-counting scheme.
Furthermore, a SIFT-based method combined with a matching refinement technique and orientation information was introduced by Noviyanto and Arymurthy [28]. The proposed refinement technique was evaluated against the SIFT features using a database of 160 muzzle prints collected from 20 cattle. The achieved accuracy was measured in terms of the equal error rate (EER), where SIFT achieved an EER of 0.0167, while the application of the refinement technique resulted in an EER of 0.0028. Gaber et al. [29] employed Weber's local descriptor (WLD) for feature extraction from muzzle print images combined with the AdaBoost classifier for developing a cattle identification system. The maximum obtained identification accuracy was 99% using a database of 217 muzzle print images from 31 animals.
Recently, convolutional neural networks (CNNs) and deep learning (DL) methods have been introduced and used in many computer vision-related applications, achieving the most success in object detection, auto-driving and text-processing applications [30][31][32][33]. Consequently, deep learning has gained attention in animal biometrics from some research groups [34,35]. For instance, Andrew et al. [36] applied deep learning methods to bovine identification. The authors showed that the off-the-shelf networks have the ability to perform end-to-end individual identification from top-down images acquired by fixed cameras.
To address the problem of swapped and missed animals as well as false insurance claims, Kumar et al. [37] introduced a DL-based approach for cattle identification using the primary patterns of muzzle print images. In this method, the well-known stacked denoising autoencoder scheme was utilized for encoding the extracted features of the muzzle point images. In Reference [38], a neural network and rolling skew histogram were fused for cow identification in the rotary milking parlor. Zhangyong et al. [39] proposed another automated method based on CNNs for the precise identification of dairy cows. Through the cross-validation of a training set and a test set, the recognition accuracy could reach 87% for a single image. Other researchers, in Reference [40], used deep learning for cattle contour extraction and instance segmentation in a real cattle feedlot management environment.

BoVW-Based Cattle Identification
Generally, the bag-of-visual-words (BoVW) technique represents a given image as a collection of local features extracted from image patches or some points of interest in the image. In other words, mapping the image from a set of very high-dimensional features to a list of word numbers. Thus, it is logical to first discuss the motivation behind local feature extractors utilized in the proposed approach and then explain how they are converted into the visual word space.

Speeded-Up Robust Features (SURF)
The speeded-up robust features (SURF) descriptor [24] was developed as an alternative to the SIFT descriptor. Briefly, the SURF descriptor starts by constructing a square region around points of interest, which is oriented along its main orientation. The size of this square is 20s, where s is the scale at which the point of interest is detected. The region inside the square is divided into smaller 4 × 4 subregions and the Haar wavelet responses in the horizontal d x and vertical d y directions are computed for each subregion at 5 × 5 sampled points, as shown in Figure 1.
To improve the robustness of the descriptor against localization errors and some geometric deformations, these responses are weighted with a Gaussian window. The wavelet responses d x and d y are summed for each subregion, which, with the sum of their absolute values, form entries of the feature vector F v ; that is, This procedure is repeated for all the 4 × 4 subregions, resulting in a feature descriptor of 4 × 4 × 4 = 64 dimensions. To reduce illumination effects and make the descriptor invariant to region size, the feature descriptor is normalized to a unit vector. Applying restrictions (e.g., the number of divisions inside the square region) to the regular descriptor (i.e., SURF-64) F v results in several extended versions of SURF, such as SURF-36, SURF-128, and U-SURF [24,41]. In this work, the SURF feature descriptor (regular descriptor of length 64) is adopted to describe the image patches due to its balance of computing efficiency and representation capacity. It uses a 64-dimensional feature vector to describe the local features; in contrast, SIFT uses a 128-dimensional feature vector. Additionally, the SURF feature descriptor is more robust to various image perturbations than the SIFT local feature descriptor.

Maximally Stable Extremal Regions (MSER)
The maximally stable extremal regions technique [42] or its fast implementation [25,43] are widely used for detecting blobs in images via extracting a number of covariant regions called MSER. In this algorithm, the term "extremal" refers to the property that all pixels in an MSER have either higher (i.e., brighter extremal regions) or lower (i.e., darker extremal regions) intensity than all other pixels outside the boundary of that MSER. The extremal regions have two main properties-(i) are invariant to affine or projective transformations on the image and (ii) are invariant to lighting variations. Thus, they are scale and rotation invariant as well. The MSER algorithm detects the regions using a connectivity analysis and by computing connected maximal-and minimal-intensity areas in the region and on its outer boundary. It should be noted that other feature descriptors discussed in Reference [44] can be utilized for feature extraction.

Bag-of-Visual-Words Representation
The BoVW technique has been shown to be successful for a wide range of computer vision applications, including image retrieval [45] and object classification [46,47] as well as action recognition [48,49] with outstanding performance and low storage requirements. Simply, in the basic BoVW model, some local features are extracted from an image using a feature extractor (e.g., SURF) and then the extracted local features are clustered into visual words. That is, the image is described by a histogram of visual word counts instead of low-level features. In this context, this visual vocabulary representation provides a global representation from the local features or a "mid-level" representation that can bridge the large semantic gap between the low-level features extracted from the image and the high-level concepts.
Suppose we have a sequence X = (x 1 , x 2 , ..., x n ) of d-dimensional feature vectors obtained by a feature extractor from an image, where x i ∈ R d . The main objective of the BoVW technique is to quantize each sequence X based on a specific vocabulary dictionary V = {ν 1 , ν 2 , ..., ν N } ⊂ R d of N visual words. To achieve this objective, each sequence X can be represented by a histogram of probabilities p(ν|x). In this way, the BoVW histogram H summarizes the whole image by counting how many times each of the visual words occurs in that image: where The most well-known method for building the visual vocabulary is to use k-means clustering because of its simplicity and convergence speed. Other methods, such as hierarchical or spectral clustering, can also be used for this task [48]. In this case, the center of every cluster is used as a visual word. The clustering step is to quantize the feature space into a small discrete number of visual words. It should be noted that the choice of data plays an important role in creating the visual vocabulary [50]. Figure 2 describes the proposed BoVW-based cattle identification approach. The local features of all the images in the database are extracted using SURF/MSER mechanisms. Then, a 500-word visual vocabulary is created by reducing the number of features via feature space quantization using k-means clustering. Finally, a classification mechanism is considered.

Classification Stage
A kernel support vector machine (KSVM) is applied to the bag-of-visual-words to achieve the classification task in the proposed cattle identification system. Every animal head, with 7 muzzle print images for each animal, is considered a separate class. The identification accuracy is measured using different database sizes for classifier training, validation and evaluation purposes.

Experimental Results
The experimental work in this study was conducted on a regular computer equipped with an Intel R Xeon R E5-2667 v2 CPU processor running @ 3.30 GHz with 64 GB of RAM and a Windows R 64-bit operating system. To build a unified testing environment, MATLAB R R2016b was used for code development and execution. The performance evaluation was measured using a nonstandard muzzle print database of 105 images. The database includes collected 7 captured cattle muzzle print images from 15 animal heads [22]. Examples of the muzzle print images randomly selected from the database are shown in Figure 3.

Bag-of-Visual-Words with SURF Features
The empirical work starts by checking the performance of the BoVW using the SURF features. In this case, the SURF approach was used for both feature detection and feature description operations. To check the BoVW performance under several training conditions, four scenarios involving 30%, 45%, 60% and 75% of the whole database were used as training input. The rest of the database in every scenario was used for testing and validation purposes. A linear kernel support vector machine was used in the last classification stage. To avoid any bias in the results, the database was randomly partitioned in each scenario. The histogram of visual word occurrences was considered, and the total number of visual words was set to 500 words.
The identification accuracy and the processing time were measured in each scenario and are recorded in Table 1. It is apparent that the identification accuracy and processing are proportional to the size of the training dataset. The maximum achieved accuracy is 93% using 75% (75 images of the 105 total images) of the data as the training dataset. However, the identification accuracy is acceptable and the situation has emerged due to the high similarity between the muzzle print images and hence between the visual words in the whole database. Table 1. Performance of BoVW using SURF as the image feature detector and descriptor. The table represents the number of images used in the training process, while the rest of the database is used for the evaluation purpose. The processing time is measured in seconds. The confusion matrices confirm the obtained identification accuracies. The confusion matrices are shown in Figure 4. The figure is aligned with the accuracies in Table 1, where yellow represents the highest score.

Bag-of-Visual-Words with MSER Features
The four aforementioned database scenarios used with the SURF mechanism were carried out again using MSER as a feature point detector. In these scenarios, the MSER detector was used to detect the points of interest, while SURF was employed as a feature descriptor for every detected point. The identification accuracy, processing time and error were calculated in every case and are reported in Table 2. Visual word histograms and confusion matrices were measured during the empirical work but are omitted from the paper to avoid figure redundancy. Table 2 illustrates the degradation in the obtained number of features and identification accuracy. It also shows very short processing times compared to Table 1, which is normal due to the small number of extracted features.
On the other hand, the obtained results from using both the SURF and MSER detectors in the BoVW paradigm were compared against other state-of-the-art methods in Table 3. The table confirms the similar performance of BoVW and our previous method that uses only the SIFT detector to calculate a small score between all the database images. Driven by the comparison results in Table 3, it is highly feasible in terms of accuracy to utilize the BoVW technique in cattle identification. Table 2. Performance of the BoVW using maximally stable extremal regions (MSER) for points-of-interest detection and the SURF detector for feature description at every detected point. The table represents the number of images used in the training process, while the rest of the database is used for evaluation. The processing time is measured in seconds.

Discussion
Traditional cattle identification methods such as ear tagging, branding and tattooing are vulnerable to losses, damages and fading. Electronic identification systems such as RFID-based systems involve many security and privacy challenges. Therefore, mapping biometric identifiers for animal identification has emerged as a hot research trend. Biometric-based cattle identification solves several problems in the conventional cattle identification methods [1,51,52].
Central to this study is the evaluation of the performance of the bag-of-visual-words (BoVW) model in cattle, buffalo and cow identification using muzzle print images. The study aims to investigate the feasibility of deploying the BoVW technique for a new biometric-based cattle identification system. To this end, two feature detection and description methods-namely, SURF and MSER-were used as feature extraction engines at the heart of the BoVW technique. The study offers two feature extraction scenarios. Initially, SURF is used as a local feature detector and descriptor. The second scenario considers MSER for point-of-interest detection, while SURF is used for feature description. The proposed BoVW-based system was evaluated using a muzzle print database of 7 images per 15 cattle heads. The evaluation database includes 105 images in total and the database was divided into training and evaluation subsets.
The common issue of using SURF for feature detection and MSER is the identification accuracy value. The maximum achieved identification accuracy of using SURF is 93%, which is promising compared to other published methods. Moreover, using MSER achieved drastically low identification accuracy, which is incomparable to the accuracy achieved by using SURF. The 7% error in accuracy resulted from the similarity between muzzle print images because it is difficult to create distinguishing vocabularies for each image. Since this study is the first attempt to use the BoVW approach in cattle identification, it was hard to find similar comparison methods. Therefore, the comparison was performed with methods that extract local features from cattle muzzle print images.
Although the BoVW approach has achieved reasonable accuracy using the SURF feature detector and descriptor, the achieved results are limited to the small-sized database of 105 images. Having a standard muzzle print image database and benchmarks for identification accuracy are still missing in the cattle identification domain [1]. The empirical work performed in this study has proven the possibility of utilizing BoVW in cattle identification; however, the BoVW feature extraction method should receive more considerations. Furthermore, this study opens the door for future investigations of cattle identification using BoVW combined with machine and deep learning techniques.
Despite the reasonable achievements by the proposed BoVW approach, the research field of cattle identification is still far from complete, especially in unconstrained environments. Thus, addressing real-world challenges such as occlusion, illumination and cattle viewing distances is a must. To this end, the following directions of future research are suggested-(1) Using k-means clustering and well-defined distance measures could be helpful for further enhancing the performance of the BoVW approach; (2) Exploring other feature extraction algorithms may help solve the lack of discriminative power in the BoVW model; (3) Encoding schemes, pooling and normalization strategies, and fusion techniques are the main steps in any BoVW framework; thus, searching for new alternatives will improve the performance; (4) It is natural to consider combining recent techniques such as convolutional neural networks with the BoVW approach.

Conclusions
Cattle identification using animal biometric identifiers is still a challenging problem. A robust and accurate cattle identification mechanism is vital for protecting livestock, limiting livestock producers' losses to disease and facilitating cattle ownership management. This paper has explored the performance of the BoVW paradigm in cattle identification using SURF and MSER as engines for BoVW feature detection and description from cattle muzzle print images. The experiments have proven the possibility of applying the BoVW model in building a cattle identification system. In addition, the study has confirmed the superiority of using SURF for feature detection and description with 93% identification accuracy compared to the 67% that was obtained by combining MSER and SURF for points-of-interest detection and description, respectively. The processing time has shown a high variability in using SURF and MSER. The required time for processing 75 images was measured as 89.5 s in SURF and 28.0 s in MSER, which correspond to the maximum accomplished accuracy. Although the empirical study has proven the feasibility of applying the BoVW approach in cattle identification using muzzle print images, special attention should be given to the quality of collected muzzle print images, the size of the training dataset and the feature extraction mechanisms. In future work, we will endeavor to build a large-scale muzzle print image dataset to further evaluate and enhance the performance of the proposed BoVW approach.