VEHICLE LOGO RECOGNITION WITH REDUCED-DIMENSION SIFT VECTORS USING AUTOENCODERS

Vehicle logo recognition has become an important part of object recognition in recent years because of its usage in surveillance applications. In order to achieve a higher recognition rates, several methods are proposed, such as Scale Invariant Feature Transform (SIFT), convolutional neural networks, bag-of-words and their variations. A fast logo recognition method based on reduced-dimension SIFT vectors using autoencoders is proposed in this paper. Computational load is decreased by applying dimensionality reduction to SIFT feature vectors. Feature vectors of size 128 are reduced to 64 and 32 by employing two layer neural nets called vanilla autoencoders. A dataset consisting of Medialab vehicle logo images [9] and other vehicle logo images obtained from the Internet, is used. The dataset may be reached at [10]. Results suggest that the proposed method needs less memory space than half of the original SIFT based method’s memory requirement with decreased processing time per image in return of a decrease in the accuracy less than 20%.


INTRODUCTION
The well-known feature extraction method SIFT [1] is used for large number of tasks in computer vision that requires object recognition and point matching between different scenes.SIFT descriptors are translation, rotation and scale invariant, robust to illumination variations and very useful on real-world tasks.But matching process is expensive because SIFT feature vectors are 128 dimensional and calculating distance between these vectors is very time consuming.Several methods are proposed to improve comparing process.
We propose a SIFT method which is more time and memory efficient than traditional SIFT method for vehicle logo recognition.

SIFT
SIFT (scale invariant feature transform) is a successful method to find interest points in an image.It basically consists of four steps:

Scale Space Extrema Detection
Laplacian of Gaussian (LoG) which is a blob detector can be used to obtain different sized blobs, via different sigma values.But LoG is a costly function, hence a similar function which is Difference of Gaussians(DoG) is used with different sigma values.The difference of The Gaussian blurred images with different sigma values are obtained in DoG process.These new images are formed a pyramid over scale space.In this pyramid, local extrema are marked as possible keypoints.

Keypoint Localization
The extrema points consist of strong interest points, edge points and points that have low contrast.In order to get rid of the low-contrast points, points are firstly localized with more precision and then the points whose intensity value are less than a threshold, are removed.In order to get rid of the edge points, an algorithm that is similar to Harris corner detector is used.

Orientation Assignment
In order to have rotation invariant keypoints, the orientations are defined for each keypoint according to their neighborhood.This process consists of determining neighborhood, computing orientation histogram which covers 360 degrees, weighting the histogram, then taking the highest peak and values more than 80% of it in the histogram and finally computing orientation.Hence different oriented keypoints at the same location and scale are obtained.

Keypoint Descriptor
At the previous steps, the keypoints that have scale, orientation and image location information were obtained.In other words, the keypoints are now invariant to these variables.In this step descriptor for the local image region which is invariant to remaining variables such as illumination change and local shape distortion too, is calculated.
The obtained descriptors are 128 dimensional vectors.Information in the keypoint descriptor comes from 16x16 neighborhood of the keypoint.8 bin orientation histogram value is computed for each 4x4 sized sub-blocks which are the pieces of 16x16 neighborhood.

AUTOENCODERS
Autoencoders are neural networks that tries to generate its input with minimum difference.In other words, if weight and bias sets are W and b and input is x, the autoencoder tries to map input x to output x' with hW,b(x) = x'.The goal is finding best set of parameters (W,b) that minimizes difference x-x'.Like other neural networks applications, trying to minimize x-x' is a convex optimization problem and solved with gradient based methods.
Autoencoder neural network is an unsupervised learning algorithm, it takes no label.When autoencoder is shallow and vanilla type, it learns similar to PCAs.Stacking more layers of neurons makes it easier to find correlation between different components because each layer adds a nonlinear operation with activation function.However, if capacity of autoencoder is too large, network just copies instead of finding useful features.
Generally, there are several error functions used in autoencoders.Eq. ( 1) and ( 2) are two frequently used error functions where (1) is L2 norm and (2) is cross entropy loss that is used when the input is bit probability, (1) where k is the index of hidden unit.In this study L2 norm as error function is used.There are several autoencoder types such as sparse autoencoders, denoising autoencoders, and variational autoencoders.In this study we used vanilla autoencoders.

METHOD
Our dataset consists of 90 cropped vehicle logo images which is obtained from 9 car brand [9].In [1], after SIFT vectors are obtained, to compare vectors and find matchings, cosine distances were calculated.Instead of calculating cosine distance between 128 dimensional vectors, we use cosine distance between 64 dimensional vectors which are reduced vectors by Autoencoders.
Figure 1 shows the Autoencoder architecture we implemented.This is a traditional autoencoder model which has 2 encoder and 2 decoder layers with sigmoid activation function.Left-most rectangle is input SIFT feature vector with 128 dimensions.Right-most rectangle represents generated SIFT feature vector.Encoder layers encodes 128 dimensional input vectors to 64 dimensional vectors while decoder layers decode this vector back to 128 dimensions.Cost function is squared distance between original input and generated output.Our purpose is to find suitable representation for these vectors with lower number of components.This is a symmetrical network, first and second encoder layers have 128 and 64 number of units, respectively.Encoder and decoder layers are symmetric; first decoder layer has 64 number of unit while second has 128.

EXPERIMENTAL RESULTS
We used two programming languages for different parts of project.Firstly, logo images are cropped manually from all vehicle images in the dataset.Then SIFT vectors are obtained with [8] in MATLAB.The autoencoder is implemented on python with TensorFlow [6] and reduced vectors are compared on MATLAB.
Implemented autoencoder architecture is trained with 5621 SIFT vectors through 30,000 iterations.Learning rate and batch size are set to 10 -4 and 256 respectively.Loss function is optimized with Adam [7] optimizer.Plot of loss obtained from Eq. (1) vs iteration number can be seen from Figure 2.During training process, a single CPU is used.

Figure 2. Loss results per iteration
The recognition performance of the proposed method is measured using the "accuracy" metric, A, defined by: where TM is the number of true matches and N is the number of all images in the data set.81% accuracy is achieved after training through 30,000 iterations while matching process with 64 dimensional vectors.In addition to this, 32 dimensional vectors are obtained and tested.Memory usage which is the occupied memory by vectors obtained from all images in data set and logo data set, is computed The results are shown in Table 1.

CONCLUSIONS AND FUTURE WORK
A reduced-dimension SIFT features based vehicle logo detection method is proposed using autoencoders.The dimension reduction of features is achieved with twolayer neural network structures called vanilla autoencoders.
Results indicate that by employing the proposed dimension reduction technique, an accuracy decrease of less than 20% yields a memory space saving of more than 100% along with a reduced processing time requirement per image.
Future work consists of 4,8 and 16 dimensional reduced vectors, quantization and binarization of these vectors and using different similarity measures, such as Jaccard and Manhattan measures.

Table 1 .
Results that are obtained with 128, 64 and 32 dimensional feature vectors.

Table 2 .
Examples of input logo images, corresponding true matches and algorithm results obtained utilizing (a.) 128, (b.) 64 and (c.) 32 dimensional SIFT feature vectors ).Using binarization and hashing for efficient SIFT matching.Journal of Visual Communication and Image Representation, 30, 86-93.