Classification of Non-Conventional Ships Using a Neural Bag-Of-Words Mechanism

The existing methods for monitoring vessels are mainly based on radar and automatic identification systems. Additional sensors that are used include video cameras. Such systems feature cameras that capture images and software that analyzes the selected video frames. Methods for the classification of non-conventional vessels are not widely known. These methods, based on image samples, can be considered difficult. This paper is intended to show an alternative way to approach image classification problems; not by classifying the entire input data, but smaller parts. The described solution is based on splitting the image of a ship into smaller parts and classifying them into vectors that can be identified as features using a convolutional neural network (CNN). This idea is a representation of a bag-of-words mechanism, where created feature vectors might be called words, and by using them a solution can assign images a specific class. As part of the experiment, the authors performed two tests. In the first, two classes were analyzed and the results obtained show great potential for application. In the second, the authors used much larger sets of images belonging to five vessel types. The proposed method indeed improved the results of classic approaches by 5%. The paper shows an alternative approach for the classification of non-conventional vessels to increase accuracy.


Introduction
Ship classification is an important process in practical applications in different places. In coastal cities, ships enter from the mouth of a river or moor at ports. This type of activity is quite often reported and recorded. However, for measurement, statistical, or even analytical purposes, it is often necessary to record vessels that arrive but do not report anywhere. To this end, the simplest solution is to create a monitoring system and analyze acquired images. This type of system architecture is based primarily on three main components: video recording, image processing, and classifying possible water vehicles.
While the solution itself seems simple, each component has its disadvantages, which also affect the others. First, the video recorder may be a simple camera, but often one needs to take good-quality photos for easier analysis. The second component is image processing. Image processing should consider the location of a possible ship on an image, or even perform some extraction of features. It is particularly important to remove unnecessary areas such as the background, houses, and even water. The third element is classifying these images, i.e., based on the obtained images, the algorithm should determine with some probability the type of ship.
In this paper, we considered the third aspect of such a system to model a solution enabling the most accurate classification of a given type of ship based on a photo entered into the system. In the analyzed system [1] an important element was the recording of information about passing vessels in

Bag-Of-Words
Bag-of-words is an abstract model used in the processing of text or graphics. It is the representation of data described in words, i.e., linguistic values. In the case of two-dimensional images, with a word we can describe a feature or fragment of an object. The idea of using a bag can help classify processes, because the input image will be decomposed into smaller fragments and classified according to certain linguistic values. These values can help in the classification of larger objects. This is especially desirable when analyzing the same objects that differ in their small features.
The proposed idea consists of extracting small fragments of the image with certain features. All points are divided against a certain metric into smaller images containing a fragment of the object. Such images can represent everything, because the object can be on any background; for instance, a ship can be captured in the port or against a background of trees; in that case, smaller images can even show some trees. Thus, the use of a classical approach, that is, the creation of a bag-of-words using an algorithm such as k-nearest neighbors, is not very effective. The reason is the lack of connection between the features (in smaller parts), because it should be considered that the objects can be on different scales or turned at a certain angle or even have some noise, such as bad weather or additional objects. That is why we propose a bag-of-words model based on more complex structures, such as neural networks.

Feature Extraction
The main idea of this study was to extract features using one of the classic algorithms for obtaining keypoints, such as scale-invariant feature transform (SIFT) [21], speeded up robust features (SURF) [22], features from accelerated segment test (FAST) [23], or binary robust invariant scalable keypoints (BRISK) [24], and then create samples with found features. It should be noted that if these algorithms processed the original image, the found points would probably cover the entire image; in the case of a simple image where a ship is at sea, all points could be placed on this object or water or waves, but there may be an image with some additional background with many possible points. To remedy this, in the first step, the image must be processed, which means using graphic filters to minimize elements such as edges or points. We used only two filters, such as gamma correction and blur.

Feature Extraction Based on Keypoints
Using the described algorithms, we obtained a set of keypoints, which we can describe as A = {(x 0 ,y 0 ), (x 1 ,y 1 ), . . . , (x n−1 ,y n−1 )}. To minimize the number of points (because unnecessary elements of the image can be indicated), all points were checked against their neighbors. If the point had a neighbor within a certain distance α, it remained in the set. Otherwise, the point was removed, and the cardinality was reduced by one. The distance between two points p i = (x i , y i ) and p j = (x j , y j ) was checked using one of the two classic metrics, Euclidean or river. The best known is the Euclidean, modeled as A river metric is the distance between points but counted relative to a certain straight line between the points. For both points, a perpendicular projection is made, as a result of which an additional two points are obtained, (x o , y o ) and (x p , y p ). The distance in this metric will be calculated as the sum of the distance of a given point to the straight line, the distance between these two points on the straight line, and the transition from the straight line to the second point. Formally, it can be stated as Sensors 2020, 20, 1608 4 of 13 Depending on the given metric, all points are checked to see if the distance is smaller, and if so, the point is removed. The next step is to divide the points into subsets B q , where q is the number of objects. It is not possible to adjust the value of q without empirically checking and testing the data in the database. With this value, it is worth using existing algorithms to divide these points (for example, using the k-nearest neighbors algorithm). However, this value is unknown, so another approach to the topic should be taken. For this purpose, one of the previously described metrics can be used.
For all points in a given set A, the average distance value is calculated as With the average distance, the points are divided concerning this value. The first subset is created by adding the first point to it, i.e., (x 0 , y 0 ) ∈ B 0 . Then, for each point (x r , y r ) ∈ A, we check to see if the distance between this point and any other in a given subset (x 0 , y 0 ) ∈ B 0 is less than the average distance of the set, i.e., d metric ((x r , y r ), (x 0 , y 0 )) < ξ(A) If the above equality is met for a point (x r , y r ), it is added to subset B 0 and removed from A. In the case where none of the points is added to a given subset, another subset, B 1 , is created. Then, the first point from A is added to subset B 1 and removed from A. In this way, the action is repeated to meet the stop condition, which is the emptiness of the set, A = ∅.
As a result, subsets B are generated, with each representing one feature. For each set, an image is created whose dimensions will depend on the subset. To find the dimensions, we look for the maximum and minimum values of both coordinates in a subset that we can mark as x max , y max , x min , and y min . Hence the image size will be (x max -x min ) × (y max -y min ). Then, the images are saved and each one represents a part of the image. The left part of Figure 1 shows this process of extracting smaller parts of the image. Figure 1 shows a graphic visualization of the proposed model. Depending on the given metric, all points are checked to see if the distance is smaller, and if so, the point is removed. The next step is to divide the points into subsets Bq, where q is the number of objects. It is not possible to adjust the value of q without empirically checking and testing the data in the database. With this value, it is worth using existing algorithms to divide these points (for example, using the k-nearest neighbors algorithm). However, this value is unknown, so another approach to the topic should be taken. For this purpose, one of the previously described metrics can be used.
For all points in a given set A, the average distance value is calculated as With the average distance, the points are divided concerning this value. The first subset is created by adding the first point to it, i.e., ( 0 , 0 ) ∈ 0 . Then, for each point ( , ) ∈ , we check to see if the distance between this point and any other in a given subset ( 0 , 0 ) ∈ 0 is less than the average distance of the set, i.e., (( , ), ( 0 , 0 )) < ( ) If the above equality is met for a point ( , ), it is added to subset 0 and removed from . In the case where none of the points is added to a given subset, another subset, 1 , is created. Then, the first point from is added to subset 1 and removed from . In this way, the action is repeated to meet the stop condition, which is the emptiness of the set, = ∅.
As a result, subsets are generated, with each representing one feature. For each set, an image is created whose dimensions will depend on the subset. To find the dimensions, we look for the maximum and minimum values of both coordinates in a subset that we can mark as xmax, ymax, xmin, and ymin. Hence the image size will be (xmax-xmin) × (ymax-ymin). Then, the images are saved and each one represents a part of the image. The left part of Figure 1 shows this process of extracting smaller parts of the image. Figure 1 shows a graphic visualization of the proposed model.

Classification with Bag-of-Words
Unfortunately, there was no unambiguous method to assign attributes to specific groups automatically. Therefore, we suggested creating groups at the initial stage of modeling the solution with the help of empirical division. In this way, the basic database of features were created, which will include a later bag-of-words.

Classification with Bag-Of-Words
Unfortunately, there was no unambiguous method to assign attributes to specific groups automatically. Therefore, we suggested creating groups at the initial stage of modeling the solution with the help of empirical division. In this way, the basic database of features were created, which will include a later bag-of-words.

Convolutional Neural Network
One of the most important branches of artificial intelligence methods is neural networks, which have been modeled for the needs of graphic image classification. Convolutional neural networks are models inspired by image processing by the cerebral cortex of cats. It is a mathematical structure built of three types of layers, where the layers between them are connected by synapses burdened with a certain weight. The weight is given randomly while creating the structure. Then, in the training process, the weights are modified to best match the training database.
One of the key layers of the network is the convolutional layer, which takes the image of the input with dimensions w × h × d, where w and h are the width and height of the image and d is understood as depth and depends on the model. For color images saved in the red-green-blue (RGB) model, the depth will be 3 due to the number of components. Formally, each image is saved as a set of matrices, each of which describes the image values for a given component. The convolutional layer works on a principle of image filter f of size k × k. This filter is a matrix with k 2 coefficient defined randomly and modified during the training process. This filter is moved over the image and changes the value in pixel p on image I at position (i, j), which can be defined as where matrix f is located over an image and the central point of the matrix is over a pixel at position (i, j), and K is the sum of all weights of filter f. The main purpose of this layer is feature extraction and reduction of data redundancy on the image. Applying some filter on the image will change it; depending on the coefficient of filters, some objects might be deleted or highlighted. The second type of layer is called pooling, which has only one purpose: to reduce the size of matrices. Reducing depends on some function g(·), which selects one pixel from each square m × m. The most commonly used function is max(·) or min(·).
These two layers can be used alternately many times. In the end, there is the last layer, the fully connected layer, which is understood as a classical neural network. Each pixel from the last layer (pooling or convolutional) is input as a numerical value. This layer is composed of columns of neurons connected by synapses, which are burdened with some weight. Each neuron gets a numerical value that is processed and sent to the next column. This operation can be described as where x t is the output from neuron m in layer t, and ω i is a weight on the connection between x m in layer t and x i in layer t -1. The number of columns and neurons depends on the modeled architecture. In the last column, there should be k neurons (when a classification process is described as a k-classes problem). The final calculation of an image in such a structure gives a probability distribution that can be normalized by some function like softmax. These values are understood as the probability of belonging to this class. Unfortunately, all weights in this model are generated randomly at the beginning. To change these values, the training algorithm must be used. The main idea is to minimize loss function during two iterations. One such algorithm commonly used in convolutional networks is adaptive moment estimation [25]. The modification of weights is based on a basic statistical coefficient like the correlation of mean m or variation v: Sensors 2020, 20, 1608 where m t and v t are the mean and variation values in the tth iteration. The formulas for calculating this can be presented as where β 1 , β 2 are distribution values. Those two statistical coefficients are used in the modification of weight as where η is the learning coefficient and 0, which prevents division by 0.

Bag-Of-Words
A trained classifier can be used as an element dividing incoming images into selected elements in a bag. For each image, smaller images representing features are created. Each of these features is classified using the pretrained convolutional neural network. As a result, the network will return the probability of belonging for each word in the set (each single output from the network is interpreted as a word). Based on a certain probability and features, it is possible to assign these attributes to an object. The selection of features for an object works on the principle of determining conditional affiliation to another word in the bag. To make it impossible to save the whole object to its characteristics, it is worth introducing division of the bag into two sets (or even two bags). The first bag will contain only features and the second full objects. For a better understanding of this idea, let us assume that the image presents a motorboat. The biggest bag will contain a class of ships, like motorboat, yacht, etc. The smallest bag (in the biggest one) will describe one ship. For motorboat, these words would be, for example "a man", "waves", and "no sails".
Each of these objects is defined as a numerical vector consisting of zeros and ones (ones as belonging to this class). Each item in the vector is assigned to one feature from the bag-of-words, so its creation consists of using the result returned by the classifier. It should be noted that for many smaller segments from basic images, there will be many classification results. These results are averaged by all returned decision from classifiers.
The evaluation of the feature vector to an object occurs by comparing these vectors. The simplest method is to approximate the values returned by the network to integers and compare them with the words in a bag. However, there may be a situation where the vector will be different in one position compared to the patterns. To prevent this, we suggest using the k-nearest neighbors algorithm, which will allow assigning to a given object. The full display of this process is shown in Figure 1.
The k-nearest neighbors algorithm consists of analyzing and assigning the sample to neighboring samples [26,27]. Suppose that the value x i has an assigned class µ i . In the case of the analyzed problem, x i will correspond to 1 and values of µ i are the appropriate values representing the objects. The algorithm finds the nearest neighbors (values) x n {x 0 , x 1 , . . . , x n−1 } for the given value x according to the following equation:

Experiments
In our experiments, we tested two databases. The first one had two classes, sailing ship and others, and was used to create the first set of features and find the best combination of algorithms. The second database contained more classes and the biggest number of samples to show the potential application of such an approach.

Classification for Two Classes of Ships
In these experiments, we tested the proposed solution to find the best combination for our proposition. For this purpose, the database we used was very small. It contained two classes, sailing ship and others. A sailing ship should have sails, although they do not always have to be spread. Such an observation allows the creation of two features describing this object, i.e., masts and sails. In this way, a vector describing these two classes will be created: where individual values are understood as appropriate features, masts and sails. In these tests, a CNN architecture as described in Table 1 was used. In the experiments, we used a database contained 800 images (600 with sailing ships, and 200 with other ships). In the training process, 75% of the samples selected randomly from each class were used, and the remaining 25% was used for the validation process, which were 150 and 50 images.
For each sample, one of the keypoint algorithms was used, which allowed us to create a few smaller segments. We tested the algorithm for each segment, and the results of two selected metrics, Euclidean and river, are presented in Table 2. In the table, for each algorithm, there are two columns labeled "Object features" and "Background", which means that the extracted segment describes an important feature of a ship or not. Quite a common problem was to find the background, i.e., an insignificant fragment of the ship, and a large amount of sky or sea. The results shown are averaged over the entire base. It is easy to see that using the Euclidean metric generates many more features compared to the river metric. In both cases the ratio of images depicting features of the background exceeded 50%; however, that is not that big for the classic Euclidean metric. In our tests, we used the SIFT, SURF, BRISK, and FAST algorithms to find keypoints. After that, all found segments were resized to one size and calculated using CNN. The results obtained for each image were averaged and classified using the k-nearest neighbors algorithm (in this experiment, k = 2) and are presented in Tables 3 and 4 (the results in the second column represent classification of the  whole image). Some examples of keypoint clustering are presented in Figure 2. image were averaged and classified using the k-nearest neighbors algorithm (in this experiment, k = 2) and are presented in Table 3 and Table 4 (the results in the second column represent classification of the whole image). Some examples of keypoint clustering are presented in Figure 2.   The highest efficiency was obtained with the Euclidean metric using the SURF algorithm. For this combination, the results of classification compared to those without using the bag-of-words mechanism was nearly 6% higher than that with the convolutional network alone. However, it is worth noting that the significant difference between the results obtained indicates the negative predictive value, whose value was almost twice as high when using the bag mechanism. This factor determines the probability of assigning a false sample to the correct class; in this case, not a sailing ship. The situation is similar to other hybrids, where this value is always higher than 50%. A similar situation occurred with the F1 score, which is the harmonic average of the precision and recall coefficients. This factor allows us to evaluate the classification if its components have different values. In each case, the statistical coefficients indicated a more accurate process taking into account the proposed mechanism.
For a more detailed analysis, time measurements were also made for the image processing and training of a given architecture, as shown in Figure 3.
predictive value, whose value was almost twice as high when using the bag mechanism. This factor determines the probability of assigning a false sample to the correct class; in this case, not a sailing ship. The situation is similar to other hybrids, where this value is always higher than 50%. A similar situation occurred with the F1 score, which is the harmonic average of the precision and recall coefficients. This factor allows us to evaluate the classification if its components have different values. In each case, the statistical coefficients indicated a more accurate process taking into account the proposed mechanism.
For a more detailed analysis, time measurements were also made for the image processing and training of a given architecture, as shown in Figure 3. The presented results are averaged data from 10 tests. In general, using the Euclidean metric saves approximately 10% more time than using the river metric. The tests showed that the longest processing time occurred using the FAST algorithm and the shortest with BRISK. As for the SIFT and SURF algorithms, the time measurement was at a similar level and was classified as in the middle.

Classification for Five Classes of Ships
Based on the previous results, the best accuracy was achieved with a combination of the SURF algorithm and CNN. We used this combination for classification of five classes: cargo (2120 images), military (1167 images), tanker (1217 images), yacht (688 images), and motorboat (512 images). For the first three classes, images were downloaded from a publicly available dataset from Deep Learning Hackathon organized by Analytics Vidhya. Each class was divided randomly into two sets in a 75%:25% (training/validation) ratio, and for the training process the data were split in the same proportion. Using the training set, the SURF algorithm was used to create smaller parts, and based on the created sets, these samples were put into features which can be described as the following vector: The presented results are averaged data from 10 tests. In general, using the Euclidean metric saves approximately 10% more time than using the river metric. The tests showed that the longest processing time occurred using the FAST algorithm and the shortest with BRISK. As for the SIFT and SURF algorithms, the time measurement was at a similar level and was classified as in the middle.

Classification for Five Classes of Ships
Based on the previous results, the best accuracy was achieved with a combination of the SURF algorithm and CNN. We used this combination for classification of five classes: cargo (2120 images), military (1167 images), tanker (1217 images), yacht (688 images), and motorboat (512 images). For the first three classes, images were downloaded from a publicly available dataset from Deep Learning Hackathon organized by Analytics Vidhya. Each class was divided randomly into two sets in a 75%:25% (training/validation) ratio, and for the training process the data were split in the same proportion. Using the training set, the SURF algorithm was used to create smaller parts, and based on the created sets, these samples were put into features which can be described as the following vector: [mast, sail, people, color, simplyShape], where people means that on deck some people can be found, color means that a boat can have different colors (for a military ship, it is mainly gray), and simplyShape means that the ship can be recognized as a simple geometric figure, such as a rectangle. These features were chosen according to the database used and their possible location.
Using these features, words describing ship type were defined as follows: The training database contained 4278 images, which resulted in almost 26,000 smaller segments. Data were split into features based on color clustering using the k-means algorithm [28] and corrected empirically (especially for shape). We trained the classifiers with the architecture described in Table 1, but in the end there were five because of the five classes of features. The classifier was trained for five different numbers of iterations, t ∈ {20, 40, . . . , 100}, and the accuracy is presented in Table 5. The best accuracy was reached using 80 iterations; accuracy did not improve with more iterations. The obtained accuracy is not very promising in such a classification problem. The main cause of this is the selection of features and creating sets for them. In the experiments, the dataset was so big that whether the sample belonged to the set was determined by the algorithm. Moreover, a feature such as shape is not the best choice for ships.
Despite these drawbacks, we conducted an additional experiment to check the classification result for this database in terms of hybrids. We classically trained a CNN to classify full images. Next, we checked the effectiveness of the validation base. Then we combined the obtained results from this classification with the proposed solution. Our approach classified into a given class out of the bag, so we understand the assignment to this class as adding a constant value equal to 0.2 to the probability of assignment according to the classic classifier. This approach will allow one probability distribution to be changed by 20%. The results of such action are shown in Table 6. The table shows the exact numbers of correctly classified images from the validation set and the accuracy. These data show that our proposition can be used as an additional component and increase the classification accuracy by nearly 5.5%. This result is better, but there is a problem with more time to train nets and classify samples because of much more operation. It is worth noting that values increased mainly for the military class, yachts, and motorboats. This is due to the good definitions of features such as the ones we have for military, or people and colors for the other two classes. The main conclusion is that the most difficult task is to initially declare a bag-of-words describing these features. This solution can be used in practice, but there are some additional tasks during the modeling of this solution, such as overseeing the creation of small images representing features and assigning them to individual groups. Also, the declaration of characteristics involves allocating image segments to these classes and analyzing them before training the classifier.
We used other CNN architectures, including VGG16, Inception, and AlexNet, and compared the results with and without our approach, as shown in Figure 4. The obtained results show that all tested architectures increased classification accuracy. The average value for all architectures was around 6%, which was a good result based on small datasets (neural networks are data-hungry algorithms). However, it was noted that apart from VGG16, where the increase was close to 3%, the other architectures achieved an increase of 7%. This was a good result, which could be significant for more extensive classes in the analyzed database.

Conclusions
Image classification is a problem for which solutions are being developed all the time. In recent years, revolutionary neural networks have been developed that have enabled a huge leap forward. Unfortunately, this solution also has its problems, such as requiring a large number of samples in the database, or architecture modeling. In this paper, we focused on analyzing images of selected ships. As part of the research, we proposed a classification mechanism based on sample segments that was determined based on algorithms searching for keypoints and subsequent classification.
As part of our experiments, we performed two tests. In the first, we analyzed two classes and the results obtained showed great potential for practical applications. In the second, we used much larger sets of images of five types of ships. The proposed solution in itself showed many disadvantages, especially at the stage of determining features and assigning samples to them to train the classifier. However, we used this solution as an additional element of classification after using the classic approach, including learning transfer. As a result, we noticed that the average efficiency increased by approximately 5% in almost all cases compared to the currently used convolutional network architectures.
An analysis of the database using a feature vector, which can be treated as a bag of words, shows potential practical application, especially if the features of the objects are well described. In future research, we plan to focus on how to automatically analyze images to extract features from them, as well as automatically assign classes as an unsupervised technique.

Conflicts of Interest:
The authors declare no conflict of interest.