Chart Classification Using Siamese CNN

In recovering information from the chart image, the first step should be chart type classification. Throughout history, many approaches have been used, and some of them achieve results better than others. The latest articles are using a Support Vector Machine (SVM) in combination with a Convolutional Neural Network (CNN), which achieve almost perfect results with the datasets of few thousand images per class. The datasets containing chart images are primarily synthetic and lack real-world examples. To overcome the problem of small datasets, to our knowledge, this is the first report of using Siamese CNN architecture for chart type classification. Multiple network architectures are tested, and the results of different dataset sizes are compared. The network verification is conducted using Few-shot learning (FSL). Many of described advantages of Siamese CNNs are shown in examples. In the end, we show that the Siamese CNN can work with one image per class, and a 100% average classification accuracy is achieved with 50 images per class, where the CNN achieves only average classification accuracy of 43% for the same dataset.


Introduction
In today's world, everything can be measured and described by numbers, and the numbers can accumulate fast and create tabular data. From tabular data, it is often hard to read, notice important information, and present that information to others who may or may not have prior knowledge of it. Because of the problems mentioned above, people tend to use graphical representations of tabular data-data visualizations or chart images. Graphical representation also helps identify unusual results and compare different values, trends, and relations between different types of data. Today, the most common data visualizations (known as line, bar, and pie chart) have been used since the 18th century [1]. The majority of the used data visualizations are "locked" inside the documents, which can be digitized. These documents contain graphical and textual information linked together in one visual unit. Each year raises essential questions and issues in retrieving and storing these documents and the information that is "locked" inside. The first challenge in retrieving information from digitized data visualization images is classifying that image in one of many existing chart classes.
Chart type classification is a well-studied problem with a vast number of real-world applications dealing with chart text processing, chart data extraction, and chart description generation. Some of the existing applications include: automatic generation of a summary description of the presented chart image, exporting original data table from the chart image, adding accessibility for various screen readers, etc. Since chart image contains heterogeneous information, it is built using graphical (lines, marks, circles, rectangles, etc.) and textual (title, legend, description, etc.) components. These components are not strictly standardized, and not every component is required to be used. The designers have many choices and much freedom when designing a chart image, which often results in creating new chart classes or new chart sub-classes. • Can generalize to inputs and outputs that have never been seen before-a network trained on approximately ten classes can also be used on any new class that the network has never seen before, without retraining or changing any parameters; • Shared weights-two networks with the same configuration and with the same parameters; • Explainable results-it is easy to notice why the network responded with a high or low similarity score; • Less overfitting-the network can work with one image per class; • Labeled data-before training, all data must be labeled and organized; • Pairwise learning-what makes two inputs similar; • In terms of dataset size-less is more.
The disadvantages of Siamese CNN: • Computationally intensive-less data but more data-pairs; • Fine-tuning is necessary-the network layer architecture should be designed for solving a specific problem; • Quality over quantity-the dataset must be carefully created and inspected; • Choosing loss function-available loss functions are contrastive loss, triplet loss, magnet loss, and center loss.
Many of the listed advantages and disadvantages will be experimentally proven in the following sections.

The Dataset
Before explaining the datasets, image pre-processing should be noted. The preprocessing used in the creation of datasets is similar to pre-processing used in [15]. The noted "Stage 3 image processing" is fine-tuned, and the number of details on the image is further reduced. The updated algorithm is presented in Figure 1. With this algorithm, a title, coordinate axes, legend, and any additional elements on the outside of chart graphics are removed. These elements are not crucial for chart type classification based only on the shape of graphic objects used in chart creation. The images are scaled-down with a preserved aspect ratio and are normalized to 105 × 105 pixels and black-and-white color space. All images are labeled and organized as the training Siamese CNN requires true (e.g., bar and bar chart) image pairs and false (e.g., bar and line chart). In this research, three different datasets are used: 1. The dataset used in our previous research, which consists of 3002 images, is divided into ten classes, as shown in Figure 2 [15,29]. This dataset includes images collected from the Google Image search engine and ReVision system [10] (further on in the text referred to as a dataset 1). 2. International Conference on Document Analysis and Recognition (ICDAR) 2019 syn- In this research, three different datasets are used: 1.
The dataset used in our previous research, which consists of 3002 images, is divided into ten classes, as shown in Figure 2 [15,29]. This dataset includes images collected from the Google Image search engine and ReVision system [10] (further on in the text referred to as a dataset 1).

2.
International Conference on Document Analysis and Recognition (ICDAR) 2019 synthetic chart dataset, which consists of 198,010 images that are divided into seven classes as shown in Figure 3 [30] (further on in the text referred to as a dataset 2). 3.
AT&T Database of Faces, which consists of 400 images, is divided into 40 classes [31]. In this research, three different datasets are used: 1. The dataset used in our previous research, which consists of 3002 images, is divided into ten classes, as shown in Figure 2 [15,29]. This dataset includes images collected from the Google Image search engine and ReVision system [10] (further on in the text referred to as a dataset 1). 2. International Conference on Document Analysis and Recognition (ICDAR) 2019 synthetic chart dataset, which consists of 198,010 images that are divided into seven classes as shown in Figure 3 [30] (further on in the text referred to as a dataset 2). 3. AT&T Database of Faces, which consists of 400 images, is divided into 40 classes [31].  Datasets 1 and 2 are fully pre-processed, while in dataset 3, the only applied image pre-processings are image resolution normalization and image color space normalization. Dataset 3 is only used in the Siamese CNN training because the additional 40 classes help the network to learn similarities better. Instead of this dataset, any other labeled dataset can be used. However, if no additional datasets are used, the network over-fits and the loss value oscillates between the two values. This phenomenon indicates that the model has not learned similarities.
Before training the network model, one validation set is excluded from datasets 1 and 2, which consists of 20 images per class, as listed in Table 2. This set is used as a reference point, as the images never change, and the set is never seen in the training process.  In this research, three different datasets are used: 1. The dataset used in our previous research, which consists of 3002 images, is divided into ten classes, as shown in Figure 2 [15,29]. This dataset includes images collected from the Google Image search engine and ReVision system [10] (further on in the text referred to as a dataset 1). 2. International Conference on Document Analysis and Recognition (ICDAR) 2019 synthetic chart dataset, which consists of 198,010 images that are divided into seven classes as shown in Figure 3 [30] (further on in the text referred to as a dataset 2). 3. AT&T Database of Faces, which consists of 400 images, is divided into 40 classes [31].  Datasets 1 and 2 are fully pre-processed, while in dataset 3, the only applied image pre-processings are image resolution normalization and image color space normalization. Dataset 3 is only used in the Siamese CNN training because the additional 40 classes help the network to learn similarities better. Instead of this dataset, any other labeled dataset can be used. However, if no additional datasets are used, the network over-fits and the loss value oscillates between the two values. This phenomenon indicates that the model has not learned similarities.
Before training the network model, one validation set is excluded from datasets 1 and 2, which consists of 20 images per class, as listed in Table 2. This set is used as a reference point, as the images never change, and the set is never seen in the training process. Datasets 1 and 2 are fully pre-processed, while in dataset 3, the only applied image pre-processings are image resolution normalization and image color space normalization. Dataset 3 is only used in the Siamese CNN training because the additional 40 classes help the network to learn similarities better. Instead of this dataset, any other labeled dataset can be used. However, if no additional datasets are used, the network over-fits and the loss value oscillates between the two values. This phenomenon indicates that the model has not learned similarities.
Before training the network model, one validation set is excluded from datasets 1 and 2, which consists of 20 images per class, as listed in Table 2. This set is used as a reference point, as the images never change, and the set is never seen in the training process.

The Architecture
The presented model in Figure 4 consists of two inputs (two images), one for each CNN. The input images are pre-processed and handed to the CNN. In this research, multiple CNN architectures are tested and compared. The used CNNs are:

•
Simplified VGG-the same network architecture used in our previous research. The achieved average classification accuracy over ten classes ranges from 78% to 89%, depending on the used dataset [15,29]. The achieved results are for classic CNN architecture.

•
SigNet CNN-the network used for writer independent offline signature verification. The authors report accuracy ranging from 76% to 100%, depending on the used dataset [26]. The achieved results are for Siamese CNN architecture.

•
Omniglot CNN-the network used on the Omniglot dataset for the validation of handwritten characters. The authors report accuracy ranging from 70% to 92%, depending on the used dataset [32]. The achieved results are for Siamese CNN architecture.
All listed network architectures were remade according to the original papers, where the authors stated details about network configuration. The input layer of the networks is reconfigured to accept new image datasets. The Siamese CNNs are identical with the same parameters, configuration, and shared weights. The parameters are mirrored and updated in both networks. Each network outputs a different feature vector. If the same input image is handed to both networks, the feature vectors will be the same. The feature vectors are used in calculating the loss function (contrastive loss), which computes a similarity score using the Euclidean distance between these two vectors. Based on the resulted similarity score and a threshold value, it can be determined if the two input images are similar and whether they belong to the same class. Used threshold values are: 0.5, 0.75, and 1. In terms of the similarity score, a lower value (closer to 0) is better (same class). All listed network architectures were remade according to the original papers, where the authors stated details about network configuration. The input layer of the networks is reconfigured to accept new image datasets. The Siamese CNNs are identical with the same parameters, configuration, and shared weights. The parameters are mirrored and updated in both networks. Each network outputs a different feature vector. If the same input image is handed to both networks, the feature vectors will be the same. The feature vectors are used in calculating the loss function (contrastive loss), which computes a similarity score using the Euclidean distance between these two vectors. Based on the resulted similarity score and a threshold value, it can be determined if the two input images are similar and whether they belong to the same class. Used threshold values are: 0.5, 0.75, and 1. In terms of the similarity score, a lower value (closer to 0) is better (same class).

Experiment Setup
All CNN and Siamese CNN models were trained and tested on the Google Collab platform with PyTorch deep learning framework and enabled CUDA acceleration.

Experiments
This section summarizes the findings and contributions made. The verification process of N-way one-shot learning is described. The results of the Simplified VGG, SigNet CNN, and Omniglot CNN are compared. The detailed information of network classification results is provided, as well as a confusion table. In the end, the comparison between classic CNN and Siamese CNN is given.

Experiment Setup
All CNN and Siamese CNN models were trained and tested on the Google Collab platform with PyTorch deep learning framework and enabled CUDA acceleration.

Experiments
This section summarizes the findings and contributions made. The verification process of N-way one-shot learning is described. The results of the Simplified VGG, SigNet CNN, and Omniglot CNN are compared. The detailed information of network classification results is provided, as well as a confusion table. In the end, the comparison between classic CNN and Siamese CNN is given.

Verification
As seen from Table 1, the CNN architecture relies on substantial data for a good outcome. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-K-shot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar.
All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score. Table 3. Example of a 10-way one-shot learning. The highest expected similarity score should be SS3.

Image Pair (Class N)
Input Image 1 (New Image) Input Image 2 (Known Image) Similarity Score (SS) 1 come. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar. All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score. Table 3. Example of a 10-way one-shot learning. The highest expected similarity score should be SS3.

Image Pair (Class N)
Input Image 1 (New Image) Input Image 2 (Known Image) come. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar. All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score. come. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar. All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score. Table 3. Example of a 10-way one-shot learning. The highest expected similarity score should be SS3.

Image Pair (Class N)
Input Image 1 (New Image) Input Image 2 (Known Image) come. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar. All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score. As seen from Table 1, the CNN architecture relies on substantial data for a good outcome. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar.
All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score. As seen from Table 1, the CNN architecture relies on substantial data for a good outcome. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar.
All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score. As seen from Table 1, the CNN architecture relies on substantial data for a good outcome. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar.
All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score. As seen from Table 1, the CNN architecture relies on substantial data for a good outcome. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar.
All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score.

Verification
As seen from Table 1, the CNN architecture relies on substantial data for a good outcome. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar.
All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score.

Verification
As seen from Table 1, the CNN architecture relies on substantial data for a good outcome. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar.
All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score.

Verification
As seen from Table 1, the CNN architecture relies on substantial data for a good outcome. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar.
All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score.

Verification
As seen from Table 1, the CNN architecture relies on substantial data for a good outcome. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar.
All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score.

Verification
As seen from Table 1, the CNN architecture relies on substantial data for a good outcome. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar.
All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score.

Verification
As seen from Table 1, the CNN architecture relies on substantial data for a good outcome. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar.
All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score.

Verification
As seen from Table 1, the CNN architecture relies on substantial data for a good outcome. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar.
All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score.

Verification
As seen from Table 1, the CNN architecture relies on substantial data for a good outcome. Thousands of images are required for training before a network can accurately assess a new image of a chart. Newly created chart classes lack datasets, and creating and labeling a dataset is a time-consuming and expensive task. When datasets are inefficient, CNN cannot match images using learned features but can calculate similarity scores between different classes. To address this problem, FSL is used in conjunction with Siamese CNN architecture. The FSL has two main variations: Zero-shot and N-shot (or N-way-Kshot). Zero-shot learning refers to using a model to predict a class without being introduced to that class in the training process. On the other hand, N-way-K-shot is a broader concept used when the number of classes N and the number of samples K from each class is familiar.
All network models were trained from scratch using datasets described in the previous section. The used verification method is N-way one-shot learning, introduced by Lake et al. [33]. The 10-way one-shot learning is explained in Table 3. The Siamese CNN requires two input images to generate a similarity score. The input in one side of Siamese CNN is an image from the validation set (or any new image that was not used in the training process). The other CNN requires one random image from each class that was used in the training process. This creates ten image pairs for one image. Comparing each image pair, the Siamese CNN calculates a similarity score. The expected highest similarity, i.e., similarity score closest to 0, according to Table 3, should be SS3. If SS3 is the lowest value in the group and within the set threshold value, this is treated as a correct classification (same class); otherwise, this is incorrect. Repeating the algorithm x times, the class accuracy CA is calculated. In Equation (1) In this algorithm, the similarity score depends on two random variables, the input image 1 (new image) and the input image 2 (random training image). To eliminate one random variable (random training image), the input image is tested against all trained images from each class. The highest similarity images from each class are grouped, and new image pairs for verification are created. With this method, 4000 image pairs are tested for one class or 40,000 image pairs for ten chart types.

Results
To validate the performance of the proposed architecture, a set of experiments is conducted. The goal is to evaluate the performance using all three models and determine which model achieves the highest average classification accuracy on chart images. Table 4 shows the 10-type average classification accuracy that was conducted using dataset 1. The three used networks were trained from scratch. Planned comparisons revealed that the Simplified VGG outperforms the Omniglot CNN and SigNet CNN. When used as the Siamese CNN, the Simplified VGG achieves results similar to our previous work, used as a classic CNN. It must be pointed out that the results are achieved using less than 10% of the total images from dataset 1. The two other networks achieve around 50% worse results than reported in papers. The reason is network layer construction. The Simplified VGG is a network that is adapted specially for chart-type classification. When the input image passes through the network layers, the image is segmented into smaller sub-images. The Omniglot CNN and SigNet CNN are specially designed for searching and learning imperfections of two images on a pixel level, while Simplified VGG observes the image as a whole. Since the input images are heavily pre-processed, they contain image noise that the Omniglot CNN and SigNet CNN are detecting. The expected highest similarity, i.e., similarity score closest to 0, according to Table 3, should be SS3. If SS3 is the lowest value in the group and within the set threshold value, this is treated as a correct classification (same class); otherwise, this is incorrect. Repeating the algorithm x times, the class accuracy CA is calculated. In Equation (1) In this algorithm, the similarity score depends on two random variables, the input image 1 (new image) and the input image 2 (random training image). To eliminate one random variable (random training image), the input image is tested against all trained images from each class. The highest similarity images from each class are grouped, and new image pairs for verification are created. With this method, 4000 image pairs are tested for one class or 40,000 image pairs for ten chart types.

Results
To validate the performance of the proposed architecture, a set of experiments is conducted. The goal is to evaluate the performance using all three models and determine which model achieves the highest average classification accuracy on chart images. Table 4 shows the 10-type average classification accuracy that was conducted using dataset 1. The three used networks were trained from scratch. Planned comparisons revealed that the Simplified VGG outperforms the Omniglot CNN and SigNet CNN. When used as the Siamese CNN, the Simplified VGG achieves results similar to our previous work, used as a classic CNN. It must be pointed out that the results are achieved using less than 10% of the total images from dataset 1. The two other networks achieve around 50% worse results than reported in papers. The reason is network layer construction. The Simplified VGG is a network that is adapted specially for chart-type classification. When the input image passes through the network layers, the image is segmented into smaller sub-images. The Omniglot CNN and SigNet CNN are specially designed for searching and learning imperfections of two images on a pixel level, while Simplified VGG observes the image as a whole. Since the input images are heavily pre-processed, they contain image noise that the Omniglot CNN and SigNet CNN are detecting. Table 4. Ten-type average classification accuracy. The testing of each architecture is conducted on the same validation set from Table 2. In terms of average accuracy and F-1 score, the Simplified VGG outperforms other networks. 10 SS10

Input vs. Random Train Image from Each Class Input vs. Highest Similarity Image from Each Class
The expected highest similarity, i.e., similarity score closest to 0, according to Table 3, should be SS3. If SS3 is the lowest value in the group and within the set threshold value, this is treated as a correct classification (same class); otherwise, this is incorrect. Repeating the algorithm x times, the class accuracy CA is calculated. In Equation (1), CC represents the number of correct classifications within a class.
For verification, a set of 20 images per class (x = 20) is used. With this method, 200 image pairs are tested for one class or 2000 image pairs for ten chart types.
In this algorithm, the similarity score depends on two random variables, the input image 1 (new image) and the input image 2 (random training image). To eliminate one random variable (random training image), the input image is tested against all trained images from each class. The highest similarity images from each class are grouped, and new image pairs for verification are created. With this method, 4000 image pairs are tested for one class or 40,000 image pairs for ten chart types.

Results
To validate the performance of the proposed architecture, a set of experiments is conducted. The goal is to evaluate the performance using all three models and determine which model achieves the highest average classification accuracy on chart images. Table 4 shows the 10-type average classification accuracy that was conducted using dataset 1. The three used networks were trained from scratch. Planned comparisons revealed that the Simplified VGG outperforms the Omniglot CNN and SigNet CNN. When used as the Siamese CNN, the Simplified VGG achieves results similar to our previous work, used as a classic CNN. It must be pointed out that the results are achieved using less than 10% of the total images from dataset 1. The two other networks achieve around 50% worse results than reported in papers. The reason is network layer construction. The Simplified VGG is a network that is adapted specially for chart-type classification. When the input image passes through the network layers, the image is segmented into smaller sub-images. The Omniglot CNN and SigNet CNN are specially designed for searching and learning imperfections of two images on a pixel level, while Simplified VGG observes the image as a whole. Since the input images are heavily pre-processed, they contain image noise that the Omniglot CNN and SigNet CNN are detecting. Table 4. Ten-type average classification accuracy. The testing of each architecture is conducted on the same validation set from Table 2. In terms of average accuracy and F-1 score, the Simplified VGG outperforms other networks. 10 SS10

Input vs. Random Train Image from Each Class Input vs. Highest Similarity Image from Each Class
The expected highest similarity, i.e., similarity score closest to 0, according to Table 3, should be SS3. If SS3 is the lowest value in the group and within the set threshold value, this is treated as a correct classification (same class); otherwise, this is incorrect. Repeating the algorithm x times, the class accuracy CA is calculated. In Equation (1), CC represents the number of correct classifications within a class.
For verification, a set of 20 images per class (x = 20) is used. With this method, 200 image pairs are tested for one class or 2000 image pairs for ten chart types.
In this algorithm, the similarity score depends on two random variables, the input image 1 (new image) and the input image 2 (random training image). To eliminate one random variable (random training image), the input image is tested against all trained images from each class. The highest similarity images from each class are grouped, and new image pairs for verification are created. With this method, 4000 image pairs are tested for one class or 40,000 image pairs for ten chart types.

Results
To validate the performance of the proposed architecture, a set of experiments is conducted. The goal is to evaluate the performance using all three models and determine which model achieves the highest average classification accuracy on chart images. Table 4 shows the 10-type average classification accuracy that was conducted using dataset 1. The three used networks were trained from scratch. Planned comparisons revealed that the Simplified VGG outperforms the Omniglot CNN and SigNet CNN. When used as the Siamese CNN, the Simplified VGG achieves results similar to our previous work, used as a classic CNN. It must be pointed out that the results are achieved using less than 10% of the total images from dataset 1. The two other networks achieve around 50% worse results than reported in papers. The reason is network layer construction. The Simplified VGG is a network that is adapted specially for chart-type classification. When the input image passes through the network layers, the image is segmented into smaller sub-images. The Omniglot CNN and SigNet CNN are specially designed for searching and learning imperfections of two images on a pixel level, while Simplified VGG observes the image as a whole. Since the input images are heavily pre-processed, they contain image noise that the Omniglot CNN and SigNet CNN are detecting. The expected highest similarity, i.e., similarity score closest to 0, according to Table 3, should be SS3. If SS3 is the lowest value in the group and within the set threshold value, this is treated as a correct classification (same class); otherwise, this is incorrect. Repeating the algorithm x times, the class accuracy CA is calculated. In Equation (1), CC represents the number of correct classifications within a class.
For verification, a set of 20 images per class (x = 20) is used. With this method, 200 image pairs are tested for one class or 2000 image pairs for ten chart types.
In this algorithm, the similarity score depends on two random variables, the input image 1 (new image) and the input image 2 (random training image). To eliminate one random variable (random training image), the input image is tested against all trained images from each class. The highest similarity images from each class are grouped, and new image pairs for verification are created. With this method, 4000 image pairs are tested for one class or 40,000 image pairs for ten chart types.

Results
To validate the performance of the proposed architecture, a set of experiments is conducted. The goal is to evaluate the performance using all three models and determine which model achieves the highest average classification accuracy on chart images. Table 4 shows the 10-type average classification accuracy that was conducted using dataset 1. The three used networks were trained from scratch. Planned comparisons revealed that the Simplified VGG outperforms the Omniglot CNN and SigNet CNN. When used as the Siamese CNN, the Simplified VGG achieves results similar to our previous work, used as a classic CNN. It must be pointed out that the results are achieved using less than 10% of the total images from dataset 1. The two other networks achieve around 50% worse results than reported in papers. The reason is network layer construction. The Simplified VGG is a network that is adapted specially for chart-type classification. When the input image passes through the network layers, the image is segmented into smaller sub-images. The Omniglot CNN and SigNet CNN are specially designed for searching and learning imperfections of two images on a pixel level, while Simplified VGG observes the image as a whole. Since the input images are heavily pre-processed, they contain image noise that the Omniglot CNN and SigNet CNN are detecting. From the left side of Table 4, it can be seen that choosing a random train image for comparison can have a hit-or-miss result. If the system chooses the correct image, the classification result can be 100% and 0% if the image is not similar. To avoid this phenomenon, the quality of the dataset is more important than the quantity of the dataset. On the right side of the table, 20 times more image pairs are used, which increases the average classification accuracy by 15%. The difference between these two approaches can also be seen in Figures 5 and 6. From the left side of Table 4, it can be seen that choosing a random train image for comparison can have a hit-or-miss result. If the system chooses the correct image, the classification result can be 100% and 0% if the image is not similar. To avoid this phenomenon, the quality of the dataset is more important than the quantity of the dataset. On the right side of the table, 20 times more image pairs are used, which increases the average classification accuracy by 15%. The difference between these two approaches can also be seen in Figures 5 and 6. In both approaches, the highest amount of correctly classified images is between 0 and 0.5. This confirms that all three networks are correctly trained, and they are confident in the results they give. When using the higher amount of image pairs, the third column (0.75 < x < 1) is eliminated, as shown in Figure 6.
For statistical comparison of the models, a statistical hypothesis test is conducted using McNemar's test. A McNemar's test uses a contingency table, which is a 2 × 2 table that contains binary variables as correct or incorrect. Each model's prediction on the same image is noted as both models predicted correct or incorrect, or only one model correctly predicted. This test calculates whether the two models disagree in the same way or not. In Table 5, all models' p-values are compared against a significance level of 0.05. In all cases, the p-value is less than 0.05, and the null hypothesis H0 is rejected. The rejected H0 shows a significant difference in the disagreements between the models, and we conclude that the models make considerably different predictions when introduced to the same images.  In both approaches, the highest amount of correctly classified images is between 0 and 0.5. This confirms that all three networks are correctly trained, and they are confident 0 < x < 0.5 0.5 < x < 0.75 0.75 < x < 1 Figure 6. Input vs. highest similarity image from each class-similarity score. Since slightly superior results are achieved with Simplified VGG, additional information is presented in the confusion table, Table 6. The horizontal rows represent known (seen) classes, and the vertical columns represent predicted classes. The number of correct predictions is displayed in a diagonal green row (maximum is 20). Red-colored cells show the number of wrong predictions. The Siamese CNN can also be used to classify chart types that were not used in the training process; therefore, the network does not know of them. To prove this statement, a box plot from dataset 2 is used. When letting the network choose a random image pair, the results are slightly worse than with seen classes during the training. When the network uses all available image pairs, the results are the same as for the seen classes during the training.  The average classification accuracy between 10-type and 11-type slightly decreases, which is expected when the number of types for classification increases.
To compare the classic CNN with the Siamese CNN, additional tests are created for the Simplified VGG. The network is trained 16 times from scratch (eight times as a classic CNN and eight times as Siamese CNN). The network configuration and the training parameters were always the same. The training is conducted using dataset 2 (7-type chart classification). The same batch of images is used when training classic CNN and Siamese CNN. The goal of training each network from scratch eight times is to find the minimal training dataset size to achieve state-of-the-art results. In Table 7, e.g., "t10" referees to training dataset with ten images per class. For the model, verification is always used the same set of images, validation set. Verification of Siamese CNN is conducted by creating image pairs for 7-way-one-shot learning. Comparing the results from Table 7 shows how the number of images and image pairs impacts classification accuracy and the required time for classification. The classic CNN does not require the pairing of input images with training images, which makes it equally fast with any size of the training dataset. However, even with 500 images per class, the average classification accuracy did not reach 100%. This type of CNN is not usable with small training datasets, and competitive results start showing when the number of images per class reaches 200 or more. On the other hand, the Siamese CNN can work with one image per class.
The competitive results are achieved with the datasets between 20 and 50 images per class, and state-of-the-art results are achieved with just 50 images per class. The average classification and F-1 score should constantly increase if the number of images is also increasing. Although this is true, it is false when pairing the input image with a random train image. In Figure 7, the effect of the hit-or-miss random image can be seen between "t10" and "t20," where average classification accuracy decreases. The competitive results are achieved with the datasets between 20 and 50 images per class, and state-of-the-art results are achieved with just 50 images per class. The average classification and F-1 score should constantly increase if the number of images is also increasing. Although this is true, it is false when pairing the input image with a random train image. In Figure 7, the effect of the hit-or-miss random image can be seen between "t10" and "t20," where average classification accuracy decreases.  Table 8. Between t10 and t20, the hit-or-miss effect can be seen. The state-of-the-art results for Siamese CNN are achieved with 50 training images per class.
For statistical comparison, the same McNemar test is conducted as for Table 5. When comparing two Siamese CNNs, the significant difference can only be seen between "t5" and "t50," and H0 can be rejected. This is expected behavior since one Siamese CNN is using random train images for generating similarity scores. When the Siamese CNNs are compared to classic CNN, the H0 can be rejected up to "t100", as shown in Table 8. This confirms that these models make considerably different predictions that are in accordance with average classification accuracy and F-1 score from Table 7.   Table 8. Between t10 and t20, the hit-or-miss effect can be seen. The state-of-the-art results for Siamese CNN are achieved with 50 training images per class.  For statistical comparison, the same McNemar test is conducted as for Table 5. When comparing two Siamese CNNs, the significant difference can only be seen between "t5" and "t50," and H0 can be rejected. This is expected behavior since one Siamese CNN is using random train images for generating similarity scores. When the Siamese CNNs are compared to classic CNN, the H0 can be rejected up to "t100", as shown in Table 8. This confirms that these models make considerably different predictions that are in accordance with average classification accuracy and F-1 score from Table 7.

Conclusions
This paper focuses on the classification of chart images using the Siamese CNN, which has never been conducted before. This work is motivated by the lack of publicly available datasets and a continually growing number of chart types. The conducted research proves that Siamese CNN can be used with chart type classification. The results of three used Siamese CNN architectures show that the network layer construction impacts classification results. Regarding N-way-one-shot learning, choosing image pairs can have a hit-or-miss result, which indicates the quality over quantity of the used dataset. When compared to a classic CNN, the Siamese CNN outperforms the required image dataset size and achieved an average classification accuracy and F-1 score. We have shown that the Siamese CNN can also generalize the input never seen before and achieve competitive results. When trained on seven chart types, the Siamese CNN achieved state-of-the-art results, which is 100% average classification accuracy and F-1 score.
In the future, other loss functions (triplet loss, magnet loss, center loss) will be tested and compared. The plan is also to increase the number of chart types to 20 or more. The image pre-processing algorithm can be further optimized, and the number of details in the image can be further decreased, resulting in achieving 100% accuracy with an even lower number of images per class.