Lifelong Machine Learning for Regional-Based Image Classiﬁcation in Open Datasets

: Deep Learning algorithms are becoming common in solving different supervised and unsupervised learning problems. Different deep learning algorithms were developed in last decade to solve different learning problems in different domains such as computer vision, speech recognition, machine translation, etc. In the research ﬁeld of computer vision, it is observed that deep learning has become overwhelmingly popular. In solving computer vision related problems, we ﬁrst take a CNN (Convolutional Neural Network) which is trained from scratch or some times a pre-trained model is taken and further ﬁne-tuned based on the dataset that is available. The problem of training the model from scratch on new datasets suffers from catastrophic forgetting . Which means that when a new dataset is used to train the model, it forgets the knowledge it has obtained from an existing dataset. In other words different datasets does not help the model to increase its knowledge. The problem with the pre-trained models is that mostly CNN models are trained on open datasets, where the data set contains instances from speciﬁc regions. This results into predicting disturbing labels when the same model is used for instances of datasets collected in a different region. Therefore, there is a need to ﬁnd a solution on how to reduce the gap of Geo-diversity in different computer vision problems in developing world. In this paper, we explore the problems of models that were trained from scratch along with models which are pre-trained on a large dataset, using a dataset speciﬁcally developed to understand the geo-diversity issues in open datasets. The dataset contains images of different wedding scenarios in South Asian countries. We developed a Lifelong CNN that can incrementally increase knowledge i.e., the CNN learns labels from the new dataset but includes the existing knowledge of open data sets. The proposed model demonstrates highest accuracy compared to models trained from scratch or pre-trained model.


Introduction
Over the last decade, it was observed that pattern recognitions systems perform more efficiently when we use Neural Network (NN) or particularly Deep Neural Networks (DNNs).Recent development in deep learning has revolutionized the field, particularly the developments in CNN.These techniques heavily depend on the data the models are trained.However, it has been observed by different researchers [1] that deep learning algorithms can give incorrect results or some times the results are disturbing when there is not enough representation of the data belonged to a particular class or region.For instance, irrigation-based techniques or equipments, work very well in certain crops Symmetry 2020, 12, 2094; doi:10.3390/sym12122094www.mdpi.com/journal/symmetryor areas but do not work well when the system is used in a different environment.Another example could be self-driving cars, where cars can drive very efficiently on asphalt roads, but the chances of failure or accidents increase when there are dirt or gravel.These deep learning based techniques are as good as the data they learn from.To build deep learning systems that can work efficiently for the whole world in different diverse regions, the training data must be designed that have representations from diverse regions or there need to be improvements in the algorithms of learning.CNN models have been efficiently and effectively used for solving problems in pattern recognition or image recognition such as classification of objects [2], face recognition [3], gesture recognition [4] and image captioning [5].ImageNet [6] and Open Images datasets are commonly used to develop systems for image classification or image recognition problems.ImageNet dataset consists of approximately 15 million images from 21,841 different categories.However, most of the research papers use 1.2 million images of 1000 different categories, and hence most researchers reports accuracy of networks where their models are trained with ImageNet consisting of 1.2 million images of 1000 classes.Images are associated with a human-verified single label.Open Images dataset contains around 9 million images annotated with labels spanning over 6000 categories.The creators of the dataset, have tried to make the dataset as practical as possible by making labels that cover more real-life entities than the 1000 ImageNet classes.These datasets contain images mostly from US, UK, Europe or Australia.There is not enough representation from developing world such as Ethiopia, India, Pakistan, Afghanistan, China, Saudi Arabia etc.The number of images in ImageNet and Open Images based on different regions/countries is analyzed in [1] and presented in Figure 1.It is clear that most of the images are taken from US, Canada, Europe and Australia.This can results into biasness in classification problems when used to make classification on images collected from a different region, because there is not enough representation from developing world as shown in Figure 2, where bridegroom from Pakistan and Ethiopia are not correctly classified, while bridegroom from US, Australia, Canada are easily classified.Humans and animals transfer the knowledge and skills learning throughout the lifespan.We have a complex neuro-cognitive system that help the sensorimotor skills to develop and specialize and at the same time develop long-term memory storage and retrieval of information [7].This ability of transferring the knowledge and skills throughout the lifespan is referred as lifelong learning.These skills are very much important for computers and intelligent systems working in real-world environment and process the flow of information.Lifelong Machine Learning (LML) [8] involves mechanisms that can learn different tasks from different domains over the span of its lifetime.The idea of LML is to retain learned knowledge and to selectively transfer knowledge when learning a new task so as to develop more accurate hypothesis or policies.This paper further emphasizes the need for Artificial Intelligence (AI) applications to not only learn based on existing dataset, but need to understand the systems that are able to learn over a lifetime.We are presenting an algorithm for CNN that is used for transferring knowledge from ImageNet and useful for the dataset containing images in different scenarios and in different regions.The results of this paper conclude that the next logical step in the domain of supervised learning is LML.The organization of the paper is as follows.In Section 2 we briefly explain the current popular techniques in literature.The proposed technique is explained at Section 3. We also discuss briefly the dataset specifically prepared for this paper and explanation of the current popular models typically used for classification of images and object recognition.We further explain an algorithm that can build the knowledge based on existing knowledge.Experiments collected from models pre-trained on large dataset and models trained on new dataset from scratch and Lifelong CNN is explained in Section 4. The conclusion of the paper along with possible future directions are given in Section 5.

Literature Study
An analysis of different Convolutional Neural Network (CNN) models for classification of Images is given in [9].The paper explains in detail the networks such as; LeNet5, AlexNet, ZFNet, VGGNet, GoogLeNet (also known as Inception), ResNet, DenseNet and CapsNet.The authors' claims are based on different experiments that GoogLeNet and ResNet obtained good performance rather than simply putting the building blocks of CNN in a sequential fashion.In [10] CNN is used to identify disease in plants by taking an image of the plant.To determine the health of plants through an image, it is very challenging as the crops have rich and complex environment.Authors have used AlexNet, DenseNet, Inception, ResNet, SqueezeNet and VGG.In [11] a CNN is trained to classify traffic signs.In the paper, authors have explored transfer learning techniques known as fine tuning technique to reuse layers pre-trained on ImageNet dataset.
In [12] an empirical analysis of performance of popular CNNs for identifying objects in real-time videos is made.The paper focuses on evaluating the performance of AlexNet, GoogLeNet and ResNet50 using datasets CIFAR10, CIFAR100 and ImageNet.The paper concludes that ResNet50 and GoogLeNet gives better performance on all three datasets.In [13], image recognition is performed using a CNN with (3 × 3) filters for convolution, and has demonstrated significant improvement by obtaining good accuracy.In [14] a CNN is used to make classification of 1.2 million images collected in the ImageNet to make classification in 1000 different classes.The performance on the test data is shown as 37.5% top-1 error rate which are considered improvement over previous popular techniques.In [15] multiple tricks in CNN for accuracy improvement in image classification are explained.The paper demonstrates that better accuracy in image classification makes transfer learning more efficient in different applications such as object detection or segmentation.
In [16] a CNN architecture that shares latent factorized representations in CNNs is developed.Deconvolutional factorization along with tensor contraction are used to make a transformation among multiple operations.They have tested two datasets i.e., CIFAR-10 and 100 and results demonstrate that the developed method has obtained good performance for difficult setting of lifelong learning, avoiding the problem of catastrophic forgetting and uses backward transfer to make improvement in the performance in learning by improving previously learned tasks from different experiments without requiring to retrain.In [17], authors are explaining that though computer vision technology is used by many people are around the world but these techniques (datasets) contain representation of only a few regions, and hence it is been reported that the computer vision models misbehaves in predicting labels that are offensive or low accuracy in unrepresented regions.They have analyzed that the datasets are typically manually annotated images or videos, and the label distribution is not fair.They have analyzed ImageNet and considered three factors for the person category; (i) the vocabulary used for annotation (ii) exhaustive illustration of all categories (iii) inequality of representation.They had taken a first step to eliminate unfairness in ImageNet.
A Tree-CNN model is proposed in [18] for data with low frequency.Three key insights are taken into consideration: (i) energy consumption in household is based on patterns based on time and this pattern can be captured by the different kernels in CNN (ii) the structural representation as a tree allows learning the structure of individual products and therefore the difference in magnitude is avoided while retaining the relationship between appliances (iii) known and unknown appliances are separated, and therefore the input time series is better used for reconstruction of time series for appliances.Authors claim that the performance of Tree-CNN-based model is better than current popular models in terms of lower prediction error and better performance in detection of states that are active in different appliances.In [19] a new CNN is developed that consists of clustering algorithm and Tree-CNN.The role of the clustering algorithm is to make classification in a high-level class.Tree-CNN consists of Trunk-CNN and Branch-CNN.The Trunk-CNN is used for coarse classification, and the role of the Branch-CNN is to make difference between groups of same category.The authors used Caltech101 and Caltech256 datasets testing the model and have demonstrated superior results by Tree-CNN.Some other related studies can be found in [20][21][22][23].
The techniques presented in literature are developed for different computer vision activities.Different CNN models are developed to solve problems in object recognition, object detection, segmentation, etc.These models suffer from catastrophic forgetting.Which means that the model is able to train an existing dataset with high accuracy, but when the model is presented with a new dataset the performance of the system is low.The model is not able to train itself on the new dataset while keeping the knowledge of the previous data set.To deal with this situation, the paper has presented a novel technique that is based on the hierarchical representation of objects.Using this model we are able to train on new dataset, while maintaining the knowledge of old datasets.The results have demonstrated that the model has achieved better performance accuracy compared to other CNN models.

Preprocessing Analysis
Here, we explain the dataset obtained specifically for this paper to demonstrate the geo-diversity issues in open datasets.The preprocessing on the dataset is also explained in this section.The dataset contains images collected mainly from different regions of Pakistan, India, and Afghanistan, where each image contains a wedding scenarios.For example, images containing Brides or Grooms or cars decorated for wedding etc.The dataset is called Wedding dataset.Three different annotators are used to annotate images and the annotation with highest accuracy percentage is shown as final value to be shown as result.The different classes used in the Wedding dataset is given in Table 1 and a sample of images from the dataset is illustrated in Figure 3.

Normalization
Colored images contain pixel values between 0 and 255.Therefore, some features in the CNN have small values in the range of [0, 10] while others have large values such [250,255].The ML algorithms have difficulty to learn in this situation.Therefore, the data is normalized in the range of [0, 1] to be efficiently processed by ML models.In this paper, all pixel values are divided by 255 to convert the values of pixels to be in the range 0 to 1.

Image Classification Models
This section gives a brief explanation of the most popular CNN architectures commonly used for problems involving computer vision.These three architectures are; LeNet-5, ResNet50 and Inception.This section also gives explanation of Tree-CNN, that is adapted in this paper to be used for classification of wedding scenarios in Wedding dataset.

LeNet, ResNet and Inception
LeNet-5 [24], presented by LeCun et.al. was the exceptionally to begin with CNN show used for acknowledgment of manually written and machine-printed characters.The model has two convolution and average pooling layers, taken after by two completely connected layers and a softmax classifier to form classification of 0-9 digits.A test demonstration of LeNet is given in Figure 4.

Full connection
Full connection
An important breakthrough in the domain of computer vision and deep learning is ResNet (also known as Residual Network) [25].To get better performance in ResNet hundreds of convolution and pooling layers are used.Getting better accuracy in deep neural network is difficult when a large number of layers are used mainly because of the vanishing gradient problem [26].Vanishing gradient means that when the gradient is computed at later layers of the ResNet and propagated back in the back propagation process, the repeated multiplication of small values of gradients makes it even more small, such that it is value is not significant to make any changes.In ResNet a novel technique known as "Identify Shortcut Connection" is used to deal with vanishing gradient problem.Using this skip connection one or more layers are skipped during the back propagation.A simple demonstration of ResNet with 34 layers is shown on the left side of Figure 5.
Another important milestone in the progress made in computer vision is the development of Inception network [2,27].The idea of Inception network is based on the observations that the important features in a given image can exist in different variation.Therefore, to perform convolution operation it is not feasible to use an optimal kernel size.Salient features distributed more globally are identified by larger kernels.Those features that are distributed locally are observed by smaller kernels.Different kernel sizes are used at a same level in Inception network.There are nine inception modules that are stacked linearly.There are 27 layers (including pooling layers) and average pooling is used.Auxiliary classifier are used to handle vanishing gradient problem in Inception network.A sample demonstration of the Inception network is given at the right hand side of Figure 5.

Tree-CNN
We implemented and tested most popular CNN models such as ResNet and Inception model on the Wedding dataset and as demonstrated in Section 4 we are not able to reduce the gap of geo-diversity in open datasets.A model trained on ImageNet is not able to learn and increase the knowledge space from Wedding dataset, and vice versa i.e., the models trained on Wedding dataset is not able to share information with models trained on ImageNet.In this section, we devised a technique based on Tree-CNN [28], where models are able to share information gained from different datasets and hence we are able to reduce the gap of geo-diversity in open datasets.Residual Network.Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version.The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3).When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions.This option introduces no extra parameter; (B) The projection shortcut in Eqn.( 2) is used to match dimensions (done by 1⇥1 convolutions).For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.

Implementation
Our implementation for ImageNet follows the practice in [21,41].The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation [41].A 224⇥224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21].The standard color augmentation in [21] is used.We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16].We initialize the weights as in [13] and train all plain/residual nets from scratch.We use SGD with a mini-batch size of 256.The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60 ⇥ 10 4 iterations.We use a weight decay of 0.0001 and a momentum of 0.9.We do not use dropout [14], following the practice in [16].
In testing, for comparison studies we adopt the standard 10-crop testing [21].For best results, we adopt the fullyconvolutional form as in [41,13], and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640}).

ImageNet Classification
We evaluate our method on the ImageNet 2012 classification dataset [36] that consists of 1000 classes.The models are trained on the 1.28 million training images, and evaluated on the 50k validation images.We also obtain a final result on the 100k test images, reported by the test server.We evaluate both top-1 and top-5 error rates.
Plain Networks.We first evaluate 18-layer and 34-layer plain nets.The 34-layer plain net is in Fig. 3 (middle).The 18-layer plain net is of a similar form.See Table 1 for detailed architectures.
The results in Table 2 show that the deeper 34-layer plain net has higher validation error than the shallower 18-layer plain net.To reveal the reasons, in Fig. 4 (left) we compare their training/validation errors during the training pro- In the Tree-CNN model, different nodes are connected in the form of a tree.Every node has a CNN and is trained to make classification on the input into children nodes.The leaf nodes are the last step of classification.The intermediary nodes make a classification to coarse classes and leaf nodes make classification to exact classes.We took the ResNet model that is initially trained on a large image dataset known as ImageNet and is able to recognize 1000 classes.In Tree-CNN it is represented as a one root node and 1000 leaf nodes.We want to transfer the learning from ImageNet to Wedding datasets.In the Wedding datasets, we have eight classes; six classes belong to Person class and two classes belong to Car class.In ImageNet Person class has three different types i.e., scuba diver, bridegroom and baseball player.The car class has types Race car, car mirror, passenger car, beach wagon, freight car and bumper car.
The Tree-CNN will learn to make classification of the eight classes in the Wedding dataset while maintaining the knowledge of ImageNet.We start from the root node and provide the Wedding dataset.The output layer produce a 3-dimensional matrix represented by O k×M×I , where K represent the number of children of root node, the number of newly introduced classes i.e.  1).The result of the so f tmax is the probability matrix L K×M as shown in Equation ( 2).The probability matrix contains the probability of creating new node or merging two nodes.
An ordered list S is created from L K×M , which have the following characteristics: The sorting of S ensures that newly created classes with large probability are combined with the Tree-CNN.After S is constructed, we analyse the first element i.e., S [1] and take one of the 3 actions: 1.
Addition of newly created class to already present node: If v 1 is larger than v 2 by the threshold α, it shows a high correlation with that child node.Therefore, the newly created class is combined with the child node n 1 .

2.
Merging children nodes to create a new node and added the newly created class to the node: In case of larger than one children nodes where the newly created class have high probability for, we can combine them to form a new node.This is possible when where α and β are threshold values provided by user.

3.
Add newly created class as a new node: In case the newly created class does not have a probability that is larger than the other values by a threshold (v The values of α and β are defined by user.α denotes the difference of values for nodes n 1 and n 2 i.e., we make sure that both values are not the same and are different by a good margin.If we increase the margin values the difference between the two values increase.Similarly, β denotes the difference of values for nodes n 2 and n 3 .Values are top three average softmax O avg values for the class in a decreasing order.
An example of how CNN trained on ImageNet is adapted to accommodate new classes from the Wedding dataset is given in Figure 6.In the Wedding data set, there are labels that belong to Persons class of ImageNet.For example Not Groom, Formal, Not Formal are labels of Persons class.However, then there are two labels Bride Not Bride that do not make a group with other labels of the Persons class and hence two new branches are created as Male and Female and the new classes i.e., Bride, Not Bride are placed in the Female branch.The performance in terms of accuracy of the models is shown in Figure 8a.It is observed that ResNet and Inception have better test accuracy as compared to LeNet.However, Tree-CNN achieves a test accuracy up to 96%, and demonstrates how sharing knowledge of ImageNet helps the model to gain knowledge from Wedding dataset and adds to the knowledge base.Similarly precision, recall and F1-score achieved by ResNet and Inception are improved compared to LeNet as shown in Figure 8b.However, the precision, recall and F1-score in Tree-CNN is 0.95 and demonstrates higher accuracy in Tree-CNN as it shares knowledge space.

Confusion Matrix
Another performance estimation procedure in machine learning classification issues is known as Confusion Matrix.In binary classification, the size of the table is 2 × 2 showing true positive, true negative, false positive and false negatives.However, in multi-class classifications problems the size of the table rise to the square of the number of classes.For classification models LeNet, ResNet, Inception and Tree-CNN the confusion matrix in obtaining prediction of classes in the Wedding dataset is shown in Figure 9.The performance of the model is high when diagonal of the confusion matrix has large values.It can be observed in the Figure 9, for Inception and ResNet the values are large on diagonals, demonstrating that these models are more accurate than LeNet.The confusion matrix for Tree-CNN is even more better than ResNet and Inception, demonstrating that training from scratch does not mean improvement in performance, but sharing knowledge helps us to improve performance while at the same time reduces the gap of geo-diversity in open datasets.

Evaluation with ROC Curve
The performance in classification by CNN models is demonstrated by using ROC (Receiver Operating Characteristic) curve.True Positive Rate (TPR) and False Positive Rate (FPR) are focused on the ROC curve.The calculation of these parameters is given in Equation ( 4).The ROC curve for CNN models i.e., LeNet, ResNet, Inception and Tree-CNN is given in Figure 10.The accuracy of CNN models is checked in ROC curve by covering the area of lines on the left side of the diagonal.The larger the area the higher the accuracy.As it can be observed in the figures, the lines are more on the left side of the diagonal for ResNet and Inception model, demonstrating the better accuracy of these models compared to LeNet.However, the ROC for Tree-CNN shown in Figure 10d is more on the left side and making area under the curve from 0.94 to 0.99.This demonstrates that Tree-CNN is the most suitable model when you want to learn new features but at the same time retain the previous knowledge.

Prediction of Labels by Different Models
The classification of images taken from Wedding dataset and classified using CNN models pre-trained on ImageNet and CNN trained from scratch are shown in Table 2.The prediction of labels using ResNet and Inception pre-trained on ImageNet are shown in the second and third columns.It can be observed that most of the labels are not appropriate or disturbing.For example, for the image in the first row a bride is labelled as abaya, vestment, cloak and theater curtain which are not appropriate.In the image in the third row a man dressed in a regional formal dressed is labelled as bulletproof vest, windsor tie, gar or barracouto, which are disturbing.In the image in the fourth row, labels are predicted as military uniform, pickelhoube, fur coat, bow tie etc, which are not appropriate to be used for a South Asian groom.In the image in fifth row, labels are predicted as groom, feather bos, fountain, stole, sarang etc are used to describe a typical South Asian woman.In the image in the sixth row a typical South Asian dressed man is labelled as file, refrigerator, photocopier, Loafer etc, which are disturbing.In the image in the seventh row a formal dressed man is labelled as bow tie, microphone, mask, jersey, drumstick which are not appropriate.These inappropriate labelling is happening because the CNN models are pre-trained on ImageNet which contains images mostly from Europe, America and Australia.When images from South Asian regions are given to the model, the CNN model predicts inappropriate labels.
We trained LeNet, ResNet and Inception CNN models from scratch on the wedding dataset and the results are shown in the fourth, fifth and sixth columns respectively.These models achieved better accuracy than ResNet and Inception pre-trained models, but the problem is that these models trained from scratch suffer from catastrophic forgetting i.e., these models learn new labels, but do not keep the labels that are learned in ImageNet.To deal with this problem, we have taken TreeCNN pre-trained on ImageNet and then trained on Wedding dataset.This way it not only learns new labels in the Wedding data, but at the same time retains the knowledge gained from ImageNet.

Discussion
We developed a model based on Tree-CNN that is able to extend its knowledge that is learned from ImageNet to another dataset Wedding to learn new classes.For example, bridegroom is a class in ImageNet, but Wedding data set has classes "Bride", "Not Bride", "Groom", "Not Groom", etc.The model based on Tree-CNN is able to learn these new classes while keeping the knowledge learned from ImageNet dataset.If the model is further trained with a new dataset it will be able to extend its knowledge and learn the new classes in the new dataset.This way it is able to reduce the gap of geo-diversity for developing world.Which means, that although there is no or little representation for developing world in the ImageNet or any other standard image dataset, the proposed model can help to learn new classes from a newly developed dataset while keeping the existing knowledge.
Our results have demonstrated that the model proposed in this paper which is based on Tree-CNN is able to learn new classes, which was not able to better captured by popular CNN models trained from scratch.The results of pre-trained models on ImageNet dataset are also shown and have demonstrated that the accuracy is very low.The proposed model based on Tree-CNN has demonstrated an accuracy of 98%.This demonstrates that the model has learned new classes from the newly developed dataset and has extended its knowledge.

Conclusions
Computer vision problems are solved by using CNN trained on different publicly available datasets such as ImageNet or Open Images.However, most of the public datasets contain images from economically developed countries such as Europe, America and Australia and the developing worlds have not representation in the open datasets.The paper studies the problem of no classification without enough representation in detail, analyzing existing public datasets and the classification of newly created datasets based on popular CNN models pre-trained on open datasets.The paper demonstrates that the results of popular models which are trained from scratch on the dataset developed in this paper to demonstrate how classification accuracy is improved when there is representations of different regions.However, the training from scratch on new datasets creates another problem of catastrophic forgetting i.e., the model is not able to retain the knowledge it has gained from the previous dataset.The paper also use pre-trained models on open datasets and make classification of different images in Wedding dataset and demonstrate that the performance is not good and the labels are disturbing sometimes because of no representation from the developing world.This paper presents a novel technique of hierarchical CNN, where the knowledge gained from Open datasets are retained, and is able to learn new features from the new datasets.We demonstrated that Tree-CNN model is able to improve accuracy by more than 95% as compared to training currently popular models from scratch.

Figure 1 :
Figure 1: Fraction of Open Images and ImageNet images from each country.In both data sets, top represented locations include the US and Great Britain.Countries are represented by their two-letter ISO country codes.

Figure 2 :
Figure 2: Distribution of the geographically identifiable images in the Open Images data set, by country.Almost a third of the data in our sample was US-based, and 60% of the data was from the six most represented countries across North America and Europe.

Figure 1 .
Figure 1.Fraction of images in Open Images and ImageNet datasets collected in different regions.In both datasets, majority of the images are collected from US, UK, Canada and Australia.In the image the two-letter code represents country name (Source[1]).

Figure 5 :
Figure 5: Photos of bridegrooms from different countries aligned by the log-likelihood that the classifier trained on Open Images assigns to the bridegroom class.Images from Ethiopia and Pakistan are not classified as consistently as images from the United States and Australia.

Figure 2 .
Figure 2. Photos of bridegrooms from different countries aligned by the log-likelihood that the classifier trained on Open Images assigns to the bridegroom class.Images from Ethiopia and Pakistan are not classified as accurately as images from the regions such as US, UK, Canada, Australia, where the model is trained on (Source [1]).

Figure 3 .
Figure 3.A sample collection of images from the Wedding dataset.

Figure 3 :Figure 5 .
Figure 3: GoogLeNet network with all the bells and whistles , 8 is represented by M and the number of images in a given class is represented by I. O(k, m, i) shows the output from the kth neuron of the ith image that belongs to the mth class where K ∈ [1, K], m ∈ [1, M], and i ∈ [1, I].The values of outputs are averaged over I images and represented by O K×M avg and the softmax is computed as shown in Equation ( or all children nodes are full, Tree-CNN grows horizontally when new classes are added as a new child node.The node becomes a leaf node to make classification of class.

Figure 6 . 3 ) 4 . 4 .
Figure 6.An example demonstration of how Tree-CNN learns new classes from the Wedding data set and adds new classes along with the classes already learned in ImageNet.(a) Shows the Tree-CNN on ImageNet (b) Shows how the Tree-CNN learns new features from Wedding data set and adds with the labels already learned from ImageNet data set.
Train and test accuracy in LeNet, ResNet, Inception and Tree-CNN.Precision, Recall and F1-Score computed by LeNet, ResNet, Inception and Tree-CNN.

Figure 8 .
Figure 8. Accuracy, Precision, Recall and F1-Score computed by LeNet, ResNet, Inception and Tree-CNN in making predictions in Wedding dataset.Figure (a) shows the train and test accuracy in CNNs and Figure (b) shows the Precision, Recall, F1-Score by different CNNs.

Figure 9 .
Figure 9. Confusion Matrix computed by LeNet, ResNet, Inception and Tree-CNN in the prediction of Wedding dataset.

Figure 10 .
Figure 10.ROC curve obtained by LeNet, ResNet, Inception and Tree-CNN computed by making prediction in Wedding dataset.

Figure 11 .
Figure 11.Images of bridegrooms and bride taken from different regions and represented as a log-likelihood using model pre-trained with ImageNet and then learn new classes from the Wedding dataset.

Table 1 .
The categories in the Wedding dataset showing the distribution of images in train and test datasets.