Generative Enhancement of 3D Image Classiﬁers

In this paper, we propose a methodology for generative enhancement of existing 3D image classiﬁers. This methodology is based on combining the strengths of both non-generative classiﬁers and generative modeling. Its purpose is to streamline the creation of new classiﬁers by embedding existing compatible classiﬁers in a generative network architecture. The demonstration of this process and evaluation of its eﬀects is performed using a 3D convolutional classiﬁer and its generative equivalent - a conditional generative adversarial network classiﬁer. The results show that the generative model achieves greater classiﬁcation performance, gaining a relative classiﬁcation accuracy improvement of 7.43%. Improvement of accuracy is also present when compared to a plain convolutional classiﬁer trained on a dataset augmented with examples produced by a trained generator. This suggests there is a desirable knowledge sharing within the hybrid discriminator-classiﬁer network.


Introduction
Generative modeling has been successfully applied to solve a wide range of tasks including realistic data generation (numeric, text, image, audio, video, etc.), extrapolation, image super-resolution, inpainting, and many others. Data classification is also one of those, even though it is usually not the main goal of designing a generative network.
Deep generative models have been proven to learn a salient representation of the data they had been trained on, which is ideal for the classification task. This latent representation is perceptually significant on the generated output which has been shown by various latent vector arithmetic experiments. [1,2] Lately, one of the most popular varieties of generative models is the Generative Adversarial Network (GAN) [3]. From the classification perspective, a GAN network's ability to generate highly realistic data indicates a potential for high performance in classification tasks, especially where large training datasets are not available, which is not uncommon in the domain of 3D imaging. While it was not originally designed 23  for classification, simple modifications exist to enable this functionality. One of those is to use the discriminator itself as a classifier. This can be achieved by extending the existing discriminator output with additional units, one per class, and training the model (semi)supervisedly. [4] While mostly used with 2D image data, a simple modification of the network (using 3D (de)convolution instead of 2D) enables the usage of 3D data in the form of voxel grids or point clouds.
State of the art generative and non-generative classifiers compete very closely in terms of classification accuracy [5]. While models utilizing the two approaches differ significantly in structure, we see the potential for their combination by enhancing This idea is lightly supported in [6], where authors observe an occurrence of beneficial interaction within the model when combining supervised and unsupervised loss when training a feature matching GAN classifier. This hints towards a potential for such interactions within a generative classifier which allows for further improvement of non-generative state-of-the-art models by embedding them within a generative network architecture.
For these reasons, we propose a general methodology of generative enhancement, which aims to enable any compatible classifier to utilize a generative modeling approach to improve its classification accuracy. We evaluate the methodology by applying it to a simple 3D voxel convolutional classifier, although it is not theoretically limited to this type of data. We investigate the reproducibility and measurability of the benefits of this transformation by comparing the classification performance of the resulting conditional GAN network to the original 3D Convolutional Neural Network (CNN) classifier, identical to the GAN discriminator. Experiments are also performed to investigate possible explanations such as the introduction of noise or data augmentation by the generator.

Related Work
As the training of a GAN network is very difficult in general, a lot of research focuses on the training process itself and examines various tweaks and hacks that improve the convergence of the model during the training. Many of these were utilized in our research. [6,7] Generative networks have been used very successfully as a tool for data augmentation, improving existing (vanilla) classifiers' performance. The benefit of GAN-based data augmentation was shown to even exceed that of standard augmentation methods. This augmentation has proven helpful especially with small and/or unbalanced datasets. [8,9,10] Conditional generation combined with a classifying discriminator (auxiliary classifier) has been shown to improve the quality of generated samples when trained on a labeled dataset using a model called Auxiliary Classifier GAN (ACGAN) [11]. This network includes separate discrimination and classification terms in the objective function. Another approach, also used in our models, is to combine the terms in a way, where a sample cannot be classified as both fake and of a particular class. This is achieved by creating a 'fake' class and discarding the fake-real discrimination function. Such an approach is also used in BAGAN [12], a generative network designed to restore the balance of an unbalanced dataset by generating samples of the minority classes. Additionally, this network tackles the mode collapse issue by initializing the parameters using an autoencoder.
Data classification is also a domain, where generative models have found their application. Many deep generative models, using both supervised and semi-supervised training, have been used for various learning and classification tasks. Unsupervised training of generative models is also viable and various techniques exist to perform this task, such as Output Distribution Matching cost function [13] or CatGAN [14].
A combination of GAN architecture and variational autoencoders (VAE) have found their use in classification and retrieval of 3D models, also being able to convert 2D images to 3D models. [2] Similarly, these models are also useful for tasks concerning other types of data, such as spoken language identification systems. [15] In our research, we focus mostly on the classification of 3D images. ModelNet (a labeled dataset of 3D objects) website [5] publishes a benchmark leaderboard comparing the results of different 3D object classifiers trained on the dataset, which shows that generative models can achieve state-of-the-art performance and sometimes also exceed it. VRN Ensemble [16], a generative model based on voxel variational The standard deconvolutional layer is not the only building block which the generator can be constructed with. A novel factorized generative model in [17] generates the 3D models by first building it with generic primitives and then refining the resulting shape. The resulting representation was evaluated using a classifier which achieved state-of-the-art results on ModelNet10 among the unsupervised models.

Dataset
The models presented in this paper have all been trained on and evaluated on a 3D model dataset called ModelNet [5]. Two subsets of the dataset exist, ModelNet10 and ModelNet40, containing 10 and 40 classes of objects respectively. The orientation of the examples is normalized so that every model's vector of gravity is aligned to the common vertical axis. Therefore, no manual adjustments of the mesh objects were necessary.

Preprocessing and Transformation
The ModelNet dataset contains examples represented as 3D CAD meshes. To be usable as an input for 3D convolution-based networks it has to be converted to a voxel grid format. For this purpose our data transformation pipeline has been used [18].
This pipeline is utilizing a voxelization tool called binvox by Patrick Min [19,20].
The pipeline then fills the hollow voxel model and scales it down to reduce the amount of data due to performance considerations. There is no color information available in the dataset.
For debugging and visualization of the dataset and generated examples, the voxel grids are drawn using our custom rendering pipeline [21]. The main parts of the pipeline are surface normal estimation using Sobel operators and texture mapping of an equirectangular lightmap to the voxels. The resulting RGBA-valued voxel grid is rendered using the PyQtGraph library [22].

Generative enhancement methodology
We propose a methodology for constructing generatively enhanced classifiers by embedding existing well-performing neural network classifiers in a compatible generative network architecture. By that, the classification performance of the original classifier is improved along with its tolerance for small training datasets. The methodology is designed to formalize this transformation process and its creation is in line with one of the goals established in [23].
The process of modification consists of several steps. First, a conditional generator network is designed with output identical to the classified data type (same number of dimensions, size, number of channels, etc.) and from the desired class. Then, the classifier is extended to perform the discrimination task (for example, by adding a 'fake' class to the output value set). Finally, the training algorithm is modified to include adversarial optimization of the generator.
Therefore, this can be conceptualized as a neural network transformation, which turns a traditional convolutional neural network classifier into a conditional GAN classifier. By applying this transformation to different classifiers, entirely new architectures can be produced with potentially improved properties.
The main constraint for this methodology is the classifier must be fully differentiable from one end to the other so that the gradients can be calculated for all layers, including the input layer. This is required to optimize the generator, which is attached to the classifier via the input layer itself. The third model is the aforementioned 3D conditional GAN network. A high-level diagram of the designed models is shown in Figure 1.

3D Convolutional Neural Network Classifiers
The first two models are identical deep 3D convolutional networks consisting of 3 convolutional layers. Stride has been used instead of pooling layers for downsampling.
Batch normalization and Leaky ReLU activation function have been used following each convolutional layer. The output is produced by a single fully-connected layer with a unit count of C, equal to the number of classes. A softmax function is used to convert the output logits to class probabilities.
The loss function used to train these classifiers is a categorical cross-entropy shown in (1). The optimizer used to train these classifiers was RMSprop.

3D Conditional Generative Adversarial Network
The final model is a 3D conditional generative adversarial network (3D CGAN), consisting of a generator-discriminator pair connected in sequence and trained in opposition to each other. The role of the generator is to transform a latent representation (vector) of an object to its visual representation (voxel grid), while the discriminator tries to determine whether the sample is real (from the training dataset) or fake (produced by the generator). The goal of the adversarial training is to achieve the equilibrium, where the generator produces samples that are realistic enough to confuse the discriminator [3]. A conditional GAN introduces a condition for the generator, specifying the desired output by providing a one-hot encoded class vector or another form of specification (eg. keywords or description text) [24]. The starting point for this architecture has been the deep convolutional 3D GAN model published in [2].
The main difference from the standard GAN architecture in our model is the discriminator, which is mostly identical to the plain and augmented CNN models mentioned above. The only difference between the discriminator and the two previously described convolutional classifiers is in the output layer, which, in addition to the class label, contains one extra output indicating whether the input is considered fake or not. Essentially, it distinguishes one additional "fake" class. The  The dense layer is also followed by ReLU and batch normalization functions.
The loss function chosen to train the discriminator, shown in (2), is a standard categorical cross-entropy.
To include both fake-real discrimination error and classification error, an extended target vector y * is constructed, as shown in (3).
For real samples, a single zero is prepended to the class label from the dataset.
In the case of generated samples, the target is a C + 1 long vector, with the first element (the "fake" class) set to one and the rest to zero.
For the generator, the loss function is chosen to minimize the "fake" output of the discriminator as well as achieve the class output according to the conditioning vector provided to generate the sample. The generator loss function formula is essentially identical to the discriminator loss with a real example. The only difference is the true class vector, which is created randomly, instead of sampling it from the dataset, and used both as generator input (concatenated with noise vector) and as the target vector compared with the discriminator output.
An alternative loss function we experimented with is a modified Wasserstein loss function [25] shown in (4) and (5).
In these formulas, M stands for the number of samples (in a minibatch or total), C is the number of classes in the dataset, x is a vector of training data samples, y is a vector of one-hot encoded training data labels,ŷ is the classifier prediction and γ is a hyperparameter that is used to tune the weight of the classification loss term during training. D and G are the discriminator and the generator functions respectively. The output of the discriminator function referred to as D is the logit value of the "fake" class output of the discriminator. This formula further decouples the discrimination and classification objectives and allows for more competition which can be tweaked by the γ parameter. ADAM optimizer [26] has been used to train both generator and discriminator.

Training
The training is performed for all three models concurrently. In each training epoch, the training data from the ModelNet10 dataset containing all 10 classes is shuffled and split into minibatches of 16 samples. Each minibatch is then fed into the plain and augmented CNN classifiers, as well as the discriminator. The discriminator is trained in two stages, first with the training data and then with generated 'fake' data. This fake-real training separation of gradient calculation has been shown to improve training stability [7]. Experiments using both approaches (separating or mixing real and fake data) have been conducted, however, no significant difference has been observed. One-sided label smoothing is also utilized to improve overfitting resistance.

Results and Discussion
The main evaluation criterion of the designed models is the classification accuracy on the ModelNet10 dataset. We also examine the generative performance of the GAN model to verify whether the performance benefit is not caused by some other effect unrelated to the learned latent representation.

Classification
All three models have been evaluated using the test data from the ModelNet10 dataset.
Classification accuracy, as well as mean average precision, have been calculated.
The results are shown in Table 1. Several other modern models utilizing generative modeling with published ModelNet10 results are also included in the table. The results show that generative enhancement of a very simple convolutional classifier (plain CNN) brings its average precision close to the state-of-the-art classifiers. Another comparison of classification performance of our three models is shown in the ROC plot in Figure 3 and some additional insight into classification performance is provided by the confusion matrix in Figure 4.
While the accuracy of our models does not exceed the state of the art, the results show that training an identical classifier inside a GAN model instead of conventionally, using just the training dataset, can produce a significant classification accuracy improvement. In our case, the relative improvement from plain CNN to the GAN-trained classifier was 7.43%.
Comparing the augmented CNN to the GAN classifier, the difference is smaller (5.47% relative improvement), however still significant. This suggests, that the improvement is caused not only by introducing noise to the dataset. It also means the generator is not just a data augmentation mechanism, because in that case, the

Object Generation
We also examined the output of the generator to verify the model is working as expected. The quality of the generated examples was not our primary goal, however, it can be an indicator of errors within the architecture design or implementation.
Several generated examples are shown in Figure 5. Mode collapse, a common pitfall for generative models, has been a major problem during our experimentation. Several solutions have been tested, such as using the Wasserstein loss function, however, the most significant improvement of both mode collapse and general convergence of the network was caused by initializing the ADAM optimizer with a very low learning rate α ≈ 10 −4 . As can be seen, there is a visible intra-class variance in the output of the generator. Some noise and artifacts are also produced, however, apparently, it does not cause any adverse effects during classification aside from the lower visual quality of the output.

Performance and Timing
We also investigated the computational performance of the proposed models. As training has been performed concurrently, we can compare the accuracy evolution

Conclusion and Future Work
In this paper, we proposed a methodology for generative enhancement of 3D image classifiers and experimentally verified its viability and evaluated its performance in a basic application scenario.
The results of the experiment support the initial idea, that a supervisedly trained conditional GAN network (i.e. generatively enhanced convolutional classifier) enables a knowledge transfer between the generator and a classifying discriminator that significantly improves its classification accuracy. The mechanism of these interactions is not clear and it could be a subject of further theoretical research.
One of the possible explanations for the classification performance improvement is that the generator introduces noise to the training process of the discriminator/classifier and therefore makes it more robust. This explanation is not completely sufficient as the same effect occurs in the augmented CNN model which does not indicate the same improvement. The same applies to another explanation, which points to the data augmentation caused by the generator producing realistic but not identical training examples and there improving the robustness of the classifier.
This should also manifest in the augmented CNN model which, however, falls short of the CGAN classifier performance.
The results of this study suggest a potential for the generative enhancement methodology, enhancing other existing well-performing classifiers by embedding them within a generative architecture. Our next efforts will focus on exploring other candidates for this transformation and evaluating the resulting architectures. Additionally, we plan to perform a physical experiment to verify the real-life application potential of the newly created models. For this purpose, a custom rotary table 3D scanner is being designed to acquire scans of real-life objects which will be used for the model evaluation. These efforts, including the existing hardware and software solutions, are part of the Technicom project [29].