3DMGNet: 3D Model Generation Network Based on Multi-Modal Data Constraints and Multi-Level Feature Fusion

Due to the limitation of less information in a single image, it is very difficult to generate a high-precision 3D model based on the image. There are some problems in the generation of 3D voxel models, e.g., the information loss at the upper level of a network. To solve these problems, we design a 3D model generation network based on multi-modal data constraints and multi-level feature fusion, named as 3DMGNet. Moreover, 3DMGNet is trained by self-supervised method to achieve 3D voxel model generation from an image. The image feature extraction network (2DNet) and 3D feature extraction network (3D auxiliary network) are used to extract the features of the image and 3D voxel model. Then, feature fusion is used to integrate the low-level features into the high-level features in the 3D auxiliary network. To extract more effective features, each layer of the feature map in feature extraction network is processed by an attention network. Finally, the extracted features generate 3D models by a 3D deconvolution network. The feature extraction of 3D model and the generation of voxelization play an auxiliary role in the training of the whole network for the 3D model generation based on an image. Additionally, a multi-view contour constraint method is proposed, to enhance the effect of the 3D model generation. In the experiment, the ShapeNet dataset is adapted to prove the effect of the 3DMGNet, which verifies the robust performance of the proposed method.


Introduction
Generating the 3D digital model of object from an image by 3D reconstruction method is a challenging problem in the fields of computer vision and computer graphics. Rapid and automatic 3D model generation is the basis and an important part in the applications of digital city, virtual reality [1], and process simulation, etc. Particularly, the efficient generation of 3D models based on images can greatly reduce the workload of designers/users, and increase the real-time ability of human-computer interaction. Thus, it is of great significance for virtual reality and software design to quickly generate the corresponding 3D model according from an image captured by the mobile devices.
The core of 3D model generation is the discriminativeness of the extracted features, which determines the accuracy of 3D model generation. At present, deep learning-based 3D model generation has achieved robust performance, e.g., ShapeNet [2], PrGan [3], Pixel2mesh [4], ICV [5], PGGAN [6], and 3D-VAE [7]. These methods obtain a feature vector to represent the 3D model of the object through the coding of convolution network or manual design, and the feature vector generates a 3D model of the object. Many networks for 3D object generation models generally adapt 3D convolutions, composed of some extended 2D convolutions, to construct 3D feature extraction networks (e.g., 3DCNN [8]). For example, the TL-Net [9] leverages the 2DCNN and the 3DCNN to encode the image and 3D voxel model to a same latent space, respectively. Then the feature of the image is decoded to generate the 3D model. Although convolution neural network has made many excellent achievements in image processing, such as VGG [10], ResNet [11], and GoogleNet [12], the feature extraction of 3D model still faces some challenges. Recently, the commonly used 3D feature extraction network adopts coarse-to-fine feature extraction method in the process of the 3D model. The high-level network layer obtains highly abstract information, but loses the original details of the 3D model. Thus, the feature information becomes one of the factors, which affects the 3D model generation accuracy. To solve the above problems, Tchapmi et al. [13] propose the Segcloud network, in which the fine pixel set segmentation is obtained by adding a network of residual module. Inspired by the above idea, we propose a multi-level feature fusion method for robust 3D model feature extraction and representation. The backbone of our network is an extended version of autoencoder [14]. To make full use of the features in the lower and upper layers of the network, the feature extraction part of the network is improved by introducing a residual module. The residual module is the key design of the residual network [11]. In this research, the residual module is sandwiched between two convolutional operations. As a bridge between high-level features and low-level features, the information of the feature map near the input layer is integrated into subsequent high-level feature maps, to improve the robustness of the final 3D voxel model features.
In order to reduce the redundancy and enhance the effectiveness of the feature generated by the feature fusion, we introduce the attention mechanism, based on the SeNet [15], which is a channel-wise attention method. For an image feature, the feature of each channel is multiplied by the weight to suppress the features of the unimportant channels and enhance the features of the important channels. Finally, the quality of features is improved. We apply this module to our 3DMGNet to deal with the features of each layer, which can strengthen the expression of effective feature information extracted by the 3D model and reduce the redundant feature information, to ultimately enhance the robustness of the 3D voxel model features.
Similar to the TL-Net [9], the proposed method adopts the idea of constraint modeling, which trains the network through data consistency constraints and feature constraints. The difference from the TL-Net is that, we improve the image feature extraction network and the 3D auxiliary network respectively, based on the premise of the constraint modeling idea that the 3D voxel features constrain the image features. Moreover, the loss function is also improved by introducing the way of multi-view contour constraints.
The main contributions of this research are as follows: (1) An image feature extraction network is introduced in the 2DNet for the 3D model generation. As shown in Figure 1a, Se-ResNet-101 is implemented to extract image features based on transfer learning method. As shown in Figure 1b, SeNet [15] is embedded into the residual module of the ResNet [11], to construct the basic architecture of the Se-ResNet-101. (2) A new 3D voxel model feature extraction network is proposed. For this network, the features of the high-level feature map and the low-level feature map are merged to enhance the quality and robustness of the 3D voxel features, by adding skip connections to the 3D Autoencoder structure. Besides, an attention mechanism is introduced to learn the weights of fusion features in each channel, and enhance the quality of the feature by feature redirection. (3) We propose a novel 3D model generation network based on multi-modal data constraints and multi-level feature fusion, i.e., 3DMGNet. Notably, 3DMGNet combines a multi-view contour (3) We propose a novel 3D model generation network based on multi-modal data constraints and multi-level feature fusion, i.e., 3DMGNet. Notably, 3DMGNet combines a multi-view contour constraint into the construction of loss function, which improves the effect of the 3D models generation network. .

Volumetric 3D Modeling
Voxel is a kind of Euclidean structure data representation of 3D data [16]. Similar to the image, the 3D voxel model is composed of a regular grid, which is arranged in the 3D space. The existence or absence of the voxel grid in the 3D space is respectively represented by 1 or 0. Voxelization representation is widely used in 3D modeling tasks and other related 3D vision tasks. For example, VoxelNet [17], VConv-DAE [18], and LightNet [19][20][21].
The generation of 3D voxel models is one of the main research fields in 3D vision. The common characteristic of these networks is that the feature of the image is extracted by the network directly or indirectly, which is further used to reconstruct the 3D voxel model. The network involved in this process is also called encoder-decoder network.
Recently, to solve the problem of 3D voxel model generation, many researchers attempt to train a generation model by deep learning. For instance, Wu et al. [22] proposed the model of ShapeNet, which expresses 3D data in a voxelization form for the first time, and shows the presence or absence of each point in 3D space with a standard grid. The original 3D model of point cloud data or mesh data are easily processed with convolution network by voxelization representation. ShapeNet expands the 2D convolutional network to the 3D space, which provides inspiration for researchers to explore the various vision tasks, including the generation of 3D models. The 3DGAN [7] uses random noise to generate a 3D voxel model. The authors extend the 2D GAN (Generative Adversarial Network) structure to the 3D space. Furthermore, 3DGAN is composed of two parts: a generator network and a discriminator network. The generator network is used to generate a 3D model, and the discriminator network is adapted to determine the difference between the generated 3D model and the real 3D model. This network can generate a high-precision 3D model. However, the GAN is hard to train, which is a challenge for obtaining a robust model. Choy et al. [23] propose a 3D-R2N2 network, which designs the RNN (recurrent neural network) to learn the features of multiple images, to form the potential feature representation of the target object by the view of one or more objects. Lastly, a 3D object voxel model is generated. However, this method will not perform well when reconstructing models with poor texture. PrGAN [3] is proposed to obtain the 3D voxel model and

Volumetric 3D Modeling
Voxel is a kind of Euclidean structure data representation of 3D data [16]. Similar to the image, the 3D voxel model is composed of a regular grid, which is arranged in the 3D space. The existence or absence of the voxel grid in the 3D space is respectively represented by 1 or 0. Voxelization representation is widely used in 3D modeling tasks and other related 3D vision tasks. For example, VoxelNet [17], VConv-DAE [18], and LightNet [19][20][21].
The generation of 3D voxel models is one of the main research fields in 3D vision. The common characteristic of these networks is that the feature of the image is extracted by the network directly or indirectly, which is further used to reconstruct the 3D voxel model. The network involved in this process is also called encoder-decoder network.
Recently, to solve the problem of 3D voxel model generation, many researchers attempt to train a generation model by deep learning. For instance, Wu et al. [22] proposed the model of ShapeNet, which expresses 3D data in a voxelization form for the first time, and shows the presence or absence of each point in 3D space with a standard grid. The original 3D model of point cloud data or mesh data are easily processed with convolution network by voxelization representation. ShapeNet expands the 2D convolutional network to the 3D space, which provides inspiration for researchers to explore the various vision tasks, including the generation of 3D models. The 3DGAN [7] uses random noise to generate a 3D voxel model. The authors extend the 2D GAN (Generative Adversarial Network) structure to the 3D space. Furthermore, 3DGAN is composed of two parts: a generator network and a discriminator network. The generator network is used to generate a 3D model, and the discriminator network is adapted to determine the difference between the generated 3D model and the real 3D model. This network can generate a high-precision 3D model. However, the GAN is hard to train, which is a challenge for obtaining a robust model. Choy et al. [23] propose a 3D-R2N2 network, which designs the RNN (recurrent neural network) to learn the features of multiple images, to form the potential feature representation of the target object by the view of one or more objects. Lastly, a 3D object voxel model is generated. However, this method will not perform well when reconstructing models with poor texture. PrGAN [3] is proposed to obtain the 3D voxel model and viewpoint information of an object through a 3D up-sampling network with an initialized vector. The contour images are projected on the Sensors 2020, 20, 4875 4 of 16 generated 3D model and the raw 3D model respectively. Finally, a generator is added to distinguish generation contour image from the real contour image. In fact, the authors adapt the idea of GAN [24] to improve the generation effect through data constraints, while Wu et al. [25] learn the geometric information and structural information of objects by two branches of RNN. Geometric information and structural information are composed of two parts: the voxel information of each part in 3D model, and the relationship between bounding boxes of the paired 3D voxel models. For this method, two kinds of information are merged to generate a higher-precision 3D model with more robust features. However, there is currently no guarantee that all generated shapes are of high quality.

Point Cloud Modeling
Compared with the voxel model, the unordered point cloud has the advantage of high storage efficiency, and can express the fine details of a 3D object, while the point cloud is not easily processed by a convolutional network. There has been much 3D model generation work based on the point cloud. For example, Mandikal et al. [26] propose 3DLMNet, in which a latent space is learnt by an auto-encoder network, and the image is encoded into the space. With the constraints of difference between the feature of the image and the feature of the point cloud, a space vector that can generate an image of a 3D model is obtained. Thus, the latent spatial representation is learnt in a probabilistic way, which can predict multiple 3D models in the image. Similar to the TL-Net [9], this network also adopts a multi-stage training method. Gadelha et al. [27] propose a tree-coding network to process point clouds, which can be combined with its tree-shaped decoding network to generate a 3D point cloud model. However, the proposed tree-coding network needs to represent the 3D model as a set of locality-preserving 1D ordered list. Jiang et al. [28] propose a conditional adversarial loss based single-view 3D modeling network, which is similar to the 3DGAN [7]. Besides, Li et al. [29] also propose the PC-GAN, which is directly extended from the GAN to the generation of 3D point clouds. Mandikal et al. [30] propose the Dense-PCR network, which is a deep pyramid network used for the point cloud reconstruction. A low-resolution point cloud algorithm is proposed, to achieve grid deformation by aggregating local and global point features, to increase the resolution of the grid hierarchically. However, certain predictions have artifacts consisting of a small cluster of points around some regions due to outlier points in the sparse point cloud, which obtains aggregated features in the dense reconstruction.

Mesh Modeling
The mesh model is a 3D model data representation composed of a series of vertices and triangular patches. Mesh model has its unique topology structure and cannot be easily processed by the convolutional network. Thus, mesh is first converted into spherical parameters or geometric images before the processing of the convolutional network. Some work related to the generation of mesh models are also proposed. For example, Sinha et al. [31] propose the SufNet, in which a geometric image of the mesh model is obtained through a spherical parametric method and corresponding information with the basic mesh, then a network generating the geometric image is designed. Because of the correlation between the geometric image and the raw mesh, the mesh model will be easily generated by the geometric image. However, SufNet can only reconstruct the surface of the rigid object. Pumarola et al. [32] propose a surface generation method of 3D model with geometric perception, which located the 2D mesh on the image through the 2D detection branch to detect the 2D position and confidence map of the mesh. Then, the vertices of each 2D mesh are raised to 3D through the 3D depth branch. The image clues are also used to further improve the quality of 2D detection. Then, the reconstruction branch generates the surface of the 3D model through the perspective projection method. This method can estimate the 3D shape of a non-rigid surface from a single image. Groueix et al. [33] propose AtalsNet inspired by the formal definition of the surface, which verifies that the surface part is similar to the topological space of the Euclidean plane. A shape embedded latent variable of the object is learnt by a method of square mapping to the local surface of the object. Finally, a mesh 3D model of the object is generated through a decoding network, but the AltasNet cannot capture the detailed information well.

The Proposed Method
Our network is composed of two parts: 3D auxiliary network and 3D model generation network. The 3D auxiliary network consists of two parts, i.e., 3D convolution and 3D deconvolution, as shown in Figure 2. The 3D model generation network consists of 2DNet and 3D deconvolution. The 2DNet is the image features extraction network. The 3D auxiliary network is responsible for obtaining a robust feature vector representation that represents the 3D model, by reconstructing the 3D model using the labels through the self-supervised learning way. The multi-level feature fusion of traditional 3D feature extraction is processed by 3D convolution. In order to improve the expression ability of the extracted features, we introduce an attention mechanism, which adds the SeNet into 3D convolution. The 3D model generation network is responsible for extracting the features of the image by using transfer learning. The image features are used to generate the 3D model of the object through the 3D up-sampling network (in the 3D auxiliary network). Similar to the GAN [23], the reconstruction accuracy of the 3D auxiliary network is controlled by reconstructing loss function loss res . A constraint loss function loss 2−3distance is constructed by comparing the L2 distance loss between the image feature in the 3D model generation network and the 3D model feature in the 3D auxiliary network. Additionally, four contour images of the generated voxel model and the original voxel model are selected to construct the multi-view contour loss l contour , which designs view consistency constraints to improve the training effect and enhance the performance of the model generation network. Train in stages strategy is adopted to optimize the loss function designed in our network. Considering that the overall performance of the network is affected by the number of parameters and the amount of data, the 2D feature extraction network (2DNet) extracts the features of the image by designing transfer learning.
Sensors 2020, 20, x FOR PEER REVIEW 5 of 17 of the object. Finally, a mesh 3D model of the object is generated through a decoding network, but the AltasNet cannot capture the detailed information well.

The Proposed Method
Our network is composed of two parts: 3D auxiliary network and 3D model generation network. The 3D auxiliary network consists of two parts, i.e., 3D convolution and 3D deconvolution, as shown in Figure 2. The 3D model generation network consists of 2DNet and 3D deconvolution. The 2DNet is the image features extraction network. The 3D auxiliary network is responsible for obtaining a robust feature vector representation that represents the 3D model, by reconstructing the 3D model using the labels through the self-supervised learning way. The multi-level feature fusion of traditional 3D feature extraction is processed by 3D convolution. In order to improve the expression ability of the extracted features, we introduce an attention mechanism, which adds the SeNet into 3D convolution. The 3D model generation network is responsible for extracting the features of the image by using transfer learning. The image features are used to generate the 3D model of the object through the 3D up-sampling network (in the 3D auxiliary network). Similar to the GAN [23], the reconstruction accuracy of the 3D auxiliary network is controlled by reconstructing loss function . A constraint loss function is constructed by comparing the L2 distance loss between the image feature in the 3D model generation network and the 3D model feature in the 3D auxiliary network. Additionally, four contour images of the generated voxel model and the original voxel model are selected to construct the multi-view contour loss , which designs view consistency constraints to improve the training effect and enhance the performance of the model generation network. Train in stages strategy is adopted to optimize the loss function designed in our network. Considering that the overall performance of the network is affected by the number of parameters and the amount of data, the 2D feature extraction network (2DNet) extracts the features of the image by designing transfer learning.   [15] module are separately connected after the Conv3, Conv4, and Conv5 layer. The specific structure of SeNet is shown in the middle left of the figure. In the 3D DeConvolution module, there are four Deconvolution layer-s (DeConv1-DeConv4). In the 2DNet module, the image feature is firstly extracted by Se-ResNet-101 [15] network, and then two fully connected layer changes the dimension of the feature to be as same as the dimension of the feature extracted by 3D Convolution module. The multi-view consistent loss is designed in the lower part of the figure to optimize the 3D auxiliary network.

Image Feature Extraction
The Se-ResNet-101 network [15] is used in 2DNet to extract the image features. The Se-ResNet-101 is an extended version of ResNet [11], which has 101 layers. As shown in Figure 2, the output features of Se-ResNet-101-C5 layer is input as the feature map before the 2DNet. The 128-channel feature vector is obtained by the global average pooling and two fully connected operation, followed by the Se-ResNet-101-C5 layer. Here, we implemented transfer learning method, i.e., the weights trained on the ImageNet [34], are used as the pre-training model. The 128-channel feature vector is used to generate a 3D model through the 3D deconvolution part of the 3D auxiliary network.

Self-Supervised Learning and Multi-Level Feature Fusion Network for 3D Reconstruction
In order to solve the problem of the high-level features extracted from the 3D model lacking detailed information, the skip connections [13] are introduced to fuse the different levels of the feature, which simultaneously contain both local and global information of the 3D model.
As shown in Figure 3, the residual module is added in the 3D convolutional part of the 3D auxiliary network. Moreover, the 3D convolutional part consists of the convolution layers and the max-pooling layers. In the 3D convolution network, the three residual layers are designed. Furthermore, the 3D deconvolution network can extract a 128-dimensional feature vector of the 3D model. The 3D deconvolution part consists of Deconv1-Deconv4, which generates the 3D model from intermediate features.
three SeNet [15] module are separately connected after the Conv3, Conv4, and Conv5 layer. The specific structure of SeNet is shown in the middle left of the figure. In the 3D DeConvolution module, there are four Deconvolution layer-s (DeConv1-DeConv4). In the 2DNet module, the image feature is firstly extracted by Se-ResNet-101 [15] network, and then two fully connected layer changes the dimension of the feature to be as same as the dimension of the feature extracted by 3D Convolution module. The multi-view consistent loss is designed in the lower part of the figure to optimize the 3D auxiliary network.

Image Feature Extraction
The Se-ResNet-101 network [15] is used in 2DNet to extract the image features. The Se-ResNet-101 is an extended version of ResNet [11], which has 101 layers. As shown in Figure 2, the output features of Se-ResNet-101-C5 layer is input as the feature map before the 2DNet. The 128-channel feature vector is obtained by the global average pooling and two fully connected operation, followed by the Se-ResNet-101-C5 layer. Here, we implemented transfer learning method, i.e., the weights trained on the ImageNet [34], are used as the pre-training model. The 128-channel feature vector is used to generate a 3D model through the 3D deconvolution part of the 3D auxiliary network.

Self-Supervised Learning and Multi-Level Feature Fusion Network for 3D Reconstruction
In order to solve the problem of the high-level features extracted from the 3D model lacking detailed information, the skip connections [13] are introduced to fuse the different levels of the feature, which simultaneously contain both local and global information of the 3D model.
As shown in Figure 3, the residual module is added in the 3D convolutional part of the 3D auxiliary network. Moreover, the 3D convolutional part consists of the convolution layers and the max-pooling layers. In the 3D convolution network, the three residual layers are designed. Furthermore, the 3D deconvolution network can extract a 128-dimensional feature vector of the 3D model. The 3D deconvolution part consists of Deconv1-Deconv4, which generates the 3D model from intermediate features. For a network, the feature of the i-th layer is expressed as follow: where −1 is the feature of the (i − 1)-th layer, and N is the channel number of this feature map. f0 is the input data of the network. is the convolution kernel. is the bias vector. The operator "*" represents the convolutional operation.
(•) is a nonlinear activation function. The calculation process of 3D deconvolution is the inverse process of 3D convolution in the forward and back propagation process. The 3D deconvolution process is as follows: For a network, the feature of the i-th layer is expressed as follow: where f i−1 is the feature of the (i − 1)-th layer, and N is the channel number of this feature map. f 0 is the input data of the network. w i is the convolution kernel. bias i is the bias vector. The operator "*" represents the convolutional operation. Relu(·) is a nonlinear activation function. The calculation process of 3D deconvolution is the inverse process of 3D convolution in the forward and back propagation process. The 3D deconvolution process is as follows: where f i−1 is the feature map of the (i − 1)-th layer, and N represents the channel number of the feature map. The kernel size of w i is f n × f n × f n . biasi is the bias vector. The w i in the deconvolution is the inverse matrix of the kernel in the convolutional calculation [35]. As shown in Figure 3, the residual layer is sandwiched between multiple convolution layers at different depth from the input layer, to blend the features of the layer closer to the input layer with the layers farther away. The calculation process is divided into a direct mapping part and a residual part. The output of a residual unit can be expressed as: where H i−1 represents the direct mapping part, which is the input of the residual layer. N represents the number of the input feature map. F (H i−1 * W i−1 ) represents the residual part. W i−1 is the kernel of each convolution layer in the residual part, which is composed of two convolutional operations.

The Attention Mechanism
To make the extracted fusion features more representative, an attention module (i.e., SeNet [15]) is added to further process the fusion features. The weights of each channel feature are redirected to suppress the unimportant part of the feature and enhance the discriminative part. SeNet [15] is a channel attention mechanism, the output features are designed to pay more attention to the relationship among the channels of the learned features through the attention mechanism network structure. It also redefines the importance of each channel feature according to the learned weights. The attention mechanism usually can reduce the redundancy information to optimize the fusion features to obtain more robust features. As shown in Figure 4, SeNet includes three processes. F sq means obtaining the global feature of each channel. F ex (., w) means learning the feature weight of each channel; F scale ( * ) means that redirecting to each channel to obtain the final optimized features with the learnt weights.
the layers farther away. The calculation process is divided into a direct mapping part and a residual part. The output of a residual unit can be expressed as: where −1 represents the direct mapping part, which is the input of the residual layer. N represents the number of the input feature map. ℱ( −1 + −1 ) represents the residual part.
is the kernel of each convolution layer in the residual part, which is composed of two convolutional operations.

The Attention Mechanism
To make the extracted fusion features more representative, an attention module (i.e., SeNet [15]) is added to further process the fusion features. The weights of each channel feature are redirected to suppress the unimportant part of the feature and enhance the discriminative part. SeNet [15] is a channel attention mechanism, the output features are designed to pay more attention to the relationship among the channels of the learned features through the attention mechanism network structure. It also redefines the importance of each channel feature according to the learned weights. The attention mechanism usually can reduce the redundancy information to optimize the fusion features to obtain more robust features. As shown in Figure 4, SeNet includes three processes. means obtaining the global feature of each channel.
(. , ) means learning the feature weight of each channel; ( * ) means that redirecting to each channel to obtain the final optimized features with the learnt weights.
The designed network with the attention mechanism is shown in Figure 4, SeNet is added to the two-layer convolution of residual module. To learn the weights of the important channel features and suppress the feature information of the unimportant channels, the input feature is compressed by channels through weight redirection operations to obtain new weights of the model.  [15]. The feature is firstly input the module to obtain the global feature of each channel. Then,the feature is input to the ( * ) module to calculate the weight value of each channel. Finally, the weight value of each channel is multiplied with the input feature , and the feature is concatenated with the feature from the residual layer.
To obtain the global feature of each channel, the feature map obtained by the fusion of global features and local features is compressed for each channel through the max-pooling operation. The weight values of each channel are learnt by two fully connected layers. The feature calculation through SeNet is as follows: Figure 4. The attention mechanism based on SeNet [15]. The feature f in is firstly input the F sq module to obtain the global feature of each channel. Then, the feature is input to the F scale ( * ) module to calculate the weight value of each channel. Finally, the weight value of each channel is multiplied with the input feature f in , and the feature is concatenated with the feature from the residual layer.
The designed network with the attention mechanism is shown in Figure 4, SeNet is added to the two-layer convolution of residual module. To learn the weights of the important channel features and suppress the feature information of the unimportant channels, the input feature is compressed by channels through weight redirection operations to obtain new weights of the model.
To obtain the global feature of each channel, the feature map obtained by the fusion of global features and local features is compressed for each channel through the max-pooling operation. The weight values of each channel are learnt by two fully connected layers. The feature calculation through SeNet is as follows: f out = f in * sigmoid( f c 2 ( f c 1 (maxpool( f in ))) (4) where f in ais the input feature of SeNet module. maxpool() represents the global maximum pooling operation. f c 1 and f c 2 are fully connected operations. Sigmoid() is the activation function.

Multi-View Contour Constraints
To make the generated 3D model more efficient, the multi-view contour projection method is used to construct the constraint. In this paper, three contours are selected along the 45 • , 90 • , and 135 • Sensors 2020, 20, 4875 8 of 16 rotation angle around the model, and one contour in the case of an elevation angle of 90 • and a rotation angle of 90 • . These four contours can describe the outline of the object more effectively. Besides, these four contour maps can construct the corresponding multi-view map constraints, which generates a 3D model similar to the corresponding original 3D model.
The projection contour image is calculated by the following steps: (i) the rotation matrix rotmatrix(θ, γ) is obtained based on the parameters of the elevation angle θ and the rotation angle γ.
(ii) The rotated 3D model V θ,γ (i, j, k) is obtained by coordinate transformation, which is performed on the coordinates of the 3D voxel model V. Each of its coordinate point c(i, j, k) is multiplied by the rotation matrix rotmatrix(θ, γ). (iii) Similar to the orthographic projection method, the formula obtains the projection contour image of 3D voxel model, through cumulatively summing the number of voxels from the rotated angle under each light.

Loss Function
The loss function contains three parts: contour loss (l contour ), reconstruction loss (loss res ), two-dimensional and three-dimensional feature distance loss (loss 2−3distance ).
The output of 3DMGNet is the predicted probability value of every coordinate point in the voxel model. Cross-entropy calculates the loss function loss res , where p i,j,k is the probability of a real voxel point, andp i,j,k is the probability of the point that is predicted as a real voxel. The Euclidean distance is used in loss function loss 2−3distance and l contour to calculate the distance between the features extracted by 2DNet and the features extracted by 3D auxiliary network, and the distance between the generated 3D model contour map and the original 3D model contour map. In loss 2−3distance function, z 2d and z 3d represent the intermediate feature vectors extracted by image and 3D model, respectively. In l contour function, r project and pred project represent the projection of the real voxel model and the predicted 3D voxel model respectively.
After training, the reconstruction accuracy is promoted by the optimization of the loss function loss res and loss contour . The distance between the features extracted from image and the features extracted from 3D model is minimized by optimizing the loss function loss 2−3distance . Thus, both of the 3D auxiliary network and 3D model generation network can produce the high accuracy of the 3D model. The overall loss function can be organized as follows: Loss total = λloss 2−3distance + γloss res + µloss contour (8) where λ, γ, and u represent the proportion of loss 2−3distance , loss res and loss contour respectively.

The Training and Test of the Model
The training of 3DMGNet is divided into three stages. In the first stage, the 3D auxiliary network is trained, which uses reconstruction loss loss res and contour loss loss contour to guide the auxiliary network to reconstruct a high-precise 3D model. In the second stage, the parameter updating of the 3D convolution is firstly suppressed, then the loss function loss 2−3distance constructs the feature of 2DNet and 3D convolution. In the third stage, the overall training of 3DMGNet network is accomplished, and the network stops training until each loss function is optimized. We test the trained model through different datasets. The test network is composed of 2DNet and 3D deconvolution networks. The result of the test is voxel coordinate position and the probability that voxel coordinate point in the 3D model is judged as a real voxel point. By setting different thresholds, the coordinate points, whose probability value is lower than the threshold, are eliminated. Finally, we obtain the 3D model under different thresholds.

Results
In this section, we briefly introduce the experimental dataset, metrics, and implementation. Then, we describe the ablation study, experiment results analysis, and comparisons with other methods.

Dataset
The dataset that we used comes from ShapeNet [2]. Some classes in ShapeNet are chosen to build our experiment dataset, which includes plane, bench, chair, sofa, and monitor, etc. Table 1 shows the training and test data of each class. Each object datum is organized with two parts: the first part is the image of the object, which is obtained through 20 different views; the other part is the voxel model of the object.  Plane  3641  2831  810  Bench  1635  1271  364  Cabinet  1415  1100  315  Car  6748  5247  1501  Chair  6101  4744  1357  Monitor  986  766  220  Lamp  2087  1622  465  Speaker  1457  1132  325  Rifle  2135  1660  475  Sofa  2857  2222  635  Table  7659  5956  1703  Telephone  947  736  211  Watercraft  1746 1357 389

Metrics
Intersection over union (IOU) is used as the metric to evaluate the 3D model generation quality. IOU is the intersection over union between the generated voxel model and the real voxel model. For a generated voxel model I, 3D coordinate point of I is (x, y, z). Each coordinate point of the generated voxel model has its corresponding probability. Different thresholds have different IOU values.
where p i,j,k andp i,j,k represents the probability values of the predicted voxel and the real voxel respectively. I(·) presents the indicator function, and t represents the threshold value. The numerator means the number of intersection points, and the denominator means the number of union points.

Implementation
In this research, the whole network is conducted in the way of phased training, and the loss function is optimized by the momentum optimizer. The learning rate in each stage is set as 0.0001, 0.0005, and 0.00005. The λ, γ and µ are set as 1.0, 0.9 and 0.8, respectively. The batch size is set as 10, the training epoch is set as 400. The input size of the image is 224 × 224 (the image is rendered by 3D model), and the resolution of the input and generated voxel model is 32 × 32 × 32.
We implemented experiments on a PC with an RTX2070 GPU.

Ablation Studies and Comparisons
To adequately test the proposed 3DMGNet, we consider the following settings for the ablation experiment: (a) TL-Net: The TL-Net [9] method is a single-view 3D model generation method based on voxel data. We directly use the original network as the baseline in the experiment. To validate each module in the proposed method, we establish three groups of experiments to evaluate the effects of the 2DNet module, the multi-layer feature fusion module, and the multi-view contour constraints.
As shown in Table 2, there are three experiments, including Voxel-Se-ResNet-101, Voxel-ResidualNet, and 3DMGNet. They are trained and tested on six categories of object models, i.e., plane, bench, sofa, monitor, speaker and telephone. IOU values of the generated 3D models are obtained by each improved method when the threshold values are 0.1, 0.3, 0.5, 0.7, and 0.9 respectively. To evaluate the effect of generating 3D models intuitively, the 3D models generated by different methods are visualized. As shown in Figure 5, the first two rows are different aircraft, the third and the fourth rows are different sofas, and the last two rows are different benches. In the generated 3D models, yellow point represents that there is a point with the predicted probability of 1, and different colors of points represent different predicted probabilities.
It can be seen in Figure 5 that the 3D models generated by our method are closer to the ground truth (Figure 5b), and the compared methods have generated many error points (e.g., the plane and bench in the second raw and sixth raw in Figure 5, respectively).
The results of ablation experiment on difference thresholds are shown in Table 3; we can obtain the following observations.
(a) The Voxel-Se-ResNet usually outperforms the baseline network TL-Net [9]. For plane, sofa, and bench, the IoU of Voxel-Se-ResNet-101 is at least 0.026 higher than TL-Net, as the improved 2DNet can extract better 2D image features through Se-ResNet-101, to promote the accuracy of the generated model.
(b) The Voxel-ResidualNet outperforms the baseline network (TL-Net [9]) and Voxel-Se-ResNet in most cases, as the more robust 3D model features are obtained through multi-level feature fusion and attention mechanism. The 3D model generation ability of 3D auxiliary network is improved, and the overall network performance is promoted to further constraint the image feature. Thus, the accuracy of the 3D model generation base on image is improved. (c) The 3DMGNet outperforms TL-Net [9], Voxel-Se-ResNet, and Voxel-ResidualNet. The main reason is that the multi-view contour constraint is added to the 3DMGNet, which proves that the reconstruction accuracy of the 3D auxiliary network is improved. (d) When the threshold is 0.1, 0.3, 0.5, and 0.7, the IOU value of 3DMGNet does not change much.
For example, the difference in IOU values of plane, sofa, and bench at different thresholds is less than 0.011. When the threshold is set to 0.3, our method usually achieves the best performance.
However, 3DMGNet acquires similar accuracy in some categories to the Voxel-Se-ResNet-101 and Voxel-Residual, e.g., plane, which proves that the improved reconstruction performance of the auxiliary network has a better supervision effect on the performance of the final 3D model, although there is always a certain spatial position offset between the generated 3D model and the original 3D model, which will affect the accuracy of the 3D model generation. The results of ablation experiment on difference thresholds are shown in Table 3; we can obtain the following observations.
(a) The Voxel-Se-ResNet usually outperforms the baseline network TL-Net [9]. For plane, sofa, and bench, the IoU of Voxel-Se-ResNet-101 is at least 0.026 higher than TL-Net, as the improved 2DNet can extract better 2D image features through Se-ResNet-101, to promote the accuracy of the generated model.
(b) The Voxel-ResidualNet outperforms the baseline network (TL-Net [9]) and Voxel-Se-ResNet in most cases, as the more robust 3D model features are obtained through multi-level feature fusion The probability value was stored in each voxel grid; blue represents the probability value is close to 0, yellow represents the probability value is close to 1.

Comparison with Other Methods
In order to verify the effectiveness and advantages of the proposed 3DMGNet, we compare 3DMGNet with state-of-the-art methods, i.e., 3D-R2-N2 [23], OGN [36], DRC [37], Pix2Vox-F [38], Pix2Vox++/F [39]. In order to unify the comparison conditions, we follow the same experiment settings as in PixelVox [38], and compare the IOU results when the threshold is 0.3. The comparison results are shown in Table 4. In Table 4, the result values of compared methods come from Ref. [38,39].
The comparison results with other methods are shown in Table 4. From Table 4, we can obtain the following observations: (1) The 3DMGNet outperforms 3D-R2-N2 [23], OGN [36], DRC [37]. Compared with 3D-R2-N2 [23], OGN [36] and DRC [37], 3DMGNet can achieve the best generation results, which obtains the highest IOU value, except for the category of Watercraft. (2) The 3DMGNet can achieve better generation accuracy than PixVox-F [38] and PixVox++/F [39] in most categories, i.e., Airplane, Bench, Chair, Display, Lamp, Rifle, Sofa and Telephone. Pix2Vox-F and PixVox++/F solves the single-view-based 3D model generation problem by spatial mapping, while the 3DMGNet solves this problem from the perspective of multi-modal feature fusion. Although PixVox++/F can achieve the best average IOU, which is mainly caused by the fact that the IOU value of Speaker is obviously higher than 3DMGNet; the 3DMGNet performs its best performance in most categories of objects (at least nine categories).
In Table 4, although 3DMGNet has not achieved the best results in some categories, this does not affect the effectiveness of the proposed method. The 3D model generation accuracy is severely affected by the limited image information, which is the main challenge of single-view 3D model generation. The multi-modal feature constraint strategy is adopted in our method to solving this problem, but it is inevitable that the extracted features cannot be effective for all objects. Thus, the accuracy of 3DMGNet in some categories is slightly lower than the compared methods.

Multi-Category Joint Training
To verify the generation ability of 3DMGNet, a multi-category joint training experiment is conducted to obtain the joint training model, i.e., Joint-3DMGNet. Multiple class models (plane, sofa, and bench) are randomly input to the 3DMGNet to train a model, which can generate multiple category 3D models based on image. The experiment results are compared with the original TL-Net [9], Diect-2D-3D (the direct trained model Diect-2D-3D, which consists only of the 2DNet and the 3D deconvolution part). The structure of the 3DMGNet and Joint-3DMGNet is completely the same. Considering the threshold values of 0.3, 0.5, and 0.7 can obtain better performance of 3DMGNet in Table 4, we compare the results when the threshold value is respectively set as 0.3, 0.5, and 0.7 in this experiment.
We have the following observations in Table 5: (1) The Joint-3DMGNet outperforms the baseline network TL-Net [9], which proves that the combination of multi-modal feature fusion of 3D auxiliary network, multi-view contour constraint, and the improved image feature extraction network are effective for higher reconstruction performance. (2) The IOU of Joint-3DMGNet is at least 0.023 higher than Direct-2D-3D, mainly because the image lacks spatial information, and it is not easy to generate a 3D model with higher accuracy without an auxiliary network part. (3) For the multi-category joint training results, the best IOU results are achieved by Joint-3DMGNet in most cases, which illustrates that the Joint-3DMGNet can achieve better generalization performance than the compared methods.

Conclusions
We propose a 3D model generation method based on multi-modal data constraints and multi-level feature fusion. In this method, we design a self-supervised method to learn a robust 3D feature for the input voxel data. Specifically, the multi-level feature fusion is used to enhance the robustness of the extracted 3D features, and the attention mechanism is introduced to optimize the quality of the features. Therefore, the performance of the 3D model generation network is improved. To further improve the accuracy of the generated 3D model by the 3D auxiliary network, we also introduce a multi-view contour constraint to construct the constraint loss function. During the training stage, the similarity between the generated model and the original 3D model is constrained by the multi-view contour loss, which can effectively increase the accuracy of the generated 3D model.
In future, we would like to further explore the application of feature fusion and attention mechanisms in 3D model generation, 3D shape synthesis, and other tasks. Besides, 3D model generation from the geometric information and structural information, or global and local features of the 3D data, will be considered.
Author Contributions: E.W. and Y.L. design the algorithm and experiment; L.X. and Y.L. analyzed the data and wrote the source code; Z.Z. helped with the study design, paper writing, and analysis of the results; X.H. helped with the data analysis and experimental analysis. All authors have read and agreed to the published version of the manuscript.