1 Introduction
Over the past several years, impressive strides have been made in the generative modeling of 3D objects. Much of this progress can be attributed to recent advances in artificial neural network research. Instead of the usual approach to representing 3D shapes with voxel occupancy vectors, promising previous work has taken to learning simple latent representations of such objects. Neural architectures that been developed with this goal in mind include deep belief networks
[44], deep autoencoders
[48, 12, 31], and 3D convolutional networks [26, 47, 34, 5, 15]. The positive progress made so far with neural networks has also led to the creation of several largescale 3D CAD model benchmarks, notably ModelNet [44] and ShapeNet [3].However, despite the progress made so far, one key weakness shared among all previous stateoftheart approaches is that all of them have focused on learning a single (unified) vector representation of 3D shapes. These include recent and powerful models such as the autoencoderlike TL Network [12] and the probabilistic 3D Generative Adversarial Network (3DGAN) [43], which shared its vector representation over multiple tasks. Other models [18, 17] further required additional supervision using information regarding camera viewpoints, shape keypoints, and segmentations.
Trying to describe the input with only a single layer of latent variables might be too restrictive an assumption, hindering the expressiveness of the underlying generative model learned. Having a multilevel latent structure, on the other hand, would allow for lower level latent variables to focus on modeling features such as edges and the upper levels to learn to command those lowerlevel variables as to where to place those edges in order to form curves and shapes. This composition of latent (local) substructures would allow us to exploit the fact that most 3D shapes usually have similar structure. This is the essence of abstract representations (which can be considered to be a coarsetofine feature extraction process), which can be easily constructed in terms of less abstract ones
[2] – higher level variables, or disentangled features, would be modeling complex interactions of lowlevel patterns. Thus, to encourage and expedite the learning of hierarchical features, we explicitly incorporate this as a prior in our model through explicit architectural constraints.In this paper, motivated by the argument developed above and the promise shown in work such as that of [8], we show how to encourage a latentvariable generative model to learn a hierarchy of latent variables through the use of synaptic skipconnections. These skipconnections encourage each layer of latent variables to model exactly one level of abstraction of the data. To efficiently learn such a latent structure, we further exploit recent advances in approximate inference [21] to develop a variational learning procedure. Empirically, we show that the learned model, which we call the Variational Shape Learner, acquires rich representations of 3D shapes which leads to significantly improved performance across a multitude of 3D shape tasks.
In summary, the main contributions of this paper are as follows:

[leftmargin=*]

We propose a novel latentvariable model, which we call the Variational Shape Learner, which is capable of learning expressive features of 3D shapes.

For both general 3D model building and single image reconstruction, we show that our model is fully unsupervised, requiring no extra humangenerated information about segmentation, keypoints, or pose information.

We show that our model outperforms current stateoftheart in unsupervised (object) model classification while requiring significantly fewer learned feature extractors.

In realworld image reconstruction, our extensive set of experiments show that the proposed Variational Shape Learner surpasses stateoftheart in 8 of 10 classes. Half of these the VSL surpasses by a large margin.
2 Related Work
3D object recognition is a wellstudied problem in the computer vision literature. Early efforts
[27, 22, 33]often combined simple image classification methods with handcrafted shape descriptors, requiring intensive effort on the side of the human data annotator. However, ever since the ImageNet contest of 2012
[23], deep convolutional networks (ConvNets) [10, 24] have swept the vision industry, becoming nearly ubiquitous in countless applications.Research in learning probabilistic generative models has also benefited from the advances made by artificial neural networks. Generative Adversarial Networks (GANs), proposed in [13] and Variational autoencoders (VAEs), proposed in [21, 32], are some of the most popular and important frameworks that have emerged from improvements in generative modeling. Successful adaptation of these frameworks range from a focus in natural language and speech processing [6, 35] to realistic image synthesis [14, 30, 28], yielding promising, positive results. Nevertheless, very little work, outside of [43, 12, 31], has focused on modeling 3D objects, where generative architectures can be used to learn probabilistic embeddings. The model proposed in this paper will offer another step towards constructing powerful probabilistic generative models of 3D structures.
One study, amidst the rise of neural networkbased approaches to 3D object recognition, most relevant to this paper is that of [44], which presented promising results and a useful benchmark for 3D model recognition: ModelNet. Following this key study, researchers have tried applying 3D ConvNets [26, 5, 41, 47], autoencoders [46, 48, 12, 31], and a variety of probabilistic neural generative models [43, 31] to the problem of 3D model recognition, with each study progressively advancing stateoftheart.
With respect to 3D object generation from 2D images, commonly used methods can be roughly grouped into two categories: 3D voxel prediction [44, 43, 12, 31, 5, 15] and meshbased methods [11, 7]. The 3DR2N2 model [5]
represents a more recent approach to the task, which involves training a recurrent neural network to predict 3D voxels from one or more 2D images.
[31] also takes a recurrent networkbased approach, but receives a depth image as input rather than normal 2D images. The learnable stereo system [17] processes one or more camera views and camera pose information to produce compelling 3D object samples.Many of the above methods require multiple images and/or additional humanprovided information. Some approaches have attempted to minimize human involvement by developing weaklysupervised schemes, making use of image silhouettes to conduct 3D object reconstruction [47, 42]. Of the few unsupervised neuralbased approaches that exist, the TL network [12] is quite important, which combines a convolutional autoencoder with an image regressor to encode a unified vector representation of a given 2D image. However, one fundamental issue with the TL Network is its threephase training procedure, since jointly training the system components proves to be too difficult. The 3DGAN [43] offers a way to train 3D object models probabilistically, employing an adversarial learning scheme. However, GANs are notoriously difficult to train [1]
, often due to illdesigned loss functions and the higher chance of zero gradients.
In contrast to this prior work, our approach, which is derived from an approximate inference approach to learning, naturally allows for joint training of all model parameters. Furthermore, our approaches make use of a wellformulated loss function from a variational Bayesian perspective, circumventing the instability involved with adversarial learning while still able to produce higherquality samples.
3 The Variational Shape Learner
In this section, we introduce our proposed model, the Variational Shape Learner (VSL), which builds on the ideas of the Neural Statistician [8] and the volumetric convolutional network [26], the parameters of which are learned under a variational inference scheme [21].
3.1 The Design Philosophy
It is well known that generative models, learned through variational inference, are excellent at reconstructing complex data but tend to produce blurry samples. This happens because there is uncertainty in the model’s predictions when we reconstruct the data from a latent space. As described above, previous approaches to 3D object modeling have focused on learning a single latent representation of the data. However, this simple latent structure might be hindering the underlying model’s ability to extract richer structure from the input distribution and thus lead to blurrier reconstructions.
To improve the quality of the samples of generated objects, we introduce a more complex internal variable structure, with the specific goal of encouraging the learning of a hierarchical arrangement of latent feature detectors. The motivation for a latent hierarchy comes from the observation that objects under the same category usually have similar geometric structure. As can be seen in Figure 2, we start from a global latent variable layer (horizontally depicted) that is hardwired to a set of local latent variables layers (vertically depicted), each tasked with representing one level of feature abstraction. The skipconnections tie together the latent codes, and in a topdown directed fashion, local codes closer to the input will tend to represent lowerlevel features while local codes farther away from the input will tend towards representing higherlevel features.
The global latent vector can be thought of as a large pool of command units that ensures that each local code extracts information relative to its position in the hierarchy, forming an overall coherent structure. This explicit globallocal form, and the way it constrains how information flows across it, lends itself to a straightforward parametrization of the generative model and furthermore ensures robustness, dramatically cutting down on overfitting. To make things easier for training via stochastic backpropagation, the local codes will be concatenated to a flattened structure when fed into the taskspecific models, e.g., a shape classifier or a voxel reconstruction module. Ultimately, more realistic samples should be generated by an architecture supporting this kind of latentvariable design, since the local variable layers will robustly encode hierarchical semantic cues in an unsupervised fashion.
3.2 Model Objective: Variational + Latent Loss
To learn the parameters of the VSL latentvariable model, we will take a variational inference approach, where the goal is to learn a directed generative model , with generative parameters , using a recognition model , with variational parameters . The VSL’s learning objective contains a standard reconstruction loss term as well as a regularization penalty over the latent variables. Furthermore, the loss contains an additional term for the latent variables , which is particularly relevant and useful for the 3D model retrieval task of Section 4.5. This extra term is a simple penalty imposed on the the difference between the learned features of the image regressor and true latent features where denotes concatenation.
We assume a fixed, spherical unit Gaussian prior, . The conditional distribution over each local latent code () is defined as follows:
(1) 
where the first local code is simply:
(2) 
Know that and are also spherical Gaussians and contains the generative parameters.
Let the reconstructed voxel be directly parametrized by occupancy probability. The loss for the input voxel of the VSL is then calculated by the following equation:
(4) 
where each term in the equation above is defined as follows:
(5)  
(6)  
(7) 
Note that and , which weigh the contributions of the each term towards the overall cost, are tunable hyperparameters.
3.3 Encoder: 3DConvNet + SkipConnections
The global latent code is directly learned from the input voxel through three convolutional layers with kernel sizes , strides and channels .
Each local latent code is conditioned on the global latent code, the input voxel , and the previous latent code (except for
, which does not have a previous latent code) using two fullyconnected layers with 100 neurons each. These skipconnections between local codes help to ease the process of learning hierarchical features and force each local latent code to learn one level of abstraction.
The approximate posterior for one single voxel is then given by:
(8) 
where , the variational parameters, is parametrized by neural networks. represents the number of local latent codes.
3.4 Decoder: 3DDeConvNet
After we learn the global and local latent codes , we then concatenate them into a single vector as shown in Figure 2 in blue dashed lines.
A 3D deconvolutional neural network with dimensions symmetrical to the encoder of Section 3.3 is used to decode the learned latent features into a voxel. An elementwise logistic sigmoid is applied to the output layer in order to convert the learned features to occupancy probabilities for each voxel cell.
3.5 Image Regressor: 2DConvNet
We use a standard 2D convolutional network to encode input RGB images into a feature space with the same dimension as the concatenation of global and local latent codes . The network contains four fullyconvolutional layers with kernel sizes , strides , and channels . The last convolutional layer is flattened and fed into two fullyconnected layers with 200 and 100 neurons each. Unlike the encoder described in Section 3.3, we apply dropout [40] before the last fullyconnected layer.
4 Experiments
To evaluate the quality of our proposed neural generative model for 3D shapes, we conduct several extensive experiments.
In Section 4.3
, we investigate our model’s ability to generalize and synthesize through a shape interpolation experiment and an nearest neighbors analysis of random generated samples from the VSL. Following this, in Section
4.4, we evaluate our model on the task of unsupervised shape classification by directly using the learned latent features on both the ModelNet10 and ModelNet40 datasets. We compare these results to previous supervised and unsupervised stateoftheart methods. Next, we test our model’s ability to reconstruct realworld image in Section 4.5, comparing our results to 3DR2N2 [5] and NRSfM [18]. Finally, we demonstrate the richness of the VSL’s learned semantic embeddings through vector arithmetic, using the latent features trained on ModelNet40 for Section 4.6.4.1 Datasets
ModelNet There are two variants of the ModelNet dataset, ModelNet10 and ModelNet 40, introduced in [44], with 10 and 40 target classes respectively. ModelNet10 has 3D shapes which are prealigned with the same pose across all categories. In contrast, ModelNet40 (which includes the shapes found in ModelNet10) features a variety of poses. We voxelize both ModelNet10 and ModelNet40 with resolution . To test our model’s ability to handle 3D shapes of great variety and complexity, we use ModelNet40 for most of the experiments, especially for those in Section 4.3 and 4.6. Both ModelNet10 and ModelNet40 are used to conduct the shape classification experiments.
PASCAL 3D The PASCAL 3D dataset is composed of the images from the PASCAL VOC 2012 dataset [9], augmented with 3D annotations using PASCAL 3D+ [45]. We voxelize the 3D CAD models using resolution and use the same training and testing splits of [18], which was also used in [5] to conduct realworld image reconstruction (of which the experiment in Section 4.5
is based off of). We use the bounding box information as provided in the dataset. Note that the only preprocessing we applied was image cropping and padding with 0intesntity pixels to create final samples of resolution
(which was required for our model).4.2 Training Protocol
Training was the same across all experiments, with only minor details that were taskdependent. The architecture of the VSL experimented with in this paper consisted of 5 local latent codes, each made up of 10 variables for ModelNet40 and 5 for ModelNet10. For ModelNet40, the global latent code was set to a dimensionality of 20 variables, while for ModelNet10, it was set to 10 variables.
The hyperparameter was set to across training on both ModelNet10 and ModelNet40. We optimize parameters by maximizing the loss function defined in Equation 4 using the Adam adaptive learning rate [20], with step size set to . For the experiments of Sections 4.3, 4.4, and 4.6
, over 2500 epochs, parameter updates were calculated using minibatches of 200 samples on ModelNet40 and 100 samples on ModelNet10.
For the experiment in Section 4.5, we use 5 local latent codes (each with dimensionality of 5) and a global latent code of 20 variables for the jointly trained model. For the separately trained model, we use 3 local latent codes, each with dimensionality of 2, and a global latent code of dimensionality 5. Minibatches of 40 samples were use to compute gradients for the joint model while 5 samples were used for the separately trained model. For both model variants, dropout [40] was to control for overfitting, with , and early stopping was employed (resulting in only 150 epochs).
For Section 4.5, which involved image reconstruction and thus required the loss term , instead of searching for an optimal value of the hyperparameter through crossvalidation, we employed a “warmingup” schedule, similar to that of [39]. “Warmingup” involves gradually increasing (on a logscale as depicted in Figure 3), which controls the relative weighting of in Equation 4. The schedule is defined as follows,
(9) 
Figure 3 depicts, empirically, the benefits of employing a warmingup schedule over using a fixed, externally set coefficient for the term in our image reconstruction experiment. We remark that using a warmingup schedule plays an essential role in acquiring good performance on the image reconstruction task.
4.3 Shape Generation and Learning
Shape Generation  Nearest Neighbor  
airplane  
chair  
toilet  
bathtub  
vase  
desk  
sofa 


airplane  desk  
sofa  chair 
To examine our model’s ability to generate highresolution 3D shapes with realistic details, we design a task that involves shape generation and shape interpolation. We add Gaussian noise to the learned latent codes on test data taken from ModelNet40 and then use our model to generate “unseen” samples that are similar to the input voxel. In effect, we generate objects from our VSL model directly from vectors, without a reference object/image.
The results of our shape interpolation experiment, from both withinclass and acrossclass perspectives, is presented in Figure 5. It can be observed that the proposed VSL shows the ability to smoothly transition between two objects. Our results on shape generation are shown in Figure 4. Notably, in our visualizations, darker colors correspond to smaller occupancy probability while lighter corresponds to higher occupancy probability. We further compare to previous stateoftheart results in shape generation, which are depicted in Figure 6.
4.4 Shape Classification
One way to test the expressiveness of our model would be to conduct shape classification directly using the learned embeddings. We evaluate our learned features on the ModelNet dataset [44] by concatenating both the global latent variable with the local latent layers, creating a single feature vector
. We train a Support Vector Machine with an RBF kernel for classification using these “pretrained” embeddings.
Supervision  Method  Classification Rate  

ModelNet10  ModelNet40  
Supervised  3D ShapeNets [44]  83.5%  77.3% 
DeepPano [37]  85.5%  77.6%  
Geometry Image [38]  88.4%  83.9%  
VoxNet [26]  92.0%  83.0%  
PointNet [29]    89.2%  
MVCNN [41]    90.1%  
ORION [34]  93.8%    
Unsupervised  SPH [19]  79.8%  68.2% 
LFD [4]  79.9%  75.5%  
TL Network [12]  74.4%    
VConvDAE [36]  80.5%  75.5%  
3DGAN [43]  91.0%  83.3%  
VSL (ours)  91.0%  84.5% 
Table 1 shows the performance of previous stateoftheart supervised and unsupervised methods in shape classification on both variants of the ModelNet dataset. Notably, the best unsupervised stateoftheart results reported so far were from the 3DGAN of [43], which used features from 3 layers of convolutional networks with total dimensions . This is a far larger feature space than that required by our model, which is simply (for 10way classification) and (for 40way classification) and reaches the exact same level of performance. The VSL performs comparably to supervised stateoftheart, outperforming models such as 3D ShapeNet [44], DeepPano [37], and Geometry Image [38], by a large margin, and comes close to models such as VoxNet [26].
In order to visualize the learned feature embeddings, we employ tSNE [25] to map our high dimensional feature to a 2D plane. The visualization is shown in Figure 7.
4.5 Single Image 3D Model Retrieval
Realworld, single image 3D model retrieval is another application of the proposed VSL model. This is a challenging problem, forcing a model to deal with realworld 2D images under a variety of lighting conditions and resolutions. Furthermore, there are many instances of model occlusion as well as different color gradings.
To test our model on this application, we use the PASCAL 3D [45] dataset and utilize the same exact training and testing splits from [18]. We compare our results with those reported for recent approaches, including the NRSfM [18] and 3DR2N2 [5] models. Note that these also used the exact same experimental configurations we did.
aero  bike  boat  bus  car  chair  mbike  sofa  train  tv  mean  
NRSfM  0.298  0.144  0.188  0.501  0.472  0.234  0.361  0.149  0.249  0.492  0.318 
3DR2N2 [LSTM1]  0.472  0.330  0.466  0.677  0.579  0.203  0.474  0.251  0.518  0.438  0.456 
3DR2N2 [Res3DGRU3]  0.544  0.499  0.560  0.816  0.699  0.280  0.649  0.332  0.672  0.574  0.571 
VSL (jointly trained)  0.514  0.269  0.327  0.558  0.633  0.199  0.301  0.173  0.402  0.337  0.432 
VSL (separately trained)  0.631  0.657  0.554  0.856  0.786  0.311  0.656  0.601  0.804  0.454  0.619 
Input  GT  VSL  3DR2N2[5]  NRSfM[18] 

For this task, we train our model in two different ways: 1) jointly on all categories, and 2) separately on each category. In Figure 8, we observe better reconstructions from the (separatelytrained) VSL when compared to previous work. Unlike the NRSfM [18], the VSL does not require any segmentation, pose information, or keypoints. In addition, the VSL is trained from scratch while the 3DR2N2 is pretrained using the ShapeNet dataset [3]. However, the jointlytrained VSL did not outperform the 3DR2N2, which is also jointlytrained. The performance gap is due to the fact that the 3DR2N2 is specifically designed for image reconstruction and employs a residual network [16] to help the model learn richer semantic features.
Quantitatively, we compare our VSL to the NRSfM [18] and two versions of 3DR2N2 from [5], one with an LSTM structure and another with a deep residual network. Results (IntersectionofUnion) are shown in Table 2. Observe that our jointly trained model performs comparably to the 3DR2N2 LSTM variant while the separately trained version surpasses the 3DR2N2 ResNet structure in 8 out of 10 categories, half of them by a wide margin. Note that our convolutional network components can be replaced with residual network components, an extension we leave as future work.
4.6 Shape Arithmetic
Another way to explore the learned embeddings is to perform various vector operations on the latent space, much what was done in [43, 12]. We present some results of our shape arithmetic experiment in Figure 9. Different from previous results, all of our objects are sampled from the model embeddings which were trained using the whole dataset with 40 classes. Furthermore, unlike the blurrier generations of [12], the VSL seems to generate very interesting combinations of the input embeddings without the need for any matching to actual 3D shapes from the original dataset. The resultant objects appear to clearly embody the intuitive meaning of the vector operators.
5 Conclusion
In this paper, we proposed the Variational Shape Learner, a hierarchical latentvariable model for 3D shape modeling, learnable through variational inference. In particular, we have demonstrated 3D shape generation results on a popular benchmark, the ModelNet dataset. We also used the learned embeddings of our model to obtain stateoftheart in unsupervised shape classification and furthermore showed that we could generate unseen shapes using shape arithmetic. Future work will entail a more thorough investigation of the embeddings learned by our hierarchical latentvariable model as well as integration of better prior distributions into the framework.
References
 [1] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862, 2017.
 [2] Y. Bengio. Deep learning of representations: Looking forward. In International Conference on Statistical Language and Speech Processing, pages 1–37. Springer, 2013.
 [3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.

[4]
D.Y. Chen, X.P. Tian, Y.T. Shen, and M. Ouhyoung.
On visual similarity based 3d model retrieval.
In Computer graphics forum, volume 22, pages 223–232. Wiley Online Library, 2003.  [5] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3dr2n2: A unified approach for single and multiview 3d object reconstruction. arXiv preprint arXiv:1604.00449, 2016.
 [6] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
 [7] A. Delaunoy, E. Prados, P. G. I. Piracés, J.P. Pons, and P. Sturm. Minimizing the multiview stereo reprojection error for triangular surface meshes. In BMVC 2008British Machine Vision Conference, pages 1–10. BMVA, 2008.
 [8] H. Edwards and A. Storkey. Towards a neural statistician. arXiv preprint arXiv:1606.02185, 2016.
 [9] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, 2015.

[10]
K. Fukushima.
Neocognitron: A hierarchical neural network capable of visual pattern recognition.
Neural networks, 1(2):119–130, 1988.  [11] P. Gargallo, E. Prados, and P. Sturm. Minimizing the reprojection error in surface reconstruction from images. In Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on, pages 1–8. IEEE, 2007.
 [12] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. arXiv preprint arXiv:1603.08637, 2016.
 [13] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 [14] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
 [15] C. Häne, S. Tulsiani, and J. Malik. Hierarchical surface prediction for 3d object reconstruction. arXiv preprint arXiv:1704.00710, 2017.
 [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 [17] A. Kar, C. Häne, and J. Malik. Learning a multiview stereo machine. arXiv preprint arXiv:1708.05375, 2017.
 [18] A. Kar, S. Tulsiani, J. Carreira, and J. Malik. Categoryspecific object reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1966–1974, 2015.
 [19] M. Kazhdan, T. Funkhouser, and S. Rusinkiewicz. Rotation invariant spherical harmonic representation of 3 d shape descriptors. In Symposium on geometry processing, volume 6, pages 156–164, 2003.
 [20] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [21] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [22] J. Knopp, M. Prasad, G. Willems, R. Timofte, and L. Van Gool. Hough transform and 3d surf for robust three dimensional classification. In European Conference on Computer Vision, pages 589–602. Springer, 2010.

[23]
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105, 2012.  [24] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.

[25]
L. v. d. Maaten and G. Hinton.
Visualizing data using tsne.
Journal of Machine Learning Research
, 9(Nov):2579–2605, 2008.  [26] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for realtime object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
 [27] A. Patterson IV, P. Mordohai, and K. Daniilidis. Object detection from largescale 3d datasets using bottomup and topdown descriptors. In European Conference on Computer Vision, pages 553–566. Springer, 2008.
 [28] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. Variational autoencoder for deep learning of images, labels and captions. In Advances in Neural Information Processing Systems, pages 2352–2360, 2016.
 [29] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593, 2016.
 [30] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [31] D. J. Rezende, S. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learning of 3d structure from images. In Advances in Neural Information Processing Systems, pages 4996–5004, 2016.
 [32] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [33] R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pages 3212–3217. IEEE, 2009.
 [34] N. Sedaghat, M. Zolfaghari, and T. Brox. Orientationboosted voxel nets for 3d object recognition. arXiv preprint arXiv:1604.03351, 2016.

[35]
I. V. Serban, A. G. Ororbia II, J. Pineau, and A. Courville.
Piecewise latent variables for neural variational text processing.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pages 422–432, 2017.  [36] A. Sharma, O. Grau, and M. Fritz. Vconvdae: Deep volumetric shape learning without object labels. In Computer Vision–ECCV 2016 Workshops, pages 236–250. Springer, 2016.
 [37] B. Shi, S. Bai, Z. Zhou, and X. Bai. Deeppano: Deep panoramic representation for 3d shape recognition. IEEE Signal Processing Letters, 22(12):2339–2343, 2015.
 [38] A. Sinha, J. Bai, and K. Ramani. Deep learning 3d shape surfaces using geometry images. In European Conference on Computer Vision, pages 223–240. Springer, 2016.
 [39] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther. Ladder variational autoencoders. In Advances in Neural Information Processing Systems, pages 3738–3746, 2016.
 [40] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [41] H. Su, S. Maji, E. Kalogerakis, and E. LearnedMiller. Multiview convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 945–953, 2015.
 [42] O. Wiles and A. Zisserman. Silnet : Single and multiview reconstruction by learning from silhouettes. 2017.
 [43] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
 [44] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1912–1920, 2015.
 [45] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2014.
 [46] J. Xie, Y. Fang, F. Zhu, and E. Wong. Deepshape: Deep learned shape descriptor for 3d shape matching and retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1275–1283, 2015.
 [47] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning singleview 3d object reconstruction without 3d supervision. In Advances in Neural Information Processing Systems, pages 1696–1704, 2016.
 [48] Z. Zhu, X. Wang, S. Bai, C. Yao, and X. Bai. Deep learning representation using autoencoder for 3d shape retrieval. Neurocomputing, 204:41–50, 2016.
Comments
There are no comments yet.