PointBLIP: Zero-Training Point Cloud Classification Network Based on BLIP-2 Model

: Leveraging the open-world understanding capacity of large-scale visual-language pre-trained models has become a hot spot in point cloud classification. Recent approaches rely on transferable visual-language pre-trained models, classifying point clouds by projecting them into 2D images and evaluating consistency with textual prompts. These methods benefit from the robust open-world understanding capabilities of visual-language pre-trained models and require no additional training. However, they face several challenges summarized as prompt ambiguity, image domain gap, view weight confusion, and feature deviation. In response to these challenges, we propose PointBLIP, a zero-training point cloud classification network based on the recently introduced BLIP-2 visual-language model. PointBLIP is adept at processing similarities between multi-images and multi-prompts. We separately introduce a novel method for point cloud zero-shot and few-shot classification, which involves comparing multiple features to achieve effective classification. Simultaneously, we enhance the input data quality for both the image and text sides of PointBLIP. In point cloud zero-shot classification tasks, we outperform state-of-the-art methods on three benchmark datasets. For few-shot classification tasks, to the best of our knowledge, we present the first zero-training few-shot point cloud method, surpassing previous works under the same conditions and showcasing comparable performance to full-training methods.


Introduction
Point cloud represents one of the most commonly used formats for 3D data, comprising a set of points sampled from a scene.In various 3D computer vision applications, point clouds serve as either the sole data source [1][2][3][4][5][6][7] or essential data [8,9] for understanding 3D scenes.Point cloud classification is a fundamental task in 3D scene understanding.Simultaneously, classifying point clouds in an open-world scenario or for unknown new categories is a hot spot issue [10,11].Achieving this level of application requires a significant amount of human-labor data annotations for deployed 3D systems.Despite the increasing availability of point cloud data facilitated by the advancement of 3D scanning technologies, the valid point cloud data volume remains significantly insufficient.In addition, annotating point cloud data is notably more challenging compared to 2D image data due to its scattered and unordered nature [10], posing difficulties in collecting point cloud datasets for deep learning applications.
Visual-language pre-trained (VLP) models, learning from image-text pairs [12][13][14], have revolutionized 2D computer vision recognition over the last few years.Benifiting from existing large-scale pre-training data, these models exhibit exceptional understanding of the open world at the 2D image level [11,13,14].Inspired by this, many downstream recognition tasks can be adapted by the VLP model, and this extension also applies to the domain of point cloud classification.Some recent works have explored how to transfer knowledge structures to point cloud classification tasks [11,15,16].Those transferred approaches utilizing VLP models follow a common process: (1) Encoding the projected point cloud images and textual prompt separately as a single feature.(2) Aligning image-text pair features and determining the category that corresponds the most by cosine similarity.Typically, the point cloud is projected into multi-view depth images and all image features are aggregated into a single feature with predefined view weights.
We identify several limitations in those VLP-based methods: (1) Prompt Ambiguity: The choice of textual prompts for each category may involve predefined templates or generation from a large language model.However, the selection of specific textual prompts for classification relies on human expertise.(2) Image Domain Gap: VLP-based methods project point clouds as depth images.Nevertheless, depth images significantly differ from the image domain of the VLP model training dataset.(3) View Weight Confusion: Point clouds are often observed from multiple viewpoints during projection as images, and encoded image features are aggregated into a single feature through view weights.Predefined view weights require manual fine-tuning rather than automatic adjustment, making it challenging to determine which viewpoint is more beneficial for classification without prior knowledge.(4) Feature Deviation: Encoded features of objects with unique shapes may deviate from common shapes in the same category, as a single image encoder may not focus on their distinctive characteristics that distinguish them from other categories.
In this work, we introduce PointBLIP, a zero-training point cloud classification network.PointBLIP is built upon the BLIP-2 visual-language pre-trained model [14], enabling it to handle zero-shot and few-shot point cloud classification tasks.Unlike previous methods, PointBLIP proposes novel and improved approaches in input data construction and feature similarity comparison to address the aforementioned challenges.
To improve the quality of input data, we employ the ray tracing method to render point clouds into simulated images instead of projecting them into depth maps, thereby making the input images more closely aligned with the image domain of VLP models.Simultaneously, we utilize a large language model to generate shape-specific and more discriminative textual prompts.We treat multiple textual prompts for the same object category as a semantic description set for textual input, collectively enriching the descriptive features and eliminating the need for manual selection.
PointBLIP conducts comparisons between multiple image features and multiple text features.The various image features are derived from multiple projection perspectives of the point cloud and the encoding process of the BLIP-2 image encoder.Simultaneously, the multiple text features originate from the semantic description set on the textual side.Diverging from previous approaches that use predefined view weights to aggregate a single feature, we directly compare the similarities between multiple image features and multiple text features without any weight adjustments.We conceptualize the process of comparing multiple features as occurring microscopically in a feature grid.In order to measure reliable feature similarity from the feature grid, we employ distinct strategies tailored to zeroshot and few-shot classification tasks.The selection between these strategies depends on whether the object is compared with features that have an explicit semantic description.
PointBLIP boosts baseline's performance and even surpasses state-of-the-art models.In zero-shot point cloud classification tasks, PointBLIP surpasses state-of-the-art methods by 1% to 3% on three benchmark datasets, including synthetic dataset ModelNet and real scanning dataset ScanObjectNN.In few-shot point cloud classification tasks, PointBLIP shows an enhancement of approximately 20% compared to other VLP-based methods under similar conditions.To our best knowledge, we propose the first zero-training few-shot point cloud classification network.It is worth noting that as a zero-training model, PointBLIP remains comparable to full-training few-shot classification models on two standard datasets, ModelNet40-FS and ShapeNet70-FS.
Our contributions are summarized as follows: • We introduce PointBLIP, a zero-training point cloud classification network based on a visual-language pre-trained model.

•
We improve the input data quality for VLP-based method.By employing ray tracing rendering, we address the gap between point cloud and image data.Additionally, we introduce a shape-specific textual prompt-generation method.

•
We employ distinct feature-measurement strategies tailored to zero-shot and few-shot classification tasks.A Max-Max-Similarity strategy entails comparing the similarities between images and prompts for zero-shot classification tasks, while a Max-Min-Similarity strategy compares the similarities between point cloud images and example images for few-shot classification tasks.

•
Comprehensive experiments conducted on benchmark datasets demonstrate that Point-BLIP surpasses state-of-the-art performances.PointBLIP surpasses previous work in both zero-shot and few-shot classification tasks.Moreover, in the few-shot classification task, PointBLIP remains comparable to full-training few-shot classification methods.

Vision-Language Pre-Training
The surge in interest in vision-language pre-training (VLP) has given rise to various model architectures specifically designed for multi-modal tasks.Diverse structures, such as dual-encoder [12,17], fusion-encoder [18], and encoder-decoder [19], have emerged to cater to various downstream objectives.Over time, pre-training objectives like imagetext contrastive learning [12,20,21], image-text matching [21,22], and masked language modeling [13,23] have converged towards approaches trained on large-scale datasets.VLP models typically undergo end-to-end pre-training on extensive image-text pair datasets, with the "image-to-text" interface becoming a standardized input-output format.This standardization facilitates task-agnostic architectures for zero-shot transfer, eliminating the need for specialized outputs or dataset-specific customization.A widely adopted model, CLIP [12], harnesses VLP for cross-modal knowledge transfer, enabling natural language to comprehend visual concepts.

Zero-Shot Learning in Point Cloud
The objective of zero-shot learning is to identify objects not encountered during the training phase.While extensive attention has been given to 2D classification in zero-shot learning [24,25], few studies have explored its application in the 3D domain.Traditional methods for 3D open-world learning still necessitate 3D training data as a pretraining stage.Pioneering the exploration of zero-shot learning on point clouds, [26] partitions the 3D dataset into "seen" and "unseen" samples.It employs PointNet [27] to train on the former set and evaluates on the latter by measuring cosine similarities with category semantics.Building upon this foundation, [28] addresses the hubness problem [29] stemming from lowquality extracted 3D features, while [30] introduces a triplet loss for enhanced performance in transductive settings.This series of works trains zero-shot classifiers on "seen" 3D categories by maximizing inter-class divergence in the latent space, and subsequently tests on "unseen" ones.

Few-Shot Learning in Point Cloud
Few-shot learning (FSL) holds great promise in the realm of deep learning due to its ability to generalize well on new tasks despite having limited annotated data.In the customary N-way K-shot Q-query few-shot learning setting [31], the aim of FSL algorithms is to meta-train a predictor that can be generalized to new unlabeled query examples by few labeled support examples.Typically, existing FSL algorithms adopt a meta-learning framework and can be broadly categorized into metric-based and optimization-based methods.Metric-based approaches [32][33][34] center around learning an embedding space where similar sample pairs are brought closer together or involve designing a metric function to assess the feature similarity between samples.Conversely, optimization-based methods [35][36][37] treat meta-learning as an optimization process.
Although most current FSL methods operate within the 2D image domain, their application in 3D perception is an under-explored area [10,38].Three-dimensional few-shot learning poses greater challenges due to sparse information in point clouds and smallerscale annotated data.Additionally, the diverse architectures and learning algorithms further complicate efficacy in the 3D domain.Recent efforts have combined 2D FSL with 3D backbone networks to benchmark few-shot point cloud classification.Refs.[10,38] present adapted 3D FSL point cloud classification methods derived from typical 2D FSL algorithms, implemented on various point cloud learning architectures.

VLP-Based Point Cloud Adapted Network
The use of VLP models for open-world point cloud recognition is an emerging research area still in its early stages.Current approaches often adopt the strategy of aligning imagetext pair features and determining the category that corresponds the most, predominantly relying on CLIP [12].PointCLIP [15] pioneered this approach, aligning the depth maps of projected point clouds with object template sentences to identify the most similar category.However, the sparsely distributed points onto depth values result in scatter-style input images, significantly deviating from real-world pre-training images in both appearance and semantics.Moreover, object template sentences are insufficient for fully describing 3D shapes and negatively impact pre-trained language-image alignment.To address the domain gap between 3D and images, CLIP2Point [16] enforces alignment between depth features and visual CLIP features through an image-depth contrastive learning method.Nevertheless, this process requires additional training and may risk overfitting to the image style of a particular dataset.In contrast, PointCLIPv2 [11] generates CLIPpreferred images through realistic projection, achieved by a series of enhanced operations, ensuring time efficiency and eliminating the need for additional pre-training.Additionally, PointCLIPv2 leverages a large-scale language model [39] to generate text with richer 3D semantics, enhancing the input for the text encoder.However, even though PointCLIPv2 has effectively enhanced the projection quality of point clouds, the resulting images are still evidently far away from the real-world image domain.
These VLP-based point cloud adapted networks follow the common strategy of comparing image-text pair features.They respectively encode all projected images of each point cloud and textual prompt of each category into a single feature.Without prior knowledge, this poses challenges in setting weights for projected viewpoints and selecting textual prompts.

Method
In Section 3.1, we revisit Bootstrapping Language-Image Pre-training with frozen unimodal models (BLIP-2) and present the essential components upon which PointBLIP relies.In Section 3.2, we delineate the methods employed for enhancing input data quality.In Section 3.3, we elucidate the procedures through which PointBLIP executes zero-shot point cloud classification.Finally, in Section 3.4, we elaborate on the methods employed by PointBLIP for conducting few-shot point cloud classification.

A Revisit of BLIP-2
BLIP-2 is a versatile and computationally efficient vision-language pre-training method that leverages off-the-shelf pre-trained vision and language models [14].It comprises two stages: the vision-and-language representation stage and the vision-to-language generative stage.The former aligns image and text representations, while the latter generates linguistic interpretations for images.In this study, we primarily explore the cross-modal capabilities of the vision-and-language representation stage, serving as a feature encoder.
We elucidate the feature-encoding process during inference in detail.For each single image, a fixed number of encoded features are extracted from the BLIP-2 image encoder instead of a single feature.The image encoder employs a set number of learnable query embeddings as input, interacting with image features from the frozen CLIP [12] image encoder through cross-attention layers.Subsequently, the image encoder produces a set of learned queries as the feature representation of an image.It is worth noting that, unlike traditional image encoders, BLIP-2 encodes an image into multiple features, with each feature representing a semantic aspect of the image.The image-encoding process can be formulated as where I is an input image, n is query number, and c is the embedding dimension.
In contrast, the text encoder encodes all words into output embeddings but focuses solely on the [CLS] token as a single classification feature.The text-encoding process can be formulated as where T is a descriptive sentence for the corresponding image.
We construct the PointBLIP network using the image and text encoders in BLIP-2.The fundamental distinction between PointBLIP and previous work lies in encoding one image into multiple features, while previous work encodes one image into a single feature.We adopt this approach for the following reasons: (1) Enhancement in feature descriptive capabilities.Encoding into multiple features is advantageous for extracting more extensive and comprehensive information from an image.Multiple features imply that the encoder interprets the image from different semantic perspectives.(2) Advantageous for filtering out irrelevant information.Since the interpretations of multiple features differ, there is an opportunity to independently extract the features of interest while filtering out irrelevant ones.In contrast, the traditional image-encoding process encodes all image information, including noise, into a single feature.

Prompting for Image and Text
To address the 3D model gap and generate meaningful textual prompts, we introduce two novel approaches to constructing input images and textual prompts for our method.

Ray Tracing for Point Cloud
Despite the existence of methods to improve image quality [11], projecting point clouds as depth maps still results in a model gap between point cloud and VLP model training images.Since VLP model training data predominantly originate in the real world, we contend that transforming point clouds into stereoscopic, clearly outlined 2D shapes is necessary.Therefore, we introduce the ray tracing method to render point clouds into simulated images.
In this process, each point in the point cloud is represented as a white sphere with a radius of r, and the surface of the sphere undergoes diffuse reflection of rays.We use parallel and inclined light sources to illuminate these spheres.To enhance the clarity of the complete outline, rays undergo a finite number of diffuse reflections on the spheres.For each point cloud object, we generate rendered images from four different perspectives around the object to obtain a comprehensive view.A comparison between the projection method and our rendering method is shown in Figure 1.

Comparative Textual Prompts
Taking inspiration from PointCLIPv2 [11], we utilize a large language model GPT-3 [39] to generate 3D-specific text with category-wise shape characteristics as textual input.Since original point clouds lack texture information, we argue that textual input should distinguish different categories based on shape.We introduce two rules in command to generate distinctive descriptive sentences: (1) Specify all categories when providing commands to GPT-3 as input.(2) Request GPT-3 to offer the most distinctive shape description.An example of generating a textual prompt for the airplane category in the ModelNet dataset is illustrated as follows: Question: The following object categories: airplane, bathtub, bed... (list all category names in ModelNet).Describe the shape differences between the airplane and other categories in one short sentence.
GPT-3 Answering: The airplane stands out with its elongated, winged structure and tail, distinctly different from the predominantly static and boxy forms of the other categories.
In our work, we generate a set of descriptive sentences for each category as textual prompts and all sentences will be used for the classification of a category.Use "CLASS" as the name for a target category to be classified, we adopt the following three sentence-generation templates: (1) CLASS.
(2) Answering from GPT-3: The following object categories: ... (list all category names).Describe the shape differences between the CLASS and other categories in one short sentence.(3) Answering from GPT-3: The following object categories: ... (list all category names).
Use one sentence to describe.In what aspects does CLASS look different from other categories in terms of shape?

Zero-Shot Point Cloud Classification
The zero-shot point cloud classification process is illustrated in Figure 2. Following the data-prompting methods detailed in Section 3.2, we generate simulated point cloud images formed from V observation perspectives and textual prompts for L target categories.
The BLIP-2 image encoder produces n learned queries for each perspective image, and the image feature set {F i } V i=1 (where F i ∈ R n×c ) is extracted from one point cloud, where c is the embedding dimension.For the textual branch, assuming each category contains q textual prompts, the text feature set {W j } L j=1 (where W j ∈ R q×c ) is extracted from the textual prompts for all categories.Our objective is to determine the most likely category for the source point cloud.Previous methods typically compare the cosine similarity between single and aggregated features of images and text, e.g., PointCLIPv2 calculates the final zero-shot classification logits by weighted summing multi-view image features to a single feature, formulated as where and ω ∈ R 1×V represents the view weights.However, our approach contains an additional dimension for feature comparison.{F i } V i=1 and {W j } L j=1 cannot be directly used to calculate similarity following Equation (3), and ω may cause view weight confusion without prior knowledge.
We take two steps to address this issue.Firstly, we establish a minimal unit called feature grid between the image feature set {F i } V i=1 and the text feature set {W j } L j=1 .For the zero-shot classification task, we define feature grid as a similarity matrix comparing the cosine similarity between image features from a certain perspective and all text features for a specific category, formulated as The feature grid is represented as a cosine similarity matrix.For the L classification task, each point cloud can generate V × L feature grids.We employ a strategy called Max-Max-Similarity to calculate similarity from each feature grid.The process of the Max-Max-Similarity strategy is illustrated in Figure 3a.Max-Max-Similarity calculates the maximum values for both rows and columns in the feature grid matrix, which will be treated as the basis for the next classification step.The aim of Max-Max-Similarity is to provide the maximum similarity level between a simulated point cloud image and a specific category.Secondly, we form a larger similarity matrix G formed by similarity values from feature grids to obtain the most relevant category, which can be formulated as We search the maximum value in matrix G and take the category corresponding to this maximum value as the classification result, formulated as where SCM(•) represents the function searching for the category index of the maximum value in the matrix.

Few-Shot Point Cloud Classification
The process of few-shot point cloud classification is illustrated in Figure 4.In this scenario, a limited number of point clouds from "unseen" categories are labeled as a reference, and our method aims to recognize new, unlabeled query point clouds under this few-shot condition.The labeled example point clouds constitute the support set S = {P i } N×K i=1 , encompassing N classes with K examples for each class, where P i represents an example point cloud.Our objective is to identify unlabeled point clouds based on the support set S.
For an unlabeled query point cloud, we generate simulated images from V observation perspectives of this point cloud and extract image features using the BLIP-2 image encoder as {F i } V i=1 (where F i ∈ R n×c ).Simultaneously, example images generated from the support set S are encoded into {Z j } N×K×V j=1 (where Z j ∈ R n×c ) following the same process.In the traditional few-shot classification procedure, we need to calculate the feature similarity between query images and example images for matching.However, the BLIP-2 imageencoding process introduces a challenge.Since each query and example image produces n features instead of a single feature, we need to measure a unique similarity among two sets of image features.Some image features may potentially describe irrelevant information, such as background and texture, which is not suitable for comparison.There is no explicit way to determine which feature represents category-discriminative characteristics and is beneficial for comparison.To address this issue, we also establish a feature grid to compare multiple image features.In the few-shot classification task, the construction of the feature grid and measurement strategy differ from the zero-shot method in Section 3.3.This process is based on the following theory: if two objects belong to the same category, the similarity of their most challenging-to-match feature will still be higher than for other categories.
Specifically, we define this feature grid as a similarity matrix comparing the cosine similarity between query and example image features, formulated as For each query point cloud in the classification process, V × N × K × V feature grids can be constructed.We calculate a similarity from each feature grid, representing the matching degree between the query image and example image.We employ a Max-Min-Similarity strategy to calculate this similarity.The process of the Max-Min-Similarity strategy is illustrated in Figure 3b.We calculate the maximum value for each row in the feature grid, representing the maximum matching level between each query image feature and example image features.Then, we compute the minimum value from this collection of maximum values, representing the maximum similarity level of the most challenging-to-match features.We treat this value as the similarity between the query image and example image.
Next, we utilize the similarities from feature grids to form a matrix G reflecting the similarities between all query images and all example images.The whole process can be formulated as Finally, we search for the maximum value in matrix G and take the category corresponding to this maximum value as the classification result, which can be formulated as where SCM(•) represents the function searching for the category index of the maximum value in G.

Results
In this section, we first illustrate the implementation details of PointBLIP in Section 4.1 and the evaluation dataset in Section 4.2.Then, we present the performance of PointBLIP on zero-shot classification in Section 4.3 and few-shot classification in Section 4.4.Finally, we conduct an ablation experiment in Section 4.5.

Implementation Details
We utilize Mitsuba 3 software [40] for rendering point clouds into simulated images.Each point in the point cloud is represented as a sphere with a radius r, and the surface of the sphere is set as a white and diffuse surface.The value of r is determined based on the specific dataset.We render the point cloud from four views around an object, creating images with a resolution of 224 × 224.A directional light source is added for each perspective, and the light rays undergo three reflections on the surface of the sphere.Figure 1 shows some instances from different categories.

Evaluation Dataset
We evaluate the performance of PointBLIP on several widely used benchmark datasets, including synthetic and real scan datasets.
ModelNet dataset.ModelNet [41] is a large-scale 3D CAD dataset containing 12,311 CAD models from 40 categories.ModelNet includes two subsets, ModelNet10 and Model-Net40, for classification tasks.ModelNet10 contains 4899 CAD models from 10 categories, with 3991 for training and 908 for testing.ModelNet40 contains 12,311 CAD models from 40 categories, with 9843 for training and 2468 for testing.We only apply its test set data since PointBLIP is a zero-training network.
ScanObjectNN dataset.ScanObjectNN [42] is a real-world dataset containing 2902 samples of point cloud data from 15 categories.Unlike clean CAD models in Model-Net, objects in ScanObjectNN are partially presented and attached with backgrounds.Thus, it is more challenging than ModelNet.We test PointBLIP on ScanObjectNN under three data splits: S-OBJ_ONLY includes only ground truth segmented objects extracted from the scene; S-OBJ_BG includes objects attached with backgrounds; S-PB_T50_RS contains rotation, scaling, perturbations, and shifting the bounding box as the hardest split.
ModelNet40-FS dataset.ModelNet40-FS [38] is a new split of ModelNet40 [41], containing 30 training classes with 9204 examples and 10 disjoint testing classes with 3104 examples.This splitting of the raw dataset according to categories is done for few-shot classification evaluation.
ShapeNet70-FS dataset.ShapeNet70-FS [38] is adapted from ShapeNetCore and has a larger number of data than ModelNet40-FS, totaling 30,073 examples, with 50 classes having 21,722 samples for training and 20 classes with 8351 samples for testing.ShapeNet70-FS is a benchmark dataset for few-shot classification evaluation.

Zero-Shot Point Cloud Classification
We evaluate PointBLIP on three widely used benchmarks for zero-shot classification: synthetic dataset ModelNet and real scanning dataset ScanObjectNN.Two splits (ModelNet10, ModelNet40) in ModelNet and three splits (S-OBJ_ONLY, S-OBJ_BG, and S-PB_T50_RS) in ScanObjectNN will be tested.Following the zero-shot principle, we directly test the classification performance on the full test set without learning from the training set.We render point clouds following the method in Section 3.2.In the ModelNet datasets, we set the value of r as the average minimum distance between points.In the ScanObjectNN dataset, we set the value of r as 2.5 times the average minimum distance between points.This choice of r is determined by the characteristics of the data distribution.The point distribution in ScanObjectNN is more sparse, while the point distribution in ModelNet is more balanced.Additionally, we generate textual prompts for target categories following the method outlined in Section 3.2.Each category includes three textual prompts.
We compare our method with several recent zero-shot point cloud classification methods, and the results are presented in Table 1.We use overall classification accuracy (%) as the experimental metric.PointCLIPv2 is currently the state-of-the-art method.We outperform PointCLIPv2 by 0.88%, 2.03%, 1.03%, 3.01%, and 3.54% in classification accuracy on ModelNet10, ModelNet40, S-OBJ_ONLY, S-OBJ_BG, and S-PB_T50_RS datasets, respectively.We compare the zero-training PointBLIP with some full-training methods.The current study on 3D data is relatively under-explored, but few-shot classification methods are well-established and diverse in 2D image tasks [38].Consequently, we assess the performance of point cloud adaptations of the current state-of-the-art methods in 2D image few-shot classification.
Following the experimental procedures outlined in [10], we substitute the backbone networks of these 2D few-shot classification methods with the mainstream DGCNN network [43] designed for processing point clouds.Subsequently, we assess their classification performance in comparison with PointBLIP on the ModelNet40-FS and ShapeNet70-FS datasets.We respectively compare the performance of these methods under 5-way 1-shot and 5-way 5-shot settings (5-way is 5 classes in both meta-training and meta-testing stages, 1-shot and 5-shot mean the number of samples), and report the mean classification results of the 700 episodes with 95% confidence intervals.The experimental results are shown in Table 2.
As indicated in Table 2, our method consistently outperforms other approaches in the 5-way 1-shot setting on both the ModelNet40-FS and ShapeNet70-FS datasets.Notably, PointBLIP is zero-training, while the other methods undergo full training.Despite this, we consistently achieve superior performance compared to these fully trained methods.However, in the 5-way 5-shot setting, although PointBLIP demonstrates improvement compared to the 5-way 1-shot setting, its performance is relatively weaker compared to other methods.To the best of our knowledge, we are pioneers in introducing a zero-training point cloud few-shot classification network.Given the absence of similar work to serve as a reference, we conduct a comparative analysis with the most relevant methods, PointCLIP [15] and PointCLIPv2 [11], under identical conditions.PointCLIP and PointCLIPv2 tackle few-shot classification challenges by incorporating a trainable inter-view adapter, aiming to fine-tune the original output features after pre-training.To ensure a fair comparison, we eliminate the inter-view adapter module, ensuring that all methods are evaluated in an untrained state.
Following the experimental procedures outlined in PointCLIP and PointCLIPv2, we assess the K-shot classification performance on the ModelNet40 and ScanObjectNN (S-PB_T50_RS) datasets, where K ∈ {1, 2, 4, 8, 16}.As shown, PointBLIP outperforms PointCLIP and PointCLIPv2 under zero-training conditions, demonstrating a significantly superior performance.On the ModelNet40 dataset, PointBLIP achieves an average increase of over 15% in classification accuracy for different values of K. On the real scan dataset ScanObjectNN, where the data are more complex than synthetic data, PointBLIP exhibits a decrease in performance compared to ModelNet40.However, it still maintains an average classification accuracy advantage of nearly 10% over PointCLIP and PointCLIPv2.To assess the impact of our prompting approach on data quality, we perform an ablation study focusing on the prompting process of image and text generation.In the case of image input, we substitute our ray tracing rendering process with the realistic projection method used in PointCLIPv2 [11] as a reference.The realistic projection method involves additional processes, such as densification and smoothing, during point cloud projection, resulting in depth maps from ten perspectives that already capture the object's outline.For text input, we replace our method with only category names.
We first conduct ablation experiments on zero-shot classification, involving both image and text generation.The evaluations are conducted on the ModelNet40 dataset, and the experimental results are presented in Table 3.
Table 3. Ablation study on ModelNet40 zero-shot classification (%) with variations in ray tracing rendering and textual prompts."×" indicates the substitution with the realistic projection for image generation or the use of category names for textual prompts."✓" signifies the inclusion of our data prompting.As presented in Table 3, the concurrent application of our data-prompting method results in an increase in classification accuracy from 52.84% to 66.25%, indicating an improvement of 13.41%.However, when employing the prompting method solely on the image side, the accuracy experiences a modest increase of 4.13%.Similarly, deploying only the prompting method on the text side maintains the accuracy at the same level, but a notable increase of almost 10% in accuracy is observed after incorporating image data prompting.
Next, we conduct ablation experiments on few-shot classification, which only involves image generation.We evaluate the ModelNet40-FS dataset under the 5-ways 1-shot setting, and the corresponding experimental results are presented in Table 4.Our image-prompting method can improve by nearly 2.5% in few-shot classification.

Measurement Strategy
To evaluate the impact of the feature grid-measurement strategy, we conduct an ablation study on both zero-shot and few-shot classification.As a reference, we replace the original strategy with one that calculates the average value in the feature grid as the final similarity.We conduct separate evaluations to assess the effects of the Max-Max-Similarity strategy in zero-shot classification and the Max-Min-Similarity strategy in few-shot classification.
We conduct additional zero-shot classification tests on all benchmark datasets mentioned in Section 4.3, and the corresponding experimental results are outlined in Table 5. Notably, Max-Max-Similarity consistently outperforms average similarity across all benchmark datasets, resulting in an improvement of 3% to 6% in classification accuracy in the zero-shot classification task.We further validate few-shot classification on the benchmark datasets discussed in Section 4.4.The evaluations were performed under the 5-way 5-shot setting, and the results, along with 95% confidence intervals, are presented in Table 6.It is evident that Max-Min-Similarity surpasses average similarity in the context of few-shot classification, exhibiting a more pronounced enhancement, particularly on the ShapeNet70-FS dataset.

Discussion
The experimental results on benchmarks presented in Section 4 demonstrate that our approach achieves state-of-the-art performance in point cloud classification.Despite surpassing prior work, several issues merit discussion.

Backbone Network Differences
In comparison to closely related VLP-based methods, our backbone VLP network differs.Most current relevant works employ CLIP [12] as the backbone network, while we utilize BLIP-2 [14].This raises the question of whether the improved performance of PointBLIP is attributable to a stronger feature-learning capability of the VLP backbone model.To investigate this, we refer to experiments in Table 3.In the scenario presented in the second row of Table 3, we use the realistic projection method for image generation and generate textual prompts using GPT-3, similar to PointCLIPv2 [11] except for differences in the backbone network and feature-measurement strategy.However, PointBLIP achieves a zero-shot classification accuracy of 52.76%, while PointCLIPv2 achieves a higher classification accuracy of 64.22%.In this scenario, the feature-extraction capability of the base model is a determining factor for classification performance, but BLIP-2 performs worse than CLIP.We argue that the performance improvement of PointBLIP does not rely on the feature-extraction capability of the base VLP model.

Feature Grid Measurement
In both zero-shot and few-shot classification tasks, we establish a feature grid to measure feature similarity.The measurement strategies for the feature grid in different tasks are configured based on comparison targets.In zero-shot classification, we compare point cloud images with textual prompts.Due to a textual prompt explicitly describing object characteristics and being encoded as a single feature, we use Max-Max-Similarity to find the most straightforward feature similarity between image and textual prompts.In few-shot classification, where example images are encoded as multiple features, some image features may potentially describe irrelevant information.To exclude noise interference, we use Max-Min-Similarity to find the similarity level between the most challenging-to-match features.From Tables 5  and 6, it can be observed that, compared to a simple averaging, the similarity reflected by our strategy in the feature grid is more advantageous for distinguishing.

Viewpoint Weights
Another advantage of PointBLIP is the absence of manually setting weights for different viewpoints.We opt to search for the category with the maximum similarity from the feature grid, which is advantageous for identifying the most likely similar category.We posit that simulated point cloud images from some perspectives may not perfectly align with the textual prompts or example images, introducing noise perturbations during the weighting process.The strategy of searching for the maximum similarity captures the maximum similarity characteristic, thereby avoiding interference from other viewpoints on the overall confidence.Furthermore, it eliminates the need for manually setting viewpoint weights.

Conclusions
We introduce PointBLIP, a zero-training and powerful point cloud classification network that achieves state-of-the-art performance in both zero-shot and few-shot classification tasks.Built upon the vision-language pre-training model BLIP-2 as a backbone network, PointBLIP directly compares similarity between multiple image features or multiple text features without the need for pre-setting weights for observed viewpoints.We establish a minimal feature-comparison unit called feature grid and employ different featuremeasurement strategies for zero-shot and few-shot classification tasks.Additionally, we enhance the input data quality by generating images through ray tracing and utilizing GPT-3 to generate comparative textual prompts.The innovations in PointBLIP address challenges such as prompt ambiguity, image domain gap, view weight confusion, and feature deviation observed in previous VLP-based classification methods, resulting in higher classification accuracy on benchmark datasets.

Figure 1 .
Figure 1.Visualization results comparing projection and ray tracing on the ModelNet dataset.The visualizations on the left, with white backgrounds, depict the outcomes obtained through realistic projection in PointCLIPv2, whereas those on the right showcase our visualizations utilizing ray tracing.The point cloud images generated through ray tracing exhibit a closer resemblance to the visual style observed in the real-world scene.

Figure 2 .
Figure 2. Overall architecture of PointBLIP for zero-shot classification.Each feature grid generates a similarity score by comparing a perspective image with all textual prompts corresponding to a specific category.The classification result is determined by selecting the category with the highest similarity score.Both image and text encoders employed in this architecture are derived from BLIP-2.

Figure 3 .
Figure 3. Different feature-measurement strategies in the feature grid.Each cube represents the cosine similarity between two features.(a) Max-Max-Similarity strategy.The output similarity is the maximum similarity for both rows and columns in the feature grid.(b) Max-Min-Similarity strategy.The output similarity is the minimum value among the maximum similarities in each row of the feature grid.

Figure 4 .
Figure 4. Overall architecture of PointBLIP for few-shot classification.Each feature grid contributes to a similarity score through the comparison of a query image with an example image.The category associated with the feature grid exhibiting the highest similarity is designated as the classification result.Both the image and text encoders incorporated in this structure are derived from BLIP-2.
For K-shot scenarios, we randomly sample K point clouds from each category in the training set, employing these point clouds as examples for classification in the testing set.The comparison results with PointCLIP and PointCLIPv2 under different K values are illustrated in Figure 5.

Figure 5 .
Figure 5.Comparison of zero-training few-shot classification performance between PointBLIP, PointCLIP, PointCLIPv2 on benchmark datasets ModelNet40 (left) and ScanObjectNN (right).The trainable interview adapter modules in PointCLIP and PointCLIPv2 were excluded for a fair evaluation.

Table 1 .
Zero-shot point cloud overall classification accuracy (%) for ModelNet and ScanObjectNN benchmark datasets.ModelNet10 and ModelNet40 are two data splits in ModelNet, S-OBJ_ONLY, S-OBJ_BG, and S-PB_T50_RS are three data splits in ScanObjectNN.

Table 2 .
Few-shot point cloud classification results with 95% confidence intervals on ModelNet40-FS and ShapeNet70-FS.Prior methods are trained with DGCNN as a backbone, while PointBLIP is zero-training.

Table 5 .
Comparison of average similarity and Max-Max-Similarity strategy on various benchmark datasets for zero-shot classification (%).

Table 6 .
Comparison of average similarity and Max-Min-Similarity strategy on various benchmark datasets for few-shot classification under 5-way 5-shot setting (95% confidence intervals).