CLIP-Driven Few-Shot Species-Recognition Method for Integrating Geographic Information

: Automatic recognition of species is important for the conservation and management of biodiversity. However, since closely related species are visually similar, it is difficult to distinguish them by images alone. In addition, traditional species-recognition models are limited by the size of the dataset and face the problem of poor generalization ability. Visual-language models such as Contrastive Language-Image Pretraining (CLIP), obtained by training on large-scale datasets, have excellent visual representation learning ability and demonstrated promising few-shot transfer ability in a variety of few-shot species recognition tasks. However, limited by the dataset on which CLIP is trained, the performance of CLIP is poor when used directly for few-shot species recognition. To improve the performance of CLIP for few-shot species recognition, we proposed a few-shot species-recognition method incorporating geolocation information. First, we utilized the powerful feature extraction capability of CLIP to extract image features and text features. Second, a geographic feature extraction module was constructed to provide additional contextual information by converting structured geographic location information into geographic feature representations. Then, a multimodal feature fusion module was constructed to deeply interact geographic features with image features to obtain enhanced image features through residual connection. Finally, the similarity between the enhanced image features and text features was calculated and the species recognition results were obtained. Extensive experiments on the iNaturalist 2021 dataset show that our proposed method can significantly improve the performance of CLIP’s few-shot species recognition. Under ViT-L/14 and 16-shot training species samples, compared to Linear probe CLIP, our method achieved a performance improvement of 6.22% (mammals), 13.77% (reptiles), and 16.82% (amphibians). Our work provides powerful evidence for integrating geolocation information into species-recognition models based on visual-language models.


Introduction
Biodiversity promotes ecosystem stability and is important for maintaining ecosystem function [1,2].Species recognition allows for a better understanding and documentation of the presence and distribution of different species, thereby promoting the conservation and management of biodiversity.With the development of artificial intelligence, computer scientists and taxonomists have been working to develop tools for automated species recognition [3][4][5][6][7].Gomez Villa et al. [8] achieved 35.4% Top-1 accuracy (unbalanced training dataset containing empty images) and 88.9% Top-1 accuracy (balanced dataset with images containing only foreground animals) using deep convolutional neural networks on the Snapshot Serengeti dataset.Xie et al. [9] utilized the improved YOLO V5s to recognize large mammal species in airborne thermal imagery, which improved the accuracy and on few-shot species recognition based on large-scale visual-language models like CLIP.To fill this gap, we proposed a few-shot species-recognition method based on species geographic information and CLIP (SG-CLIP).First, we utilized the powerful image and text feature extraction capabilities of CLIP to extract species image features and species text features through the image encoder and text encoder, respectively.Then, geographic features were driven by the geographic feature extraction module.Next, we designed an image and geographic fusion module to obtain enhanced image features.Finally, the matrix of species similarity between the enhanced image feature and the text feature was calculated to obtain recognition results.Overall, our contributions can be summarized as follows: • We proposed SG-CLIP, which integrates geographic information about species, to improve the performance of few-shot species recognition.To the best of our knowledge, this is the first work to exploit geographic information for few-shot species recognition of large vision-language models.• We introduced the geographic feature extraction module to better process geographic location information.Meanwhile, we designed the image and geographic feature fusion module to enhance the image representation ability.

Methods
In this section, we introduce the proposed SG-CLIP for few-shot species recognition.The structure of SG-CLIP is shown in Figure 1.In Section 2.1, we first revisit CLIP-driven methods for image and text feature extraction.In Section 2.2, we introduce the geographic feature extraction module.In Section 2.3, we elaborate on the details of image and geographic feature fusion.In Section 2.4, the species prediction probability is calculated to obtain the species recognition results.ographical distribution knowledge in textual form to improve zero-shot species recognition accuracy.However, no studies have explored the potential effects of utilizing geographic information on few-shot species recognition based on large-scale visual-language models like CLIP.To fill this gap, we proposed a few-shot species-recognition method based on species geographic information and CLIP (SG-CLIP).First, we utilized the powerful image and text feature extraction capabilities of CLIP to extract species image features and species text features through the image encoder and text encoder, respectively.Then, geographic features were driven by the geographic feature extraction module.Next, we designed an image and geographic fusion module to obtain enhanced image features.Finally, the matrix of species similarity between the enhanced image feature and the text feature was calculated to obtain recognition results.Overall, our contributions can be summarized as follows:

•
We proposed SG-CLIP, which integrates geographic information about species, to improve the performance of few-shot species recognition.To the best of our knowledge, this is the first work to exploit geographic information for few-shot species recognition of large vision-language models.

•
We introduced the geographic feature extraction module to better process geographic location information.Meanwhile, we designed the image and geographic feature fusion module to enhance the image representation ability.

Methods
In this section, we introduce the proposed SG-CLIP for few-shot species recognition.The structure of SG-CLIP is shown in Figure 1.In Section 2.1, we first revisit CLIP-driven methods for image and text feature extraction.In Section 2.2, we introduce the geographic feature extraction module.In Section 2.3, we elaborate on the details of image and geographic feature fusion.In Section 2.4, the species prediction probability is calculated to obtain the species recognition results.

CLIP-Driven Image and Text Feature Extraction
In contrast to previous work on training visual models [32,33] for image classification, target detection and semantic segmentation, or language models [34] for content understanding and generation, CLIP combines both visual tasks and linguistic modalities, utilizes large-scale image-text pairs collected from the Internet for comparative learning, and obtains transferable visual features, and has demonstrated inspiring performance in a variety of zero-shot image classification tasks [35][36][37].CLIP contains two encoders, image and text, that encode an image and its corresponding description into corresponding visual and textual embeddings, respectively, and then computes the cosine similarity between the two.Specifically, given an N-class species dataset, where N denotes the number of species classes, for any image I ∈ R H×W×3 that depicts a certain species, where H and W denote the height and width of the image, according to Equation (1), the image encoder is used to obtain image features I f ∈ R D , where D denotes the dimension of the image features.
where the image encoder uses the ResNet [32] or ViT [33] architecture.
For N different classes, after generating K sentences T k using the default prompt template "A photo of a [class]", the text encoder was used to obtain text features T f ∈ R D×K according to Equation (2).
where the text encoder uses the Transformer [34] architecture.

Geographic Feature Extraction Module (GFEM)
Species observations in community science often come with location coordinates, such as longitude and latitude, and utilizing prior knowledge of the location can help with species recognition.To simplify the calculation and help the model learn better, we fitted latitude and longitude to range [−1, 1], respectively, with reference to the settings of Mac Aodha et al. [27] and de Lutio et al. [29].Then, we concatenated them into g(latitude, longitude), and mapped this onto the location output S according to Equation (3).S = [sin(πg), cos(πg)], ( where sin(•) denotes the sine function and cos(•) denotes the cosine function.
Inspired by Mac Aodha et al. [27], who proposed geographical priors for species recognition, we constructed a geographic feature extraction module (GFEM) for address location information.The structure of GFEM is shown in Figure 2. First, to provide a richer representation of geographic features, we started with a fully connected layer to transform the dimension of the location input S to a larger dimension.To acquire the nonlinear feature extraction capability, we used the ReLU activation function.Then, we stacked four FCResLayer blocks to enhance geographic feature representation.Every FCResLayer block contains two fully connected layers, two ReLU activation functions, a Dropout layer, and a skip connection.Finally, we obtained geographic features.Given the location input S, the geographic feature S f was extracted by GFEM.

Image and Geographic Feature Fusion Module (IGFFM)
After acquiring image features or geographic features, inspired by Yang et al. [30], who proposed Dynamic MLP, we constructed an image and geographic feature fusion module (IGFFM) to fuse multimodal features.The structure of the IGFFM was shown in Figure 3. Since the image feature dimension is larger than the geographic feature dimension, we decreased the image feature dimension to the same dimension as the geographic feature in order to reduce the memory cost.Then, we constructed a dynamic fusion block

Image and Geographic Feature Fusion Module (IGFFM)
After acquiring image features or geographic features, inspired by Yang et al. [30], who proposed Dynamic MLP, we constructed an image and geographic feature fusion module (IGFFM) to fuse multimodal features.The structure of the IGFFM was shown in Figure 3.

Image and Geographic Feature Fusion Module (IGFFM)
After acquiring image features or geographic features, inspired by Yang et al. [30], who proposed Dynamic MLP, we constructed an image and geographic feature fusion module (IGFFM) to fuse multimodal features.The structure of the IGFFM was shown in Figure 3. Since the image feature dimension is larger than the geographic feature dimension, we decreased the image feature dimension to the same dimension as the geographic feature in order to reduce the memory cost.Then, we constructed a dynamic fusion block Since the image feature dimension is larger than the geographic feature dimension, we decreased the image feature dimension to the same dimension as the geographic feature in order to reduce the memory cost.Then, we constructed a dynamic fusion block (DFB) to dynamically fuse image features and geographic features.In the DFB, onedimensional geographic features were converted into the same shape as 3D image features, the depth interaction between image and geographic information was performed by matrix multiplication, and the enhanced image features were obtained after a LayerNorm layer and the ReLU activation function.Given the impact of geographic information on the recognition of different species, the number of stacked DFB modules is different for different species datasets.Finally, we extended the dimensions of the enhanced image features to be the same as the dimensions of the input image features and further enhanced them by a skip connection.To avoid forgetting the powerful image representation learning ability of CLIP, according to Equation ( 4), we used a residual connection to obtain the final image features by adding the enhanced image features with the original image features.
where I f , I e f , I f , and α represent the original image features, enhanced image features, final image features, and the residual ratio, respectively

Species Prediction Probability Calculation
After acquiring fused image features, we multiplied the enhanced image features I f with the text features T f generated using the CLIP's text encoder, and applied a Softmax function to calculate the predicted probability of the species, according to Equation (5).
where p i denotes the prediction probabilities for the i-th species category, Softmax(•) represents the Softmax function.

Experiments
In this section, we present the dataset, experimental environment, implementation details, and evaluation metrics.

Datasets
The iNaturalist 2021 dataset [38] is a public species dataset containing 10,000 species in 11 supercategories.The dataset was collected and labeled by community scientists, and each image in the dataset contains latitude and longitude information at the time it was taken.We produced three few-shot datasets from the training and validation sets of this dataset by selecting three supercategories, namely Mammals, Reptiles, and Amphibians.Of these, Mammals contains 246 species, Reptiles contains 313 species, and Amphibians contains 170 species.Detailed descriptions of the three few-shot datasets are shown in Table 1.We combined all the geographic locations of the training and testing sets for each dataset and created heat maps of their corresponding geographic distributions for the three datasets.As can be seen in Figure 4, these species were distributed globally and occurred with different frequencies in different locations.

Experimental Setup, Implementation Details, and Evaluation Metrics
We implemented all the methods using PyTorch on an NVIDIA GeForce RTX 3090 (NVIDIA, Santa Clara, CA, USA).In our experiments, all input images were resized to 256 × 256 for training and testing.We trained our SG-CLIP using 1, 2, 4, 8, and 16 shots.
During training, we used a Stochastic Gradient Descent (SGD) algorithm to optimize our proposed model, where the initial learning rate was set to 0.04, momentum was set to 0.9, and weight decay was set to 1 × 10 −4 .The CrossEntropyLoss function was used as the loss function, the batch size was set to 32, and the number of training epochs was set to 200.During testing, we used all the test images.We used Top-1 accuracy and Top-5 accuracy as evaluation metrics to validate our proposed method.For Tip-Adapter and Tip-Adapter-F, the same parameter settings were adopted, according to Zhang et al. [23].

Results
In this section, we compare our SG-CLIP with four baseline models, Zero-shot CLIP [20], Linear probe CLIP [20], Tip-Adapter [23], and Tip-Adapter-F [23].Zero-shot CLIP, Tip-Adapter, Tip-Adapter-F, and SG-CLIP used the same textual prompt "A photo of a [class]".Linear probe CLIP was adapted to the target dataset by training a linear classifier.To demonstrate the effectiveness and generalization of our proposed method, we conducted a number of comparison and ablation studies on different versions of CLIP.

Performance of Different Few-Shot Learning Methods on ViT-B/32
We chose the ViT-B/32 version of CLIP to compare the recognition performance of the different methods.The main results are displayed in Table 2.The results demonstrated in Table 2 show that SG-CLIP significantly outperforms all baseline methods in terms of recognition accuracy, proving that fusing geolocation information can improve the performance of few-shot species recognition.Compared to Zero-shot CLIP, SG-CLIP achieved significant performance gains across all three species datasets.For the three species dataset, the performance of Zero-shot As shown in Figure 5, on the Mammals dataset, the recognition accuracy of Tip-Adapter-F was the highest, followed by Tip-Adapter, and SG-CLIP was only higher than Linear probe CLIP when the number of training samples was 1.As the number of training samples increased to more than 2, SG-CLIP gradually outperformed Tip Adapter and Tip Adapter.On the Reptiles and Amphibians datasets, SG-CLIP outperformed Tip-Adapter-F and Tip-Adapter at all settings of the number of training samples.As can be seen from Table 3, the recognition accuracies of Zero-shot CLIP, Linear probe CLIP, Tip-Adapter, Tip-Adapter-F, and SG-CLIP were significantly improved as the model became larger.Under the 16-shot training setup on the Mammals dataset, the recognition accuracy of SG-CLIP under ViT-L/14 improved from 53.82% to 64.07%com-  As can be seen from Table 3, the recognition accuracies of Zero-shot CLIP, Linear probe CLIP, Tip-Adapter, Tip-Adapter-F, and SG-CLIP were significantly improved as the model became larger.Under the 16-shot training setup on the Mammals dataset, the recognition accuracy of SG-CLIP under ViT-L/14 improved from 53.82% to 64.07%compared to SG-CLIP under ViT-B/32, which gained a 10.25% improvement.Similarly, improvements of 11.85% and 7.22% were obtained on the Reptiles and Amphibians datasets, respectively.From Figure 6, it can be observed that SG-CLIP achieved the highest level of recognition accuracy when the number of training samples for each category was set to 16. Tip-Adapter-F and Linear probe CLIP followed after SG-CLIP, and their performance improved as the number of training samples increased.Finally, Tip-Adapter had the lowest recognition accuracy among all the few-shot recognition methods.On the Reptiles and Amphibians datasets, SG-CLIP achieved the best performance and substantially outperformed other methods in all training sample setups.The gap became more pronounced as the number of training samples in each class increased.On the Mammals dataset, SG-CLIP did not have the highest recognition accuracy when the training sample setting was less than 4; SG-CLIP only outperformed the other methods when it exceeded 4.

Time Efficiency Analysis
To evaluate our proposed method more comprehensively, we conducted time efficiency analysis experiments, the results of which are shown in Table 4

Ablation Studies
We conducted three ablation studies for SG-CLIP on the residual ratio α, the number of DFBs in IGFFM, and geographic information.
We first performed an ablation study by varying the residual ratio α.From Table 5, we can see that the residual ratios were different on different datasets, where the best residual ratio on Mammals was 0.8, the best residual ratio on Reptiles was 0.8, and the

Time Efficiency Analysis
To evaluate our proposed method more comprehensively, we conducted time efficiency analysis experiments, the results of which are shown in Table 4

Ablation Studies
We conducted three ablation studies for SG-CLIP on the residual ratio α, the number of DFBs in IGFFM, and geographic information.
We first performed an ablation study by varying the residual ratio α.From Table 5, we can see that the residual ratios were different on different datasets, where the best residual ratio on Mammals was 0.8, the best residual ratio on Reptiles was 0.8, and the best residual ratio on Amphibians was 1.0.In general, fusing image and geographic features and then adding them to the original image features better improves the performance of few-shot species recognition.Moreover, we performed an ablation study by adjusting the number of DFBs in IGFFM.As shown in Table 6, we observed that the number of DFBs in IGFFM is different on different datasets.Specifically, the number of DFBs in IGFFM was 2 on Mammals, 3 on Reptiles, and 4 on Amphibians.We observed that the less common the species in the CLIP training set, the more DFBs in IGFFM were required.Finally, we conducted an ablation study on geographic information.As can be seen from Table 7, fusing either longitude or latitude helps to improve the species recognition performance under the 16-shot training setup.Further, the recognition performance on all three species datasets when fusing longitude or latitude alone is lower than when fusing both longitude and latitude.

Visualization Analysis
SG-CLIP has been shown to improve few-shot species recognition accuracy across species datasets.To better understand the mechanism of how SG-CLIP works, we selected the top 15 species from the Mammals dataset and used t-SNE [39] to visualize the image representations extracted by different methods as a point.As shown in Figure 7, from ViT-B/32 to ViT-B/14, the cluster results of Zero-shot CLIP and SG-CLIP improved.SG-CLIP generated more discriminative image representations compared to Zero-shot CLIP under ViT-B/32 and ViT-L/14.This was due to the fact that SG-CLIP utilizes geolocation information for multiple interactions with image features in higher dimensions.

Case Studies
To demonstrate the effectiveness of the proposed method in practical application, we executed a case study on a subset of Mammals.First, we created Mammals-20 by selecting the top 20 species from the Mammals dataset.Then, we performed comparative experiments of different methods on Mammals-20 under ViT-L/14.From Table 8, it can be seen that the Top-1 and Top-5 recognition accuracies of SG-CLIP were much better than the other methods under all training sample setups (1, 2, 4, 8, and 16).Specifically, SG-CLIP achieved the best Top-1 and Top-5 accuracies of 89% and 97.5%, respectively, for the 16shot training setup.Although the Top-5 accuracy of Tip-Adapter-F was the same as the recognition accuracy of SG-CLIP, the Top-1 accuracy of 75.5% was much lower than that of SG-CLIP.Compared to the 37% Top-1 recognition accuracy of the Zero-shot CLIP on Mammals-20, the SG-CLIP with the 16-shot training setup is sufficient for real-world precision recognition applications.

Case Studies
To demonstrate the effectiveness of the proposed method in practical application, we executed a case study on a subset of Mammals.First, we created Mammals-20 by selecting the top 20 species from the Mammals dataset.Then, we performed comparative experiments of different methods on Mammals-20 under ViT-L/14.From Table 8, it can be seen that the Top-1 and Top-5 recognition accuracies of SG-CLIP were much better than the other methods under all training sample setups (1, 2, 4, 8, and 16).Specifically, SG-CLIP achieved the best Top-1 and Top-5 accuracies of 89% and 97.5%, respectively, for the 16-shot training setup.Although the Top-5 accuracy of Tip-Adapter-F was the same as the recognition accuracy of SG-CLIP, the Top-1 accuracy of 75.5% was much lower than that of SG-CLIP.Compared to the 37% Top-1 recognition accuracy of the Zero-shot CLIP on Mammals-20, the SG-CLIP with the 16-shot training setup is sufficient for real-world precision recognition applications.

Discussion
In this work, we explored for the first time the potential of using geographic information in foundation model-based species classification tasks, and demonstrated that fusing geolocation information from species observations can help improve performance in few-shot species recognition.
Additional information, like geographic location information and time information, can benefit species recognition.Many studies [27,30,40,41] have utilized geographic information to improve the performance of species recognition.However, the above studies on species recognition are mainly built on traditional ResNet and ViT-based visual models, while our proposed method is built on foundation models such as CLIP.In addition, the amount of data used for training is another difference; our approach requires a relatively small number of training samples.
There are many studies exploring the potential of CLIP as an emerging paradigm for zero-shot or few-shot species recognition.CLIP-Adapter [21] constructs a new bottleneck layer to fine-tune the image or text features extracted by the pretrained CLIP to the new species dataset.Tip-Adapter [23] utilizes the training set from the new species dataset to construct a key-value caching model and improve the performance of the CLIP for species recognition through feature retrieval.Maniparambil et al. [36] used GPT-4 to generate species descriptions in combination with an adapter constructed by the self-attention mechanism to improve performance in zero-shot and few-shot species recognition.The above work is either improved in the image branch of CLIP or in the text branch of CLIP.Unlike the above work, we added a geographic information branch to improve species recognition performance with the help of a priori knowledge of geographic location.
First, we validated our proposed method on three species supercategories (Mammals, Reptiles, and Amphibians), demonstrating that fusing geographic information facilitates species recognition and was effective across categories.Then, we performed experiments on different versions of CLIP (ViT-B/32 and ViT-L/14), and it can be found that as the feature extraction capability of the pretrained CLIP model is enhanced, the performance of our method is enhanced accordingly.This provides strong confidence for building foundation models for taxonomic classification in the future.Next, we performed rigorous ablation experiments that demonstrate the importance of our different modules and also show that appropriate hyperparameters need to be set for different species' supercategories.These hyperparameters are usually positively correlated with the ease of species recognition.
Finally, from the visualization results in Figure 7, it can be noticed that different species become easier to distinguish after incorporating geolocation information.
However, our work faces two limitations.First, the data used to train the CLIP model came from the Internet, where there is relatively little data on species, leading to poor performance of CLIP on the species classification task.Second, due to the difficulty of species recognition, the residual ratio and the number of DFB are variable on different species datasets, requiring a manual search for appropriate values.
In the future, we will consider allowing the proposed model to automatically learn appropriate parameters such as the residual rate and the number of DFBs.In addition, we will validate the proposed method on other foundation models.

Conclusions
In this paper, we proposed SG-CLIP, a few-shot species-recognition method of CLIP that leverages geographic information about species.To the best of our knowledge, we are the first to integrate geographic information for CLIP-driven few-shot species recognition.First, to harness the powerful image representation learning capabilities of foundation models such as CLIP, we used a pretrained CLIP model to extract species' image features and corresponding textual features.Then, to better utilize geographic information, we constructed a geographic feature extraction module to transform structured geographic information into geographic features.Next, to fully exploit the potential of geographic features, we constructed a multimodal fusion module to enable deep interaction between image features and geographic features to obtain enhanced image features.Finally, we computed the similarity between the enhanced image features and the text features to obtain the predictions of the species.Through extensive experiments on different species datasets, it can be observed that utilizing geolocation information can effectively improve the performance of CLIP for species recognition, and significantly outperform other advanced methods.
For species recognition scenarios under data constraints, such as rare and endangered wildlife recognition, our model can be used as the first step for accurate recognition to improve the level of automated species recognition, accelerate species data annotation, and provide preliminary data support work for the subsequent design and iteration of models dedicated to the recognition of specific species.We hope that our model can help build better species distribution models for biodiversity research.In future work, we will focus on the impact of geographic information on species recognition at different regional scales.

Figure 1 .
Figure 1.The overall framework of SG-CLIP for few-shot species recognition.It contains three paths for text, image, and geographic information, respectively.The geographic feature is obtained by GFEM.The parameters of GFEM and IGFFM are learnable.

Figure 1 .
Figure 1.The overall framework of SG-CLIP for few-shot species recognition.It contains three paths for text, image, and geographic information, respectively.The geographic feature is obtained by GFEM.The parameters of GFEM and IGFFM are learnable.

Figure 2 .
Figure 2. The structure of GFEM for geographic feature extraction.The dashed box is the structure of the FCResLayer.

Figure 3 .
Figure 3.The structure of IGFFM for image and geographic feature fusion, where Fc denotes the fully connected layer, ReLU denotes the ReLU activation function, and LayerNorm denotes layer normalization.DFB denotes the dynamic fusion block.DFB is used recursively, where N is the number of DFB modules.

Figure 2 .
Figure 2. The structure of GFEM for geographic feature extraction.The dashed box is the structure of the FCResLayer.

Figure 2 .
Figure 2. The structure of GFEM for geographic feature extraction.The dashed box is the structure of the FCResLayer.

Figure 3 .
Figure 3.The structure of IGFFM for image and geographic feature fusion, where Fc denotes the fully connected layer, ReLU denotes the ReLU activation function, and LayerNorm denotes layer normalization.DFB denotes the dynamic fusion block.DFB is used recursively, where N is the number of DFB modules.

Figure 3 .
Figure 3.The structure of IGFFM for image and geographic feature fusion, where Fc denotes the fully connected layer, ReLU denotes the ReLU activation function, and LayerNorm denotes layer normalization.DFB denotes the dynamic fusion block.DFB is used recursively, where N is the number of DFB modules.

Figure 4 .
Figure 4. Heatmaps of the geolocation distribution.(a) Mammals.(b) Reptiles.(c) Amphibians.Different colors indicate the number of species at different locations.Green indicates relatively little data and red indicates a large number.

Figure 4 .
Figure 4. Heatmaps of the geolocation distribution.(a) Mammals.(b) Reptiles.(c) Amphibians.Different colors indicate the number of species at different locations.Green indicates relatively little data and red indicates a large number.
CLIP was exceeded by only the one-shot training setup.Under the 16-shot training setup, SG-CLIP boosts recognition accuracy by 5~10 times.Compared to Linear probe CLIP, SG-CLIP achieved the best performance under all xshot training settings, where x was 1, 2, 4, 8, and 16.As shown in Figure4, as the number of training samples increases, the gap between SG-CLIP and Linear probe CLIP becomes more obvious.Under the 16-shot training setup, SG-CLIP improves the recognition accuracy by 15.12% (Mammals), 17.51% (Reptiles), and 17.65% (Amphibians).Tip-Adapter and Tip-Adapter-F outperformed SG-CLIP only on the mammal dataset with the one-shot training setup.The SG-CLIP achieved significant performance gains on all three species datasets when the training samples are set to 2 or more.Compared to Tip-Adapter, the recognition accuracy of SG-CLIP was improved by 23.58% (Mammals), 23.87 (Reptiles), and 22.35% (Amphibians) with the 16-shot training setup.Similarly, the recognition accuracy of SG-CLIP was improved by 12.56% (Mammals), 17.39 (Reptiles), and 15.53% (Amphibians) compared to Tip-Adapter-F.As shown in Figure5, on the Mammals dataset, the recognition accuracy of Tip-Adapter-F was the highest, followed by Tip-Adapter, and SG-CLIP was only higher than Linear probe CLIP when the number of training samples was 1.As the number of training samples increased to more than 2, SG-CLIP gradually outperformed Tip Adapter and Tip Adapter.On the Reptiles and Amphibians datasets, SG-CLIP outperformed Tip-Adapter-F and Tip-Adapter at all settings of the number of training samples.Remote Sens. 2024, 16, x FOR PEER REVIEW 9 of 18Compared to Zero-shot CLIP, SG-CLIP achieved significant performance gains across all three species datasets.For the three species dataset, the performance of Zeroshot CLIP was exceeded by only the one-shot training setup.Under the 16-shot training setup, SG-CLIP boosts recognition accuracy by 5~10 times.Compared to Linear probe CLIP, SG-CLIP achieved the best performance under all x-shot training settings, where x was 1, 2, 4, 8, and 16.As shown in Figure4, as the number of training samples increases, the gap between SG-CLIP and Linear probe CLIP becomes more obvious.Under the 16-shot training setup, SG-CLIP improves the recognition accuracy by 15.12% (Mammals), 17.51% (Reptiles), and 17.65% (Amphibians).Tip-Adapter and Tip-Adapter-F outperformed SG-CLIP only on the mammal dataset with the one-shot training setup.The SG-CLIP achieved significant performance gains on all three species datasets when the training samples are set to 2 or more.Compared to Tip-Adapter, the recognition accuracy of SG-CLIP was improved by 23.58% (Mammals), 23.87 (Reptiles), and 22.35% (Amphibians) with the 16-shot training setup.Similarly, the recognition accuracy of SG-CLIP was improved by 12.56% (Mammals), 17.39 (Reptiles), and 15.53% (Amphibians) compared to Tip-Adapter-F.

Figure 5 .
Figure 5. Performance comparison of different methods with different training samples on different datasets.(a) Mammals.(b) Reptiles.(c) Amphibians.

Further, we
performed comparative experiments on different versions of CLIP, such as ViT-B/32 and ViT-L/14, to validate the generalizability of SG-CLIP.Compared with ViT-B/32, ViT-L/14 has more parameters and better feature extraction capability.

Figure 5 .
Figure 5. Performance comparison of different methods with different training samples on different datasets.(a) Mammals.(b) Reptiles.(c) Amphibians.

4. 2 .
Performance of Few-Shot Species Recognition with Different Versions of CLIP Further, we performed comparative experiments on different versions of CLIP, such as ViT-B/32 and ViT-L/14, to validate the generalizability of SG-CLIP.Compared with ViT-B/32, ViT-L/14 has more parameters and better feature extraction capability.

Figure 6 .
Figure 6.Comparison of few-shot species recognition accuracy on different datasets under different versions of CLIP.(a) Mammals.(b) Reptiles.(c) Amphibians.
. The training time for SG-CLIP was far superior to that of other methods.Linear probe CLIP freezes the CLIP to extract features and adds a linear layer for classification, requiring very little training time.Tip-Adapter improves the species recognition performance by fusing the predictions of the pre-constructed key-value cache model with the predictions of Zero-shot CLIP, which is a training-free adaptation method, and thus it and Zero-shot CLIP have no additional training cost.Tip-Adapter-F further fine-tunes the keys of the cached models in Tip-Adapter as learnable parameters, thus introducing additional training time, usually within only a few minutes.SG-CLIP introduces two modules, GFEM and IGFFM, for feature extraction of geolocation information and deep fusion between images and geographic features, respectively, so it requires a lot of training time.

Figure 6 .
Figure 6.Comparison of few-shot species recognition accuracy on different datasets under different versions of CLIP.(a) Mammals.(b) Reptiles.(c) Amphibians.
. The training time for SG-CLIP was far superior to that of other methods.Linear probe CLIP freezes the CLIP to extract features and adds a linear layer for classification, requiring very little training time.Tip-Adapter improves the species recognition performance by fusing the predictions of the pre-constructed key-value cache model with the predictions of Zero-shot CLIP, which is a training-free adaptation method, and thus it and Zero-shot CLIP have no additional training cost.Tip-Adapter-F further fine-tunes the keys of the cached models in Tip-Adapter as learnable parameters, thus introducing additional training time, usually within only a few minutes.SG-CLIP introduces two modules, GFEM and IGFFM, for feature extraction of geolocation information and deep fusion between images and geographic features, respectively, so it requires a lot of training time.

•
We performed extensive experiments of SG-CLIP in the iNaturalist 2021 dataset to demonstrate its effectiveness and generalization.Under ViT-B/32 and the 16-shot training setup, compared to Linear probe CLIP, SG-CLIP improves the recognition accuracy by 15.12% on mammals, 17.51% on reptiles, and 17.65% on amphibians.

Table 1 .
Details of the training and validation set in the three few-shot datasets.

Table 2 .
Top-1 accuracy of few-shot species recognition by different methods under ViT-B/32.Data usage denotes the number of samples used for training.

Table 3 .
Performance of few-shot species recognition with different versions of CLIP.Data usage indicates the number of samples used for training.Bold indicates the best Top-1 recognition accuracy.

Table 4 .
Comparative experiments on the training time of different methods.The image encoder of CLIP is ViT-L/14.The training time was tested on a NVIDIA GeForce RTX 3090 GPU.

Table 4 .
Comparative experiments on the training time of different methods.The image encoder of CLIP is ViT-L/14.The training time was tested on a NVIDIA GeForce RTX 3090 GPU.

Table 5 .
Ablation study on different residual ratios.The CLIP version is ViT-B/32.The number of DFBs in IGFEM is 2. The recognition accuracy in the table is Top-1 accuracy.

Table 6 .
Ablation study on different numbers of DFB in IGFEM.The CLIP version is ViT-B/32.The residual ratios on Mammals, Reptiles, and Amphibians are 0.8, 0.8, and 1.0, respectively.The recognition accuracy in the table is Top-1 accuracy.

Table 7 .
Ablation study of geographic information.The CLIP version is ViT-L/14.The residual ratios on Mammals, Reptiles, and Amphibians are 0.8, 0.8, and 1.0, respectively.The number of DFBs in IGFFM on Mammals, Reptiles and Amphibians are 2, 3, and 4, respectively.The recognition accuracy in the table is Top-1 accuracy.The number of samples is 16-shot.

Table 8 .
Performance of few-shot species recognition by different methods under ViT-L/14.Data usage denotes the number of samples used for training.