Zero-Shot Elasmobranch Classification Informed by Domain Prior Knowledge

Beviá-Ballesteros, Ismael; Jerez-Tallón, Mario; Aranda-Garrido, Nieves; Saval-Calvo, Marcelo; Abel-Abellán, Isabel; Fuster-Guilló, Andrés

doi:10.3390/make7040146

Open AccessArticle

Zero-Shot Elasmobranch Classification Informed by Domain Prior Knowledge

by

Ismael Beviá-Ballesteros

¹

,

Mario Jerez-Tallón

¹

,

Nieves Aranda-Garrido

²

,

Marcelo Saval-Calvo

^1,*

,

Isabel Abel-Abellán

²

and

Andrés Fuster-Guilló

¹

Department of Computer Science and Technology, University of Alicante, 03690 San Vicente del Raspeig, Spain

²

Marine Research Center of Santa Pola, University of Alicante, 03130 Santa Pola, Spain

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(4), 146; https://doi.org/10.3390/make7040146

Submission received: 5 October 2025 / Revised: 7 November 2025 / Accepted: 12 November 2025 / Published: 14 November 2025

(This article belongs to the Section Learning)

Download

Browse Figures

Versions Notes

Abstract

The development of systems for the identification of elasmobranchs, including sharks and rays, is crucial for biodiversity conservation and fisheries management, as they represent one of the most threatened marine taxa. This challenge is constrained by data scarcity and the high morphological similarity among species, which limits the applicability of traditional supervised models trained on specific datasets. In this work, we propose an informed zero-shot learning approach that integrates external expert knowledge into the inference process, leveraging the multimodal CLIP framework. The methodology incorporates three main sources of knowledge: detailed text descriptions provided by specialists, schematic illustrations highlighting distinctive morphological traits, and the taxonomic hierarchy that organizes species at different levels. Based on these resources, we design a pipeline for prompt extraction and validation, taxonomy-aware classification strategies, and enriched embeddings through a prototype-guided attention mechanism. The results show significant improvements in CLIP’s discriminative capacity in a complex problem characterized by high inter-class similarity and the absence of annotated examples, demonstrating the value of integrating domain knowledge into methodology development and providing a framework adaptable to other problems with similar constraints.

Keywords:

elasmobranch identification; informed machine learning; prior knowledge; schematic illustrations; CLIP; zero-shot

Graphical Abstract

1. Introduction

The accurate identification of elasmobranchs, including sharks and batoids (rays, torpedoes, and sawfish), constitutes a critical challenge for biodiversity monitoring and fisheries management. These species face alarming conservation threats, with more than one-third currently at risk of extinction [1]. In regions such as the Spanish Levantine coast, where this project was conducted and where two Important Shark and Ray Areas (ISRAs) have been recognized [2,3], it is essential to have reliable recognition tools that support conservation strategies and help mitigate the impact of human activities on vulnerable populations. However, the available datasets are limited, biased toward a reduced number of species, and often composed of heterogeneous image sources. Furthermore, the morphological similarity among related species increases the difficulty of fine-grained classification tasks, particularly when relying on conventional supervised learning techniques, which depend on large volumes of annotated images.

In this context, recent research has highlighted the potential of multimodal zero-shot strategies to address classification problems in domains characterized by data scarcity and high inter-class similarity. These approaches rely on knowledge transfer from large-scale generic datasets, exploiting complex semantic relationships without the need for task-specific fine-tuning [4]. Among them, Contrastive Language–Image Pretraining (CLIP) [5] has emerged as a particularly effective framework, capable of learning a multimodal semantic space that integrates natural language and visual information to enable cross-modal alignment. Building on this paradigm, we propose an approach guided by prior, domain-specific knowledge, framed within informed zero-shot learning as a branch of informed machine learning [6] that incorporates external and independent knowledge into the learning pipeline. In this case, since it is a zero-shot scenario, such knowledge is not integrated into the training process but instead guides model inference toward more informed and explainable decisions.

Based on these foundations, the present work introduces a methodology aimed at improving CLIP’s performance in the identification of elasmobranchs under conditions of low data availability, limited inter-class variability, and absence of annotated examples. The proposed framework integrates multiple sources of expert knowledge: (i) detailed descriptions provided by specialists, (ii) schematic field-guide illustrations highlighting distinctive morphological traits, and (iii) the taxonomic hierarchy that organizes species according to shared characteristics. First, we present a systematic pipeline for prompt extraction and validation, where expert descriptions are expanded through language models and optimized against visual prototypes derived from illustrations, maximizing their discriminative capacity both intra- and inter-class. Second, we incorporate taxonomy-aware classification strategies that leverage the hierarchical structure of species classification to progressively reduce ambiguity in fine-grained recognition tasks. Finally, we design a prototype-guided attention mechanism, in which illustrations highlight relevant features and are integrated into CLIP’s attention process, steering the embedding formation toward the most informative regions and reinforcing coherence with the optimized prompts. This methodology demonstrates the value of incorporating domain knowledge into representation learning, while also improving interpretability and robustness in fine-grained species recognition. The code is available at: https://github.com/Tech4DLab/e-Lasmobranc-project (accessed on 7 November 2025).

The remainder of this article is organized as follows. Section 2 reviews related work on species identification in contexts of data scarcity and multimodal zero-shot learning. Section 3 details the proposed methodology, including prompt extraction, taxonomy-aware classification, and prototype-guided cross-attention. Section 4 presents the experimental setup and results. Finally, Section 5 discusses the conclusions and outlines potential directions for future work.

2. Related Work

Within the marine domain, most existing works have focused on supervised approaches requiring large training datasets. Examples include the automated detection of sharks for ecological monitoring tasks [7,8,9] and the real-time classification of species from aerial drone imagery [10], which are useful for specific conservation purposes but heavily dependent on large volumes of annotated data. Other studies have addressed more specialized applications, such as the re-identification of individual mosaic rays under few-shot conditions [11], or efficient instance segmentation for fish identification in markets [12,13]. Few-shot learning has also been applied to fish recognition [14], but it still relies on carefully curated datasets. More recently, the classification of the origin of commercially relevant species based on subtle traits has been explored through multimodal CLIP-based approaches with lightweight training, using expert descriptions to guide the distinction between wild and farmed individuals [15]. Overall, these approaches demonstrate the potential of machine learning for biodiversity monitoring and fisheries management, but remain limited by annotation demands and the need for retraining whenever new categories or scenarios are introduced.

To overcome these limitations, recent advances in zero-shot and multimodal learning offer promising alternatives. Beyond general surveys of zero-shot learning, research has increasingly focused on domains with fine-grained categories and data scarcity. A key line of work has aimed to improve prompt quality and semantic alignment. Menon and Vondrick [16] showed that natural language descriptions generated by large language models can guide visual classification, bridging the gap between visual embeddings and textual semantics. Zhang et al. [17] introduced Tip-Adapter, an adaptation method for CLIP designed for few-shot classification by leveraging prototypes learned from a dataset, improving feature representation. Guo et al. [18] proposed CALIP, a parameter-free attention mechanism that enhances the discriminative ability of CLIP, while Zhuang et al. [19] introduced FaLIP, which integrates visual prompts with foveal attention to improve fine-grained recognition. Other efforts have tackled dataset biases: Liu et al. [20] proposed DeCLIP, a bias-free contrastive learning approach for more robust generalization, and Yuksekgonul et al. [21] analyzed CLIP’s tendency to behave like a bag-of-words model, suggesting architectural refinements. In the context of biodiversity monitoring, Praveena et al. [22] demonstrated the potential of zero-shot learning for species identification and monitoring, reducing reliance on costly annotated datasets.

The incorporation of external or domain-specific knowledge has also become central to improving fine-grained recognition. Early zero-shot methods exploited semantic attributes and curated knowledge bases [23,24], showing that structured priors such as taxonomies or textual descriptions provide effective supervision without the need for labeled images. More recent works have extended this to multimodal models: Rodríguez et al. [25] leveraged field guide illustrations to recognize unseen bird species, Menon and Vondrick [16] employed language models to generate discriminative prompts, and Stevens et al. [26] integrated taxonomic hierarchies into CLIP to enhance biodiversity monitoring. Other trends include methods such as IF&PA [27,28], which apply rule-based approaches to address classical segmentation tasks without relying on trained architectures. Such knowledge-driven strategies can complement or even reduce the need for data-intensive learning. Approaches, such as Prototypical Networks [29] and knowledge-guided prompt learning [30,31], further illustrate how domain-specific prototypes and prompts can act as structured anchors for few-shot and zero-shot recognition. Collectively, these studies highlight that the careful design of priors is key to improving efficiency, interpretability, and generalization in multimodal systems, particularly in contexts with limited annotated data.

In summary, the literature reveals three complementary trends: improvements in vision–language models, such as prompt design, attention mechanisms, and bias reduction, which strengthen zero-shot alignment; the integration of expert or domain-specific knowledge, ranging from schematic illustrations to taxonomic priors, which mitigate data scarcity and enhance interpretability; and the gradual extension of these approaches to biodiversity research. Building on these ideas, in this work we apply and extend such techniques to the specific challenge of elasmobranch recognition, introducing a methodology that combines expert descriptions, schematic prototypes, and taxonomy-aware strategies to address data scarcity and fine-grained classification.

3. Methodology

The proposed approach combines different sources of expert knowledge to guide classification in a zero-shot scenario based on CLIP, characterized by limited data availability, such as the identification of elasmobranch species. The sources of knowledge considered in this task are:

Precise base descriptions provided by experts for each category.
Schematic illustrations from field guides and specialized sources that highlight distinctive traits.
Hierarchical taxonomy that organizes categories and groups shared visual features.

The procedure is structured into two main stages:

Prompt extraction and validation: starting from expert descriptions and their automatically generated variations, the most discriminative prompts are selected by evaluating their similarity with visual prototypes obtained from schematic illustrations. This process accounts for intra-class validity, ensuring that each description faithfully represents its own category, and inter-class validity, guaranteeing a sufficient separation margin from other classes, an essential aspect in low-variability scenarios. In addition, taxonomy-aware classification strategies are incorporated, leveraging the biological hierarchy to reduce ambiguity.

Prototype-guided attention: illustrations are used to emphasize distinctive and shared traits within each category when developing representations of real images. The highlighted features are automatically integrated into CLIP’s attention mechanism, guiding the model to form embeddings focused on the most relevant regions and improving coherence with the optimized prompts.

Overall, the method leverages textual and visual expert knowledge to align prototypes and descriptions, enhancing discrimination between closely related classes in data-limited contexts.

3.1. Prompt Extraction Through Prior Knowledge: Illustrations and Expert Descriptions

This section describes the method for the extraction and automatic selection of textual embeddings and, consequently, discriminative textual descriptions for each species in CLIP’s multimodal space, using prior knowledge of the different categories as the main source. Unlike common approaches that generate prompts with language models from large training sets, our scenario presents two limitations: (i) data scarcity and (ii) classes that are difficult to distinguish, requiring highly specific descriptions. To address this, we rely on two sources of domain knowledge: a precise base description provided by experts and schematic illustrations of each category, as shown in Figure 1, which provides an overview of the proposed method.

The initial expert descriptions were automatically expanded using a language model (GPT-4 in our case), generating a set of variations per category (omissions, reformulations, and adaptations) following CLIP-friendly guidelines. These candidates were encoded into the latent space through CLIP’s text encoder. In parallel, schematic illustrations of each class were processed with the visual encoder, averaging multiple sketches per species to build a representative prototype that reduces variability and emphasizes shared features.

The interaction between both modalities follows a two-stage selection strategy. First, at the intra-class level, all descriptions are evaluated by comparing their similarity with their own visual prototype

μ^{+}

(positive) and with the prototypes of the remaining classes

μ_{j}^{-}

(negatives). For each description, a discriminative score

S (t)

is computed through a sigmoid-weighted average, assigning greater weight to the hardest negative classes. It is defined as follows:

S (t) = \frac{\sum_{j} ω_{j} \cdot δ_{j}}{\sum_{j} ω_{j}}, where δ_{j} = cos (t, μ^{+}) - cos (t, μ_{j}^{-}), ω_{j} = \frac{1}{1 + e^{δ_{j}}}

(1)

where

cos (\cdot, \cdot)

denotes cosine similarity, and

ω_{j}

is an adaptive coefficient that downweights easily separable classes and emphasizes those that are closer and harder to distinguish. Descriptions surpassing a minimum threshold k are retained as candidates.

Second, an inter-class validation is applied, discarding ambiguous descriptions that do not preserve a margin of separation

τ

with respect to other species. This is achieved through a cross-validation process designed to detect and resolve conflicts between categories (since in semantically related classes, prompts may point to multiple categories). In such cases, discarded candidates are replaced with those ranked highest by S. If conflicts persist, the most robust option is selected, and the process is reiterated for the conflicting categories.

The final result is a set of class-specific prompts, optimized and validated from expert knowledge (both textual and visual), which serve as discriminative descriptors in a zero-shot setting without requiring training on real images.

3.1.1. Taxonomy-Aware Classification Strategies

When applying this methodology, prompts are obtained for all categories within the same level. However, since species are organized into a well-defined hierarchical taxonomy (kingdom, order, family, etc.), comparing them all directly is complex: many share highly similar traits, making fine-grained differentiation increasingly difficult as the problem grows in granularity.

Higher taxonomic levels are more easily distinguishable than lower ones, and treating them separately therefore provides greater consistency. In this work, we propose to exploit this hierarchy by progressively incorporating information from each level into the classification process. To this end, two strategies are considered:

General aggregation: each image is simultaneously evaluated across multiple taxonomic levels, including subclass (Elasmobranchii, which comprises sharks and rays), order, family, and species, by summing the similarity scores at each level to obtain an accumulated score.
Sequential classification: the decision is taken hierarchically, starting from the most general level (shark or ray) and progressively restricting the options at each step to the corresponding subgroup. This reduces ambiguity and simplifies prompt selection.

3.2. Prototype-Guided Cross-Attention from Illustrations

In this section, we describe the use of average prototypical representations, obtained from different category-specific illustrations (analogous to those introduced in Section 3.1), to focus and strengthen the representation of shared characteristic features when generating embeddings from real images, as illustrated in Figure 2. The goal is to produce representations that emphasize the differences extracted from the illustrations and make them more compatible with the prompts derived in the previous section, thus improving system performance.

To this end, we extract the spatial output of the original CLIP visual encoder in order to obtain the projected representations of each patch in the semantic space. Although these projections integrate contextual information from the entire set of patches, they provide more spatially localized encodings in each section. This is crucial, since CLIP’s global pooling tends to dilute details, making it difficult to discriminate categories with high visual similarity that rely on local and specific descriptors pointing to fine-grained details (as in our case).

From these intermediate representations of the real image (

F_{r} \in R^{N \times D}

) and the prototypical representations of each category (

F_{t} \in R^{E \times D}

), cross-attention weights are computed through matrix multiplication:

A = F_{r} F_{t}^{T}

(2)

which produces an attention map for each category. These maps are jointly normalized using a softmax function, generating for each class a distribution that highlights regions of semantic correspondence between the real image and the prototypes. While each patch reflects a combination with some inherent noise from this general representation, when a clearly matching feature exists, it is emphasized over the rest. For interpretation, the maximum values are extracted from each map (

M \in R^{7}

), which are then integrated into a single result normalized to the range

[0, 1]

. The outcome is a weight vector that, for each image, highlights the most discriminative visual features.

These weight vectors are used when re-executing CLIP on the real image, introducing a modification in its architecture: specifically, in the attention mechanism of the last layer, where they are incorporated as a bias multiplied by a factor

α

to control its influence and guide the attention distribution towards the most discriminative regions. In this way, the network prioritizes the patches previously highlighted by the comparison with the prototypes:

Attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d}} + α B^{2}) V

(3)

where

α

controls the strength of the bias and the bias term (B) is squared to further emphasize the most relevant regions.

With this adjustment, the final representations preserve the global context while reinforcing the emphasis on features coinciding with the prototypical illustrations, highlighting the most distinctive ones. As a result, the generated embeddings align more robustly with the optimized textual prompts defined in previous sections, leading to improved discrimination ability in zero-shot scenarios by coherently integrating both visual and textual information.

4. Experimentation

This section presents the dataset and experiments conducted to evaluate the effectiveness of the proposed methodology in the identification of elasmobranch species, a complex and data-scarce problem. The goal is to assess how the integration of textual, visual, and taxonomic expert knowledge enhances CLIP’s discriminative capacity in a zero-shot scenario.

4.1. Dataset

The dataset used in this study consists of images of elasmobranchs observed in the southeastern Mediterranean region (Levant), captured out of the water. The represented species are taxonomically distributed into two main groups (sharks and rays), four orders, five families, and seven distinct species. In total, the dataset includes 448 real images collected from heterogeneous sources, which were preprocessed to maintain a square format and padded with white in the uncovered areas. The data distribution is shown in Table 1. In addition, three schematic illustrations per species were incorporated, selected by experts, centered, segmented on a uniform white background, and manually refined, as shown in Figure 3. This is an evolving dataset, designed with the aim of progressively incorporating new species and expanding the visual representation of each existing category.

As this work follows a zero-shot paradigm, no training or validation subsets are defined. All real images were exclusively used for evaluation, while schematic prototypes and expert descriptions act as external prior knowledge sources guiding the inference process.

4.2. Overview of the Methodological Workflow

To provide a complete overview of the proposed methodology, this section summarizes the general processing flow. Overall, the approach follows a zero-shot scheme, where prior knowledge is used to guide the inference process without any retraining. The procedure is organized into two main stages that constitute the essential components of the final classification.

First, the prompt selection stage (Section 3.1) involves generating expert and category-specific textual descriptions from schematic illustrations that capture the distinctive morphological traits of each class. These descriptions are evaluated by measuring the similarity between their embeddings and those derived from the illustrations, selecting the prompts that best represent each class while minimizing confusion with others.
For example, for the species Galeus melastomus, the prompt “a shark with reddish-brown coloration and oval spots encircled in white” was selected, as it showed the highest correspondence with its schematic illustration and minimal confusion with other species of the same order. The circular white-bordered spots and characteristic coloration reinforce its distinctiveness.
Second, the visual representations (Section 3.2) of the images to be classified are refined by using the illustrations to strengthen the visual embeddings within the encoder. This process enhances the attention given to semantically matching features in the last layer of the model, improving the expressiveness of the resulting representations.
For instance, in an image of Galeus melastomus, the illustrations highlight the pale circular marks on the dorsal area, guiding the model to assign higher weight to these features during representation building.
Finally, both components (the optimized prompts and the enriched representations guided by semantically shared features) are integrated into a hierarchical decision workflow aware (Section 3.1.1) of the taxonomic structure. This process progressively narrows the decision space from broader to more specific categories, reducing ambiguity and improving the overall consistency of classification results.
For example, when classifying between Galeus melastomus and Torpedo marmorata, both species exhibit dorsal spots; however, the hierarchical decision process allows comparisons only within equivalent levels, enabling more precise and consistent differentiation.

4.3. Hierarchical Taxonomic Evaluation

The results are presented according to the different taxonomic levels, excluding the main group level (sharks vs. rays), which represents a trivial problem. The complexity increases progressively along the hierarchy, reaching its highest difficulty at the species level, where the differences are subtle and closely related.

All experiments were conducted using the ViT-H/14 model as the visual encoder of CLIP (the highest-capacity variant), given the complexity of the data and the need to capture descriptors in as much detail as possible. To ensure a fair comparison with the baseline configuration, the model was also evaluated using only the category names as descriptions. The results presented in Table 2 demonstrate that the model had not been “exposed” to this type of data during its large-scale pretraining. In addition to scientific names, common names were also included at the species level in order to explore variability in recognition.

This highlight the need to guide zero-shot strategies in order to achieve adequate discriminative capacity in the scenario under study. First, the models are compared by applying the prompt selection methodology based on illustrations (Section 3.1); subsequently, taxonomic information is incorporated through both cumulative and sequential strategies (Section 3.1.1). Finally, in the best-performing scenario, a bias is introduced into CLIP with the aim of focusing attention on the most relevant regions, both across all taxonomic levels and exclusively at the family and species level where a more fine-grained differentiation is required (Section 3.2). All these results are summarized in Table 3.

Additionally, although the complete comparative experiments have been made publicly available, the hierarchical decision module has been extended by incorporating a simple abstention mechanism that introduces a confidence-based penalty. This mechanism allows the model to abstain from making a decision when the similarity margin between the two most probable classes is below a defined threshold. In such cases, the classification process is interrupted, retaining the prediction at a higher taxonomic level and avoiding an unreliable fine-grained decision. This adjustment helps to reduce error propagation throughout the hierarchical structure and increase reliability in ambiguous scenarios.

For prompt generation, 100 variants were created per category following different instructions, such as starting with “shark” or “stingray” as appropriate, randomly omitting certain features, or using synonyms in the description. Furthermore, as previously noted, three illustrations were selected per class. These originated from different sources (books, web pages, among others), which introduces a clear domain gap between them and makes their fine-grained integration with real images more challenging.

To ensure reproducibility, the hyperparameters k and

τ

used in the prompt search are reported. In most cases, the base values

(0, 0)

are applied, although in more complex situations—such as comparisons among many similar categories or closely related species, they are adjusted to be more restrictive. In general,

τ

is increased to

1.8

at the order level, reduced to

- 0.3

at the species level, and set to

k = 0.75, τ = 0.9

for Triakidae species differentiation in the sequential strategy. The adjustment is carried out by evaluating results on individual illustrations and verifying, with a very small set, that the specified characteristics help overcome the domain gap. The descriptive prompts selected by category are finally summarized in Table 4 and Table 5.

On the other hand, the proposed Prototype-guided Cross-Attention from Illustrations methodology was implemented using the same set of illustrations to ensure consistency. Examples of the resulting weight vectors, visualized as heatmaps, are shown in Figure 4. These maps mainly highlight fine-grained characteristics (e.g., spots, fins) that differentiate species from one another, while also enabling a more precise conditioning of the representation and providing a clear visualization of the concentration of distinctive features.

The noise observed in certain regions occurs because, although patches focus on their local area, they share information throughout the entire CLIP processing pipeline. The self-attention mechanism distributes contextual information globally, leading to residual activations from unrelated regions that contribute to noise in the attention maps. This effect is further amplified by the domain gap, which reduces semantic alignment between schematic illustrations and real images. As a corrective measure, the attention weights—normalized between 0 and 1—are squared, acting as a nonlinear normalization step that amplifies high-confidence responses (that is mainly concentrated on the semantically related regions) and attenuates low-magnitude activations, effectively suppressing part of the spatial noise.

To validate the methodology, a selection process was carried out to determine in which layers the bias should be applied and how strongly the weights should be scaled using a factor

α

. Figure 5 shows the results obtained under different configurations. The best performance was achieved when modifying only the final layer, which preserves the strength of the global image context while using attention modulation solely to refine the output embedding at the last step. It is also worth noting that both the noise and the weight vectors exhibit intrinsic errors due to the way they are computed. This must be taken into account, as assigning them excessive influence can negatively affect the final results. This strategy was applied both across all taxonomic levels and exclusively to the distinction of problematic family and species belonging to the same family, since these are the classes where finer discrimination is required. Moreover, applying the bias only to these specific categories helps to avoid the negative impact of weight noise on higher taxonomic levels, where class distinctions are clearer and there is no need for an exhaustive differentiation.

Finally, the results obtained for each specific class using the P-TS+B and P-TS+S methods (Prompt with Taxonomy-aware Sequential classification combined with CLIP Attention Bias), for all taxonomic levels as well as specifically for family and species, are presented in Figure 6. In all cases, the error tends to concentrate in neighboring squares, since the confusion matrices are ordered according to the visual similarity of the species, while simultaneously reflecting their order–family–species structure. From these results, it can be observed that P-TS acts as a baseline, generally exhibiting higher or similar error rates in all the fine-grained categories compared to the other approaches. In contrast, P-TS+B emerges as the most promising method, achieving the greatest improvements precisely in the most challenging categories. However, due to issues encountered during the generation of attention maps, a slight degradation is observed for a class that is usually easily distinguishable, which was negatively affected by the incorporation of fine-grained information. Finally, the variant that operates only at the family and species levels (characterized by being the most difficult) builds on a clearly defined baseline and refines it. While it does not achieve results as good as those of P-TS+SB due to the constraints imposed in the previous step, it favors more straightforward differentiations.

To quantitatively support the qualitative trends observed in Figure 6, a complementary metric termed hierarchical accuracy (H-Acc) was introduced. This metric incorporates taxonomic relationships into the evaluation by assigning different penalties depending on the hierarchical level at which the prediction error occurs. Formally, for each sample (i), a taxonomic distance

(d_{i} \in 0, 0.33, 0.66, 1)

is defined, where (0) corresponds to a correct prediction at the species level, (0.33) when only the family matches, (0.66) when only the order matches, and (1) when all levels are incorrect. The final score is computed as:

H_{Acc} = 1 - \frac{1}{N} \sum_{i = 1}^{N} d_{i}

(4)

The metric was computed using the macro average to balance the contribution of each class regardless of its frequency within the dataset. The obtained values, (0.73) and (0.75) for the attention-biased variants, respectively, confirm that both outperform the purely sequential P-TS version. This demonstrates that the incorporation of attention not only enhances overall differentiation but also improves recognition among species with subtle visual similarities, providing a consistent advantage to the taxonomy-aware approach.

Finally, a comparative analysis has been incorporated between our method and CALIP [18], together with CLIP in its standard zero-shot configuration as a baseline reference. CALIP introduces a parameter-free attention mechanism applied directly during inference and outside the visual encoder, making it a comparable yet contrasting approach to ours, in which prototype-guided attention is integrated within the model’s internal attention layers.

In this comparison, the hierarchical classification strategy is not applied, and all methods use the same optimized prompts. Likewise, CALIP with ViT-L/14 has been thoroughly optimized following the parameter search strategy described by its authors, selecting the configuration that best fits our dataset. From our methodology, only the Prototype-guided Cross-Attention from Illustrations module (Section 3.2) has been evaluated, in order to isolate and highlight the benefit of incorporating this visual prior based on schematic illustrations and its integration into the attention layers.

As shown in Table 6, the proposed PGCA method consistently outperforms both CLIP and CALIP across all taxonomic levels. The improvements are especially notable at the species level, where visual and semantic distinctions are more subtle. These results confirm that incorporating schematic illustrations as visual priors within the attention layers enhances fine-grained discrimination without the need for retraining, effectively surpassing existing parameter-free attention mechanisms such as CALIP.

5. Conclusions

This work shows the feasibility of integrating prior and domain-specific knowledge to support the development of methodologies, in this case within the framework of informed zero-shot approaches, addressing limitations such as extreme data scarcity. The potential of CLIP has been explored by leveraging resources used by experts, such as textual descriptions, representative illustrations that highlight differences, and taxonomic organization. The proposed methodology constitutes a reproducible and adaptable framework with potential applications in other problems that present similar limitations in data availability and high inter-class similarity. Its flexible nature makes it a promising approach for disciplines where visual and textual resources are scarce, but expert knowledge is abundant and structured.

Furthermore, this work addresses a scarcely studied problem: the differentiation of elasmobranchs in an out-of-water context, a task of great ecological and conservation relevance but still underexplored from the perspective of artificial intelligence. This task is particularly important, as the correct identification of species is essential for fisheries management, bycatch control, and the protection of endangered species.

For future work, the main challenge identified is the domain gap between illustrations and real data. Although CLIP partially mitigates this effect through its multimodal semantic space, it remains one of the most critical factors affecting system performance. Overcoming this limitation requires the development of specific strategies to better align both sets, whether through domain adaptation techniques, synthetic data generation, or the incorporation of hybrid representations that combine real and schematic examples. Additionally, improving the construction of patch-based weight vectors—by refining how discriminative regions are extracted, aggregated, or regularized, could further reduce noise and increase the precision of the attention bias. Progress in these directions will enhance knowledge transfer from illustrations to real images and strengthen the robustness of comparisons.

On the other hand, as mentioned, the dataset used is in constant evolution, with the progressive incorporation of new images and annotations. This continuous growth opens the possibility of extending the methodology in the future toward few-shot learning or lightweight fine-tuning scenarios.

Advances in these directions will strengthen the transfer of knowledge from illustrations to real images and increase the robustness of the comparisons.

Author Contributions

Conceptualization, I.B.-B., A.F.-G. and M.S.-C.; methodology, I.B.-B. and A.F.-G.; writing—original draft, I.B.-B. and M.J.-T.; writing—review and editing, A.F.-G. and M.S.-C.; resources, N.A.-G. and I.A.-A.; supervision, A.F.-G. and M.S.-C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the eLasmobranc project, which is developed with the collaboration of the Biodiversity Foundation of the Ministry for Ecological Transition and the Demographic Challenge, through the Pleamar Programme, and is co-financed by the European Union through the European Maritime, Fisheries and Aquaculture Fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset supporting the findings of this study is not publicly available at this time, but it is intended to be made available in the future. Work is currently underway to prepare the dataset for public release.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jabado, R.W.; Morata, A.Z.A.; Bennett, R.H.; Finucci, B.; Ellis, J.R.; Fowler, S.L.; Grant, M.I.; Barbosa Martins, A.P.; Sinclair, S.L. The Global Status of Sharks, Rays, and Chimaeras; IUCN Species Survival Commission (SSC), Shark Specialist Group: Gland, Switzerland, 2024; ISBN 978-2-8317-2318-1. [Google Scholar] [CrossRef]
Pozo-Montoro, M.; Arroyo, E.; Abel, I.; Bas Gómez, A.; Clemente Navarro, P.; Cortés, E.; Esteban, A.; García-Charton, J.A.; López Castejón, F.; Ortolano, A.; et al. Important Shark and Ray Areas (ISRAs) en el SE ibérico, una declaración necesaria. In Proceedings of the XV Reunión del Foro Científico Sobre la Pesca Española en el Mediterráneo; Universitat d’Alacant: Alicante, Spain, 2025; pp. 55–64. [Google Scholar]
Jabado, R.W.; García-Rodríguez, E.; Kyne, P.M.; Charles, R.; Armstrong, A.H.; Bortoluzzi, J.; Mouton, T.L.; Gonzalez-Pestana, A.; Battle-Morera, A.; Rohner, C.; et al. Mediterranean and Black Seas: A Regional Compendium of Important Shark and Ray Areas; Technical report; IUCN SSC Shark Specialist Group: Dubai, United Arab Emirates, 2023. [Google Scholar] [CrossRef]
Pourpanah, F.; Abdar, M.; Luo, Y.; Zhou, X.; Wang, R.; Lim, C.P.; Wang, X.Z.; Wu, Q.M.J. A Review of Generalized Zero-Shot Learning Methods. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4051–4070. [Google Scholar] [CrossRef] [PubMed]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
Von Rueden, L.; Mayer, S.; Beckh, K.; Georgiev, B.; Giesselbach, S.; Heese, R.; Kirsch, B.; Pfrommer, J.; Pick, A.; Ramamurthy, R.; et al. Informed Machine Learning—A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems. IEEE Trans. Knowl. Data Eng. 2023, 35, 614–633. [Google Scholar] [CrossRef]
Jenrette, J.; Liu, Z.C.; Chimote, P.; Hastie, T.; Fox, E.; Ferretti, F. Shark detection and classification with machine learning. Ecol. Inform. 2022, 69, 101673. [Google Scholar] [CrossRef]
Villon, S.; Iovan, C.; Mangeas, M.; Vigliola, L. Toward an artificial intelligence-assisted counting of sharks on baited video. Ecol. Inform. 2024, 80, 102499. [Google Scholar] [CrossRef]
Clark, J.; Lalgudi, C.; Leone, M.; Meribe, J.; Madrigal-Mora, S.; Espinoza, M. Deep Learning for Automated Shark Detection and Biometrics Without Keypoints. In Computer Vision—ECCV 2024 Workshops; Springer: Cham, Switzerland, 2025; pp. 105–120. [Google Scholar] [CrossRef]
Purcell, C.; Walsh, A.; Colefax, A.; Butcher, P. Assessing the ability of deep learning techniques to perform real-time identification of shark species in live streaming video from drones. Front. Mar. Sci. 2022, 9, 981897. [Google Scholar] [CrossRef]
Gómez-Vargas, N.; Alonso-Fernández, A.; Blanquero, R.; Antelo, L.T. Re-identification of fish individuals of undulate skate via deep learning within a few-shot context. Ecol. Inform. 2023, 75, 102036. [Google Scholar] [CrossRef]
Garcia-D’Urso, N.E.; Galan-Cuenca, A.; Climent-Pérez, P.; Saval-Calvo, M.; Azorin-Lopez, J.; Fuster-Guillo, A. Efficient instance segmentation using deep learning for species identification in fish markets. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; pp. 1–8. [Google Scholar] [CrossRef]
Climent-Perez, P.; Galán-Cuenca, A.; Garcia-d’Urso, N.E.; Saval-Calvo, M.; Azorin-Lopez, J.; Fuster-Guillo, A. Simultaneous, vision-based fish instance segmentation, species classification and size regression. PeerJ Comput. Sci. 2024, 10, e1770. [Google Scholar] [CrossRef] [PubMed]
Lu, J.; Song, Z.; Zhao, S.; Li, D.; Zhao, R. A Metric-Based Few-Shot Learning Method for Fish Species Identification with Limited Samples. Animals 2024, 14, 755. [Google Scholar] [CrossRef] [PubMed]
Jerez-Tallón, M.; Beviá-Ballesteros, I.; Garcia-D’Urso, N.; Toledo-Guedes, K.; Azorín-López, J.; Fuster-Guilló, A. Comparative Study of Deep Learning Approaches for Fish Origin Classification. In Advances in Computational Intelligence. IWANN 2025; Rojas, I., Joya, G., Catala, A., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 16008, pp. 66–78. [Google Scholar] [CrossRef]
Menon, S.; Vondrick, C. Visual Classification via Description from Large Language Models. arXiv 2022, arXiv:2210.07183. [Google Scholar] [CrossRef]
Zhang, R.; Zhang, W.; Fang, R.; Gao, P.; Li, K.; Dai, J.; Qiao, Y.; Li, H. Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification. In Proceedings of the Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 493–510. [Google Scholar]
Guo, Z.; Zhang, R.; Qiu, L.; Ma, X.; Miao, X.; He, X.; Cui, B. CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention. arXiv 2022, arXiv:2209.14169. [Google Scholar] [CrossRef]
Zhuang, J.; Hu, J.; Mu, L.; Hu, R.; Liang, X.; Ye, J.; Hu, H. FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance. In Computer Vision—ECCV 2024; Springer: Cham, Switzerland, 2024; pp. 236–253. [Google Scholar] [CrossRef]
Li, Y.; Liang, F.; Zhao, L.; Cui, Y.; Ouyang, W.; Shao, J.; Yu, F.; Yan, J. Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. arXiv 2022, arXiv:2110.05208. [Google Scholar]
Yuksekgonul, M.; Bianchi, F.; Kalluri, P.; Jurafsky, D.; Zou, J. When and why vision-language models behave like bags-of-words, and what to do about it? arXiv 2023, arXiv:2210.01936. [Google Scholar]
Praveena, K.; Anandhi, R.; Gupta, S.; Jain, A.; Kumar, A.; Saud, A.M. Application of Zero-Shot Learning in Computer Vision for Biodiversity Conservation through Species Identification and Tracking. In Proceedings of the 2024 International Conference on Trends in Quantum Computing and Emerging Business Technologies, Pune, India, 22–23 March 2024; pp. 1–6. [Google Scholar]
Lampert, C.H.; Nickisch, H.; Harmeling, S. Learning to detect unseen object classes by between-class attribute transfer. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 951–958. [Google Scholar] [CrossRef]
Xian, Y.; Lampert, C.H.; Schiele, B.; Akata, Z. Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly. arXiv 2020, arXiv:1707.00600. [Google Scholar] [CrossRef] [PubMed]
Rodríguez, A.C.; D’Aronco, S.; Daudt, R.C.; Wegner, J.D.; Schindler, K. Recognition of Unseen Bird Species by Learning from Field Guides. In Proceedings of the WACV, Waikoloa, HI, USA, 3–8 January 2024; pp. 1742–1751. [Google Scholar]
Stevens, S.; Wu, J.; Thompson, M.J.; Campolongo, E.G.; Song, C.H.; Carlyn, D.E.; Dong, L.; Dahdul, W.M.; Stewart, C.; Berger-Wolf, T.; et al. BioCLIP: A Vision Foundation Model for the Tree of Life. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 19412–19424. [Google Scholar] [CrossRef]
Muravyov, S.V.; Nguyen, D.C. Automatic Segmentation by the Method of Interval Fusion with Preference Aggregation When Recognizing Weld Defects. Russ. J. Nondestruct. Test. 2023, 59, 1280–1290. [Google Scholar] [CrossRef]
Muravyov, S.V.; Nguyen, D.C. Method of Interval Fusion with Preference Aggregation in Brightness Thresholds Selection for Automatic Weld Surface Defects Recognition. Measurement 2024, 236, 114969. [Google Scholar] [CrossRef]
Snell, J.; Swersky, K.; Zemel, R.S. Prototypical Networks for Few-shot Learning. arXiv 2017, arXiv:1703.05175. [Google Scholar] [PubMed]
Sui, D.; Chen, Y.; Mao, B.; Qiu, D.; Liu, K.; Zhao, J. Knowledge Guided Metric Learning for Few-Shot Text Classification. arXiv 2020, arXiv:2004.01907. [Google Scholar] [CrossRef]
Chen, X.; Zhang, N.; Xie, X.; Deng, S.; Yao, Y.; Tan, C.; Huang, F.; Si, L.; Chen, H. KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction. In Proceedings of the Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; ACM: New York, NY, USA, 2022; pp. 2778–2788. [Google Scholar] [CrossRef]
Moloch. iNaturalist Observation: Prionace Glauca (Blue Shark). 2012. Available online: https://www.inaturalist.org/observations/8641690 (accessed on 1 October 2025).
Arroyo, E.; Canales Cáceres, R.M.; Abel, I.; Giménez-Casalduero, F. Tiburones y Rayas de la Región de Murcia; Proyecto TIBURCIA, Fondo Europeo Marítimo y de Pesca: Alicante, Spain, 2021. [Google Scholar]

Figure 1. Prompt extraction and validation flow. Expert descriptions are expanded with GPT-4 to generate candidate prompts, which are encoded and compared with visual prototypes obtained as the mean of schematic illustrations. Each mean visual embedding is labeled positive (+) if it matches thetarget class or negative (−) if it belongs to the comparison set. Intra-class validity ensures that each description represents its category and distances from negatives by weighting inter-class similarity, while inter-class validation enforces sufficient separation across categories via cross-validation. The result is a discriminative and coherent prompt selected from the pool for each category.

Figure 2. Prototype-guided attention flow. Schematic illustrations are encoded to obtain mean prototypes that highlight the characteristic traits of each category, which are compared with patch-level embeddings of real images to generate attention maps emphasizing relevant regions. The resulting weights are integrated into the last-layer attention mechanism of CLIP, producing refined visual embeddings aligned with the optimized prompts.

Figure 3. Comparison between real images (left) [32] and schematic illustrations (right) [33] of elasmobranch species.

Figure 4. Example of a squared weight vector represented as an attention map over image patches, emphasizing distinctive characteristics of the individuals.

Figure 5. Balanced accuracy (left) and accuracy (right) results with different configurations of

α

as a bias multiplier and modified layers in the CLIP architecture.

Figure 5. Balanced accuracy (left) and accuracy (right) results with different configurations of

α

as a bias multiplier and modified layers in the CLIP architecture.

Figure 6. Species-level confusion matrices for the P-TS+B and P-TS+SB methods. Classes are ordered by visual similarity, emphasizing fine-grained confusions and improvements in challenging categories.

Table 1. Dataset distribution with scientific and common names.

Order	Family	Species	Common Name	Nº Img
Squaliformes	Oxynotidae
		Oxynotus centrina	Angular roughshark	36
Carcharhiniformes	Triakidae
		Mustelus mustelus	Smooth-hound	76
		Galeorhinus galeus	Tope shark	38
	Scyliorhinidae
		Scyliorhinus canicula	Small-spotted catshark	121
		Galeus melastomus	Blackmouth catshark	37
Torpediniformes	Torpedinidae
		Torpedo marmorata	Spotted torpedo	50
Rajiformes	Rajidae
		Raja undulata	Undulate ray	90

Table 2. Performance of CLIP using different name descriptors and taxonomic levels. Species-level results are reported with both scientific names (SC) and common names (CN). The reported metrics correspond to balanced accuracy, accuracy and macro-averaged precision, recall, and F1-score.

Level	BAcc.	Acc.	Prec.	Rec.	F1
Order	0.37	0.70	0.60	0.37	0.36
Family	0.18	0.18	0.17	0.18	0.13
Species (SC)	0.13	0.15	0.06	0.14	0.06
Species (CN)	0.37	0.51	0.16	0.13	0.12

Table 3. Comparison of the proposed classification strategies across different taxonomic levels. The evaluated methods are: PA (Prompt-based, illustration-guided), P-TC (Prompt with taxonomy-aware cumulative classification), P-TS (Prompt with taxonomy-aware sequential classification), P-TS+B (Prompt with taxonomy-aware sequential classification combined with CLIP attention bias across all levels) and P-TS+SB (Prompt with taxonomy-aware sequential classification combined with CLIP attention bias applied only at the family and species level).

Method	BAcc.	Acc.	Prec.	Rec.	F1
Order
PA	0.79	0.79	0.71	0.79	0.72
P-TC	0.83	0.82	0.74	0.83	0.76
P-TS	0.83	0.82	0.74	0.83	0.76
P-TS+B	0.82	0.82	0.74	0.82	0.76
P-TS+SB	0.83	0.82	0.74	0.83	0.76
Family
PA	0.70	0.74	0.70	0.70	0.69
P-TC	0.78	0.79	0.75	0.78	0.76
P-TS	0.78	0.75	0.73	0.78	0.73
P-TS+B	0.78	0.77	0.74	0.78	0.74
P-TS+SB	0.79	0.77	0.74	0.79	0.74
Species
PA	0.51	0.50	0.60	0.51	0.48
P-TC	0.61	0.58	0.63	0.61	0.57
P-TS	0.62	0.60	0.63	0.62	0.58
P-TS+B	0.64	0.63	0.63	0.64	0.60
P-TS+SB	0.64	0.63	0.63	0.64	0.61

Table 4. Selected prompts by category for the general/cumulative strategy.

Level	Category	Selected Prompt
Order	Carcharhiniformes	Shark slim elongated figure and white-edged oval spots
	Squaliformes	Shark with a fat triangular body, stout compressed silhouette
	Torpediniformes	Stingray rounded silhouette, brown shade covered in patches
	Rajiformes	Stingray diamond-shaped figure, slender elongated tail tip, brown-gray body with dark banding, prominent circular mark in each fin, pale underside, pelvic fins gray
Family	Scyliorhinidae	Shark slim elongated shape, light brown shade, covered with small round black speckles
	Triakidae	Shark pointed body and light gray color
	Oxynotidae	Shark black body color and triangular shape
	Rajidae	Stingray rhomboid figure, large black rounded patch on each fin
	Torpedinidae	Stingray rounded silhouette, brown shade covered in patches
Species	Galeus melastomus	Shark reddish brown coloration with oval spots encircled
	Galeorhinus galeus	Shark with long snout, second dorsal fin tiny like anal, slender elongated figure, deep caudal notch, light gray body, white belly
	Oxynotus centrina	Shark with a fat triangular body, stout compressed silhouette
	Mustelus mustelus	Shark slim elongated body and gray to brown tone
	Scyliorhinus canicula	Shark slim elongated figure, brown shade with tiny circular black speckles
	Raja undulata	Stingray rhomboid figure, long thin tail, gray-brown body with narrow and broad bands, big round spot located in each fin center, underside white, gray pelvic fins
	Torpedo marmorata	Stingray rounded silhouette, brown shade covered in patches

Table 5. Selected prompts by category for the sequential strategy.

Level	Category	Selected Prompt
Order (Sharks)	Carcharhiniformes	Shark slim elongated figure and white-edged oval spots
Order (Sharks)	Squaliformes	Shark with a fat triangular body, stout compressed silhouette
Family (Carcharhiniformes)	Scyliorhinidae	Shark elongated narrow silhouette, light brown tone patterned with round black speckles
Family (Carcharhiniformes)	Triakidae	Shark slim narrow form and gray coloration
Species (Scyliorhinidae)	Galeus melastomus	Shark elongated slender body, long low anal fin, narrow caudal fin ridged, gray or brownish red coloration with oval markings bordered in white
Species (Scyliorhinidae)	Scyliorhinus canicula	Shark slender long body and dotted with tiny black speckles
Species (Triakidae)	Mustelus mustelus	Shark elongated slender build, pointed profile, small mouth, uniform light gray or brown
Species (Triakidae)	Galeorhinus galeus	Shark narrow elongated outline, long snout, second dorsal fin small like anal, deep notched caudal fin, light gray body, white belly
Order (Rays)	Torpediniformes	Stingray rounded silhouette, brown shade covered in patches
Order (Rays)	Rajiformes	Stingray diamond-shaped figure, slender elongated tail tip, brown-gray body with dark banding, prominent circular mark in each fin, pale underside, pelvic fins gray

Table 6. Comparative performance between CLIP, CALIP, and the proposed Prototype-guided Cross-Attention (PGCA) method at different taxonomic levels. Metrics are reported as Accuracy/Balanced Accuracy.

Taxonomic Level	CLIP (Zero-Shot)	CALIP	PGCA (Ours)
Order	0.77/0.76	0.77/0.77	0.82/0.81
Family	0.71/0.67	0.75/0.70	0.79/0.80
Species	0.50/0.50	0.52/0.53	0.59/0.62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Beviá-Ballesteros, I.; Jerez-Tallón, M.; Aranda-Garrido, N.; Saval-Calvo, M.; Abel-Abellán, I.; Fuster-Guilló, A. Zero-Shot Elasmobranch Classification Informed by Domain Prior Knowledge. Mach. Learn. Knowl. Extr. 2025, 7, 146. https://doi.org/10.3390/make7040146

AMA Style

Beviá-Ballesteros I, Jerez-Tallón M, Aranda-Garrido N, Saval-Calvo M, Abel-Abellán I, Fuster-Guilló A. Zero-Shot Elasmobranch Classification Informed by Domain Prior Knowledge. Machine Learning and Knowledge Extraction. 2025; 7(4):146. https://doi.org/10.3390/make7040146

Chicago/Turabian Style

Beviá-Ballesteros, Ismael, Mario Jerez-Tallón, Nieves Aranda-Garrido, Marcelo Saval-Calvo, Isabel Abel-Abellán, and Andrés Fuster-Guilló. 2025. "Zero-Shot Elasmobranch Classification Informed by Domain Prior Knowledge" Machine Learning and Knowledge Extraction 7, no. 4: 146. https://doi.org/10.3390/make7040146

APA Style

Beviá-Ballesteros, I., Jerez-Tallón, M., Aranda-Garrido, N., Saval-Calvo, M., Abel-Abellán, I., & Fuster-Guilló, A. (2025). Zero-Shot Elasmobranch Classification Informed by Domain Prior Knowledge. Machine Learning and Knowledge Extraction, 7(4), 146. https://doi.org/10.3390/make7040146

Article Menu

Zero-Shot Elasmobranch Classification Informed by Domain Prior Knowledge

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Prompt Extraction Through Prior Knowledge: Illustrations and Expert Descriptions

3.1.1. Taxonomy-Aware Classification Strategies

3.2. Prototype-Guided Cross-Attention from Illustrations

4. Experimentation

4.1. Dataset

4.2. Overview of the Methodological Workflow

4.3. Hierarchical Taxonomic Evaluation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI