Next Article in Journal
Cross-Sectional Study on Proportions of Type 2 Diabetic Patients Presenting with Oral Candidal Lesions
Previous Article in Journal
Advancing Discontinuous-Model Predictive Control for High Performance Inverters via Optimized Zero-Voltage State Selection Based on Offset Voltage
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Automatic Identification and Description of Jewelry Through Computer Vision and Neural Networks for Translators and Interpreters

by
José Manuel Alcalde-Llergo
1,2,*,
Aurora Ruiz-Mezcua
3,
Rocío Ávila-Ramírez
3,
Andrea Zingoni
2,
Juri Taborri
2 and
Enrique Yeguas-Bolívar
1,*
1
Department of Computer Science, University of Córdoba, 14071 Córdoba, Spain
2
Economics, Engineering, Society and Business Organization, University of Tuscia, 01100 Viterbo, Italy
3
Department of Social Sciences, Translation and Interpreting, University of Córdoba, 14071 Córdoba, Spain
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(10), 5538; https://doi.org/10.3390/app15105538
Submission received: 4 April 2025 / Revised: 10 May 2025 / Accepted: 13 May 2025 / Published: 15 May 2025
(This article belongs to the Special Issue Image Processing and Computer Vision Applications)

Abstract

:
Identifying jewelry pieces presents a significant challenge due to the wide range of styles and designs. Currently, precise descriptions are typically limited to industry experts. However, translators and interpreters often require a comprehensive understanding of these items. In this study, we introduce an innovative approach to automatically identify and describe jewelry using neural networks. This method enables translators and interpreters to quickly access accurate information, aiding in resolving queries and gaining essential knowledge about jewelry. Our model operates at three distinct levels of description, employing computer vision techniques and image captioning to emulate expert analysis of accessories. The key innovation involves generating natural language descriptions of jewelry across three hierarchical levels, capturing nuanced details of each piece. Different image captioning architectures are utilized to detect jewels in images and generate descriptions with varying levels of detail. To demonstrate the effectiveness of our approach in recognizing diverse types of jewelry, we assembled a comprehensive database of accessory images. The evaluation process involved comparing various image captioning architectures, focusing particularly on the encoder–decoder model, crucial for generating descriptive captions. After thorough evaluation, our final model achieved a captioning accuracy exceeding 90%.

1. Introduction

Image captioning presents a challenging task in the realms of computer vision and natural language processing, requiring the generation of textual descriptions for given input images [1]. In the field of linguistics, particularly in translation and interpreting, this approach can provide accurate visual representations, precise technical terms, images, meanings, descriptions, and contexts. Translators and interpreters can rely on it to quickly access data about specific terms.
In recent years, deep learning-based approaches have emerged as a promising solution to the image captioning problem, harnessing the capabilities of deep neural networks to learn intricate mappings between images and text [2]. Typically, an encoder–decoder structure is employed for this task, with neural networks serving as encoders to process the input image and extract high-level feature representations. Furthermore, neural networks are commonly utilized as decoders, generating a sequence of words describing the image based on the features extracted by the encoder. The synergy between these two networks has significantly enhanced the performance of image captioning systems, approaching human-level performance on certain benchmark datasets [3].
An artificial intelligence (AI) tool capable of automatically identifying and describing jewelry can emulate human intuitive behavior, fostering increased connections between individuals from diverse cultures and customs. This includes overcoming language and cultural barriers by providing accurate and detailed descriptions of jewelry in various languages. Such an AI tool can encourage curiosity and cultural exchange by offering information about the history, symbolism, and craftsmanship of jewelry, sparking meaningful conversations and dialogues between individuals of different backgrounds. Additionally, it can promote appreciation and respect for diversity, fostering tolerance, inclusivity, and harmony among individuals and communities with different traditions and customs. Moreover, automatic jewelry recognition has practical applications in cultural commerce and tourism, assisting buyers and collectors in making informed decisions and engaging in fair trade of cultural jewelry, as well as attracting tourists interested in exploring and acquiring traditional jewelry during their travels.
In this study, various image captioning encoder–decoder structures are assessed using a dataset comprising images of jewelry accessories. The dataset was created by collecting images of diverse accessories from two well-known online jewelry stores in the city of Córdoba, Spain. Córdoba is celebrated for its artisanal craftsmanship, particularly in the creation of high-quality silver jewelry. The city hosts notable jewelry fairs and events that exhibit distinctive designs, attracting both local and international buyers.
The utility of AI in linguistics becomes evident in unraveling the intricate linguistic landscape of jewelry terminology. AI-powered linguistic analysis can efficiently process vast datasets of historical documents, dowry letters, and cultural references related to jewelry. It helps in identifying patterns, tracing linguistic evolution, and discerning the cultural contexts from which specific terms and expressions emerged.
Given the limited application of image captioning within this specific field of study, the task is intricate and notably captivating. The ultimate goal of the project is to generate detailed captions for accessories featured in the input images, highlighting characteristics such as their type (earrings, necklace, bracelet, etc.), material, color, and the specific jewelry they showcase. The images encompass individuals wearing the accessories, suggesting the potential application of the system to automatically generate descriptions for accessories worn by individuals in images sourced from fashion magazines, jewelry catalogs, and jewelry trade shows, among other sources. Additionally, the methodology enables the creation of simpler descriptions, indicating only the type of accessory, to facilitate a jewelry multiclass classification task.
In the following sections, we navigate through key aspects of linguistic modeling, automatic jewelry recognition, and AI-based description generation. Section 2 conducts a review of related work, highlighting prior research contributions. Section 3 delves into the linguistic levels and terminology specific to jewelry descriptions and explains the database and architectures supporting our approach. Section 4 presents the experimental design and results, showcasing the effectiveness of our linguistic models, automatic recognition techniques, and AI-generated descriptions. Finally, in Section 5, we draw conclusions and outline future research directions.

2. Related Work

In this section, we conduct a thorough review of previous works that have significantly influenced and provided reference points for the current study. Our examination revealed three main categories of research that are relevant to our investigation: the analysis of the impact of AI on linguistics, the application of artificial vision techniques for jewelry analysis, and neural networks for image captioning. Furthermore, we undertook a dedicated study specifically exploring the use of image captioning techniques in the context of jewelry; however, no prior research was identified that specifically addresses this exact subject matter.

2.1. AI and Applied Linguistics

The confluence of AI and linguistics has cultivated a fertile ground for tackling specific challenges in the identification and description of jewelry in images. The capability of these integrated approaches to comprehend both visual content and natural language offers innovative solutions.
A pivotal theory influencing this domain is the hierarchy of grammatical complexity proposed in [4,5]. Their approach underscores that communication through natural language fundamentally involves mapping signals to meanings. The application of this theory extends beyond natural language and aligns with the goal of understanding and describing jewelry images.
Considering image processing, Xu et al. [6] explore the connection between computer vision and natural language processing. The work emphasizes synergies between both fields for joint applications, such as generating descriptions of images. Chen et al. [7] take this approach a step further by introducing visual semantic role labeling, an advanced technique for image description generation, essential for detailed jewelry identification.
To undertake language modeling, a fundamental step for subsequent AI applications, studies such as those by [8] emphasize the need for robust language models.
Firstly, it must be said that the distinction between translating from natural language into formal language and reasoning about that formal language was identified in [9,10].
However, a longstanding goal of AI has been to build intelligent agents that can function with the linguistic dexterity of people, which involves such diverse capabilities as participating in a fluent conversation, using dialog to support task-oriented collaboration, and engaging in lifelong learning through processing speech and text [11].
These studies highlight the necessity of an interdisciplinary perspective, where linguistics and AI intertwine to advance jewelry image identification. They demonstrate how these converging disciplines can catalyze significant innovations in this domain.

2.2. Computer Vision for Jewelry Analysis

Jewelry has played a pivotal role in human culture for millennia, and the capacity to distinguish various types of accessories holds significance across diverse fields, including security, e-commerce, and the analysis of social preferences. Recent years have witnessed substantial progress in computer vision models, particularly in the domains of object recognition and detection. Numerous models now exist that can proficiently discern a wide array of objects, ranging from animals and vehicles to accessories. An exemplar is DeepBE, introduced in [12], which adeptly categorizes accessories worn above the shoulders, such as necklaces, earrings, and glasses.
Despite these advancements, current methods for accessory recognition, like the aforementioned DeepBE, still exhibit certain limitations compared to human expertise. While these models effectively identify the presence of jewelry and other accessories in an image, they fall short in providing intricate details about the specific type of jewelry or its quality. Consequently, there arises a demand for more specialized models that can precisely and comprehensively recognize and classify jewelry and other accessories based on their unique characteristics.
Conversely, certain studies have been dedicated to crafting specialized models tailored to appraising the quality of specific types of jewelry. Notably, the research outlined in [13] concentrates on the quality assessment of pearls based on images, employing computer vision techniques to identify and scrutinize the distinct visual characteristics of these items. The proposed methodology involves tracing light rays and scrutinizing highlight patterns to estimate the level of specularity within the input image of the pearl. These data are then employed to ascertain the equivalent index of appearance, serving as a metric for appraising the object’s quality. Through the application of these techniques, the approach comprehensively evaluates the visual appearance of the object, offering valuable insights into its quality and potential industrial applications.

2.3. Neural Networks for Image Captioning

Describing the content of an image was a task that just a few years ago could only be accomplished by humans. However, advancements in computer vision have led to the creation of new systems capable of replicating this human behavior with impressive performance.
The pioneering steps in this field are documented in works like [1], where the authors developed a probabilistic framework for generating descriptions from images. To achieve this goal, they devised an encoder–decoder structure employing neural networks.
The primary function of the encoder is to process the input image, extracting meaningful features and encoding them into a format that can be easily understood by the neural network. In simpler terms, the encoder transforms the raw visual information of the image into a more abstract representation, highlighting important features. The decoder takes the encoded information from the encoder and generates a coherent and descriptive sequence. In the case of image captioning, this sequence comprises the natural language description of the image. The decoder essentially translates the encoded features back into human-readable language, allowing the system to generate meaningful and contextually relevant captions for the input images.
Neural networks come in various types, each designed for specific tasks. Two prominent types are Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).
CNNs are particularly well suited for image-related tasks. They utilize layers of convolutional operations to automatically extract hierarchical features from input images. CNNs are effective in recognizing spatial patterns, making them commonly used in image classification, object detection, and image segmentation tasks.
On the other hand, RNNs are designed to handle sequential data and are well suited for tasks involving a sequence of inputs, such as time-series data or natural language. RNNs have memory units that allow them to capture dependencies and relationships within sequential data. This makes them suitable for tasks like language modeling, speech recognition, and sequence generation.
Therefore, a good approach to addressing the problem of caption generation is the development of an encoder–decoder structure [14]. In this approach, a CNN is employed as the encoder, and an RNN serves as the decoder. This particular architecture has laid the groundwork for image captioning. In this process, the input image undergoes processing through the CNN to encode and extract features, which are then embedded into a fixed-length vector. This fixed-length vector, also known as an “embedding”, succinctly represents the most important features of the image based on the encoding performed by the CNN. This vector representation is subsequently used in the process of generating descriptions (captions) for the image, where relevant information has been condensed into a structured and manageable format for the neural network. Choosing a fixed-length vector facilitates consistency and efficiency in processing information within the network.
After the image has been processed through the CNN to encode features, the generation of final descriptions involves appending the RNN to the last hidden layer of the CNN. The term “last layer” in this context refers to the final or output layer of the CNN. In neural networks, layers are stacked, and the last layer typically produces the final representation or output. In image captioning, the information encoded by the CNN is transferred to the RNN at this stage to further refine and generate coherent textual descriptions.
The choice of utilizing a Long Short-Term Memory (LSTM) network as the RNN is significant. LSTM is a type of recurrent neural network designed to capture and retain long-term dependencies in sequential data. In the context of image captioning, LSTM plays a key role in understanding the sequential nature of language, allowing the model to generate descriptions that are not only contextually relevant but also exhibit a coherent flow. Its effectiveness lies in its ability to handle dependencies over longer sequences, making it well suited for tasks involving the generation of sentences or captions.
Similarly, subsequently to the mentioned study, several works emerged incorporating innovative mechanisms to improve its outcomes. Particularly noteworthy is a proposal outlined in [6], which introduced the concept of attention to the previously employed encoder–decoder architecture. In this scenario, diverse words generated by the LSTM were directed toward different sections of the input image. The outcomes demonstrated that their model produced highly accurate words that directly corresponded to the specific areas of the image that the LSTM focused on at each moment. As a result, a comprehensive and precise description of the image was achieved in the majority of cases.
Other advancements in image captioning have explored alternative approaches, such as generative adversarial networks (GANs). A representative example is CgT-GAN, which combines adversarial training with CLIP-based semantic rewards to generate captions without relying on paired human annotations [15]. These methods can yield more diverse and creative outputs, especially in open-domain settings. However, GAN-based captioning models often suffer from training instability, require large computational resources, and may lack linguistic control, making them less suitable for tasks involving predefined syntactic structures like those used in our multilevel jewelry description framework [16].
Another line of research explores reinforcement learning (RL) techniques to enhance image captioning models. For instance, Ref. [17] propose a strategy that integrates ground-truth captions into a CLIP-guided RL framework. This approach aims to generate more distinctive and fluent captions by leveraging the similarity between generated captions and corresponding images. Similarly, in [18], the authors introduce an RL approach that combines a policy network, responsible for generating captions, with a value network that estimates the expected reward of a generated sequence. This architecture is optimized using sequence-level feedback, allowing the model to refine caption quality based on global evaluation metrics rather than word-by-word accuracy. While RL-based models have demonstrated improvements in generating diverse and contextually relevant captions, they often require extensive training data, careful reward design, and complex tuning processes [19].

3. Materials and Methods

This section outlines the methodological framework adopted to develop a system capable of generating descriptive captions for jewelry images. The proposed approach integrates insights from linguistics and computer vision to address the complexity of jewelry representation. Throughout the following subsections, we describe the linguistic foundations that support the structuring of jewelry descriptions, the creation and preprocessing of the image dataset, and the experimental setup used to train and evaluate the captioning models.

3.1. Levels and Terminology for the Description of Jewelry

Our primary objective is to create a concise and cohesive summary of jewelry images. However, grasping the high-level informational content of these images poses challenges. Initially, a jewelry image conveys a wealth of information, making it impractical to provide all details textually. Secondly, the choice of a jewelry image as a communication medium allows viewers to extract information at various levels. A cursory glance may convey the intended message and salient features, while a more in-depth examination offers further insights.
To address the task of identifying content for jewelry image summaries, we leverage insights gained from an experiment where online jewelry catalogs were used to generate summaries for various images with a shared high-level intention. We follow a similar approach to that employed for graphic descriptions in [20]. The key finding was that the intended message of a jewelry image was consistently conveyed in all summaries regardless of visual features. Online catalogs, emulating human behavior, enhanced the intended message with salient features, depending on the image’s purpose. Similar summaries for a specific jewelry image led us to hypothesize shared perceptions of salient features. Although the set of salient features may be consistent for images with the same underlying intention, differences in summaries arise from the saliency of these features. The exclusion of exhaustive details in summaries aligns with Grice’s Maxim of Quantity [21], emphasizing informative contributions relevant to the current exchange.
To identify specific content in jewelry images, we focus on a set of propositions capturing meaningful information. This set encompasses a variety of details present in the images, including propositions common to all images and propositions specific to certain message categories. Some of these propositions include the following: labels for all jewelry items, type of jewelry, materials used in the jewelry, relationship between the materials, colors present in the jewelry, style or design of the jewelry, specific details like engraved patterns, size, or relative dimensions of the jewelry, and overall impression or aesthetic, including the addition of descriptive adjectives. These propositions cover a spectrum of information, allowing for a comprehensive and nuanced summary of jewelry images.
From a linguistic perspective, the justification for developing linguistic models based on different theories that propose levels of descriptions [22,23,24,25] for jewelry is rooted in the need to capture and express the semantic diversity and inherent complexity of jewelry pieces [26,27]. Each level of description has its own linguistic characteristics that enable effective communication of specific information about jewelry. Here, we provide the descriptions considered in our approach and specific justifications for each level of description:
  • Basic description: noun + noun.
    -
    Communicative efficiency: This simple format focuses on the essence of the jewelry by combining two key nouns. It provides concise and direct information about the type of jewelry present. In contexts where brevity is essential, this format allows for quick identification of the jewelry type.
    -
    Clarity and simplicity: This is ideal for quickly identifying the type of jewelry without adding unnecessary details. The straightforward structure enhances clarity, making it suitable for scenarios where a quick overview is required.
  • Normal description: adjective + noun + adjective + noun.
    -
    Semantic expansion: The inclusion of adjectives allows for a more detailed and nuanced description. It enables conveying specific features such as color, material, or design. This format provides a broader semantic scope, accommodating a richer set of details.
    -
    Aesthetic appeal: By adding adjectives, the beauty and visual attributes of the jewelry are highlighted. This can be relevant in contexts where aesthetics plays a significant role, such as art exhibitions or design showcases.
  • Complete description: superlative adjectives + noun + complement.
    -
    Comprehensiveness and specificity: The inclusion of superlative adjectives indicates the exceptional quality of the jewelry, offering a complete and detailed description. This format is essential for describing unique or high-value pieces, providing comprehensive insights.
    -
    Differentiation and valuation: This allows highlighting the unique features that make the jewelry stand out among others. It is especially valuable in contexts where exclusivity is sought, such as luxury markets or bespoke jewelry presentations.
The development of linguistic models that encompass these types of descriptions allows for a more comprehensive and accurate representation of jewelry. Such models cater to the variety of information that may be relevant in different situations and contexts, ensuring that linguistic descriptions effectively convey the richness and uniqueness of each jewelry piece.
Using this approach, we propose some examples of descriptions at different complexity levels after recognizing the jewelry pieces (see Table 1). The presence of superlative adjectives is always optional, and a complete description can exist without them. For example, “Earrings in sustainable yellow gold adorned with exquisite, brilliant-cut diamonds and featuring a secure push-back clasp” is equivalent to “Earrings in yellow gold with brilliant-cut diamonds and a push-back clasp”.
In the end, systematizing jewelry terms, determining their role, function, and semantic meaning would help to offer high-quality translations by AI. In the transition to a modern digital system, it is relevant to provide background knowledge that contributes to understanding the meaning of concepts and to shift towards new electronic forms of creating dictionaries, glossaries, and thesauri for high-quality translation [28].

3.2. Fundamentals of Jewelry Terminology

In all languages, a lot of idiosyncrasy and cultural referents can be found, especially when we work with specialized texts and speeches. Each jargon is based on the culture where it was originally used or created. This especially happens when a sector is prominent in a society. For instance, wine vocabulary is extensive and rich in most Mediterranean countries but not as much in regions like Alaska, where the weather and habits are completely different from wine areas. In the case of jewelry, we can find a wide range of specific terminology that, at times, is particularly difficult to translate, for example, from Spanish (Spain being a traditional power in the field) into other languages.
Terminology related to jewelry is very wide and varied. It ranges from numerous nouns to name each piece of jewelry to an infinite number of adjectives describing the shape, cut, weight (carat), color or materials it is made of. It can therefore be said that the descriptions of jewelry items are very specific, exhaustive, and dominated by the use of adjectives. There are also different varieties and terms depending on geographical areas and idiolects, as well as cultural implications that underlie jewelry terminology. Nevertheless, there are not many reliable resources from a linguistic, lexicographical, or translational point of view. Most works focus, on one hand, on chemistry, especially on composition, typology and the materials’ alloy [29,30], and, on the other hand, on commerce and marketing [31,32,33,34,35]. Considering the growing demand for internationalization in this sector, there is a strong need to create resources which would help professionals to standardize and use accurate jewelry terminology.
In the context of jewelry, the terminology associated with descriptive words is rich and diverse. Similar to languages that lack a distinct class of adjectives, the descriptions of jewelry items often involve the use of nouns and verbs to convey specific characteristics. Additionally, the concept of gender or noun class, which is common in language, is mirrored in jewelry terminology. For instance, certain jewelry pieces may be classified into categories based on gender, such as masculine or feminine designs, influencing how they are described and categorized. This linguistic parallel illustrates the intricate and nuanced nature of expressing the features of jewelry items.
In the scope of linguistic descriptions, the clarity of explanations often follows the principle that the better the descriptions, the more evident they become, particularly given that those reading them possess a functional knowledge of language. This characteristic holds true for various scientific fields, where intricate issues are articulated in detailed descriptions.
In the context of this research, AI plays a key role in object recognition. The computer vision system identifies objects, and linguistically, we assign a noun to the object along with a brief description. Essentially, jewelry-specific terminology and adjectives are utilized in this linguistic process [36].
The catalog or vocabulary list employed in our approach could provide a broad array of possibilities, but they generally fall into larger categories: fobs, lockets, necklaces, bracelets, armlets, finger rings, watches, buttons, clips, cuff links, pendants, pins, tie clasps, anklets, toe rings, ear ornaments, hair ornaments, and nose ornaments. Furthermore, each group encompasses various items.
The size and shape of each item should be supplemented with details about the material it is made of, the color, and other specifications such as weight and care. Strictly speaking, there are only seven precious stones: pearl, diamond, ruby, emerald, alexandrite, sapphire, and oriental cats-eye. In contrast, the so-called semi-precious stones include amethyst, topaz, tourmaline, aquamarine, chrysoprase, peridot, opal, zircon, and jade.
Information about the hardness, color, and sources of supply of a specific gemstone can be found in various places, and the availability of information may depend on the type of gem and its supplier. Some common places where you might find this information are as follows: gem and diamond certificates, product descriptions in jewelry stores, websites of jewelry suppliers and designers, and books and online resources on gemology.
In our case, we relied on a set of catalogs provided by the Parque Joyero de Córdoba (https://parquejoyero.es/en/home (accessed on 12 May 2025)). Using these catalogs, we created a relational database encompassing materials, precious and semi-precious stones, commonly used terms and tokens, most common meanings, colors, shapes, and common interrelationships in the creation and presentation of each piece of jewelry.
The database serves as a comprehensive resource, capturing a wide array of information related to the field of jewelry. It includes details about the materials used, the types of precious and semi-precious stones featured, prevalent terminology and expressions, as well as the common meanings associated with various elements. Additionally, the database covers information about colors, shapes, and the typical interconnections found in the crafting and display of each jewelry item. This extensive compilation of data provides a foundation for our linguistic models and aids in the development of a nuanced and contextually rich understanding of jewelry descriptions.

3.3. Creation of the Dataset

The jewelry dataset used in the experiments was specifically crafted for this study by extracting, preparing, and merging images from two online jewelry stores (Baquerizo Joyeros: https://baquerizojoyeros.com (accessed on 12 May 2025) and Doñasol: https://doñasol.com (accessed on 12 May 2025)) in Córdoba (Spain). These catalogs included multiple images of the same jewelry item, captured from different perspectives or highlighting particular details (e.g., side views, clasps, gemstone close-ups). These naturally diverse viewpoints were preserved and included in the dataset, enriching the training data beyond standard augmentation techniques.
Nevertheless, as is common in computer vision challenges, the available quantity of images was not sufficient to train robust models adequately for optimal results. To address this limitation, various data augmentation techniques were implemented to diversify the dataset. The augmentation techniques applied to enhance the jewelry database include the following: 90º rotations, width shift by 30%, length shift by 30%, cuts by 15%, enlargement or reduction by 5%, slight color changes, horizontal and vertical flips, and brightness range by 80%.
Following the data augmentation procedures, the merging process resulted in a comprehensive and unified jewelry database comprising a total of 5374 accessory images. For experimentation, this dataset was partitioned into training (75%), validation (15%), and testing (10%) sets. Figure 1 displays some samples from the dataset along with their corresponding captions.

3.4. Experimental Design

The first step in designing the experimentation was to select models that could be relevant for addressing the task of jewelry description. The presence of pretrained neural networks, enriched with extensive datasets, allows us to leverage their capabilities for enhanced results. Consequently, there is no need to build a neural network from the ground up. This approach of repurposing networks is known as Transfer Learning [37]. In the experimentation phase, various combinations of CNN and RNN architectures were explored to identify the optimal encoding and decoding structure for describing the accessory images within the database. Figure 2 provides an overview of the proposed architecture for generating a basic-level jewelry description.
The encoder–decoder structure is selected due to its proven effectiveness in tasks involving short, structured textual outputs, such as image captioning [1], and considering the nature of the available dataset. In our case, the generated textual outputs follow predefined linguistic templates (basic, normal, and complete levels), which significantly reduces the complexity of the sequence modeling task. Furthermore, this architecture offers a favorable trade-off between interpretability, computational efficiency, and performance, especially in low-resource settings where training data and processing capabilities are limited. Although recent Transformer-based architectures [38] have achieved state-of-the-art results in large-scale image captioning tasks, they typically require significantly higher computational resources and larger datasets. For these reasons, the CNN–RNN model remains a solid and accessible solution for domain-specific applications that prioritize structured linguistic output and lightweight deployment.
Three prominent architectures have been explored for CNNs, each with its unique characteristics: VGG-16 [39], InceptionV3 [40], and MobileNet [41].
On one hand, in the field of sequence modeling and predictions, RNNs have been instrumental. However, a common challenge faced by traditional RNNs is short-term memory loss, which hampers their ability to capture long-term dependencies within the data. To overcome this limitation, advanced models like Long Short-Term Memory (LSTM), see [42], and Gated Recurrent Unit (GRU), see [43], have been introduced. Both LSTM and GRU play pivotal roles in enhancing the linguistic finesse of the proposed architectures, making them proficient in describing various jewelry types in images with linguistic sensitivity.
In the context of experimental design, considerations extended beyond the choice of CNN architecture and RNN type. Several critical parameters were systematically explored:
  • Number of neurons in the RNN: This parameter affects the model’s capacity to capture complex linguistic patterns. The values considered were 64, 128, 256, 512, and 1024.
  • Batch size: This determines the number of samples processed per training iteration. Tested batch sizes included 4, 8, 32, 128, and 512.
  • Learning rate: This regulates the step size during the optimization process and is essential for effective model convergence. The explored values were 0.0001, 0.001, 0.01, and 0.1.
  • Optimizer: Optimizers influence parameter updates during training, affecting convergence speed and stability. The following optimizers were evaluated: Adam, Adagrad, Adadelta, and RMSProp.
To reduce the risk of overfitting, an early stopping mechanism that monitors the validation loss is implemented. When no improvement is observed over a certain number of epochs, training is halted to prevent the model from fitting non-generalizable patterns in the training data.
Regarding all these considerations, carefully designed experiments were conducted to enable a robust comparison of different parameter configurations for model generation. Following these experiments, the optimal model configuration for generating captions from the jewelry database was determined.
The evaluation of caption quality poses challenges due to the intricate nature of image captioning. While established metrics like METEOR, BLEU, and ROUGE [44] exist for measuring language and semantic accuracy, our dataset’s unique constraints led us to adopt a straightforward approach. We deemed generated captions correct if they matched the original ones provided in the online jewelry stores. For future implementations, there is an intention to enhance the dataset and incorporate these established metrics for a more comprehensive assessment of our model’s performance.

4. Results and Discussion

In this section, we present the results corresponding to the best-performing configurations obtained after a comprehensive evaluation of all tested model architectures and hyperparameter settings. Initially, a classification task was undertaken, categorizing accessories into four distinct classes: necklaces, rings, earrings, and bracelets. After thorough analysis to determine the optimal configuration for each considered encoder and decoder combination, the most favorable outcomes in terms of test Correct Classification Rate (CCR) were attained using VGG-16 as the CNN and GRU as the RNN, as detailed in Table 2. This table also includes additional metrics, such as validation CCR and Loss.
Following the selection of the best configuration, a more comprehensive analysis was conducted, incorporating additional metrics like precision, recall, and F1-Score to assess the model’s performance in classifying each jewelry class independently. Precision measures the accuracy of positive predictions, recall assesses the model’s ability to capture all positive instances, and F1-Score provides a balance between precision and recall, offering a comprehensive evaluation of the model’s performance. Table 3 illustrates the model’s effectiveness in the classification task, achieving scores exceeding 91% for all metrics and accessory types, with the exception of bracelets, which proved more challenging to detect.
It is worth noting that an optimization of the various parameters under consideration was conducted, yielding the best results with the following configuration: number of neurons = 512; batch size = 8; optimizer = Adam; and learning rate = 0.001.
In the context of the primary task of generating comprehensive captions from images of accessories, optimal results were achieved with the VGG-16 and GRU combination. MobileNet was excluded from this investigation due to its tendency to frequently get stuck in local optima, leading to suboptimal performance in handling the specific challenges posed by image captioning in the context of jewelry. Table 4 presents a comparison of all considered encoder–decoder combinations for this task.
In this experimentation, an optimization of the different parameters considered was carried out, resulting in the best outcomes with the following configuration: number of neurons = 256; batch size = 16; optimizer = Adam; and learning rate = 0.001.
Furthermore, the assessment of our proposed approach revealed that the top-performing model exhibited excellent captioning accuracy, closely aligning with the original labels from online jewelry stores. Notably, the captioning accuracy for rings, necklaces, and earrings is very promising. However, the classification accuracy for bracelets, while still commendable, was slightly lower in the test samples. The evaluation criteria involved a meticulous comparison of whether the generated captions precisely matched the authentic descriptions of the accessories captured in the images.
It is important to acknowledge certain limitations in our model. The observed misclassifications, particularly in the case of bracelets, can be attributed to the inherent challenges posed by accessories with similar shapes and materials. Additionally, instances arose where the same type of accessory could exhibit variations in materials or jewels, further contributing to the nuanced nature of accurate classification. These challenges highlight areas for potential refinement and improvement in future iterations of the model.
Ultimately, a three-tiered interface was developed, leveraging the best-performing image captioning models created. This interface allows users to upload an image of an accessory and generate a corresponding description. Illustrated in Figure 3, the interface offers descriptions of varying complexity (see Section 3).
Let us consider an example image of a gold necklace with a pendant featuring a sapphire gemstone. The image is uploaded to the system, and the generated descriptions at different complexity levels are as follows:
  • Basic description: ‘A necklace with a pendant’.
  • Normal description: ‘A gold necklace with a sapphire pendant’.
  • Complete description (after several iterations): ‘This exquisite gold necklace showcases a beautiful sapphire pendant, adding a touch of elegance and sophistication to the accessory’.
From a linguistic perspective, the descriptions exhibit a progressive level of detail, ranging from a simple identification of the accessory to a more elaborate narrative that captures specific attributes, such as the material and gemstone, and an emotional tone. This iterative process in the complete description phase involves multiple iterations to incorporate the embellishment of the jewelry with superlative adjectives, further enhancing the richness of the generated descriptions. This meticulous approach aims to provide potential customers with a more nuanced and compelling understanding of the featured jewelry.
For the sake of reproducibility, the jewelry identification models, multilevel description generation models, and database of jewelry images, along with the necessary code, are available at Jewelry Linguistics Github https://github.com/jewelryling/jewelry_linguistics/ (accessed on 12 May 2025).

5. Conclusions and Future Work

The encoder–decoder architecture, coupled with linguistic models, has proven to be instrumental in generating coherent and contextually relevant descriptions across different types of jewelry. This multilevel linguistic approach contributes to a more versatile and accurate representation of jewelry items. It allows for a tailored and informative description that aligns with diverse user preferences and contextual requirements. The adaptability and scalability of our model underscore its potential in the future for broader applications in the domain of automated image captioning, with implications for diverse industries beyond jewelry analysis.
In considering future avenues for research, the refinement of our linguistic models stands out as a priority to enhance accuracy and accommodate an even broader spectrum of jewelry variations. Exploring advanced network architectures and expanding the dataset to encompass diverse cultural and stylistic influences can contribute significantly to the development of a more robust and culturally sensitive linguistic model. In particular, incorporating attention mechanisms into the encoder–decoder architecture could improve the model’s ability to dynamically focus on relevant visual features, helping to mitigate repetition and enrich caption diversity. Building upon this, we also plan to explore Transformer-based architectures, which leverage self-attention to model long-range dependencies and enable parallel generation, making them especially suitable for producing the complex and expressive descriptions targeted in this work.
Another important direction for future work is to evaluate the robustness of the system under real-world image acquisition conditions, including varying lighting, reflections, and occlusions, which are common in user-generated content and may affect the accuracy of jewelry recognition and description.
Furthermore, incorporating user feedback and preferences into the training process will be pivotal in refining the models for personalized and context-aware jewelry descriptions. This user-centric approach aims to ensure that the generated descriptions align closely with individual preferences and contribute to a more user-friendly and adaptable system.

Author Contributions

Conceptualization, J.M.A.-L., R.Á.-R., A.R.-M. and E.Y.-B.; methodology, J.M.A.-L., R.Á.-R., A.R.-M. and E.Y.-B.; software, J.M.A.-L. and E.Y.-B.; validation, A.Z., J.T. and E.Y.-B.; formal analysis, J.M.A.-L. and E.Y.-B.; investigation, J.M.A.-L., R.Á.-R., A.R.-M. and E.Y.-B.; resources, A.Z. and E.Y.-B.; data curation, J.M.A.-L. and E.Y.-B.; writing—original draft preparation, J.M.A.-L.; writing—review and editing, J.M.A.-L., A.Z., J.T., R.Á.-R., A.R.-M. and E.Y.-B.; visualization, J.M.A.-L.; supervision, A.Z., J.T. and E.Y.-B.; project administration, E.Y.-B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Regional Ministry of Economy, Knowledge, Enterprises, and University, Government of Andalusia (Spain), grant reference 1381382-F.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used during the current study is not publicly available due to privacy agreements with the jewelry companies that participated in this work. Access to the data is restricted in accordance with the confidentiality terms established by the collaborating stores.

Acknowledgments

We would like to express our gratitude for the cooperation and availability of the online jewelry stores that provided us access to their catalogs for the execution of the experiments in this study. José Manuel Alcalde Llergo is a student enrolled in the National in Artificial Intelligence, XXXVIII cycle, course on health and life sciences, organized by Università Campus Bio-Medico di Roma. He is also pursuing a doctorate in co-supervision at the Universidad de Córdoba (Spain), enrolled in its program in Computation, Energy and Plasmas.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar] [CrossRef]
  2. Stefanini, M.; Cornia, M.; Baraldi, L.; Cascianelli, S.; Fiameni, G.; Cucchiara, R. From Show to Tell: A Survey on Deep Learning-Based Image Captioning. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 539–559. [Google Scholar] [CrossRef] [PubMed]
  3. Xiao, X.; Wang, L.; Ding, K.; Xiang, S.; Pan, C. Deep Hierarchical Encoder–Decoder Network for Image Captioning. IEEE Trans. Multimed. 2019, 21, 2942–2956. [Google Scholar] [CrossRef]
  4. Jackendoff, R. Foundations of Language: Brain, Meaning, Grammar, Evolution; Oxford University Press: Oxford, UK, 2002. [Google Scholar] [CrossRef]
  5. Jackendoff, R.; Wittenberg, E. Linear grammar as a possible stepping-stone in the evolution of language. Psychon. Bull. Rev. 2016, 24, 219–224. [Google Scholar] [CrossRef] [PubMed]
  6. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 7–9 July 2015; Volume 37. [Google Scholar] [CrossRef]
  7. Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
  8. Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; McClosky, D. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the Stanford CoreNLP Natural Language Processing Toolkit, Baltimore, MA, USA, 1 June 2014. [Google Scholar] [CrossRef]
  9. Bar-Hillel, Y. Language and Information; Addison-Wesley Pub. Co.: Reading, MA, USA, 1964. [Google Scholar]
  10. Bar-Hillel, Y. Aspects of Language. Br. J. Philos. Sci. 1973, 24, 190–193. [Google Scholar] [CrossRef]
  11. McShane, M.; Nirenburg, S. Linguistics for the Age of AI; MIT Press: London, UK, 2021. [Google Scholar]
  12. Li, C.; Kang, Q.; Ge, G.; Song, Q.; Lu, H.; Cheng, J. DeepBE: Learning Deep Binary Encoding for Multi-label Classification. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 744–751. [Google Scholar] [CrossRef]
  13. Chen, S.Y.; Luo, G.J.; Li, X.; Ji, S.M.; Zhang, B.W. The Specular Exponent as a Criterion for Appearance Quality Assessment of Pearllike Objects by Artificial Vision. IEEE Trans. Ind. Electron. 2012, 59, 3264–3272. [Google Scholar] [CrossRef]
  14. Alcalde-Llergo, J.M.; Yeguas-Bolívar, E.; Zingoni, A.; Fuerte-Jurado, A. Jewelry Recognition via Encoder-Decoder Models. In Proceedings of the 2023 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering (MetroXRAINE), Milano, Italy, 25–27 October 2023. [Google Scholar] [CrossRef]
  15. Yu, J.; Li, H.; Hao, Y.; Zhu, B.; Xu, T.; He, X. CgT-GAN: CLIP-guided Text GAN for Image Captioning. In Proceedings of the 31st ACM International Conference on Multimedia, MM ’23, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 2252–2263. [Google Scholar] [CrossRef]
  16. Shetty, R.; Rohrbach, M.; Hendricks, L.A.; Fritz, M.; Schiele, B. Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 4155–4164. [Google Scholar] [CrossRef]
  17. Chaffin, A.; Kijak, E.; Claveau, V. Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning. arXiv 2024, arXiv:2402.13936. [Google Scholar] [CrossRef]
  18. Shi, H.; Li, P.; Wang, B.; Wang, Z. Image captioning based on deep reinforcement learning. In Proceedings of the 10th International Conference on Internet Multimedia Computing and Service, ICIMCS’18, Nanjing, China, 17–19 August 2018; pp. 1–5. [Google Scholar] [CrossRef]
  19. Ghandi, T.; Pourreza, H.; Mahyar, H. Deep Learning Approaches on Image Captioning: A Review. ACM Comput. Surv. 2023, 56, 1–39. [Google Scholar] [CrossRef]
  20. McCoy, K.F.; Carberry, S.; Roper, T.; Green, N.L. Towards generating textual summaries of graphs. In Universal Access in HCI: Towards an Information Society for All, Proceedings of the HCI International ’2001 (the 9th International Conference on Human-Computer Interaction), New Orleans, LA, USA, 5–10 August 2001; Stephanidis, C., Ed.; Lawrence Erlbaum: Mahwah, NJ, USA, 2001; Volume 3, pp. 695–699. [Google Scholar]
  21. Grice, H.P. Logic and Conversation; Academic Press: Cambridge, MA, USA, 1975; Volume 3, pp. 41–58. [Google Scholar]
  22. Dijk, T.V. Semantic macro-structures and knowledge frames in discourse comprehension. In Cognitive Processes in Comprehension; Psychology Press: London, UK, 1977; pp. 3–32. [Google Scholar]
  23. Leech, G. Corpora and theories of linguistic performance. In Directions in Corpus Linguistics. Proceedings of the Nobel Symposium 82; Svartvik, J., Ed.; Trends in Linguistics. Studies and Monographs; Mouton de Gruyter: Berlin, Germany; New York, NY, USA, 1992; Volume 65, pp. 105–122. [Google Scholar]
  24. Mairal, R.; Ruiz de Mendoza, F. Levels of description and explanation in meaning construction. In Deconstructing Constructions; John Benjamins Publishing Company: Amsterdam, The Netherlands, 2009; Volume 107, pp. 153–198. [Google Scholar] [CrossRef]
  25. Abdel-Raheem, A. Semantic macro-structures and macro-rules in visual discourse processing Semantic macro-structures and macro-rules in visual discourse processing. Vis. Stud. 2021, 38, 407–424. [Google Scholar] [CrossRef]
  26. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems 30 (NIPS 2017), Processing of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar] [CrossRef]
  27. Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, New York, NY, USA, 3–10 March 2021; pp. 610–623. [Google Scholar] [CrossRef]
  28. Kozhakhmetova, G.A.; Tazhibayeva, S. Kazakh jewelry: Problems of translation and creation of a multilingual thesaurus. Bull. L.N. Gumilyov Eurasian Natl. Univ. Philol. Ser. 2021, 135, 55–63. [Google Scholar] [CrossRef]
  29. Demortier, G. Targeting Ion Beam Analysis techniques for gold artefacts. ArchéoSciences 2009, 33, 29–38. [Google Scholar] [CrossRef]
  30. Yu, Q.; Meng, K.; Guo, J. Research on innovative application of silver material in modern jewelry design. MATEC Web Conf. 2018, 176, 02013. [Google Scholar] [CrossRef]
  31. Ariza-Montes, A. Plan Estratégico Del Sistema Productivo Local de la Joyería de Córdoba; Universidad de Loyola: Córdoba, Spain, 2006. [Google Scholar]
  32. Arellano, N.; Espinoza, A.; Perdomo, D.; Carpio, L. Joyería Marthita. DSpace en ESPOL Unidades Académicas Facultad de Ciencias Sociales y Humanísticas. Artículos de Tesis de Grado. 2009. Available online: https://www.dspace.espol.edu.ec/handle/123456789/1361?locale=en (accessed on 12 May 2025).
  33. Díaz, A.; Aguilar, D.; Blandón, J.; Estela, B. Competitividad entre joyerías y tiendas de bisutería fina legalmente constituidas, 2013. Rev. Cient. Estelí 2014, 9, 35–44. [Google Scholar] [CrossRef]
  34. Hani, S.; Marwan, A.; Andre, A. The effect of celebrity endorsement on consumer behavior: Case of the Lebanese jewelry industry. Arab Econ. Bus. J. 2018, 13, 190–196. [Google Scholar] [CrossRef]
  35. Alvarado Hidalgo, C.A. El marketing como factor relevante en la gestión de calidad y plan de mejora en las micro y pequeñas empresas del sector servicio-rubro joyería y perfumería, de la ciudad de Chimbote, 2019. Repositorio Institucional de ULADECH Católica. Trabajo de Investigación. 2021. Available online: https://repositorio.uladech.edu.pe/handle/20.500.13032/24863 (accessed on 12 May 2025).
  36. Liddicoat, A.J.; Curnow, T.J. Language Descriptions; Wiley: Hoboken, NJ, USA, 2004. [Google Scholar] [CrossRef]
  37. Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A Comprehensive Survey on Transfer Learning. arXiv 2019, arXiv:1911.02685. [Google Scholar] [CrossRef]
  38. Wang, Y.; Xu, J.; Sun, Y. End-to-End Transformer Based Model for Image Captioning. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence. Assoc Advancement Artificial Intelligence, Virtual Event, 22 February–1 March 2022; pp. 2585–2594. [Google Scholar]
  39. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  40. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27 June–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef]
  41. Howard, A.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. Int. J. Intell. Sci. 2017. [Google Scholar] [CrossRef]
  42. Hochreiter, S.; Schmidhuber, J. Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  43. Cho, K.; Merrienboer, B.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. ACL Anthol. 2014, 103–111. [Google Scholar] [CrossRef]
  44. Luo, G.; Cheng, L.; Jing, C.; Zhao, C.; Song, G. A thorough review of models, evaluation metrics, and datasets on image captioning. IET 2022, 16, 311–332. [Google Scholar] [CrossRef]
Figure 1. Example of jewelry pieces used in the final dataset: (a) Skye yellow gold and oval stones ring, (b) Arayat emerald necklace, and (c) Taurus white gold, diamonds earring. These images represent part of the training data used in this study.
Figure 1. Example of jewelry pieces used in the final dataset: (a) Skye yellow gold and oval stones ring, (b) Arayat emerald necklace, and (c) Taurus white gold, diamonds earring. These images represent part of the training data used in this study.
Applsci 15 05538 g001
Figure 2. Overview of the encoder–decoder model used for jewelry description. The CNN encoder (a) processes the input image of a jewelry item into a latent feature vector, which is then used to initialize the initial state of the RNN decoder (b) that generates the caption.
Figure 2. Overview of the encoder–decoder model used for jewelry description. The CNN encoder (a) processes the input image of a jewelry item into a latent feature vector, which is then used to initialize the initial state of the RNN decoder (b) that generates the caption.
Applsci 15 05538 g002
Figure 3. Jewelry image captioning interface (complete description after several iterations for generating the caption).
Figure 3. Jewelry image captioning interface (complete description after several iterations for generating the caption).
Applsci 15 05538 g003
Table 1. Description levels of jewelry pieces.
Table 1. Description levels of jewelry pieces.
LevelApplsci 15 05538 i001Applsci 15 05538 i002Applsci 15 05538 i003
BasicEarrings in yellow gold.Solitaire in rose gold.Yellow gold bracelet.
NormalYellow gold and diamond earrings.Ring in rose gold with central diamond.Yellow gold bracelet with topazes.
CompleteEarrings in sustainable yellow gold adorned with exquisite, brilliant-cut diamonds and featuring a secure push-back clasp.Iris solitaire in rose gold with
central diamond.
Sustainable yellow gold bracelet adorned with dazzling sky topazes.
Table 2. Best model for identification of the class: necklace, ring, earring, and bracelet.
Table 2. Best model for identification of the class: necklace, ring, earring, and bracelet.
CNNRNNNeuronsVal. CCRVal. LossTest CCR
InceptionV3GRU10240.91870.10260.9161
VGG-16GRU5120.93140.03900.9464
MobileNetGRU640.88510.26960.8802
InceptionV3LSTM10240.94870.03010.9193
VGG-16LSTM5120.95970.05420.9227
MobileNetLSTM640.88110.22080.8775
Table 3. Individual class classification quality.
Table 3. Individual class classification quality.
AccessoryPrecisionRecallF1-Score
Necklaces0.94520.90870.9131
Rings0.92760.91730.9343
Earrings0.94520.96750.9674
Bracelets0.94200.82980.8807
Table 4. Best model for image captioning.
Table 4. Best model for image captioning.
CNNRNNNeuronsVal. CCRVal. LossTest CCR
InceptionV3LSTM5120.93070.06150.8439
VGG-16LSTM5120.92700.08810.9036
InceptionV3GRU2560.94830.09210.8354
VGG-16GRU2560.96330.07060.9345
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alcalde-Llergo, J.M.; Ruiz-Mezcua, A.; Ávila-Ramírez, R.; Zingoni, A.; Taborri, J.; Yeguas-Bolívar, E. Automatic Identification and Description of Jewelry Through Computer Vision and Neural Networks for Translators and Interpreters. Appl. Sci. 2025, 15, 5538. https://doi.org/10.3390/app15105538

AMA Style

Alcalde-Llergo JM, Ruiz-Mezcua A, Ávila-Ramírez R, Zingoni A, Taborri J, Yeguas-Bolívar E. Automatic Identification and Description of Jewelry Through Computer Vision and Neural Networks for Translators and Interpreters. Applied Sciences. 2025; 15(10):5538. https://doi.org/10.3390/app15105538

Chicago/Turabian Style

Alcalde-Llergo, José Manuel, Aurora Ruiz-Mezcua, Rocío Ávila-Ramírez, Andrea Zingoni, Juri Taborri, and Enrique Yeguas-Bolívar. 2025. "Automatic Identification and Description of Jewelry Through Computer Vision and Neural Networks for Translators and Interpreters" Applied Sciences 15, no. 10: 5538. https://doi.org/10.3390/app15105538

APA Style

Alcalde-Llergo, J. M., Ruiz-Mezcua, A., Ávila-Ramírez, R., Zingoni, A., Taborri, J., & Yeguas-Bolívar, E. (2025). Automatic Identification and Description of Jewelry Through Computer Vision and Neural Networks for Translators and Interpreters. Applied Sciences, 15(10), 5538. https://doi.org/10.3390/app15105538

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop