1. Introduction
The development of artificial intelligence (AI) in recent years has completely transformed how we engage with visual data, particularly within the context of enterprises. Multi-modal codification systems enable the creation of semantic connections between text and visual data, such as images or videos, using AI models such as CLIP (Contrastive Language-Image Pretraining) [
1], and others.
This advancement opens new markets for businesses by enabling visual search engines, recommendation systems, and automatic tagging of visual data. One of the most pertinent examples is a visual content-based search system, where companies use artificial intelligence models to provide customers with services such as product comparison, image search, and visual discovery [
2]. For example, by submitting an image of a product they are interested in, customers can find similar products. Furthermore, businesses can offer these search engine services to other companies, using their own product data and creating databases with multi-modal codifications of visual data related to the product.
However, implementing these systems presents significant technical challenges. The typical workflow involves encoding images using an AI model, converting these representations into semantic vectors, and inserting them into databases optimized for fast searches. A critical bottleneck in this process is the initial data acquisition stage: downloading, preprocessing, and preparing images for integration [
3]. This issue not only affects technical efficiency but also impacts the ability of companies to scale these systems to a commercial level, where speed and handling of large volumes of data are essential for remaining competitive.
The principal goal of this paper is not merely to optimize an image download and preparation pipeline, but to provide the first systematic and controlled comparison of heterogeneous communication architectures, namely (i) synchronous gRPC, (ii) asynchronous message queuing with RabbitMQ, and (iii) serverless approaches using AWS Lambda and SageMaker, for GPU-intensive image encoding workloads. Unlike prior industrial case studies that describe individual production systems, our work quantitatively evaluates performance trade-offs, GPU utilization efficiency, and scalability behaviors under identical experimental conditions.
From a business perspective, these improvements not only enable more efficient workflows but also allow companies to offer more agile and scalable services. In a market where user experience and response speed are critical differentiators, optimizing the initial stages of data flow can be a strategic component to maximize return on investment in AI solutions.
The remainder of this paper is organized as follows.
Section 2 presents related work. Next,
Section 3 discusses the research questions and the proposed solution. Later,
Section 4 presents the results and
Section 5 discusses them. Finally,
Section 6 concludes the paper.
2. Related Work
The optimization of databases for visual content-based search systems is closely tied to the underlying distributed systems that support them. In e-commerce, where visual search and recommendation systems are critical, the choice of a distributed architecture significantly impacts performance, scalability, and reliability. This section explores previous work on distributed systems in e-commerce and how different architectures leverage pipelines to optimize workflows, with a focus on their implications for Approximate Nearest Neighbor (ANN) database management.
2.1. Distributed Systems in E-Commerce
Distributed systems play a pivotal role in modern e-commerce platforms, enabling scalable and efficient handling of large volumes of data, including images, user interactions, and transaction records. For example, Yang et al. [
3] show a system that relies on distributed storage and processing to manage millions of images efficiently. By using micro-batches and distributed hashing techniques, it ensures that duplicate images are avoided and data ingestion is optimized. The hashes generated during this process are stored in Google Bigtable, a distributed database designed for high scalability and low latency. This highlights the importance of distributed systems to achieve high performance and reliability in e-commerce applications.
Another relevant study by Zheng et al. [
4] demonstrates how distributed architectures enhance the efficiency of visual search engines. Techniques such as data partitioning, replication, and distributed databases accelerate queries and reduce latency. These findings underscore the critical role of distributed systems in optimizing the performance of visual content-based search systems, particularly when managing large-scale datasets. Furthermore, these architectures directly influence the performance of ANN-based databases by improving data preprocessing, indexing, and retrieval.
Beyond e-commerce and visual retrieval systems, recent research highlights the importance of robustness, scalability, and distributed intelligence in complex, data-intensive environments. Advanced fault diagnosis techniques for industrial and transportation systems demonstrate how distributed learning pipelines must operate reliably under noisy, heterogeneous, and limited data conditions. For instance, antinoise bearing fault diagnosis using time-frequency reassignment and sparse learning dictionaries has been proposed to improve robustness in industrial sensing systems [
5]. Few-shot cross-domain learning approaches further emphasize the need for scalable architectures capable of efficiently generalizing across domains with limited supervision [
6]. These studies underline the relevance of flexible and resilient distributed pipelines for large-scale industrial AI applications.
2.2. Pipelines in Distributed Architectures
Numerous distributed architectures use pipelines to optimize workflows in e-commerce and other domains. The following subsections explore key approaches, highlighting their impact on the optimization of the ANN database.
2.2.1. Event-Driven Pipeline Framework (EPypes)
EPypes [
7] is a software framework and architecture that allows you to develop pipeline-based distributed systems. It enables the breakdown of complicated operations into smaller interrelated tasks that can be performed in a distributed environment. For example, in an e-commerce platform, EPypes can be used to build an image processing pipeline in which each stage (e.g., downloading, preprocessing, and encoding) is handled by a different component.
For example, Semeniuta et al. [
8] use the definition of pipeline to develop vision algorithms integrated with a publish–subscribe communication distribution system to facilitate edge detection and feature matching, demonstrating that its computational overhead is minimal when compared to the application’s inherent processing time.
2.2.2. Event-Driven Architectures (RabbitMQ, Apache Kafka, AWS SQS)
Event-driven architectures, such as those built with RabbitMQ [
9], Apache Kafka [
10], Amazon Web Services (AWS) Simple Queue Service (SQS) [
11], and similar queuing systems, rely on message queues to coordinate operations asynchronously. Each task in the pipeline generates events that are consumed by subsequent tasks. In visual search systems, an event-based pipeline might consist of steps such as image ingestion, feature extraction, and database insertion.
Hinze et al. [
12] discuss the fundamental principles and applications of event-driven architectures, highlighting their use in fraud detection, traffic monitoring, and smart infrastructure. These architectures enable real-time processing, making them particularly useful for ANN database updates, where new embeddings must be efficiently integrated into existing indexes. However, challenges such as message ordering, latency, and fault tolerance remain key considerations.
2.2.3. Serverless Computing for Scalable Pipelines
Serverless computing is gaining popularity because of its ability to scale automatically and reduce operational overhead. In a serverless pipeline, each task is implemented as a function triggered by events. For example, AWS Lambda [
13] can be used to build a pipeline to process product images, with separate functions that handle downloading, resizing, and deep learning-based encoding.
A study by Dehury et al. [
14] explores serverless pipelines for managing massive data volumes using AWS Lambda, AWS S3 [
15], and AWS Data Pipeline [
16]. Their findings demonstrate the cost-effectiveness and scalability of serverless architectures. In the context of ANN databases, serverless functions can dynamically preprocess and update embeddings, reducing infrastructure costs while maintaining performance.
2.2.4. Workflow Orchestration with Apache Airflow
Apache Airflow [
17] is a powerful tool for orchestrating complex workflows using Directed Acyclic Graphs (DAGs). In e-commerce, Apache Airflow is commonly used to define and manage pipelines for data ingestion, preprocessing, model training, and deployment.
Testas et al. [
18] demonstrate how Apache Airflow transforms independent scripts into automated, scalable workflows. By integrating tools such as PySpark [
19], PyTorch [
20], and TensorFlow [
21], they create reproducible pipelines ideal for data science and machine learning projects. For ANN databases, Airflow’s task orchestration capabilities help ensure efficient data preprocessing and embedding updates, improving overall database performance.
2.3. Communication Architectures in Distributed Systems
While pipelines and frameworks provide structure for distributed workflows, the choice of communication architecture significantly impacts system performance. This section examines recent research comparing communication paradigms relevant to compute-intensive applications.
2.3.1. Synchronous vs. Asynchronous Communication
The distinction between synchronous and asynchronous communication has profound implications for distributed system design. Kumar et al. [
22] characterized REST, gRPC, and Apache Thrift communication protocols in microservices, demonstrating that gRPC and Thrift achieve significantly lower response times than REST due to binary serialization and HTTP/2 advantages. Their work shows that Unix Domain Sockets can reduce response time for same-host communication. Weerasinghe et al. [
23] prove that selecting appropriate communication mechanisms is critical for reducing response time, as network overhead represents a primary performance bottleneck in microservice architectures.
For asynchronous communication, Dobbelaere and Esmaili [
24] compare Apache Kafka and RabbitMQ, demonstrating that the system choice depends heavily on use case requirements: RabbitMQ excels in traditional message queuing with complex routing, while Kafka is optimized for high-throughput streaming.
2.3.2. Serverless Computing Trade-Offs
Serverless computing introduces new architectural considerations. Jinfeng et al. [
25] provide a systematic review identifying performance optimization and cost modeling as critical challenges, noting that serverless pay-per-use fundamentally changes cost–performance trade-offs. Mathew et al. [
26] explored AWS Step Functions for data processing pipelines, revealing that Lambda cold starts and concurrency limits significantly impact both performance and cost. Wang et al. [
27] characterized multiple serverless platforms, finding significant performance variability with cold start latency being a key differentiator.
2.3.3. Implications for Image Encoding Systems
The reviewed literature establishes key principles for communication architectures in compute-intensive applications: (i) direct communication models minimize queuing overhead and maximize GPU utilization [
28,
29], (ii) message brokers provide fault tolerance at the cost of some performance [
24], (iii) serverless computing suits variable workloads but introduces cold start overhead [
26], and (iv) communication architecture directly impacts GPU utilization, with efficient patterns maintaining high resource usage [
28]. These findings motivate our experimental comparison of communication architectures for large-scale image encoding, examining how architectural choices impact GPU utilization and system throughput.
Parallel trends can be observed in physics-informed and data-driven modeling of complex infrastructures, where scalable computation and efficient communication are essential. Physics-informed neural networks have been applied to large-scale thermal inversion problems in multilayer pavement systems, requiring efficient data handling and model execution across heterogeneous resources [
30]. Similarly, stochastic vibration analysis of aircraft–pavement systems illustrates the growing need for high-throughput computational pipelines capable of handling uncertainty and large parameter spaces [
31]. These domains share common challenges with large-scale image encoding, particularly regarding efficient resource utilization and communication under increasing workload complexity.
3. Solution Proposed
In this section, we first describe the current architecture and how our image processing pipelines operate and communicate. Then, we propose a comparative study of performance, efficiency, and scalability across different types of architectures.
3.1. Current Architecture: gRPC and Multiprocessing Queues
As shown in
Figure 1, the system is designed to efficiently handle image downloading and processing. It consists of two main components: Downloaders and Workers.
3.1.1. Downloaders
The role of this component is to extract images from external URLs and perform basic preprocessing. The main tasks of these components are as follows:
3.1.2. Workers
The role of this component is to encode image tensors into embeddings (a data format that can be mathematically manipulated for visual searches) using the CLIP model. The main tasks of these components are as follows:
Receive preprocessed image tensors from the Downloaders via gRPC.
Use multiprocessing queues to manage tasks processed by the CLIP model.
Generate embeddings from image tensors and store the resulting embeddings in the visual search database.
The strengths of the current architecture include the use of multiple processes, which enables efficient workload distribution, parallelism, and easy scalability due to the separation of Downloaders and Workers.
But there are also several weaknesses in the current architecture. For instance, the use of synchronous communication due to gRPC, following a strict request–response model, can create bottlenecks. Thus, if Workers process requests slowly, Downloaders must wait for a response, thereby reducing overall system throughput. Each Worker maintains a multiprocessing queue for CLIP execution, which may introduce delays under high data volumes. Another point to keep in mind is that the components never turn off, even if they do not have work for a long period of time, which is not a very cost-efficient approach.
3.2. Proposed Architectures
To address these limitations, we propose a comparative study of different architectures to identify the most suitable approach for our system. The study evaluates the performance, scalability, and fault tolerance of the following architectures:
In general, gRPC is preferable for latency-sensitive, high-throughput workloads; RabbitMQ is suitable when fault tolerance and decoupling are priorities; Lambda-based pipelines fit variable or intermittent workloads; and SageMaker offers a balance between performance and operational simplicity.
The goal of this study is to determine which architecture best meets the requirements of our visual content-based search system, particularly in terms of handling large-scale datasets and ensuring high availability.
3.3. Event-Driven Architecture with RabbitMQ
As a first step in this study, we propose using RabbitMQ as a message broker. As shown in
Figure 2, using RabbitMQ, we implement an asynchronous, event-driven architecture that decouples Downloaders and Workers, improving scalability and fault tolerance.
3.3.1. Downloaders
In this case, the Downloaders remove all direct communication and subscribe to a RabbitMQ queue to receive image download tasks and continue downloading and preprocessing images in different threads. Instead of sending data directly to the Workers via gRPC, the Downloaders publish messages to a RabbitMQ queue to which workers are subscribed (now the Downloaders act as consumers and producers of RabbitMQ). Thus, the Downloaders do not wait for the responses from the Workers, eliminating the need for synchronous communication.
3.3.2. Workers
As previously described, the Workers subscribe to the RabbitMQ queue to receive messages (RabbitMQ ensures reliable message delivery and supports load balancing across multiple Workers), making the Workers consumers of RabbitMQ. Thus, once the message data is obtained, a Worker sends it to the CLIP multiprocessing queue and also sends and stores the resulting embeddings in the database.
The event-driven architecture’s benefits are enhanced in comparison to the current architecture. First, scalability makes it easier to scale Workers and Downloaders horizontally. Second, this architecture increases system flexibility by allowing the Downloaders and Workers to work independently. Additionally, RabbitMQ is fault-tolerant, which enables it to reassign jobs to other Workers or Downloaders in the event that one of their tasks fails, guaranteeing service continuity.
However, there are also some potential drawbacks, such as higher latency and the potential for a small increase in processing time due to messages moving through queues. Additionally, to reduce extra latencies, an efficient queue system must be designed.
3.4. Serverless Downloader Architecture with Queue Integration
As an alternative approach in our comparative study, in
Figure 3 we propose implementing the Downloader component using a serverless architecture. This design preserves asynchronous processing and introduces elastic scalability.
3.4.1. Serverless Downloaders
The Downloaders are implemented as AWS Lambda functions triggered by API calls that provide batches of image URLs (ranging from 100 to 1000 URLs per batch). Each function invocation handles image downloading and preprocessing in parallel within the allocated execution.
Upon completion, the serverless functions publish the results to a message queue subscribed to by the Workers, without awaiting any response, enabling true fire-and-forget processing. Automatic retries with exponential back-off are implemented to handle failed downloads, increasing robustness. These functions operate independently, maintaining an event-driven and asynchronous execution model that fully decouples them from downstream components.
3.4.2. Workers
The Worker components continue to process tasks from the message queue. They remain subscribed to a dedicated RabbitMQ queue containing the preprocessed image tensors and use multiprocessing queues to manage CLIP model execution. Dead-letter queues are introduced to handle failed processing tasks, ensuring fault tolerance. RabbitMQ provides reliable message delivery between the serverless Downloaders and the Worker cluster, while also distributing workload efficiently. Each component of the system functions independently, communicating only via the message queue infrastructure.
The key advantages of this architecture are the following:
It enables truly asynchronous, fire-and-forget image downloads with automatic horizontal scaling.
It reduces infrastructure costs by removing the need for continuously running Downloader servers.
It preserves the strengths of our existing queue-based system, while enhancing reliability through AWS Lambda’s built-in retry mechanisms.
There are several important considerations regarding this architecture:
Cold starts can introduce 100–500 ms of latency in initial Lambda invocations.
Batch size tuning is necessary to optimize throughput while avoiding timeout risks (Lambda has a 900 s maximum execution limit).
Queue configurations require optimization to effectively handle message prioritization.
This architecture combines the benefits of serverless computing with a message-driven pipeline, maintaining full decoupling between components while addressing the scalability limitations inherent to our traditional Downloader design.
3.5. Serverless Architecture with SageMaker Inference
To further evolve our system design, in
Figure 4 we propose a serverless architecture that minimizes infrastructure management of workers, while maintaining high scalability, reliability, and cost-efficiency using Amazon SageMaker [
34] on Workers.
3.5.1. Serverless Downloaders
The Downloader component still works to download and preprocess the images. Once the job is finished, it invokes the endpoint of SageMaker to pass all the images and obtain the embeddings to codify the database of the client.
3.5.2. Serverless Workers
After preprocessing, we use CLIP Inference via SageMaker as a worker. The preprocessed tensor is sent directly to a SageMaker real-time endpoint. SageMaker real-time endpoints (with GPU instances) are used since SageMaker Serverless Inference currently does not support GPUs. Cold starts can introduce 10–30 s of initial latency in case we do not have any instance or time of scale-up.
The key advantages of this architecture are almost the same as those of Serverless Downloader Architecture with Queue Integration, but without intermediate persistence or notification delay, and with fewer components, which reduces operation complexity and increases scalability.
In this case, the considerations are that this architecture still has cold starts that introduce 10–30 s of SageMaker and needs to acquire a GPU instance to perform all the processing work, which has a higher cost than directly using a GPU instance.
3.6. Experimental Methodology
In the next section, each architecture is evaluated using identical workloads processed through a controlled submission pattern. A ThreadPool with four concurrent threads submits images sequentially from the current visual database creation process. The experiment is conducted using different numbers of images, specifically 4.5 K, 9 K, 13.5 K, and 18 K images.
Key metrics include end-to-end processing time (from first image submission to final embedding storage), CPU and GPU utilization, memory usage, cost per processed image, and system behavior under increasing load conditions.
4. Evaluation
This section presents a comprehensive experimental evaluation of the three proposed communication architectures for large-scale image encoding systems. We design controlled experiments to measure performance and cost-effectiveness across different workload patterns.
4.1. Experimental Setup
Next, we detail the experimental setup used in this work. First, we detail the hardware and infrastructure configuration employed in the experiments. Then, we specify the dataset and workload utilized. Finally, we discuss specific configurations depending on the architecture used in the experiments.
4.1.1. Hardware and Infrastructure Configuration
All experiments are conducted on standardized hardware configurations to ensure fair comparison. The hardware and infrastructure configuration employed in the experiments is as follows:
- –
Downloaders and RabbitMQ Broker: one AWS EC2 t3.xlarge instance (4 vCPUs, 16 GB RAM);
- –
Workers: one AWS EC2 G5.2xlarge instance (8 vCPUs, 32 GB RAM, 1 NVIDIA A10G Tensor Core);
- –
Network: All instances within the same AWS availability zone (eu-west-1a).
- –
AWS Lambda: 1024 MB memory allocation, 15 min timeout;
- –
SageMaker Endpoint: One AWS SageMaker ml.g5.2xlarge instance (8 vCPUs, 32 GB RAM, 1 NVIDIA A10 G Tensor Core);
- –
Concurrency Limits: Lambda concurrent executions set to 100.
4.1.2. Dataset and Workload Specification
The dataset and workload utilized for the tests are the following:
Image Characteristics: Resolution 640 × 480 to 1024 × 768, the file size is around 50 KB–100 KB;
Workload Sizes: 4.5 K, 9 K, 13.5 K, and 18 K images;
Submission Pattern: Four concurrent threads submitting images sequentially.
The SigLIP model (ViT-B/16 variant) [
35] is used, which processes images to generate 768-dimensional embedding vectors, representing each image in a semantically meaningful vector space.
The selected workload sizes reflect typical enterprise-scale catalog ingestion scenarios, ranging from small merchant databases (4.5 K images) to large multi-tenant datasets (18 K images). The use of four concurrent submission threads corresponds to the observed level of parallelism in production ingestion pipelines, ensuring realism while maintaining experimental control.
4.1.3. Architecture-Specific Configurations
The specific configurations, depending on the architecture used in the experiments, are as follows:
- –
Connection pooling with four persistent channels per Downloader;
- –
gRPC compression not enabled;
- –
Request timeout: 180 s.
- –
Two queues: One for Downloaders and one for Workers;
- –
Queue durability: Enabled for fault tolerance;
- –
Prefetch count: A total of 50 messages per Worker;
- –
Message persistence: Enabled;
- –
Dead letter queue: Configured for failed messages.
Serverless Lambda + RabbitMQ architecture (
Figure 3):
- –
Batch size: A total of 100–200 images per Lambda invocation;
- –
Retry policy: Three attempts with exponential back-off;
- –
Lambda warm-up: Five concurrent executions maintained.
Serverless SageMaker architecture (
Figure 4):
- –
Endpoint configuration: Single ml.g5.2xlarge instance.
4.2. Performance Metrics
We measure the following key performance indicators:
End-to-End latency: Time from first image submission to final embedding storage;
Average throughput: Images processed per second over the entire workload;
GPU utilization: Average GPU usage during processing;
CPU utilization: Average CPU usage across all components;
Memory usage: Peak memory consumption per component;
Cost: Total infrastructure cost divided by images processed.
Throughout this study, performance superiority is defined as a statistically stable improvement in throughput and/or GPU utilization efficiency, confirmed by repeated experimental runs and low inter-run variance.
4.3. Experimental Results
In this section, we evaluate the proposed communication architectures. We begin with an end-to-end performance comparison to benchmark them. Following this, we analyze their resource utilization to quantify the computational efficiency and hardware demanded by each approach.
4.3.1. End-to-End Performance Comparison
Figure 5 and
Figure 6 present the comprehensive performance analysis across all architectures and workload sizes, showing both processing time and throughput metrics. Considering the results shown in these tables, next, we analyze the performance of each architecture evaluated:
gRPC architecture: Maintains the highest and most consistent throughput across all workload sizes, with minimal performance degradation as the dataset scales. The direct communication between Downloaders and Workers eliminates queuing overhead, resulting in optimal resource utilization and predictable performance scaling.
RabbitMQ architecture: Shows variable performance patterns. Notably, the architecture exhibits a throughput drop at the 13.5 K image workload, followed by a partial recovery at higher load, as illustrated in
Figure 6. This pattern suggests queue management overhead that becomes more efficient with sustained high-volume processing. Further optimization of queue configurations, prefetch counts, and message batching strategies could potentially improve these results.
Lambda + RabbitMQ architecture: Exhibits the most consistent but lowest throughput, with performance stabilizing for larger workloads. The consistent performance across different scales demonstrates the predictable nature of serverless scaling, though cold start impacts and function invocation overhead limit peak throughput. Performance could be enhanced through optimization of batch sizes, concurrent execution limits, and Lambda memory allocation.
SageMaker architecture: Demonstrates strong and consistent performance. The managed infrastructure provides reliable performance scaling, with processing times showing near-linear growth with workload size, as shown in
Figure 5.
Figure 5.
Total end-to-end processing time (minutes) of the evaluated architectures for different workload sizes (number of images).
Figure 5.
Total end-to-end processing time (minutes) of the evaluated architectures for different workload sizes (number of images).
Figure 6.
Average throughput (images per second) of the evaluated architectures for different workload sizes (number of images).
Figure 6.
Average throughput (images per second) of the evaluated architectures for different workload sizes (number of images).
4.3.2. Resource Utilization
While
Figure 5 and
Figure 6 in the previous section summarized the complete end-to-end temporal behavior of all architectures,
Table 1 and
Table 2 in this section report complementary resource utilization metrics. The resource utilization analysis focuses on active processing periods, excluding setup and idle times between experiments. Each architecture was monitored during four sequential workloads (4.5 K, 9 K, 13.5 K, and 18 K images). As we can observe in these tables, the results vary depending on the architecture:
gRPC architecture: During active processing periods, this architecture shows a higher GPU utilization (over 90%), regardless of the workload size. CPU utilization while processing presents an excellent stability, with average values below 40% and some peaks exceeding this value. This higher use of resources explains the better results previously shown in terms of processing time and performance of this architecture.
RabbitMQ architecture: Shows the lowest GPU utilization (around 70%). CPU average utilization (over 35%) is similar to other architectures; however, some peaks go beyond 64%, reducing stability.
Lambda + RabbitMQ architecture: GPU utilization is slightly increased (around 80%) compared to RabbitMQ without Lambda. It has the lowest CPU usage (more than 25%), but it also has the lowest stability, with some CPU peaks close to 70%.
SageMaker architecture: Similarly to gRPC, this architecture presents higher GPU utilization (close to 99%) and CPU (over 30%). However, as happened with Lambda + RabbitMQ, the latter is less stable, presenting CPU peaks near 70%.
Note that these results are further discussed in
Section 5.
4.4. Cost Analysis
This section provides a comprehensive cost analysis of each architecture based on AWS pricing in the eu-west-1 region at the time of writing this paper. Before discussing the cost of each architecture, we specify next how we have calculated it.
4.4.1. Pricing
For pricing purposes, we consider the following costs for each infrastructure:
4.4.2. Cost Calculation Methodology
For each architecture, we calculate:
Compute cost: Instance running time × times × hourly rate.
Request cost: Number of invocations × times × request pricing (Lambda only).
Total cost: Sum of all infrastructure components.
Cost per 1 K images: Total cost/(number of images/1000).
4.4.3. Architecture-Specific Infrastructure Components Cost
Next, we present a more detailed cost for each infrastructure:
Considering the above methodology,
Table 3 presents the cost of each architecture. As we can observe, gRPC and RabbitMQ present a similar cost per hour. Lambda + RabbitMQ is slightly lower, but the cost per hour does not include the Lambda, whose cost varies depending on the specific execution time. Finally, SageMaker is the architecture presenting the highest cost per hour.
Table 3 also shows the cost for processing 1000 images. As we can observe, the lowest price is provided by gRPC, followed by RabbitMQ and Lambda + RabbitMQ, which present a similar cost. SageMaker is the architecture presenting the highest cost.
4.4.4. Cost Sensitivity and Deployment Scenarios
While the above cost figures assume continuous workload execution, real-world deployments often exhibit diverse ingestion patterns. For example, gRPC and RabbitMQ-based architectures are particularly cost-effective for sustained, high-throughput ingestion pipelines, such as nightly or batch catalog indexing. In contrast, serverless Lambda-based pipelines become economically favorable under intermittent workloads, where idle infrastructure costs can be eliminated. SageMaker, despite higher hourly pricing, can be justified in production environments that require operational simplicity, managed scalability, and consistently high GPU utilization. These scenarios illustrate how deployment characteristics directly affect the cost-performance trade-offs discussed in this section.
5. Discussion
In this section, we discuss the results presented in the previous section. We expose the key findings, resource efficiency insights, architectural trade-offs, and election guidance.
5.1. Key Findings
Our experimental evaluation reveals distinct performance profiles across architectures. gRPC emerges as the performance leader, achieving a consistent throughput of 16.9 images/s with a high GPU utilization (around 95% during active processing) and exceptional CPU stability (1.9% standard deviation and mostly low peak CPU usage). The direct communication model eliminates message broker overhead, enabling immediate processing upon image availability and maintaining predictable scaling across workload sizes.
RabbitMQ architectures excel in fault tolerance and system decoupling but experience performance trade-offs, with 74% GPU utilization around and 13.6 images/second throughput. The message-driven approach provides superior reliability through automatic retry mechanisms and dead letter queues, although queue management overhead creates processing variability (CPU standard deviation: 8.6%).
Serverless Lambda+RabbitMQ architectures prioritize cost efficiency and operational flexibility, which require further optimization research, particularly in batch aggregation strategies and efficient serialization protocols for Worker communication. Current limitations include suboptimal batch size configurations and communication overhead between Lambda functions and Workers that create processing gaps affecting GPU utilization efficiency. The throughput used (around 11.7 imgs/s) and the use of resources right now are very inefficient.
SageMaker demonstrates the optimal balance between performance and operational simplicity, achieving around 98% GPU utilization and 15.4 images/s throughput while significantly reducing infrastructure management complexity. The managed infrastructure provides the most optimal resource efficiency in GPU utilization, but not the best throughput performance, making it attractive for production deployments requiring both performance and operational scalability.
5.2. Resource Efficiency Insights
The analysis establishes a strong correlation between GPU utilization efficiency and system throughput, with architectures maintaining >90% GPU utilization achieving superior performance. This finding emphasizes the critical importance of minimizing idle GPU time in compute-intensive image processing workflows. CPU stability emerges as a secondary but significant factor, with consistent resource usage patterns correlating with predictable performance scaling.
The resource analysis reveals that architectural overhead directly impacts processing efficiency. Direct communication models (gRPC) and managed infrastructure (SageMaker) minimize processing gaps, while message-driven architectures introduce queuing overhead that proportionally reduces GPU feeding efficiency.
5.3. Architectural Trade-Offs and Selection Guidance
The study identifies clear trade-offs between performance, cost, operational complexity, and fault tolerance. Thus, for high-throughput, performance-critical scenarios, gRPC provides optimal end-to-end performance despite not achieving peak resource utilization, demonstrating that communication efficiency can outweigh pure compute optimization in determining system throughput.
In the case of fault-tolerant, decoupled systems, RabbitMQ architectures excel when system reliability and component independence are prioritized, though with significant performance trade-offs (around 73% GPU utilization) due to substantial queue management overhead.
For variable, cost-sensitive workloads, serverless Lambda architectures provide a better cost efficiency and operational flexibility, suitable for applications with intermittent processing requirements, but need to be adjusted to achieve better performance.
Finally, for resource-efficient processing with managed complexity, SageMaker offers the highest GPU utilization (98.1%) while eliminating infrastructure management overhead, making it ideal for scenarios where resource efficiency is prioritized and moderate communication latency is acceptable.
Compared with prior benchmarks of microservice communication and serverless pipelines already discussed in this work, our results confirm earlier observations regarding queuing overhead in asynchronous messaging systems and cold start penalties in serverless platforms [
22,
23,
24,
26,
27], while extending these findings to GPU-bound image encoding workloads, which remain underexplored in existing benchmark studies.
From a broader systems perspective, recent advances in optimization and secure distributed learning further contextualize our architectural comparison. Hybrid and quantum-inspired optimization algorithms have been proposed to address complex, high-dimensional optimization problems, illustrating the importance of scalable and communication-efficient infrastructures [
36,
37]. In parallel, privacy-preserving vertical-horizontal federated learning frameworks enable secure data sharing across heterogeneous sources, emphasizing how communication architecture choices directly impact scalability, efficiency, and trust in distributed AI systems [
38]. Our findings also align with recent work on trustworthy and privacy-preserving distributed infrastructures, such as secure federated learning frameworks for large-scale IoT and cybersecurity systems [
39]. These developments reinforce the significance of our experimental evaluation for designing robust, efficient pipelines in modern large-scale AI deployments.
6. Conclusions
This study demonstrates that the selection of communication architecture significantly impacts system performance in large-scale image encoding applications. The choice between architectures should be driven by specific requirements: gRPC for maximum performance, RabbitMQ for fault tolerance, Lambda for cost efficiency, and SageMaker for balanced performance and simplicity. The established performance-resource correlations provide quantitative guidance for system architects designing visual content-based search systems at scale.
For future work, several research directions emerge from this analysis. First, hybrid architectures combining the performance benefits of direct communication with the fault tolerance of message-driven systems warrant investigation. Second, optimization strategies for queue-based architectures, including advanced batching and prefetch mechanisms, could reduce the observed performance gaps. Third, the impact of different AI model architectures (beyond SigLIP) on communication architecture performance requires evaluation. Finally, cost-performance optimization through dynamic architecture selection based on workload characteristics presents opportunities for adaptive system design.
The comparative framework developed in this study provides a foundation for the evaluation of emerging communication architectures and can be extended to other computation-intensive AI applications beyond visual search systems.
Future research should explore adaptive hybrid pipelines that dynamically switch between direct and queue-based communication based on real-time GPU utilization feedback, as well as cost-aware schedulers that jointly optimize throughput and cloud expenditure under variable workloads.
Author Contributions
Conceptualization, H.Z., C.R. and J.F.A.-S.; methodology, C.R.; software, H.Z. and J.F.A.-S.; validation, C.R. and J.F.A.-S.; formal analysis, C.R.; investigation, H.Z. and J.F.A.-S.; resources, H.Z. and J.F.A.-S.; data curation, H.Z.; writing—original draft preparation, H.Z.; writing—review and editing, H.Z., C.R. and J.F.A.-S.; visualization, C.R.; supervision, C.R. and J.F.A.-S.; project administration, C.R. and J.F.A.-S.; funding acquisition, C.R. and J.F.A.-S. All authors have read and agreed to the published version of the manuscript.
Funding
This work was funded by the Valencian Innovation Agency (AVI) under Grant INNTA3/2023/17.
Data Availability Statement
Data available on request due to restrictions: The data presented in this study are available on request from the corresponding author due to commercial reasons.
Acknowledgments
Authors are grateful for the support provided by the Kimera Technologies company and its team. AI-assisted language editing (M365 Copilot, based on Microsoft GPT-5 model) was used to improve the readability of the manuscript.
Conflicts of Interest
Authors Haojie Zheng and Juan F. Ariño-Sales were employed by the company Kimera Technologies, S.L. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| AWS | Amazon Web Services |
| CPU | Central Processing Unit |
| RAM | Random Access Memory |
| GPU | Graphics Processing Unit |
| AI | Artificial Intelligence |
| API | Application Programming Interface |
| EC2 | Elastic Compute Cloud |
| ANN | Artificial Neural Networks |
| CLIP | Contrastive Language-Image Pretraining |
| EPypes | Event-Driven Pipeline Framework |
| SQS | Simple Queue Service |
| DAGs | Directed Acyclic Graphs |
| RPC | Remote procedure call |
References
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning; Meila, M., Zhang, T., Eds.; Proceedings of Machine Learning Research; PMLR: Cambridge, MA, USA, 2021; Volume 139, pp. 8748–8763. Available online: https://proceedings.mlr.press/v139/radford21a.html (accessed on 14 April 2026).
- Zhang, Y.; Pan, P.; Zheng, Y.; Zhao, K.; Zhang, Y.; Ren, X.; Jin, R. Visual Search at Alibaba. In KDD ’18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2018; pp. 993–1001. [Google Scholar] [CrossRef]
- Yang, F.; Kale, A.; Bubnov, Y.; Stein, L.; Wang, Q.; Kiapour, H.; Piramuthu, R. Visual Search at eBay. In KDD ’17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2017; pp. 2101–2110. [Google Scholar] [CrossRef]
- Zheng, Y.; Xie, X.; Ma, W.Y. Distributed Architecture for Large Scale Image-Based Search. In 2007 IEEE International Conference on Multimedia and Expo; IEEE: Piscataway, NJ, USA, 2007; pp. 579–582. [Google Scholar] [CrossRef]
- Deng, W.; Li, H.; Zhao, H. Antinoise Bearing Fault Diagnosis Using Time-Reassigned Multisynchrosqueezing Transform and Complex Sparse Learning Dictionary. IEEE Trans. Instrum. Meas. 2025, 74, 3557310. [Google Scholar] [CrossRef]
- Zhao, H.; Liu, C.; Dang, X.; Xu, J.; Deng, W. Few-Shot Cross-Domain Fault Diagnosis of Transportation Motor Bearings Using MAML-GA. IEEE Trans. Transp. Electrif. 2026, 12, 1165–1174. [Google Scholar] [CrossRef]
- Semeniuta, O. EPypes: A Python Library for Developing Event-Driven Pipelines. Available online: https://github.com/semeniuta/EPypes (accessed on 14 April 2026).
- Semeniuta, O.; Falkman, P. EPypes: A framework for building event-driven data processing pipelines. PeerJ Comput. Sci. 2019, 5, e176. [Google Scholar] [CrossRef] [PubMed]
- Broadcom Inc. RabbitMQ: One Broker to Queue Them All. Available online: https://www.rabbitmq.com/ (accessed on 14 April 2026).
- Apache Software Foundation. Apache Kafka. Available online: https://kafka.apache.org/ (accessed on 14 April 2026).
- Amazon Web Services. Amazon Simple Queue Service (Amazon SQS). Available online: https://aws.amazon.com/sqs (accessed on 14 April 2026).
- Hinze, A.; Sachs, K.; Buchmann, A. Event-based applications and enabling technologies. In DEBS ’09: Proceedings of the Third ACM International Conference on Distributed Event-Based Systems; Association for Computing Machinery: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
- Amazon Web Services. AWS Lambda: Serverless Function, FaaS Serverless. Available online: https://aws.amazon.com/lambda (accessed on 14 April 2026).
- Dehury, C.; Jakovits, P.; Srirama, S.N.; Tountopoulos, V.; Giotis, G. Data Pipeline Architecture for Serverless Platform. In Proceedings of the Software Architecture; Muccini, H., Avgeriou, P., Buhnova, B., Camara, J., Caporuscio, M., Franzago, M., Koziolek, A., Scandurra, P., Trubiani, C., Weyns, D., et al., Eds.; Springer: Cham, Switzerland, 2020; pp. 241–246. [Google Scholar] [CrossRef]
- Amazon Web Services. Amazon S3: Cloud Objest Storage. Available online: https://aws.amazon.com/es/s3/ (accessed on 14 April 2026).
- Amazon Web Services. AWS Data Pipeline. Available online: https://docs.aws.amazon.com/data-pipeline/ (accessed on 14 April 2026).
- Apache Software Foundation. Apache Airflow. Available online: https://airflow.apache.org/ (accessed on 14 April 2026).
- Testas, A. Scalable Deep Learning Pipelines with Apache Airflow. In Building Scalable Deep Learning Pipelines on AWS: Develop, Train, and Deploy Deep Learning Models; Apress: Berkeley, CA, USA, 2024; pp. 489–584. [Google Scholar] [CrossRef]
- Apache Software Foundation. Apache PySpark. Available online: https://spark.apache.org/docs/latest/api/python/index.html (accessed on 14 April 2026).
- The Linux Foundation. PyTorch. Available online: https://pytorch.org/ (accessed on 14 April 2026).
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2016, arXiv:1603.04467. [Google Scholar] [CrossRef]
- Kumar, P.K.; Agarwal, R.; Shivaprasad, R.; Sitaram, D.; Kalambur, S. Performance Characterization of Communication Protocols in Microservice Applications. In Proceedings of the 2021 International Conference on Smart Applications, Communications and Networking (SmartNets), Glasgow, UK, 22–24 September 2021; pp. 1–5. [Google Scholar] [CrossRef]
- Weerasinghe, L.; Perera, I. Evaluating the Inter-Service Communication on Microservice Architecture. In Proceedings of the 2022 7th International Conference on Information Technology Research (ICITR), Moratuwa, Sri Lanka, 7–9 December 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Dobbelaere, P.; Esmaili, K.S. Kafka versus RabbitMQ: A comparative study of two industry reference publish/subscribe implementations: Industry Paper. In DEBS ’17: Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems; Association for Computing Machinery: New York, NY, USA, 2017; pp. 227–238. [Google Scholar] [CrossRef]
- Wen, J.; Chen, Z.; Jin, X.; Liu, X. Rise of the Planet of Serverless Computing: A Systematic Review. ACM Trans. Softw. Eng. Methodol. 2023, 32, 131. [Google Scholar] [CrossRef]
- Mathew, A.; Andrikopoulos, V.; Blaauw, F.J. Exploring the cost and performance benefits of AWS step functions using a data processing pipeline. In UCC ’21: Proceedings of the 14th IEEE/ACM International Conference on Utility and Cloud Computing; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Yu, T.; Liu, Q.; Du, D.; Xia, Y.; Zang, B.; Lu, Z.; Yang, P.; Qin, C.; Chen, H. Characterizing serverless platforms with serverlessbench. In SoCC ’20: Proceedings of the 11th ACM Symposium on Cloud Computing; Association for Computing Machinery: New York, NY, USA, 2020; pp. 30–44. [Google Scholar] [CrossRef]
- Lee, S.; Oh, J.; Go, S.; Mahajan, D. Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications. In 2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS); IEEE: Piscataway, NJ, USA, 2025; pp. 353–355. [Google Scholar] [CrossRef]
- Ovi, M.S.I. A Study on Distributed Strategies for Deep Learning Applications in GPU Clusters. arXiv 2026, arXiv:2505.12832. [Google Scholar] [CrossRef]
- Xing, X.; Ling, J.; Liu, S.; Tao, Z. Physics-informed neural network for thermal property inversion of airport pavement multilayer materials under icing conditions. Constr. Build. Mater. 2026, 522, 146164. [Google Scholar] [CrossRef]
- Hou, T.; Liu, S.; Mao, W.; Zhao, J.; Ling, J.; Xing, X. Stochastic vibration analysis of aircraft-rigid pavement system under random aircraft parameters. Int. J. Pavement Eng. 2026, 27, 2666268. [Google Scholar] [CrossRef]
- Clark, J.A.; Lundh, F.; Contributors. Pillow (PIL Fork). Available online: https://pillow.readthedocs.io/ (accessed on 14 April 2026).
- gRPC Authors. gRPC: A High Performance, Open Source Universal RPC Framework. Available online: https://grpc.io/ (accessed on 14 April 2026).
- Amazon Web Services. Amazon SageMaker: The Center for All Your Data, Analytics, and AI. Available online: https://aws.amazon.com/sagemaker (accessed on 14 April 2026).
- Tschannen, M.; Gritsenko, A.; Wang, X.; Naeem, M.F.; Alabdulmohsin, I.; Parthasarathy, N.; Evans, T.; Beyer, L.; Xia, Y.; Mustafa, B.; et al. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. arXiv 2025, arXiv:2502.14786. [Google Scholar] [CrossRef]
- Chen, Y.; Xu, H.; Liu, J.; Hou, M.; Li, Y.; Qiu, S.; Sun, M.; Zhao, H.; Deng, W. A Hybridizing-Enhanced Quantum-Inspired Differential Evolution Algorithm with Multi-Strategy for Complicated Optimization. J. Artif. Intell. Soft Comput. Res. 2025, 16, 5–37. [Google Scholar] [CrossRef]
- Zhao, H.; Li, L.; Deng, W. Multi-UAV Path Planning Using Improved Artificial Hummingbird Algorithm Based on Differential Evolution and Gradient Descent. IEEE Trans. Consum. Electron. 2026, 72, 558–569. [Google Scholar] [CrossRef]
- Deng, W.; Li, X.; Sun, Y.; Zhao, H. Privacy Protection-Enhanced Vertical-Horizontal Federated Learning Secure Sharing for Multisource Heterogeneous Data. IEEE Trans. Ind. Inform. 2026, 22, 3138–3147. [Google Scholar] [CrossRef]
- Kumar, D.; Pramod Pawar, P.; Kumar Meesala, M.; Kumar Pareek, P.; Reddy Addula, S.; K S, S. Trustworthy IoT Infrastructures: Privacy-Preserving Federated Learning with Efficient Secure Aggregation for Cybersecurity. In 2024 International Conference on Integrated Intelligence and Communication Systems (ICIICS); IEEE: Piscataway, NJ, USA, 2024; pp. 1–8. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |