1. Introduction
Artificial intelligence-based computer vision has seen rapid adoption across numerous domains, including industrial manufacturing [
1] and medical applications [
2]. This widespread uptake is primarily driven by recent breakthroughs in deep learning methods [
3] together with the growing availability of large-scale annotated datasets [
4], which have substantially improved the performance and reliability of modern computer vision systems.
Despite these advances and their transformative potential, deploying computer vision solutions at scale remains a difficult task for many organizations, particularly small and medium-sized enterprises [
5]. A major limiting factor is the considerable investment required for computational infrastructure, which is necessary not only for training and inference of deep learning models but also for the storage and processing of massive volumes of visual data.
To mitigate these economic and technical barriers, cloud computing [
6] has become a prominent enabling technology. Major cloud providers, such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, offer elastic and adaptable infrastructures that support data-intensive workloads and machine learning pipelines. The capability to dynamically allocate resources and follow a usage-based pricing scheme makes cloud platforms particularly appealing for organizations with limited in-house resources.
Nevertheless, adopting cloud-based solutions for computer vision introduces its own set of challenges. Conventional cloud storage systems and data management architectures are typically designed for structured or semi-structured data, which can lead to inefficiencies when handling large-scale image and video collections. This limitation highlights the need for advanced load balancing strategies and specialized database architectures tailored to the unique requirements of computer vision workloads in cloud environments [
7].
The aim of this paper is to explore how emerging technologies in the fields of computer vision and cloud computing can be effectively combined to address these challenges and unlock new opportunities for businesses. It is proposed to research and design a distributed server infrastructure, where the computing cost associated with artificial intelligence is distributed across multiple sub-processes. For evaluation purposes, a real-world use case from the Kimera Technologies company [
8] is used. It is an AI-based image processing and classification application. Adjustments are made to costs and performance suitable for this company.
In this context, the central objective of this paper is to address the following engineering research question: How do different cloud deployment models, specifically Infrastructure as a Service (IaaS) and Container as a Service (CaaS), affect the performance–cost trade-off of scalable, GPU-accelerated computer vision applications under realistic workloads? While the evaluation is grounded in a real-world industrial use case, the goal is not to optimize a single application instance, but rather to extract general insights that can guide deployment decisions for similar AI-based image processing and classification systems.
This paper makes the following main contributions:
We designed and evaluated three distinct cloud deployment architectures for an AI-based computer vision application, covering monolithic IaaS, decomposed IaaS, and container-based CaaS models.
We provided a comprehensive experimental evaluation that jointly analyzes execution time, scalability behavior, resource utilization patterns, and monetary cost.
We quantified the trade-offs introduced by containerized deployments, including scale-up latency and orchestration overhead, in comparison to VM-based approaches.
We derived practical recommendations for deploying GPU-accelerated computer vision workloads in cloud environments, grounded in real-world industrial constraints.
This paper extends our previous work [
9]. Compared with our preliminary conference publication, this journal version significantly extends the work by (i) introducing two additional deployment architectures (multiple EC2 and ECS); (ii) providing a comprehensive evaluation that includes scalability, detailed resource utilization, load testing, and cost analysis; (iii) analyzing container-induced performance penalties and scale-up latency; and (iv) drawing broader insights into deployment trade-offs relevant to real-world AI systems. These extensions substantially deepen both the experimental and engineering contributions, supporting publication as a journal article.
The remainder of this paper is organized as follows:
Section 2 presents related work. Next,
Section 3 discusses the research questions and the solution proposed. Later,
Section 4 presents the evaluation of the results and
Section 5 discusses them. Finally,
Section 6 concludes the paper.
2. Related Work
This section provides an overview of relevant literature on cloud computing architectures, with particular emphasis on scalable and distributed applications in the cloud.
2.1. Cloud Computing
Cloud computing refers to the delivery of information technology (IT) resources through service-oriented cloud platforms. Several cloud service models have been established to address different levels of abstraction, including Infrastructure as a Service (IaaS), Container as a Service (CaaS), and Function as a Service (FaaS). In addition, users can leverage a variety of cloud-based solutions provided over the Internet to satisfy evolving computational demands, as offered by major providers, such as Amazon AWS, Google Cloud Platform, and Microsoft Azure.
2.1.1. Infrastructure as a Service (IaaS)
Infrastructure as a Service (IaaS) [
10] enables users to provision and manage virtual machines (VMs) with a high degree of control. This model allows customers to configure multiple aspects of the virtualized environment, including the number of Central Processing Units (CPU), the amount of Random Access Memory (RAM), storage capacity, networking components, and hardware accelerators, such as Graphics Processing Units (GPUs). Such flexibility provides extensive customization options and enables fine-grained cost control. However, these advantages come at the expense of increased operational complexity, as users are responsible for managing and orchestrating the underlying resources. As the scale of the deployed application grows, handling and scaling these resources efficiently can become challenging [
11].
2.1.2. Container as a Service (CaaS)
Container as a Service (CaaS) [
12] provides a cloud-based environment for orchestrating, deploying, and managing containerized applications. This model enables developers to encapsulate software together with its dependencies, thereby enhancing portability and runtime consistency. Widely used container technologies, such as Docker [
13] and orchestration platforms like Kubernetes [
14], streamline application delivery and automate operational workflows. By abstracting much of the underlying infrastructure, CaaS reduces the burden on development teams, allowing them to concentrate on application logic rather than configuration and system maintenance. Nevertheless, this approach introduces security considerations, as containerized ecosystems require careful governance to mitigate potential vulnerabilities and ensure safe operation.
2.1.3. Function as a Service (FaaS)
Function as a Service (FaaS) enables developers to implement application logic as discrete functions that are triggered by predefined events, abstracting away the management of servers and runtime infrastructure. This execution model supports a pay-per-use cost structure, whereby charges are incurred only during function invocation. As a result, FaaS is well-suited for workloads that demand elastic scalability and for development scenarios where efficient utilization of computing resources is a priority. Despite these advantages, FaaS presents several limitations. The initialization of the execution environments can introduce latency, commonly referred to as cold-start delay, which may affect performance-sensitive applications. Furthermore, observing, debugging, and monitoring serverless functions is often more complex than in traditional deployment models. These drawbacks become more pronounced when integrating machine learning workloads, as the inclusion of models significantly increases execution duration as well as CPU and memory consumption, leading to higher operational costs.
2.1.4. Comparative Analysis of Cloud Service Models
While IaaS, CaaS, and FaaS each offer distinct advantages, their suitability depends on workload characteristics. IaaS provides fine-grained control and cost efficiency for long-running workloads, but introduces operational complexity. CaaS strikes a balance by abstracting infrastructure management while retaining deployment flexibility, although it introduces orchestration overhead and security concerns. In contrast, FaaS offers superior elasticity and ease of deployment, but suffers from cold-start latency and limited execution control.
Recent studies suggest that no single model consistently outperforms the others across all metrics, such as latency, cost, and scalability. Instead, hybrid approaches are increasingly favored, although they introduce additional system complexity and integration challenges.
Table 1 synthesizes all the relevant information regarding cloud service models to allow for an immediate comparison.
2.1.5. Previous Studies
Although each cloud computing model presents its own limitations and trade-offs, several studies have demonstrated effective strategies to mitigate these issues or propose alternative solutions. For example, Zhang et al. [
15] analyze cost-efficient approaches for delivering Machine Learning as a Service (MLaaS) on public cloud platforms, such as Amazon AWS. Their study compares the cost and average response latency of different cloud service models-EC2 [
16] representing IaaS, ECS [
17] for CaaS, and AWS Lambda [
18] for FaaS—under a workload of one million requests using three representative machine learning models. The experimental results indicate that Lambda incurs the highest cost and latency among the evaluated options. However, it benefits from rapid provisioning times on the order of seconds. ECS exhibits intermediate performance in terms of both cost and latency, with resource provisioning typically requiring several minutes. In contrast, EC2 provides the lowest latency and cost, although its provisioning time is comparable to that of ECS. To address scalability challenges, the authors propose a hybrid deployment strategy that combines EC2 and Lambda, where IaaS serves as the primary execution environment and FaaS is leveraged during instance initialization phases or to accommodate short-lived demand spikes. Complementary findings are reported by Ramesh et al. [
19], who demonstrate that cloud-based artificial intelligence workflows can be deployed in a cost-effective manner by strategically integrating FaaS with cloud storage services, thereby reducing operational overhead while maintaining scalability.
More recent research has focused extensively on addressing the performance and scalability limitations of serverless and hybrid cloud architectures. A significant body of work targets the cold-start problem, which remains one of the primary bottlenecks in FaaS platforms. For example, recent large-scale empirical analyses [
20] reveal that cold-start latency is influenced by multiple factors, including runtime environments, scheduling delays, and dependency deployment, with delays reaching several seconds in real-world systems. Complementary studies propose predictive and learning-based approaches to mitigate these effects, such as deep learning–driven scheduling policies and workload forecasting techniques that proactively reduce initialization overhead [
21]. In addition, lightweight virtualization and container optimization strategies have been shown to significantly reduce startup latency while maintaining isolation guarantees [
22].
Beyond reactive optimizations, recent work explores adaptive and intelligent resource management. For instance, learned caching and resource allocation mechanisms have been proposed to dynamically adjust container life-cycles based on workload characteristics, outperforming static heuristics in heterogeneous environments [
23]. Similarly, fine-grained scheduling approaches that optimize function packaging and dependency management have demonstrated substantial improvements in response time and resource utilization [
24].
Another emerging direction is the integration of serverless computing within distributed and edge–cloud environments. Recent studies highlight that data movement, function orchestration, and cold-start overhead collectively dominate execution time in distributed serverless workflows, motivating new architectures that overlap computation and data transfer to improve performance [
25]. These findings indicate that future cloud systems must consider not only compute elasticity but also data locality and cross-layer optimization.
Each cloud computing model offers distinct strengths and limitations, resulting in multiple viable deployment strategies and a variety of approaches to mitigating their inherent challenges. Selecting the most suitable model ultimately depends on an organization’s specific requirements, requiring careful consideration of factors, such as cost efficiency, scalability, and the level of operational complexity imposed on development teams. Existing research indicates that hybrid approaches, which combine different service models, can be particularly effective. For example, IaaS may serve as the primary execution environment, while FaaS can be employed to absorb transient workload spikes, or CaaS platforms (e.g., ECS) can be layered on top of IaaS resources (e.g., EC2) to facilitate container-based deployments. Such combinations enable flexible, scalable, and cost-effective solutions tailored to the deployment of machine learning models and artificial intelligence workflows in cloud environments.
2.2. Scalable and Distributed Applications on Cloud
Leymann et al. [
26] characterized scalability as the capability of an application to improve its performance through the addition of IT resources, typically without explicitly accounting for the release of those resources. In contrast, the same authors defined elasticity as the ability to dynamically both provision and deprovision IT resources, enabling an application to rapidly adapt its performance in response to workload fluctuations. This property is fundamental to fully leveraging the pay-per-use nature of cloud computing.
From an implementation perspective, cloud application scaling mechanisms can generally be categorized into two primary approaches: horizontal scaling and vertical scaling. Horizontal scaling refers to adjusting the number of independent computing resources, such as servers or virtual machines. When application demand increases, additional instances are provisioned to provide more computational power, storage capacity, or throughput. Conversely, resources are removed when demand decreases. Vertical scaling, on the other hand, involves enhancing the capabilities of existing resources, for example, by allocating additional CPU cores or memory to a server. This approach is inherently constrained by the maximum capacity supported by individual resources available from the cloud provider. Consequently, horizontal scaling is widely recommended as the preferred strategy in cloud environments.
Beyond scalability, cloud applications must also exhibit elasticity [
26]. To support this property, applications are typically designed as distributed systems composed of multiple loosely coupled components. The majority of these components should be stateless, meaning that they do not maintain session state between client requests, while only a limited subset is designed to manage stateful information.
Prior studies [
27] suggest that an ideal cloud platform should enable dynamic resource provisioning, effective load balancing, and seamless scaling with minimal performance degradation. Achieving such a balance remains an open research challenge, particularly when attempting to scale applications without increasing operational costs or reducing performance. As an example, Song et al. [
28] presented an autoscaling framework for an application programming interface (API) gateway built on Kubernetes [
14], demonstrating how container orchestration components can be leveraged to support scalable and elastic cloud applications.
2.3. Image Quality Assessment (IQA)
Recent advances in image quality assessment (IQA) have been driven by transformerbased architectures and large pre-trained models, which significantly improve generalization across diverse degradations. Comprehensive surveys, such as refs. [
29,
30], highlight how modern IQA models can be integrated into AI pipelines to improve robustness and efficiency. While the present work focuses on infrastructure-level deployment trade-offs, these advances suggest promising opportunities for combining IQA-based pre-filtering with scalable cloud architectures to reduce unnecessary computation and improve end-to-end system performance.
3. Solution Proposed
Building on the concepts discussed in the previous section, cloud environments provide multiple architectural approaches for achieving scalable and efficient deployments. In this context, we introduce two alternative models for designing the distributed server infrastructure. As previously described, these models are evaluated using a real-world use case from Kimera Technologies, which involves an AI-based system for image processing and classification. The following sections present and analyze these models in detail.
The first step involves decomposing our application into various components aiming to maintain as few stateful components as possible, as the Cloud Computing Patterns book [
26] recommends, to enhance the scalability and distribution of the cloud application. We divide the application into four main components (ignoring other small components that are not part of the cloud or can be generalized as one of these components): REST API server, backend server, scaler/load balancer, and workers. Next, we provide the detail for each component.
3.1. REST API Server
This component serves as the primary entry point through which clients interact with the application, receiving incoming requests and routing them to the appropriate backend services. It is implemented using Flask (version 3.0.3) [
31], a lightweight Python web framework that facilitates rapid development of web applications, in combination with Gunicorn (version 22.0.0) [
32], a UNIX-based Web Server Gateway Interface (WSGI) HTTP server designed for production environments. The integration of Flask and Gunicorn enables reliable and scalable client-server communication, while Gunicorn’s worker-based architecture supports concurrent request processing and improves overall deployment efficiency.
3.2. Backend Server
The backend component is responsible for managing and persisting client-related data, as well as handling requests issued by the REST API server. It supports core operations, such as user authentication, database creation and management, record updates, deletions, and query execution. In addition to performing all read and write interactions with the database, this component maintains user session information and monitors client activity. When a request requires processing by artificial intelligence models, the backend coordinates with the corresponding worker nodes by forwarding the necessary data and collecting the resulting outputs. This design ensures efficient, reliable, and secure data handling throughout the system architecture.
3.3. Workers
The workers component plays a central role in executing the AI models and is architected to deliver fast response times while scaling efficiently under peak workloads. To support scalability, this component follows a stateless design, allowing instances to be added or removed dynamically as demand fluctuates. It is provisioned with substantial computational resources, including high-performance CPUs, large memory capacities, and GPUs, to ensure efficient execution of computationally intensive tasks. This configuration enables the system to process complex workloads and large datasets with low latency, thereby maximizing throughput and performance for data-driven artificial intelligence operations.
3.4. Scaler/Load Balancer
The scaler/load balancer monitors the status of workers, such as CPU, GPU, and RAM usage, to dynamically scale the workforce up or down based on demand. It maintains a list of all active workers’ IP addresses, ensuring that the backend server is matched with the most available worker for processing requests, thereby optimizing resource allocation and maintaining system efficiency. This component acts as both a scaler and a load balancer, intelligently distributing the workload among workers to ensure optimal performance.
In the current implementation, a resource utilization threshold of 90% is used to trigger scale-up actions. This value was selected empirically to provide a balance between responsiveness and stability. It allows new worker instances to be provisioned before resource saturation leads to queuing delays, while avoiding unnecessary instance activation due to short-lived utilization spikes. Preliminary testing with lower thresholds resulted in increased operational cost without measurable performance improvement, whereas higher thresholds occasionally caused delayed scaling under peak load. More sophisticated scaling policies could be explored in future work. However, a threshold-based approach was chosen to reflect common industrial practice and to ensure reproducibility.
3.5. Proposed Models for the Distributed Server Infrastructure
Next, we discuss the proposed models, shown in
Figure 1. They are all based on Amazon AWS technologies, namely Elastic Compute Cloud (EC2) and Elastic Container Service (ECS). We provide the details below.
The first proposed model, shown in
Figure 1a, uses Amazon EC2 (IaaS). As we can see, all the components of the application, which have been detailed previously, are deployed in one EC2 instance. All the clients use the same EC2 instance. This is a very basic model. Notice that the EC2 instance includes a GPU, but it is only used by the worker. In this model, there is only one EC2 instance. It is always active, waiting for requests. This instance services all the requests. As there is only one worker, there is no need for the scaler/load balancer component. The backend server sends the tasks directly to the only worker.
The second proposed model, shown in
Figure 1b, also uses Amazon EC2 (IaaS). In this case, we highly customized the entire configuration of the instance to adapt it to our needs. Thus, the REST API server, the backend server, and the scaler/load balancer run in one EC2 instance (without GPU, less powerful and cheaper), while the workers run in different EC2 instances (with GPU, more powerful and expensive). There is always one worker active. Additional workers are started or stopped as required. To avoid the slow start of the additional workers [
15], we previously created the additional worker instances and set the state of these instances as “stopped”. When the instances were stopped in AWS, they did not generate additional costs. Instances can later be activated when needed. The idea is, therefore, to provide our scaler/load balancer with the list of all the worker instances that we created, and activate them when we have client activity that requires additional workers. The scaler scales up (activates more workers) if a high workload arrives. Conversely, the scaler scales down if there are no requests or jobs for workers for a given period of time.
The third proposed model, shown in
Figure 1c, uses Amazon ECS (CaaS). As we can see, it is similar to the previous model, but instead of using EC2 instances directly, ECS clusters are used with EC2 groups inside them. The REST API server, the backend server and the scaler/load balancer run in one ECS cluster, which contains one EC2 group of one EC2 instance. The workers run in a different ECS cluster, which includes one EC2 group with multiple EC2 instances. This container-based model simplifies deployment, management, and updates through Docker [
13,
33]. In addition, ECS inherently supports auto-scaling, which balances ease of use against potentially slower scale-up times. Furthermore, ECS offers a streamlined approach for future application development and operational efficiency.
4. Evaluation
This section reports the performance evaluation of the approaches introduced in this study. We first describe the experimental environment and configuration used in the evaluation. Subsequently, the obtained results are presented and discussed.
Throughout the evaluation, the average CPU, RAM, and GPU utilization metrics were reported to illustrate how efficiently provisioned resources are used over time. However, average utilization alone did not directly represent the total computational consumption or monetary cost. For this reason, execution time and on-demand pricing were treated as the primary indicators of overall resource consumption, while utilization metrics were used as supporting evidence to interpret performance behavior and identify potential inefficiencies. For the sake of completeness, total resource consumption results are provided in
Appendix A to maintain the conciseness of the main text.
4.1. Experimental Setup
For the experimental setup, we used the AWS EC2 instances shown in
Table 2. More specifically, we used three types of deployment, one per model proposed:
Deploying the complete application using one EC2 instance type [
16] (
Figure 1a). In this case, we use an EC2 g5.2xlarge instance with 8 vCPUs (AMD EPYC 7R32), 32 GB of RAM, and a GPU with 24 GB of memory (NVIDIA A10G Tensor Core).
Deploying the application in components on different types of EC2 instances (
Figure 1b). In this case, we deploy less resource-intensive components like the API server, the backend server and the scaler/load balancer on a smaller instance with enough RAM and CPU capacity, namely a t2.xlarge instance with 4 vCPUs (Intel Xeon Scalable) and 16 GB of RAM. Two worker components are deployed on two g5.2xlarge instances. The t2.xlarge instance is always active, waiting for requests. One worker instance is always active. The additional worker instance is started or stopped as required.
Deploying the application components as services on ECS clusters [
34] (
Figure 1c). The ECS implementation is the same as the previous one, but using ECS clusters. The API server, the backend server and the scaler/load balancer components are deployed on an ECS cluster, which contains an EC2 group with one EC2 instance of type t2.xlarge. The two workers are deployed on another ECS cluster, which includes an EC2 group with two EC2 instances of type g5.2xlarge. In this case, it is necessary that the two ECS clusters use the same namespace to communicate with each other.
As previously mentioned, the application evaluated in the experiments corresponds to a real-world use case provided by Kimera Technologies. The system is an AI-based application for image processing and classification, offering two core functionalities delivered as services to end users:
Database creation from structured data. The application enables the construction of a product database from a comma-separated values (CSV) file containing a client’s product information. During this process, product images are encoded and linked to their corresponding entries defined in the CSV file. This functionality relies on artificial neural networks (ANNs) and GPU-accelerated AI models to generate image representations. The time and computational cost required to build the database depend on several factors, including the size of the CSV file, the number of products, and the amount of additional metadata provided.
Product search and retrieval. The system supports advanced search capabilities over the generated databases, allowing clients to query products using visual and textual information. For example, users may upload an image to retrieve visually similar products or search for items sharing specific image characteristics. Textual queries based on product attributes, such as titles or descriptions included in the CSV file, are also supported. In addition to unimodal searches, the application allows multimodal queries that combine both image-based and text-based information.
To assess system performance, several monitoring and benchmarking tools are employed. Apache JMeter [
35] is used to conduct load testing by emulating concurrent user requests to the database, thereby enabling performance evaluation under varying system loads. In addition, AWS CloudWatch [
36] is utilized to monitor CPU utilization and memory consumption of the deployed instances over time. Since CloudWatch does not expose GPU-related metrics, GPU usage data are collected by directly accessing the instances and querying the NVIDIA System Management Interface (SMI) [
37].
It should be noted that, in experiments related to database creation, the experimental evaluation is limited to a maximum of four users. However, it should also be noted that these experiments reflect a realistic industrial scenario. Scalability under higher loads (up to 80 users) is evaluated through specific load tests in the case of product search and retrieval experiments.
4.2. Deploying the Complete Application Using One EC2 Instance Type to Create a Database
The model evaluated in this section is the one shown in
Figure 1a. We used the EC2 g5.2xlarge instance detailed before. For the experiment, we employed a CSV dataset containing approximately 150,000 entries. Each record included a link to an image along with additional descriptive attributes. Since each user in the application can maintain only a single database, we create up to four user accounts and initiate the database-creation process while varying the number of concurrent users. During these tests, we recorded the time required to complete the operation, as well as the CPU, RAM, GPU, and GPU memory utilization of the EC2 instance.
All system components were deployed on a single EC2 instance for this part of the evaluation. After a successful deployment, API requests can be issued directly to the instance. A scripted workflow was used to trigger the database-creation request via the API, authenticating with the credentials of the targeted user and supplying the CSV file of 150,000 elements. The application internally logs the execution time for this operation. In parallel, we monitor CPU, RAM, GPU, and GPU memory usage through the measurement techniques described earlier. After each user’s process completes, the corresponding database is deleted, and the number of concurrent users is incremented. This procedure enables us to evaluate system behavior under loads of up to four simultaneous database-creation requests.
Figure 2 illustrates the execution time obtained from this experiment. The results reveal a linear increase in processing time as the number of concurrent users grows, with an average growth rate of 0.41.
Resource utilization metrics are depicted in
Figure 3a, which shows the average CPU and RAM usage during the tests. RAM consumption remains below 60% across all scenarios and shows only a modest increase with additional users. CPU utilization rises from roughly 20% with a single user to around 32% when two users are active, appearing to stabilize near this value as the number of concurrent users increases.
Figure 4a presents GPU and GPU memory usage. GPU memory consumption remains constant at approximately 68.2% regardless of the load. GPU utilization follows a pattern similar to CPU usage, increasing from about 65% with one user to roughly 94% with two users, after which it remains stable for higher concurrency levels.
The model evaluated in this section does not scale well and it is not suitable for the aims of this work.
4.3. Deploying the Application in Components on Different Types of EC2 Instances to Create a Database
The model evaluated in this section is the one shown in
Figure 1b. We carried out experiments with one t2.xlarge instance (for the API server, the backend server, and the scaler/load balancer) and two g5.2xlarge instances (for the workers). As explained in
Section 3.5, there is always one worker (i.e., one g5.2xlarge instance) active. The second additional worker is created in advance and stopped. When the GPU usage of the first worker reaches 90%, the scaler restarts the second worker. To balance the load between the two workers, a classic application load balancer provided by AWS is used [
38]. It implements a round robin policy.
In these experiments, as well as in the ones in the next sections involving multiple instances, the pre-processing of the user request before sending it to a worker, which is executed in the t2.xlarge instance, is negligible (in these experiments, around 5 s). Thus, the results presented below consider the time and resources used by the two workers, which are executed in one g5.2xlarge instance each.
Figure 2 shows the execution time of this experiment. Similar to the previous scenario, the execution time increases linearly with the number of users. However, in this case, the execution time is lower than in the previous case for more than one user. A speed-up of 2.12× is achieved with four users. The average growth rate (0.11) is also lower than in the previous case (0.41). These results were expected, as we have doubled resources (i.e., two workers in two different EC2 instances instead of one worker in one EC2 instance).
In terms of resource consumption,
Figure 3b presents the average CPU and RAM memory usage. As in this case, we have two workers, this figure shows the average usage of two CPUs (CPU1 and CPU2) and two RAM memory units (Memory1 and Memory2). CPU1 and Memory1 refer to one worker, whereas CPU2 and Memory2 refer to the other worker.
In experiments with only one user, the second worker is not active. Thus, the results for one user are similar to the previous scenario. For more than one user, however, the second worker is started. In these cases, the average RAM memory usage is similar (around 54%) for all the workers regardless of the number of users. The average CPU usage is also similar for all the workers (around 29%). It increases slightly when more users are added. We also observe that CPU load is better balanced as the number of users (i.e., the work to do) increases.
Figure 4b presents the average GPU and GPU memory usage. As in the previous scenario, the average GPU memory usage remains constant at 68.2% regardless of the number of users. The average GPU usage increases as the number of users grows. Thus, it increases from around 65% with one user to around 91% for four users. GPU load is well balanced regardless of the number of users.
Compared to the previous scenario (i.e., one EC2 vs. multiple EC2), the resources used are increased but the execution time is reduced by a higher factor. Thus, the final resource usage is also reduced. For example, in the experiments with four users, the CPU usage and the RAM memory are reduced by 12% and 14%, respectively. GPU usage and GPU memory are also reduced by 8% and 6%, respectively.
In summary, the model evaluated in this section presents good scalability and flexibility and it is suitable for the objectives of this work.
4.4. Deploying the Application Components as Services on ECS Clusters to Create a Database
The model evaluated in this section is the one shown in
Figure 1c. The scenario using ECS is very similar to the previous one, with one t2.xlarge (for the API server, the backend server, and the scaler/load balancer) and two g5.2xlarge instances (for the workers). However, now the t2.xlarge instance is in one ECS cluster and the two g5.2xlarge instances are in another ECS cluster.
To ensure a fair comparison with the multiple EC2 deployments, both approaches rely on pre-configured execution environments. In the multiple EC2 scenario, worker instances are created in advance and kept in a stopped state, enabling fast activation. In contrast, ECS provisions new workers using predefined task and instance templates. Although functionally equivalent in configuration, the ECS approach inherently incurs additional delay due to container instantiation, scheduling, and orchestration, which is reflected in the observed scale-up times.
As in the previous model, there is always one worker (i.e., one g5.2xlarge instance) active. However, in this case, the second additional worker does not exist because ECS works in a different way. Thus, a template is created, which contains the configuration required. When the GPU usage of the first worker reaches 90%, the scaler requests the new instance, which is then configured using the template. After that, the new instance is started. To balance the load between the two workers, the application load balancer provided by ECS is used [
38]. Similarly to the previous scenario, it also implements a round robin policy.
Figure 2 shows the execution time of this experiment. Similarly to the previous scenario, the execution time increases linearly with the number of users. However, in this case, the execution time is higher than in the previous case, regardless of the number of users. For example, for four users, this scenario requires 22% more time. The average growth rate (0.18) is also higher than in the previous case (0.11). These results were expected, as this model is based on containers, which simplify deployment and management, but usually reduce performance.
Regarding resource consumption,
Figure 3c presents the average CPU and RAM memory usage. In general, the resource usage follows a similar trend to that of the previous scenario. However, the values are slightly lower. Thus, the average RAM memory usage is 51.9%, while in the previous scenario it was 53.5%. The same happens with average CPU usage (24.6% vs. 25.6%). We also observe that CPU load is better balanced than in the previous scenario, regardless of the load (i.e., the number of users).
Figure 4c presents the average GPU and GPU memory usage. As in the previous scenario, the average GPU memory usage remains constant at 68.2% regardless of the number of users. The average GPU usage is, however, slightly lower. Thus, the average GPU usage is 76.8%, while in the previous scenario it was 79.1%. Similar to what happened in the previous scenario, GPU load is well balanced regardless of the number of users.
Compared to the previous scenario (i.e., multiple EC2 vs. ECS), the resources used are reduced but the execution time is increased by a higher factor. Thus, in general, the final resource usage is also increased. For example, in the experiments with four users, the CPU usage and the RAM memory are both increased by 18%. GPU usage and GPU memory are also increased by 21% and 22%, respectively.
To better understand why ECS results are worse than in the previous scenario, we further analyzed these cases.
Figure 5 shows the average CPU and GPU usage of these deployments with four concurrent users creating a database. As already discussed in
Section 3.5, ECS balances ease of use against potentially slower scale-up times. This is clearly shown in these experiments. In
Figure 5a, we can observe that the scale-up time (i.e., time passed since the second worker is required until it is ready to work) for the previous scenario is 5.12 min. However, as shown in
Figure 5b, the scale-up time for when using ECS is 11.85 min. This slower scale-up time, together with the reduced performance due to the use of containers, explains the results.
The model evaluated in this section also presents good scalability and flexibility, and it is also suitable for the purposes of this work.
4.5. Load Testing of Product Search Functionality
Next, we evaluated the system’s performance under load using JMeter (version 5.6.3), following the methodology described earlier. In this experiment, we measured the number of requests per second that the application can handle as the number of concurrent users performing product-search queries increases. The same database and workload configuration were maintained across all tests to ensure comparability. To conduct the experiment, JMeter was configured to continuously issue requests to the application’s IP address for a duration of five minutes, while varying the number of threads to emulate different levels of simultaneous user activity. This setup allowed us to observe how the system behaves under increasing query rates and to assess its capacity for handling multiple search operations concurrently. The resulting performance metrics are presented in
Figure 6.
Similarly to previous sections, three scenarios were compared: one, where the complete application is deployed in a single g5.2xlarge instance (
Figure 1a); a second scenario, where some components of the application (the API server, the backend server, and the scaler/load balancer) are deployed in one t2.xlarge instance, and the workers are deployed in g5.2xlarge instances (
Figure 1b); and a third scenario similar to the second one but using ECS (
Figure 1c). As explained in
Section 3.5, in the second and third scenarios, there is always one worker (i.e., one g5.2xlarge instance) active.
During these search experiments, GPU utilization of the active worker never exceeded the scale-up threshold (i.e., 90%). As a result, additional workers were not instantiated, and the comparison effectively isolates differences related to deployment overhead, request handling, and orchestration mechanisms under a single-worker configuration. This behavior is representative of steady-state inference workloads with moderate concurrency.
As illustrated in
Figure 6, the throughput of the system, measured in requests per second, increases with the number of concurrent users issuing search queries across all evaluated scenarios. This growth persists up to approximately 40 simultaneous users. Beyond this point, the throughput reaches a saturation level, remaining stable despite further increases in the number of concurrent requests. In the case of using a single EC2 instance, the maximum number of requests per second achieved is 30.7. In the case of using multiple EC2 instances, 33.4 and 31 maximum number of requests per second were achieved for ECS. Thus, the multiple EC2 scenario improves the EC2 and ECS scenarios by 9% and 8%, respectively.
In addition to throughput, we monitored response latency and request completion status during the load tests. Across all evaluated scenarios, no request failures were observed. The average response latency increased gradually with the number of concurrent users and stabilized once throughput saturation was reached, following a trend consistent with the observed throughput behavior.
The deployment method does not affect retrieval precision, classification accuracy, or result consistency. All experiments use identical model architectures, weights, preprocessing steps, and inference logic. Therefore, differences observed across deployment scenarios are solely attributable to infrastructure and orchestration effects rather than to changes in algorithmic behavior.
4.6. Cost Analysis
This section presents a comparative analysis of the economic costs associated with the evaluated deployment configurations. The assessment is based on Amazon EC2 on-demand pricing [
39]. At the time of writing this paper, the cost of the instance t2.xlarge was USD 0.1856 per hour, while the cost of the instance g5.2xlarge was USD 1.212 per hour (on-demand price per hour for Linux/Unix in the US East (Northern Virginia) AWS Region). In the scenarios where different types of instances are used (i.e., multiple EC2 and ECS), the price includes one t2.xlarge instance, one g5.2xlarge and an additional g5.2xlarge when required (i.e., two or more users for creating a database).
The corresponding cost estimates are presented in
Table 3 and
Table 4.
Table 3 reports the monetary cost, expressed in U.S. dollars (USD), associated with each evaluated deployment configuration when creating a database under different levels of user concurrency. Each value represents the cost per database creation process. As we can observe, for one user, the price is similar for all the deployments. However, as the number of users increases, the cost of the scenarios using multiple instances decreases. Thus, for four users, the scenario using multiple EC2 instances presents the lowest cost, with a reduction of 6% and 22% compared to EC2 and ECS, respectively.
Table 4 summarizes the estimated economic cost, expressed in U.S. dollar cents (CENT), for the evaluated deployment configurations when executing product search operations under varying levels of concurrent user load. Each reported value corresponds to the cost incurred per individual search request. Prices are similar for all the deployments. Again, as the level of concurrency increases, the cost per request decreases. This reduction is observed up to 40 concurrent users, after which the cost stabilizes and remains unchanged despite further increases in user count. For over 40 users, the scenario using multiple EC2 instances presents the lowest cost, with a reduction of 7% compared to the other scenarios.
5. Discussion
The experimental results highlight several practical implications for deploying large-scale, GPU-accelerated computer vision applications in cloud environments. Decomposing application components across heterogeneous resources enables better scalability and cost efficiency than monolithic deployments, while container-based solutions offer improved manageability at the expense of additional overhead. These findings suggest that infrastructure selection should be guided not only by raw performance but also by operational considerations, such as scale-up latency, deployment complexity, and expected workload variability.
Furthermore, the proposed architecture can be naturally extended with intelligent preprocessing stages, such as image quality assessment or filtering modules, to avoid unnecessary inference on low-quality inputs. Incorporating such components represents a promising direction for future work, particularly in conjunction with modern IQA models based on transformers and large-scale representation learning.
Beyond the scope of the current investigation, further research could also explore analyzing how image degradation (e.g., noise, resolution, or compression) affects the system’s accuracy, scalability, or resource utilization.
6. Conclusions
In this work, we analyzed different cloud computing alternatives for distributing a real-world AI-based computer vision application, namely, (i) deploying the complete application using one EC2 instance (EC2), (ii) deploying the application in components on different types of EC2 instances (multiple EC2), and (iii) deploying the application components as services on ECS clusters (ECS).
In the tests performed, the multiple EC2 scenario presented the best results. This model and the ECS one showed good scalability and flexibility. On the contrary, the scenario using one EC2 instance did not present acceptable scalability. In addition to offering the best performance, the multiple EC2 scenario also presented slightly lower cost. Therefore, we conclude that this approach is the most appropriate for the objectives pursued in this study.