Next Article in Journal
Co-Design of Pipelining and Fixed-Point Quantization for SOVA-Turbo Codec IP Core
Next Article in Special Issue
Pantograph Wear Classification via Dual-Backbone Feature-Fusion Ensemble Network
Previous Article in Journal
Integrated Multi-Modal Logistics Planning and Scheduling for Electric Freight Systems
Previous Article in Special Issue
MDCNet: A Multi-Neighborhood Dense Connectivity Network for Infrared Transmission Line Clamp Segmentation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Scaling Computer Vision: A Comparative Analysis of Cloud Infrastructures for AI-Based Image Processing and Classification Applications †

1
Departament d’Informàtica, Escola Tècnica Superior d’Enginyeria (ETSE), Universitat de València, Avda. Universitat S/N, Burjassot, 46100 València, Spain
2
Kimera Technologies, S.L., Carrer del Moll de la Duana S/N, Poblados Marítimos, 46024 València, Spain
*
Author to whom correspondence should be addressed.
The 24th International Symposium on Parallel and Distributed Computing (ISPDC), Rennes, France, 8–11 July 2025.
Electronics 2026, 15(9), 1953; https://doi.org/10.3390/electronics15091953
Submission received: 3 April 2026 / Revised: 27 April 2026 / Accepted: 1 May 2026 / Published: 5 May 2026

Abstract

Artificial intelligence-driven computer vision has undergone rapid expansion in recent years, largely propelled by progress in deep learning techniques and the availability of extensive annotated datasets. Nevertheless, the large-scale adoption of such systems remains challenging for many organizations due to financial constraints and technological complexity. In this context, cloud computing has become an appealing alternative, as it offers elastic, on-demand resources under a pay-as-you-go model. Despite these advantages, the use of cloud platforms also introduces specific challenges for computer vision applications. One of the key open issues concerns the assessment of whether it is better to use classical Infrastructure (IaaS) or Containers (CaaS) to build applications. In this paper, we evaluated and compared these two models by using a real-world use case: an AI-based image processing and classification application. The best-performing model achieved speed-ups of up to 2.12× and reduced resource consumption and costs by up to 22% compared with the other evaluated alternatives.

1. Introduction

Artificial intelligence-based computer vision has seen rapid adoption across numerous domains, including industrial manufacturing [1] and medical applications [2]. This widespread uptake is primarily driven by recent breakthroughs in deep learning methods [3] together with the growing availability of large-scale annotated datasets [4], which have substantially improved the performance and reliability of modern computer vision systems.
Despite these advances and their transformative potential, deploying computer vision solutions at scale remains a difficult task for many organizations, particularly small and medium-sized enterprises [5]. A major limiting factor is the considerable investment required for computational infrastructure, which is necessary not only for training and inference of deep learning models but also for the storage and processing of massive volumes of visual data.
To mitigate these economic and technical barriers, cloud computing [6] has become a prominent enabling technology. Major cloud providers, such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, offer elastic and adaptable infrastructures that support data-intensive workloads and machine learning pipelines. The capability to dynamically allocate resources and follow a usage-based pricing scheme makes cloud platforms particularly appealing for organizations with limited in-house resources.
Nevertheless, adopting cloud-based solutions for computer vision introduces its own set of challenges. Conventional cloud storage systems and data management architectures are typically designed for structured or semi-structured data, which can lead to inefficiencies when handling large-scale image and video collections. This limitation highlights the need for advanced load balancing strategies and specialized database architectures tailored to the unique requirements of computer vision workloads in cloud environments [7].
The aim of this paper is to explore how emerging technologies in the fields of computer vision and cloud computing can be effectively combined to address these challenges and unlock new opportunities for businesses. It is proposed to research and design a distributed server infrastructure, where the computing cost associated with artificial intelligence is distributed across multiple sub-processes. For evaluation purposes, a real-world use case from the Kimera Technologies company [8] is used. It is an AI-based image processing and classification application. Adjustments are made to costs and performance suitable for this company.
In this context, the central objective of this paper is to address the following engineering research question: How do different cloud deployment models, specifically Infrastructure as a Service (IaaS) and Container as a Service (CaaS), affect the performance–cost trade-off of scalable, GPU-accelerated computer vision applications under realistic workloads? While the evaluation is grounded in a real-world industrial use case, the goal is not to optimize a single application instance, but rather to extract general insights that can guide deployment decisions for similar AI-based image processing and classification systems.
This paper makes the following main contributions:
  • We designed and evaluated three distinct cloud deployment architectures for an AI-based computer vision application, covering monolithic IaaS, decomposed IaaS, and container-based CaaS models.
  • We provided a comprehensive experimental evaluation that jointly analyzes execution time, scalability behavior, resource utilization patterns, and monetary cost.
  • We quantified the trade-offs introduced by containerized deployments, including scale-up latency and orchestration overhead, in comparison to VM-based approaches.
  • We derived practical recommendations for deploying GPU-accelerated computer vision workloads in cloud environments, grounded in real-world industrial constraints.
This paper extends our previous work [9]. Compared with our preliminary conference publication, this journal version significantly extends the work by (i) introducing two additional deployment architectures (multiple EC2 and ECS); (ii) providing a comprehensive evaluation that includes scalability, detailed resource utilization, load testing, and cost analysis; (iii) analyzing container-induced performance penalties and scale-up latency; and (iv) drawing broader insights into deployment trade-offs relevant to real-world AI systems. These extensions substantially deepen both the experimental and engineering contributions, supporting publication as a journal article.
The remainder of this paper is organized as follows: Section 2 presents related work. Next, Section 3 discusses the research questions and the solution proposed. Later, Section 4 presents the evaluation of the results and Section 5 discusses them. Finally, Section 6 concludes the paper.

2. Related Work

This section provides an overview of relevant literature on cloud computing architectures, with particular emphasis on scalable and distributed applications in the cloud.

2.1. Cloud Computing

Cloud computing refers to the delivery of information technology (IT) resources through service-oriented cloud platforms. Several cloud service models have been established to address different levels of abstraction, including Infrastructure as a Service (IaaS), Container as a Service (CaaS), and Function as a Service (FaaS). In addition, users can leverage a variety of cloud-based solutions provided over the Internet to satisfy evolving computational demands, as offered by major providers, such as Amazon AWS, Google Cloud Platform, and Microsoft Azure.

2.1.1. Infrastructure as a Service (IaaS)

Infrastructure as a Service (IaaS) [10] enables users to provision and manage virtual machines (VMs) with a high degree of control. This model allows customers to configure multiple aspects of the virtualized environment, including the number of Central Processing Units (CPU), the amount of Random Access Memory (RAM), storage capacity, networking components, and hardware accelerators, such as Graphics Processing Units (GPUs). Such flexibility provides extensive customization options and enables fine-grained cost control. However, these advantages come at the expense of increased operational complexity, as users are responsible for managing and orchestrating the underlying resources. As the scale of the deployed application grows, handling and scaling these resources efficiently can become challenging [11].

2.1.2. Container as a Service (CaaS)

Container as a Service (CaaS) [12] provides a cloud-based environment for orchestrating, deploying, and managing containerized applications. This model enables developers to encapsulate software together with its dependencies, thereby enhancing portability and runtime consistency. Widely used container technologies, such as Docker [13] and orchestration platforms like Kubernetes [14], streamline application delivery and automate operational workflows. By abstracting much of the underlying infrastructure, CaaS reduces the burden on development teams, allowing them to concentrate on application logic rather than configuration and system maintenance. Nevertheless, this approach introduces security considerations, as containerized ecosystems require careful governance to mitigate potential vulnerabilities and ensure safe operation.

2.1.3. Function as a Service (FaaS)

Function as a Service (FaaS) enables developers to implement application logic as discrete functions that are triggered by predefined events, abstracting away the management of servers and runtime infrastructure. This execution model supports a pay-per-use cost structure, whereby charges are incurred only during function invocation. As a result, FaaS is well-suited for workloads that demand elastic scalability and for development scenarios where efficient utilization of computing resources is a priority. Despite these advantages, FaaS presents several limitations. The initialization of the execution environments can introduce latency, commonly referred to as cold-start delay, which may affect performance-sensitive applications. Furthermore, observing, debugging, and monitoring serverless functions is often more complex than in traditional deployment models. These drawbacks become more pronounced when integrating machine learning workloads, as the inclusion of models significantly increases execution duration as well as CPU and memory consumption, leading to higher operational costs.

2.1.4. Comparative Analysis of Cloud Service Models

While IaaS, CaaS, and FaaS each offer distinct advantages, their suitability depends on workload characteristics. IaaS provides fine-grained control and cost efficiency for long-running workloads, but introduces operational complexity. CaaS strikes a balance by abstracting infrastructure management while retaining deployment flexibility, although it introduces orchestration overhead and security concerns. In contrast, FaaS offers superior elasticity and ease of deployment, but suffers from cold-start latency and limited execution control.
Recent studies suggest that no single model consistently outperforms the others across all metrics, such as latency, cost, and scalability. Instead, hybrid approaches are increasingly favored, although they introduce additional system complexity and integration challenges.
Table 1 synthesizes all the relevant information regarding cloud service models to allow for an immediate comparison.

2.1.5. Previous Studies

Although each cloud computing model presents its own limitations and trade-offs, several studies have demonstrated effective strategies to mitigate these issues or propose alternative solutions. For example, Zhang et al. [15] analyze cost-efficient approaches for delivering Machine Learning as a Service (MLaaS) on public cloud platforms, such as Amazon AWS. Their study compares the cost and average response latency of different cloud service models-EC2 [16] representing IaaS, ECS [17] for CaaS, and AWS Lambda [18] for FaaS—under a workload of one million requests using three representative machine learning models. The experimental results indicate that Lambda incurs the highest cost and latency among the evaluated options. However, it benefits from rapid provisioning times on the order of seconds. ECS exhibits intermediate performance in terms of both cost and latency, with resource provisioning typically requiring several minutes. In contrast, EC2 provides the lowest latency and cost, although its provisioning time is comparable to that of ECS. To address scalability challenges, the authors propose a hybrid deployment strategy that combines EC2 and Lambda, where IaaS serves as the primary execution environment and FaaS is leveraged during instance initialization phases or to accommodate short-lived demand spikes. Complementary findings are reported by Ramesh et al. [19], who demonstrate that cloud-based artificial intelligence workflows can be deployed in a cost-effective manner by strategically integrating FaaS with cloud storage services, thereby reducing operational overhead while maintaining scalability.
More recent research has focused extensively on addressing the performance and scalability limitations of serverless and hybrid cloud architectures. A significant body of work targets the cold-start problem, which remains one of the primary bottlenecks in FaaS platforms. For example, recent large-scale empirical analyses [20] reveal that cold-start latency is influenced by multiple factors, including runtime environments, scheduling delays, and dependency deployment, with delays reaching several seconds in real-world systems. Complementary studies propose predictive and learning-based approaches to mitigate these effects, such as deep learning–driven scheduling policies and workload forecasting techniques that proactively reduce initialization overhead [21]. In addition, lightweight virtualization and container optimization strategies have been shown to significantly reduce startup latency while maintaining isolation guarantees [22].
Beyond reactive optimizations, recent work explores adaptive and intelligent resource management. For instance, learned caching and resource allocation mechanisms have been proposed to dynamically adjust container life-cycles based on workload characteristics, outperforming static heuristics in heterogeneous environments [23]. Similarly, fine-grained scheduling approaches that optimize function packaging and dependency management have demonstrated substantial improvements in response time and resource utilization [24].
Another emerging direction is the integration of serverless computing within distributed and edge–cloud environments. Recent studies highlight that data movement, function orchestration, and cold-start overhead collectively dominate execution time in distributed serverless workflows, motivating new architectures that overlap computation and data transfer to improve performance [25]. These findings indicate that future cloud systems must consider not only compute elasticity but also data locality and cross-layer optimization.
Each cloud computing model offers distinct strengths and limitations, resulting in multiple viable deployment strategies and a variety of approaches to mitigating their inherent challenges. Selecting the most suitable model ultimately depends on an organization’s specific requirements, requiring careful consideration of factors, such as cost efficiency, scalability, and the level of operational complexity imposed on development teams. Existing research indicates that hybrid approaches, which combine different service models, can be particularly effective. For example, IaaS may serve as the primary execution environment, while FaaS can be employed to absorb transient workload spikes, or CaaS platforms (e.g., ECS) can be layered on top of IaaS resources (e.g., EC2) to facilitate container-based deployments. Such combinations enable flexible, scalable, and cost-effective solutions tailored to the deployment of machine learning models and artificial intelligence workflows in cloud environments.

2.2. Scalable and Distributed Applications on Cloud

Leymann et al. [26] characterized scalability as the capability of an application to improve its performance through the addition of IT resources, typically without explicitly accounting for the release of those resources. In contrast, the same authors defined elasticity as the ability to dynamically both provision and deprovision IT resources, enabling an application to rapidly adapt its performance in response to workload fluctuations. This property is fundamental to fully leveraging the pay-per-use nature of cloud computing.
From an implementation perspective, cloud application scaling mechanisms can generally be categorized into two primary approaches: horizontal scaling and vertical scaling. Horizontal scaling refers to adjusting the number of independent computing resources, such as servers or virtual machines. When application demand increases, additional instances are provisioned to provide more computational power, storage capacity, or throughput. Conversely, resources are removed when demand decreases. Vertical scaling, on the other hand, involves enhancing the capabilities of existing resources, for example, by allocating additional CPU cores or memory to a server. This approach is inherently constrained by the maximum capacity supported by individual resources available from the cloud provider. Consequently, horizontal scaling is widely recommended as the preferred strategy in cloud environments.
Beyond scalability, cloud applications must also exhibit elasticity [26]. To support this property, applications are typically designed as distributed systems composed of multiple loosely coupled components. The majority of these components should be stateless, meaning that they do not maintain session state between client requests, while only a limited subset is designed to manage stateful information.
Prior studies [27] suggest that an ideal cloud platform should enable dynamic resource provisioning, effective load balancing, and seamless scaling with minimal performance degradation. Achieving such a balance remains an open research challenge, particularly when attempting to scale applications without increasing operational costs or reducing performance. As an example, Song et al. [28] presented an autoscaling framework for an application programming interface (API) gateway built on Kubernetes [14], demonstrating how container orchestration components can be leveraged to support scalable and elastic cloud applications.

2.3. Image Quality Assessment (IQA)

Recent advances in image quality assessment (IQA) have been driven by transformerbased architectures and large pre-trained models, which significantly improve generalization across diverse degradations. Comprehensive surveys, such as refs. [29,30], highlight how modern IQA models can be integrated into AI pipelines to improve robustness and efficiency. While the present work focuses on infrastructure-level deployment trade-offs, these advances suggest promising opportunities for combining IQA-based pre-filtering with scalable cloud architectures to reduce unnecessary computation and improve end-to-end system performance.

3. Solution Proposed

Building on the concepts discussed in the previous section, cloud environments provide multiple architectural approaches for achieving scalable and efficient deployments. In this context, we introduce two alternative models for designing the distributed server infrastructure. As previously described, these models are evaluated using a real-world use case from Kimera Technologies, which involves an AI-based system for image processing and classification. The following sections present and analyze these models in detail.
The first step involves decomposing our application into various components aiming to maintain as few stateful components as possible, as the Cloud Computing Patterns book [26] recommends, to enhance the scalability and distribution of the cloud application. We divide the application into four main components (ignoring other small components that are not part of the cloud or can be generalized as one of these components): REST API server, backend server, scaler/load balancer, and workers. Next, we provide the detail for each component.

3.1. REST API Server

This component serves as the primary entry point through which clients interact with the application, receiving incoming requests and routing them to the appropriate backend services. It is implemented using Flask (version 3.0.3) [31], a lightweight Python web framework that facilitates rapid development of web applications, in combination with Gunicorn (version 22.0.0) [32], a UNIX-based Web Server Gateway Interface (WSGI) HTTP server designed for production environments. The integration of Flask and Gunicorn enables reliable and scalable client-server communication, while Gunicorn’s worker-based architecture supports concurrent request processing and improves overall deployment efficiency.

3.2. Backend Server

The backend component is responsible for managing and persisting client-related data, as well as handling requests issued by the REST API server. It supports core operations, such as user authentication, database creation and management, record updates, deletions, and query execution. In addition to performing all read and write interactions with the database, this component maintains user session information and monitors client activity. When a request requires processing by artificial intelligence models, the backend coordinates with the corresponding worker nodes by forwarding the necessary data and collecting the resulting outputs. This design ensures efficient, reliable, and secure data handling throughout the system architecture.

3.3. Workers

The workers component plays a central role in executing the AI models and is architected to deliver fast response times while scaling efficiently under peak workloads. To support scalability, this component follows a stateless design, allowing instances to be added or removed dynamically as demand fluctuates. It is provisioned with substantial computational resources, including high-performance CPUs, large memory capacities, and GPUs, to ensure efficient execution of computationally intensive tasks. This configuration enables the system to process complex workloads and large datasets with low latency, thereby maximizing throughput and performance for data-driven artificial intelligence operations.

3.4. Scaler/Load Balancer

The scaler/load balancer monitors the status of workers, such as CPU, GPU, and RAM usage, to dynamically scale the workforce up or down based on demand. It maintains a list of all active workers’ IP addresses, ensuring that the backend server is matched with the most available worker for processing requests, thereby optimizing resource allocation and maintaining system efficiency. This component acts as both a scaler and a load balancer, intelligently distributing the workload among workers to ensure optimal performance.
In the current implementation, a resource utilization threshold of 90% is used to trigger scale-up actions. This value was selected empirically to provide a balance between responsiveness and stability. It allows new worker instances to be provisioned before resource saturation leads to queuing delays, while avoiding unnecessary instance activation due to short-lived utilization spikes. Preliminary testing with lower thresholds resulted in increased operational cost without measurable performance improvement, whereas higher thresholds occasionally caused delayed scaling under peak load. More sophisticated scaling policies could be explored in future work. However, a threshold-based approach was chosen to reflect common industrial practice and to ensure reproducibility.

3.5. Proposed Models for the Distributed Server Infrastructure

Next, we discuss the proposed models, shown in Figure 1. They are all based on Amazon AWS technologies, namely Elastic Compute Cloud (EC2) and Elastic Container Service (ECS). We provide the details below.
The first proposed model, shown in Figure 1a, uses Amazon EC2 (IaaS). As we can see, all the components of the application, which have been detailed previously, are deployed in one EC2 instance. All the clients use the same EC2 instance. This is a very basic model. Notice that the EC2 instance includes a GPU, but it is only used by the worker. In this model, there is only one EC2 instance. It is always active, waiting for requests. This instance services all the requests. As there is only one worker, there is no need for the scaler/load balancer component. The backend server sends the tasks directly to the only worker.
The second proposed model, shown in Figure 1b, also uses Amazon EC2 (IaaS). In this case, we highly customized the entire configuration of the instance to adapt it to our needs. Thus, the REST API server, the backend server, and the scaler/load balancer run in one EC2 instance (without GPU, less powerful and cheaper), while the workers run in different EC2 instances (with GPU, more powerful and expensive). There is always one worker active. Additional workers are started or stopped as required. To avoid the slow start of the additional workers [15], we previously created the additional worker instances and set the state of these instances as “stopped”. When the instances were stopped in AWS, they did not generate additional costs. Instances can later be activated when needed. The idea is, therefore, to provide our scaler/load balancer with the list of all the worker instances that we created, and activate them when we have client activity that requires additional workers. The scaler scales up (activates more workers) if a high workload arrives. Conversely, the scaler scales down if there are no requests or jobs for workers for a given period of time.
The third proposed model, shown in Figure 1c, uses Amazon ECS (CaaS). As we can see, it is similar to the previous model, but instead of using EC2 instances directly, ECS clusters are used with EC2 groups inside them. The REST API server, the backend server and the scaler/load balancer run in one ECS cluster, which contains one EC2 group of one EC2 instance. The workers run in a different ECS cluster, which includes one EC2 group with multiple EC2 instances. This container-based model simplifies deployment, management, and updates through Docker [13,33]. In addition, ECS inherently supports auto-scaling, which balances ease of use against potentially slower scale-up times. Furthermore, ECS offers a streamlined approach for future application development and operational efficiency.

4. Evaluation

This section reports the performance evaluation of the approaches introduced in this study. We first describe the experimental environment and configuration used in the evaluation. Subsequently, the obtained results are presented and discussed.
Throughout the evaluation, the average CPU, RAM, and GPU utilization metrics were reported to illustrate how efficiently provisioned resources are used over time. However, average utilization alone did not directly represent the total computational consumption or monetary cost. For this reason, execution time and on-demand pricing were treated as the primary indicators of overall resource consumption, while utilization metrics were used as supporting evidence to interpret performance behavior and identify potential inefficiencies. For the sake of completeness, total resource consumption results are provided in Appendix A to maintain the conciseness of the main text.

4.1. Experimental Setup

For the experimental setup, we used the AWS EC2 instances shown in Table 2. More specifically, we used three types of deployment, one per model proposed:
  • Deploying the complete application using one EC2 instance type [16] (Figure 1a). In this case, we use an EC2 g5.2xlarge instance with 8 vCPUs (AMD EPYC 7R32), 32 GB of RAM, and a GPU with 24 GB of memory (NVIDIA A10G Tensor Core).
  • Deploying the application in components on different types of EC2 instances (Figure 1b). In this case, we deploy less resource-intensive components like the API server, the backend server and the scaler/load balancer on a smaller instance with enough RAM and CPU capacity, namely a t2.xlarge instance with 4 vCPUs (Intel Xeon Scalable) and 16 GB of RAM. Two worker components are deployed on two g5.2xlarge instances. The t2.xlarge instance is always active, waiting for requests. One worker instance is always active. The additional worker instance is started or stopped as required.
  • Deploying the application components as services on ECS clusters [34] (Figure 1c). The ECS implementation is the same as the previous one, but using ECS clusters. The API server, the backend server and the scaler/load balancer components are deployed on an ECS cluster, which contains an EC2 group with one EC2 instance of type t2.xlarge. The two workers are deployed on another ECS cluster, which includes an EC2 group with two EC2 instances of type g5.2xlarge. In this case, it is necessary that the two ECS clusters use the same namespace to communicate with each other.
As previously mentioned, the application evaluated in the experiments corresponds to a real-world use case provided by Kimera Technologies. The system is an AI-based application for image processing and classification, offering two core functionalities delivered as services to end users:
  • Database creation from structured data. The application enables the construction of a product database from a comma-separated values (CSV) file containing a client’s product information. During this process, product images are encoded and linked to their corresponding entries defined in the CSV file. This functionality relies on artificial neural networks (ANNs) and GPU-accelerated AI models to generate image representations. The time and computational cost required to build the database depend on several factors, including the size of the CSV file, the number of products, and the amount of additional metadata provided.
  • Product search and retrieval. The system supports advanced search capabilities over the generated databases, allowing clients to query products using visual and textual information. For example, users may upload an image to retrieve visually similar products or search for items sharing specific image characteristics. Textual queries based on product attributes, such as titles or descriptions included in the CSV file, are also supported. In addition to unimodal searches, the application allows multimodal queries that combine both image-based and text-based information.
To assess system performance, several monitoring and benchmarking tools are employed. Apache JMeter [35] is used to conduct load testing by emulating concurrent user requests to the database, thereby enabling performance evaluation under varying system loads. In addition, AWS CloudWatch [36] is utilized to monitor CPU utilization and memory consumption of the deployed instances over time. Since CloudWatch does not expose GPU-related metrics, GPU usage data are collected by directly accessing the instances and querying the NVIDIA System Management Interface (SMI) [37].
It should be noted that, in experiments related to database creation, the experimental evaluation is limited to a maximum of four users. However, it should also be noted that these experiments reflect a realistic industrial scenario. Scalability under higher loads (up to 80 users) is evaluated through specific load tests in the case of product search and retrieval experiments.

4.2. Deploying the Complete Application Using One EC2 Instance Type to Create a Database

The model evaluated in this section is the one shown in Figure 1a. We used the EC2 g5.2xlarge instance detailed before. For the experiment, we employed a CSV dataset containing approximately 150,000 entries. Each record included a link to an image along with additional descriptive attributes. Since each user in the application can maintain only a single database, we create up to four user accounts and initiate the database-creation process while varying the number of concurrent users. During these tests, we recorded the time required to complete the operation, as well as the CPU, RAM, GPU, and GPU memory utilization of the EC2 instance.
All system components were deployed on a single EC2 instance for this part of the evaluation. After a successful deployment, API requests can be issued directly to the instance. A scripted workflow was used to trigger the database-creation request via the API, authenticating with the credentials of the targeted user and supplying the CSV file of 150,000 elements. The application internally logs the execution time for this operation. In parallel, we monitor CPU, RAM, GPU, and GPU memory usage through the measurement techniques described earlier. After each user’s process completes, the corresponding database is deleted, and the number of concurrent users is incremented. This procedure enables us to evaluate system behavior under loads of up to four simultaneous database-creation requests.
Figure 2 illustrates the execution time obtained from this experiment. The results reveal a linear increase in processing time as the number of concurrent users grows, with an average growth rate of 0.41.
Resource utilization metrics are depicted in Figure 3a, which shows the average CPU and RAM usage during the tests. RAM consumption remains below 60% across all scenarios and shows only a modest increase with additional users. CPU utilization rises from roughly 20% with a single user to around 32% when two users are active, appearing to stabilize near this value as the number of concurrent users increases.
Figure 4a presents GPU and GPU memory usage. GPU memory consumption remains constant at approximately 68.2% regardless of the load. GPU utilization follows a pattern similar to CPU usage, increasing from about 65% with one user to roughly 94% with two users, after which it remains stable for higher concurrency levels.
The model evaluated in this section does not scale well and it is not suitable for the aims of this work.

4.3. Deploying the Application in Components on Different Types of EC2 Instances to Create a Database

The model evaluated in this section is the one shown in Figure 1b. We carried out experiments with one t2.xlarge instance (for the API server, the backend server, and the scaler/load balancer) and two g5.2xlarge instances (for the workers). As explained in Section 3.5, there is always one worker (i.e., one g5.2xlarge instance) active. The second additional worker is created in advance and stopped. When the GPU usage of the first worker reaches 90%, the scaler restarts the second worker. To balance the load between the two workers, a classic application load balancer provided by AWS is used [38]. It implements a round robin policy.
In these experiments, as well as in the ones in the next sections involving multiple instances, the pre-processing of the user request before sending it to a worker, which is executed in the t2.xlarge instance, is negligible (in these experiments, around 5 s). Thus, the results presented below consider the time and resources used by the two workers, which are executed in one g5.2xlarge instance each.
Figure 2 shows the execution time of this experiment. Similar to the previous scenario, the execution time increases linearly with the number of users. However, in this case, the execution time is lower than in the previous case for more than one user. A speed-up of 2.12× is achieved with four users. The average growth rate (0.11) is also lower than in the previous case (0.41). These results were expected, as we have doubled resources (i.e., two workers in two different EC2 instances instead of one worker in one EC2 instance).
In terms of resource consumption, Figure 3b presents the average CPU and RAM memory usage. As in this case, we have two workers, this figure shows the average usage of two CPUs (CPU1 and CPU2) and two RAM memory units (Memory1 and Memory2). CPU1 and Memory1 refer to one worker, whereas CPU2 and Memory2 refer to the other worker.
In experiments with only one user, the second worker is not active. Thus, the results for one user are similar to the previous scenario. For more than one user, however, the second worker is started. In these cases, the average RAM memory usage is similar (around 54%) for all the workers regardless of the number of users. The average CPU usage is also similar for all the workers (around 29%). It increases slightly when more users are added. We also observe that CPU load is better balanced as the number of users (i.e., the work to do) increases.
Figure 4b presents the average GPU and GPU memory usage. As in the previous scenario, the average GPU memory usage remains constant at 68.2% regardless of the number of users. The average GPU usage increases as the number of users grows. Thus, it increases from around 65% with one user to around 91% for four users. GPU load is well balanced regardless of the number of users.
Compared to the previous scenario (i.e., one EC2 vs. multiple EC2), the resources used are increased but the execution time is reduced by a higher factor. Thus, the final resource usage is also reduced. For example, in the experiments with four users, the CPU usage and the RAM memory are reduced by 12% and 14%, respectively. GPU usage and GPU memory are also reduced by 8% and 6%, respectively.
In summary, the model evaluated in this section presents good scalability and flexibility and it is suitable for the objectives of this work.

4.4. Deploying the Application Components as Services on ECS Clusters to Create a Database

The model evaluated in this section is the one shown in Figure 1c. The scenario using ECS is very similar to the previous one, with one t2.xlarge (for the API server, the backend server, and the scaler/load balancer) and two g5.2xlarge instances (for the workers). However, now the t2.xlarge instance is in one ECS cluster and the two g5.2xlarge instances are in another ECS cluster.
To ensure a fair comparison with the multiple EC2 deployments, both approaches rely on pre-configured execution environments. In the multiple EC2 scenario, worker instances are created in advance and kept in a stopped state, enabling fast activation. In contrast, ECS provisions new workers using predefined task and instance templates. Although functionally equivalent in configuration, the ECS approach inherently incurs additional delay due to container instantiation, scheduling, and orchestration, which is reflected in the observed scale-up times.
As in the previous model, there is always one worker (i.e., one g5.2xlarge instance) active. However, in this case, the second additional worker does not exist because ECS works in a different way. Thus, a template is created, which contains the configuration required. When the GPU usage of the first worker reaches 90%, the scaler requests the new instance, which is then configured using the template. After that, the new instance is started. To balance the load between the two workers, the application load balancer provided by ECS is used [38]. Similarly to the previous scenario, it also implements a round robin policy.
Figure 2 shows the execution time of this experiment. Similarly to the previous scenario, the execution time increases linearly with the number of users. However, in this case, the execution time is higher than in the previous case, regardless of the number of users. For example, for four users, this scenario requires 22% more time. The average growth rate (0.18) is also higher than in the previous case (0.11). These results were expected, as this model is based on containers, which simplify deployment and management, but usually reduce performance.
Regarding resource consumption, Figure 3c presents the average CPU and RAM memory usage. In general, the resource usage follows a similar trend to that of the previous scenario. However, the values are slightly lower. Thus, the average RAM memory usage is 51.9%, while in the previous scenario it was 53.5%. The same happens with average CPU usage (24.6% vs. 25.6%). We also observe that CPU load is better balanced than in the previous scenario, regardless of the load (i.e., the number of users).
Figure 4c presents the average GPU and GPU memory usage. As in the previous scenario, the average GPU memory usage remains constant at 68.2% regardless of the number of users. The average GPU usage is, however, slightly lower. Thus, the average GPU usage is 76.8%, while in the previous scenario it was 79.1%. Similar to what happened in the previous scenario, GPU load is well balanced regardless of the number of users.
Compared to the previous scenario (i.e., multiple EC2 vs. ECS), the resources used are reduced but the execution time is increased by a higher factor. Thus, in general, the final resource usage is also increased. For example, in the experiments with four users, the CPU usage and the RAM memory are both increased by 18%. GPU usage and GPU memory are also increased by 21% and 22%, respectively.
To better understand why ECS results are worse than in the previous scenario, we further analyzed these cases. Figure 5 shows the average CPU and GPU usage of these deployments with four concurrent users creating a database. As already discussed in Section 3.5, ECS balances ease of use against potentially slower scale-up times. This is clearly shown in these experiments. In Figure 5a, we can observe that the scale-up time (i.e., time passed since the second worker is required until it is ready to work) for the previous scenario is 5.12 min. However, as shown in Figure 5b, the scale-up time for when using ECS is 11.85 min. This slower scale-up time, together with the reduced performance due to the use of containers, explains the results.
The model evaluated in this section also presents good scalability and flexibility, and it is also suitable for the purposes of this work.

4.5. Load Testing of Product Search Functionality

Next, we evaluated the system’s performance under load using JMeter (version 5.6.3), following the methodology described earlier. In this experiment, we measured the number of requests per second that the application can handle as the number of concurrent users performing product-search queries increases. The same database and workload configuration were maintained across all tests to ensure comparability. To conduct the experiment, JMeter was configured to continuously issue requests to the application’s IP address for a duration of five minutes, while varying the number of threads to emulate different levels of simultaneous user activity. This setup allowed us to observe how the system behaves under increasing query rates and to assess its capacity for handling multiple search operations concurrently. The resulting performance metrics are presented in Figure 6.
Similarly to previous sections, three scenarios were compared: one, where the complete application is deployed in a single g5.2xlarge instance (Figure 1a); a second scenario, where some components of the application (the API server, the backend server, and the scaler/load balancer) are deployed in one t2.xlarge instance, and the workers are deployed in g5.2xlarge instances (Figure 1b); and a third scenario similar to the second one but using ECS (Figure 1c). As explained in Section 3.5, in the second and third scenarios, there is always one worker (i.e., one g5.2xlarge instance) active.
During these search experiments, GPU utilization of the active worker never exceeded the scale-up threshold (i.e., 90%). As a result, additional workers were not instantiated, and the comparison effectively isolates differences related to deployment overhead, request handling, and orchestration mechanisms under a single-worker configuration. This behavior is representative of steady-state inference workloads with moderate concurrency.
As illustrated in Figure 6, the throughput of the system, measured in requests per second, increases with the number of concurrent users issuing search queries across all evaluated scenarios. This growth persists up to approximately 40 simultaneous users. Beyond this point, the throughput reaches a saturation level, remaining stable despite further increases in the number of concurrent requests. In the case of using a single EC2 instance, the maximum number of requests per second achieved is 30.7. In the case of using multiple EC2 instances, 33.4 and 31 maximum number of requests per second were achieved for ECS. Thus, the multiple EC2 scenario improves the EC2 and ECS scenarios by 9% and 8%, respectively.
In addition to throughput, we monitored response latency and request completion status during the load tests. Across all evaluated scenarios, no request failures were observed. The average response latency increased gradually with the number of concurrent users and stabilized once throughput saturation was reached, following a trend consistent with the observed throughput behavior.
The deployment method does not affect retrieval precision, classification accuracy, or result consistency. All experiments use identical model architectures, weights, preprocessing steps, and inference logic. Therefore, differences observed across deployment scenarios are solely attributable to infrastructure and orchestration effects rather than to changes in algorithmic behavior.

4.6. Cost Analysis

This section presents a comparative analysis of the economic costs associated with the evaluated deployment configurations. The assessment is based on Amazon EC2 on-demand pricing [39]. At the time of writing this paper, the cost of the instance t2.xlarge was USD 0.1856 per hour, while the cost of the instance g5.2xlarge was USD 1.212 per hour (on-demand price per hour for Linux/Unix in the US East (Northern Virginia) AWS Region). In the scenarios where different types of instances are used (i.e., multiple EC2 and ECS), the price includes one t2.xlarge instance, one g5.2xlarge and an additional g5.2xlarge when required (i.e., two or more users for creating a database).
The corresponding cost estimates are presented in Table 3 and Table 4. Table 3 reports the monetary cost, expressed in U.S. dollars (USD), associated with each evaluated deployment configuration when creating a database under different levels of user concurrency. Each value represents the cost per database creation process. As we can observe, for one user, the price is similar for all the deployments. However, as the number of users increases, the cost of the scenarios using multiple instances decreases. Thus, for four users, the scenario using multiple EC2 instances presents the lowest cost, with a reduction of 6% and 22% compared to EC2 and ECS, respectively.
Table 4 summarizes the estimated economic cost, expressed in U.S. dollar cents (CENT), for the evaluated deployment configurations when executing product search operations under varying levels of concurrent user load. Each reported value corresponds to the cost incurred per individual search request. Prices are similar for all the deployments. Again, as the level of concurrency increases, the cost per request decreases. This reduction is observed up to 40 concurrent users, after which the cost stabilizes and remains unchanged despite further increases in user count. For over 40 users, the scenario using multiple EC2 instances presents the lowest cost, with a reduction of 7% compared to the other scenarios.

5. Discussion

The experimental results highlight several practical implications for deploying large-scale, GPU-accelerated computer vision applications in cloud environments. Decomposing application components across heterogeneous resources enables better scalability and cost efficiency than monolithic deployments, while container-based solutions offer improved manageability at the expense of additional overhead. These findings suggest that infrastructure selection should be guided not only by raw performance but also by operational considerations, such as scale-up latency, deployment complexity, and expected workload variability.
Furthermore, the proposed architecture can be naturally extended with intelligent preprocessing stages, such as image quality assessment or filtering modules, to avoid unnecessary inference on low-quality inputs. Incorporating such components represents a promising direction for future work, particularly in conjunction with modern IQA models based on transformers and large-scale representation learning.
Beyond the scope of the current investigation, further research could also explore analyzing how image degradation (e.g., noise, resolution, or compression) affects the system’s accuracy, scalability, or resource utilization.

6. Conclusions

In this work, we analyzed different cloud computing alternatives for distributing a real-world AI-based computer vision application, namely, (i) deploying the complete application using one EC2 instance (EC2), (ii) deploying the application in components on different types of EC2 instances (multiple EC2), and (iii) deploying the application components as services on ECS clusters (ECS).
In the tests performed, the multiple EC2 scenario presented the best results. This model and the ECS one showed good scalability and flexibility. On the contrary, the scenario using one EC2 instance did not present acceptable scalability. In addition to offering the best performance, the multiple EC2 scenario also presented slightly lower cost. Therefore, we conclude that this approach is the most appropriate for the objectives pursued in this study.

Author Contributions

Conceptualization, H.Z., C.R., A.C., J.F.A.-S., Á.I. and C.I.; methodology, C.R.; software, H.Z. and J.F.A.-S.; validation, C.R. and A.C.; formal analysis, C.R. and A.C.; investigation, H.Z. and J.F.A.-S.; resources, Á.I. and C.I.; data curation, H.Z.; writing—original draft preparation, H.Z.; writing—review and editing, H.Z., C.R., A.C., J.F.A.-S., Á.I. and C.I.; visualization, C.R.; supervision, C.R. and J.F.A.-S.; project administration, Á.I. and C.I.; funding acquisition, C.R., A.C., Á.I. and C.I. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Valencian Innovation Agency (AVI) of the Generalitat Valenciana (GVA) under Grant INNTA3/2023/17.

Data Availability Statement

Data are available on request due to restrictions. Experimental performance, resource usage, and cost data were generated and analyzed during the current study. These data are derived from controlled deployments on commercial cloud infrastructure and are available from the authors upon reasonable request.

Acknowledgments

Authors are grateful for the support provided by the Kimera Technologies company and its team.

Conflicts of Interest

Authors Haojie Zheng, Alberto Castillo, Juan F. Ariño-Sales, Álvaro Igual and Carles Igual were employed by the company Kimera Technologies, S.L. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
IaaSInfrastructure as a service
CaaSContainer as a Service
FaaS   Function as a Service
AWSAmazon Web Services
GCPGoogle Cloud Platform
ITInformation Technology
CPUCentral Processing Unit
RAMRandom Access Memory
GPUGraphics Processing Unit
MLMachine Learning
MLaaSMachine Learning as a Service
AIArtificial Intelligence
APIApplication Programming Interface
IQAImage Quality Assessment
WSGIWeb Server Gateway Interface
EC2Elastic Compute Cloud
ECSElastic Container Service
CSVComma-Separated Value
ANNArtificial Neural Networks
SMINVIDIA System Management Interface
USDUS dollars
CENTUSD cents

Appendix A. Total Resource Consumption

This appendix presents the total resource consumption of the different deployments, depending on the number of concurrent users creating a database. The resources considered are CPU and RAM memory, shown in Table A1; and GPU and GPU memory, shown in Table A2. These values directly capture the aggregate amount of computational resources consumed by each deployment configuration, therefore complementing the average utilization metrics analyzed in the main evaluation section of this paper.
It should be noted that in the deployments where more than one compute instance is used (i.e., “multiple EC2 instance types” and “ECS infrastructure”) resource consumption is aggregated to obtain the total number. For each resource analyzed, the total consumption was calculated using Equation (A1). As we can observe in this equation, the total resource consumption is calculated by multiplying the average resource consumption by the number of resources and by the execution time.
Total_resource_consumption = avg_resource_consumption × num_resources × execution_time
It should also be noted that the consumption units for the resources depend on each resource. Specifically, the consumption unit for the resources shown in Table A1 and Table A2 are the following ones:
  • CPU consumption measured in cores per hour (core-hours). E.g., a value of 1 indicates that one CPU core is fully utilized (i.e., 100%) per hour.
  • RAM memory consumption measured in GB per hour (GB·h). E.g., a value of 1 indicates that one GB of RAM memory is used per hour.
  • GPU consumption measured in GPUs per hour (GPU-hours). E.g., a value of 1 indicates that one GPU is fully utilized (i.e., 100%) per hour.
  • GPU memory consumption measured in GB per hour (GB·h). E.g., a value of 1 indicates that one GB of GPU memory is used per hour.
As we can see in Table A1, for a single concurrent user, the total CPU and RAM consumption is very similar across all three deployments. The One EC2 and multiple EC2 configurations both consume 0.11 CPU core-hours, while the ECS deployment consumes slightly less CPU (0.10 core-hours). Memory consumption follows a similar pattern, ranging from 8.85 GB·h in ECS to 9.84 GB·h in the monolithic EC2 setup, indicating negligible differences at low load levels. As the number of concurrent users increases, clearer differences emerge. With four users, the One EC2 deployment exhibits the highest total system memory consumption (28.70 GB·h), followed closely by ECS (28.99 GB·h). In contrast, the multiple EC2 deployment consumes substantially less memory (24.59 GB·h), representing a reduction of approximately 14% compared to One EC2. A similar trend is observed for CPU usage at four users, multiple EC2 consumes 0.42 CPU core-hours, compared to 0.48 in One EC2 and 0.50 in ECS. These results indicate that decomposing the application across heterogeneous EC2 instances enables more efficient use of CPU and RAM as concurrency increases.
Table A1. The total CPU and RAM memory consumption of the different deployments, depending on the number of concurrent users creating a database.
Table A1. The total CPU and RAM memory consumption of the different deployments, depending on the number of concurrent users creating a database.
Number of UsersCPU (Cores per Hour)Memory (GB per Hour)
One EC2 instance type
10.119.84
20.2413.82
30.3620.89
40.4828.70
Multiple EC2 instance types
10.119.05
20.2519.29
30.3521.55
40.4224.59
ECS infrastructure
10.108.85
20.2520.99
30.3822.89
40.5028.99
GPU-related consumption, reported in Table A2, further highlights the differences between deployment strategies. For a single user, the total GPU consumption is comparable across deployments (0.35, 0.34, and 0.33 GPU-hours for One EC2, multiple EC2, and ECS, respectively), with minor differences attributable to orchestration overhead. At higher concurrency levels, however, the multiple EC2 deployment consistently demonstrates lower total GPU memory consumption than the other approaches. For four users, GPU memory consumption reaches 24.73 GB·h in the One EC2 deployment and 28.38 GB·h in ECS, while multiple EC2 requires only 23.34 GB·h, corresponding to reductions of 6% and 18%, respectively. The total GPU core consumption follows a similar pattern at four users, multiple EC2 consumes 1.30 GPU-hours, compared to 1.41 GPU-hours for One EC2 and 1.57 GPU-hours for ECS. The higher GPU consumption observed in ECS can be attributed to longer execution times and scale-up delays, which increase the duration during which GPU resources remain allocated, even if average utilization is slightly lower.
Overall, the results demonstrate that the multiple EC2 deployment achieves the lowest total resource consumption across CPU, RAM, GPU, and GPU memory at moderate to high concurrency levels. While the ECS-based deployment offers operational benefits, such as simplified orchestration and management, these advantages come at the cost of higher total resource usage due to increased execution time and container-related overhead. The monolithic One EC2 deployment, although simple, scales poorly and leads to higher aggregate resource consumption as user concurrency increases. These findings reinforce and complement the conclusions drawn in the main evaluation section of this paper.
Table A2. The total GPU and GPU memory consumption of the different deployments, depending on the number of concurrent users creating a database.
Table A2. The total GPU and GPU memory consumption of the different deployments, depending on the number of concurrent users creating a database.
Number of UsersGPUs per HourMemory (GB per Hour)
One EC2 instance type
10.358.88
20.7112.38
31.0618.44
41.4124.73
Multiple EC2 instance types
10.348.54
20.7618.42
31.1020.87
41.3023.34
ECS infrastructure
10.338.75
20.7720.73
31.2022.59
41.5728.38

References

  1. Zhou, L.; Zhang, L.; Konz, N. Computer Vision Techniques in Manufacturing. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 105–117. [Google Scholar] [CrossRef]
  2. Javaid, M.; Haleem, A.; Singh, R.P.; Ahmed, M. Computer vision to enhance healthcare domain: An overview of features, implementation, and opportunities. Intell. Pharm. 2024, 2, 792–803. [Google Scholar] [CrossRef]
  3. Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep Learning for Computer Vision: A Brief Review. Comput. Intell. Neurosci. 2018, 2018, 7068349. [Google Scholar] [CrossRef] [PubMed]
  4. Souza, D.; Geuna, A.; Rodríguez, J. How small is big enough? Open labeled datasets and the development of deep learning. Ind. Corp. Change 2025, 34, 1322–1365. [Google Scholar] [CrossRef]
  5. Mayer, R.; Jacobsen, H.A. Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques, and Tools. ACM Comput. Surv. 2020, 53, 1–37. [Google Scholar] [CrossRef]
  6. Qian, L.; Luo, Z.; Du, Y.; Guo, L. Cloud Computing: An Overview. In Cloud Computing; Jaatun, M.G., Zhao, G., Rong, C., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 626–631. [Google Scholar]
  7. Mahmoudi, S.A.; Adoui, M.E.; Belarbi, M.A.; Larhmam, M.A.; Lecron, F. Cloud-based platform for computer vision applications. In Proceedings of the 2017 International Conference on Smart Digital Environment (ICSDE ’17), New York, NY, USA, 21–23 July 2017; pp. 195–200. [Google Scholar] [CrossRef]
  8. Kimera Technologies, S.L. eCommerce Shopping Assistant. Available online: https://kimeratechnologies.com/ (accessed on 30 March 2026).
  9. Zheng, H.; Reaño, C.; Castillo, A.; Igual, A.; Igual, C. Cloud Computing Alternatives for Distributing a Real-world AI-based Computer Vision Application. In Proceedings of the 24th International Symposium on Parallel and Distributed Computing (ISPDC), Rennes, France, 8–11 July 2025; pp. 1–5. [Google Scholar]
  10. Nathani, A.; Chaudhary, S.; Somani, G. Policy based resource allocation in IaaS cloud. Future Gener. Comput. Syst. 2012, 28, 94–103. [Google Scholar] [CrossRef]
  11. Cáceres, J.; Vaquero, L.M.; Rodero-Merino, L.; Polo, Á.; Hierro, J.J. Service Scalability Over the Cloud. In Handbook of Cloud Computing; Springer: New York, NY, USA, 2010; pp. 357–377. [Google Scholar] [CrossRef]
  12. Malviya, A.; Dwivedi, R.K. Designing Architecture for Container-As-A-Service (CaaS) in Cloud Computing Environment: A Review. In Proceedings of the 3rd International Conference on Machine Learning, Advances in Computing, Renewable Energy and Communication; Springer: Singapore, 2022; pp. 549–563. [Google Scholar] [CrossRef]
  13. Docker Inc. Docker: Accelerated Container Application Development. Available online: https://www.docker.com/ (accessed on 30 March 2026).
  14. The Kubernetes Authors. Kubernetes: Production-Grade Container Orchestration. Available online: https://kubernetes.io/ (accessed on 30 March 2026).
  15. Zhang, C.; Yu, M.; Wang, W.; Yan, F. Enabling Cost-Effective, SLO-Aware Machine Learning Inference Serving on Public Cloud. IEEE Trans. Cloud Comput. 2022, 10, 1765–1779. [Google Scholar] [CrossRef]
  16. Amazon Web Services, Inc. Amazon EC2—Cloud Compute Capacity. Available online: https://aws.amazon.com/ec2 (accessed on 30 March 2026).
  17. Amazon Web Services, Inc. Fully Managed Container Solution—Amazon Elastic Container Service. Available online: https://aws.amazon.com/ecs (accessed on 30 March 2026).
  18. Amazon Web Services, Inc. Serverless Function, Faas Serverless—AWS Lambda. Available online: https://aws.amazon.com/lambda (accessed on 30 March 2026).
  19. Ramesh, M.; Chahal, D.; Singhal, R. Multicloud Deployment of AI Workflows Using FaaS and Storage Services. In 15th International Conference on COMmunication Systems & NETworkS (COMSNETS); IEEE: Piscataway, NJ, USA, 2023; pp. 269–277. [Google Scholar] [CrossRef]
  20. Joosen, A.; Hassan, A.; Asenov, M.; Singh, R.; Darlow, L.; Wang, J.; Deng, Q.; Barker, A. Serverless Cold Starts and Where to Find Them. In Proceedings of the Twentieth European Conference on Computer Systems (EuroSys ’25), New York, NY, USA, 30 March–3 April 2025; pp. 938–953. [Google Scholar] [CrossRef]
  21. Nguyen, T. Holistic cold-start management in serverless computing cloud with deep learning for time series. Future Gener. Comput. Syst. 2024, 153, 312–325. [Google Scholar] [CrossRef]
  22. Karamzadeh, A.; Shameli-Sendi, A. Reducing cold start delay in serverless computing using lightweight virtual machines. J. Netw. Comput. Appl. 2024, 232, 104030. [Google Scholar] [CrossRef]
  23. Huang, C.; Huang, Y.; Feng, J.; Liang, S.; Yan, M.; Wu, J. LACE: Mitigating cold-starts in serverless with a multi-task mixture-of-experts caching. J. King Saud. Univ. Comput. Inf. Sci. 2026, 38, 93. [Google Scholar] [CrossRef]
  24. Hanaforoosh, M.; Azgomi, M.A.; Ashtiani, M. Reducing the cost of cold start time in serverless function executions using granularity trees. Future Gener. Comput. Syst. 2025, 164, 107604. [Google Scholar] [CrossRef]
  25. Marcelino, C.; Nastic, S. Truffle: Efficient Data Passing for Data-Intensive Serverless Workflows in the Edge-Cloud Continuum. In 2024 IEEE/ACM 17th International Conference on Utility and Cloud Computing (UCC); IEEE: Piscataway, NJ, USA, 2024; pp. 53–62. [Google Scholar] [CrossRef]
  26. Fehling, C.; Leymann, F.; Retter, R.; Schupeck, W.; Arbitter, P. Essential Cloud Application Properties. In Cloud Computing Patterns: Fundamentals to Design, Build, and Manage Cloud Applications; Springer: Wien, Austria, 2014; pp. 5–7. [Google Scholar] [CrossRef]
  27. Vaquero, L.M.; Rodero-Merino, L.; Buyya, R. Dynamically scaling applications in the cloud. SIGCOMM Comput. Commun. Rev. 2011, 41, 45–52. [Google Scholar] [CrossRef]
  28. Song, M.; Zhang, C.; Haihong, E. An Auto Scaling System for API Gateway Based on Kubernetes. In IEEE 9th International Conference on Software Engineering and Service Science (ICSESS); IEEE: Piscataway, NJ, USA, 2018; pp. 109–112. [Google Scholar] [CrossRef]
  29. Rehman, M.U.; Nizami, I.F.; Ullah, F.; Hussain, I. IQA Vision Transformed: A Survey of Transformer Architectures in Perceptual Image Quality Assessment. IEEE Access 2024, 12, 183369–183393. [Google Scholar] [CrossRef]
  30. Zhang, Z.; Zhou, Y.; Li, C.; Zhao, B.; Liu, X.; Zhai, G. Quality Assessment in the Era of Large Models: A Survey. ACM Trans. Multimed. Comput. Commun. Appl. 2025, 21, 1–31. [Google Scholar] [CrossRef]
  31. Pallets. Flask: The Python Micro Framework for Building Web Applications. Available online: https://flask.palletsprojects.com/ (accessed on 30 March 2026).
  32. The Gunicorn Authors. Gunicorn: Python WSGI HTTP Server for UNIX. Available online: https://gunicorn.org/ (accessed on 30 March 2026).
  33. Merkel, D. Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux J. 2014, 239, 76–91. [Google Scholar]
  34. Amazon Web Services, Inc. Amazon ECS clusters—Amazon Elastic Container Service. Available online: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/clusters.html (accessed on 30 March 2026).
  35. Apache Software Foundation. Apache JMeterTM. Available online: https://jmeter.apache.org/ (accessed on 30 March 2026).
  36. Amazon Web Services, Inc. APM Tool—Amazon CloudWatch. Available online: https://aws.amazon.com/cloudwatch/ (accessed on 30 March 2026).
  37. NVIDIA Corporation. System Management Interface (SMI). Available online: https://developer.nvidia.com/system-management-interface (accessed on 30 March 2026).
  38. Amazon Web Services, Inc. Elastic Load Balancing. Available online: https://docs.aws.amazon.com/elasticloadbalancing (accessed on 30 March 2026).
  39. Amazon Web Services, Inc. Amazon EC2 On-Demand Pricing. Available online: https://aws.amazon.com/ec2/pricing/on-demand/ (accessed on 30 March 2026).
Figure 1. Proposed models.
Figure 1. Proposed models.
Electronics 15 01953 g001
Figure 2. Execution time of the different deployments (One EC2 instance type, multiple EC2 instance types, and ECS infrastructure) varying the number of concurrent users creating a database.
Figure 2. Execution time of the different deployments (One EC2 instance type, multiple EC2 instance types, and ECS infrastructure) varying the number of concurrent users creating a database.
Electronics 15 01953 g002
Figure 3. The average CPU and RAM memory usage of the different deployments, depending on the number of concurrent users creating a database.
Figure 3. The average CPU and RAM memory usage of the different deployments, depending on the number of concurrent users creating a database.
Electronics 15 01953 g003aElectronics 15 01953 g003b
Figure 4. The average GPU and GPU memory usage of the different deployments, depending on the number of concurrent users creating a database.
Figure 4. The average GPU and GPU memory usage of the different deployments, depending on the number of concurrent users creating a database.
Electronics 15 01953 g004aElectronics 15 01953 g004b
Figure 5. The average CPU and GPU usage of the different deployments with four concurrent users, creating a database.
Figure 5. The average CPU and GPU usage of the different deployments with four concurrent users, creating a database.
Electronics 15 01953 g005
Figure 6. The load test results for the different deployments, varying the number of concurrent users searching products in a database.
Figure 6. The load test results for the different deployments, varying the number of concurrent users searching products in a database.
Electronics 15 01953 g006
Table 1. Comparison of cloud service models.
Table 1. Comparison of cloud service models.
ModelControlScalabilityComplexityPrimary Trade-Off
IaaSHighManual/SlowHighMaximum flexibility vs. high operational overhead.
CaaSMediumManagedMediumPortability vs. security/governance challenges.
FaaSLowAutomaticLowRapid scaling vs. cold-start latency and limited observability.
Table 2. Specification summary of AWS EC2 instance types used in experiments.
Table 2. Specification summary of AWS EC2 instance types used in experiments.
Instance TypevCPURAMGPUGPU Memory
g5.2xlarge8 vCPU32 GB124 GB
t2.xlarge4 vCPU16 GB--
Table 3. The economic cost (USD) of the deployments under evaluation for creating a database depending on the number of concurrent users. The value shown is the price per database.
Table 3. The economic cost (USD) of the deployments under evaluation for creating a database depending on the number of concurrent users. The value shown is the price per database.
Concurrent UsersEC2Multiple EC2ECS
1USD 0.66USD 0.63USD 0.65
2USD 0.46USD 0.68USD 0.77
3USD 0.46USD 0.52USD 0.56
4USD 0.46USD 0.43USD 0.53
Table 4. The economic cost (USD CENT) of the deployments under evaluation for searching a product in a database depending on the number of concurrent users. The value shown is the price per request.
Table 4. The economic cost (USD CENT) of the deployments under evaluation for searching a product in a database depending on the number of concurrent users. The value shown is the price per request.
Concurrent UsersEC2Multiple EC2ECS
10.0337¢0.0342¢0.0342¢
100.0034¢0.0034¢0.0034¢
200.0017¢0.0017¢0.0017¢
>400.0011¢0.0010¢0.0011¢
Symbol “¢” refers to USD CENT.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zheng, H.; Reaño, C.; Castillo, A.; Ariño-Sales, J.F.; Igual, Á.; Igual, C. Scaling Computer Vision: A Comparative Analysis of Cloud Infrastructures for AI-Based Image Processing and Classification Applications. Electronics 2026, 15, 1953. https://doi.org/10.3390/electronics15091953

AMA Style

Zheng H, Reaño C, Castillo A, Ariño-Sales JF, Igual Á, Igual C. Scaling Computer Vision: A Comparative Analysis of Cloud Infrastructures for AI-Based Image Processing and Classification Applications. Electronics. 2026; 15(9):1953. https://doi.org/10.3390/electronics15091953

Chicago/Turabian Style

Zheng, Haojie, Carlos Reaño, Alberto Castillo, Juan F. Ariño-Sales, Álvaro Igual, and Carles Igual. 2026. "Scaling Computer Vision: A Comparative Analysis of Cloud Infrastructures for AI-Based Image Processing and Classification Applications" Electronics 15, no. 9: 1953. https://doi.org/10.3390/electronics15091953

APA Style

Zheng, H., Reaño, C., Castillo, A., Ariño-Sales, J. F., Igual, Á., & Igual, C. (2026). Scaling Computer Vision: A Comparative Analysis of Cloud Infrastructures for AI-Based Image Processing and Classification Applications. Electronics, 15(9), 1953. https://doi.org/10.3390/electronics15091953

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop