Next Article in Journal
Validity and Reliability of the MyJump 2 Application for Measuring Vertical Jump in Youth Soccer Players Across Age Groups
Previous Article in Journal
Comparative Analysis of Lab-Data-Driven Models for International Friction Index Prediction in High Friction Surface Treatment (HFST)
Previous Article in Special Issue
Efficient I/O Performance-Focused Scheduling in High-Performance Computing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Low-Scalability Distributed Systems for Artificial Intelligence: A Comparative Study of Distributed Deep Learning Frameworks for Image Classification

by
Manuel Rivera-Escobedo
1,†,
Manuel de Jesús López-Martínez
1,*,†,
Luis Octavio Solis-Sánchez
2,
Héctor Alonso Guerrero-Osuna
3,
Sodel Vázquez-Reyes
3,
Daniel Acosta-Escareño
3 and
Carlos A. Olvera-Olvera
1,*,†
1
Laboratorio de Invenciones Aplicadas a la Industria (LIAI), Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Zacatecas 98000, Mexico
2
Laboratorio de Sistemas Inteligentes de Visión Artificial, Posgrado en Ingeniería y Tecnología Aplicada, Universidad Autónoma de Zacatecas, Zacatecas 98000, Mexico
3
Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Zacatecas 98000, Mexico
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2025, 15(11), 6251; https://doi.org/10.3390/app15116251
Submission received: 24 April 2025 / Revised: 28 May 2025 / Accepted: 30 May 2025 / Published: 2 June 2025
(This article belongs to the Special Issue Distributed Computing Systems: Advances, Trends and Emerging Designs)

Abstract

:
Artificial intelligence has experienced tremendous growth in various areas of knowledge, especially in computer science. Distributed computing has become necessary for storing, processing, and generating large amounts of information essential for training artificial intelligence models and algorithms that allow knowledge to be created from large amounts of data. Currently, cloud services offer products for running distributed data training, such as NVIDIA Deep Learning Solutions, Amazon SageMaker, Microsoft Azure, and Google Cloud AI Platform. These services have a cost that adapts to the needs of users who require high processing performance to perform their artificial intelligence tasks. This study highlights the relevance of distributed computing in image processing and classification tasks using a low-scalability distributed system built with devices considered obsolete. To this end, two of the most widely used libraries for the distributed training of deep learning models, PyTorch’s Distributed Data Parallel and Distributed TensorFlow, were implemented and evaluated using the ResNet50 model as a basis for image classification, and their performance was compared with modern environments such as Google Colab and a recent Workstation. The results demonstrate that even with low scalability and outdated distributed systems, comprehensive artificial intelligence tasks can still be performed, reducing investment time and costs. With the results obtained and experiments conducted in this study, we aim to promote technological sustainability through device recycling to facilitate access to high-performance computing in key areas such as research, industry, and education.

1. Introduction

Processing large amounts of data has become a significant challenge for companies, government institutions, universities, and academic groups that need to solve problems efficiently and effectively within a very short timeframe. Artificial intelligence (AI) consists of algorithms that imitate intelligence to solve problems using data collected from different sources of information. This technology allows us to transform the way complex problems are solved, ranging from automation to understanding the various languages with which humans communicate [1]. Artificial neural networks (ANNs) [2], machine learning (ML) [3], fuzzy logic [4], evolutionary computing [5], and natural language processing (NLP) [6] are some of the current applications of artificial intelligence. These applications open up the possibility of helping humans explore or solve challenges that arise in everyday life.
Within AI, deep learning (DL) is rapidly gaining reliability in various fields of knowledge, such as computer vision. Many studies have shown that the identification of objects within videos and images provides knowledge for decision-making, imitating the vision of humans [7]. Digital image processing (DIP), through classic methods, allows the application of algorithms to identify patterns or extract characteristics from images or videos. This helps program or develop applications that help people identify certain regions of interest that sometimes go unnoticed. DL, particularly through convolutional neural networks (CNNs), allows us to perform tasks similar to those of DIP but typically with more innovative methods and higher precision than classic DIP methods [8].
Distributed systems (DSs) or high-performance computing (HPC) clusters offer efficient processing of large amounts of data by dividing tasks among multiple computing nodes, thereby reducing time and bottlenecks and effectively increasing resource utilization [9]. DSs play an important role in this research. Their ability to handle large volumes of data and intensive workloads renders them essential in numerous scientific and research fields. An example of these fields is image classification, where this technology fundamentally transforms the techniques and methods implemented in different sectors of society, such as automation, medical diagnostics, precision agriculture, facial recognition, security and surveillance, and autonomous cars, among others.
In recent years, there has been a growing interest in the application of DSs to AI. This is due to the need to process large amounts of data efficiently and achieve significantly lower model training times.
In the field of image classification, this convergence of DSs and AI is especially relevant since techniques, algorithms, and CNNs are used to identify patterns and objects in images, such as people, plants, stairs, animals, and many other objects.
To implement DSs in applied AI with a focus on image classification, distributed deep learning (DDL) tools and frameworks are used. These DDL frameworks play a fundamental role, allowing researchers and developers to harness the power of DSs to train models faster and handle massive datasets, which is essential for high-performance AI applications [10].
To accelerate the processing of AI models, Graphics Processing Units (GPUs) are used. These units are specialized hardware developed to solve complex problems using multi-core technology. This is in contrast to using a conventional computer that only uses Central Processing Units (CPUs) and often requires considerable time to resolve these same problems. In summary, GPUs stand out for their ability to significantly speed up processing compared with conventional CPUs [11]. The synergy between GPUs and DDL frameworks is essential to fully realize the potential of AI, allowing faster processing, efficient management of large amounts of data, and significant advances in research and development in various applications [12].
This study combines the efficiency of DSs with the advanced analytical capabilities of AI to verify the usefulness of HPC infrastructure for image processing and classification. The main objective of this paper is to compare the processing efficiency and image classification accuracy of a low-scalability DS-HPC (LSDS-HPC) infrastructure composed of three nodes equipped with NVIDIA GTX 1050ti graphics cards. The main contributions of this work are as follows:
  • It demonstrates how an LSDS-HPC setup with recycled hardware can maximize the efficiency and accuracy of image classification.
  • It promotes the reuse of devices considered obsolete for complex tasks.
  • The results obtained aim to facilitate research in various areas, opting for the use of low-scalability distributed systems when a large budget is not available to acquire specific equipment for high-performance computing.
  • It allows researchers and scientists to choose a distributed training framework for their distributed deep learning models. This highlights the importance of reusing obsolete hardware to reduce e-waste in computer labs, research centers, companies, and industries.
  • Finally, empirical speedups across different frameworks and infrastructures are presented to reinforce the optimal choice of the distributed training framework.

2. Related Works

In recent years, various strategies have been explored to accelerate the training process of DL models. One of the most widely accepted approaches is parallel computing. Numerous studies have shown that incorporating HPC into AI significantly reduces training times and yields more accurate results, especially when handling large-scale data. In medical investigations, HPC is very useful, as shown in the work of [13], in which a modular pipeline technique is presented to detect tumor regions in digital specimens of breast lymph nodes using DL models. They showed that this technique is applicable to both local machines and HPC resources using containers. Additionally, they mentioned that distribution training was carried out using four NVIDIA K80 GPUs and a single NVIDIA V100 GPU. In the field of hyperspectral remote sensing, Plaza et al. [14] conducted a study that addressed the computational challenges that arise in hyperspectral remote sensing applications due to the high dimensionality of the data generated by modern sensors. This article explores HPC techniques to improve the efficiency and accuracy of the processing and analysis of these data. Exact sciences, such as physics, have also greatly benefited from the use of HPC. For example, the work by Florin Pop [15] analyzed the existing computational methods for high-energy physics (HEP) from two perspectives: numerical methods and HPC. This article highlights the importance of stable simulations and numerical methods for high-precision analysis, experimental validation, and visualization in HEP. Some of the techniques mentioned and simulated were Monte Carlo, Markovian Monte Carlo, the overflow method in particle physics, kernel estimation in HEP, and random matrix theory. Furthermore, within mathematics, Singh et al. [16] highlighted the importance of implementing HPC to solve complex mathematical problems that would be intractable using traditional methods. Beyond infrastructure-level comparisons, it is important to consider the evolution of deep learning applications in emerging fields that demand computational efficiency and adaptability. For instance, Wang et al. [17] presented a novel self-supervised graph contrastive learning method for structured data, demonstrating how deep learning models can be optimized using positive sample selection strategies. Specifically, in the performance evaluation of DS, multiple studies have implemented DDL frameworks to reinforce the theory that using HPC improves the accuracy of AI task results. Table 1 lists the main studies that inspired this study.
Compared to the studies mentioned in Table 1, the approach presented in this study offers several advantages that make it stand out. While most related works are based on high-end HPC infrastructures with large-scale distributed systems (often exceeding four nodes) [9,20,21], our study demonstrates that it is feasible to achieve competitive training performance and acceptable validation metrics using a low-scalability distributed system composed of outdated hardware. This significantly reduces the cost barrier and promotes sustainability by repurposing existing hardware for AI tasks, which are critical to the needs of today. However, a notable limitation is the limited generalization to larger-scale applications, as our infrastructure may not support the computational demands of more complex models or large-scale datasets. Furthermore, while many related studies focus exclusively on processing time or model accuracy, this study offers a balanced assessment that includes empirical speedup, model performance on datasets from different application areas, and hardware configuration limitations. Setting up an LSDS-HPC is difficult due to the incompatibility of current libraries with obsolete hardware. These considerations position our approach as a cost-effective alternative for academic and resource-constrained settings.

3. Materials and Methods

This section presents the proposed methodology for measuring the processing time and accuracy in image classification using 3 different datasets belonging to different areas of applied AI. The experiments utilized Python libraries that allow the application of Distributed Data Parallel (DDP) processing, specifically, DDP PyTorch and Distributed Tensorflow (Figure 1).

3.1. Data Acquisition

For the experiments, three datasets were taken into account, which have very different characteristics and applications. First, the CIFAR-100 (Canadian Institute for Advanced Research 100) dataset is a widely used dataset in the computer vision community, especially within AI, for evaluating image classification models. It was introduced by Krizhevsky et al. [22] in 2009 and is composed of 60,000 color images, each 32 × 32 pixels in size and divided into 100 different classes. Each class contained 600 images. In addition, the 100 classes were organized into 20 superclasses, grouping similar categories according to their shared characteristics.
Unlike its simplified version, CIFAR-10, which contains only 10 categories, CIFAR-100 offers a greater challenge due to its diversity of classes and the similarity between some categories, making it a key benchmark for evaluating DL models in classification tasks.
The second dataset used is called 38-Cloud, which contains 38 Landsat 8 scene images and their manually extracted pixel-level ground truths for cloud detection. The entire images of these scenes were cropped into multiple 384 × 384 patches to be suitable for DL-based semantic segmentation algorithms. There are 8400 patches for training and 9201 patches for testing. Each patch has 4 corresponding spectral channels: red (band 4), green (band 3), blue (band 2), and near-infrared (band 5). Unlike other computer vision images, these channels are not combined. Instead, they are located in their corresponding directories.
The 38-Cloud dataset was introduced in [23], but it is a modification of the dataset in [24]. All data were prepared at the Laboratory for Robotics Vision (LRV), School of Engineering Science at Simon Fraser University, Burnaby, BC, Canada.
Finally, the Monkeypox Skin Lesion Dataset (MSLD) was proposed by Ahsan et al. [25] and specifically designed for the detection and classification of skin lesions caused by monkeypox. This dataset was created to provide an accessible and reliable database for the development and evaluation of DL models in the automated diagnosis of this disease.
The dataset consists of images of skin lesions categorized into different classes, including cases of monkeypox, as well as images of similar skin diseases, such as smallpox or measles, and healthy skin. The diversity in the images of the MSLD allows us to evaluate the ability of computer vision models to differentiate between visually similar conditions, which is crucial for medical applications based on AI.
These three datasets illustrate the broad applicability of AI in multiple fields, from general image classification and earth observation to medical diagnosis. The use of these systems in this study highlights the importance of implementing HPC or DS infrastructures, even with low scalability or aging hardware, to meet the growing computational demands in disciplines such as remote sensing, computer vision, and health sciences.

3.2. Description of Distributed Systems

A DS or HPC is a collection of computers or worker nodes interconnected in a local area network (LAN) that provides services and applications to different types of users. Computers that integrate the DS may share the same software and hardware (Homogeneous Systems) or not (Heterogeneous Systems). When discussing DSs, it is important to consider the different architectures, algorithms, paradigms, and applications. To adapt to the above, it is necessary to take into account the type of computing problem that we want to solve and the results that we want to obtain. The most common architectures that actually exist are cloud computing (CC), fog computing (FC), edge computing (EC), and peer-to-peer (P2P) networks [26].
CC refers to the management of services without physical servers; to access services, users log in to remote servers through the Internet. Typically, CC is categorized into several service models: Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Software as a Service (SaaS). These service models can be deployed using public, private, and hybrid clouds [27].
FC expands the capabilities of cloud computing by focusing on the operations and processes carried out by computer networks, low latency, real-time processing, scalability, and security, which are elements that allow the proper functioning of the applications and services offered through the network. The term FC was proposed by Cisco as a new paradigm [28].
EC refers to the connection between end-users and specific applications, such as autonomous vehicles, sensors, cameras, and other devices that send data to the cloud through networks and data centers. Finally, the P2P network paradigm is not new; it is a type of distributed network in which individual users or nodes act as both servers and clients, sharing resources and data directly and bypassing central servers. Common applications include file sharing, cryptocurrencies, the Internet of Things (IoT), instant messaging, and content delivery [29]. All these architectures share the feature that they can run multiple processes together to achieve various computing tasks. These processes can be performed on different machines across the network.
It is also important to mention that to run multiple tasks in a DS, algorithms, and protocols specifically designed for the aforementioned environments are required. These algorithms play a crucial role in optimizing resource allocation, data management, and task coordination across distributed architectures. They ensure efficient and reliable operation, enabling the system to maximize its distributed resources while maintaining data consistency, security, and scalability.
In CC, for example, distributed algorithms are essential for load balancing, auto-scaling, and ensuring the high availability of services across data centers [27]. P2P networks rely on distributed algorithms for decentralized data retrieval and resource sharing [30]. In EC and FC, real-time data processing, low latency, and efficient task scheduling algorithms are vital for timely decision-making and responsive services [31].
These distributed algorithms enable the efficient utilization of computing power and distributed resources on the network, making DSs capable of handling various applications and use cases, from data analysis and content delivery to management IoT and edge processing. One of the most popular algorithms is MapReduce [32], a programming model and associated data processing framework developed by Google. It is designed to process and generate large datasets that can be distributed in DS. MapReduce provides a way to parallelize and distribute data processing tasks, making it suitable for big-data analysis and large-scale data processing. Additionally, this algorithm uses data parallelism as a fundamental concept to efficiently process large datasets in a DS environment [33]. Data parallelism (DP) is a parallel computing paradigm in which the same operation is applied to multiple subsets of data simultaneously [34].

3.2.1. Data Parallelism

DP is a technique that allows distributed training of neural networks and consists of partitioning the dataset into multiple DS devices like CPUs or GPUs [34]. Data parallelization can be performed in two ways: asynchronous and synchronous. The choice of technique depends on how the processing is performed, as each GPU learns from the overall training to aggregate what it has learned into the gradients. Once each GPU calculates its gradients, they are summed or averaged in an aggregation operation that allows all the model parameters to be updated [9].
Figure 2 represents a parameter server (PS) infrastructure that consists of multiple worker nodes (which, in this case, can either be GPUs or CPUs) running parallel training processes, where each node computes gradients based on a local copy of the neural network model and then communicates those gradients to a central unit that averages and updates the global model parameters.
In a PS architecture, the training workload is distributed across multiple worker nodes, each of which performs forward and backward propagation on a subset of the data. These workers independently compute the gradients of the loss function with respect to the model parameters [9]. The computed gradients are then sent to one or more PSs, which are responsible for aggregating the gradients (e.g., by averaging) and updating the global parameters accordingly. Once updated, the new parameters are broadcast back to the workers, allowing them to synchronize their local models. This architecture enables scalable and efficient distributed training, especially in scenarios involving large datasets and deep neural networks (DNNs), such as those shown in Figure 2: 38-Cloud for remote sensing, MSLD for lesion detection, and CIFAR-100 for general image classification.

3.2.2. Model Parallelism

Model parallelism (MP) is a widely employed method in distributed training that addresses the challenge posed by the limitations of a single GPU memory capacity [35]. In this approach, DL models are partitioned across multiple devices, mitigating the constraints on the maximum model size that can be accommodated. By strategically distributing model components across different devices, model parallelism enhances scalability and enables the training of larger models that would otherwise be restricted by individual GPU memory constraints. This technique is particularly valuable for tasks such as computer vision and NLP, where increasing the model size contributes to improved accuracy [36].
Figure 3 illustrates an MP strategy in which different layers of a DNN are distributed across multiple compute nodes, where each node is responsible for processing a specific portion of the model (e.g., a hidden layer), thereby enabling the training of large-scale models that do not entirely fit into the memory of a single device.
MP is a distributed training strategy in which the architecture of a neural network is partitioned across several devices or compute nodes. Instead of replicating the entire model in each worker, as in data parallelism, this approach assigns different layers or components of the model to separate nodes [35]. For instance, one node handles the first hidden layer, another handles the second, and so forth, as shown in Figure 3. This allows extremely large models to be trained even when their full architecture exceeds the memory limits of a single GPU and CPU.
Multiple factors allow us to calculate the memory requirements for training a DL model, including the model architecture, batch size, and memory capacity of the devices, whether CPUs, GPUs, or Tensor Processing Units (TPUs) [37]. This is important since distributed training will not be performed if the architecture does not have a large Random Access Memory (RAM) capacity on the devices used.

3.3. Distributed Deep Learning

DDL is a methodology that plays a fundamental role in training large datasets, a task that often requires harnessing the computational power of multiple CPUs and GPUs. This approach is especially crucial when dealing with image quantities. By distributing the workload across multiple processing units, DDL enables the simultaneous training of CNN models or architectures, significantly reducing the time required for their convergence [38]. The scalability offered by DDL not only addresses the computational or processing demands posed by large datasets but also improves the training efficiency and accuracy. Additionally, the methodology allows practitioners to leverage the capabilities of parallel computing, unlocking the potential for accelerated training of CNN models and improved performance when addressing complex ML tasks. To distribute CNN models among devices, DDL employs synchronization via communication mechanisms. This crucial aspect of DDL ensures that model parameters and updates are consistently shared and coordinated across distributed devices during the training process, thereby maintaining the integrity and accuracy of the CNN model formed by complex features [39].
DDL leverages various communication methods, such as message passing, to facilitate the exchange of information between devices. In addition to message passing, DDL often incorporates collective communication operations like all-reduce. The all-reduce algorithm is particularly valuable for DDL as it combines information across multiple devices efficiently. This operation helps synchronize the gradients and model parameters among devices by performing a reduction operation, such as summation or averaging, across all participating nodes [40]. Synchronization points are strategically placed to harmonize the training progress across different processing units to achieve a unified understanding of the model state and collectively move toward convergence. Additionally, effective communication in DDL enables task parallelization, allowing devices to collaboratively handle the parts of the dataset. This parallel processing not only speeds up training but also optimizes resource utilization.
To implement both message passing and all-reduce in the context of DDL, libraries such as the Message Passing Interface (MPI) and NVIDIA Collective Communication Library (NCCL) are commonly utilized to distribute processes across devices [41]. These libraries provide essential functionalities for communication, synchronization, and coordination in distributed computing systems.
MPI is a widely adopted standard for message-passing in distributed computing. It allows processes to communicate with each other, making it suitable for DDL implementations that involve distributed training across multiple CPU nodes. The MPI enables the exchange of messages among different processes, thereby facilitating seamless communication and synchronization. MPI supports collective communication operations, such as broadcast, scatter, gather, and reduce, which enable coordination and data exchange among a group of processes [42].
NCCL was specifically developed to optimize collective communication operations in environments with multiple NVIDIA GPUs. Its design is geared toward maximizing performance in DDL implementations that leverage the parallel processing capabilities of GPUs. NCCL offers highly efficient implementations of collective communication primitives, such as all-reduce, broadcast, and reduce, which are critical for synchronizing model parameters and gradients across GPU devices, thus facilitating efficient scalability in heterogeneous clusters [9].
Within parallel computing, Amdahl’s Law is a fundamental principle that helps define and identify the limits of applications or the scalability of distributed computing infrastructures [43]. Because DDL is a parallel computing application, it also adapts to Amdahl’s Law, which provides insights into the potential speedup achievable through parallelization. Amdahl’s Law is particularly relevant in understanding the impact of parallelizing only a portion of a computation.
Amdahl’s Law is stated mathematically as follows:
S p e e d u p = 1 ( F s + F p P )
where
  • Speedup is the overall speedup achieved by parallelizing the computation.
  • F s is the fraction of the computation that must be executed sequentially.
  • F p is the fraction of the computation that can be executed in parallel;
  • P is the number of processors or processing units, and
Since this study does not have the F s and F p values in addition to the number of processors or processing units P, Equation (1) can be homologated with the empirical speedup represented by the following equation:
S p e e d u p = T s T p
where
  • T s is the execution time without parallelism;
  • T p is the execution time with a distributed strategy.
This empirical speedup reflects the actual acceleration effect of distributed parallel training [20].
The key implication of Amdahl’s Law for DDL is that the speedup of a distributed computing application is limited by the sequential fraction of the computation [44]. In the context of DDL, a sequential fraction may include tasks or processes that cannot be easily parallelized or exhibit diminishing returns when parallelized.

3.3.1. Distributed Training Frameworks

To carry out the training process in a distributed manner, especially in the context of DDL, specialized applications or frameworks that facilitate coordination and communication between multiple computing devices must be used. These frameworks are designed to distribute the workload, synchronize model parameters, and efficiently use resources in a distributed infrastructure to achieve the best results in the shortest time.
PyTorch and TensorFlow share similarities that support distributed training, including efficient communication, the ability to parallelize processes across single or multiple computing nodes, and integration and compatibility for installation in a distributed computing infrastructure.
Specifically, to perform distributed training, TensorFlow uses various strategies, such as MirroredStrategy, TPUStrategy, MultiWorkerMirroredStrategy, CentralStorageStrategy, and ParameterServerStrategy. These strategies are designed to perform data-parallel training, typically implementing synchronous training that supports all-reduced operations through the parameter server architecture [45].
Similarly, PyTorch uses libraries that implement distributed strategies such as Distributed Data Parallel (DDP), which enables data parallelism by allowing multiple batches of data to be processed simultaneously across multiple devices, with the goal of maximizing performance. Similar to TensorFlow strategies, DDP supports all-reduce operations to compute gradients and synchronize values across devices [46].
In this study, the DDP and MultiWorkerMirroredStrategy strategies from PyTorch and TensorFlow, respectively, are used to evaluate the performance of the three datasets described above, allowing the analysis of which library is more optimal on an LSDS-HPC infrastructure such as the one proposed. Additionally, it is important to mention that in this implementation, both PyTorch’s DDP and TensorFlow’s MultiWorkerMirroredStrategy lack fault tolerance settings. This implies that if a node fails during training, the resulting gradients may be lost. To avoid this, the model training results are typically saved every certain number of iterations or epochs during experimentation.

3.3.2. Distributed System Environment Configuration

To conduct the experiments, a comparison was also made with a high-performance personal computer and the Google Colab platform.
Google Colab is a hosted web service that uses the Jupiter Notebook as a template or development environment. It requires no manual configuration; everything is already installed in the Web service. This service offers free access, making it widely used for educational and scientific research.
Colab is a particularly well-suited solution for self-study, data science, and other scientific applications. The limitations of the service, such as those of many cloud services, are reduced due to end-user requirements for specific applications, which often require more processing power than is freely available. This requires payment for additional processing units [47]. Google Colab offers three plans with different levels of access to computing resources: Colab (free), Colab Pro, and Colab Pro+. While the free plan does not grant access to compute units or premium GPUs, Colab Pro and Pro+ offer 100 and 500 compute units per month, respectively, with access to premium GPUs, increased memory capacity, terminal use with a connected virtual machine, and an optimized notebook experience. The monthly costs range from USD 8.33–9.99 for Colab Pro and USD 41.66–49.99 for Colab Pro+ [48].
This information is relevant for those who do not have the capacity to purchase HPC devices or access HPC infrastructure.
Table 2 describes the characteristics of the GPU devices used in the performance tests for processing the datasets.
As can be seen, the LSDS-HPC incorporates NVIDIA GeForce GTX 1050 Ti graphics cards (manufactured by Gigabyte in Taipei, Taiwan) and an Intel Core i7-7700K processor, components that can currently be considered obsolete for high-performance tasks in DL and scientific computing. The GTX 1050ti, launched in October 2016, is based on the Pascal architecture and presents significant limitations in terms of graphics memory (4 GB GDDR5) and processing power compared with modern GPUs optimized for intensive parallel workloads [49]. Meanwhile, the Intel Core i7-7700K processor, launched in January 2017 as part of Intel’s 7th generation (Kaby Lake), lacks support for newer technologies such as PCIe 4.0, as well as significant improvements in energy efficiency and additional cores common in current generations. These limitations position them as outdated hardware compared with contemporary standards in AI-based scientific research [50].

3.3.3. CNN ResNet50 Model Description

The CNN model used in the ResNet50 experiments consisted of 50 layers and was based on the use of residual blocks. Residual networks (ResNets) were introduced as a solution to the problem of performance degradation in extremely deep neural networks. Unlike traditional architectures, ResNets incorporate residual blocks that allow the learning of residual mapping functions through skip connections, thus facilitating gradient propagation and efficient training of networks with many layers. In particular, ResNet50 allows for higher accuracy without incurring performance losses due to increasing depth [51].
The three datasets were evaluated using ResNet50 (Figure 4) to test the performance of the implemented computational environments. This CNN model was chosen due to its deep layer count and to verify whether GTX 1050ti graphics cards could handle the batch size of images, as these cards are limited by RAM.

3.3.4. Hyperparameter Definition and Model Evaluation Metrics

To ensure proper training of the ResNet50 convolutional neural network model, it is essential to define the hyperparameters. In the field of AI, these hyperparameters determine the scalability, generalization capacity, and efficiency of the models. The correct choice of these values is crucial to ensure that the model learns optimally, avoiding both overfitting and underfitting and striking a balance between accuracy and computational performance. Table 3 shows the hyperparameter values used in the low-scalability DS-HPC to train the CNN model ResNet50, which was applied to the classification tasks with the 38-Cloud, MSLD, and CIFAR-100 datasets.
The Stochastic Gradient Descent (SGD) optimizer has been shown to offer good convergence with low computational resource consumption in HPC environments, which is why it was chosen for the experiments conducted in this study. This optimizer, together with the cross-entropy loss function, allows for efficient updates of the model weights in multiclass classification tasks.
The input sizes varied depending on the dataset: 192 × 192 pixels for 38-Cloud, 244 × 244 pixels for MSLD, and 32 × 32 pixels for CIFAR-100, ensuring adequate resolution based on the characteristics of each dataset. A constant number of 50 epochs was set for the three experiments; however, this number of epochs is used both in Workstation and in Google Colab since only one GPU was used for training the models of the datasets used.
In the case of the LSDS-HPC infrastructure, an epoch configuration was used based on the number of GPUs used. In this case, the ratio between epochs used in the experiments with Workstation and Google Colab was 50/1 (50 epochs between 1 GPU); in the LSDS-HPC infrastructure, the ratio is 50/3 (50 epochs between 3 GPUs) due to the three NVIDIA GTX 1050ti cards used. On the LSDS-HPC infrastructure, the decision to train the models for 17 epochs with DDP and MultiWorkerMirroredStrategy was based on the principle of scaling the learning rate and adjusting the number of steps per epoch to ensure an equivalent total training amount that allows for convergence similar to that of a single GPU. This strategy is crucial in DP, where each GPU (worker) processes a portion of the total batch size. As the number of GPUs increases, the effective batch size that the model processes in a single optimization step also increases, justifying the need for these adjustments to maintain the stability and efficiency of the DDL [52]. The batch size was adapted to the volume and complexity of the dataset, with values of 8, 32, and 128 for the 38-Cloud, MSLD, and CIFAR-100 datasets, respectively. Finally, the learning rate was adjusted to 0.003 for 38-Cloud and 0.001 for MSLD and CIFAR-100, seeking a balance between the convergence speed and stability during training. It is worth mentioning that these settings were applied during training on both the Workstation and Google Colab environments.
Accuracy was used as the primary validation metric, which represents the percentage of correct predictions made by the model relative to the total number of samples evaluated. This metric is widely used in classification tasks, as it provides a clear and direct view of the model’s performance on data not seen during training. In this study, accuracy allowed us to objectively compare the performance of the ResNet50 model trained under different hyperparameters and dataset configurations, evaluating its generalization capacity in each of the experimental environments. The results obtained with this metric will allow us to identify the potential of low-scalability computing infrastructures for the efficient resolution of complex artificial intelligence tasks.
Both the hyperparameters and validation metric accuracy were defined based on recommendations in the literature, seeking efficient model performance without compromising accuracy on resource-constrained hardware infrastructures. In summary, to ensure fair model convergence in experiments on each infrastructure, the batch size and learning rate were tuned per dataset based on its complexity, input resolution, and empirical performance in preliminary training runs. These hyperparameters were not tuned for each hardware platform, as our goal was to maintain consistency in the training conditions for each dataset. While this approach optimizes the learning dynamics, it introduces variations in the computational load across platforms, which may influence the reported speedup and accuracy values. Furthermore, given the limited memory available in the LSDS-HPC infrastructure, explicit gradient stacking policies were not applied. This allowed us to maintain a consistent batch size across all platforms for each dataset, avoid asynchronous optimizer behavior, and simplify cross-platform comparisons.

4. Results

4.1. Processing Time

The processing time is one of the most critical variables that define the efficiency of computing systems. The faster a problem is solved, the more effective the decision-making process. In AI applications, where real-time decision-making is often a key requirement, this factor is crucial.
Table 4 presents the processing or training times recorded on the three computing infrastructures used in this study for image classification using the three selected datasets. This comparison provides valuable insights into the relative performance of the distributed training strategies implemented with DDP PyTorch and Distributed TensorFlow MultiWorkerMirroredStrategy, evaluated using different hardware configurations.
As shown, in most experiments, the PyTorch DDP implementation achieved the best processing times compared to Distributed TensorFlow MultiWorkerMirroredStrategy, especially in environments where the number of available GPUs was limited.
Specifically, on the LSDS-HPC infrastructure proposed in this study, the 38-Cloud and MSLD datasets recorded the lowest training times. However, for the CIFAR-100 dataset, Workstation achieved the best performance in terms of processing time. As mentioned in Section 3.3.4, one key reason for the lower processing times observed in the LSDS-HPC infrastructure is that training was performed over 17 epochs, unlike Google Colab and Workstation, where 50 epochs were used for training. Although this difference has been previously justified, it is worth emphasizing again, as it directly influences performance outcomes and reflects the trade-offs inherent to data parallelism operations in distributed training settings. Furthermore, it is widely accepted and verified in our experiments that running all 50 epochs of each dataset on the LSDS-HPC significantly increases the time processing, as the training is executed as if on a single node. The synchronization provided by DDP and MultiWorkerMirroredStrategy helps stabilize the training process across multiple nodes and improve convergence, thus enforcing the basic principles of data parallelism. An important limitation identified in the experiment was that when entering data from the 38-Cloud dataset, memory overflow occurred, and the libraries threw errors. To address this, a resize factor was applied to the dataset to reduce the batch size and allow the distributed training to run smoothly on the LSDS-HPC.
Empirical speedup can be calculated using Equation (2), where T s represents the processing time (in seconds) for training on a single node with respect to T p , which is the processing time (in seconds) using multiple nodes or the DS.
As shown in Table 5, the empirical speedup highlights how the training performance improves (or degrades) when transitioning from a single-node environment (Workstation or Google Colab) to a multi-node infrastructure (LSDS-HPC).
The highest speedup was obtained when training ResNet50 with the CIFAR-100 dataset on Google Colaboratory using Distributed TensorFlow, reaching a 6.74× increase in performance. In contrast, the lowest speedup (0.03×) was also observed on Google Colab using PyTorch DDP for the same dataset, suggesting that the performance gains vary significantly depending on both the framework and the specific workload characteristics.
These results demonstrate that, while the speedup metric is useful for evaluating the efficiency of a DS-HPC infrastructure, it should not be the sole criterion, particularly in the context of DDL experiments. For a comprehensive performance assessment, validation metrics such as accuracy and loss must also be considered, as they provide a more complete view of how well the models generalize and learn in different environments.

4.2. Performance Evaluation of Distributed Frameworks

The validation metrics used to evaluate the performance of the ResNet50 model in the computing environments were accuracy and loss. While it is possible to incorporate additional metrics, such as recall, F1 score, specificity, or precision, in this study, we chose to focus solely on accuracy and loss to avoid making the study overly complex and to focus on analyzing the behavior of the model throughout the training epochs for each dataset. Table 6 presents the results obtained for these metrics when using the PyTorch DDP and distributed TensorFlow strategies.
For Dataset 38-Cloud, the best accuracy score was obtained using PyTorch DDP on Google Colab, with accuracy values above 97%. Meanwhile, for the MSLD, the best accuracy score was obtained using Distributed Tensorflow in Google Colab, with accuracy values close to 84%. Finally, for CIFAR-100, the best accuracy score was obtained by Workstation when implementing DDP from PyTorch, with accuracy values close to 89%.
Furthermore, as shown in Table 6, specifically for the CIFAR-100 dataset, the accuracy and loss are both very low when implementing Distributed Tensorflow in all three computing environments. While the codes were homologous across libraries, this abnormal behavior was observed in this dataset. This could be due to various factors, such as the implementation of data enhancements, the distribution of training and validation data, hyperparameter management, or simply because the dataset is not well suited to DDP or Distributed TensorFlow with the applied configurations. This opens the door to investigating why this situation occurs in the first place.
However, it is not enough to highlight the best scores; in DDL, it is also important to see the behavior or performance of the model when trained. For this, learning curve graphs were also obtained to visually determine whether the best score data were consistent with the model performance for each of the computing infrastructures used.
Learning curves in DL models can help us graphically visualize the training process and identify patterns that can help us properly adjust or configure a CNN model.
Figure 5 shows the learning curves obtained when training the ResNet50 model on the 38-Cloud dataset. The model showed progressive convergence, with a consistently decreasing training loss and validation accuracy reaching values close to 93% in the final epochs using Distributed Tensorflow. Compared to the learning curves for training in the Workstation and Google Colab, the fit is much better with only 17 epochs. However, as mentioned above, the best score is obtained using Google Colab with DDP PyTorch. From the graph, we can see that at least during the first 20 epochs, overfitting occurs, indicating a poor correlation between the training and validation data. This can also be observed with the Distributed Tensorflow implementation in both Google Colab and Workstation. These results indicate the good capacity of the model to adapt to the spectral and spatial characteristics of the multispectral dataset in the LSDS-HPC infrastructure.
For the MSLD dataset (Figure 6), which represents a small and specialized set of clinical images, the model achieved high training and validation accuracies in a few epochs. The loss quickly stabilized, suggesting the rapid adaptation of the model to the visual characteristics of the dataset. This behavior reinforces the model’s effectiveness in environments with balanced data and restricted domains. The DS-HPC infrastructure proved suitable for fast and efficient training on this type of dataset for both DDP PyTorch and Distributed Tensorflow since their learning curves are the most correlated. Although the accuracy and loss values were the lowest in the experiments, it demonstrated better agreement in its graphical representation, unlike its training in Workstation and Google Colab, which, in some cases, showed overfitting between the training and validation data.
In the case of the CIFAR100 dataset (Figure 7), characterized by high interclass variability and low resolution (32 × 32), a more fluctuating behavior in the validation accuracy was observed, causing overfitting from its first iterations or epochs. Although the training loss was notably reduced, the accuracy stabilized at around 70% using DDP PyTorch, indicating generalization challenges in this dataset. Despite these limitations, the model achieved an acceptable classification capacity considering the complexity of the set, and the low-scalability DS-HPC infrastructure allowed efficient execution owing to the compact size of the images. It is important to note that only the Workstation, Colab, and Distributed TensorFlow results in LSDS-HPC showed considerably poor performance, prompting further investigation into the behavior of this dataset when trained with the ResNet50 architecture.
The results obtained show that the efficiency of distributed training depends not only on the hardware used but also on the parallelism strategy and the dataset processed. In terms of processing time, the LSDS-HPC infrastructure exhibited competitive performance, particularly with the 38-Cloud and MSLD datasets, due in part to the reduced number of training epochs used. However, in the case of CIFAR-100, Workstation outperformed the other platforms, suggesting that there is no single optimal configuration for all scenarios.

5. Discussion

The evaluation of HPC infrastructure for artificial intelligence tasks, specifically using convolutional neural networks (CNNs) for classification, is a widely explored and validated topic. Currently, distributed computing systems are used in large-scale research in areas such as medicine, aerospace, remote sensing, agribusiness, physics, mathematics, and quantum computing. Many of these systems have high processing capacities and are listed in the TOP500 [53], unlike the system proposed in this study, which is poorly scalable and represents less than 1% of the capacity of some of the infrastructures listed in the TOP500.
As mentioned above, the purpose of this study is precisely to demonstrate that, even without access to the computing capabilities offered by high-performance infrastructures, it is possible to leverage the theoretical foundations and tools used in large-scale distributed systems to implement low-scalability solutions. This can be achieved using hardware considered obsolete, which, in many cases, has been discarded by educational institutions or by society in general to make way for new generations of devices, which implies an increase in costs and budgets.
Our experiments show that PyTorch DDP consistently outperformed TensorFlow in terms of processing time on all datasets (Table 4), especially in the LSDS-HPC setting. For example, training ResNet50 on the 38-Cloud dataset using PyTorch DDP was 27% faster than using TensorFlow (3 h 02 min vs. 3 h 37 min). This is consistent with previous studies [18] highlighting PyTorch’s resource efficiency, although our work extends this observation to the context of obsolete hardware. PyTorch’s faster performance can be attributed to its low communication overhead and dynamic computation graph, which are better suited to heterogeneous GPU clusters [9].
However, TensorFlow achieved higher accuracy in specific scenarios (e.g., 93.05% vs. 76.87% for 38-Cloud on LSDS-HPC; see Table 6), suggesting greater robustness in gradient synchronization for complex datasets. This dichotomy is consistent with the findings of [19], who observed that the centralized parameter server architecture in TensorFlow could improve convergence, albeit at the cost of higher latency.
Despite its limitations, the LSDS-HPC infrastructure (composed of three GTX 1050 Ti GPUs) demonstrated empirical speedups of up to 1.68× (MSLD dataset; Table 5) compared to single-GPU environments. This challenges the common belief that outdated GPUs are unsuitable for modern deep distributed learning (DDL) tasks. Notably, these results contrast with those of studies such as that of Sun et al. [10], which emphasize the need for high-end GPUs (e.g., NVIDIA V100) for efficient training. In contrast, our findings validate that low-scalability systems can reduce costs while maintaining acceptable usability, which is crucial for resource-constrained institutions.
However, the results on the CIFAR-100 dataset revealed significant limitations: both frameworks struggled to achieve high accuracy (<70%) and achieved very low speedups (e.g., 0.03× for PyTorch on Google Colab). This is likely due to the high interclass variability and small image size (32 × 32 pixels), which exacerbates communication bottlenecks in distributed environments. This observation is consistent with that reported by Hegde and Usmani [20], who noted that the efficiency of data parallelism decreases if the batch size is not scaled properly to the hardware capacity.
Our study fills a gap in the literature by evaluating distributed training frameworks in low-resource environments, while most previous work [9,18] focused on high-performance clusters. Unlike Graziani et al. [13], who used NVIDIA K80/V100 GPUs for medical image analysis, our experiment achieved comparable accuracy (97% for 38-Cloud) with older GPUs, highlighting the potential of hardware recycling.
In addition, it is important to mention that in previous reports using the 38-Cloud dataset, some images could not be segmented because, in certain snowy regions, the algorithm could not determine whether the image contained clouds or snow. To address this, methods such as VNDHR [54], a novel variational nighttime dehazing framework that uses hybrid regularization to improve perceptual visibility in hazy nighttime scenarios, could be used in such a study to test LSDS-HPC for improved results on Cloud-38.
Finally, a major limitation of this study is the exclusive use of relatively small datasets, such as CIFAR-100, 38-Cloud, and MSLD, which do not exert significant pressure on the interconnect bandwidth in multi-GPU environments. This restricts the possibility of comprehensively evaluating the scalability and communication efficiency of LSDS-HPC infrastructure. For a more accurate performance assessment in scenarios requiring high communication intensity, it is advisable to incorporate large-scale datasets, such as ImageNet or high-resolution geospatial imagery. These datasets represent a greater demand for inter-GPU transfer and would allow the evaluation to approximate conditions typical of infrastructures listed in the TOP500, provided that the hardware employed is comparable to that used in this study.

6. Conclusions and Future Work

Distributed systems can significantly reduce the processing times for image classification tasks, even when deployed on legacy, low-scalability hardware. The proposed LSDS-HPC infrastructure promotes an accessible and sustainable approach to distributed (DDL), which is especially valuable for resource-constrained institutions. Clearly, there are limitations, such as scalability and memory capacity, in the proposed infrastructure, but the results obtained will help us achieve interest in our scientific community to increase the scalability of our small distributed system. In future work, it will be essential to analyze the cost–benefit ratio of implementing LSDS-HPC infrastructure and explore hybrid parallelism strategies (data and model) to further improve scalability and memory efficiency in small-scale environments.

Author Contributions

Conceptualization, M.R.-E. and M.d.J.L.-M.; methodology, M.R.-E. and M.d.J.L.-M.; validation, M.R.-E., M.d.J.L.-M., D.A.-E. and C.A.O.-O.; formal analysis, C.A.O.-O.; investigation, M.R.-E., M.d.J.L.-M. and D.A.-E.; resources, C.A.O.-O. and L.O.S.-S.; data curation, M.R.-E., M.d.J.L.-M. and D.A.-E.; writing—original draft preparation, M.R.-E. and M.d.J.L.-M.; writing—review and editing, S.V.-R. and C.A.O.-O.; visualization, H.A.G.-O.; supervision, H.A.G.-O. and S.V.-R.; project administration, C.A.O.-O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available at https://www-cs-toronto-edu.translate.goog/~kriz/cifar-100-python.tar.gz?_x_tr_sl=en&_x_tr_tl=es&_x_tr_hl=es&_x_tr_pto=tc (accessed on 16 May 2025), kagglehub.dataset_download(“sorour/38cloud-cloud-segmentation-in-satellite-images”) (accessed on 16 May 2025), kagglehub.dataset_download(“nafin59/monkeypox-skin-lesion-dataset”) (accessed on 16 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Janiesch, C.; Zschech, P.; Heinrich, K. Machine learning and deep learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
  2. Roy, A. Artificial neural networks. ACM SIGKDD Explor. Newsl. 2000, 1, 33–38. [Google Scholar] [CrossRef]
  3. Zhang, F.; Petersen, M.; Johnson, L.; Hall, J.; O’Bryant, S.E. Hyperparameter tuning with high performance computing machine learning for imbalanced Alzheimer’s disease data. Appl. Sci. 2022, 12, 6670. [Google Scholar] [CrossRef] [PubMed]
  4. Straccia, U. Reasoning within fuzzy description logics. J. Artif. Intell. Res. 2001, 14, 137–166. [Google Scholar] [CrossRef]
  5. PJothi, N. Fault diagnosis in high-speed computing systems using big data analytics integrated evolutionary computing on the Internet of Everything platform. Deleted J. 2024, 20, 2177–2191. [Google Scholar] [CrossRef]
  6. Li, J.; Monroe, W.; Ritter, A.; Jurafsky, D.; Galley, M.; Gao, J. Deep reinforcement learning for dialogue generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 2–4 November 2016. [Google Scholar] [CrossRef]
  7. Zhang, C.; Sargent, I.; Pan, X.; Li, H.; Gardiner, A.; Hare, J. An object-based convolutional neural network (OCNN) for urban land use classification. Remote Sens. Environ. 2018, 216, 57–70. [Google Scholar] [CrossRef]
  8. Lv, X.; Ming, D.; Lu, T.; Zhou, K.; Wang, M.; Bao, H. A new method for region-based majority voting CNNs for very high resolution image classification. Remote Sens. 2018, 10, 1946. [Google Scholar] [CrossRef]
  9. Kim, Y.; Choi, H.; Lee, J.; Kim, J.-S.; Jei, H.; Roh, H. Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster. Clust. Comput. 2020, 23, 2287–2300. [Google Scholar] [CrossRef]
  10. Liu, J.; Wu, Z.; Feng, D.; Zhang, M.; Wu, X.; Yao, X.; Yu, D.; Ma, Y.; Zhao, F.; Dou, D. HeterPS: Distributed deep learning with reinforcement learning based scheduling in heterogeneous environments. Future Gener. Comput. Syst. 2023, 148, 106–117. [Google Scholar] [CrossRef]
  11. Sun, P.; Feng, W.; Han, R.; Yan, S.; Wen, T. Optimizing network performance for distributed DNN training on GPU clusters: ImageNet/AlexNet training in 1.5 minutes. arXiv 2019, arXiv:1902.06855. [Google Scholar]
  12. Bhangale, U.; Durbha, S.S.; King, R.L.; Younan, N.H.; Vatsavai, R. High performance GPU computing based approaches for oil spill detection from multi-temporal remote sensing data. Remote Sens. Environ. 2017, 202, 28–44. [Google Scholar] [CrossRef]
  13. Graziani, M.; Eggel, I.; Deligand, F.; Bobák, M.; Andrearczyk, V.; Müller, H. Breast histopathology with high-performance computing and deep learning. Comput. Inform. 2021, 39, 780–807. [Google Scholar] [CrossRef]
  14. Plaza, A.; Du, Q.; Chang, Y.L.; King, R.L. High performance computing for hyperspectral remote sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2011, 4, 528–544. [Google Scholar] [CrossRef]
  15. Pop, F. High performance numerical computing for high energy physics: A new challenge for big data science. Adv. High Energy Phys. 2014, 2014, 507690. [Google Scholar] [CrossRef]
  16. Singh, V.K.; Sheng, Q. Bridging the science and technology by modern mathematical methods and high performance computing. Appl. Math. 2021, 66, 2. [Google Scholar] [CrossRef]
  17. Wang, Z.; Yu, D.; Shen, S.; Zhang, S.; Liu, H.; Yao, S. Select Your Own Counterparts: Self-Supervised Graph Contrastive Learning with Positive Sampling. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4–6453. [Google Scholar] [CrossRef]
  18. Xu, P.; Shi, S.; Chu, X. Performance Evaluation of Deep Learning Tools in Docker Containers. In Proceedings of the 2017 3rd International Conference on Big Data Computing and Communications (BIGCOM), Chengdu, China, 10–12 August 2017. [Google Scholar] [CrossRef]
  19. Bahrampour, S.; Ramakrishnan, N.; Schott, L.; Shah, M. Comparative Study of Deep Learning Software Frameworks. arXiv 2015, arXiv:1511.06435. [Google Scholar] [CrossRef]
  20. Du, X.; Xu, Z.; Wang, Y.; Huang, S.; Li, J. Comparative Study of Distributed Deep Learning Tools on Supercomputers. In Algorithms and Architectures for Parallel Processing; Vaidya, J., Li, J., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018; Volume 11334, pp. 121–135. [Google Scholar] [CrossRef]
  21. Hegde, V.; Usmani, S. Parallel and Distributed Deep Learning. arXiv 2016, arXiv:1605.04591. [Google Scholar] [CrossRef]
  22. Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Tech. Rep. 2009. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 30 April 2025).
  23. Mohajerani, S.; Krammer, T.A.; Saeedi, P. A Cloud Detection Algorithm for Remote Sensing Images Using Fully Convolutional Neural Networks. In Proceedings of the 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), Vancouver, BC, Canada, 29–31 August 2018; pp. 1–5. [Google Scholar] [CrossRef]
  24. Mohajerani, S.; Saeedi, P. Cloud-Net: An End-to-End Cloud Detection Algorithm for Landsat 8 Imagery. arXiv 2019, arXiv:1901.10077. [Google Scholar]
  25. Gürbüz, S.; Aydin, G. Monkeypox Skin Lesion Detection Using Deep Learning Models. In Proceedings of the 2022 International Conference on Computers and Artificial Intelligence Technologies (CAIT), Istanbul, Turkey, 21–23 July 2022; pp. 66–70. [Google Scholar]
  26. Kshemkalyani, A.D.; Singhal, M. Distributed Computing: Principles, Algorithms, and Systems, 1st ed.; Cambridge University Press: New York, NY, USA, 2008. [Google Scholar]
  27. Hajibaba, M.; Gorgin, S. A Review on Modern Distributed Computing Paradigms: Cloud Computing, Jungle Computing and Fog Computing. J. Comput. Inf. Technol. 2014, 22, 69–84. [Google Scholar] [CrossRef]
  28. Cisco Systems. Cisco Fog Computing Solutions Unleash the Power of the Internet of Things. Available online: https://docplayer.net/20003565-Cisco-fog-computing-solutions-unleash-the-power-of-the-internet-of-things.html (accessed on 12 December 2024).
  29. Merenda, M.; Porcaro, C.; Iero, D. Edge Machine Learning for AI-Enabled IoT Devices: A Review. Sensors 2020, 20, 2533. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
  30. Sunny, R.T.; Thampi, S.M. Survey on Distributed Data Mining in P2P Networks. arXiv 2012, arXiv:1205.3231. [Google Scholar] [CrossRef]
  31. Misirli, J.; Casalicchio, E. An Analysis of Methods and Metrics for Task Scheduling in Fog Computing. Future Internet 2024, 16, 16. [Google Scholar] [CrossRef]
  32. Wang, L.; Tao, J.; Ranjan, R.; Marten, H.; Streit, A.; Chen, J.; Chen, D. G-Hadoop: MapReduce across distributed data centers for data-intensive computing. Future Gener. Comput. Syst. 2013, 29, 739–750. [Google Scholar] [CrossRef]
  33. Koutris, P.; Salihoglu, S.; Suciu, D. Algorithmic Aspects of Parallel Data Processing. Found. Trends Databases 2018, 8, 239–370. [Google Scholar] [CrossRef]
  34. Shallue, C.; Lee, J.; Antognini, J.; Sohl-Dickstein, J.; Frostig, R.; Dahl, G. Measuring the Effects of Data Parallelism on Neural Network Training. J. Mach. Learn. Res. 2018, 20, 1–49. [Google Scholar]
  35. Shoeybi, M.; Patwary, M.; Puri, R.; LeGresley, P.; Casper, J.; Catanzaro, B. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv 2019, arXiv:1909.08053. [Google Scholar]
  36. Chen, T.; Xu, B.; Zhang, C.; Guestrin, C. Training Deep Nets with Sublinear Memory Cost. arXiv 2016, arXiv:1604.06174. [Google Scholar]
  37. Jiang, Y.; Zhu, Y.; Lan, C.; Yi, B.; Cui, Y.; Guo, C. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI’20), 4–6 November 2020; pp. 463–479. [Google Scholar]
  38. Sergeev, A.; Del Balso, M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv 2018, arXiv:1802.05799. [Google Scholar]
  39. Gu, J.; Chowdhury, M.; Shin, K.G.; Zhu, Y.; Jeon, M.; Qian, J.; Liu, H.; Guo, C. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (NSDI’19), Boston, MA, USA, 26–28 February 2019; pp. 485–500. [Google Scholar]
  40. Nguyen, T.T.; Wahib, M.; Takano, R. Efficient MPI-AllReduce for Large-Scale Deep Learning on GPU-Clusters. Concurr. Comput. Pract. Exp. 2021, 33, e5574. [Google Scholar] [CrossRef]
  41. NVIDIA. NVIDIA Collective Communications Library (NCCL). Available online: https://docs.nvidia.com/deeplearning/nccl/index.html (accessed on 10 April 2025).
  42. Awan, A.A.; Manian, K.V.; Chu, C.-H.; Subramoni, H.; Panda, D.K. Optimized Large-Message Broadcast for Deep Learning Workloads: MPI, MPI+NCCL, or NCCL2? Parallel Comput. 2019, 85, 141–152. [Google Scholar] [CrossRef]
  43. Hill, M.D.; Marty, M.R. Amdahl’s Law in the Multicore Era. Computer 2008, 41, 33–38. [Google Scholar] [CrossRef]
  44. Végh, J. How Amdahl’s Law Limits the Performance of Large Artificial Neural Networks. Brain Inf. 2019, 6, 4. [Google Scholar] [CrossRef] [PubMed]
  45. TensorFlow. Distributed Training with TensorFlow. Available online: https://www.tensorflow.org/guide/distributed_training?hl=es-419 (accessed on 10 April 2025).
  46. PyTorch. Distributed Training Overview. Available online: https://pytorch.org/tutorials/beginner/dist_overview.html (accessed on 10 April 2025).
  47. Google. Colaboratory FAQ. Available online: https://research.google.com/colaboratory/intl/es/faq.html (accessed on 10 April 2025).
  48. Google. Administrar Google Colab para Organizaciones. Available online: https://support.google.com/a/answer/13177581 (accessed on 10 April 2025).
  49. NVIDIA Corporation. GeForce GTX 1050 Ti. NVIDIA. 2016. Available online: https://www.nvidia.com/es-la/geforce/products/10series/geforce-gtx-1050/ (accessed on 21 March 2025).
  50. Intel Corporation. Intel® Core™ i7-7700K Processor (8M Cache, up to 4.50 GHz). Intel ARK. 2017. Available online: https://ark.intel.com/products/97129 (accessed on 21 March 2025).
  51. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
  52. Horovod: Distributed Training Framework for TensorFlow, Keras, PyTorch, and MXNet. Available online: https://horovod.readthedocs.io/en/latest/keras.html (accessed on 25 May 2025).
  53. Available online: https://top500.org/ (accessed on 11 April 2025).
  54. Liu, Y.; Wang, X.; Hu, E.; Wang, A.; Shiri, B.; Lin, W. VNDHR: Variational Single Nighttime Image Dehazing for Enhancing Visibility in Intelligent Transportation Systems via Hybrid Regularization. IEEE Trans. Intell. Transp. Syst. 2025, 2025, 1–15. [Google Scholar] [CrossRef]
Figure 1. Proposed methodology in this study.
Figure 1. Proposed methodology in this study.
Applsci 15 06251 g001
Figure 2. Data parallelism is applied in the DS.
Figure 2. Data parallelism is applied in the DS.
Applsci 15 06251 g002
Figure 3. Model parallelism applied to the DS.
Figure 3. Model parallelism applied to the DS.
Applsci 15 06251 g003
Figure 4. ResNet50 architecture representation.
Figure 4. ResNet50 architecture representation.
Applsci 15 06251 g004
Figure 5. Learning curves of the 38-cloud dataset for accuracy and loss evaluations.
Figure 5. Learning curves of the 38-cloud dataset for accuracy and loss evaluations.
Applsci 15 06251 g005
Figure 6. Learning curves of the MSLD dataset for accuracy and loss evaluations.
Figure 6. Learning curves of the MSLD dataset for accuracy and loss evaluations.
Applsci 15 06251 g006
Figure 7. Learning curves of the CIFAR-100 dataset for accuracy and loss evaluations.
Figure 7. Learning curves of the CIFAR-100 dataset for accuracy and loss evaluations.
Applsci 15 06251 g007
Table 1. Recent studies on HPC performance validation.
Table 1. Recent studies on HPC performance validation.
StudyApproachHighlights
Kim et al. [9]Optimization for heterogeneous multi-GPU clusters, using Tensorflow as a distributed processing framework, with the goal of improving resource utilization without sacrificing training accuracy.Improving computational performance by reducing I/O bottlenecks and efficiently increasing resource utilization on heterogeneous multi-GPU clusters using Distributed Tensorflow.
Xu et al. [18]Performance evaluation of DL tools (such as TensorFlow, Caffe, and MXNet) running in virtualized environments with Docker containers, compared to running directly on the host operating system. The use of Docker containers generates minimal overhead in the training performance of DL models. Tests showed almost negligible performance differences between the native and containerized environments, and the evaluated tools maintained consistent and stable performance, validating their use in both production and experimental environments.
Bahrampour et al. [19]Comparison of five DL frameworks (Caffe, Neon, TensorFlow, Theano, and Torch) evaluating their extensibility, hardware utilization, and computational performance (training and execution times) on both CPUs and multithreaded GPUs, testing different architectures, including convolutional networks.Performance was found to vary depending on the framework, network architecture, and hardware environment, so the choice of framework should consider these factors. Torch was the most efficient on both CPUs and GPUs and the most extensible.
Du et al. [20]Systematic comparison of DDL tools—specifically TensorFlow, MXNet, and CNTK—running on supercomputers. Their computational efficiency, scalability, and performance when training complex models in HPC environments are analyzed.Provide a comprehensive evaluation of the performance of distributed tools on HPC architectures, considering both speed and scalability. Practical and comparable metrics such as training time, speedup, and computational resource monitoring are used to measure the efficiency of computational resource usage during DL model training.
Hegde et al. [21]An overview of parallel and DDL, addressing both the theoretical principles and practical challenges for efficiently deploying models on parallel infrastructures. It focuses on explaining the various parallelization techniques and their impact on training performance.The balance between computation and communication is critical to achieving real improvements in speed and scalability. Data parallelism is most efficient for tasks where the model is small or moderately sized and is trained on large amounts of data.
Table 2. Characteristics of the performance evaluation environment.
Table 2. Characteristics of the performance evaluation environment.
InfrastructureComponentsDescription
WorkstationGPUAsus NVIDIA RTX 3070 (Taipei, Taiwan)
CPUAMD Ryzen 9 5900X (Santa Clara, CA, USA)
RAM96 GB/8 GB VRAM
Operating SystemWindows 11
TensorFlowVer. 2.10.1
PyTorchVer. 2.0.1 + cu117
PythonVer. 3.9.18
Low-Scalability DS-HPCGPUGigabyte NVIDIA GTX 1050ti x 3 (Taipei, Taiwan)
CPUIntel i7-7700K x 3 (Santa Clara, CA, USA)
RAM16 GB x 3/4 GB VRAM x 3
Operating SystemUbuntu Server 22.04.4 LTS
TensorFlowVer. 2.17.0
PyTorchVer. 2.0.1 + cu117
PythonVer. 3.9.18
Google ColaboratoryGPUNVIDIA T4 (Santa Clara, CA, USA)
CPUN/A
RAM16 GB VRAM
Operating SystemGNU/Linux
TensorFlow2.18.0
PyTorchVer. 2.6.0 + cu124
PythonVer. 3.11.12
Table 3. Defining hyperparameters for evaluating the ResNet50 model on infrastructures.
Table 3. Defining hyperparameters for evaluating the ResNet50 model on infrastructures.
DatasetHyperparameterValue
Cloud-38Input192 × 192
Epochs50
Batch size8
OptimizerSGD
Learning rate0.003
Loss functionCross-Entropy
MSLDInput244 × 244
Epochs50
Batch size32
OptimizerSGD
Learning rate0.001
Loss functionCross-Entropy
CIFAR-100Input32 × 32
Epochs50
Batch size128
OptimizerSGD
Learning rate0.001
Loss functionCross-Entropy
Table 4. Processing time in computing infrastructure.
Table 4. Processing time in computing infrastructure.
InfrastructureDatasetDistributed Data Parallel PyTorchDistributed Tensorflow
Time ProcessingTime Processing
WorkstationCloud4 h 12 m 22 s5 h 18 m 18 s
MSLD12 m 14 s19 m 36 s
CIFAR 10024 m 8 s37 m 41 s
Low-Scalability DS-HPCCloud3 h 2 m 15 s3 h 37 m 8 s
MSLD10 m 32 s37 m 54 s
CIFAR 1001 h 51 8 s10 m 52 s
Google ColaboratoryCloud4 h 27 m 6 s5 h 21 m 4 s
MSLD17 m 43 s52 m 36 s
CIFAR 10037 m 4 s1 h 13 m 13 s
Table 5. Speedup of the computing infrastructure.
Table 5. Speedup of the computing infrastructure.
InfrastructureDatasetProcessing Time in SecondsSpeedup
DDP PyTorchDist. TensorflowDDP PyTorchDist. Tensorflow
WorkstationCloud15,14219,09815,142/10,935 = 1.38x19,098/13,028 = 1.47x
MSLD7341176734/632 = 1.16x1176/2274 = 0.51x
CIFAR 100144822611448/6668 = 0.21x2261/652 = 3.47x
Low-Scalability DS-HPCCloud10,93513,028N/A
MSLD6322274
CIFAR 1006668652
Google ColaboratoryCloud16,02619,26416,026/10,935 = 1.47x19,264/13,028 = 1.48x
MSLD106331561063/632 = 1.68x3156/2274 = 1.39x
CIFAR 100222443932224/6668 = 0.03x4393/652 = 6.74x
Table 6. Validation metric values (accuracy and loss) obtained from the computational infrastructures.
Table 6. Validation metric values (accuracy and loss) obtained from the computational infrastructures.
InfrastructureDatasetDistributed Data Parallel PyTorchDistributed Tensorflow
AccuracyLossAccuracyLoss
TrainingValidationTrainingValidationTrainingValidationTrainingValidation
WorkstationCloud0.96490.96720.09900.08630.90270.91210.21350.2688
MSLD0.80420.75900.50500.48300.83490.80700.38610.3884
CIFAR 1000.88760.48920.64311.94750.19170.16953.43253.6841
Low-Scalability DS-HPCCloud0.76870.79510.41180.37210.93050.91250.16760.2303
MSLD0.65110.62780.72290.79530.75040.71710.51830.5532
CIFAR 1000.70770.70381.01731.0380.12580.11923.91493.9252
Google ColabCloud0.97130.97100.07950.07300.95570.11060.95580.1183
MSDL0.77220.78570.52350.50650.83920.89600.36880.3454
CIFAR 1000.62020.46371.52792.00780.10370.10894.06704.0532
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Rivera-Escobedo, M.; López-Martínez, M.d.J.; Solis-Sánchez, L.O.; Guerrero-Osuna, H.A.; Vázquez-Reyes, S.; Acosta-Escareño, D.; Olvera-Olvera, C.A. Low-Scalability Distributed Systems for Artificial Intelligence: A Comparative Study of Distributed Deep Learning Frameworks for Image Classification. Appl. Sci. 2025, 15, 6251. https://doi.org/10.3390/app15116251

AMA Style

Rivera-Escobedo M, López-Martínez MdJ, Solis-Sánchez LO, Guerrero-Osuna HA, Vázquez-Reyes S, Acosta-Escareño D, Olvera-Olvera CA. Low-Scalability Distributed Systems for Artificial Intelligence: A Comparative Study of Distributed Deep Learning Frameworks for Image Classification. Applied Sciences. 2025; 15(11):6251. https://doi.org/10.3390/app15116251

Chicago/Turabian Style

Rivera-Escobedo, Manuel, Manuel de Jesús López-Martínez, Luis Octavio Solis-Sánchez, Héctor Alonso Guerrero-Osuna, Sodel Vázquez-Reyes, Daniel Acosta-Escareño, and Carlos A. Olvera-Olvera. 2025. "Low-Scalability Distributed Systems for Artificial Intelligence: A Comparative Study of Distributed Deep Learning Frameworks for Image Classification" Applied Sciences 15, no. 11: 6251. https://doi.org/10.3390/app15116251

APA Style

Rivera-Escobedo, M., López-Martínez, M. d. J., Solis-Sánchez, L. O., Guerrero-Osuna, H. A., Vázquez-Reyes, S., Acosta-Escareño, D., & Olvera-Olvera, C. A. (2025). Low-Scalability Distributed Systems for Artificial Intelligence: A Comparative Study of Distributed Deep Learning Frameworks for Image Classification. Applied Sciences, 15(11), 6251. https://doi.org/10.3390/app15116251

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop