Next Article in Journal
Exploring the Role of Self-Efficacy in Maintaining Healthy Lifestyle Habits among Patients with Cardiometabolic Diseases; Findings from the Multi-Center IACT Cross-Sectional Study
Previous Article in Journal
Minimally Invasive Chevron Akin (MICA) Osteotomy Corrects Radiographic Parameters but Not Central Metatarsal Loading in Moderate to Severe Hallux Valgus without Metatarsalgia
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Analysis of the Spread and Evolution of COVID-19 Mutations in Ecuador Using Open Data

Centro de Mecatrónica y Sistemas Interactivos—MIST, Universidad Tecnológica Indoamérica, Quito 170301, Ecuador
Neurosurgery Department, Hospital de las Fuerzas Armadas HE-1, Quito 170136, Ecuador
Neurosurgery Department, Metropolitano Hospital, Quito 170521, Ecuador
Author to whom correspondence should be addressed.
Life 2024, 14(6), 735;
Submission received: 1 April 2024 / Revised: 2 June 2024 / Accepted: 4 June 2024 / Published: 7 June 2024


Currently, the analyses of and prediction using COVID-19-related data extracted from patient information repositories compiled by hospitals and health organizations are of paramount importance. These efforts significantly contribute to vaccine development and the formulation of contingency techniques, providing essential tools to prevent resurgence and to effectively manage the spread of the disease. In this context, the present research focuses on analyzing the biological information of the SARS-CoV-2 viral gene sequences and the clinical data of COVID-19-affected patients using publicly accessible data from Ecuador. This involves considering variables such as age, gender, and geographical location to understand the evolution of mutations and their distributions across Ecuadorian provinces. The Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology is applied for data analysis. Various data preprocessing and statistical analysis techniques are employed, including Pearson correlation, the chi-square test, and analysis of variance (ANOVA). Statistical diagrams and charts are used to facilitate a better visualization of the results. The results illuminate the genetic diversity of the virus and its correlation with clinical variables, offering a comprehensive understanding of the dynamics of COVID-19 spread in Ecuador. Critical variables influencing population vulnerability are highlighted, and the findings underscore the significance of mutation monitoring and indicate a need for global expansion of the research area.

1. Introduction

Undoubtedly, the analysis of the spread of communicable diseases has become an extremely promising area of research. This is attributed to its significant contributions in recent years to the development of vaccines, prevention techniques, and health response plans [1,2]. In the last two decades, we have witnessed the spread of numerous communicable diseases among different countries. Recent studies, including Prabu’s work [3], elaborate that these diseases have been directly transmitted by bacteria, viruses, and other pathogens.
According to the International Federation of Red Cross and Red Crescent Societies (IFRC) and the World Health Organization (WHO), the most contagious diseases globally include tropical diseases such as tuberculosis, malaria, coronavirus, dengue, hepatitis, measles, and HIV/AIDS [4]. These diseases pose significant global public health challenges due to their capacity to spread and their impact on the affected populations. A notable instance of the spread of communicable diseases is the MERS-CoV virus, which has caused severe respiratory infections in more than 2468 people, resulting in over 851 deaths in 27 countries since 2012 [5].
In 2009, the swine flu virus (H1N1) emerged, spreading to 214 countries and causing more than 18,449 confirmed deaths, as reported by the WHO [6]. However, this was not the only case of this nature during the aforementioned period. Between 2003 and 2019, the avian influenza virus (H5N1) emerged, infecting 861 people worldwide [7]. Another virus of significant spread and relevance was the severe acute respiratory syndrome (SARS) pandemic, which originated at the end of 2002 and extended to 29 countries. SARS caused 8096 cases of infection and resulted in 774 deaths. More recently, the COVID-19 virus, likely originating in Wuhan, China, in 2019, caused a global epidemic that has severely affected numerous countries [8]. According to the WHO [9], COVID-19 has affected 770 million people, with 6 million lives lost due to this disease.
In the case of Ecuador, more than one million people have been diagnosed with COVID-19, and over 36,000 people have lost their lives [10]. However, it is crucial to note that a significant vaccination campaign has been conducted in the country. According to data from the Ministry of Health of Ecuador [11], nearly 39 million doses of vaccines have been administered to the population.
Research in this field is of vital importance, especially in the context of the coronavirus disease. Disease spread is a matter of the utmost significance due to the unpredictable evolution and mutation of viruses globally [12]. This is particularly relevant in Ecuador, a country characterized by unique climatic diversity, which could be conducive to the development of various mutations.
A relevant work on this topic is Rui Wang’s study [13], which identified and analyzed the positions, frequencies, and encoded proteins of SARS-CoV-2 mutations globally. The primary objective of this study was to isolate the SARS-CoV-2 genome and quantify the number of mutations present using the genotyping technique. The results were considered satisfactory, as they identified a total of 13,402 unique mutations. Additionally, the study revealed that 51.4% of the SARS-CoV-2 mutations corresponded to the C T type.
Wang [14] identified rapidly proliferating mutations in the receptor-binding domain (RBD) and analyzed the evolutionary trend of SARS-CoV-2. The primary goal was to examine a genomic dataset of SARS-CoV-2 recorded in the Mutation Tracker using a deep learning method. The results are highly encouraging, highlighting 6945 unique mutations and 2,194,305 non-unique mutations in the SARS-CoV-2 S gene worldwide. Furthermore, the authors determined that the majority of mutations in SARS-CoV-2 corresponded to the A G , C T , and T C types. They also indicated that approximately 70% of these mutations can weaken the efficacy of known antibodies.
Thanh [15] undertook a comprehensive analysis of genomic mutations in the coding regions of SARS-CoV-2, exploring the potential secondary structure of the resulting proteins. The central objective of this study was to assess all point mutations recorded to date in SARS-CoV-2. This study further identified different mutation patterns using various deep-learning models. A total of 3089 mutations were found in the S protein of SARS-CoV-2. Lucy Van Dorp [16] analyzed mutations associated with SARS-CoV-2 virus transmission with the aim of quantifying the number of offspring that inherited a specific allele compared to those who did not. The phylogenetic index was employed for the data analysis in this study. The results revealed a total of 12,706 C U -type mutations. However, the study concluded that none of these mutations were associated with a significant increase in virus transmission.
Pachetti et al. [17] conducted an analysis and evaluation of the distribution of SARS-CoV-2 mutations in various geographical areas (Asia, Oceania, Europe, and North America) using the Clustal Omega method. This study relied on randomly collected data from the GISAID database. The work produced significant findings, identifying a total of 14,408 mutations in the P to L proteins. Furthermore, the authors demonstrated that some of these mutations could lead to resistance to certain drugs.
Rozhgar [18] identified and analyzed the genomic mutations of SARS-CoV-2. The primary objective was to determine the most common SARS-CoV-2 mutations using bioinformatics programs. The study analyzed 95 complete SARS-CoV-2 genome sequences available at the GenBank National Microbiology Data Center (NMDC). The results showed 116 mutations corresponding to the ORF1ab gene, ORF8, and the N gene. Ahmad [19] analyzed the whole-genome mutations of SARS-CoV-2. The primary objective of this study was to determine the possible mutations and evolution of COVID-19. The study utilized BioEdit software version 7.2 to conduct genomic alignments and determined that there were 596 mutations across all genes.
Lastly, Abdel-Rahman [20] analyzed the sequential mutations present in the SARS-CoV-2 genome and determined the various mutation patterns manifested in infected Egyptian patients. The author utilized the Pangolin and Nextstrain lineage declassification methods with the primary objective of determining the optimal classification of SARS-CoV-2 genomes. The results revealed the existence of a total of 1115 unique mutations. Further, approximately 60.5% of these mutations were located in the ORF1ab polyprotein.
Thus, the central objective of the present investigation is to analyze both the biological information of viral variants and the clinical data of patients infected with COVID-19 in Ecuador. This analysis encompasses variables such as age, gender, and geographic location, among others. The goal is to identify the most relevant variables and to comprehend the evolution of virus mutations, as well as their geographic distributions in various provinces of Ecuador. Through these data analyses, we aim to pinpoint the most vulnerable population groups based on their clinical characteristics, such as age, gender, and geographic location, concerning the diverse mutations and variants of SARS-CoV-2. This approach will foster a more profound understanding of the disease dynamics and enable more informed decision-making in terms of public health.
This paper is structured as follows. Section 2 details the methods and materials used to conduct the research. Section 3 focuses on the phases of data preprocessing and the application of technological tools to the information. Section 4 discusses the analysis of the data obtained, encompassing information related to variants of viruses, as well as patient profiles extracted from the database. The discussion is further enriched by incorporating perspectives and findings from other authors in the field. Finally, in Section 5, conclusions derived from the findings are presented, and possible directions for future work are discussed.

2. Methods and Materials

In this section, detailed information on the materials, including the database used, and the methods applied for data analysis in the development of this study is presented.

2.1. Materials

COVID-19 Database—GISAID Ecuador

The COVID-19 database in Ecuador was compiled through EpiFlu™, an initiative of GISAID [21]. This dataset consists of a total of 8992 records and 13 attributes. In total, 1 attribute contains genetic information (virus genetic sequences), and the remaining 12 attributes include medical information related to patients, as presented in Table 1. This database contains different variants of SARS-CoV-2 (Omicron, Delta, Epsilon, Gamma, Lambda, and Alpha). However, the Beta variant is not included because the GISAID website lacks Ecuadorian records for this variant.
In Table 1, the virus code row has the GRA code, which, in the context of COVID-19, refers to the timeline of virus variants. GRA is an abbreviation that the World Health Organization (WHO) uses to identify Variants of Interest (VOIs). After processing the protein chain data, a new database was generated that incorporated information on the 150 amino acids frequently mutated in the SARS-CoV-2, along with the number of mutations associated with each of them.

3. Data Processing

Data preprocessing is a crucial phase as the efficiency of preprocessing greatly influences the quality of the final results. Data processing was divided into four distinct phases based on the CRISP-DM methodology, in which a series of techniques and transformations were applied to the data to clean, organize, and prepare them for further analysis.

3.1. CRISP-DM Methodology

The Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology is one of the most widely used for the development of data mining projects, as it offers a cyclic approach to project management. This methodology allows for a structured life cycle, facilitating the understanding and efficient management of each stage of the data mining process [22]. Figure 1 illustrates each stage, starting with problem understanding, followed by data understanding, data preparation, modeling, model evaluation, and implementation.
The CRISP-DM methodology consists of six fundamental stages. It begins with understanding the problem, where the problem is identified, the project objectives are defined, and the current state is evaluated. Subsequently, in the data understanding phase, data are collected and explored to understand their meanings and properties. The data preparation stage focuses on cleaning, transforming, and creating indicators from existing data. In the modeling stage, the most appropriate technique is chosen, the model parameters are adjusted, and its performance is evaluated. The evaluation stage determines the model’s quality through statistical metrics and comparisons with previously established objectives to ensure that it meets the expectations of the project. A satisfactory evaluation of the trained model is followed by its implementation, during which a specific infrastructure for data processing is configured [23].

3.2. Preprocessing

For each phase of data processing, we utilized the server of the Institute of Mathematical Sciences (ICMAT), known as the LOVELACE Cluster. This server comprises 32 general computing nodes, one node equipped with Xeon Phi processors, two nodes with graphics processing units (Tesla GPUs), and three nodes with high RAM capacity.

3.2.1. Phase 1—Integration and Data Collection

In this phase, the COVID-19 database from Ecuador was downloaded from the GISAID platform [21]. This database was divided into two parts: the first in a FASTA format containing information on the protein chain, date, and patient code, among other relevant details. The second part comprised patient medical data in .CSV files containing epidemiological information, such as age, gender, patient ID, location, and variant.

3.2.2. Phase 2—Data Selection and Cleaning

During this phase, data problems were identified and corrected to ensure the accuracy and reliability of the information. For this procedure, an algorithm called “DataClean” was developed and created using the high-level Python programming language. This algorithm utilizes multiple data management libraries such as Pandas, Numpy, and Matplotlib. The algorithm eliminates duplicate data, outliers, and data containing errors. Additionally, it cleans the information on each variable by eliminating line breaks and unknown characters. In this phase, irrelevant variables for the study were also removed, such as host, as all those infected with the virus are human; originating_lab due to incoherent, incomplete, and imprecise information; authors, as it only shows who collected the sample; and originVariant because of its incomplete information that does not contribute to the research.
Table 2 shows the Pearson correlation coefficients between the variables.
Table 3 presents the values of the chi-square statistic between the variables.
Table 4 displays the values of the ANOVA statistic, with F-statistics representing the tool used to analyze significant differences between variables.

3.2.3. Phase 3—Data Transformation

In this phase, the data were prepared for further analysis, including the standardization of data (age and gender). The categorization of the age variable was carried out, representing 20 groups of data in 5-year intervals: Group 1 (0–5 years), Group 2 (6–10 years), Group 3 (11–15 years), Group 4 (16–20 years), Group 5 (21–25 years), Group 6 (26–30 years), Group 7 (31–35 years), Group 8 (36–40 years), Group 9 (41–45 years), Group 10 (46–50 years), Group 11 (51–55 years), Group 12 (56–60 years), Group 13 (61–65 years), Group 14 (66–70 years), Group 15 (71–75 years), Group 16 (76–80 years), Group 17 (81–85 years), Group 18 (86–90 years), Group 19 (91–95 years), and Group 20 (96–over 100 years). In this data preparation phase, we decided to categorize the age variable into 5-year groups to better structure and organize the information for further analysis. This choice allows for a detailed representation of the age distribution in the sample and facilitates the identification of specific patterns and trends in different population segments. Furthermore, the division into small intervals provides a more accurate view of how age can influence the results, enabling a more precise interpretation of the findings.
The gender variable was transformed into two numerical categories (0: male and 1: female) to apply variable selection techniques, such as ANOVA [24] and chi-square [25], which do not allow for the analysis of categorical data. Protein chain standardization was carried out, where genetic sequence alignment was performed using the Multiple Alignment Fast Fourier Transform (MAFFT) [26] version 7 system with the progressive Fast Fourier Transform method and the iterative refinement method (FFT-NS-2) [27]. The primary purpose of this phase was to align the viral gene sequences of each patient to a reference sequence of uniform length (29,904 nucleotides) for all. This process was executed using an algorithm developed in Python. Subsequently, the dataset of aligned gene sequences of the same size was transformed into amino acid chains using the EigenMS [28] and LibMUSCLE [29] libraries.
Lastly, the resulting amino acid chain was utilized to identify the number of mutations present in each record and compared to the patient’s COVID-19 sample in the early stages of infection. This task was executed with a Python algorithm using the Biopython library and Pandas. The primary objective of this algorithm was to identify the various mutations present in each amino acid. To detect these mutations, changes in amino acids at specific positions in the sequence were analyzed in comparison to other sequences. As a result, a database was generated detailing the number of mutations affecting each amino acid.

3.2.4. Phase 4—Data Integration

In this final phase, an algorithm was developed to integrate the patient’s epidemiological database with the amino acid database and the number of detected mutations of the virus in Phase 3. For this data integration, a Python 3.11.3 algorithm was created to merge the two datasets based on the patient’s ID. Through this data fusion, a cohesive dataset was established, enabling a comprehensive and unified analysis.

4. COVID-19 Data Analysis and Results

This section presents data analysis related to COVID-19 patients in Ecuador post-data preprocessing. To conduct this analysis, Python, Matplotlib, Geopandas, and Seaborn libraries were employed to visualize the processed information. These libraries facilitated the generation of statistical diagrams and variable correlations, thereby easing the examination of each graph.
Table 5 presents the total number and infection rate of SARS-CoV-2 variants in different provinces of Ecuador. Table 6 illustrates the correlation between the provinces most impacted by various variants of SARS-CoV-2, along with the corresponding infection percentages. The data highlighted Pichincha, Guayas, and, to a lesser extent, Chimborazo as the provinces experiencing the highest infection percentages linked to different variants of SARS-CoV-2.
Figure 2 depicts the distribution of various SARS-CoV-2 variants across the provinces of Ecuador. Provinces with the six highest infection rates include Pichincha, Guayas, Manabí, Chimborazo, Azuay, and Cotopaxi. Moreover, the most prevalent variant in each province is Omicron, followed by Delta, Mu Gh, Gamma, Lambda, and, lastly, the Alpha variant (Table 6).
Figure 3 presents a plot illustrating the chronology of SARS-CoV-2 contagion by variant from January 2021 to October 2023. The results showed that the Alpha variant was predominant until July 2021, followed by the Delta variant from July 2021 to January 2021. However, the variant that consistently remained predominant in Ecuador is Omicron, which began spreading from December 2021 to October 2023.
Figure 4 illustrates the quantitative distribution of SARS-CoV-2 variants according to patient gender. We observed that female patients were highly affected by the Omicron variant, representing 69.17% of the cases, followed by the Delta variant, which was detected in 14.88% of cases. Conversely, male patients were preferrently affected by the Omicron variant, constituting 61.72% of the cases, followed by the Delta variant in 16.46% of cases. These findings underscore significant differences in SARS-CoV-2 variant prevalence between genders.
Figure 5 presents a quantitative distribution concerning individuals infected by SARS-CoV-2, categorized by gender and province. It illustrates that female COVID-19 patients were observed predominately in the provinces of Pichincha (30.97%), Guayas (19.36%), and Manabí (8.75%). Similarly, male patients showed a high incidence in the provinces of Pichincha (30.91%), Guayas (17.36%), and Manabí (8.55%). These findings indicate that the provinces of Pichincha, Guayas, and Manabí were the most affected by the COVID-19 pandemic, significantly impacting both female and male patients.
Figure 6 illustrates the age distribution of patients for each variant of SARS-CoV-2 in Ecuador, revealing that the most affected group consisted of patients between 31 and 35 years of age, followed by patients aged 26–30, and, in third place, patients aged 36–40.
Figure 7 illustrates the primary distribution of infected patients by age and SARS-CoV-2 variant. Table 7 provides a detailed breakdown of the ages affected by each COVID-19 variant. In the case of individuals aged 31–35 years, the Omicron variant accounted for 22.17%, the Delta variant for 20.93%, the Mu Gh variant for 19.45%, and the Gamma variant for 21.73% of cases. The Lambda variant showed a higher impact on patients aged 26–30 years, with 17.15%, while the Alpha variant affected 22.17% of patients.


The analysis of the biological information of the virus specifically focused on the spike protein, which, as observed in prior studies, such as those by Wang [14] and Thanh [15], exhibits the highest number of mutations. As shown in Table 8, the majority of amino acids with two mutations belong to the spike protein (S) chain. Mutations in this protein chain may impact transmissibility, the ability to evade the immune system, and vaccine efficacy. Notably, 98% of amino acids generating two mutations are associated with the spike protein, while the remaining 2% lack a defined protein chain.
The “Mutations” column delineates the type of genetic change occurring at a specific location in the genomic sequence. The “Protein” column identifies the protein linked to that mutation. The “Position” column specifies the precise location within the genomic sequence where the mutation is recorded. The “Original Sequence” column displays the reference genetic sequence at the affected location, while the “Mutated Sequence” column illustrates the genetic alteration. In instances where no specific protein is assigned to the sequences, informations on the original sequence and the mutated sequence are unavailable.
Table 9 displays the amino acids most significantly affected by mutations, offering crucial details such as the associated protein chain, the specific position within the sequence, the original amino acid sequence, and the resulting amino acid sequence following the mutation. The findings reveal that the predominant portion of amino acids affected by mutations is within the spike protein (S), accounting for 68.29% of cases. Additionally, 14.63% of intances are associated with the nucleocapsid (N) protein, 14.63% with the non-structural protein (nsp2), and 2.45% with the non-structural protein (nsp1).
Table 10 illustrates the amino acids that harbor two mutations in various variants. The most noteworthy are E484A and P681R. Amino acid E484A has been linked to a decrease in antibody effectiveness. Furthermore, the presence of the P681R amino acid in the spike protein of the virus may impact its ability to enter human cells.
Table 11 presents a comprehensive overview of the amino acids that have undergone mutation in various variants, specifying the protein chain to which they belong. The most frequently altered amino acids among variants are D614G, N501Y, and P681H. Amino acid D614G has been linked to enhanced virus transmission capability. Moreover, amino acid N501Y has been associated with a potential increase in virus binding to host cells. Lastly, amino acid P681H, similar to the P681R mutation, affects virus infectivity by influencing its ability to enter human cells. These amino acids, as listed in Table 11, are particularly noteworthy due to their potential impact on transmission dynamics and the virus’s interaction with host cells.
The results obtained from the analysis of COVID-19 in Ecuador exhibit similarities with those reported by Wang [14] and Thanh [15], who indicated that the majority of mutations are associated with the spike protein (S). Conversely, Rozhgar’s findings [18] align with our results, establishing that the affected protein chain includes nucleocapsid (N). Furthermore, Pachetti et al.’s study [16], similar to our study, identified a set of affected amino acids, specifically those located in the protein sequence from “P” to “L”. Notably, mutations P1000L and P2046L within this sequence were prominent, being present in a high percentage of infected patients across the Omicron, Delta, Mu GH, and Alpha variants. These specific amino acids (P1000L and P2046L) underwent two significant mutations, emphasizing their association with the aforementioned virus variants and their recurrent presence in a substantial number of infected patients.
The literature review did not identify any comparable studies with the epidemiological data of COVID-19 from Ecuador.

5. Conclusions and Future Work

In this study, an in-depth analysis of COVID-19 data from Ecuador was conducted. This involved examining both the biological information from the viral variants and patient-related epidemiological data obtained through the GISAID initiative. Throughout the research, diverse data preprocessing and statistical analysis techniques were applied, including Pearson’s correlation, the chi-square test, and analysis of variance (ANOVA). Furthermore, statistical diagrams and graphs were utilized to enhance the visualization of the results.
The CRISP-DM methodology is utilized in numerous data mining projects yet exhibits notable limitations. First, it primarily focuses on the initial stages of a project, as a comprehensive grasp of the research domain is imperative for progressing to subsequent phases. Second, it lacks a robust emphasis on validating the obtained results. However, a vital advantage of the CRISP-DM methodology is its adaptability to meet the specific requirements of each project. This flexibility facilitates the integration of additional techniques, such as protein alignment and the transformation of proteins into amino acid chains, along with strategies to identify the most relevant mutations in our study. Consequently, this integration enhances both the procedural aspects and the outcomes obtained.
Clearly, this study generated significant findings by analyzing the geographic distribution of COVID-19-related variables in various provinces of Ecuador. We observed that the Omicron variant was more prevalent in a large part of the Ecuadorian territory, closely followed by the Delta variant, while the Lambda variant was present in some specific regions. In this context, we found that the Omicron, Delta, and Lambda variants affected more than 50% of female patients, while the Mu GH and Gamma variants affected more than 50% of male patients. A higher incidence of COVID-19 was observed in female patients in the provinces of Pichincha with 30.97%, Guayas with 19.36%, and Manabí with 8.75% of cases.
A high incidence among male patients was reported in the provinces of Pichincha (30.91%), Guayas (17.36%), and Manabí (8.55%), indicating that these provinces were the most affected by the disease. The findings further revealed that the age group of 26 to 35 years was the most affected by all variants, with the Delta variant being more noticeable in patients aged 31 to 35 years, while the Mu GH variant had a relevant impact on patients aged 31 to 35 years and 41 to 45 years. We also found that from January 2021 to the end of November 2021, the predominant variants in infections were Mu GH, Gamma, and Delta. However, from the beginning of December 2021 until 2023, the most prevalent variant was Omicron.
In addition, highly relevant genomic information highlighting the relationship between variants and amino acids was discovered, including the fact that amino acids associated with the S protein, nucleocapsid (N), and non-structural protein chains (nsp1) and (nsp2) were most affected by the mutations. Specifically, amino acids D614G, N501Y, P681H, E484A, and P681R were observed to have significant effects on virus transmissibility, its ability to bind to host cells, and its ability to evade the immune response. These findings underscore the importance of understanding and monitoring these mutations to develop effective strategies for both the treatment and prevention of virus infection. With this patient-related epidemiological information approach and variant data, public health institutions in Ecuador can improve their understanding of disease dynamics, comprehend the importance of disease monitoring, and formulate health safety policies to prevent the recurrence of dangerous SARS-CoV-2 variants.
For future research, we propose incorporating more information from different reputable free databases to enhance the analysis and make comparisons with existing data. This may include integrating genomic and epidemiological information from various countries to identify global or specific patterns. We also suggest considering the inclusion of additional variables, such as environmental factors and the underlying medical conditions of the patients, to achieve a deeper and more holistic understanding of the infection.
Another crucial aspect would be to conduct a detailed analysis of the mutations identified in key amino acids. This would enable the evaluation of their individual impact on the virus’s interaction with host cells, their replicative capacity, and their influence on the immune response. The use of predictive models or machine learning algorithms to forecast the evolution of variants and their impact could also offer accurate early insight into possible disease scenarios.

Author Contributions

Conceptualization, C.G. and D.C.; methodology, C.G.; software, D.C.; validation, C.G. and D.C.; formal analysis, C.G.; investigation, C.G.; resources, C.G.; data curation, C.G.; writing—original draft preparation, D.C. and H.A.-F.; writing—review and editing, H.A.-F., B.S. and J.S.; visualization, D.C.; supervision, C.G.; project administration, C.G.; funding acquisition, C.G. All authors have read and agreed to the published version of the manuscript.


This research was funded by Universidad TecnolOgica Indoamérica grant number INV-0019-01-017.

Institutional Review Board Statement

Ethical review and approval were waived for this study because it only analyzed open-access data from the GISAID repository, and it is not applicable to studies not involving humans or animals.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available on the GISAID site:, (accessed on 1 December 2023 ).


With consent and support from Fundación Datalat and Universidad Tecnológica Indoamérica.

Conflicts of Interest

The authors declare no conflicts of interest.


  1. Gilbert, G.L.; Degeling, C.; Johnson, J. Communicable Disease Surveillance Ethics in the Age of Big Data and New Technology. Asian Bioeth. Rev. 2019, 11, 173–187. [Google Scholar] [CrossRef] [PubMed]
  2. Wong, Z.S.; Zhou, J.; Zhang, Q. Artificial Intelligence for infectious disease Big Data Analytics. Infect. Dis. Health 2019, 24, 44–48. [Google Scholar] [CrossRef] [PubMed]
  3. Prabhu, S.R. Infectious and Communicable Diseases: An Overview. In Textbook of General Pathology for Dental Students; Elsevier: Singapore, 2023; pp. 63–72. [Google Scholar] [CrossRef]
  4. The International Federation of Red Cross and Red Crescent Societies (IFRC). Communicable Diseases; World Health Organization (WHO): Geneva, Switzerland, 2023; Available online: (accessed on 1 December 2023).
  5. Sheahan, T.P.; Sims, A.C.; Leist, S.R.; Schäfer, A.; Won, J.; Brown, A.J.; Montgomery, S.A.; Hogg, A.; Babusis, D.; Clarke, M.O.; et al. Comparative therapeutic efficacy of remdesivir and combination lopinavir, ritonavir, and interferon beta against MERS-CoV. Nat. Commun. 2020, 11, 222. [Google Scholar] [CrossRef] [PubMed]
  6. Taylor, C.; Kidgell, J. Flu-like pandemics and metaphor pre-covid: A corpus investigation. Discourse Context Media 2021, 41, 100503. [Google Scholar] [CrossRef] [PubMed]
  7. Chowdhury, S.; Hossain, M.E.; Ghosh, P.K.; Ghosh, S.; Hossain, M.B.; Beard, C.; Rahman, M.; Rahman, M.Z. The Pattern of Highly Pathogenic Avian Influenza H5N1 Outbreaks in South Asia. Trop. Med. Infect. Dis. 2019, 4, 138. [Google Scholar] [CrossRef] [PubMed]
  8. Nicomedes, C.J.C.; Avila, R.M.A. An analysis on the panic during COVID-19 pandemic through an online form. J. Affect. Disord. 2020, 276, 14–22. [Google Scholar] [CrossRef] [PubMed]
  9. The World Health Organization (WHO). WHO COVID-19 Dashboard; World Health Organization (WHO): Geneva, Switzerland, 2023; Available online: (accessed on 1 December 2023).
  10. Mathieu, E.; Ritchie, H.; Rodés-Guirao, L.; Appel, C.; Giattino, C.; Hasell, J.; Macdonald, B.; Dattani, S.; Beltekian, D.; Ortiz-Ospina, E.; et al. Coronavirus Pandemic (COVID-19). Our World in Data. 2020. Available online: (accessed on 1 December 2023).
  11. Ministry of Health of Ecuador. Vacunómetro COVID-19; Ministry of Health of Ecuador: Quito, Ecuador, 2023. Available online:, (accessed on 1 December 2023).
  12. Ruiz-Bravo, A.; Jiménez-Valera, M.; Ruiz-Bravo, A.; Jiménez-Valera, M. SARS-CoV-2 y pandemia de síndrome respiratorio agudo (COVID-19). Ars Pharm. (Internet) 2020, 61, 63–79. [Google Scholar] [CrossRef]
  13. Wang, R.; Hozumi, Y.; Yin, C.; Wei, G.W. Mutations on COVID-19 diagnostic targets. Genomics 2020, 112, 5204–5213. [Google Scholar] [CrossRef] [PubMed]
  14. Wang, R.; Chen, J.; Gao, K.; Wei, G.W. Vaccine-escape and fast-growing mutations in the United Kingdom, the United States, Singapore, Spain, India, and other COVID-19-devastated countries. Genomics 2021, 113, 2158–2170. [Google Scholar] [CrossRef] [PubMed]
  15. Nguyen, T.T.; Pathirana, P.N.; Nguyen, T.; Nguyen, Q.V.H.; Bhatti, A.; Nguyen, D.C.; Nguyen, D.T.; Nguyen, N.D.; Creighton, D.; Abdelrazek, M. Genomic mutations and changes in protein secondary structure and solvent accessibility of SARS-CoV-2 (COVID-19 virus). Sci. Rep. 2021, 11, 3487. [Google Scholar] [CrossRef] [PubMed]
  16. van Dorp, L.; Richard, D.; Tan, C.C.; Shaw, L.P.; Acman, M.; Balloux, F. No evidence for increased transmissibility from recurrent mutations in SARS-CoV-2. Nat. Commun. 2020, 11, 5986. [Google Scholar] [CrossRef] [PubMed]
  17. Pachetti, M.; Marini, B.; Benedetti, F.; Giudici, F.; Mauro, E.; Storici, P.; Masciovecchio, C.; Angeletti, S.; Ciccozzi, M.; Gallo, R.C.; et al. Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant. J. Transl. Med. 2020, 18, 179. [Google Scholar] [CrossRef] [PubMed]
  18. Khailany, R.A.; Safdar, M.; Ozaslan, M. Genomic characterization of a novel SARS-CoV-2. Gene Rep. 2020, 19, 100682. [Google Scholar] [CrossRef] [PubMed]
  19. Ahmad, S.U.; Kiani, B.H.; Abrar, M.; Jan, Z.; Zafar, I.; Ali, Y.; Alanazi, A.M.; Malik, A.; Rather, M.A.; Ahmad, A.; et al. A comprehensive genomic study, mutation screening, phylogenetic and statistical analysis of SARS-CoV-2 and its variant omicron among different countries. J. Infect. Public Health 2022, 15, 878–891. [Google Scholar] [CrossRef] [PubMed]
  20. Zekri, A.R.N.; Bahnasy, A.A.; Hafez, M.M.; Hassan, Z.K.; Ahmed, O.S.; Soliman, H.K.; El-Sisi, E.R.; Dine, M.H.E.; Solimane, M.S.; Latife, L.S.; et al. Characterization of the SARS-CoV-2 genomes in Egypt in first and second waves of infection. Sci. Rep. 2021, 11, 21632. [Google Scholar] [CrossRef] [PubMed]
  21. Abraham, P.; Lopez Martinez, I.; Lin Tzer Pin, R. The GISAID Data Science Initiative. Available online: (accessed on 1 December 2023).
  22. Solano, J.A.; Cuesta, D.J.L.; Ibáñez, S.F.U.; Coronado-Hernández, J.R. Predictive models assessment based on CRISP-DM methodology for students performance in Colombia—Saber 11 Test. Procedia Comput. Sci. 2022, 198, 512–517. [Google Scholar] [CrossRef]
  23. Huber, S.; Wiemer, H.; Schneider, D.; Ihlenfeldt, S. DMME: Data mining methodology for engineering applications—A holistic extension to the CRISP-DM model. Procedia CIRP 2019, 79, 403–408. [Google Scholar] [CrossRef]
  24. Sharma, V.; Sharma, R.K. Application of Taguchi Method and ANOVA in Parameters Optimization for Fluidization Characteristic of Pine Needles in Fluidized Bed. In Lecture Notes in Mechanical Engineering; Springer: Singapore, 2021; pp. 869–878. [Google Scholar] [CrossRef]
  25. Bahassine, S.; Madani, A.; Al-Sarem, M.; Kissi, M. Feature selection using an improved Chi-square for Arabic text classification. J. King Saud Univ. Comput. Inf. Sci. 2020, 32, 225–231. [Google Scholar] [CrossRef]
  26. Wang, Z.; Tan, J.; Long, Y.; Liu, Y.; Lei, W.; Cai, J.; Yang, Y.; Liu, Z. SaAlign: Multiple DNA/RNA sequence alignment and phylogenetic tree construction tool for ultra-large datasets and ultra-long sequences based on suffix array. Comput. Struct. Biotechnol. J. 2022, 20, 1487–1493. [Google Scholar] [CrossRef] [PubMed]
  27. Reddy, B.; Fields, R. Performance Analysis of Multiple Sequence Alignment Tools. In Proceedings of the 2024 ACM Southeast Conference, Marietta, GA, USA, 18–20 April 2024; pp. 167–174. [Google Scholar] [CrossRef]
  28. Karpievitch, Y.V.; Nikolic, S.B.; Wilson, R.; Sharman, J.E.; Edwards, L.M. Metabolomics data normalization with EigenMS. PLoS ONE 2014, 9, e116221. [Google Scholar] [CrossRef] [PubMed]
  29. Veen, L.E.; Hoekstra, A.G. Easing multiscale model design and coupling with MUSCLE 3. In Proceedings of the International Conference on Computational Science, Amsterdam, The Netherlands, 3–5 June 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 425–438. [Google Scholar] [CrossRef]
Figure 1. Stages of the Crisp-DM methodology [23].
Figure 1. Stages of the Crisp-DM methodology [23].
Life 14 00735 g001
Figure 2. SARS-CoV-2 variants by province in Ecuador.
Figure 2. SARS-CoV-2 variants by province in Ecuador.
Life 14 00735 g002
Figure 3. Chronology of the spread of SARS-CoV-2 variants.
Figure 3. Chronology of the spread of SARS-CoV-2 variants.
Life 14 00735 g003
Figure 4. SARS-CoV-2 variants by patient gender.
Figure 4. SARS-CoV-2 variants by patient gender.
Life 14 00735 g004
Figure 5. Infected persons by gender and province.
Figure 5. Infected persons by gender and province.
Life 14 00735 g005
Figure 6. Infected patients by age and SARS-CoV-2 variant.
Figure 6. Infected patients by age and SARS-CoV-2 variant.
Life 14 00735 g006
Figure 7. Higher percentage of infected patients by age and variant.
Figure 7. Higher percentage of infected patients by age and variant.
Life 14 00735 g007
Table 1. Database variables, COVID-19, Ecuador.
Table 1. Database variables, COVID-19, Ecuador.
Type of DataNameTypePrecisionExample
Genetic sequenceProtein chainTextCharacters[——–accaaccaactctaa.....]
PatientPatient codeTextCharactersEPI_ISL_10137512
Length protein chainNumericInteger29,557
LatitudeNumericInteger with five decimal0.35987
LongitudeNumericInteger with five decimal 78.12825
Virus codeTextCharactersVOI GRA
Sampling dateTextCharacters21 February 2022
Table 2. Pearson correlation.
Table 2. Pearson correlation.
Age1 0.31 0.85
Gender 0.31 10.54
Table 3. Chi-square test.
Table 3. Chi-square test.
Table 4. ANOVA method.
Table 4. ANOVA method.
Table 5. Variants of SARS-CoV-2 across Ecuadorian provinces.
Table 5. Variants of SARS-CoV-2 across Ecuadorian provinces.
VariantProvinceNumber of InfectedInfected (%)
Santo Domingo de los Tsachilas1252.11
Province with less than 2%62010.46
Total5927 100.00
El Oro18613.29
Santo Domingo de Los Tsachilas846.00
Los Rios412.93
Province with less than 2%16011.43
Total 1400 100.00
  Ex voi muPichincha19831.83
Santo Domingo de los Tsachilas345.47
El Oro284.50
Los Rios182.89
Morona Santiago152.41
Province with less than 2%457.24
El Oro256.55
Santo Domingo de los Tsachilas123.14
Morona Santiago112.88
Province with less than 2%318.12
Total 382 100.00
Ex Voi LambdaGuayas6920.06
El Oro6518.90
Los Rios123.49
Morona Santiago185.23
Province with less than 2%298.43
Total 344 100.00
Santo Domingo de Los Tsachilas175.38
Los Rios154.75
Morona Santiago92.85
Province with less than 2%4213.29
Total 316 100.00
Table 6. Infection numbers and rates in Ecuadorian provinces grouped by SARS-CoV-2 variants.
Table 6. Infection numbers and rates in Ecuadorian provinces grouped by SARS-CoV-2 variants.
ProvinceVariantNumber of InfectedInfected %
Ex Voi Mu1987.12
Total 2782 100.00
Ex Voi Mu915.48
Total 1662 100.00
Ex Voi Mu415.26
Ex Voi Mu8717.51
Ex Voi Mu173.46
Ex Voi Mu214.70
El OroDelta18653.30
Ex Voi Mu288.02
Santo Domingo
de Los Tsachilas
Ex Voi Mu3412.27
Ex Voi Mu114.47
Ex Voi Mu146.97
Ex Voi Mu42.16
Ex Voi Mu42.60
Los RiosOmicron6240.52
Ex Voi Mu1811.76
Ex Voi Mu64.05
Ex Voi Mu107.69
Ex Voi Mu1311.82
Morona SantiagoLambda1821.43
Ex Voi Mu1517.86
Ex Voi Mu22.44
Ex Voi Mu00.00
Provinces with
less than 50 cases
Ex Voi Mu85.10
Table 7. Percentage of infected patients by age and variant.
Table 7. Percentage of infected patients by age and variant.
Variant26–30 Years (%)31–35 Years (%)
Mu Gh15.1119.45
Table 8. Amino acids affected by two mutations.
Table 8. Amino acids affected by two mutations.
MutationProteinPositionOriginal SequenceMutated Sequence
DEL144/144Not Assigned144--
E1264D1264Glu (E)Asp (D)
G662SSpike (S)662Gly (G)Ser (S)
I1566V1566Ile (I)Val (V)
I2230T2230Ile (I)Thr (T)
K3353R3353Lys (K)Arg (R)
P1000L1000Pro (P)Leu (L)
P2046L2046Pro (P)Leu (L)
P2287S2287Pro (P)Ser (S)
P3395H3395Pro (P)His (H)
S1188L1188Ser (S)Leu (L)
T1001I1001Thr (T)Ile (I)
T265I265Thr (T)Ile (I)
T3255I3255Thr (T)Ile (I)
T3646A3646Thr (T)Ala (A)
V2930L2930Val (V)Leu (L)
A1306S1306Ala (A)Ser (S)
A1708D1708Ala (A)Asp (D)
A1918V1918Ala (A)Val (V)
Table 9. Amino acids affected by a mutation.
Table 9. Amino acids affected by a mutation.
MutationProteinPositionOriginal SequenceMutated Sequence
D614GSpike(s)614Asp (D)Gly (G)
S84L84Ser (S)Leu (L)
P681H681Pro (P)His (H)
H655Y655His (H)Tyr (Y)
N679K679Asn (N)Lys (K)
H69X69His (H)Insertion X
D796Y796Asp (D)Tyr (Y)
V70X70Val (V)Insertion X
T478K478Thr (T)Lys (K)
A63T63Ala (A)Thr (T)
N501Y501Asn (N)Tyr (Y)
S375F375Ser (S)Phe (F)
S373P373Ser (S)Pro (P)
G339D339Gly (G)Asp (D)
T223I223Thr (T)Ile (I)
S413R413Ser (S)Arg (R)
Y505H505Tyr (Y)His (H)
Q498R498Gln (Q)Arg (R)
E484A484Glu (E)Ala (A)
S477N477Ser (S)Asn (N)
T376A376Thr (T)Ala (A)
K417N417Lys (K)Asn (N)
S371F371Ser (S)Phe (F)
D405N405Asp (D)Asn (N)
L452R452Leu (L)Arg (R)
R408S408Arg (R)Ser (S)
N440K440Asn (N)Lys (K)
R203KNucleocapsid (N)203Arg (R)Lys (K)
G204R204Gly (G)Arg (R)
N969K969Asn (N)Lys (K)
Q954H954Gln (Q)His (H)
N764K764Asn (N)Lys (K)
T9I9Thr (T)Ile (I)
Q19ENon-structural (nsp2)19Gln (Q)Glu (E)
G142D142Gly (G)Asp (D)
T19I19Thr (T)Ile (I)
Y144X144Tyr (Y)Insertion X
T95I95Thr (T)Ile (I)
P13LNon-structural (nsp1)13Pro (P)Leu (L)
Table 10. Amino acids affected by two mutations in different variants.
Table 10. Amino acids affected by two mutations in different variants.
T265IOmicron, Delta, Lambda, Mu GH, Gamma, AlphaSpike (S)
P3395HOmicron, Delta, Lambda, Mu GH, AlphaSpike (S)
S1188LDelta, Lambda, Mu GH, Gamma, AlphaSpike (S)
A1306SOmicron, Lambda, Mu GH, AlphaSpike (S)
A1708DOmicron, Lambda, Mu GH, AlphaSpike (S)
A1918VOmicron, Delta, Mu GH, AlphaSpike (S)
DEL144/144Omicron, Delta, Mu GH, AlphaSpike (S)
DEL157/158Omicron, Delta, Mu GH, AlphaSpike (S)
DEL241/243Omicron, Delta, Mu GH, AlphaSpike (S)
DEL25/27Omicron, Delta, Mu GH, AlphaSpike (S)
DEL69/70Omicron, Delta, Mu GH, AlphaSpike (S)
E1264DOmicron, Delta, Mu GH, AlphaSpike (S)
G662SOmicron, Mu GH, AlphaSpike (S)
I1566VOmicron, Delta, Mu GH, AlphaSpike (S)
I2230TOmicron, Lambda, Mu GH, AlphaNonstructural (nsp2)
K3353ROmicron, Lambda, Mu GH, AlphaNonstructural (nsp2)
P1000LOmicron, Delta, Mu GH, AlphaSpike (S)
P2287SOmicron, Lambda, Mu GH, AlphaNonstructural (nsp2)
T1001ILambda, Mu GH, Gamma, AlphaSpike (S)
T3255IOmicron, Delta, Lambda, Mu GHNonstructural (nsp2)
T3646AOmicron, Delta, Lambda, Mu GH, AlphaNonstructural (nsp2)
V2930LOmicron, Lambda, Mu GH, AlphaNonstructural (nsp2)
L24SMu GH, Alphac
P26SMu GH, AlphaNonstructural (nsp1)
P681RMu GH, AlphaSpike (S)
T19RMu GH, AlphaSpike (S)
E484AMu GHNucleocapsid (N)
K1655NGammaNucleocapsid (N)
K1795QGammaNucleocapsid (N)
L18FGammaSpike (S)
R346TMu GHNucleocapsid (N)
T19IMu GHNonstructural (nsp1)
T20NGammaSpike (S)
Y144XMu GHSpike (S)
Table 11. Amino acids affected by a mutation in different variants.
Table 11. Amino acids affected by a mutation in different variants.
NumberAmino AcidVariantsProteins
1D614GAlpha, Mu GH, Lambda, Delta, OmicronSpike (S)
2N501YAlpha, Mu GH, Omicron
3P681HAlpha, Mu GH, Omicron
4DEL31/33Alpha, Mu GH, Lambda, Delta
5S84LAlpha, Mu GH, Lambda, Delta, Omicron
6R203KAlpha, Mu GH, Lambda, Omicron
7G204RAlpha, Mu GH, Lambda, Omicron
8T40IAlpha, Lambda, Delta, OmicronNot defined
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guevara, C.; Coronel, D.; Salazar, B.; Salazar, J.; Arias-Flores, H. Analysis of the Spread and Evolution of COVID-19 Mutations in Ecuador Using Open Data. Life 2024, 14, 735.

AMA Style

Guevara C, Coronel D, Salazar B, Salazar J, Arias-Flores H. Analysis of the Spread and Evolution of COVID-19 Mutations in Ecuador Using Open Data. Life. 2024; 14(6):735.

Chicago/Turabian Style

Guevara, Cesar, Dennys Coronel, Byron Salazar, Jorge Salazar, and Hugo Arias-Flores. 2024. "Analysis of the Spread and Evolution of COVID-19 Mutations in Ecuador Using Open Data" Life 14, no. 6: 735.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop