Usefulness of Vaccine Adverse Event Reporting System for Machine-Learning Based Vaccine Research: A Case Study for COVID-19 Vaccines

Usefulness of Vaccine-Adverse Event-Reporting System (VAERS) data and protocols required for statistical analyses were pinpointed with a set of recommendations for the application of machine learning modeling or exploratory analyses on VAERS data with a case study of COVID-19 vaccines (Pfizer-BioNTech, Moderna, Janssen). A total of 262,454 duplicate reports (29%) from 905,976 reports were identified, which were merged into a total of 643,522 distinct reports. A customized online survey was also conducted providing 211 reports. A total of 20 highest reported adverse events were first identified. Differences in results after applying various machine learning algorithms (association rule mining, self-organizing maps, hierarchical clustering, bipartite graphs) on VAERS data were noticed. Moderna reports showed injection-site-related AEs of higher frequencies by 15.2%, consistent with the online survey (12% higher reporting rate for pain in the muscle for Moderna compared to Pfizer-BioNTech). AEs {headache, pyrexia, fatigue, chills, pain, dizziness} constituted >50% of the total reports. Chest pain in male children reports was 295% higher than in female children reports. Penicillin and sulfa were of the highest frequencies (22%, and 19%, respectively). Analysis of uncleaned VAERS data demonstrated major differences from the above (7% variations). Spelling/grammatical mistakes in allergies were discovered (e.g., ~14% reports with incorrect spellings for penicillin).


Introduction
VAERS, an online passive reporting system, co-sponsored by the US Center for Disease Control and Prevention (CDC) and the Food and Drug Administration (FDA), and the agencies of US Health and Health Services (HHS) are specifically geared towards assessing the safety of newly developed vaccines along with other priorities that include: (i) the detection of new, unusual, or rare vaccine adverse events, (ii) the monitoring of the increase in known events, (iii) the identification of potential risk factors for particular types of adverse events (AEs), (iv) the determination of possible reporting clusters, (v) the recognition of persistent safe-use problems, and (vi) the provision of national safety monitoring to public health emergencies, such as a large-scale pandemic influenza vaccination program [1][2][3]. Due to its spontaneous reporting nature, VAERS data is not recommended for discerning were incorporated to: (i) identify the frequently reported AEs after COVID-19 vaccines, (ii) assess their correlations with respect to various demographics (age groups, gender, and allergies), and (iii) provide a baseline decision support for predictive capability when deidentified data become available from regulatory agencies as well as the vaccine producers. Such analysis can be useful for determining the proportion of reports involving specific AEs and a vaccine can be compared to the proportion of reports involving the same AEs and other vaccines [2]. Figure 1a,b shows the relative frequencies of the 20 most-reported AEs for all age groups per three vaccine manufacturers and children of ages up to (and inclusive of) 15 years old, respectively. AEs for each vaccine manufacturer were significantly consistent. There were 13 AEs {arthralgia, asthenia, chills, dizziness, dyspnoea, fatigue, headache, injection site pain, myalgia, nausea, pain, pain in extremity, pyrexia} that were common for all three vaccine manufacturers. Survey data also reported {headache, aches, chills, pain in muscle, dizziness, nausea, vomiting, and rash} to be the most commonly reported AEs (Table 1) The subset {chest pain, Dyspnoea, hyperhidrosis, and myocarditis} was among the lowest-reported AEs for age group (5-11 years) in comparison to the AEs reported for the group 12-15 and other 16 most commonly reported AEs. Interestingly, four injection-site-related AEs {injection site-(erythema, pruritus, swelling, warmth)} were among the top 20 AEs for Moderna (p-value < 2.2 × 10 −16 for Moderna vs. {Pfizer-BioNTech, Janssen} with respect to the top 20 AEs including injection-siterelated AEs). Survey data also showed 51% of the samples for Moderna with pain in muscle as opposed to only 39% samples for Pfizer-BioNTech reporting pain in muscle (Table 1). This may be due to the fact that Moderna uses a 100-microgram dose as opposed to the 30-microgram used by Pfizer-BioNTech, causing increased reactogenicity [23]. Additionally, although the etiology of delayed large local reactions due to Moderna is unclear, a delayed-type hypersensitivity reaction to the excipient polyethylene glycol can be a potential etiology [24]. The above visual exploration without duplicate-row removal (Supplementary Materials-Figures S1 and S2) showed relative frequencies of the above 20 AEs to be lower by up to 7% (Supplementary Materials- Figure S1) than the frequencies observed in the cleaned data ( Figure 1a). Similarly, the AEs for children showed differences of up to 4% (Figure 1b and Figure S2 (Supplementary Materials)).

Results
The subset {dizziness, pyrexia, headache, nausea, vomiting, fatigue, dyspnoea, pain, pain in extremity, chills, rash} was common among adults (including children (Figure 1a)) and children (Figure 1b). None of the injection-site-related AEs {injection site-(pain, erythema, swelling, warmth)} were among highly reported AEs in children's reports. Additionally, {arthralgia, asthenia, myalgia, pruritus, erythema} only appeared in the 20 most reported AEs for adults where {chest pain, syncope, loss of consciousness, pallor, hyperhidrosis, urticaria, fall, unresponsive to stimuli, myocarditis} were reported only among children. The above differ-ences may arise due to the Pfizer-BioNTech dose for children being only 10 micrograms compared to 30 micrograms for adults.
As given in the heatmap in Table 2, although in a different order based on their percentage, all 20 highest-reported AEs for both children's genders were the same. An important pattern in children's VAERS reports was found to have chest pain reported to be 3 times higher in male reports than in female reports (chi-squared test p-value: 5.74 × 10 −62 ). Based on the number of occurrences, vomiting was ranked as the top effect for female as opposed to the 2nd for male children (p-value: 0.56 indicating no significant correlation of vomiting with gender). It is noted, however, that for the VAERS dataset with duplicates, headache appeared as the top-ranked effect for female (Table S3) as opposed to 5th-ranked for male children (p-value: 0.61). Additionally, injection site pain ranked a level higher for female (6th) compared to male children (7th), with p-value: 0.59 (i.e., no significant correlation of injection site pain with gender). Other correlation tests with p-values are {Dizziness: 3.8 × 10 −14 , Pyrexia: 8.21 × 10 −6 , Fatigue: 4 × 10 −3 , Nausea: 4 × 10 −4 , Pain in Extremity: 0.77, Rash: 0.88, Pain: 0.075, Chest Pain: 5.74 × 10 −62 }. It is also noted that the ratio of female VAERS COVID-19 reports is higher than male reports, which is consistent with other VAERS vaccine ratios (e.g., flu vaccine for 2021 had the number of reports as female: 5222, and male: 2375). When grouped into clusters via an unsupervised HC approach, male children and young adults (i.e., age groups of 18 inclusive and under) were clustered in one group (i.e., Cluster III with blue dendrogram), as shown in Figure 2. For male children, {dizziness, headache, pyrexia} were grouped in the same cluster (Cluster II) with {nausea, vomiting} to be in the adjacent cluster (Cluster III), consistent with the grouping provided in Table 2. Furthermore, {fatigue, chills, pain} for male children were clustered in Cluster I. Interestingly, the HC approach demonstrated tolerance in grouping datasets with and without duplicates, as no difference in Figures 2 and S3 was observed. Overall, VAERS reports for male participants in Clusters I and II (fatigue, chills, pain, dizziness, headache, pyrexia} were to be of the highest percentage, as confirmed in Table 2. Consequently, due to {dizziness, headache, nausea, pyrexia} being reported more commonly between the age groups of 12-15 and 16-18 for female, they were grouped in the same cluster as shown in Figure 2, while 5-11 grouped in adjacent cluster. Consistent with Table 2, injection-site-related AEs in female and male children were grouped in clusters I and IV with lower-reporting percentages in Figures 2 and S3, respectively. It is noted that, despite comprehensive data preprocessing steps, reports submitted through VAERS have not undergone data-quality assurance/control strategies, thus posing challenges for the verification of the analysis. To overcome the challenge of the uncertainty and reliability of the VAERS reports and confirm the AE similarities, an exploration of the online survey data was also conducted ( Figure 3). As illustrated in Table 1 and Figure 3, from a set of 11 AEs compiled from 211 participants, {headache, chills, dizziness, nausea, itchy skin/rash, vomiting} also appeared in the 20 most reported AEs in the VAERS reports. effect for age group 12-15. It is noted that, despite comprehensive data preprocessing steps, reports submitted through VAERS have not undergone data-quality assurance/control strategies, thus posing challenges for the verification of the analysis. To overcome the challenge of the uncertainty and reliability of the VAERS reports and confirm the AE similarities, an exploration of the online survey data was also conducted ( Figure 3). As illustrated in Table 1 and Figure 3, from a set of 11 AEs compiled from 211 participants, {headache, chills, dizziness, nausea, itchy skin/rash, vomiting} also appeared in the 20 most reported AEs in the VAERS reports.

Associations of the Most Commonly Reported AEs via ARM and SOM
The interrelationships of AEs from VAERS reports were analyzed via ARM and SOM with respect to two major age groups [<16, ≥16]. Assessment of the interrelationships of AEs for children revealed 16 non-redundant association rules (ARs) ( Table 3). From a subset of one-to-one rules, the existence of Hyperhidrosis or flushing was shown to imply the existence of dizziness with lift-over 3 (R2,10). Chest pain was found to be prominent with dependency over the subset {Electrocardiogram ST segment elevation, Chest X-ray normal, Echocardiogram normal, Myocarditis, Electrocardiogram normal, C-reactive protein increased, Troponin increased} with a lift value of >8 (R3,5,7-9,11,13). Additionally, it was also noticed that hyperhidrosis was associated with flushing with a high lift value of 18.8 (R9). Although fatigue appeared among the top 6 AEs for children based on its individual frequency, its correlation with any other AE could not qualify it for the top 14 ARs (Table 3 and Figure  1).

Associations of the Most Commonly Reported AEs via ARM and SOM
The interrelationships of AEs from VAERS reports were analyzed via ARM and SOM with respect to two major age groups [<16, ≥16]. Assessment of the interrelationships of AEs for children revealed 16 non-redundant association rules (ARs) ( Table 3). From a subset of one-to-one rules, the existence of Hyperhidrosis or flushing was shown to imply the existence of dizziness with lift-over 3 (R 2,10 ). Chest pain was found to be prominent with dependency over the subset {Electrocardiogram ST segment elevation, Chest X-ray normal, Echocardiogram normal, Myocarditis, Electrocardiogram normal, C-reactive protein increased, Troponin increased} with a lift value of >8 (R 3,5,7-9,11,13 ). Additionally, it was also noticed that hyperhidrosis was associated with flushing with a high lift value of 18.8 (R 9 ). Although fatigue appeared among the top 6 AEs for children based on its individual frequency, its correlation with any other AE could not qualify it for the top 14 ARs (Table 3 and Figure 1). Table 3. Non-redundant association rules for post-COVID-19 vaccine AEs reported in VAERS reports for children based on cleaned with duplicate rows merged. Rule 14 was the only non-redundant many-to-one rule identified for children. The highlighted regions in gray represent a subset of rules with relatively high counts in the dataset (>200) and include {dizziness, hyperhidrosis, syncope, unresponsive to stimuli, pyrexia, chills, myocarditis}, which were also among the 20 most commonly reported AEs in children when explored based on their individual frequencies. The ARM employs a frequentist approach to calculate the Support, Confidence, and Lift (Supplementary Materials-Section S2.2.1), for which duplicate reports can pose a significant challenge. Therefore, a new report or reports with spelling/grammar mistakes can impact the generality and specificity of the ARs, impacting such analysis with duplicates present in the VAERS data. As illustrated in Tables S4-S7 (Supplementary Materials), the ARs for children and each vaccine producer indicated significant differences from those identified when duplicates were removed (Tables 3 and 4). For example, R 6 (Echocardiogram normal → Troponin increased) for the non-redundant ARs for children (Table 3) demonstrated that the highest lift value of 16.6 was initially not identified as a non-redundant AR in Table S4 (Supplementary Materials). Additionally, seven rules (R 3,5,7-9,11,13 ) reported chest pain in the consequent cleaned VAERS data for children ( Table 3) whereas none of the ARs in Table S4 (Supplementary Materials) reported chest pain in the consequent VAERS data with duplicates. ARs (R 4,10,12,14 ) when verified via SOM in Figure 4 demonstrate the relationships of {{Unresponsive to stimuli → Syncope}, {Hyperhidrosis → Dizziness}, {Chills → Pyrexia}, {Headache, pain → Pyrexia}}. However, SOM may also suffer from misleading correlations from uncleaned VAERS data due to the iterative nature of 2D-map refinement (  10,12,14 in Table 3.   Table 3.

Rule
Analysis of the ARs for the AEs of all age groups was also conducted for the three vaccine types (Table 4). In the set of ARs for Pfizer-BioNTech, headache appeared in the consequent of 10 ARs, with {chills, myalgia, pyrexia, pain, fatigue, nausea} in antecedents with count values > 3000 (R 5,6,8,10-16 ). Although the above distributions appeared to be dispersed without demonstrating a discernible pattern (Figure 4), the overall distributions showed similarities in the SOM component planes. However, with duplicates present, only two ARs (R 19,20 ) had headache in the consequent (Supplementary Materials, Table S5), due to the fact that the entries for headache were distributed with duplicates, increasing the frequency with which headache appeared. Another observation (Table 4) Figure S6) to validate the existence of correlations among these AEs as indicated by the rules (R 1, 2,4,5,7,8,15,16 ) in Table 4. The similarity between AEs as represented by the 2D SOM is indicative of the coexistence of their correlations (i.e., the existence of a base AE implies the existence of another AE as given by the distributions on a 2D map). Table 4. Non-redundant association rules for AEs reported in VAERS reports for the three vaccines.
Non-redundant association rules for post-COVID-19 vaccine AEs reported in VAERS reports for Pfizer-BioNTech vaccine. Rules 4-16 were non-redundant many-to-one rules identified for Pfizer-BioNTech. The highlighted regions in gray represent the subset of rules with relatively high counts in the dataset (>6000). The rules below include {pyrexia, fatigue, headache, nausea, vomiting, chills, pain, myalgia}, which were also among the 20 most commonly reported AEs for VAERS reports for Pfizer-BioNTech when explored based on their individual frequencies. ARs for Janssen (Table 4) showed that 6 of 14 ARs reported headache in the consequent, with {fatigue, pain, pyrexia, chills, myalgia, nausea} appearing in the antecedent (Supplementary Materials- Figure S7). This is in contrast with Pfizer-BioNTech and Moderna, where the AR with highest count was chills → pyrexia, pyrexia → headache had the highest count of 6574 (R 10 ). An interesting AR R 2 indicated a noteworthy observation {decreased appetite → fatigue}, with a 609 count value for Janssen. The AR R 10 was also demonstrated with the help of SOM (Supplementary Materials- Figure S7) showing similarity for pyrexia and headache, despite the lack of indication of definitive clusters in the SOM.

Interrelations of Vaccine AEs via Bipartite Graphs
The interrelationships between the 20 most commonly reported AEs and the three vaccines were also interrogated via bipartite graphs [25][26][27] (Figures 5, S8 and S9). As shown in Figure 5, headache was most often reported for all 3 vaccines with a relative existence of 11%. The injection-site-related AEs {injection site (erythema, pain, swelling)} are of a higher relative percentage for Moderna (5%, 6%, and 4%) compared to those for Pfizer-BioNTech (1%, 4%, and 1%) and Janssen (1%, 3%, and 1%). The relationships of allergies with the 20 most-reported AEs in Figure 5e showed penicillin and sulfa to have the highest occurrences with 22% and 19% frequency, respectively. In the same figure, penicillin and sulfa appear to be uniformly distributed among all 20 AEs with headache, fatigue, and pyrexia having the highest percentages. Additionally, gluten from 3% of VAERS reports demonstrated a correlation with fatigue in 11% of data after cleaning and data pre-processing steps. Such a percentage suggests that the AR of gluten with fatigue may be supported with a higher level of confidence than the AR of sulfa with fatigue. Studies have reported that a significant percentage (31%) of patients with a self-reported gluten sensitivity had a lack of energy (third-highest symptom). Reports with non-coeliac gluten sensitivity also appear to correlate with {depression, anxiety, headache, fatigues, reflux, and irritable bowel syndrome} [25]. One study found that 82% of those newly diagnosed with coeliac disease complained of fatigue. Limited literature also indicates that fatigue can potentially be caused by malnutrition, induced by intestinal damage causing malabsorption of nutrients [26]. Fatigue can also be caused by anemia, which frequently appears in patients with coeliac disease [27]. It is noted that the VAERS data that included 5 distinct symptoms reported as 5 attributes in free-form text were of significant percentage with spelling mistakes. For example, penicillin was reported with various spelling variations such as {penecellin, penecillin, pene-cilin}, and sulfates was reported as {sulfa, sulpha, sulfides, sulfite, sulfate}. Another notable spelling mistake in the present analysis was the use of words "vaccination site" and "injection site" interchangeably such as vaccination site {pain, mass, induration, swelling, warmth, inflammation} and injection site {pain, mass, induration, swelling, warmth, inflammation}. The words "vaccination site" were replaced with "injection site" for consistency.

Discussion
The usefulness of the VAERS data for the statistical analysis of vaccines was illustrated with the help of a case study for COVID-19 vaccine data. It was emphasized that, due to the specific reporting format by VAERS online submission portal, its passive nature and access to the public can have an impact on any machine-learning (ML)/data-mining approach when careful data preprocessing approaches are omitted (i.e., removing/merging duplicates in VAERS, discretizing numeric attributes, handling missing values, and fixing spelling/grammar errors). With the help of these data provenance and preprocessing techniques, it is hoped that vaccine research and development can utilize and streamline the protocols when ML techniques are applied to VAERS data. The present study proposes a set of recommendations supported by the application of various ML algorithms that are critical to applying modeling approaches to or exploratory analyses of VAERS data. An online survey was also conducted, providing 211 distinct reports of the COVID-19 postvaccination effects from participants in the US. Various useful data preprocessing/cleaning techniques were pinpointed, which should be considered to be part of VAERS.
It is noted that, although models of various types have been developed for different vaccine reports based on exploratory data analyses and the application of ML techniques on VAERS data [4,6,7,[9][10][11][12][13]15,16,20,28], the model development for evolving VAERS data can be exposed to unseen situations that would neither be available for model training nor for validation. Despite the anticipated outcome from the ML perspective, the monitoring and testing strategies should be carefully implemented. Studies utilizing VAERS data for vaccine safety based on ML techniques require the following best practices.

Flexibility offor Model Features
Data and model-feature provenance strategies should be documented, including feature definitions, data ranges, meta-level requirements, and privacy controls. Structure of the developed ML model should be made flexible for new feature addition and updates to existing features.

Robust Model-Development Pipelines
Model development for vaccine AE identification and predictive capability should be reviewed, tested, and updated for the continuous refinement of existing workflows. Modularity in terms of model applicability on all or selected slices of data should be accomplished through a robust development pipeline, and model parameters should be tuned upon the availability of new data.

ML Model Verification
In order to enhance model applicability and reproducibility, validation (via unit, system, and integration testing) should be ensured before deployment into the production environment, or any policy or recommendation is proposed. Appropriate model maintenance and documentation strategies should be implemented, and transparency in terms of step-by-step debugging (on single data instances) should be demonstrated.

Model Stability and Efficiency
Model efficiency should be carefully evaluated via robust tests to ensure the reasonable use of computational resources in order to provide accurate predictions. Such tests can be based on model-training speed, use of RAM, and throughput in a real-time learning environment. Additionally, automation test cases can be developed to verify model prediction accuracy and stability (in terms of predictive accuracy) over time, as well as latency issues.

Materials and Methods
Analysis of the psychological and physical effects of COVID-19 vaccines along with the discovery of correlations among the most commonly reported AEs was conducted as per the workflow described in Figure 6. Vaccine data for Pfizer-BioNTech, Moderna, and Janssen were obtained via VAERS, which was accompanied by a primary dataset collected from an online survey comprising information on post-vaccine AEs and public perception of the COVID-19 vaccine. Online survey data were designed to fill data gaps in the absence of other closely monitored data repositories such as v-safe [29], whose data have not yet been made available for public and research communities. The overarching goal of the present study of VAERS and the online survey data was to pinpoint critical data provenance and management protocols for robust statistical analysis and predictive modeling of vaccines with a case study of COVID-19 vaccines. Particular steps to assess the efficacy of data-driven techniques applied on VAERS data were based on: (i) the exploration of the post-vaccination effects of COVID-19 vaccines on various age groups (particularly children under the age of 16), (ii) the determination of the frequencies of reported AEs after each dose of COVID-19 vaccines, (iii) the evaluation of the co-existence of common post-vaccine AEs via unsupervised ML approaches, and (iv) the assessment of potential relationships of pre-existing conditions (e.g., allergies) with the AEs. Active reporting via an online survey was also aimed for to further assess the impact of COVID-19 vaccination via the reported AEs, evaluate psychological perception of COVID-19 vaccination, and compare the VAERS reports with an active and systematically controlled system that incorporates quality data into COVID-19 vaccine domain knowledge.
of the COVID-19 vaccine. Online survey data were designed to fill data gaps in the absence of other closely monitored data repositories such as v-safe [29], whose data have not yet been made available for public and research communities. The overarching goal of the present study of VAERS and the online survey data was to pinpoint critical data provenance and management protocols for robust statistical analysis and predictive modeling of vaccines with a case study of COVID-19 vaccines. Particular steps to assess the efficacy of data-driven techniques applied on VAERS data were based on: (i) the exploration of the post-vaccination effects of COVID-19 vaccines on various age groups (particularly children under the age of 16), (ii) the determination of the frequencies of reported AEs after each dose of COVID-19 vaccines, (iii) the evaluation of the co-existence of common post-vaccine AEs via unsupervised ML approaches, and (iv) the assessment of potential relationships of pre-existing conditions (e.g., allergies) with the AEs. Active reporting via an online survey was also aimed for to further assess the impact of COVID-19 vaccination via the reported AEs, evaluate psychological perception of COVID-19 vaccination, and compare the VAERS reports with an active and systematically controlled system that incorporates quality data into COVID-19 vaccine domain knowledge.

Compilation, Preprocessing, and Exploration of VAERS Data
Two distinct datasets were compiled with 905,976 and 211 data samples from VAERS (filtered to prune rows for the three COVID-19 vaccines) and an online survey, respectively. The VAERS reports consisted of vaccine-and patient-related attributes that

Compilation, Preprocessing, and Exploration of VAERS Data
Two distinct datasets were compiled with 905,976 and 211 data samples from VAERS (filtered to prune rows for the three COVID-19 vaccines) and an online survey, respectively. The VAERS reports consisted of vaccine-and patient-related attributes that included vaccine identification (VAX type, VAX manufacturer, VAX lot, VAX does series, VAX route, VAX site, VAX name), free-form textual attributes {US state, gender, allergies, hospital, disability, current illness}, binary attributes {birth defects, prior visit, ER visit}, age (numeric), and vaccination date (date). VAERS reports that did not list any AEs were removed, reducing the dataset size to 892,213 reports. Data cleaning was then performed to merge duplicate reports and fix spelling/grammar mistakes, resulting in a total of 643,522 reports. The age attribute was discretized into 7 groups (5-11, 12-15, 16-18, 19-30, 31-50, 51-65, and 66+) for the purpose of identifying the age-to-AE correlation via bipartite plots (Section 3.3). Data statistics per manufacturer for each of the above age groups and genders are given in Tables 5 and S1, along with the numbers of categories for each attribute in both datasets from VAERS (original without removing duplicates (Table S1) and after data preprocessing (Table 5)) and the online survey. A summary of the content of the datasets (without the removal of duplicate records) is also provided in Table S2, which lists the number of data samples for each of the 20 most commonly reported AEs along with their percentage per manufacturer. Note: The number of samples per vaccine manufacturer and their percentages were calculated using clean data by removing those samples where any of the four attributes {age, gender, vaccine manufacturer, and symptom} were listed as "unknown." There were 63,189 reports with missing age values, which were also removed from the above analysis, followed by the merger of duplicate rows in the dataset.
VAERS reports for children of age under 16, with a total of 12,489 VAERS samples, were also collected and analyzed separately in order to explore the commonality between the AEs with respect to different age groups. The goal of this analysis was to discover meaningful patterns (i.e., AEs) that appear collectively in children when compared to adults or differences as the age group progresses to an older population. Data from children's reports were also cleaned where rows that reported any attribute (column) from {age group, gender, symptom, and vaccine manufacturer} as "unknown" were removed. Additionally, reports indicating "product administered to patient of inappropriate age" while reporting no AEs were also removed. Cleaned data after the removal of reports with "product administered to patient of inappropriate age" comprised of 9457 reports with distributions of 9142, 228, and 87 for Pfizer-BioNTech, Moderna, and Janssen, respectively ( Table 5). The AEs submitted in children's VAERS reports were also separated in the form of heatmaps (Table 2) with respect to the gender in order to identify gender similarities/dissimilarities with the help of cell colors based on the percentage of the corresponding AEs. The AEs for all genders in Table 2 were sorted based on the age group (Sections 2 and 2.1) of 5-11 years old. A non-cleaned version of the VAERS reports (i.e., the reports with duplicates) is provided in Table S1.

Exploratory Data Analysis of the COVID-19 Vaccines' Effects
The initial exploratory analysis for VAERS data was conducted to determine the frequencies of AEs to support advanced analysis. The 20 most commonly reported AEs were first used to assess their associations, as shown in Table 1. Similar to Tables 5 and S1, statistics based on non-cleaned data are reported in Table S2, demonstrating significant differences from Table 1, which could have a significant impact on the performance and robustness if a statistical approach is applied.

Correlation Analysis of AEs Based on Age Groups and Allergies
Unsupervised ML approaches utilizing ARM and SOMs (Supplementary Materials-Section S2.2.1) were applied on VAERS and survey data, where the endpoints were analyzed to explore the relationships among AEs and reported allergies. Unsupervised learning is useful for visual data exploration to find hidden data groups in order to better understand the correlation of the AEs with existing medical conditions without any predictions or testing the underlying hypotheses. ML approaches are also helpful for applying statistical approaches to cluster/group similar biological effects to enhance the applicability domain of the vaccines as well as recommend proactive strategies for vaccine safety. Furthermore, as new data become available, analyzing the relationships among post-COVID-19 vaccine AEs and other reported demographical characteristics via ML approaches can be helpful in designing improved versions of vaccines (e.g., COVID-19 pills) for COVID-19 vaccine safety. Mapping the relationships (i.e., associations) among the reported post-COVID-19 vaccine AEs via unsupervised ML techniques is particularly helpful in revealing useful patterns, streamlining COVID-19 vaccine safety standards and the development of robust models for proactive strategies and recommendations. Through these relationships, one can assess the co-occurrence of certain AEs and infer the reasons that the emergence of one or more AE may lead to other AE(s) that are correlated due to biological or other relevant reasons. The ARM of AEs was also accompanied by confirmatory cluster analysis approaches based on hierarchical clustering.
ARM has been applied in various disciplines [28,[30][31][32][33][34][35][36][37]. Irrespective of the domain of interest, triggering of one or more AE can imply the triggering of other AEs, consistent with the crosstalk between various physical AEs and perceptional indicators. The ARM of the AEs after each vaccine dose can be used to identify many-to-many relationships and propose a data-driven hypothesis-generation technique. A detailed description of ARM can be found in the Supplementary Materials (Section S2.2.1). ARs in the present study were also validated with the help of the SOM analysis, demonstrating the VAERS data distribution on 2D maps. Cluster analysis via SOMs has been demonstrated to be useful for discovering relationships in complex multidimensional datasets in crossdisciplinary areas of research and development [38][39][40][41]. SOM clustering applies competitive learning, preserves topological structure of the input space, and transforms the output to a lower dimension (i.e., 2-D map of cells within SOM clusters). Further discussion on SOM can be found in the Supplementary Materials (Section S2.2.1). The utility of SOMs for data visualization and feature selection has also been demonstrated for exploratory data analyses [34,38,39,[41][42][43][44][45][46][47]. For the analysis via ARM and SOM, open-source libraries were utilized, which are freely available online (R Studio arules-version 1.7-3 [48] (for ARM), kohonen version 3.0.11 [49] (for SOM analysis), hclust version 3.6.2 [50] (for HC), and Python stats.chisquare [51] (for statistical significance test)).
The interrelationships of the AEs with allergies and other personalized factors (age group and gender) were identified via bipartite graphs (Section 2.2 and Section S2.2.2). Bipartite graphs established in the present study are useful for the exploratory analysis of potential allergies, age groups and genders that may be indicative of the occurrence of one or more common AEs. Moreover, bipartite graphs allow for the bidirectional exploration of COVID-19 vaccine data for detailed information about specific AEs and their causal (i.e., {allergy, age group, gender} → AE) or a diagnostic reasoning (i.e., AE → {allergy, age group, gender}). Graphical displays of correlations between reported AEs and allergies can help explore the frequencies of certain AEs, interrogate comparisons between them and their occurrences given certain pre-existing conditions, identify similarity/distribution among reports that demonstrated similar AEs, and assess potential causes of AEs given certain pre-existing conditions [47]. For example, it can be seen in Figure 5 that Age group of 31-50 years old has been reported to have the highest percentage (36%) among all of the 20 commonly reported AEs. Each bar in the bipartite graph is further split into subbars representing its distribution in terms of the available categories for each of the three variables {age group, gender, and allergies} across 20 AEs. The bars on the left side of the bipartite graphs list the 20 most commonly reported AEs. The bipartite graphs in the present study were created using the open-source JavaScript library from d3.js [52].