Set Operations in Python for Translational Medicine

: This is the world’s ﬁrst tutorial article on Python programing on set operations for beginners and practitioners in translational medicine or medicine in general. This tutorial will allow researchers to demonstrate and showcase their tools on PyPI packages around the world. Via the PyPI packaging, a Python application with a single source code can run on Windows, MacOS, and Linux operating systems. In addition to the PyPI packaging, the reproducibility and quality of the source code must be guaranteed. This paper shows how to publish the Python application in Code Ocean after the PyPI packaging. Code Ocean is used in IEEE, Springer, and Elsevier for software reproducibility validation. First, programmers must understand how to scrape a dataset over the Internet. Second, the dataset ﬁles must be read in Python. Third, a program must be built to compute the target values using set operations. Fourth, the Python program must be converted to the PyPI package. Finally, the PyPI package is uploaded. Code Ocean plays a key role in publishing validation for software reproducibility. This paper depicts a vaers executable package as an example for calculating the number of deaths due to COVID-19 vaccines. Calculations were based on gender (male and female), age group, and vaccine group (Moderna, Pﬁzer, and Novartis), respectively.


Introduction
First, scientists in translational medicine must understand how to use Google search engine. You may be surprised that depending on browsers, the searched result may be different. There are two types in keyword searches: word keyword search and phrase keyword search. In a phrase keyword search, quotation marks indicate the ordered set of words. For example, "set operations" is composed of two words, i.e., set and operations, where set must be the first word and operations should be the second word after set.
An exhaustive search for articles containing the two phrases "vaccine safety" and "set operations" revealed only three articles over the Internet [1-3]. Jacquez et al. showed how to use set operations for breast cancer analysis where the dataset is only composed of 285 instances [1]. Lu et al. did not show set operations at all for their analysis where the phrase of "set operations" was included in their references [2]. Barry DeVille et al. published a SAS book that briefly introduced set operations using VAERS datasets with a Statistical Analysis System (SAS) [3]. However, there was no detailed explanation on set operations by just showing graphic results with SAS. Since SAS needs a proprietary license, it is not open-source programing. To the best of our knowledge, there is no tutorial on set operations with open-source programing for vaccine safety. This paper's role with open-source programing in Python will be critical for translational medicine to deal with large datasets.
The author has published a tutorial paper on the PyPI packaging for translational medicine [4]. However, the significant contribution of this paper lies in that the previous tutorial did not include software reproducibility and set operations for efficient computing with large datasets. This paper details the calculations on set operations used in translational medicine. Set operations are used for calculating adverse effects on deaths due to COVID-19 using VAERS datasets [5].
There are many articles on the efficacy of vaccines, but few articles on adverse effects with vaccines. Writing this tutorial on set operations with open-source Python for translational medicine is motivated by four reasons: (1) we need to show that efficient computation, such as set operations in Python, is crucial for manipulating large datasets such as VAERS with 748,230 instances; (2) the computational complexity should be understood for accelerating computation; (3) there is no tutorial analysis on the extensive adverse effects of COVID-19 vaccines; and (4) PyPI packaging and software reproducibility are essential for scientists in translational medicine for maximum software dissemination to the world. This paper presents a data analysis with set operations. The computational time complexity is depending on the structure of nested loops and the size of individual loops in algorithms or programs. For example, if your program has a single loop, the size of the loop determines the computational time complexity. In Python, the computational time complexity for a single-for-loop is determined by the number of instances (n), which is called Big O Notation O(n): for i in range(len(instances)): In double-nested loops or triple-nested loops, the time complexity can be expressed with O(n 2 ) and O(n 3 ), respectively. With set operations, the double-nested loops, the triplenested loops, and other loops can be converted to O(n). Therefore, this paper introduces set operations to significantly reduce the time complexity.
For example, when calculating the number of deaths with mixing Pfizer and Moderna vaccine, with O(n) time complexity, the number of deaths can be generated with set intersection.
In datasets, the number of instances is equivalent to the number of patients. In other words, the unique patient IDs can be used and shared in set operations in multiple datasets. Patient IDs are unique and shared in three VAERS datasets.
The number of Pfizer-death-patients deathPFIZER set can be simply calculated by intersecting the deathIDs and PFIZERIDs sets. Similarly, the number of Moderna-deathpatients can be computed by intersecting the deaths-set and Moderna-set. Therefore, patient deaths from mixing the Pfizer and Moderna vaccines can be calculated by intersecting the Pfizer-death-patients-set and Moderna-death-patients-set. However, we do not know if Pfizer is the first vaccine. In other words, there are Pfizer-Moderna-death-patients and Moderna-Pfizer-death-patients. The time complexity in the above calculations is with O(n).
The maleIDs and femaleIDs sets can be similarly generated with O(n) for gender class set operations. All features, such as types of vaccines, gender class (male or female), death or alive (non-death), and ages, can be simply computed in this manner with set operations with O(n). In other words, the computation time with set operations is drastically reduced from O(n 3 ) or O(n 2 ) to O(n).
The advantage of PyPI is that it allows vaers to run on Windows, MacOS, and Linux operating systems, without being aware of operating systems as long as Python is installed on the system. This advantageous feature of PyPI is that it can maximize the open-source dissemination of software to the world. This paper introduces Code Ocean for the reproducibility of software codes after showing the PyPI packaging. Code Ocean is the de facto service provider for software reproducibility.
In traditional software development, programmers must write a program from scratch. With the rapid progress of open-source software, programmers must choose the right libraries from depositories and glue them with minimum effort. The selected libraries and packages are available to the public and can be installed by a simple pip terminal-line command [6]. In other words, programmers must be familiar with the bash command in the terminal.
In this tutorial, we will follow the order of the execution of the commands in the bash shell based on reverse engineering. There is no significant difference between Windows, MacOS, and Linux operating systems. This paper depicts a vaers executable package [7] as an example for calculating adverse effects on the number of deaths due to COVID-19 by gender and age group against the Moderna [8] and Pfizer [9] vaccines. The vaers method is currently under review.
First, programmers must understand how to scrape a dataset over the Internet. The executable vaers use the VAERS datasets. VAERS stands for Vaccine Adverse Event Reporting System. VAERS is a national early warning system to detect possible safety problems in US-licensed vaccines. VAERS is not designed to determine if a vaccine caused a health problem, but it is especially useful for detecting unusual or unexpected patterns of adverse event reporting that might indicate a possible safety problem with a vaccine.
Second, the dataset file must be read in Python. VAERS is composed of three csv files: 2021VAERSDATA.csv, 2021VAERSSYMPTOMS.csv, and 2021VAERSVAX.csv. In vaers.py, 2021VAERSDATA.csv and 2021VAERSVAX.csv are used. csv stands for comma-separatedvalue.
Third, a program is built to compute the target values using set operations. This paper shows how to calculate adverse events of death by sex and age group for each of the Novartis [10], Moderna, and Pfizer vaccines.
Fourth, the Python program is converted to the PyPI package with three files: setup.py, vaers.py, and README.md. The README.md file can be created using the GitHub site. Therefore, you need to create a new account on the GitHub site.
Finally, the PyPI package is uploaded using a twine command. In order to upload a PyPI package, you need to have an account on the pypi.org site.
In order to use and run a Python program, you must choose a proper installation package, miniconda, depending on your operating system from the following site: https://docs.conda.io/en/latest/miniconda.html (accessed on 16 March 2022) For Windows, double-click on the file, Miniconda3-py38_4.11.0-Windows-x86_64.exe. Python3.8 is recommended in this paper. For MacOS, the file, Miniconda3-py38_4.11.0-MacOSX-x86_64.sh, should be installed by the following terminal command: zsh or bash [11,12]: $ zsh Miniconda3-py38_4.11.0-MacOSX-x86_64.sh or $ bash Miniconda3-py38_4.11.0-MacOSX-x86_64.sh For Linux, download Miniconda3-py38_4.11.0-Linux-x86_64.sh and run the following command: $ bash Miniconda3-py38_4.11.0-Linux-x86_64.sh For Windows users, you have two options of Miniconda: one on Windows 11 or 10 and the other on Windows Subsystem for Linux (WSL). WSL is a compatibility layer for running Linux binary executables (in ELF format) natively on Windows 11 or 10. WSL has not been completed yet, but you are allowed to use binary executables on Windows from the WSL command line.
From here onwards, there is no difference between all operating systems. You should be familiar with conda and pip command with options: First, start a terminal command and update the Miniconda environment by the following command. The first ($) is a prompt from the terminal, while the second ($) is the dollar key.
$ conda update conda Second, update the pip installation command. "-U" stands for update. $ pip install -U pip or $ python -m pip install -U pip In order to install pandas, for example, run the following command. $ pip install -U pandas or $ conda install pandas In order to know the Python version number, $ python -V Python 3.8.4 the "which" command can inform the location of Python. $ which python /home/takefuji/miniconda3/bin/python If the library is not Python-related, install it by the apt command on WSL or brew on MacOS.
First, apt should be updated and upgraded on Linux or WSL on Windows. $ sudo apt update $ sudo apt upgrade Then, you can install the necessary library. For example, "wget" is a library name. "sudo" is a superuser command.
$ sudo apt install wget For MacOS users, you must install the brew command, then run the following command to install matplotlib library. matplotlib is a library name.
$ brew install matplotlib In vaers, the wget command is needed.
In WSL and MacOS, you must install the X-Window. For Windows users, you should download VcXsrv Windows X Server exe file and install it. For Mac users, you should install XQuartz. Before running Python, you should start the X Server.
vaers was selected for this tutorial because there is no tutorial on Python set operations. Set operations are useful to calculate the adverse effects on death by gender (male and female), age group, and vaccine group (Moderna, Pfizer, and Novartis).
In traditional programming, the programmer must program the target software from scratch. In open-source programming, the right libraries must be chosen from depositories and the selected libraries are simply glued together with minimum effort. This is called rapid open-source prototyping. vaers.py was developed within a few hours.
In other words, the skills in open-source programming lie in selecting the right libraries from a variety of the existing libraries [13]. The more examples that are available in opensource libraries, the easier it is for users to create the desired code.

Materials and Methods
This Section includes testing the Python environment, the PyPI package of three files (setup.py, README.md, vaers.py), how to upload a PyPI package, and how to run it.

Python Environment and How to Run Vaers
It is assumed that Python is ready to run on the terminal. We must make sure that the system has a pip command in the PATH variable by the following command. PATH is an environmental variable in Windows, WSL on Windows, MacOS, and Linux operating systems that tells the shell which directories to search for executable files. If

PyPI Package
A PyPI package needs three files including README.md, setup.py, and vaers.py.

README.md
The README.md file can be easily prepared by using the GitHub site. You need to have an account on the GitHub site. When creating a new Repository, select "add a README file". README.md will be created when you enter the necessary content of a new PyPI package. Remember that the image in GitHub should be linked to the global site image address, instead of the local address. Unless the image is linked to the global address link, the image will not be displayed on the PyPI site.

setup.py
The following is a template of setup.py file for creating an executable code. The shaded 10 lines should be changed for your PyPI package.
$ twine upload dist/* When you want to update the package, you must delete all files and directories in dist/* and build/* by the following command.
If you want to update the application, remove the old files: $ rm -rf dist/* build/* Then, repeat the same commands.
$ twine upload dist/* When you want to update the package, you must delete all files and directories in dist/* and build/* by the following command.
If you want to update the application, remove the old files: $ rm -rf dist/* build/* Then, repeat the same commands.
$ twine upload dist/* When you want to update the package, you must delete all files and directories in dist/* and build/* by the following command.
If you want to update the application, remove the old files: $ rm -rf dist/* build/* Then, repeat the same commands.
$ twine upload dist/* When you want to update the package, you must delete all files and directories in dist/* and build/* by the following command.
If you want to update the application, remove the old files: $ rm -rf dist/* build/* Then, repeat the same commands.
$ twine upload dist/* When you want to update the package, you must delete all files and directories in dist/* and build/* by the following command.
If you want to update the application, remove the old files: $ rm -rf dist/* build/* Then, repeat the same commands.
$ twine upload dist/* When you want to update the package, you must delete all files and directories in dist/* and build/* by the following command.
If you want to update the application, remove the old files: $ rm -rf dist/* build/* Then, repeat the same commands.

Discussion on Set Operations
vaers.py The following command can upload three files. The system will ask for a username and password.
$ twine upload dist/* When you want to update the package, you must delete all files and directories in dist/* and build/* by the following command.
If you want to update the application, remove the old files: $ rm -rf dist/* build/* Then, repeat the same commands.

Discussion on Set Operations
This tutorial allows researchers to submit a new PyPI package and to showcase their skills on PyPI packages around the world. All that is required is to create three files, including uaers.py, setup.py, and README.md, by following instructions in the Materials and Methods Section. Before submitting the new package, you should test it on your local machine.
There are four set operations as shown in Figure 1: union, intersection, exclusive OR, and subtraction. In Python, the union set operation of set A and set B can be calculated by the following: In vaers.py, the shaded lines from the first line before def main() are used for checking the existence of two files and, if two files exist, then they are read by pd.read_csv of pandas library.