African Genomic Medicine Portal: A Web Portal for Biomedical Applications

Genomics data are currently being produced at unprecedented rates, resulting in increased knowledge discovery and submission to public data repositories. Despite these advances, genomic information on African-ancestry populations remains significantly low compared with European- and Asian-ancestry populations. This information is typically segmented across several different biomedical data repositories, which often lack sufficient fine-grained structure and annotation to account for the diversity of African populations, leading to many challenges related to the retrieval, representation and findability of such information. To overcome these challenges, we developed the African Genomic Medicine Portal (AGMP), a database that contains metadata on genomic medicine studies conducted on African-ancestry populations. The metadata is curated from two public databases related to genomic medicine, PharmGKB and DisGeNET. The metadata retrieved from these source databases were limited to genomic variants that were associated with disease aetiology or treatment in the context of African-ancestry populations. Over 2000 variants relevant to populations of African ancestry were retrieved. Subsequently, domain experts curated and annotated additional information associated with the studies that reported the variants, including geographical origin, ethnolinguistic group, level of association significance and other relevant study information, such as study design and sample size, where available. The AGMP functions as a dedicated resource through which to access African-specific information on genomics as applied to health research, through querying variants, genes, diseases and drugs. The portal and its corresponding technical documentation, implementation code and content are publicly available.


Introduction
Modern biomedical research is strongly reliant on data repositories that are wellstructured for the storage and retrieval of information. The number of biological and biomedical databases have increased over the last two decades, with resources for diverse organisms, molecules, interactions and pathways. The goal of research is often compounded by the complexity of these different sources and the amount of data that are being gathered from high-throughput technologies, which has also increased in the past two decades. However, low-and middle-income regions, such as Africa, are still not fully capable of tapping into genomic resources for biomedical application due to unmatched resources and needs. There is thus an urgent need to improve facilities, particularly for data generation, processing, storage and access, that will contribute to the implementation of genomic medicine in Africa. Recent initiatives have begun to enhance efforts to overcome the barriers for the use of genomics in Africa, including the Malaria Genomic Epidemiology Network (MalariaGEN) and the Human Heredity and Health in Africa (H3Africa) consortium [1].
Current genomics and biomedical data and metadata in public databases are skewed toward populations of European ancestry. Many African populations remain underrepresented and the data that is available for Africa is not structured in a manner appropriate for the geographical, social and cultural diversity of the continent, with data generally being classified under Sub-Saharan Africa, with little reference to specific populations, or misclassified with other regional groups, as previously described [2,3]. Recent advances in genomics have illustrated the rich genetic diversity of African populations [4,5] and its associated health implications, such as varying lipid profiles [6], predisposition to kidney disease [7] and more [8]. Similar diversity has been observed in pharmacogenomics studies, where, for example, population-specific alleles and frequencies have been found in the genes encoding the cytochrome P450 (CYP) enzymes in Africans, together with structural variants that impact enzymes functionality [9][10][11].
The current underrepresentation of African data and metadata in public databases, as well as the implementation of insufficient classification methods of such data in these public databases, call for the inclusion of more diverse African samples in public genomic reference panels and databases, and highlights the need for a resource that can provide a more granular view of African and African-ancestry populations. This would illustrate the current status of research and implementation on the continent, as well as facilitate continuous progress of genomic medicine in Africa. An accurate representation of actionable genetic variants in Africa is required to facilitate genomic medicine research, including molecular markers that are relevant in the aetiology of diseases, variants that have been associated with specific clinically-supported drug therapies, and variants related to drug response. The ability to query and make use of the currently available data and metadata is critical. Unstructured and heterogeneous data hinders integration efforts and knowledge inference. Moreover, the scattering of data and metadata makes challenging the establishment of a clear genetic determinant map in Africa and affects our capacity to study its genetic diversity and the implication thereof in clinical studies for these population groups at a large scale.
The existing challenges have prompted the development of a fit-for-purpose online database that contains relevant African-specific information related to genomic medicine extracted from public databases and further curated from the literature. This resource will be of interest to the African bioscience community, and genomic medicine stakeholders at large as it may facilitate the implementation of genomic medicine on African-ancestry populations worldwide. The need for an African-specific resource is justified by the extensive genetic diversity among populations across the continent, the data fragmentation across different existing resources, and the lack of appropriate annotation. The proposed resource, the African Genomic Medicine Portal (AGMP) provides curated information to guide the research and application of genomic medicine, pharmacogenomics, and clinical genetics within African-ancestry populations. The AGMP aims to provide a user-friendly resource specific to African research to support genomic medicine in Africa, a centralized root for African studies in genomic medicine available for public use and to supply the community with a well-curated and highly enriched content of genomic medicine that reflects the diversity of the genetic makeup in Africa.

Content Mining and Curation
The development of AGMP v1.0 was initiated through a collaborative effort by H3ABioNet, the Pan-African H3Africa Bioinformatics Network members [12]. Two primary data sources were employed for AGMP v1.0, PharmGKB (www.pharmgkb.org, accessed on the 25 July 2019) [13] and DisGeNET (www.disgenet.org, accessed on the 25 July 2019) [14]. The pipeline summarising the different stages for collecting and curating the AGMP data content is presented in Figure 1 and is explained in more detail below. Datasets from these databases were retrieved in plain text format, containing variant-based information on known associations with drug response (PharmGKB) and associations related to disease phenotypes (DisGeNET), as well as the related Pubmed entries (PMID). Gene names were obtained for DisGeNET variants with Ensembl's Biomart [15] to allow querying of the data by gene symbol. PharmGKB provides gene symbol annotated variants in its raw data files.  To exclude non-African data from AGMP, a custom text-mining pipeline that identifies biomedical literature containing African-related biogeographical and ethnolinguistic terms from Medline abstracts was developed and applied to the retrieved datasets. These terms included a list of 104 biogeographical entities (countries, nationalities, African-ancestryrelated population groups such as African-American and Afro-Caribbean) and 146 African ethnolinguistic population labels. The pipeline used the "EDirect" (http://eutils.ncbi. nlm.nih.gov/, accessed on the 10 January 2022) tool to retrieve the PubMed entries in XML format followed by a series of processes to tokenize and Lemmatise the text content. Finally, the list of biogeographical and ethnolinguistic terms was queried and the abstracts that matched at least one of the terms were retained for manual curation. The pipeline is implemented in the Nextflow workflow manager (https://github.com/hothman/text_ mining/tree/master/workflow, accessed on the 10 January 2022).
A manual curation step was included in the content curation process due to false positives, which may have resulted from the text-mining pipeline. In addition, supplemental information was extracted from the review of the full-text papers identified in the previous step, including the association significance level (p-value), experimental analysis, cohort structure, and more (Supplementary Materials File S2). The manual curation step was also employed to harmonize information related to the biogeographical groups of studies and to extract information about the country of origin of each study. During this process, entries corresponding to studies on populations from African ancestry were harmonized and labeled with one of the following terms: Sub-Saharan Africa, North Africa, West Africa, East Africa, Central Africa, Southern Africa and African-American/Afro-Caribbean. Cohorts of mixed African origins were assigned to Sub-Saharan Africa or Africa depending on the geographical spread of the study samples. Study p-values were retrieved either from the PharmGKB metadata field or by recovering it from the main text of the paper.
Drugs derived from the PharmGKB variants-phenotypes list have been annotated by identifying the commercial drug names, the FDA approval status, the indication information, and their conventional names from the International Union of Pure and Applied Chemistry (IUPAC) federation. This information was extracted manually from DRUG-BANK [16] (https://go.drugbank.com/, accessed on the 10 January 2022). Drug identifiers from DRUGBANK were retrieved and added to the list of attributes.
All the genes associated with drug response or disease phenotypes were annotated to include information about their gene product, chromosome number and biological function. The gene product was identified by mapping the gene symbol to the corresponding UniProt ID using BioMart. The biological function data were obtained by manual extraction and formatting of the related field from the UniProt database.
Data integration consists of applying the extract, transform and label (ETL) procedure to integrate data from different sources into a relational database. A series of interactive transformation codes were applied for this purpose. Manually curated data were examined for their consistency of using unique terminology in each column. All the data found to describe populations from Non-African ancestry were discarded from the source table. The quality assessment process consists of evaluating the fraction of missing data, the correctness of mapping to the target tables, harmonization of terminologies, harmonization of class labels and the presence of duplicates.

Technical Implementation
The portal can be accessed on the link https://agmp.h3abionet.org (accessed on the 10 January 2022). The search menus have been set up on the basis that most users would need minimum information i.e., the gene name, the variant ID, the disease name or the drug name to easily explore the content of the portal. The portal has been designed with a genecentric focus which ensures that all the information related to genotype-phenotype data and the associated metadata are attributed to genes at the lowest level of granularity. For example, the user may have access to all the variants belonging to specific gene that include the star-allele types and the SNVs (single-nucleotide variants) related to drug-response or disease phenotype from a single web page on the portal attributed to that gene. With such an implementation, querying, navigating and extracting information from the portal would require only a few prior inputs from the user. Access to data in the portal is facilitated by a combination of an SQLite SQL data storage engine, a Python-Django web framework as an intermediate back-end and an interactive web technology front-end user interface. A Bootstrap framework along with open source JavaScript Libraries and custom CSS are used. The SQLite database consists of six tables linked together using the object relational model (Supplementary Figure S1).The tables consist of information on research studies and association of human genomic variations, drugs, genes and phenotypes. Data were integrated into the SQLite database with in-house Python scripts that use the Django web framework (https://github.com/h3abionet/african_genomics_medicine_portal, accessed on the 10 January 2022).

Results
The portal was developed to provide curated information on the genetic variations associated with disease aetiology and treatment in populations of African-ancestry as well as a broad overview of the disease-causing/-associated variants. We utilized PharmGKB and DisGeNET as data sources of the portal mainly for their high-quality content and for providing comprehensive and easy-to-access raw data. DisGeNET includes more than 357,000 variants related to 66,379 bibliography resources. PharmGKB includes 9151 variantdrug associations, including single nucleotides and star alleles described by 3007 papers.
An exhaustive search for the African-related data content was difficult to achieve, therefore we opted to automate the identification of African-related biomedical literature to reduce the size of the bibliography resources to be curated. The text mining process allowed us to identify 846 and 164 relevant studies from DisGeNET and PharmGKB respectively. Furthermore, we identified 479 and 2584 variants that are linked to these biomedical literature entries described, respectively, by PhamrGKB and DisGeNET raw data content.

AGMP Data Model and Content
In addition to studies on African populations identified from PharmGKB and Dis-GeNET, the content of AGMP included data collected from UniProt, DrugBank, Ensembl and PubMed, which was transformed into a relational database containing six tables, as shown on the entity-relationship diagram (Supplementary Figure S1). The Table S1 in Supplementary Materials File S2 explains, in detail, the information provided by each attribute in the relational database. SNVs and star alleles, stored in the 'Variant' table, provide information about the variants, and these are associated with phenotypes, which could be related to either the response to a drug or a disease. The star allele variants from PharmGKB were retrieved to account for the particularity of annotating the absorption distribution metabolism excretion (ADME) genes [10]. Note, that a star allele is defined by a set of core variants that define the phenotype associated with the drug response in linkage disequilibrium with other variants, of which presence or absence is not necessarily related to the drug-response phenotype [17][18][19]. The "Variant study" table contains information describing the bio-geographical and the ethnolinguistic attributes for variants in the African population. The other four tables were created to provide additional metadata for the variant-phenotype association. For instance, the "Gene" table stores data about the genes encoding the variants while the "Drug Table" accommodates detailed data for the drugs described by the variant-phenotype associations, including their approval status. Literature resources used by the manual curation process for the retrieval of the additional data are incorporated into he 'Study' table. Information about the pharmacokinetic effect on drugs and information about diseases from PharmGKB and DisGeNET respectively are stored in the table 'Phenotype'.
The manual curation retained 815 studies, including 143 references from PharmGKB and 652 references from DisGeNET that are associated with populations of African ancestry. We identified a total of 556 genes, of which 461 and 95 are related to 552 diseases and 71 drugs, respectively. This represents only a fraction (5.5% and 2%) of the total genes annotated in PhamrGKB and DisGeNET. Many of these diseases are similar but were assigned to different Unified Medical Language System concept unique identifiers in Dis-GeNET. For instance, "Piebaldism" was described to be associated with SNP rs121913684. "Piebaldism With Sensorineural Deafness" was assigned to another identifier in DisGeNET and was associated with the same rs121913684. The total number of variants associated with the drug response phenotype is 286, of which, 234 correspond to SNVs and 52 are star allele variants.

Data Exploration through the Portal
The frontend of the portal is divided into six main pages (home, about, documentation, search, summary, resources and help), displayed as a menu in the header. The search page has toggle switches that allow a combination of four fields (disease, drug, variant and gene) to be selected, along with the input field to provide search keywords (Figure 2). A complete tutorial on how to use AGMP is included in Supplementary File 2. Access and usage of the data through the AGMP is available under the Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0), as both PharmGKB and DisGeNET content are distributed for public use under the same type of license. AGMP seeks to comply with FAIR principles in its development cycle. To achieve this, the portal includes documentation, public code sharing, data and code archiving and licensing. Formal FAIRification from the data content and web accessibility point of view will be addressed further in upcoming versions of the portal.

Representation of African Data in AGMP
Studies about genetic markers from populations of African ancestry have been conducted by different research teams from around the world. The 'country' data of AGMP aims to assign the annotated variants in the database to a geographical entity. A geographical entity is an indicator of where the study participants were recruited, including names of the country and the region. The annotation is dependent on the level of detail provided by a study, usually in the publication. Many of these studies have used a mixed population either of total African individuals or combined with non-Africans. For example, Zimmerman et al. [20] studied the polymorphism of chemokine receptor 5 (CCR5) in a group of mixed ethnicity that included Africans. However, they did not provide details about the exact origin of the African cohort, therefore their geographical entity was assigned to "Africa". For others, more fine-grained geographical locations were provided.
AGMP includes data derived from 37 African countries, though we noticed a significant imbalance of identified variants number between the different regions of Africa ( Figure 3A). With 66% of all the annotated variants, countries from North Africa are, by far, the most represented in the dataset. We also noticed an important gap between the central African region and the rest of the continent, with only 52 variants corresponding to just 2.4%, derived from countries in this region. This is the least represented region in the AGMP dataset despite its demographic and historical importance in Africa. Tunisia has the most annotated variants, with 568 variants, representing 23% of the entire dataset ( Figure 3B).

Pharmacogenetic-Related Phenotypes in AGMP
Variants identified in PharmGKB are associated with drug response phenotypes for 50 drugs approved for specific treatments (Figure 4). Warfarin is the drug with the highest number of associated variants with a total of 192, representing about 7% of the entire dataset. Anticoagulant drugs are the most represented therapeutic class associated with 37% of the total annotated variants.

Discussion
The first version of the AGMP was implemented to provide high-quality and reliable information on genetic variants from African populations and populations of African ancestry for users with interests in genomic medicine. The primary sources of data were PharmGKB and DisGeNET supplemented with manual curation from the literature and links to other databases. These provide rich metadata for genetic variants in a standard format and use controlled vocabularies. The two databases are also recognized for the high quality of their content because of their rigorous processes of curation. For example, the PharmGKB curation panel of experts regularly verify the biomedical literature to integrate novel variants or update the information about the association with drug-response phenotypes [21], while DisGeNET aggregates data from eight databases including UniProt, ClinVar, GWAS catalog and GWASdb [22].
Data profiling and curation for AGMP have highlighted some drawbacks in the current conception of DisGeNET and PharmGKB, as well as the practices of describing the experimental protocol and presenting data in genetic and genomic studies. DisGeNET was intended to give a broad overview of variant-phenotype associations. The data, however, are presented as a binary association without taking into consideration the ethnic composition of the original study participants, hence our need to use text analysis and manual curation to extract data and metadata. In contrast, PharmGKB has adopted specific guidelines to assign a biogeographical group to each association [23]. According to this scheme, Africa is represented by "Sub-Saharan African" and "African American/Afro-Caribbean" groups, while the North African region was assigned to the "Near eastern" group. These annotation guidelines are inadequate, as the Sub-Saharan African region is not ethnically and culturally homogenous [5]. The labels also ignore the admixture landscape of the North African population [24][25][26] that is significantly different from the broad near-eastern population. In addition to these issues, manual curation showed that the description of biogeographical information is either lacking, inconsistent or inaccurate in the biomedical literature. Providing a standardized ethnolinguistic ontology will be beneficial in annotating genomic medicine data. Some of the variants in AGMP were assigned to the Sub-Saharan and North African geographical regions because we could not find enough information to annotate the country of origin. Since it is difficult to use a global annotation scheme for assigning biogeographical data to the genotype-phenotype association, it would be valuable to establish clear guidelines on how to ensure broad coverage of the genetic diversity in Africa. This highlights the need to integrate geographical information system data with genomic data, which could and would provide a valuable asset to be linked to the AGMP portal.
The under representation of some African regions in public databases has been extensively discussed in several papers [2,3,27,28]. Very few entries have been assigned to the central-African region compared with the other African regions. It is not clear, however if this unbalance is caused by the lack of reliable studies originating in central Africa or the lack of data related to the region generated by consortia or submitted to databases. Moreover, to facilitate automated extraction of genotype-phenotype associations, collection and reporting of standardized phenotypic data was also highlighted as being a challenge.
The disproportionate skew in the geographical distribution of data may be due to numerous factors; one of the main factors being the data representation in PharmGKB and DisGeNET. Both resources contain more data from North Africa as a result of many mixed studies with southeastern European, Middle Eastern and Francophone countries. In addition, DisGeNET contains a large proportion of case studies from North Africa and has less representation of other regions of Africa. This skew does not exist in all other biomedical databases (e.g., GWAS Catalog), and as more resources are incorporated into the Portal, the distribution should become more evenly spread. As such, the Portal will be continually updated and new data sources will be included in the future. One of the main aims of developing the Portal was to highlight such knowledge gaps and discrepancies within Africa so that we can better address these research questions in the future.
We expect the AGMP to be a crucial source of information for the ongoing African research activities in pharmacogenetics and genomic medicine. As a product of H3ABioNet, a member of the H3Africa consortium, we have the opportunity to promote the AGMP among a large community of African genetics researchers, which will allow us to update, improve and diffuse its content. Besides its informative added value, throughout the processes of data profiling, curation and web application development, the project enabled H3ABioNet to develop a solid core of experts, including bioinformaticians, curators, computer scientists, clinical biologists and data scientists. The core group will have a critical role in ensuring the sustainability and update of the portal and to maintain its relevance to genomic medicine-related activities in Africa.

Conclusions
The AGMP is a user-friendly resource that allows users to query curated African genomic variants linked to either clinical phenotypes or to drug response using gene, disease, variant and drug names. This portal should be of interest to the African scientific community, physicians and genomic medicine stakeholders worldwide, as it provides informative data for those working with individuals of African descent. In addition, our work emphasizes the importance of population sampling coupled to GIS data acquisition that can improve the granularity of the geographical location to better represent the genetic diversity in Africa. It would also help in linking non-genetic factor contributions to phenotype expression.
Two strategies have been proposed to extend the capacity of AGMP and improve the content and features of the resource. In the first, a curation group is established to run updated versions of the data mining, cleaning and integration workflows with more robust methods. This includes the training of predictive models that could process, in high-throughput mode, Medline entries that are not necessarily described in DisGeNET or PharmGKB. This will also allow communicating information about variants that are specific to Africa or with higher prevalence in Africa. Additionally, the group will implement reliable methods to automate the processes of data extraction and curation by utilizing APIs and entity-relationship detection from the biomedical literature. The data curation team will also determine the feasibility-to-adopt in integrating new data sources to capture African data in the GWAS catalog and GWASdb. Second, we aim to increase engagement with the scientific community involved in genomic medicine research in Africa by promoting the portal to the wider community, establishing an extended network of experts and curators, and adapting the portal user interface to allow for direct submission of the data. Finally, in order to provide more detail related to the studies captured in the Portal, a notes section will be included in upcoming versions of the Portal, providing additional study information such as study type and sample size.
The portal's primary aim is to facilitate genomic medicine research by researchers and clinicians in Africa or those working on African-ancestry populations. Secondarily, the portal may also serve as a guide for healthcare workers for diagnosis and prognosis of diseases and adverse drug reactions within their patient populations in the future and improve their understanding of the genetic underpinnings of precision medicine in Africa. However, it should be noted that the portal is a research tool and should not be used for clinical decision-making; rather, it is intended to provide doctors with the developing knowledge of pharmacogenetics and precision medicine that may be relevant for their patient population.

Informed Consent Statement:
There are no newly generated data in this project; this is a secondary analysis of data that had been generated and studied for other purposes.

Conflicts of Interest:
The authors declare no conflict of interest. The authors have no other relevant affiliations or financial involvement with any organisation or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.