HDVdb: A Comprehensive Hepatitis D Virus Database

Hepatitis D virus (HDV) causes the most severe form of viral hepatitis, which may rapidly progress to liver cirrhosis and hepatocellular carcinoma (HCC). It has been estimated that 15–20 million people worldwide are suffering from the chronic HDV infection. Currently, no effective therapies are available to treat acute or chronic HDV infection. The remarkable sequence variability of the HDV genome, particularly within the hypervariable region has resulted in the provisional classification of eight major genotypes and various subtypes. We have developed a specialized database, HDVdb, which contains a collection of partial and complete HDV genomic sequences obtained from the GenBank and from our own patient cohort. HDVdb enables the researchers to investigate the genetic variability of all available HDV sequences, correlation of genotypes to epidemiology and pathogenesis. Additionally, it will contribute in understanding the drug resistant mutations and develop effective vaccines against HDV infection. The database can be accessed through a web interface that allows for static and dynamic queries and offers integrated generic and specialized sequence analysis tools, such as annotation, genotyping, primer prediction, and phylogenetic analyses.


Introduction
Hepatitis D virus (HDV) infection remains the most difficult-to-treat form of viral hepatitis, affecting 15-20 million patients worldwide with chronic hepatitis, liver cirrhosis and hepatocellular carcinoma (HCC) [1]. The HDV infection in humans occurs so far only together with hepatitis B virus (HBV) because HDV needs the envelope proteins from HBV to complete its life cycle. Therefore, two main forms of HDV infection have been described: (1) coinfection; with a high rate of viral clearance in adults similar to HBV mono-infection [2], or (2) super-infection in the presence of a pre-existing HBV infection. The latter results in a persistent chronic HDV infection in 70-90% of the cases and is associated with an early risk to develop cirrhosis and HCC [3]. The current anti-HDV therapy is mainly based on administration of interferon with a very low response rate in patients [4,5] and high chance of relapse upon discontinuation [6]. Nevertheless, efforts have been made recently to develop new anti-HDV drugs to treat chronic HDV infection, with promising results in the clinical trials [7][8][9][10][11].
HDV is a small, spherical virus of 35-37 nm in diameter, with an envelope containing the hepatitis B surface antigen (HBsAg), which surrounds the genomic RNA-nucleoprotein complex [12]. The genome is a negative sense single-stranded RNA (1.67 kb), whose complementary strand (antigenomic RNA) Viruses 2020, 12, 538 2 of 9 contains one single functional open reading frame (ORF) encoding two isoforms of the hepatitis delta antigen (HDAg), the small (S-HDAg, 195 aa) and the large (L-HDAg, 214 aa) [13,14]. The sequence encoding these isoform proteins resides in the antigenomic RNA, which, as a result of the cellular editing activity of ADAR-1, modifies the amber stop codon (UAG) to (UGG) of S-HDAg, resulting in the extension of the amino acid sequences by 19-20 aa at the C terminus [15].
HDV RNA sequences identified so far have been classified into eight known genotypes (HDV-1-8) based both on the nucleotide and amino acid sequences of the coding region of HDAg [16]. These genotypes are distributed across different geographical regions. In our recent studies, we identified and introduced different subtypes for the genotype 1, genotype 2 and genotype 4 [17]. Subgrouping of the so far identified HDV genotypes into distinct clusters or subtypes has been also suggested by others [18,19]. These data provide a clearer picture of the geographical a global distribution of HDV isolates. However, it is not known how these subtypes correlate with the clinical manifestation and response to therapy.
HDV-1 is the most geographically widespread genotype distributed across major regions such as Europe, Middle East, East Asia, America and Africa; whereas all other genotypes (HDV-2-8) are associated with distinct geographical and ethnic regions. HDV-2 and HDV-4 are found in North Asia and East Asia, respectively [20][21][22][23]; HDV-3 is exclusively found in the north part of South America (Brazil, Peru, Colombia, Argentina, Ecuador and Venezuela) [24][25][26][27][28] and HDV-5 to HDV-8 were previously described to be found "only" in Africa [29,30], however, a recent study reported HDV-8 isolates from Northeast Brazil, which presumably crossed the ocean through slave trades in the 16-18th centuries [31]. In humans, HDV infections with different genotypes exhibit different clinical courses and outcomes. For instance, HDV-1 strains show a broad spectrum of virulent and pathogenic phenotypes [32], HDV-2 (and HDV-4) cause milder forms of liver disease [33], whereas HDV-3 isolates are associated with outbreaks of fulminant hepatitis in South America [24]. The pathogenic properties of HDV 5-8 isolates are not well characterized [34].
For decades, HDV has been thought to have evolved in humans in combination with HBV as a helper virus providing a viral envelope to form infectious particles. Recently, however, HDV like sequences have been identified in a variety of animals and insects [35][36][37]. Sequence analysis in ducks revealed an approximately 1700-nt circular RNA genome with self-complementary, unbranched rod-like structures, and coiled-coil domains [36]. The predicted HDV-like protein discovered in ducks shares 32% amino acid similarity with the small delta antigen (S-HDAg) of the human HDV (hHDV).
This discovery of an HDV-like agent in ducks was followed by the identification of a deltavirus in snakes (Boa constrictor), designated as snake HDV (sHDV) [37]. Sequence comparison of the snake delta antigen (sHDAg) showed that its aa sequence is 55% identical to its human counterparts. Anti-sera raised against a recombinant sHDAg was used in immunohistology studies. A broad viral target was demonstrated in different snake cells, including neurons, epithelial cells and leukocytes. The duck and snake viruses constitute divergent phylogenetic lineages as compared to the human HDV (hHDV), which so far seem quite distant related to the known human isolates.
Using additional meta-transcriptomic data, highly divergent HDV-like viruses were also found to be present in fish, amphibians and invertebrates. These newly identified viruses share human HDV-like genomic features such as a small genome size of 1.7 kb in length [35].
The identification of a much broader range of hosts as initially anticipated and the fact that the HDV RNA genome can efficiently replicate in different tissues and species, raise the possibility that HDV is able to be transmitted independently of HBV. Perez-Vargas J. et al. [38] have shown, that envelope glycoproteins (GPs) of unrelated viruses can act as helper viruses for HDV including vesiculovirus, flavivirus and hepacivirus. These GPs can package HDV RNPs, allowing efficient egress of HDV particles in the extracellular milieu of coinfected cells and subsequent entry into the cells expressing the corresponding receptors. In vivo studies in humanized mice indicate that HDV RNPs packaged into an HCV envelope can propagate HDV infection in the liver of coinfected mice [38]. In recent years, the amount of HDV genomic data has increased exponentially. Intensive sequencing efforts have resulted in approximately 2621 nucleotide HDV sequences (partial and full length) deposited into the DDBJ, EMBL and GenBank databases. The GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC), which comprises of the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA) and GenBank at NCBI. Those three organizations are synchronized and exchange data on a daily basis. Therefore, the sequences dataset can be retrieved using any of the platforms, i.e., the GenBank database. In order to exploit this large and growing collection of sequences efficiently and to facilitate sequence analysis we sought to develop a specialized database. Databases established for other types of viruses, in particular for HIV [39], HBV [40] and HCV [41] have proved to be very helpful for epidemiological and clinical studies, more importantly in characterizing resistance to direct anti-viral drugs. Here, we present the hepatitis delta virus database (HDVdb; http://hdvdb.bio.wzw.tum.de/). This comprehensive database collates HDV sequences and is mainly oriented towards the sequence analysis of HDV isolates, including the complete viral genomic sequences, large and small HDV antigen sequences (L-HDAg and S-HDAg, respectively). HDVdb provides a platform for genotyping and phylogenetic analyses including prediction of HDV genotypes for user-supplied HDV sequence entries. Moreover, the database will help in identifying the emerging variants related to immune escape from the B and T cell response as described recently [42,43] and in detecting therapy resistant variants across different HDV genotypes, which can be correlated with clinical studies.

Materials and Methods
The HDVdb building process began with the manual retrieval of all the HDV entries using the keyword: "hepatitis delta virus" from GenBank hosted at NCBI [44]. All results corresponding to taxon "Hepatitis delta virus" (taxid: 12475) were considered. Currently, a total of 2621 hepatitis delta virus nucleotide sequences are deposited into GenBank. These entries contain full sequence records of both HDV "complete genomic" sequences and subgenomic fragments (S-HDAg; 1-195 aa and L-HDAg; 1-214 aa) as well as partial cds sequences. GenBank entries containing complete HDV protein sequences were also incorporated. Majority of the sequences were retained to provide the maximum data information to our visitors, however, sequences shorter than 90 bases were not included into the dataset. The sequence dataset was than parsed by creating an automatic pipeline using Java programming language to extract essential information for each accession number such as strain name, genotype, country and date. In addition, 152 sequences lacking the genotype information were assigned to a genotype by performing similarity search using BLAST [45].
The HDVdb web interface is hosted using Apache HTTP server and runs on PHP Laravel framework. The HDVdb is updated on an annual basis. The software for the automatic annotation, as well as for the querying and the managing the database is implemented in Java and Bash programming languages. It makes use of all defined genotypes and their subgenotypes. The workflow of database construction is schematically demonstrated in Figure 1.

Results and Discussion
The HDVdb is accessible online through the website: http://hdvdb.bio.wzw.tum.de/. HDVdb contains entries for human hepatitis delta virus sequences, with 512 complete genome sequences, as well as 1066 L-HDAg and S-HDAg and 1281 partial cds nucleotide sequences as well as protein sequences for L-HDAg and S-HDAg. These sequences can be directly downloaded from the database for any further analysis. Links to protein sequences for both L-HDAg and S-HDAg sequences are directly provided at the home page. Additionally, we included 13 complete genome (Accession MH457142-MH457154) and 116 L-HDAg sequences (Accession MF175257-MF175360, MH447633-MH447644) retrieved from six different medical centers of our European study cohort [17]. In this study, sequence conservation at each position across the entire length of the 322 multiply aligned genome sequences (i.e., genotype-1) was visualized. The multiple sequence alignments were performed using MUSCLE v3.8.5551 [46] whereas the evaluation was performed using customized Ruby scripts (Figure 2). We concluded that despite low conservation rate throughout the HDV genome, there were no significant differences on genotyping results using the whole genome or the L-HDAg encoding region.

Results and Discussion
The HDVdb is accessible online through the website: http://hdvdb.bio.wzw.tum.de/. HDVdb contains entries for human hepatitis delta virus sequences, with 512 complete genome sequences, as well as 1066 L-HDAg and S-HDAg and 1281 partial cds nucleotide sequences as well as protein sequences for L-HDAg and S-HDAg. These sequences can be directly downloaded from the database for any further analysis. Links to protein sequences for both L-HDAg and S-HDAg sequences are directly provided at the home page. Additionally, we included 13 complete genome (Accession MH457142-MH457154) and 116 L-HDAg sequences (Accession MF175257-MF175360, MH447633-MH447644) retrieved from six different medical centers of our European study cohort [17]. In this study, sequence conservation at each position across the entire length of the 322 multiply aligned genome sequences (i.e., genotype-1) was visualized. The multiple sequence alignments were performed using MUSCLE v3.8.5551 [46] whereas the evaluation was performed using customized Ruby scripts (Figure 2). We concluded that despite low conservation rate throughout the HDV genome, there were no significant differences on genotyping results using the whole genome or the L-HDAg encoding region.
The HDVdb is divided into a static and a dynamic part as demonstrated in Figure 3. The static part allows the user to access the general information about HDV. The homepage provides a data summary of updated number of S-HDAg, L-HDAg and complete genomes of all the eight known genotypes on the database. The user can retrieve pre-compiled protein and nucleotide datasets for complete genome, L-HDAg, and S-HDAg separately for each genotype, alternatively the user can also download a single FASTA file containing these datasets for all genotypes. In addition, the database also provides a tutorial to help the users with necessary technical information required to access tools available on the database. study, sequence conservation at each position across the entire length of the 322 multiply aligned genome sequences (i.e., genotype-1) was visualized. The multiple sequence alignments were performed using MUSCLE v3.8.5551 [46] whereas the evaluation was performed using customized Ruby scripts (Figure 2). We concluded that despite low conservation rate throughout the HDV genome, there were no significant differences on genotyping results using the whole genome or the L-HDAg encoding region. The HDVdb is divided into a static and a dynamic part as demonstrated in Figure 3. The static part allows the user to access the general information about HDV. The homepage provides a data summary of updated number of S-HDAg, L-HDAg and complete genomes of all the eight known genotypes on the database. The user can retrieve pre-compiled protein and nucleotide datasets for complete genome, L-HDAg, and S-HDAg separately for each genotype, alternatively the user can also download a single FASTA file containing these datasets for all genotypes. In addition, the database also provides a tutorial to help the users with necessary technical information required to access tools available on the database. The dynamic part allows the analysis of user-provided queries. The homepage presents an interactive search box that allows the visitors of our database with options to search sequences based on accession number, genotype, country and date for protein, complete genome, coding sequences for L-HDAg and S-HDAg, as well as partial sequences. The nucleotide and protein sequences queries can be genotyped using "Identify HDV sequences by genotyping" option. The webservice uses BLAST [45], which performs local alignments and scores the most relevant sequences to access the query genotype. A minimum of 75% identity score against the database is required to classify the The dynamic part allows the analysis of user-provided queries. The homepage presents an interactive search box that allows the visitors of our database with options to search sequences based on accession number, genotype, country and date for protein, complete genome, coding sequences for L-HDAg and S-HDAg, as well as partial sequences. The nucleotide and protein sequences queries can be genotyped using "Identify HDV sequences by genotyping" option. The webservice uses BLAST [45], which performs local alignments and scores the most relevant sequences to access the query genotype. A minimum of 75% identity score against the database is required to classify the query sequence to one Viruses 2020, 12, 538 6 of 9 of the HDV genotypes. This threshold prevents the false positives to be classified and is based on our previous research [17].
Furthermore, we integrated computational tools for multiple sequence analysis (Clustal Omega, version 1.2.3 [47]), primer design (Primer3, version 2.3.7 [48]) and phylogenetic analyses (Phylip (PhyML), version 3.696 [49]), Figure 3. The user can also graphically visualize the phylogenetic trees on completion generated by FigTree, version 1.4.4. The request and response from these services was handled using PHP Laravel framework and bash scripting.

Conclusions
Hepatitis D has received a lot of attention in recent years, resulting in a flood of new findings and information, including next generation sequencing data. However, a platform capable of collecting and analyzing this growing body of data has so far been missing. Here we introduced HDVdb as a comprehensive database of human HDV sequences with a potential of expansion to the recently identified isolates from animals and insects. HDVdb allows the user to download structured data of all known HDV sequences. It also permits the user to use this data and perform comparative sequence analysis using multiple bioinformatics services available directly on HDVdb website.