Whole-Genome Sequencing and Comparative Genomics of Three Helicobacter pylori Strains Isolated from the Stomach of a Patient with Adenocarcinoma

Helicobacter pylori is a common pathogen associated with several severe digestive diseases. Although multiple virulence factors have been described, it is still unclear the role of virulence factors on H. pylori pathogenesis and disease progression. Whole genome sequencing could help to find genetic markers of virulence strains. In this work, we analyzed three complete genomes from isolates obtained at the same point in time from a stomach of a patient with adenocarcinoma, using multiple available bioinformatics tools. The genome analysis of the strains B508A-S1, B508A-T2A and B508A-T4 revealed that they were cagA, babA and sabB/hopO negative. The differences among the three genomes were mainly related to outer membrane proteins, methylases, restriction modification systems and flagellar biosynthesis proteins. The strain B508A-T2A was the only one presenting the genotype vacA s1, and had the most distinct genome as it exhibited fewer shared genes, higher number of unique genes, and more polymorphisms were found in this genome. With all the accumulated information, no significant differences were found among the isolates regarding virulence and origin of the isolates. Nevertheless, some B508A-T2A genome characteristics could be linked to the pathogenicity of H. pylori.


Introduction
Helicobacter pylori is a Gram-negative bacterium that persistently infects the human stomach inducing chronic inflammation. The pathogenicity of H. pylori ranges from asymptomatic colonization to bacterially mediated oncogenesis [1]. This bacterium can cause several gastrointestinal diseases, such as gastritis, peptic ulcer disease, gastric adenocarcinoma, and mucosa-associated lymphoid tissue (MALT) lymphoma.
The genomes of H. pylori are highly variable, with considerable allelic diversity. This genomic plasticity is thought to aid in its adaptation, which is essential for its survival in different human populations [2]. It has been reported that isolates associated with different geographic areas, different diseases and different individuals might have variable genomic features [3][4][5]. Several genes, such as vacA, cagA, and iceA, among others, have been identified as markers of enhanced pathogenicity of H. pylori [6].
All H. pylori strains carry the vacuolating cytotoxin gene (vacA). This gene contains three highly variable polymorphic regions, which are the signal sequence region (s1, s2), the intermediate region (i1, i2, i3) and the mid region (m1, m2). Cytotoxicity is related The high number of these factors and allelic variation of the involved genes generates a highly complex scenario and reveals the difficulties in testing the contribution of each individual factor. The association of gastric cancer with the H. pylori infection has led to its systematic eradication. However, the progression from infection to the development of cancer continues to be unclear, promoting great interest in clarifying this issue. The aim of this study was to investigate possible relevant differences among genomes of strains isolated from tumoral gastric tissue and from non-tumoral tissue. Both samples were obtained from the same patient who suffered from gastric cancer.

Results
2.1. B508A-S1, B508A-T2A, and B508A-T4 Genome Comparison Basic characteristics of these genomes are shown in Table 1. The average genomic GC content for each strain was 39%, perfectly fitting with the standards of the species, which range from 38% to 39% [22]. The genome length of B508A-S1, B508A-T2A and B508A-T4 was 1,580,553 bp, 1,578,500 bp and 1,585,256 bp, respectively. The total number of genes was 1577 for B508A-S1 and B508A-T2A, and 1581 for B508A-T4. The genomes of B508A-S1, B508A-T2A and B508A-T4 contained a total of 1445, 1442 and 1444 proteins or coding genes, respectively. On average, 624-fold genome coverage of sequence data was used to create draft genome assemblies using the a5 assembler. The number of scaffolds per genome in each assembly ranged from 25 to 45 (average: 35.6 scaffolds) and a median N50 of 143,796 bp. Using Roary, a total of 1456 clusters of orthologous sequences were identified as core genes (>95% of the strains) and 86 were just present in a subset of isolates. The whole gene set, defined as the pangenome, included 1542 genes.
The average nucleotide identity (ANI) values were higher than 99.50% in all comparisons, as expected for strains of the same species. The most similar strains, with an ANI value of 99.94%, were B508A-S1 and B508A-T4. Regarding the B508A-T2A isolate, it had an ANI value of 99.58% with the two other strains.
Comparison of the three genomes using OrthoVenn2 showed 1435 clusters of orthologous genes, where 1394 were shared among all strains, two between B508A-T2A and B508A-T4, 21 between B508A-T4 and B508A-S1, and 12 between B508A-S1 and B508A-T2A ( Figure 1). The differences among them were mainly in outer membrane proteins, methylases and flagellar biosynthesis proteins (Table S1). Regarding unique clusters, strains B508A-S1 and B508A-T2A had just one, while strain B508A-T4 had four. Each unique cluster had a double copy of the protein. The four unique clusters of B508A-T4 consisted of two Hop family outer membrane proteins, one transposase and one hypothetical protein.
The unique cluster of B508A-T2A corresponded to an outer membrane beta-barrel protein.
Additionally, the unique cluster of B508A-S1 was found to be a restriction endonuclease. In total, B508A-S1, B508A-T2A and B508A-T4 showed 1428, 1409 and 1421 gene clusters, respectively ( Figure 1).  Focusing on the analysis of the unique genes using Roary, the genomes of B508A-T2A, B508A-T4 and B508A-S1 had 38, 12 and four unique genes, respectively, most of them codifying hypothetical proteins with unknown function. A graphical representation of the distribution of unique genes from the comparison of the three genomes is shown in Figure 2. The complete annotation of unique genes with known function can be found in the Table S2. Nucleotide polymorphisms and indels between genomes were obtained using Snippy. On the one hand, few variants, 169 in total, were found between B508A-S1 and B508A-T4. On the other hand, the number of variants between B508A-T2A and the other genomes was higher, specifically 3344 with B508A-S1 and 3316 with B508A-T4 (Table 2). Focusing on the analysis of the unique genes using Roary, the genomes of B508A-T2A, B508A-T4 and B508A-S1 had 38, 12 and four unique genes, respectively, most of them codifying hypothetical proteins with unknown function. A graphical representation of the distribution of unique genes from the comparison of the three genomes is shown in Figure 2. The complete annotation of unique genes with known function can be found in the Table S2.  Focusing on the analysis of the unique genes using Roary, the genomes of B508A-T2A, B508A-T4 and B508A-S1 had 38, 12 and four unique genes, respectively, most of them codifying hypothetical proteins with unknown function. A graphical representation of the distribution of unique genes from the comparison of the three genomes is shown in Figure 2. The complete annotation of unique genes with known function can be found in the Table S2. Nucleotide polymorphisms and indels between genomes were obtained using Snippy. On the one hand, few variants, 169 in total, were found between B508A-S1 and B508A-T4. On the other hand, the number of variants between B508A-T2A and the other genomes was higher, specifically 3344 with B508A-S1 and 3316 with B508A-T4 (Table 2). Nucleotide polymorphisms and indels between genomes were obtained using Snippy. On the one hand, few variants, 169 in total, were found between B508A-S1 and B508A-T4. On the other hand, the number of variants between B508A-T2A and the other genomes was higher, specifically 3344 with B508A-S1 and 3316 with B508A-T4 (Table 2). This fact reflects the special distinction of the strain B508A-T2A with the other strains. Most of the variability was found in coding genes with unknown function (data not shown). Regarding the coding genes with known function, the greatest variability was found in flagellar proteins, proteins of LPS biosynthesis and OMPs.
The circular representation of genomes obtained with the comparative genomic tool CG-View (Circular Genome Viewer) is available in Figure S1. They are markedly similar among them, without visual difference being detected on this representation. It would be necessary to significantly zoom-in the image in order to perceive the minor differences.
The functional distribution of the genomes was obtained with the Rapid Annotation Subsystem Technology (RAST) server ( Figure S2). It showed the same subsystem distribution statistics for all the strains, with few differences in the categories of DNA and protein metabolism, motility, and chemotaxis. Remarkably, in genome B508A-T2A, a unique protein was found, which was classified as a type I restriction modification system, specifically subunit S (EC 3.1.21.3), which could be of interest because it is present on the strain with more virulence determinants, as described in next section.

Virulence Factors and Antimicrobial Resistances
The analysis of the results from Diamond shows that all three strains contain all the urease genes, the flagella forming genes and most of the virulence factors analyzed, including vacA and iceA (Table 3). By contrast, they are babA/hopS and sabB/hopO negative. There are some differences in the number of copies in the babB/hopT adherence gene and in the genes involved in immune evasion (futA, futB and futC genes). All isolates have the vacA i2/m2 genotype (Table 1). Furthermore, on the one hand, the strain B508A-T2A has the s1 allele. On the other hand, the isolates B508A-S1 and B508A-T4 are vacA s2, as evidenced by the size of the fragment within the primers ( Figure S3). It should be noted that conventional polymerase chain reactions (PCRs) could not discriminate between alleles. None of the three isolates has the cagPAI nor any other genomic island (Table 1).
Possible CRISPR (clustered regularly interspaced short palindromic repeats) sequences were sought using CRISPR-Finder. No confirmed sequence was found, only two questionable ones in each of the genomes. The sequences of the two spacers were identical in all cases and located in the middle of a coding gene which was annotated as a toxin. The locus tags for B508A-S1, B508A-T2A and B508A-T4 were DDP45_03600, DDP35_03380 and DDP36_03845, respectively. Furthermore, PILER-CR-tool for rapid identification and classification of CRISPR repeats-could not find any putative CRISPR array.
No resistance genes were found using the database ResFinder. The sequences of genes pbp1 (amoxicillin), 23S rRNA (clariythromicin), 16S rRNA (tetracycline), frxA, rdxA (metronidazole) and gyrA (levofloxacin) were manually checked to confirm this statement and, indeed, no mutation responsible for the resistance was found (Table 4).

H. pylori Genetic Population
In the phylogenetic tree obtained with MLST (multilocus sequence typing) analysis, four of the actual populations can be observed. The most distant is hpAfrica2, due to its ancestral evolution. Population hpAfrica1 is clearly separated from hpAfrica2 and the other ones. The population hpEastAsia expands from within the hpEurope population, showing the fact that East Asian H. pylori diverged from European lineages, as suggested by Kawai et al. [23]. Finally, circled in this phylogenetic tree are the three strains of this study, which lie within the hpEurope population ( Figure 3). Table 3. Virulence factors analyzed with Diamond. Gene copy number is indicated.

Virulence Factors
Related Genes B508A-S1 B508A-T2A B508A-T4   Table 4. Nucleotides/amino acids found in the described mutational positions in resistance genes for each antibiotic.

Mutational Position
Susceptible Strains B508A-S1 B508A-T2 four of the actual populations can be observed. The most distant is hpAfrica2, due to its ancestral evolution. Population hpAfrica1 is clearly separated from hpAfrica2 and the other ones. The population hpEastAsia expands from within the hpEurope population, showing the fact that East Asian H. pylori diverged from European lineages, as suggested by Kawai et al. [23]. Finally, circled in this phylogenetic tree are the three strains of this study, which lie within the hpEurope population ( Figure 3).

Discussion
Even though hundreds of H. pylori genomes have been published, few studies focused on different strains obtained from the same patient at a single point in time [24]. Recently, we suggested that the predominant pattern of H. pylori infections in humans are events of multiple infections, including a predominant strain and multiple minority H. pylori strains [25].
This study focused directly on three strains, isolated from a patient suffering gastric cancer. The aforementioned study showed, through amplicon sequencing of housekeeping genes, that these three strains were a result of microevolution events by mutation of an original strain infecting the gastric mucosa. Additionally, the interest on using these

Discussion
Even though hundreds of H. pylori genomes have been published, few studies focused on different strains obtained from the same patient at a single point in time [24]. Recently, we suggested that the predominant pattern of H. pylori infections in humans are events of multiple infections, including a predominant strain and multiple minority H. pylori strains [25].
This study focused directly on three strains, isolated from a patient suffering gastric cancer. The aforementioned study showed, through amplicon sequencing of housekeeping genes, that these three strains were a result of microevolution events by mutation of an original strain infecting the gastric mucosa. Additionally, the interest on using these strains lies in the finding that the strain B508A-T4 (isolated from tumoral tissue) was found genetically closer to the strain B508A-S1 (from non-tumoral tissue) than to the strain B508A-T2A (from the same tumoral tissue).
The ANI values of the B508 isolates were ≥99.58%, being consistent with the definition of species [26], and it is even more consistent considering they come from microevolution events. As shown by Palau et al. [25,27], the strain B508A-T2A is the most distant among them. Therefore, the aim of the study was to search within these three strains for genetic markers that could be linked with the level of virulence, pathogenicity or the risk of developing gastric cancer.
The strain B508A-T2A is the most divergent isolate, as seen by multiple factors. One of them is the length of its genome, being the smallest of the three with a difference of 2053 bp with B508A-S1 and 6756 bp with B508A-T4, respectively (Table 1). It also has fewer coding genes than the rest, indicating a gene loss. Any change in selection pressure might contribute to a gene loss, which might be done in several ways: selection for a smaller genome, some genes become actively deleterious, or some genes become less necessary [28]. Adoption of a pathogenic lifestyle can lead to recurrent changes in selection pressures, due to the host adaptive immune responses [29,30]. Such changes have been observed to occur even within the same strain during the course of infection [31]. Hence, the present results are consistent with the events of microevolution.
Unique genes obtained with Roary ( Figure 2) show the relative abundance of the unique genes found in the comparison among the three isolates. The strain B508A-T2A is the one with more unique genes. It can clearly be seen that the area describing hypothetical proteins is much bigger than the rest, showing there are many undiscovered functions. Strain B508A-T4 showed a unique vacuolating cytotoxin autotransporter, although caution must be taken when assigning its real function, because just a small percentage of these proteins have been proved to be able to translocate proteins [32]. Consequently, further studies have to be undertaken in this direction to understand the role of this protein.
As seen in our results, RAST and OrthoVenn analyses point at restriction modification systems, which have been described to have a relevant role in the regulation of gene expression and in modulating virulence [33]. These restriction modification systems are important providers of defense against foreign DNA. Furthermore, in order to avoid destroying its own DNA, methyl groups are added to the sequences by methyltransferases, which also appear in our results.
Also important to mention is that the number of shared genes of strain B508-T2A with any of the two other strains is lower than the number of shared genes between these two other strains (Figure 1), thus, making strain B508-T2A a more differentiated strain.
After studying all the types of polymorphisms among the three strains, this work continued with the focus on the differences between the most different strain, B508A-T2A, and the other two. Specifically, nonsense polymorphisms were sought. Being B508A-T2A the most differentiated strain and isolated from tumoral tissue, the search for these differences serves to highlight possible signs of differentiations between virulent and nonevirulent strains. Even though only one strain is here considered, the current results provide hints to narrow the search for possible candidates.
Changes in the expression of bacterial surface structures, such as OMPs, are anticipated to facilitate adaption of the bacterium to the new human host [34]. A total of 21 nonsynonymous or missense single nucleotide polymorphisms (SNPs) were related to outer membrane proteins. Specifically, on the hof family of adhesins, six were situated in the hofB gene, two in the hofG gene and one in the hofH gene. The other 12 involved bamA, an outer membrane protein assembly factor. Additionally, eight non-synonymous SNPs were detected on the fliI gene and a single one in the flgG gene. These differences may be of relevance since colonization is the basis of the inflammatory reaction induced by H. pylori [35]. Consequently, the motility of H. pylori, and in particular the flagellum, is a critical colonization determinant that affects the infection outcome.
The LPS of this organism plays a key role in its colonization and persistence in the stomach. In addition, the LPS of H. pylori modulates pathogen-induced host inflammatory responses. These responses may result in chronic inflammation within the gastrointestinal tract. Very little is known about the LPS compositions of different strains of H. pylori with varied degree of virulence in human [36]. Here, 27 missense SNPs in the same LPS biosynthesis protein, a glycosyltransferase, were found in the strain B508A-T2A when comparing it with the two other strains.
Different virulence factors have been described until now, related to persistent colonization of the gastric mucosa, toxin expression or immune evasion [10,37,38]. Some of these pathogenicity factors have been associated with increased risk of gastric cancer: cagA+ and vacA s1i1m1 genotypes and the protein expression of AlpA, OipA, BabA, and SabA [39][40][41]. The three studied genomes are cagA, babA and sabB/hopO negative. This is a surprising result for two main reasons. First, as mentioned above, these genes have been clearly defined in the literature as gastric cancer markers and, second, the strains B508A-T2A and B508A-T4 were isolated from tumoral tissue.
Strain B508A-T2A is the only presenting the genotype vacA s1, showing a specific feature directly linked with pathogenicity, while the others are s2. On the other hand, the three strains have the coding genes for the proteins AlpA, OipA and SabA and other genes related to pathogenicity (Table 3).
Although CRISPR-Cas systems are of vital importance in the immunological defense of certain bacteria, as they confer resistance to foreign genetic elements, no cas genes could be detected in this work, and neither CRISPR-like sequences could be defined. From the bibliography, CRISPR-like loci have been identified in H. pylori, but no CRISPR-Cas system has been found so far [42].
Regarding the study of antibiotic resistances, all three strains were defined as susceptible to the five antibiotics studied (amoxicillin, clarithromycin, levofloxacin, metronidazole and tetracycline), as none of them contained any mutation in the sites responsible for the resistances. H. pylori is nowadays on the high priority list of the World Health Organization (WHO) for research into and the discovery of new antibiotics. Resistances have arisen in the last years, leading to suboptimal eradication rates [43,44]. Here, however, these three strains show no mutations linked with resistances. This could be explained by multiple factors, such as the patient history of antibiotics intake.
Regarding the MLST analysis and its predicted phylogenetic tree (Figure 3), the three isolates of study were from the group within the hpEurope population, so the European origin of the patient is demonstrated.

H. pylori Strains and Genomic DNA Extraction and Sequencing
The present study focuses on the study of strains B508A-S1, B508A-T2A, and B508A-T4. These were isolated from the stomach of a single patient with adenocarcinoma [27]. In particular, strains B508A-T2A and B508A-T4 were obtained from the same cancerous tissue and strain B508A-S1 from non-tumoral tissue.
The selection and culture of the strains is extensively described in Palau et al. [27]. Briefly, the strains were obtained from the H. pylori collection of the Digestive Diseases Department of the Hospital Taulí (Sabadell, Barcelona, Catalonia, Spain). The strains were recovered on Columbia agar with 5% sheep blood (bioMérieux) and subcultured on Columbia blood agar or Brucella agar (BD Diagnostics, Franklin Lakes, NJ, USA) supplemented with 10% fetal bovine serum (Invitrogen, Waltham, MA, USA) at 37 • C under microaerophilic conditions for a week. Different colonies were isolated from antral biopsies B508A-S and B508A-T that were taken from the same patient (B508), B508A-S from normal tissue and B508A-T from gastric adenocarcinoma. Six colonies from each sample were analyzed by multilocus sequence analyses (MLSA) of six housekeeping genes, detecting a unique strain from B508A-S (B508A-S1 = -S2 = -S3 = -S4 = -S5 = -S6), and two different strains from B508A-T, B508A-T2A (= -T2B = -T3 = -T5 = -T6) and B508A-T4.
Genomic DNAs from all three strains were extracted using the genomic DNA extraction kit (REAL, Durviz S.L., València, Spain) following the procedure explained by Palau et al. [27]. Genome sequencing was performed using Illumina technology (San Diego, CA, USA). The sequencing library preparation and sequencing of the whole genome was done by the Centre for Genomic Regulation (CRG, Barcelona, Spain) using an Illumina Hi-Seq Sequencing with 2 × 125 bp v4 chemistry. Following, a de novo assembly of all the set of genomes was carried out using the tool a5-assembler [45]. The alignment and rearrangement of the sequences was achieved using the software Mauve [46]. The draft genome annotation was performed using the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) [47].
In order to fully characterize the genomes, different analyses were carried out. Average nucleotide identity (OrthoANIu algorithm) [49] was calculated using EzBioCloud's ANI calculator (www.ezbiocloud.net/tools/ani) (accessed on 4 February 2020). OrthoVenn2 [50] was used for the comparison of orthologous gene clusters. Snippy [51] was used to find both substitutions and insertions/deletions (indels). Later, snippy-core was used to combine the previous obtained snippy outputs into a core SNP alignment. Snippy and snippy-core were run on the Galaxy platform [52] with the default parameters.
In order to visualize the results, CGView server [53] was used to obtain a circular representation of the genomes. Additionally, the RAST and SEED [54] servers were used to compare the subsystem distribution statistics among all strains.

Virulence and Antimicrobial Resistance Genes
With the aim of finding pathogenic genes, Diamond [55] was used as a search tool against the Virulence-Factor database (VFDB). In addition, PCR was used to determine H. pylori cagA status and to genotype the vacA gene (signal, intermediate and mid-region polymorphisms), as described by Lario et al. [56]. PCR results were verified by looking for the primers in the sequence of the whole genomes and calculating the size of the fragments to determine the corresponding allele [7,9,57] (Figure S3). Furthermore, PILER-CR [58] and CRISPRCasFinder [59] were used for CRISPR identification.

Conclusions
With all the accumulated information no significant differences were found between the isolates regarding the virulence and the origin of the isolates. Still, considering that B508A-S1 and B508A-T4 are more closely related and that they originate from non-tumoral and tumoral tissue, respectively, it could be inferred that their characteristics/genes are not a cause for developing a cancerous process. On the other hand, regarding B508A-T2, being markedly distant from the other two and having been found in the tumoral tissue, it can be hypothesised that some of its characteristics could be linked to the pathogenicity of Helicobacter pylori. A further consideration to take into account is that even though two strains were extracted from a cancerous tissue, they are all missing some relevant virulence genes such as cagA, babA or sabB. Nonetheless, more exhaustive analyses are needed.