Recovering Escherichia coli Plasmids in the Absence of Long-Read Sequencing Data

The incidence of infections caused by multidrug-resistant E. coli strains has risen in the past years. Antibiotic resistance in E. coli is often mediated by acquisition and maintenance of plasmids. The study of E. coli plasmid epidemiology and genomics often requires long-read sequencing information, but recently a number of tools that allow plasmid prediction from short-read data have been developed. Here, we reviewed 25 available plasmid prediction tools and categorized them into binary plasmid/chromosome classification tools and plasmid reconstruction tools. We benchmarked six tools (MOB-suite, plasmidSPAdes, gplas, FishingForPlasmids, HyAsP and SCAPP) that aim to reliably reconstruct distinct plasmids, with a special focus on plasmids carrying antibiotic resistance genes (ARGs) such as extended-spectrum beta-lactamase genes. We found that two thirds (n = 425, 66.3%) of all plasmids were correctly reconstructed by at least one of the six tools, with a range of 92 (14.58%) to 317 (50.23%) correctly predicted plasmids. However, the majority of plasmids that carried antibiotic resistance genes (n = 85, 57.8%) could not be completely recovered as distinct plasmids by any of the tools. MOB-suite was the only tool that was able to correctly reconstruct the majority of plasmids (n = 317, 50.23%), and performed best at reconstructing large plasmids (n = 166, 46.37%) and ARG-plasmids (n = 41, 27.9%), but predictions frequently contained chromosome contamination (40%). In contrast, plasmidSPAdes reconstructed the highest fraction of plasmids smaller than 18 kbp (n = 168, 61.54%). Large ARG-plasmids, however, were frequently merged with sequences derived from distinct replicons. Available bioinformatic tools can provide valuable insight into E. coli plasmids, but also have important limitations. This work will serve as a guideline for selecting the most appropriate plasmid reconstruction tool for studies focusing on E. coli plasmids in the absence of long-read sequencing data.


Introduction
Escherichia coli is a versatile micro-organism able to survive and thrive in different ecological habitats. It is a Gram-negative facultative anaerobe that commonly resides in the human gut as a commensal bacteria [1]. However, several members of this species also harbor the potential to cause severe infections, both intestinally [2] and extra-intestinally [3], in the healthcare settings [4] as well as in the community [5]. The 'success' of E. coli as a pathogen can be mostly attributed to the wide repertoire of virulence factors that strains may carry [6] and the increasing fraction of infections caused by multidrug-resistant strains [7]. Many of the antibiotic resistance genes and virulence factors present in E. coli are commonly encoded on plasmids, mobile genetic elements (MGE) that can be horizontally disseminated [8][9][10]. Therefore, precise identification and characterization of E. coli plasmids are highly relevant from an epidemiological and clinical standpoint.
Over the past decade, Illumina short-read sequencing platforms have become a popular technology to elucidate the genomic content and molecular epidemiology of bacteria. However, the frequent occurrence of repeat elements prohibits the assembly of complete replicons (plasmids and chromosomes) and often results in hundreds of contigs per genome with an unclear origin. Plasmid and chromosome contigs are mingled in draft genome assemblies, which challenges the accurate reconstruction of plasmids. More recently, longread sequencing platforms (Oxford Nanopore and PacBio) have successfully resolved this issue, but short-read sequencing remains the de facto standard in many microbiology laboratories [11][12][13][14].
Several fully automated bioinformatics tools are currently available to predict bacterial plasmids from short-read sequencing data. Since 2018, at least 15 different tools have been created for this purpose (Table S1). They can be broadly categorized into two main classes. The first class comprises software that produces a binary classification of contigs as either plasmid-or chromosome-derived, generating an output that predicts the complete plasmid content of a bacterial strain, often referred to as the 'plasmidome'. An accurate plasmidome prediction has proven helpful to discover the genomic location of clinically relevant genes [15][16][17][18] and their role in shaping niche specificity [19], among others. The second class consists of tools that aim to recover distinct closed plasmid sequences. The output of these tools provides, in theory, a more comprehensive picture of the plasmid content of bacteria and allow to study the dissemination and epidemiology of specific plasmids [20].
Here, we reviewed the different tools and strategies to achieve binary prediction, for example fast k-mer based searches against reference plasmid databases (PlaScope and PlasmidSeeker), exploitation of the natural distribution bias of protein-coding genes between plasmids and chromosomes (Platon), and machine learning algorithms with different underlying features (cBAR, PlasFlow, mlplasmids, PlasClass, RFPlasmid and PPR-Meta) and others. Furthermore, we benchmarked six tools aimed at reconstructing fully closed distinct plasmids for use with E. coli, by using complete E. coli genomes that were recently deposited to public databases. The strategies applied by the reconstruction tools consist of graph-based approaches (plasmidSPAdes, gplas), reference-based approaches (MOB-Suite, FishingForPlasmids) and hybrid approaches which use reference-and graph information (HyAsP and SCAPP). We assessed their performance based on their ability to correctly recover different plasmids as distinct and complete predictions, including plasmids that carry clinically relevant antibiotic resistance determinants, such as extendedspectrum beta-lactamase (ESBL) genes.

Review of Plasmid Prediction Tools
We performed a systematic search of peer-reviewed publications deposited in PubMed by August 25th 2020, using the following search terms: ( This search resulted in 238 peer-reviewed publications that we manually curated to obtain a list of 17 different tools with the goal to study the plasmid content of bacteria in silico (Table S1).
In order to find tools deposited on GitHub and GitLab, we used the search term '*plasmid*'. This resulted in 229 repositories from which 7 relevant tools were added to the selection (Table S1). The Github location of FishingForPlasmids was obtained through personal communication with the developer.

Phylogenetic Analysis
Phylogroups were determined in silico by using ClermonTyping v1.4.0 [22]. Core-and accessory-genome distances were calculated by using PopPUNK v1.2 [23] with standard parameters. PopPUNK was also used to build a core-genome neighbor-joining tree with 1381 complete E. coli genomes downloaded from the NCBI database on 25 August 2020. Tree visualization and metadata information were integrated in Microreact [24] (Table S2).

Benchmark Data Set Selection
Isolates that were not sequenced by both long-and short-read technologies (n = 559) were excluded, as well as sequences that were predicted as Escherichia cryptic clades [25] by in silico ClermonTyping (n = 12) and genomes that exhibited a predicted accessory-genome distance larger than 0.5 by PopPUNK (n = 2). We used a script written in R (version = 3.6.1) to remove genomes that had been used for developing the tested tools (n = 601). Moreover, we excluded genomes that did not carry any plasmids (n = 170), except for 19 randomly selected E. coli isolates without plasmids that were included as negative controls. In order to get a balanced data set, we removed a random sample of genomes isolated from farm animals (n = 161). Finally, we removed 30 genomes containing short-read-only assembled contigs that did not align to any replicon in their respective closed reference genome. The data set resulted in 240 E. coli complete genomes, which carried a total of 631 plasmids ( Figure S1, Table S3).

Evaluating Plasmid Diversity in Benchmarking Data
We used Mash v2.2.2 (k = 21, s = 1000) to estimate the pairwise k-mer distances of all plasmids (n = 3264) from all complete E. coli genomes (n = 1381). The obtained distances were clustered using the t-distributed stochastic neighbor embedding (t-SNE) algorithm with a perplexity value of 30, and data points (which represents individual plasmid sequences) were colored in orange if they were part of the benchmarking data set.

Analysis of the Plasmid Bins Composition
We used QUAST (v5.0.2) to align the contigs of each bin to the respective closed reference genome. An extended description of the parameters used is available at Supplementary Materials. Based on the alignment results, we calculated precision, recall and F1-score as specified below.
If a bin was composed of contigs derived from different plasmids, precision, recall and F1-score were reported for each plasmid-bin combination.
In order to quantify the chromosomal sequence content (if any) on a bin, we defined a chromosome contamination metric as follows. Depending on the input requirement of the respective tools (graph or contigs), we converted assembly graph nodes to FASTA format using the tool Any2Fasta (https:// github.com/tseemann/any2fasta). or used the contigs produced by SPAdes and aligned them to their respective closed reference genomes using QUAST. Based on these alignments we calculated the maximum recall that could be obtained for reconstruction of every reference plasmid using short-read sequencing data (Supplementary Materials).

Antibiotic Resistance Gene (ARG) Prediction
Resistance genes were predicted by running Abricate (v1.0.1) against the resfinder database (database indexed on 19 April 2020) with reference plasmids as query, using 80% as identity and coverage cut-off. The same software and parameters were used to predict the presence of ARGs in the plasmid bins generated by each of the plasmid reconstruction tools.

Evaluating Reconstruction of ARG Plasmids
For bins that carried ARGs, we calculated Recall ARG , as indicated below.

Recall(ARG) =
Nr.o f .correctly.predicted.ARGs.on.bin Total.nr.o f .ARGs.on.re f erence.plasmid Bins that included the complete ARG content of the reference plasmid (Recall ARG = 1) and were linked to the correct plasmid backbone (F1-score ≥ 0.95) were considered as correct reconstructions of the ARG-plasmid.

Computational Methods to Predict the Plasmidome or Distinct Plasmids
We used a systematic search of peer-reviewed publications and two popular softwarerepository hosting web services and retrieved a total of 25 plasmid-or plasmidome-prediction tools (Table S1). Most of the tools (n = 24) were fully automated and harbored the potential to be included in computational pipelines. Of these 24 tools, 13 tools were designed to analyze the plasmidome of multiple species using whole-genome sequencing data as input, while 8 tools can be applied to metagenomic sequences. A total of two tools, Recycler and RFPlasmid, worked with both types of input. Notably, we found one tool (FishingForPlasmids) that was developed to exclusively study the plasmid content of E. coli.
Based on the output, most of the tools (n = 23) can be broadly categorized into one of the following three classes. The first class comprises software that predicts the plasmidome, thus producing a binary classification of contigs as either plasmid-or chromosome-derived (n = 10). The second class consists of tools that aim to recover distinct plasmid sequences (n = 11) ( Figure 1, Table S1). The third class of tools seeks to facilitate the detection of known plasmids (n = 2). Below, we briefly review the computational strategies applied by 17 tools that belong to the first two categories. Four tools were excluded from this review for distinct reasons: plasmIDent uses long-reads as input, plasmidID and plasmidAssembler use a similar approach to MOB-suite for plasmid reconstruction and PLACNET requires manual intervention from the user. mid content of E. coli.
Based on the output, most of the tools (n = 23) can be broadly categorized into one of the following three classes. The first class comprises software that predicts the plasmidome, thus producing a binary classification of contigs as either plasmid-or chromosome-derived (n = 10). The second class consists of tools that aim to recover distinct plasmid sequences (n = 11) ( Figure 1, Table S1). The third class of tools seeks to facilitate the detection of known plasmids (n = 2). Below, we briefly review the computational strategies applied by 17 tools that belong to the first two categories. Four tools were excluded from this review for distinct reasons: plasmIDent uses long-reads as input, plas-midID and plasmidAssembler use a similar approach to MOB-suite for plasmid reconstruction and PLACNET requires manual intervention from the user.

Binary Classification Tools
Binary classification tools take previously assembled contigs as input and classify them as being plasmid-or chromosome-derived.
PlaScope [27] and PlasmidPicker perform k-mer searches against reference plasmid databases. This strategy is very fast but limited to detecting k-mers that are present in the underlying database. Consequently, this produced high specificity and precision values but lower recall in a study that included a benchmark of PlaScope [27,28].
cBAR, PlasFlow and PlasClass all share a common underlying principle: using short k-mer frequencies and machine learning (ML) algorithms to classify metagenomic assemblies. More specifically, cBAR relies on observed differences in pentamer frequencies and uses a sequential minimal optimization (SMO) model. PlasFlow calculates the fre-

Binary Classification Tools
Binary classification tools take previously assembled contigs as input and classify them as being plasmid-or chromosome-derived.
PlaScope [27] and PlasmidPicker perform k-mer searches against reference plasmid databases. This strategy is very fast but limited to detecting k-mers that are present in the underlying database. Consequently, this produced high specificity and precision values but lower recall in a study that included a benchmark of PlaScope [27,28].
cBAR, PlasFlow and PlasClass all share a common underlying principle: using short kmer frequencies and machine learning (ML) algorithms to classify metagenomic assemblies. More specifically, cBAR relies on observed differences in pentamer frequencies and uses a sequential minimal optimization (SMO) model. PlasFlow calculates the frequencies of multiple k-mers sizes (between 5 and 7 nt) and utilizes a neural-network voting classifier to integrate predictions. PlasFlow has a better performance than cBAR [29,30], but shows less reliable results for short contigs [31]. PlasClass addresses this issue by using a set of four logistic regression classifiers, each trained on sequences of different length [31]. Similar to cBAR, mlplasmids also relies on pentamer frequencies but uses a Support Vector Machine (SVM) model to determine the origin of contigs for a single species, and contains models for Escherichia coli, Klebsiella pneumoniae and Enterococcus faecium. Mlplasmids outperformed both cBAR and PlasFlow when classifying data derived from whole-genome sequencing experiments, and it can also accurately predict the plasmid localization of several antimicrobial resistance genes [29]. RFPlasmid [32], a recently released tool, uses a random forest classifier trained with a hybrid approach by identifying chromosomal and plasmids marker genes using two databases and also pentamer frequencies. This tool also works with metagenomic assemblies, albeit only for contigs from the 17 different species for which classifiers were trained. Platon exploits the natural distribution bias of protein-coding genes between plasmids and chromosomes and also analyzes higher-level characteristics of the contigs: circularization, presence of replication and mobilization proteins, presence of oriT and incompatibility sequences [28].
Finally, PPR-Meta [33] allows simultaneous identification of both phages and plasmids fragments from metagenomes by using a Convolutional Neural Network. Notably, instead of k-mer frequencies, this tool uses one-hot matrices to represent nucleotides and aminoacids sequences [33].
Despite the differences in approaches and performances, none of the aforementioned tools attempted to further sort the predicted plasmidome into individual plasmids. As a consequence, these tools are not suitable for studying the epidemiology of specific plasmids.

Plasmid Reconstruction Tools
Based on their computational strategies, we can roughly subdivide plasmid reconstruction tools into three different categories: (i) de novo reconstruction of plasmids using assembly graph information, (ii) reference-based approaches and (iii) hybrid approaches.
PlasmidSPAdes, Recycler, metaplasmidSPAdes and gplas [34][35][36] perform a de novo reconstruction of plasmids using assembly graph information. PlasmidSPAdes and Recycler were released in 2016 and were the first tools that exploited the information on the assembly graph for identifying individual plasmids. PlasmidSPAdes is based on the assumption that plasmids have a different copy number than the chromosome, and therefore plasmid contigs will exhibit a different read coverage than chromosomal contigs. A number of studies have shown that this tool is able to reconstruct bacterial plasmids with high recall [11,37,38], but they have also revealed two major disadvantages of this approach: (1) plasmidSPAdes fails to identify large plasmids that have the same copy number as the chromosome and (2) it has a tendency to merge different plasmids together. Recycler also tries to identify plasmid-paths in the assembly graph by using coverage information but incorporates additional data regarding the topology of the selected paths. The main rationale behind this algorithm is that selected plasmid-paths should be cyclic, coverage should be homogeneous amongst all contigs and mated pair-end reads should map to the same path. Recycler appears to successfully identify short plasmids but yields very low precision values for long plasmids [11,37]. This issue is partially addressed by metaplasmidSPAdes, released in 2019 as an improvement on the original prediction algorithm of plasmidSPAdes. This tool allows prediction of dominant plasmids in metagenomes, defined as plasmids with coverage exceeding that of chromosomes and other plasmids. The algorithm iteratively extracts cyclic subgraphs with increasing coverage from the metagenome assembly graph. These potential plasmid sequences are later analyzed by a naive Bayesian classifier, called plasmidVerify, that further assesses the gene content of potential plasmids. None of the aforementioned tools takes advantage of the information embedded in the nucleotide sequences of the assembled contigs to a priori simplify the task of identifying plasmid subgraphs. In contrast, gplas initially classifies assembled contigs as plasmid-derived or chromosome-derived by using mlplasmids (or plasflow), a tool that exploits short k-mer frequencies for achieving such classification. Subsequently, plasmid-derived unitigs act as seeds for finding plasmid-walks with homogeneous coverage in the assembly graph, using a greedy approach. Gplas generates a plasmidome network in which nodes corresponding to plasmid unitigs and edges are created and weighted based on the co-existence of the nodes in the solution space of the computed walks. Finally, this plasmidome network is queried by a selection of network partitioning algorithms for generating bins of contigs that belong to the same plasmid [36].
MOB-suite and FishingForPlasmids use a reference-based approach for reconstructing individual plasmids. MOB-suite works as a modular set of tools for clustering, reconstruc-tion and typing of plasmids from assemblies. This software initially uses Mash [39] and a single-linkage clustering algorithm to create clusters of similar plasmids present in a reference database. Input contigs are then aligned against this database using Blast and assigned to a plasmid cluster according to the best hits obtained. Contigs assigned to the same reference cluster constitute potential individual plasmid units. Also, the topology of the contigs is evaluated and every circular contig is considered an individual plasmid. Finally, each identified plasmid is queried against a different database for finding known replication and mobilization proteins and oriT sequences. According to the authors, MOB-suite performs better than plasmidSPades at correctly reconstructing plasmids from a benchmarking data set that included more than 370 plasmids from 14 different bacterial species [38]. However, the authors identified that MOB-suite splits single plasmids into different predictions more often than plasmidSPAdes. FishingForPlasmids attempts to reconstruct individual plasmids from Escherichia coli assemblies. This tool identifies plasmid-contigs by using BlastN to align each contig against a curated E. coli database. Each plasmid-derived sequence is further classified into discrete components by using a combination of plasmidFinder and pMLST [14].
Finally, HyAsP and SCAPP use a hybrid approach, mixing principles from referencebased and de novo methods. In HyAsP, a set of potential plasmid contigs is first selected based on: (1) a high density of known plasmid genes, identified by using a database, (2) high read coverage and (3) a length that does not exceed a maximum threshold. These plasmidcontigs will be used as seeds for finding plasmid-walks within the original assembly graph using a greedy algorithm. Plasmid-walks must satisfy the following conditions: (1) have a uniform GC content and sufficient read coverage, (2) do not have large gene-free segments and (3) total length of the plasmid-walk does not exceed a threshold. SCAPP, on the other hand, is designed for finding plasmids in metagenome assemblies. This algorithm starts by finding potential plasmid-contigs based on two strategies: (1) searching for plasmidspecific genes by using a curated database and (2) assigning weight to each contig based on the output from PlasClass, a ML-based binary classifier. The assembly graph is then queried to find cyclic walks of uniform coverage, similar to Recycler, but prioritizing the inclusion of contigs with strong evidence of plasmid-origin [40].

The Benchmark Data Set Represents the Diversity of Sequenced Plasmids
To benchmark the aforementioned plasmid reconstruction tools, we used a data set of 240 E. coli strains with complete genome sequences and short read data available from public databases that harbored 631 plasmids. These E. coli genomes were absent from all training data sets used to develop the selected plasmid prediction tools. The majority of the genomes derived from Europe (n = 170), Asia (n = 39) and North America (n = 24) (Figure 2A). They were isolated from multiple sources such as animals (n = 103), humansclinical samples (n = 27), humans-community samples (n = 4), environmental sources (n = 86) and unknown sources (n = 13) ( Figure 2B).
To assess if the selected genomes were a representative sample of the phylogenetic diversity of E. coli, we built a neighbor-joining tree combining our data set with 1141 complete E. coli genomes and determined the phylogroup of each of these genomes in silico. This analysis revealed that the selected genomes were distributed across the core-genome tree and that all phylogroups were represented with at least five strains. ( Figure 2C).
Most of the genomes carried one (n = 73), two (n = 49) or three (n = 28) plasmids, but notably some genomes contained as much as nine (n = 3), ten (n = 1) or eleven (n = 1), with a median of two (mean = 2.62 plasmids). We found a clear bimodal plasmid size distribution, with peaks around 4500 bp and 100,000 bp ( Figure 2D). Consequently, plasmids with a length smaller than 18,000 bp were classified as 'small' (n = 273), while plasmids that exceeded this cut-off value were classified as 'large' (n = 358).
Next, we wanted to assess the diversity of plasmids included in the benchmark data set. We used Mash to estimate the pairwise k-mer distances of all plasmids (n = 3264) from all complete E. coli genomes (n = 1381) and clustered them with the t-SNE algorithm.
Plasmids included in this study were distributed among all major clusters, suggesting that this data set is able to properly capture the diversity of the E. coli pan-plasmidome currently available at NCBI ( Figure 2E).

A Third of All Plasmids Could Not Be Correctly Reconstructed by Any of the Tools
We selected six tools to reconstruct distinct plasmid sequences. These tools applied different computational strategies: graph-based (plasmidSPAdes, gplas), reference-based (MOB-Suite, FishingForPlasmids) and hybrid (HyAsP and SCAPP).
The rest of the plasmid reconstruction tools were not included in the analysis because of a variety of reasons: Plasmid Assembler couldn't be installed, plasmidID predictions were not completed due to errors during execution, PLACNET required manual intervention of the user, Recycler provided suboptimal results in comparison with plasmidSPAdes and HyAsP in previous studies [11,37] and metaplasmidSPAdes uses a similar approach to plasmidSPAdes but optimized for metagenomic samples.
We evaluated the predictions obtained with the six selected plasmid reconstruction tools in terms of (i) speed and memory requirements, (ii) the number of plasmid predictions, (iii) correct reconstruction of reference plasmids, (iv) chromosomal contamination included in predicted plasmids, and (v) correct reconstruction of ARG-plasmids.
We used a High-Performance Cluster (HPC) to run the tools with minimal resources (number of cores = 2, 4GB of RAM per genome), and documented the total CPU-time and memory required by each of them (Table 1, Figure S2). Most tools required less than 100 CPU hours to complete all predictions, except for plasmidSPAdes which used 321.07 CPU hours. In contrast, FishingForPlasmids was the fastest tool and completed the task in 10.60 CPU hours. PlasmidSPAdes and SCAPP had the highest memory requirements, utilizing a total of 442.03 Gb and 435.23 Gb of RAM, respectively. The remaining tools required less than 300 Gb to complete all predictions. Notably, FishingForPlasmids only required a total of 36.57 Gb.
Next, we evaluated the number of plasmid predictions produced by each tool and calculated the difference between this number and the true number of plasmids present in the benchmark data set (Table 1, Figure S3). The total number of plasmid predictions ranged from 377 (FishingForPlasmids) to 2590 (HyAsP). plasmidSPAdes, MOB-suite, SCAPP and HyAsP overestimated the true number of plasmids (n = 631), while gplas and FishingForPlasmids underestimated this number. PlasmidSPAdes displayed the least deviation by producing 642 bins, and therefore exceeding the total number of plasmids by 11. Nevertheless, these absolute numbers do not reflect whether predictions were correct or incorrect.
In order to evaluate how the different tools performed at recovering E. coli plasmids as distinct and complete predictions, we studied the distributions of recall, precision and F1-score (Table 1, Figure S4A-C) for all plasmid predictions made by the tools. Based on these results, we determined an F1-score cut-off value of 0.95 to define a plasmid as correctly reconstructed (or recovered) ( Figure S4D).
We found that a total of 418 (66.25%) plasmids were correctly reconstructed by at least one of the tools ( Figure 3C). Out of these, only 7 (1.11%) were reconstructed by all tools concurrently, 273 (43.26%) by multiple tools and 138 (21.9%) by a single tool. Interestingly, combining MOB-suite and plasmidSPAdes predictions together achieved the correct reconstruction of 400 (63.39%) plasmids, and incorporating the predictions from the remaining tools only resulted in the reconstruction of 18 (2.85%) additional plasmids. Notably, a total of 213 (33.75%) plasmids were incorrectly reconstructed (F1 score < 0.95) by all tools, including 21 (3.32%) that were not even detected. The majority of ARG-plasmids (n = 85, 57.8%) could not be correctly reconstructed by any of the tools (Table S6).     We found that a total of 418 (66.25%) plasmids were correctly reconstructed by at least one of the tools ( Figure 3C). Out of these, only 7 (1.11%) were reconstructed by all tools concurrently, 273 (43.26%) by multiple tools and 138 (21.9%) by a single tool. Interestingly, combining MOB-suite and plasmidSPAdes predictions together achieved the correct reconstruction of 400 (63.39%) plasmids, and incorporating the predictions from the remaining tools only resulted in the reconstruction of 18 (2.85%) additional plasmids. Notably, a total of 213 (33.75%) plasmids were incorrectly reconstructed (F1 score < 0.95) by all tools, including 21 (3.32%) that were not even detected. The majority of ARG-plasmids (n = 85, 57.8%) could not be correctly reconstructed by any of the tools (Table S6).
We also compared the performance of the software when attempting to reconstruct small-and large plasmids separately. For small plasmids, we discovered that all tools displayed similar F1-score distributions, with medians ranging from 0.95 to 0.99. However, the tools did not detect 21.25-89.74% of small plasmids ( Figure S6A,B). Plas-midSPAdes and MOB-suite were the only tools that achieved the correct reconstruction of most of these replicons, with a total of 168 (61.54%) and 155 (55.31%), respectively (Table 1). When considering the reconstruction of large plasmids, percentages of notdetected plasmids were much lower and ranged from 2.23% to 20.11% across tools. MOB-suite exhibited the highest F1-score values (median = 0.74, IQR = 0.17-0.97) and We also compared the performance of the software when attempting to reconstruct small-and large plasmids separately. For small plasmids, we discovered that all tools displayed similar F1-score distributions, with medians ranging from 0.95 to 0.99. However, the tools did not detect 21.25-89.74% of small plasmids ( Figure S6A,B). PlasmidSPAdes and MOB-suite were the only tools that achieved the correct reconstruction of most of these replicons, with a total of 168 (61.54%) and 155 (55.31%), respectively (Table 1). When considering the reconstruction of large plasmids, percentages of not-detected plasmids were much lower and ranged from 2.23% to 20.11% across tools. MOB-suite exhibited the highest F1-score values (median = 0.74, IQR = 0.17-0.97) and correctly reconstructed 166 (46.3%) of these replicons, significantly surpassing the reconstruction capacity of the rest of the tools, which ranged from 45 (12.57%) to 95 (26.54%) (Table 1, Figure S6A,B). Not surprisingly, most tools correctly reconstructed a higher fraction of small plasmids, and also displayed higher F1-score values (Table 1, Figure S6A,B) when comparing with the reconstruction of large plasmids. FishingForPlasmids was the only exception as it recovered a total of 14 (5.13%) small and 78 (21.79%) large plasmids.
To investigate how the tools performed at reconstructing ARG-plasmids, we analyzed Recall, Precision and F1-score values for these replicons ( Figure S8B-D). Furthermore, we extracted the bins that contained antibiotic resistance genes, and explored the fraction of detected ARGs in each prediction -Recall(ARG)-. An ARG-plasmid was considered as correctly reconstructed if the prediction simultaneously included all ARGs -Recall(ARG) = 1and correctly represented the reference plasmid backbone (F1-score ≥ 0.95).
We discovered that the reconstruction of large ARG-plasmids was particularly challenging for the evaluated tools, since all of them exhibited lower F1-score values in comparison with the reconstruction of large non-ARG-plasmids ( Figure S8B,E, Table 1). We excluded small plasmids from this comparison due to the low amount of small ARGplasmids present in our data set.

Discussion
A tool that is able to correctly predict E. coli plasmids will assist in identifying clinically relevant plasmids [41][42][43][44] and improve our understanding of the complex dynamics of ARG dissemination across different ecological niches [45][46][47]. From the vast offer of software to predict plasmids from short-read data we selected six tools and benchmarked their performances when attempting to reconstruct individual E. coli plasmids, with a special focus on plasmids that carry ARGs.
A total of 418 (66.24%) plasmids were correctly reconstructed by at least one of the tools compared in this benchmark. Interestingly, 400 (63.39%) of these plasmids were recovered by combining the predictions from MOB-suite and plasmidSPAdes alone. Therefore, adding the predictions from the rest of the tools resulted only in 18 (2.85%) additional correct reconstructions.
We observed that plasmidSPAdes correctly reconstructed the highest fraction of small plasmids (n = 168, 61.5%). This result is consistent with the observations that small plasmids usually have high copy numbers [48] and therefore exhibit a higher coverage; which in theory would facilitate their prediction using this tool. A similar success at predicting small plasmids was also reported by [11,38]. Nevertheless, it is worth noticing that most small plasmids (n = 215, 79%) are represented as a single node in the assembly graph. Therefore, using a binary classification tool would be sufficient for correctly predicting these replicons.
MOB-suite correctly reconstructed a total of 166 (46.37%) large plasmids, and considerably outperformed the rest of the tools, which ranged from 45 (12.57%) to 95 (26.54%) correct reconstructions. Nevertheless, MOB-suite's performance strongly depends on its underlying database, which is enriched for Enterobacteriaceae plasmid sequences [38]. Consequently, the reconstruction capacity of this tool could be different when attempting to predict plasmids from bacterial species less frequently represented in its database.
A third (n = 213, 33.76%) of all plasmids could not be correctly reconstructed by any of the evaluated tools. In particular, the reconstruction of ARG-plasmids proved to be problematic. We hypothesize that ARG-plasmids constitute a particularly hard puzzle to solve for all compared computational approaches, for several reasons.
Firstly, ARG-plasmids usually carry a high number of repeated sequences [49][50][51][52], and therefore exhibit highly entangled assembly graphs. Secondly, ARGs are frequently located on large plasmids with low copy number, and therefore have coverage values that are similar to chromosomes [48,52]. Consequently, finding plasmid-walks with differential coverage in the assembly graphs could be challenging for all tools relying on this strategy. This hypothesis is supported by the observation that plasmidSPAdes predicted large ARGplasmids with the lowest precision values (median = 0.47, IQR = 0.31-0.92) of all tools, indicating that these plasmids are more frequently merged with sequences derived from other replicons. Additionally, this tool failed to predict 37% of all plasmid-located ARGs, which would be explainable in case that these contigs should have coverage values similar to the chromosomes.
Thirdly, ARG-plasmids are frequently built as mosaic-like structures, containing mobile components that can be found in different plasmid backbones [48,[52][53][54][55]. This type of genomic organization also complicates their reconstruction using reference-based methods, since databases might contain very similar fragments that are shared by a variety of plasmids. Consequently, unequivocally assigning these "shared fragments" to a unique reference plasmid (or plasmid group) could be problematic. This is supported by the results obtained using MOB-suite. This software identified the highest proportion of plasmidderived ARGs (n = 548, 88.67%), but most ARG-plasmids reconstructions had either an incomplete ARG content (n = 47, 31.97%) or an incorrect backbone (n = 49, 33.33%). These results, in combination with the low recall values observed (median = 0.38, IQR = 0.09-0.88) seems to suggest that large ARG-plasmids were frequently split into multiple bins.
Despite the aforementioned limitations, MOB-suite was the most effective tool at predicting ARG-plasmids in E. coli, achieving the correct reconstruction of 41 (27.89%) of these, while the rest of the tools ranged from 5 (3.4%) to 23 (15.65%) correct ARG-plasmid reconstructions. Additionally, MOB-suite was the best performing tool for prediction of ESBL-plasmids. It identified 57 (95%) plasmid-borne ESBL-genes and had a median F1-score of 0.93 (IQR = 0.72-0.97). However, it must be noted that a fraction (n = 13, 22.80%) of ESBL-plasmid predictions presented low F1-score values, implying that in these cases the contigs carrying the ESBL gene were associated with the incorrect plasmid backbone.
All tools exhibited chromosomal contamination in their predictions. Notably, Fish-ingForPlasmids outperformed the rest of the tools and only included chromosomal sequences in 7 (1.8%) bins. The rest of the tools included chromosomal sequences in a range from 25.25% to 51.73% of the bins. Surprisingly, MOB-suite included chromosomal sequences in 297 (40.2%) bins, including 65 chromosome-only predictions (chromosome contamination = 1).
A fraction of the plasmids (n = 28, 4.4%) were completely absent (recall = 0) from contig sequences and nodes in the assembly graph. Interestingly, 14 of these replicons were correctly reconstructed by plasmidSPAdes when using pair-end reads as input. This suggests that the quality of the assembly has impacted the ability of the tools to reconstruct certain plasmids. Consequently, it is possible that plasmid predictions for E. coli could be optimized by running SPAdes with different parameters, by performing assembly with different assemblers or through construction of Illumina libraries with a different read length.
The results from our study indicate that accurate reconstruction of E. coli plasmids from short-reads is still challenging using currently available bioinformatic methods. Long reads generated by Oxford Nanopore or PacBio technologies can span repeat elements in the bacterial genomes and are therefore useful to obtain complete plasmid sequences. However, long-reads still exhibit a lower sequencing accuracy than Illumina reads [56], and small plasmids (size < 10 kb) are frequently underrepresented or absent in Nanopore libraries [57,58]. Consequently, combining long-and short-read sequences is currently the best option for correctly reconstructing E. coli plasmids. Nevertheless, the accuracy of long-reads has been increasing in recent years, mainly due to the release of improved hardware and also owing to the development of bioinformatic tools designed for read error correction [56]. It is possible that in the near future long-read only assemblies will provide the best alternative for obtaining complete bacterial genomes.
Nonetheless, in the absence of long-reads, bioinformatic tools can be applied to gain valuable insight on different aspects of the plasmidome of E. coli. MOB-suite presented the best overall performance of all tools, but predictions were frequently contaminated with chromosomal sequences. Consequently, using MOB-suite coupled to a binary classification tool could improve plasmid predictions in E. coli. Furthermore, these predictions could be used as an initial screening step for selecting interesting isolates for long-read sequencing.

Data Availability Statement:
The complete code and files required to reproduce the analysis of this study are publicly available at GitLab under a GPL3.0 license (https://gitlab.com/jpaganini/ recovering_ecoli_plasmids).

Conflicts of Interest:
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.