CleanSeq: A Pipeline for Contamination Detection, Cleanup, and Mutation Veriﬁcations from Microbial Genome Sequencing Data

: Contaminations frequently occur in bacterial cultures, which signiﬁcantly affect the reproducibility and reliability of the results from whole-genome sequencing (WGS). Decontaminated WGS data with clean reads is the only desirable source for detecting possible variants correctly. Improvements in bioinformatics are essential to analyze the contaminated WGS dataset. Existing pipelines usually contain contamination detection, decontamination, and variant calling separately. The efﬁciency and results from existing pipelines ﬂuctuate since distinctive computational models and parameters are applied. It is then promising to develop a bioinformatical tool containing functions to discriminate and remove contaminated reads and improve variant calling from clean reads. In this study, we established a Python-based pipeline named CleanSeq for automatic detection and removal of contaminating reads, analyzing possible genome variants with proper veriﬁcations via local re-alignments. The application and reproducibility are proven in either simulated, publicly available datasets or actual genome sequencing reads from our experimental evolution study in Escherichia coli . We successfully obtained decontaminated reads, called out all seven consistent mutations from the contaminated bacterial sample, and derived ﬁve colonies. Collectively, the results demonstrated that CleanSeq could effectively process the contaminated samples to achieve decontaminated reads, based on which reliable results (i.e., variant calling) could be obtained.


Introduction
Next-generation sequencing (NGS) technology has been widely employed as a standard procedure to distinguish diversities and variabilities (i.e., gene mutations and expressions) from environmental, laboratory, and clinical samples [1]. Following this trend, an enormous amount of sequencing data has been quickly accumulated. It is known that unexpected DNA sample contaminations are among the shared issues in NGS-based studies, especially in microbial samples where contaminants could be rapidly introduced during upstream (experimental procedures) or downstream (sample preparations or sequencing) stages [2,3]. Previous studies have shown that even a low contamination rate in the read pools, as low as 2%, could increase the inconsistency of genotyping by more than two-fold [4]. Those vulnerable conditions could negatively impact the reproducibility of the analytical results and conclusions from acquired sequencing reads [5]. Although the sequencing technology and the quality of most sequencing data have been improved significantly, computational and bioinformatical approaches for contamination and decontamination verification have become a challenge for biologists and bioinformaticians [6]. For most biologists, available experiments like PCR or sequencing of PCR products have been routinely employed to identify contaminants or mutations from genomic samples [7]. Decontaminated genomic sequencing data with clean reads are the only desirable source for detecting possible variants correctly. Thus, it is beneficial to have such bioinformatical tools containing functions in the discrimination and removal of contaminating reads and improving variant calling from clean reads.
For bioinformaticians, on the other hand, there have already been some tools or pipelines developed for distinguishing contaminated reads in microbial whole genomic sequencing (WGS) samples [5]. To avoid drawing false conclusions, one of the main tasks of quality check (QC) for raw reads is to perform taxonomic classification, based on which the possible contaminated species could be identified for further bioinformatical analysis [8]. Based on the Basic Local Alignment Search Tool (BLAST) [9], Burrows-Wheeler Aligner (BWA) [10], Sequence Alignment/Map (SAMtools) [11], or the Genome Analysis Toolkit (GATK) [12], integrated tools and pipelines, including CheckM [13], Kraken [14], ConFindr [15], FastQ Screen [16], MutScan [17], DecontaMiner [18], and microDecon [19] have been developed for contamination detection or cleanup. ConFindr [15] is a Pythonbased pipeline that could detect contaminant species automatically and demonstrates a better sensitivity for detecting intraspecies contamination. microDecon [19] is an R package that claims a high cleanup rate (98.1%) in simulated samples and could effectively remove contamination across a broad range of situations. DecontaMiner [18], DeconSeq [20], and FastQ Screen are all written in Perl, and DecontaMiner is mainly designed to deal with unmapped reads for a contamination screening [18]. DeconSeq [20] could be used to remove contaminated reads from metagenomic data based on the BWA-SW algorithm, which supports a web interface. FastQ Screen validates the origin of DNA samples by quantifying the proportion of reads mapping to multiply reference genomes. FastQ Screen generates reports for the source of all predictable contaminated reads and runs BLAST if there is no match for specific reads [16]. MutScan is a C++ algorithm that could detect and visualize target mutations by directly scanning FASTQ raw data. Multiple tools can be used continuously for complete contamination data processing and analysis [17].
The use of the above tools involves some preparation steps such as data format conversion, building a running environment suitable for local operating systems, tool debugging, etc., which seems a difficult task for biologists without a bioinformatical background [21]. Meanwhile, it has also been commonly recognized that the efficiency and results of those tools could fluctuate since they are generally operating as system-dependent and derived from distinctive computational models where specific parameters are applied [7,22,23]. Therefore, an automatic and easy-to-use pipeline for contamination detection and cleanup may be of great interest to improve the efficiency, reliability, and reproducibility of WGS data. Herein, we describe CleanSeq, a Python-based bioinformatics pipeline for (1) automatically detecting, (2) removing contaminating reads, and (3) analyzing and visualizing possible genome mutations from WGS data. The application and reproducibility are proven in both simulated and publicly available datasets. Further, we also examine real datasets from bacterial samples taken during experimental evolution experiments in our laboratory using Escherichia coli, where unexpected contaminations are found. The results show that the decontaminated reads are sufficient, and the called variants from CleanSeq analysis are consistent among cultures of the original contaminated bacterial group and independently selected single colonies, indicating that CleanSeq can effectively process the contaminated samples to achieve reliable results. The trimmed reads are then analyzed in a local bacterial nucleotide (NT) database. The species and contamination information are determined based on the percentage of the reads of each species out of all reads. The reads are then cleaned up and proceeded as a cleanup file, further subjected to mutation calling. The called mutations are further verified via re-alignment, visualization, and manual comparisons. The finalized report file containing species contamination and confirmed mutations is generated.

Taxonomic Identification
Since it has been reported that the BLAST performs better than most existing tools as judged by speed and accuracy [14], we also choose to use local BLAST+ for our pipeline. The complete genome nt database was adopted from NCBI, and the BLAST command makeblastdb assembled a FASTA file to blast databases [18]. The raw reads were then mapped to all the available genomes [18] using the BLASTn algorithm with the following parameters: e-value = 1 × 10 −11 , max_target_seqs = 5 mismatches < 1, gap opens < 1, and the highest %identity value As the first step, CleanSeq detected whether the raw data (trimmed by Trimmomatic [24], randomly selected 10,000 reads) were contaminated with unexpected reads other than the targets via local BLASTn. Subsequently, CleanSeq decided whether or not to execute the cleanup program if it met both the following criteria:  Figure 1. Schematic design and data processing flow of CleanSeq pipeline. Raw data from genome DNA sequencing are directly introduced, verified by quality check (QC), and trimmed accordingly. The trimmed reads are then analyzed in a local bacterial nucleotide (NT) database. The species and contamination information are determined based on the percentage of the reads of each species out of all reads. The reads are then cleaned up and proceeded as a cleanup file, further subjected to mutation calling. The called mutations are further verified via re-alignment, visualization, and manual comparisons. The finalized report file containing species contamination and confirmed mutations is generated.

Taxonomic Identification
Since it has been reported that the BLAST performs better than most existing tools as judged by speed and accuracy [14], we also choose to use local BLAST+ for our pipeline. The complete genome nt database was adopted from NCBI, and the BLAST command makeblastdb assembled a FASTA file to blast databases [18]. The raw reads were then mapped to all the available genomes [18] using the BLASTn algorithm with the following parameters: e-value = 1 × 10 −11 , max_target_seqs = 5 mismatches < 1, gap opens < 1, and the highest %identity value As the first step, CleanSeq detected whether the raw data (trimmed by Trimmomatic [24], randomly selected 10,000 reads) were contaminated with unexpected reads other than the targets via local BLASTn. Subsequently, CleanSeq decided whether or not to execute the cleanup program if it met both the following criteria: (i) the contamination rate (contamination reads/all reads, threshold = 10%) was over 10%, (ii) genome similarity [26] between the contaminated and target species was less than 80%.
In other words, it was considered as one contamination if the proportion of a specific species other than the target exceeded 10% and the similarity between the target reference genome was less than 80%. Species information with total count and genome accession number, and the proportion out of total reads were collected.

Cleanup
Cleanup procedures were performed based on BLAST to remove contaminating reads [18,27]. Firstly, BLAST was used to map the raw data to the reference genome of target bacteria with a threshold e-value of 1 × 10 −6 . Any reads with an e-value over 1 × 10 −6 were extracted as potential reads from the targeted species. Reads from these candidates were then mapped to the reference genome of contamination species with an e-value of 1 × 10 −11 and %identity over 95% to obtain all contaminated reads for each available species. The confirmed contaminating reads were continuously removed from the targeted reads pool, and the decontaminated file was generated for further analysis, i.e., mutation calling by GATK4.

Mutation Call
A universal variant calling pipeline combining BWA [10], SAMtools [11], and GATK [12] was employed in this study. BWA aligns clean reads with the reference genome to generate a bam file, SAMtools sort, and index the bam file. GATK4 HaplotypeCaller [12] was implemented to call variants and generate VCF files compared to the provided reference genome.

Mutation Verification
The mutation verification module was based on the mutation call information. The called variants in VCF files and the reference genome were employed to generate k-mers (all possible substrings of length k) for each mutation [17,28]. k-mers were then mapped to the decontaminated reads via BLASTn, setting an e-value equal 1 × 10 −6 , to acquire reads with variants [17]. The trapped reads were filtered in accordance with the following conditions to ensure that the resulting reads contained the desired bases: mismatch < 1, gap opens < 1, q.startPos < m.pos < q.endPos The filtered reads were further extracted for verification, performed by aligning extracted reads with called mutations via Clustal Omega with an .aln file. Each alignment was visualized for manual validations as demanded. Here, an automatic verification was performed if the .aln file was misaligned or not. If the .aln file was correctly aligned, the mutation was considered real.

Report
The final output summary report showed a list of the species detected, gained clean reads, VCF file with called mutations, and verified mutations. Specifically, the report contained a list of the top ten matched species and their proportions over all existing reads, contamination information, sequence information, and GATK variants calling results with verifications.
To investigate the possible effects of genome similarity on CleanSeq, four distinct strains, E. coli (NZ_LR881938.1), P. aeruginosa (CP007224.1), S. enterica (CP007523.1), and Shigella flexneri (AE014073.1), were used to build the simulated datasets. The genome similarity was approximately 0.42% for E. coli and P. aeruginosa, 7.73% for E. coli and S. enterica, and 82.99% for E. coli and S. flexneri, respectively. E. coli was designated as the target species, while the contaminating bacteria were P. aeruginosa, S. enterica, or S. flexneri for each independent dataset, respectively. The contamination degree was set differently as 5%, 10%, 30%, 50%, and 70%. Each dataset had a sample size of 1 million reads, a read length of 150 bp, and an error rate of 0.01%. All the decontaminated reads were verified again using the reference genome accession number, and the efficiency and contamination degree were calculated: Efficiency = num of target species clean reads num of target species total reads Contamination degree = num of other species reads num of clean reads To evaluate the impact of coverage and mutation rate on the efficiency and accuracy of mutation verification in CleanSeq, we simulated a series of the dataset by dwgsim from five bacteria species, P. aeruginosa (CP007224.1), E. coli (NZ_LR881938.1), Citrobacter freundii (CP016762.1), B. fragilis (CR626927.1), and S. pyogenes (AE014074.1). The following parameters were considered in dwgsim: coverage = 30× or 100×, mutation rate = 0.01%, 0.001%, or 0.0001%, error rate = 0.01%. Additionally, one publicly available contamination dataset from bacterial WGS (https://doi.org/10.6084/m9.figshare.c.4282706.v2, accessed on 30 January 2022, [30]) was also used in this study to test CleanSeq. All the above tests were performed in triplicates (as shown in mean (SD)) and compared with the results from MutScan.
As for the final test dataset mimicking the actual experimental results, a total of 150 million (1,425,042 after cleanup of duplicates) read mixtures of E. coli (949,982,~2/3) and P. aeruginosa (475,060,~1/3) were generated in dwgsim.

CleanSeq Processes WGS Raw Data for Contamination Detection, Decontamination, and Calling Variants
As shown in Figure 1 and described in the Materials and Methods, the CleanSeq pipeline incorporated routine WGS analytical tools, such as Trimmomatic [24], BLAST [9], SAMtools [11], GATK4 [12], BWA [10], and Clustal Omega [25], to process raw reads directly. To date, some existing pipelines for microbial WGS, such as Bactopia [37], TORMES [38], ASA3P [8], and BacPipe [39], often only contain similar modules like taxonomic identification and genome assembly. To the best of our knowledge, CleanSeq is the first attempt to integrate four modules of taxonomic classification, decontamination, and variant calling and verification in one Python-based pipeline. It detects and removes contaminated reads based on BLASTn mapping and generates clean data for subsequent analysis of variant calling via GATK. All the reads containing called variants are extracted and verified by local re-alignment and visualization via Clustal Omega [25].

CleanSeq Detects and Eliminates Mutation Reads Efficiently as Compared with FastQ Screen
To test the performance of CleanSeq regarding the detection and elimination of contaminated reads, we generated a series of simulated sequence datasets via dwgsim using a mixture of ten bacterial genome DNA sequences. As demonstrated in Figure 2, coverage (3× or 30×), the genomic mutation rate in the genome (0.01% or 0.0001%), and the similarity between the target and contaminating bacteria (0.42%, 7.73% or 82.99%) were considered. Although the results seemed to be species (genome)-dependent, the efficiency of contaminating reads detection (calculated according to Materials and Methods) was generally over 91%, except for E. coli and S. flexneri, which were nearly indistinguishable at the species level, with a genome similarity of 82.99% [40,41]. The same result was also obtained for the contamination degree in reads after cleanup which was kept nicely within 0-0.15% in either high or lower coverage, a mutation rate of 1 × 10 −4 or 1 × 10 −6 , and a lower genome similarity and contamination rate. It should be noted that the decontamination efficacy (~42%) and correctness (~0.1%) dropped significantly in the condition when the similarity and mutation rate were both very high. Compared to FastQ Screen [16], the cleanup efficiency of CleanSeq was slightly higher as judged by the detection efficiency. The increase in efficiency results was minor in the contamination degree after cleanup, but it was still acceptable. The performance between CleanSeq and FastQ Screen was comparable as expected.

CleanSeq Calls Variants with Satisfactory Correctness as Compared with MutScan
Subsequently, we investigated the performance of CleanSeq for calling possible variants from decontaminated reads. As described in the Materials and Methods, simulated datasets containing reads from five different species (P. aeruginosa, E. coli, C. freundii, B. fragilis, and S. pyogenes) were employed. The results are plotted in Figure 3, showing that the overall efficiency for variant calling is above~90% in a lower coverage of 30× and can reach 95% when the coverage is 100×. The efficiency was also correlated with the genome mutation rate (number) in the genome, especially in the case of the combination of low coverage and high genome mutation rate (i.e., 3×, 1 × 10 −4 , data not shown). Since it is widely accepted that depth and coverage are the key parameters in genomic analyses [42], the quality of the original data, including the read length, depth, and coverage, should be maintained within an acceptable range for bioinformatical research so that reliable conclu-sions can be drawn. Compared to MutScan [17], both tools are utilizable and comparable as no significant differences were observed in provided conditions. 73% for E. coli and S. enterica, and 82.99% for E. coli and S. flexneri, with 30%~70% contamination rates). Two outcome values, efficiency and contamination degree after cleanup were calculated and compared to evaluate the overall performance. Efficiency refers to the ratio of clean reads in total reads. Contamination degree refers to the percentage of mutated reads in clean reads. The higher the cleaning efficiency and the lower the contamination degree, the better performance is indicated. The genome similarity between E. coli and P. aeruginosa, E. coli and S. enterica, and E. coli and S. flexner is low (0.42%), medium (7.73%), and high (82.99%), respectively.

CleanSeq Calls Variants with Satisfactory Correctness as Compared with MutScan
Subsequently, we investigated the performance of CleanSeq for calling possible variants from decontaminated reads. As described in the Materials and Methods, simulated datasets containing reads from five different species (P. aeruginosa, E. coli, C. freundii, B. fragilis, and S. pyogenes) were employed. The results are plotted in Figure 3, showing that the overall efficiency for variant calling is above ~90% in a lower coverage of 30× and can reach 95% when the coverage is 100×. The efficiency was also correlated with the genome mutation rate (number) in the genome, especially in the case of the combination of low coverage and high genome mutation rate (i.e., 3×, 1 × 10 −4 , data not shown). Since it is widely accepted that depth and coverage are the key parameters in genomic analyses [42], the quality of the original data, including the read length, depth, and coverage, should be maintained within an acceptable range for bioinformatical research so that reliable conclusions can be drawn. Compared to MutScan [17], both tools are utilizable and comparable as no significant differences were observed in provided conditions. , and genome similarities ((C,F), 0.42% for E. coli and P. aeruginosa, 7.73% for E. coli and S. enterica, and 82.99% for E. coli and S. flexneri, with 30%~70% contamination rates). Two outcome values, efficiency and contamination degree after cleanup were calculated and compared to evaluate the overall performance. Efficiency refers to the ratio of clean reads in total reads. Contamination degree refers to the percentage of mutated reads in clean reads. The higher the cleaning efficiency and the lower the contamination degree, the better performance is indicated. The genome similarity between E. coli and P. aeruginosa, E. coli and S. enterica, and E. coli and S. flexner is low (0.42%), medium (7.73%), and high (82.99%), respectively.

CleanSeq Is Practical when Applied to either Simulated or Real Experimental Datasets
Subsequently, one dataset mimicking bacterial contaminations in real experiments (E. coli contaminated with P. aeruginosa, see Materials and Methods) was employed to test CleanSeq. The results are summarized in Figure 4 and show that P. aeruginosa was successfully detected as a contaminant species at about 1/3 (32.3%) contamination rate ( Figure 4A) as expected. Due to the high genome similarity of E. coli and Shigella [40,41], S. sonnei and S. flexneri were also listed as potentially contaminating species, indicating the high accuracy of CleanSeq with a low level of analytical bias. CleanSeq also called variants correctly out of a pool of 32. The called variants were further extracted and verified through local alignment and resulted in no misalignment as listed in Figure 4B.
The output files with all information regarding the results related to this dataset are further provided in Supplemental Data S1. Additionally, one publicly available contamination dataset from bacterial WGS (https://doi.org/10.6084/m9.figshare.c.4282706.v2, accessed on 30 January 2022, [30]) was also used in this study to test CleanSeq, the results of which are also provided as Supplemental Data S2. Appl. Sci. 2022, 12,

CleanSeq Is Practical when Applied to either Simulated or Real Experimental Datasets
Subsequently, one dataset mimicking bacterial contaminations in real experiments (E. coli contaminated with P. aeruginosa, see Materials and Methods) was employed to test CleanSeq. The results are summarized in Figure 4 and show that P. aeruginosa was successfully detected as a contaminant species at about 1/3 (32.3%) contamination rate ( Figure  4A) as expected. Due to the high genome similarity of E. coli and Shigella [40,41], S. sonnei and S. flexneri were also listed as potentially contaminating species, indicating the high accuracy of CleanSeq with a low level of analytical bias. CleanSeq also called variants correctly out of a pool of 32. The called variants were further extracted and verified through local alignment and resulted in no misalignment as listed in Figure 4B. The output files with all information regarding the results related to this dataset are further provided in Supplemental Data S1. Additionally, one publicly available contamination dataset from bacterial WGS (https://doi.org/10.6084/m9.figshare.c.4282706.v2, accessed on 30 January 2022, [30]) was also used in this study to test CleanSeq, the results of which are also provided as Supplemental Data S2. To examine the performance and correctness of CleanSeq towards real experimental datasets, we subjected and analyzed the sequencing results from laboratory experimental evolution of E. coli. As demonstrated in Figure 5A, we were interested in selecting cell wallless (L-form) bacteria exposed to cell wall-targeting reagents (Mecillinam and Lysozyme) via FACS technology and a laboratory experimental evolution [35,36,43]. We employed part of the experimental results here because one of the passage lines was unexpectedly contaminated with other bacteria species during cell culture, which served as real practical material for CleanSeq. The WGS datasets from the final passage line P16 and five single colonies were then analyzed by CleanSeq for decontamination and genome-wide mutation analysis starting from the single command line: ./cleanSeq MDS42Ref.fasta ntPath SP5P16.raw_1.fastq.gz SP5P16.raw_2.fastq.gz. As mentioned above, we also compared the efficiency between FastQ Screen [16] and CleanSeq. The overall cleaning efficiency and the read number after cleanup (CleanSeq/FastQ Screen, 3688271/3489622) from CleanSeq were higher than that of the FastQ Screen, implying that CleanSeq could be considered alternatively as a pre-processing tool for decontamination. The contamination analysis based on 10,000 randomly selected reads showed that the final culture was mainly contaminated with P. aeruginosa with a 47.1% contamination degree ( Figure 5B). After the decontamination, the overall coverage of remaining reads from E. coli still could reach 133× (Supplemental Data S3), ensuring the correctness of the following variant analysis.

B . f r a g i l i s C . f r e u n d i i E . c o l i P . a e r u g i n o s a S . p y o g e n
After variant calling by GATK, we obtained all mutation information from P2 (free of contamination), P16 cell passage, and its derived five E. coli single colonies (P16S 1-5 ). As shown in Figure 5C, P16 and P16S 1-5 contained the same seven mutation sites, six of which were also detected from P2 passages. It should be highlighted here that MutScan can only call out five variants, partially because MutScan has a limitation in detecting long insertions and deletions (INDELs) [17]. Another possibility is the overcleaning of the reads, which were initially from E. coil. Those variants were then further confirmed by re-alignment (Clustal Omega) of the corresponding reads extracted using the generated six k-mers (One of the generated k-mers contained two close mutation sites, Supplemental Data S4) [17,28]. No misalignment was found for all seven variants, suggesting the high correctness of the GATKintegrated CleanSeq pipeline. The consistency of called-out genome mutations among the contaminated sample and its derived single colonies indicated that the decontamination and verification processes are sufficient for real WGS analysis like this study. A rational decision could be made according to those results on whether further experiments or reruns should be performed or not. Appl  To examine the performance and correctness of CleanSeq towards real experimental datasets, we subjected and analyzed the sequencing results from laboratory experimental evolution of E. coli. As demonstrated in Figure 5A, we were interested in selecting cell wall-less (L-form) bacteria exposed to cell wall-targeting reagents (Mecillinam and Lysozyme) via FACS technology and a laboratory experimental evolution [35,36,43]. We employed part of the experimental results here because one of the passage lines was unexpectedly contaminated with other bacteria species during cell culture, which served as real practical material for CleanSeq. The WGS datasets from the final passage line P16 and  CleanSeq were higher than that of the FastQ Screen, implying that CleanSeq could be considered alternatively as a pre-processing tool for decontamination. The contamination analysis based on 10,000 randomly selected reads showed that the final culture was mainly contaminated with P. aeruginosa with a 47.1% contamination degree ( Figure 5B). After the decontamination, the overall coverage of remaining reads from E. coli still could reach ~133× (Supplemental Data S3), ensuring the correctness of the following variant analysis. After variant calling by GATK, we obtained all mutation information from P2 (free of contamination), P16 cell passage, and its derived five E. coli single colonies (P16S1-5). As shown in Figure 5C, P16 and P16S1-5 contained the same seven mutation sites, six of which were also detected from P2 passages. It should be highlighted here that MutScan can only call out five variants, partially because MutScan has a limitation in detecting long insertions and deletions (INDELs) [17]. Another possibility is the overcleaning of the reads, which were initially from E. coil. Those variants were then further confirmed by re-alignment (Clustal Omega) of the corresponding reads extracted using the generated six k-mers (One of the generated k-mers contained two close mutation sites, Supplemental Data S4) Among the called genomic mutations, it was desirable to obtain two shared genomic mutations, djlA (Phe54Leu, missense variant) and wecA (Gln104*, stop gained) in P2 and P16 passage lines, and one specific mutation yigA (Thr94_Asp103delinsAsn, disruptive inframe deletion) in P16 passage lines. djlA is a cell inner membrane protein involved in regulating the synthesis of the colanic acid capsule, which might contribute to cell viability in harsh conditions [44,45]. wecA is also a cell inner member protein reported to initiate the biosynthesis of enterobacterial common antigen and O-antigen lipopolysaccharide (LPS) [46]. yigA is a DUF484 domain-containing protein with unknown functions. All three mutated genes found in passage lines of this study were nonessential, and both of djlA and wecA were potentially related to the cell membrane or cell wall functions and were probably targeted by cell wall-targeting reagents like Mecillinam and lysozyme used in this study [47], although no direct evidence was provided. Further study is crucial to investigate and uncover the possible involvement and mechanism of sorted cells with the above gene mutations.

Conclusions
This study designed a Python-pipeline CleanSeq to detect and eliminate contaminated reads from bacterial WGS datasets. After obtaining the clean reads, CleanSeq further verifies the possible variants claimed from GATK by local re-alignment of corresponding reads with generated k-mer references. The performance was confirmed using various datasets, either simulated or real experimental datasets derived from laboratory experimental evolution in E. coli. Based on the results compared to other existing pipelines, we concluded that CleanSeq has satisfactory correctness and could be employed as either a pre-processing tool for decontamination of raw reads or mutation calls and verifications of mutations of interest. The results from CleanSeq could also be utilized as a reference to evaluate the WGS data quality and make a decision for a further experimental plan.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/app12126209/s1, Supplemental Data S1: Output report on the simulated contamination test dataset (The output reports contain the information for the input file, a detailed list for contamination detection, and a list of the top ten contaminated species. All the statistical results for reads and mutations are also provided.); Supplemental Data S2: CleanSeq output report on the public contamination test dataset (https://doi.org/10.6084/m9.figshare.c.4282706.v2, accessed on 30 January 2022) used in this study; Supplemental Data S3: CleanSeq output report on the P16 experiment data; Supplemental Data S4: Extracted reads containing variants based on seven generated k-mers.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: All relevant data are within the paper and its Supporting information files. The pipeline created in this study is publicly available at https://github.com/bingxiao-wcy/ cleanSeq (accessed on 30 January 2022).