The Effects of Sequence Length and Composition of Random Sequence Peptides on the Growth of E. coli Cells

Castro, Johana F.; Tautz, Diethard

doi:10.3390/genes12121913

Open AccessArticle

The Effects of Sequence Length and Composition of Random Sequence Peptides on the Growth of E. coli Cells

by

Johana F. Castro

and

Diethard Tautz

^*

Max Planck Institute for Evolutionary Biology, August-Thienemann Strasse 2, 24306 Plön, Germany

^*

Author to whom correspondence should be addressed.

Genes 2021, 12(12), 1913; https://doi.org/10.3390/genes12121913

Submission received: 17 October 2021 / Revised: 22 November 2021 / Accepted: 26 November 2021 / Published: 28 November 2021

(This article belongs to the Special Issue How Do New Genes Originate and Evolve?)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

We study the potential for the de novo evolution of genes from random nucleotide sequences using libraries of E. coli expressing random sequence peptides. We assess the effects of such peptides on cell growth by monitoring frequency changes in individual clones in a complex library through four serial passages. Using a new analysis pipeline that allows the tracing of peptides of all lengths, we find that over half of the peptides have consistent effects on cell growth. Across nine different experiments, around 16% of clones increase in frequency and 36% decrease, with some variation between individual experiments. Shorter peptides (8–20 residues), are more likely to increase in frequency, longer ones are more likely to decrease. GC content, amino acid composition, intrinsic disorder, and aggregation propensity show slightly different patterns between peptide groups. Sequences that increase in frequency tend to be more disordered with lower aggregation propensity. This coincides with the observation that young genes with more disordered structures are better tolerated in genomes. Our data indicate that random sequences can be a source of evolutionary innovation, since a large fraction of them are well tolerated by the cells or can provide a growth advantage.

Keywords:

de novo gene evolution; random peptide sequences; fitness; E. coli; protein structure

1. Introduction

New genes can arise by two alternative mechanisms [1,2,3,4,5,6]. The first is through duplication and/or recombination of existing genes or gene fragments, which later accumulate mutations that render them different from their parental genes. The second is de novo evolution from previously non-coding sequences. While this was long thought to be unlikely, there is now plenty of evidence that the process has probably been active throughout evolution [7,8,9,10,11,12,13]. However, since it is difficult to distinguish de novo evolution from duplication followed by divergence beyond sequence recognition [14], one can prove true de novo evolution only for relatively recent events, where evolutionary time has not been enough for accumulation of too many mutations [1]. Several dedicated studies on individual genes, including functional analyses, have been published [15,16,17,18,19]. In addition to this, there are well-documented cases of peptides with biological function derived from randomly synthesized sequences [20,21,22,23,24]. Overall genome comparisons between recently separated species have suggested that de novo evolved genes arise continuously with a high rate, but can also get lost at high rates [25,26,27,28]. This dynamic transformation of non-coding sequences into coding ones is very clear, especially in eukaryotes, where large parts of the non-coding genome are transcribed. Comparisons between closely related mouse populations and species revealed the transcription of these non-coding regions is subject to fast evolutionary change, such that within a time span of 10 million years the whole genome can become transcribed and thus subjected to evolutionary testing [8]. Hence, the raw material for de novo evolution, namely transcripts from initially non-coding DNA regions, is abundantly present.

Based on these insights, we previously developed an experimental approach to ask which fraction of random sequences has a potential biological function that could become subject to further adaptive evolution [29]. We expressed a library of sequences with random sequence composition in bacterial cells and monitored which sequences could provide a growth advantage or disadvantage to the cell in the context of four growth cycles of the whole library. The general experimental design for this experiment is shown in Figure 1.

The experiments showed that a surprisingly large fraction of random sequences affected cell growth, either by enhancing it, or by slowing it down. In the initial analysis, between 11 and 25% of the sequences increased in frequency in all replicates of each experiment, whereas 18 to 53%, decreased [29]. However, the study focused exclusively on the full-length peptides in the library, although the design strategy with random synthesis of the insert produces also a large number of truncated peptides with premature stop codons.

In the present study, we first reproduced the experiment, but with a lower concentration of starting library in an attempt to reduce the possible impact of very many low-frequency clones on the overall mean fitness of the complex library. Plus, we designed a new pipeline to analyze the new experiment, as well as all of the previous experiments. This new pipeline allowed us to include the clones expressing truncated peptides and to assess whether the expressed vector without insert could have a growth effect on the cells harboring it.

The new goal of this project was to explore the possible effects of shorter peptides in relation to the full-length peptides studied before. All peptides in the original analysis have common C-terminal residues (FLAG-tag—see Figure 1), which may have contributed to their stability and/or biological effects. Since naturally de novo evolved peptides would not have such a common C-terminus, it is important to verify whether the same spectrum of effects is also seen with peptides that have random C-termini. Furthermore, we wanted to explore sequence features of the peptides that could make them more or less likely to be tolerated by the cells and to be maintained in the population through several cycles of growth. Finally, we wanted to address the critical points that were raised against our original experiment, where [30] and [31] suggested a vector effect driving the patterns of peptides that rise in frequency. In this view, the vector itself would have a negative effect due to expressing a 38 amino acid peptide (or a secondary RNA structure) under induction conditions, which would be relieved when a "neutral" random sequence was replacing it, giving the impression that the "neutral" sequence acts positively. While we had argued that this effect could not fully explain the data that we had at that time [32], further analysis of this question is certainly warranted.

2. Materials and Methods

2.1. Library and Replication Experiment

We used the original library described in [29] from a stock frozen in 20% glycerol. The general design of the library and the experiment are depicted in Figure 1. In order to assess whether there could be a complexity effect, we repeated the original experiment using a 100-fold dilution of the original library and a 1-day sampling schedule, with samplings every 3 h for a total of 4 samplings in 12 h. This was done by seeding 5 µL from the stock on 25 mL LB liquid medium with 500 µg/mL ampicillin, and allowed it to grow overnight at 37 °C with constant shaking (250 rpm). After 16 h, 500 µL of the liquid culture were transferred into five 5 mL tubes containing 4.5 mL of LB medium with 10⁻³ mol/L IPTG to induce expression of the random sequences. For each cycle, 500 µL of culture from each tube were used to seed a new replicate after 3 h of growth (37 °C, 250 rpm). From the remaining bacterial culture for each replicate, 3 mL were collected and used for plasmid extraction using a QIAprep Spin Miniprep kit (QIAGEN, Hilden, Germany). Extracted plasmids were eluted in 30 µL of elution buffer and stored at -20 °C until use.

Amplicon sequencing of the library was performed using specific barcoded primers to amplify a 356-nucleotide fragment including the random sequences in a one-step PCR using PHUSION HF master mix (Invitrogen) (all primers used are listed in Supplementary Table S1). The cycling program consisted of an initial denaturation at 98 °C for 30 s, followed by 25 cycles of 98 °C for 10 s, 65 °C for 20 s, and 72 °C for 1 minute. After a final elongation step of 72 °C for 10 minutes, samples were purified using a Qiagen MinElute Gel Extraction kit. Concentration of samples was calculated through relative quantification in an agarose gel, using a Molecular Imager(R) Gel Doc(TM) XR+ System with the Image Lab(TM) Software (Bio-Rad). Barcoded samples were pooled together in equal concentrations to obtain the sequencing library. Sequencing was done using Illumina's MiSeq Reagent Kit v3 with 300 cycles to get overlapping 300-nucleotide paired-end reads.

Availabledata

In addition to sequencing data from the diluted library experiment described above, we used the original fastq files for eight experiments described in [29]. The original experiments were done following two different sampling schedules: either a 1-day course with samplings every 3 h, or a 4-day course with samplings every 24 h. In either case, four timepoints were sampled. The number of replicates, cycle duration, and experiment length for each of the experiments are summarized in Table 1. In addition to three experiments with 10 replicates of each type of sampling schedule, we used two 4-day experiments with 5 replicates. One of them (experiment 7) was done with a treatment control without induction with IPTG, while the other one (experiment 8) was sequenced more deeply (5x more reads than the other experiments) to capture even rare clones present at low frequencies in the population.

2.2. Analysis Pipeline

First, the paired end reads for each experiment were trimmed using Trimmomatic (v. 0.36), and merged using the software USEARCH10 (-fastq_mergepairs, -fastq_maxdiffs 30, -fastq_minmergelen 100) [33]. Since each read in a pair covers the entire random sequence, up to 30 mismatches were allowed between the paired forward and reverse reads. The fastq_mergepairs algorithm resolves discrepancies between the forward and reverse reads by comparing the quality score for the conflicting position in each read. It keeps the residue with the best quality score in the merged read. Merging the reads with this algorithm reduces the percentage of sequencing errors kept in each read. Note that it is not possible to account for PCR errors that have occurred during the library preparation.

To remove reads that do not belong to a PCR product from the plasmids in the library, a custom Perl script was used to find and save all merged reads containing pre-defined sequences up- and downstream of the random sequences on the pFLAG-CTC plasmid. The pre-defined sequences were a 18 bp sequence around the start codon, and the FLAG-tag, including the stop codon. The reads thus selected are considered clean amplicon reads, trimmed around the pre-defined sequences, and used for all subsequent analyses.

2.3. Database Generation

To generate a database of all unique sequences in the library that could be detected by the amplicon sequencing approach, all clean reads from all available experiments and replicates were first dereplicated using USEARCH10. Dereplication was done in 3 rounds. In the first round, the nucleotide sequences were sorted alphabetically, and the -fastx_uniques option was used to remove duplicate sequences, keeping only one sequence of each type in the database while keeping track of the number of total sequences of each type with the -sizeout option. In this way repeated identical sequences were removed and a “size” annotation was added to the read name indicating how many identical matches were present in the clean read files. In the second round, all files with singletons removed were merged into a single file of all amplicon sequences available, sorted, and de-replicated again using the same exact-match method. This exact matching approach is prone to enrichment of PCR or sequencing errors, since any two reads with even a single nucleotide difference are kept as individual sequences in the database. Singleton reads—more likely to be PCR or sequencing errors—were removed and a third dereplication round using a clustering approach was implemented.

The third round of dereplication aimed to remove reads generated by PCR or sequencing artefacts. The clustering approach used is based on the one used for OTU validation in microbiome analyses. Reads were sorted in decreasing order of size annotation, and the -cluster_smallmem option of USEARCH10 was used with an identity cut-off of 0.97. The clustering algorithm used by USEARCH is a greedy clustering approach. Here, sorting by the size annotation means that high-frequency reads are used as centroids or seeds for clusters first. This strategy relies on the assumption that reads found in high frequencies are more likely to be real, and less-common, highly-similar reads are probably generated through PCR or sequencing errors. The identity threshold of 0.97 allows less frequent reads with, for example, up to 5 mismatches in the expected 195-nucleotide sequence to join the high-frequency centroids forming the clusters. Using an additional filter of minimum cluster size of 8 reads, commonly used in microbiome amplicon sequencing analyses, removes other artefacts from the database. The resulting library of unique clusters (Supplementary Figure S1, full database: SuppData_BACT_tableinfo.tsv and SuppData_BACT.tsv) was used as the final database.

This database served also basis for the simulation of a 100.000 sequence library in R by sampling A, T, G, and C using the calculated probabilities for each nucleotide at each position (see below).

2.4. Sequence Features

Several parameters were used to characterize the sequences in the complete database, as well as in the sequence groups generated after mapping of the reads to find changes in frequency. ORFs were predicted using the program getorf form the EMBOSS suite [34] using the full database as input (-minsize 12, -find 3). Only the first ORF was kept for each sequence. Predicted ORFs were translated in the first frame using transeq from the EMBOSS suite, and the first predicted peptide was kept for each ORF. Sequence length was calculated for each read, as well as the predicted ORF and peptide using bash programs.

The number of peptides of each length depends on the probability of getting a stop codon at each consecutive position, and not before. This is best described by the probability function of a geometric distribution:

(1-p)^(k-1)*p,

(1)

where k is the number of trials, in this case, the number of positions or the length of the sequence; and p is the probability of “success” or getting a stop codon. Multiplying this probability distribution by the number of synthesized sequences, one gets the expected count of peptides of each length. The resulting expected distribution of peptide lengths was used to confirm library quality (Supplementary Figure S2).

GC content was calculated as the percentage of guanine (G) and cytosine (C) in a sequence relative to its length using custom Perl scripts. This was done for the complete read, the random part of the sequence (obtained by trimming 12 nucleotides on the 5`-end and 33 nucleotides from the 3`- end of the clean reads), and the predicted ORF. Amino acid composition of the database and different sequence groups were calculated using the Biostrings package (V 2.58.0) from Bioconductor in R. Lists of sequences from each database formatted as AAStringSets were used as input for the letterfrequency function and amino acid frequencies were plotted for each sequence correcting for length. For the complete database, full-length predicted peptides were used, and frequencies were calculated for each sequence independently in order to obtain frequency distributions. For the group analysis, the flanking sequences were trimmed from the peptides and amino acid frequencies were calculated for the complete set of random amino acids as a single sequence.

Intrinsic disorder was calculated using the command line version of IUPred (IUPred2A) [35] with the -short option. Intrinsic disorder scores were averaged for each peptide to obtain single average disorder values. In addition to this, the fraction of residues with a predicted disorder score equal to or larger than 0.5 was calculated, producing comparable results (data not shown).

Protein aggregation propensity was calculated for all sequences in the database using the program PASTA 2.0 on the web server of The BioComputing POS lab of the University of Padua (Italy) (http://old.protein.bio.unipd.it/pasta2/ last accessed November 2021) [36]. For each sequence, free energy for the single-best pairing was obtained using the default settings for peptides. The best-energy pairing for self-aggregation was obtained for each sequence from the output files, and energies of -5 or less were considered indicative of a high probability of aggregation.

2.5. Mapping of Reads to Full Database

Clean reads for all replicates and timepoints in each experiment were mapped to the database using a global alignment-based method from the program USEARCH10 (option -usearch_global). For consistency with the clustering analysis, alignments had a minimum required identity of 0.97, minimum query coverage of 0.9, and maximum one hit, and 5 gaps. Hits were extracted from the search results and counted using custom bash scripts to generate count tables for each replicate in each experiment.

2.6. Frequency Change Determination and Group Assignment

Raw count tables for each experiment were used as input for statistical analyses using the package DESeq2 in R [37]. Count data of each experiment were analyzed independently using cycle number as explanatory variable, and only sequences that had at least 5 reads mapped in the whole experiment were kept.

DESeq2 was designed mostly for the analysis of RNASeq data, but is broadly applicable to a large range of data types that require to control for large dynamic range and dispersion effects [37]. This makes different experiments better comparable between each other. Based on the log2-fold changes provided by DESeq2 (full data in: SuppData_DESeq2_ALLexp_Cycle4vs1.tsv) we classify the clones into NEG for negative changes and POS for positive changes. In addition, we chose the multiple-testing corrected p_adj value (provided by the program) as a cut-off to create a category of NS ("non-significant") clones. For category assignment, a flag was added to each sequence on the database table depending on whether its fold-change was positive ( 1) or negative ( 1), and significant (p_adj 0.05) or non-significant (p_adj 0.05) for each experiment. For the overall assignments of sequences to one of the three categories, category flags were compared across all experiments, and a general flag (sign.most, in the database table) was assigned when at least the strict majority of experiments had the same flag (5 or more).

While p-values should normally not be used for a ranking between experiments, we believe that errors created in this way are small, or at least smaller than the variances that we see between the experiments anyway. A possible alternative for ranking the clones would be to calculate their individual fitness effects in the background of the mean fitness of the whole library, as suggested by [38]. However, these authors advise against using their procedure under conditions where fitness distribution are broadly spread, as is the case in our experiments (compare Figure 1 and Figure 2 in [29]).

3. Results

3.1. Replication with Diluted Library

In experiments with a complex library, all clones compete against each other, but rare clones generate only few reads that cannot be reliably analyzed. Hence, these unaccounted background clones can influence the behavior of the more frequent clones. In an attempt to test this possibility, we repeated the experiment of [29], but with a starting stock that was diluted by 100-fold compared to the previous ones and used a sampling schedule with samplings every 3 h for a total of 4 samplings in 12 h. The further experimental steps were conducted as described in [29]. The overall results showed that there was no major difference compared to the previous results (see Table 1 below). The majority of clones identified in the previous experiments could again be detected even with a 100-fold dilution. Hence, we decided to do the in-depth analysis described below across all available data.

3.2. Characterization of the Sequences in the Random Clone Library

To analyze all experiments done with the given clone library, we first produced a reference sequence database including all different sequences reliably detected in any of the sequencing experiments. This required the establishment of a pipeline for filtering of PCR and sequencing errors, which we conducted based on a common approach that is also used in microbiome studies. We required that each sequence was represented by at least eight reads, biasing against rare variants that can be generated in the PCR amplification steps before sequencing.

The median number of paired-end reads per replicate was 284,875. On average, 79.3% of them could be successfully merged, and both known plasmid-derived sequence regions could be found in 96.44% (±1.46%) of those merged. The resulting database consisted of 5701 unique sequences with minimum cluster size of eight. This included 647 peptides with the FLAG-tag sequence, of which 25 were not full-length due to internal deletions. There were 253 peptides that end with frameshift versions of the FLAG-tag sequence. Furthermore, since for the random part of peptides of lengths 4, 5 and 6, there were only 1, 21, and 441 possible different amino acid sequences, respectively, different clones could code for the same peptide. For example, the library includes 200 clones coding for the shortest possible peptide (MKLS—derived from the vector, see Figure 1), where the first triplet in the random sequence is a stop codon. Overall, the 5701 unique sequence clones coded for 5234 different peptides.

The dereplication algorithms used to generate the database provided information about the frequency of the different sequences in the library. The cluster size distribution is shown in Supplementary Figure S1. It has a right skewed distribution (mean: 20,274, median: 9870 sequences per cluster) with one extreme outlier with 4.2 × 10⁷ sequences, which corresponds to the vector plasmid without insert ("empty" vector).

The ORF length distribution in the database has the expected composition and features of a random database of sequences, i.e., it follows largely the expected distribution of predicted peptide lengths (Supplementary Figure S2). Deviations concern mostly the longest sequence classes, due to the constant sequences derived from the vector. Note that some of the longest classes were also partly derived from frameshift versions.

With respect to GC content, we found that the sequences in the databases did not fully reflect a completely random synthesis. The mean and median GC content of the full reads was 53.8%, and median GC content of the predicted ORFs was slightly higher (mean 53.04%, median 54.6%) with larger variance due to the shorter sequences (Supplementary Figure S3A). A closer look to the GC content at every position in the database for reads with exactly the designed sequence length revealed a generalized bias towards lower A and higher G content at every position, remarkably larger on the 3’ end of the sequences starting at position 36 (Supplementary Figure S3B). This is probably due to a bias during library synthesis, with a presumptive new supply of chemicals in between. Still, given that the length distribution of resulting peptides conformed mostly to the random expectation (compare Supplementary Figure S2), we considered the library as being primarily made up of random nucleotide sequences.

A relevant descriptor of the structural properties of an amino acid sequence is its intrinsic disorder level. Intrinsically disordered proteins lack defined secondary and tertiary structures, and naturally occurring genes have a higher intrinsic disorder than random sequences [10,39,40,41]. In addition to intrinsic disorder scores, GC content [40] and amino acid content are used as indicators of the disorder levels of proteins in a database of sequences. Since large, hydrophobic amino acids are more likely to promote aggregation or formation of secondary structures, they are called order-inducing amino acids. The propensity of amino acids to induce order or disorder is one of the factors used for the calculation of intrinsic disorder scores [42].

Intrinsic disorder for the proteins in the database was calculated as the average intrinsic disorder score (IDS) of all residues in the peptide, using the -short setting of IUPred2A (see Methods). Average IDS values have a right-skewed bimodal distribution with the majority of sequences having an average IDS of 1.00 (Supplementary Figure S4A). This was due to the large number of short sequences in the database that were very unlikely to be able to make any secondary structures and were also under the limit of detection of the software used. Grouping the sequences into length classes shows this effect clearly. The mean of the distribution of average IDS shifts to smaller values for longer peptide lengths, ranging from 0.947 for the shortest peptides with less than 10 residues, to 0.281 for those with 48 or more residues (Supplementary Figure S4B). There is also a general correlation of IDS with length (Supplementary Figure S4C), as well as with GC content (Supplementary Figure S4D).

3.3. Frequency Changes in Clones during the Growth Experiments

For all sequencing files from the experiments, 80–90% of clean reads were successfully mapped to the database, allowing us to calculate frequency changes during the experiments. Raw count tables were used to do enrichment analyses using DESeq2. Although this algorithm was originally designed for the analysis of RNAseq sequencing data, it is also frequently used for the analysis of amplicon sequencing data. The assumption behind this is that the distribution of data in amplicon sequencing should follow a near-log normal distribution, with many low frequency counts and few high-frequency ones. The overall results of the DESeq2 analyses with respect to categorizing clones with positive (POS), negative (NEG), or non-significant (NS) changes are summarized in Table 1.

There is some variation between single experiments, especially with respect to the number of clones in the POS group. This is not directly related to the experiment type, i.e., the two experiments with the lowest fraction of POS clones (Exp 1 and Exp 7) have different cycle times (3 h vs. 24 h). Similarly, the range of log2-fold changes for individual clones varies considerably (Table 1). This suggests that even small variations in experimental conditions can lead to somewhat different outcomes. However, for all experiments there were always more NEG clones than POS clones. The average across all experiments shows 16% POS clones, 36% NEG clones, and 48% NS clones.

With the new pipeline, we could also trace the overall performance of the empty vector in the different experiments using the log2-fold change values. In experiments 1 to 3 it went slightly up, in experiments 5 to 9 it went slightly down, and in experiment 4 there was no significant change (Table 1). Only in experiment 5 the down trend was stronger than the average in this experiment. Note, however, that the DESeq2 normalization procedure penalizes against large count numbers in a way that could make negative trends stronger. Overall, we concluded from these data that the peptide and RNA expressed from the vector itself has no strong influence on growth.

We assessed also whether translation of the flanking sequences has a specific effect. The first four amino acids (MKLS) of the peptides were coded by the vector (see Figure 1). Of the clones that expressed only these first four amino acids due to a direct stop codon in the random sequence only 4% were POS while 78% are NEG, indicating a negative effect of this peptide compared to the overall clone performance (Table 1). Interestingly, this overall negative effect is relieved when only one or two additional amino acids were translated, with the percentage of NEG clones falling to 53% and 38%, respectively (Table 1). From this analysis we concluded that the vector-derived, constant N-terminal amino acids of the peptides have an overall negative effect on growth, which can be overcome by additionally coded amino acids or the RNA sequence components in the clones.

The C-terminus of the full-length peptides was formed by three constant amino acids plus the eight amino-acid FLAG tag sequence (see Figure 1). Of the different clones with this translated FLAG tag sequence, only 3% were POS while 75% were NEG across most experiments (Table 1), which would indicate a negative effect of this sequence. However, we find also the two frameshift translation versions of this sequence among the clones and both showed a similar excess of NEG versus POS effects (Table 1). This suggests that it is not the FLAG tag sequence that acted negatively, but that longer peptides have a generally higher likelihood of being in the NEG group. This is also supported by the fact that peptides with a length of 48+, but without including any of the FLAG tag versions, showed a similar bias towards NEG (Table 1) (see also the further analysis of the length effects below).

3.4. Length, GC Content, and Amino Acid Composition Dependence

For the further analysis, we assigned each sequence in the database into the categories POS, NEG, or NS, based on having consistent category assignments in the majority of experiments (see Methods). Since most sequences ( 95%) fell consistently within one of these three categories (with the remainder being inconsistent and therefore not further analyzed), one can compare whether peptides of different length are equally represented in each of these groups. Figure 2 shows that this is not the case. The fraction of NEG clones is particularly high for the shortest and the longest peptides. This is most likely caused by the negative effects of the vector derived parts of the sequence, as discussed above. The relative fraction of POS and NS clones is particularly high in the length classes between 8 and 20 amino acids.

The GC content distribution of ORFs in each of the groups is depicted in Figure 3. The peaks are similar for all three classes at about 57% GC, slightly higher than the average for the whole library, which is at 53% GC. The POS and NS peptides show broader distributions than the NEG peptides, with a stronger shoulder towards lower GC contents. The NEG peptides show a second peak at 42% GC, mostly driven by the negative effect of the shortest clones with only the vector-derived peptide (see above).

We also compared amino acid compositions of the peptides from the three clone groups. For this analysis we excluded the vector derived parts of the sequences. The overall frequencies for the whole database and the three groups of peptides are presented in Supplementary Table S2. Figure 4 shows the differences for each group compared to the database. The largest differences are found for A, G, and S in the comparison between the POS and NEG groups. It is also notable that the frequency of 7 out of the 10 amino acids considered to be more disorder-inducing is lower in the NEG group than in the database, while 9 out of 10 of the order-inducing amino acids are depleted in the NS group. The POS group shows in general the largest deviations from the database.

3.5. Structural Features

The intrinsic disorder score (IDS) differs between the different clone groups, with the NEG group showing a stronger bimodal distribution than the two other groups (Figure 5). When breaking up the IDS in peptide length classes, it becomes clear that the highest IDS were due to the shortest classes (1–17 amino acids), for which the IDS calculation is anyway not very meaningful (compare also Supplementary Figure S4). The lowest IDS scores were seen for the longest peptides (48+ amino acids), but otherwise there was no clear difference, especially between the POS and NS groups of peptides (Figure 5).

There are generally few highly ordered sequences in the library (i.e., sequences with an average IUPred2 score of less than 0.25). This could be due to the fact that highly ordered sequences tend to aggregate, and are expected to be highly insoluble and detrimental to the cells. In order to assess aggregation propensity, we used the software PASTA 2.0. It calculates the free energy of predicted ß-strand intermolecular pairings for each sequence and reports the lowest value for each peptide as the best pairing [36]. Lower aggregation energies mean that it is easier for the peptides to form amyloids or to aggregate. In general, aggregation energies lower than -5 pasta energy units (PEU) are considered evidence for possible amyloid formation. Sequences in the NEG group show generally lower PEU values than the two other groups with a peak at - 4 PEU and a distribution shifted towards even lower values (Figure 6A). There is also a secondary peak at aggregation energies higher than the other two groups (Figure 6A).

In order to see whether sequences of a particular length are driving the observed pattern, we compared the data distribution for the different peptide length classes (Figure 6B). Interestingly, the distributions showed different patterns between the groups of clones at the highest and lowest length values, but more similar ones at the intermediate ones. The secondary peak found in the NEG group at very high aggregation energies seems to be generated mostly by the shortest peptides, and the shift towards negative values, by the longest ones.

4. Discussion

Here we have performed an in-depth analysis of all available data from amplicon sequencing experiments of a library of E. coli cells expressing different randomly synthesized sequences and grown for four expansion cycles to allow competition between clones. The experiments, first described by [29], were set up to assess which fraction of random DNA sequences expressed in a living organism has the potential of producing molecules that have an effect on cell growth or fitness. This question is relevant for the study of the origin of innovation in biological systems and, in particular, of de novo genes derived from more or less random non-coding sequences.

The main goal of the present study was to broaden the analysis to all peptides in the data, irrespective of their length. The authors in [29] had originally focused on the full-length peptides only, with FLAG-tags derived from the vector sequences. The broadening of the focus to all expressed peptides allowed us to investigate whether the constant sequences flanking the random inserts in the library influenced the growth effects. In addition to this, we wanted to test whether there are particular molecular or structural features of the sequences driving their effect on the growth of the cells in this population. Studies looking at young and novel genes in diverse species report that novel genes and proto-genes can have distinct features such as ORF length or intrinsic disorder levels that differentiate them from older genes and intergenic regions [10,12,39,41,43,44]. A possible explanation for these observations is that certain features make sequences more likely to be positively selected—or at least not selected against.

4.1. New Analysis Pipeline

The first step in our analysis was to ask whether amino acid sequences of different lengths present in the population and not analyzed in the original publication show similar behavior to the full-length (65 amino-acid-long) sequences that were the focus of the [29] study. This required us to generate a new analysis pipeline that addressed the three limitations of the original one. The first two—the incomplete removal of PCR and sequencing errors from the database, and the resulting artificial redundancy—are the result of using the predicted translation of the ORFs to generate the reference database of clones, instead of the nucleotide sequences. This was done to take advantage of the genetic code redundancy and to, at least partially, compensate for PCR and sequencing errors. The strategy, however, was insufficient as can be seen from the finding in the original publication of several very similar clones coding for peptides with only one or a few substitutions—an extremely improbable event in a library composed of random sequences of that length.

To compensate for such PCR and sequencing errors in this new analysis, we used a dereplication approach that removes all singleton reads from each sequencing file before joining them together and clustering them to a 97% identity. We used the full-length merged and trimmed reads for database generation and mapping, which allowed us to keep track of all clones that code for the same (shorter) peptides independently, which was important to detect protein vs. RNA effects (see further discussion below). The database generated was composed of more than 5000 reliably identifiable sequences predicted to code for peptides of all expected lengths. Furthermore, from its composition, it is possible to confirm that the library was indeed generated from random sequences, albeit with a slight bias towards a higher guanidine content during the synthesis process.

The final improvement to the pipeline was to change the algorithm used for mapping the reads back to the database from a local to a global alignment strategy. This greatly improved the speed and accuracy of the pipeline. Over 90% of all reads containing the flanking sequences mapped back to our database of unique sequences in all experiment replicates. This represents a 30–50% increase in mapped reads when compared with the pipeline used for analyses in the [29] study. As a result of this, we found that the change in frequencies can actually be much larger than what was initially reported. Some sequences, for example, have a decrease in frequency between the first and the last cycle of the experiments of up to 1000-fold.

4.2. Clone Effects

Having identified how the frequency of sequences changes in the available experiments, we were able to classify the sequences in groups according to the direction of the change. With the improved mapping pipeline, we found that over 80% of the sequences in the database had consistent behavior in at least five of the nine experiments, suggesting that the observed results were indeed an effect of the sequences and not due to chance or drift. This is noteworthy, considering that the experiments were performed independently, by different researchers, at different times, and have variations between sampling schedules, seed size, sequencing depths, and number of replicates.

Over half of the sequences in the database were consistently assigned to be either neutral (48%) or to go up in frequency (16%), suggesting that they are at least not very deleterious to the cells expressing them. This large proportion of sequences tolerated in a population of E. coli suggests that random sequences could also be expressed and maintained in large numbers in natural populations, making them an abundant source for possible evolutionary innovations.

We also specifically evaluated the vector effects that were suggested to have indirectly caused the observed positive effect [30,31]. In dedicated experiments, [31] found that just expressing the 38 amino acid peptides from the empty vector (i.e., without a cloned insert), had a slightly negative effect on the exponential growth of the E. coli cells. By disrupting this vector peptide with a potentially neutral peptide, one could generate an apparent positive effect. However, we found in our analysis that this peptide behaves mostly like a neutral peptide in the context of the full experiment, i.e., when not only focusing on the exponential growth phase as done in [31], but taking all competition cycles into account. While there is, on average, a small negative effect across experiments, it is not strong enough to explain the growth of most POS clones as merely its relief. Hence, we conclude that, in principle, the justified reservations about positive effects in our experiments [30,31] are not warranted in the face of the full data shown here, as well as the arguments provided previously [32].

4.3. Negative Effects of Vector Coded Amino Acids

Our data showed that the first four amino acids expressed by the vector had by themselves a negative fitness effect on the cells, with 73% of clones encoding only the first four residues, consistently decreasing in frequency in five or more experiments. A reason for this might be that the second and third codons in the sequence—lysine (AAG) and leucine (CTT), respectively—are not the most commonly used by E. coli for these amino acids. Interestingly, this negative effect diminishes quickly when one or two additional amino acids are translated. Hence, it is not of much concern for the overall experiment with mostly longer peptides; although, it contributes to the observed bimodal distributions of peptide length, intrinsic disorder, and aggregation propensity for the NEG peptides.

However, the same observation demonstrates that not only the coding part of a random sequence is important, for its maintenance in a population. Over 20% of the clones coding for the same peptide, but with different RNA sequences, show different growth trends in at least five experiments. In other words, clones with the same coding peptide had different effects on the growth trajectory of cells, due to the non-coding parts of their sequence. The authors in [29] had already shown that the RNA can have a different effect on growth than the protein by introducing a stop codon in single clones, disrupting the reading frame but keeping the rest of the sequence intact.

Systematic studies on replacing non-coding positions in an artificially expressed GFP RNA in E. coli have also shown that even small differences in RNA sequence can have differential fitness consequences for the cells [45], although this might be mostly caused by perturbing co-translational protein folding [46]. On the other hand, transcription has also been shown to contribute strongly to the metabolic burden that is caused by overexpressing genes in E. coli [47]. It is thus expected that the clone effects that we find are a combination of effects from the expressed RNA and protein together.

4.4. Protein Structure Correlations

Notwithstanding the possible fitness contribution of the RNA of the clones, we have analyzed protein structural properties in the three different groups of peptides. The most compelling difference between clones with POS or NEG responses is their length. Shorter peptides in the length range of 8–20 aa are prevalent in the POS and NS groups, while longer ones are prevalent in the NEG group. While it is generally known that newly evolved genes are shorter than older genes [7,43,44], the differences we observe here are at a much smaller scale than what is usually studied, since the ORF lengths of 4–65aa in our database are often not even annotated. Interestingly, in a study on the phenotypic impact of random sequences in Arabidopsis, [23] used also very short peptides (with cores of 6 or 12 random amino acids) and found a substantial fraction having an effect on the phenotype, including possibly beneficial ones.

The NEG group of peptides showed on average lower intrinsic disorder and higher aggregation propensity compared to the POS group. This is in line with the observation that naturally occurring young genes are more likely to have higher intrinsic disorder [10], which could be the reason why they are better tolerated by the cells [48].

We find no major differences in the three groups of peptides with respect to GC-content. However, there are some differences with respect to overall amino acid composition. The largest contrasts occur between POS and NEG peptides, whereby POS peptides have more alanine and serine but less glycine. With its six codons, serine is a frequent amino acid in the random sequences and it has a strong disorder promoting effect [42]. This could explain the higher disorder tendency in the POS peptides. Alanine and glycine, on the other hand, have both four codons and are therefore expected to occur equally frequently in random sequences, and they have similar disorder promoting effects. It is therefore unclear why alanine is more prevalent in POS and glycine is more prevalent in NEG clones.

An additional possible implication of the enrichment of serine in both the POS and NS groups is its potential for evolution. Creixell et al. [49] found that serine is the fastest-evolving amino acid and attribute this to fact that its six codons can be divided into two, very different, groups (AGY and TCN). The fact that the codons are so different facilitates non-synonymous substitutions, which allows evolution to explore a large sequence space in a shorter period of time. If, as our data seem to show, sequences containing larger fractions of serine are better tolerated by the cells, such sequences would be excellent starting material for the evolution of new functional peptides.

5. Conclusions

Although no single determining feature of a sequence could be identified that would earmark individual peptides as having potentially positive or negative effects on the cells, some differences exist with respect to structural properties. In particular, we found that shorter and more disordered peptides have a greater potential for being retained in a population as a primary source for novel genes, supporting the conclusions by James et al. on the general patterns of protein domain evolution [12]. Most importantly, our data confirm that random sequences have the potential of being beneficial for the cell, especially in the context of the complex competition between clones that we study in these experiments. However, we show in the accompanying paper [50] that individual candidate POS clones can also provide a growth advantage in pairwise competition experiments, although not necessarily with the same strength as seen in the bulk experiments. We conclude that our experiments support the notion that random sequences are an abundant source for generating evolutionary novelty.

Supplementary Materials

The following are available online at https://www.mdpi.com/article/10.3390/genes12121913/s1, Supplementary files: Figure S1: Cluster size distribution of clones, Figure S2: Predicted peptide length distribution of all clones in the library, Figure S3: GC content analysis of clones in the library, Figure S4: Intrinsic disorder scores for peptides, Table S1: Amino acid frequencies, Table S2: List of primers used, SuppData_BACT.tsv: results table for library analysis; SuppData_DESeq2_ALLexp_Cycle4vs1.tsv: DESeq2 results table for individual experiments; SuppData_BACT_tableInfo.tsv: Table legends.

Author Contributions

Conceptualization, J.F.C., and D.T.; data curation, J.F.C.; formal analysis, J.F.C.; investigation, J.F.C.; supervision, D.T.; validation, J.F.C. and D.T.; visualization, J.F.C.; writing—original draft, J.F.C.; writing—review and editing, D.T. All authors have read and agreed to the published version of the manuscript.

Funding

The work was funded by institutional funds of the MPG to DT. JF is a member of the IMPRS for Evolutionary Biology.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: Dryad http://dx.doi.org/10.5061/dryad.6f356 and at the European Nucleotide Archive (ENA) under the project number PRJEB19640. Sequencing data files for the replication experiments will are available at ENA, project number: PRJEB48999. All codes for data organization and analysis are available as GitLab project under https://github.com/johanafc/random-seqs-ecoli.

Acknowledgments

We thank Ellen McConnell for her help in getting the replication experiment established and Rafik Neme for his comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Tautz, D.; Domazet-Loso, T. The evolutionary origin of orphan genes. Nat. Rev. Genet. 2011, 12, 692–702. [Google Scholar] [CrossRef]
Chen, S.; Krinsky, B.H.; Long, M. New genes as drivers of phenotypic evolution. Nat. Rev. Genet. 2013, 14, 645–660. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Schlötterer, C. Genes from scratch—The evolutionary fate of de novo genes. Trends Genet. 2015, 31, 215–219. [Google Scholar] [CrossRef] [PubMed] [Green Version]
McLysaght, A.; Guerzoni, D. New genes from non-coding sequence: The role of de novo protein-coding genes in eukaryotic evolutionary innovation. Philos. Trans. R. Soc. B Biol. Sci. 2015, 370, 20140332. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Van Oss, S.B.; Carvunis, A.-R. De novo gene birth. PLoS Genet. 2019, 15, e1008160. [Google Scholar] [CrossRef] [Green Version]
Andersson, D.I.; Jerlström-Hultqvist, J.; Näsvall, J. Evolution of New Functions De Novo and from Preexisting Genes. Cold Spring Harb. Perspect. Biol. 2015, 7, a017996. [Google Scholar] [CrossRef]
Neme, R.; Tautz, D. Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution. BMC Genom. 2013, 14, 1–13. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Neme, R.; Tautz, D. Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence. eLife 2016, 5, e09977. [Google Scholar] [CrossRef] [PubMed]
Ruiz-Orera, J.; Messeguer, X.; Subirana, J.; Alba, M.M. Long non-coding RNAs as a source of new peptides. eLife 2014, 3, e03523. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wilson, B.A.; Foy, S.G.; Neme, R.; Masel, J. Young genes are highly disordered as predicted by the preadaptation hypothesis of de novo gene birth. Nat. Ecol. Evol. 2017, 1, 1–6. [Google Scholar] [CrossRef] [PubMed]
Pavesi, A.; Magiorkinis, G.; Karlin, D.G. Viral Proteins Originated De Novo by Overprinting Can Be Identified by Codon Usage: Application to the “Gene Nursery” of Deltaretroviruses. PLoS Comput. Biol. 2013, 9, e1003162. [Google Scholar] [CrossRef] [PubMed] [Green Version]
James, J.E.; Willis, S.M.; Nelson, P.G.; Weibel, C.; Kosinski, L.J.; Masel, J. Universal and taxon-specific trends in protein sequences as a function of age. eLife 2021, 10, e57347. [Google Scholar] [CrossRef]
Zhang, L.; Ren, Y.; Yang, T.; Li, G.; Chen, J.; Gschwend, A.R.; Yu, Y.; Hou, G.; Zi, J.; Zhou, R. Rapid evolution of protein diversity by de novo origination in Oryza. Nat. Ecol. Evol. 2019, 3, 679–690. [Google Scholar] [CrossRef]
Weisman, C.M.; Murray, A.W.; Eddy, S.R. Many, but not all, lineage-specific genes can be explained by homology detection failure. PLoS Biol. 2020, 18, e3000862. [Google Scholar] [CrossRef]
Heinen, T.J.; Staubach, F.; Häming, D.; Tautz, D. Emergence of a New Gene from an Intergenic Region. Curr. Biol. 2009, 19, 1527–1531. [Google Scholar] [CrossRef] [Green Version]
Xie, C.; Bekpen, C.; Künzel, S.; Keshavarz, M.; Krebs-Wheaton, R.; Skrabar, N.; Ullrich, K.K.; Tautz, D. A de novo evolved gene in the house mouse regulates female pregnancy cycles. eLife 2019, 8, e44392. [Google Scholar] [CrossRef]
Cai, J.; Zhao, R.; Jiang, H.; Wang, W. De Novo Origination of a New Protein-Coding Gene in Saccharomyces cerevisiae. Genetics 2008, 179, 487–496. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Li, D.; Dong, Y.; Jiang, Y.; Jiang, H.; Cai, J.; Wang, W. A de novo originated gene depresses budding yeast mating pathway and is repressed by the protein encoded by its antisense strand. Cell Res. 2010, 20, 408–420. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Reinhardt, J.; Wanjiru, B.M.; Brant, A.T.; Saelao, P.; Begun, D.J.; Jones, C.D. De Novo ORFs in Drosophila Are Important to Organismal Fitness and Evolved Rapidly from Previously Non-coding Sequences. PLoS Genet. 2013, 9, e1003860. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Stepanov, V.G.; Fox, G.E. Stress-Driven In Vivo Selection of a Functional Mini-Gene from a Randomized DNA Library Expressing Combinatorial Peptides in Escherichia coli. Mol. Biol. Evol. 2007, 24, 1480–1491. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Knopp, M.; Gudmundsdottir, J.S.; Nilsson, T.; König, F.; Warsi, O.; Rajer, F.; Ädelroth, P.; Andersson, D.I. De Novo Emergence of Peptides That Confer Antibiotic Resistance. mBio 2019, 10, e00837-19. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Knopp, M.; Babina, A.M.; Gudmundsdóttir, J.S.; Douglass, M.V.; Trent, M.S.; Andersson, D.I. A novel type of colistin resistance genes selected from random sequence space. PLoS Genet. 2021, 17, e1009227. [Google Scholar] [CrossRef] [PubMed]
Bao, Z.; Clancy, M.A.; Carvalho, R.F.; Elliott, K.; Folta, K.M. Identification of Novel Growth Regulators in Plant Populations Expressing Random Peptides. Plant Physiol. 2017, 175, 619–627. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Keefe, A.D.; Szostak, J.W. Functional proteins from a random-sequence library. Nat. Cell Biol. 2001, 410, 715–718. [Google Scholar] [CrossRef] [PubMed]
Zhao, L.; Saelao, P.; Jones, C.D.; Begun, D.J. Origin and Spread of de Novo Genes in Drosophila melanogaster Populations. Science 2014, 343, 769–772. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Palmieri, N.; Kosiol, C.; Schlötterer, C. The life cycle of Drosophila orphan genes. eLife 2014, 3, e01311. [Google Scholar] [CrossRef] [Green Version]
Neme, R.; Tautz, D. Evolution: Dynamics of De Novo Gene Emergence. Curr. Biol. 2014, 24, R238–R240. [Google Scholar] [CrossRef] [Green Version]
Durand, É.; Gagnon-Arsenault, I.; Hallin, J.; Hatin, I.; Dubé, A.K.; Nielly-Thibault, L.; Namy, O.; Landry, C.R. Turnover of ribosome-associated transcripts from de novo ORFs produces gene-like characteristics available for de novo gene emergence in wild yeast populations. Genome Res. 2019, 29, 932–943. [Google Scholar] [CrossRef]
Neme, R.; Amador, C.; Yildirim, B.; McConnell, E.; Tautz, D. Random sequences are an abundant source of bioactive RNAs or peptides. Nat. Ecol. Evol. 2017, 1, 1–7. [Google Scholar] [CrossRef]
Weisman, C.; Eddy, S.R. Gene Evolution: Getting Something from Nothing. Curr. Biol. 2017, 27, R661–R663. [Google Scholar] [CrossRef] [Green Version]
Knopp, M.; Andersson, D.I. No beneficial fitness effects of random peptides. Nat. Ecol. Evol. 2018, 2, 1046–1047. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Tautz, D.; Neme, R. Reply to ‘No beneficial fitness effects of random peptides’. Nat. Ecol. Evol. 2018, 2, 1048. [Google Scholar] [CrossRef]
Edgar, R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 2010, 26, 2460–2461. [Google Scholar] [CrossRef] [Green Version]
Rice, P.; Longden, I.; Bleasby, A. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 2000, 16, 276–277. [Google Scholar] [CrossRef]
Mészáros, B.; Erdős, G.; Dosztányi, Z. IUPred2A: Context-dependent prediction of protein disorder as a function of redox state and protein binding. Nucleic Acids Res. 2018, 46, W329–W337. [Google Scholar] [CrossRef]
Walsh, I.; Seno, F.; Tosatto, S.C.; Trovato, A. PASTA 2.0: An improved server for protein aggregation prediction. Nucleic Acids Res. 2014, 42, W301–W307. [Google Scholar] [CrossRef] [PubMed]
Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 1–21. [Google Scholar] [CrossRef] [Green Version]
Li, F.; Salit, M.L.; Levy, S.F. Unbiased Fitness Estimation of Pooled Barcode or Amplicon Sequencing Studies. Cell Syst. 2018, 7, 521–525. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Heames, B.; Schmitz, J.; Bornberg-Bauer, E. A Continuum of Evolving De Novo Genes Drives Protein-Coding Novelty in Drosophila. J. Mol. Evol. 2020, 88, 382–398. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Basile, W.; Sachenkova, O.; Light, S.; Elofsson, A. High GC content causes orphan proteins to be intrinsically disordered. PLoS Comput. Biol. 2017, 13, e1005375. [Google Scholar] [CrossRef]
Yuedong, Y.; Cao, Z.; Yang, Y.; Wang, C.-L.; Su, Z.-D.; Zhao, Y.-W.; Wang, J.-H.; Zhou, Y. Natural protein sequences are more intrinsically disordered than random sequences. Cell. Mol. Life Sci. 2016, 73, 2949–2957. [Google Scholar] [CrossRef] [Green Version]
Campen, A.; Williams, R.; Brown, C.; Meng, J.; Uversky, V.; Dunker, A. TOP-IDP-Scale: A New Amino Acid Scale Measuring Propensity for Intrinsic Disorder. Protein Pept. Lett. 2008, 15, 956–963. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Carvunis, A.-R.; Rolland, T.; Wapinski, I.; Calderwood, M.; Yildirim, M.; Simonis, N.; Charloteaux, B.; Hidalgo, C.; Barbette, J.; Santhanam, B.; et al. Proto-genes and de novo gene birth. Nat. Cell Biol. 2012, 487, 370–374. [Google Scholar] [CrossRef] [Green Version]
Schmitz, J.F.; Ullrich, K.K.; Bornberg-Bauer, E. Incipient de novo genes can evolve from frozen accidents that escaped rapid transcript turnover. Nat. Ecol. Evol. 2018, 2, 1626–1632. [Google Scholar] [CrossRef]
Mittal, P.; Brindle, J.; Stephen, J.; Plotkin, J.B.; Kudla, G. Codon usage influences fitness through RNA toxicity. Proc. Natl. Acad. Sci. USA 2018, 115, 8639–8644. [Google Scholar] [CrossRef] [Green Version]
Walsh, I.M.; Bowman, M.A.; Santarriaga, I.F.S.; Rodriguez, A.; Clark, P.L. Synonymous codon substitutions perturb cotranslational protein folding in vivo and impair cell fitness. Proc. Natl. Acad. Sci. USA 2020, 117, 3528–3534. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Rinas, U. Recombinant protein production associated growth inhibition results mainly from transcription and not from translation. Microb. Cell Factories 2020, 19, 1–11. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Tretyachenko, V.; Vymětal, J.; Bednárová, L.; Kopecký, V.; Hofbauerova, K.; Jindrová, H.; Hubálek, M.; Souček, R.; Konvalinka, J.; Vondrášek, J.; et al. Random protein sequences can form defined secondary structures and are well-tolerated in vivo. Sci. Rep. 2017, 7, 1–9. [Google Scholar] [CrossRef] [Green Version]
Creixell, P.; Schoof, E.M.; Tan, C.S.H.; Linding, R. Mutational properties of amino acid residues: Implications for evolvability of phosphorylatable residues. Philos. Trans. R. Soc. B Biol. Sci. 2012, 367, 2584–2593. [Google Scholar] [CrossRef] [Green Version]
Bhave, D.; Tautz, D. Effects of the expression of random sequence clones on growth and transcriptome regulation in Escherichia coli. bioRxiv 2021. [Google Scholar] [CrossRef]

Figure 1. Experimental design to evaluate the fraction of bioactive sequences in a library of random sequences. A pool of random 150 bp sequences generated by adding equimolar amounts of each nucleotide at every synthesis step was ligated into a commercial inducible expression vector (pFLAG-CTC, Sigma). This vector has start and stop codons in frame of the restriction site used for cloning, which means that the random sequences were flanked by a common sequence of 12 bp on the 5’-end and 36 bp on the 3’-end with a FLAG-tag (grey boxes). The resulting 195 nucleotide and 65 amino acid full sequences are shown. The pool of clones was used to transfect E. coli (DH10B) to generate a library of bacterial cells. Expression of the cloned peptides was induced by adding isopropyl β-d-1-thiogalactopyranoside (IPTG) to the culture media. Replicates were sampled every three h for a total of 12 h (3-h cycles, 12-h experiments) whereby one tenth of the culture volume was used for seeding the culture at each passage. The overall experiment replicated the one described in [29], where the analysis focused on full-length peptides only and also included experiments with 24-h growth cycles (5-day experiments). Here we use a newly designed pipeline to analyze all experiments and all peptide lengths.

Figure 2. Length distribution of all predicted peptides in the random sequence database and assignment to response groups. Histogram of sequence lengths for each group of sequences. The colored bars represent the number of peptides of each length assigned to each group in the experiments. Light blue: peptides showing a decrease in frequency (NEG); dark grey: peptides showing no significant change in frequency (NS); orange: peptides showing an increase in frequency (POS). The light grey bars in each panel represent the predicted peptide lengths of the complete database (compare suppl. Figure 2). Dashed lines represent the kernel density estimates for each category.

Figure 3. Density plots for the GC content of sequences in each of the three clone groups. Dashed line: average GC content of all ORFs in the library (53%). The colors represent the assignment to the three groups of clones in the experiments: Light blue: sequences showing a decrease in frequency (NEG); dark grey: sequences showing no significant change in frequency (NS); orange: sequences showing an increase in frequency (POS).

Figure 4. Differences in amino acid frequencies for the three groups of peptides. Frequencies were calculated as the percentage of each amino acid in all sequences in the groups and the complete database. Differences are shown as absolute values and the direction of the change is represented as +, if positive, or – is negative for each comparison. Light blue: NEG minus the database; grey: NS minus the database; orange: POS minus the database. Amino acids are ordered from left to right according to the TOP-IDP scale that reflects propensity for disorder induction from order promoting (left) to disorder promoting (right) [42].

Figure 5. Intrinsic disorder scores (IDS) for the three groups of peptides. Histograms of average IUPred2 scores (IDS) colored by the length categories depicted to the right. OR Empirical cumulative distribution of average IUPred2 scores (IDS) colored by the length categories.

Figure 6. Aggregation energy analysis for the three clone groups and different length classes of peptides: (A) Density plots for the best aggregation energy of sequences in each group. (B) Best aggregation energy of sequences in each group and each length class. The lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles). The whiskers are the minimum and maximum data points up to 1.5 times the closest IQR.

Table 1. Clone performance in different experiments.

Exp ¹	Cycle Length/Experiment Length (Replicates)	N ²	POS ³	NEG ³	NS ³	Range Log2-Fold Change (Average)	Empty Vector Log2-Fold Change ⁴
1	3 h/1 day (n = 10)	5625	0.11	0.36	0.53	−8.0 to 2.7 (−1.1)	1.1
2	3 h/1 day (n = 10)	5606	0.17	0.43	0.41	−7.7 to 2.2 (−1.2)	0.5
3	3 h/1 day (n = 10)	5638	0.18	0.40	0.42	−7.8 to 2.7 (−1.1)	0.4
4	24 h/4 days (n = 8)	5623	0.14	0.30	0.56	−5.2 to 5.2 (−0.5)	0.1
5	24 h/4 days (n = 10)	5596	0.10	0.26	0.64	−5.4 to 5.0 (−0.6)	−1.7
6	24 h/4 days (n = 10)	5632	0.26	0.41	0.32	−5.9 to 2.2 (−0.7)	−1.1
7	24 h/4 days (n = 5)	5623	0.07	0.28	0.65	−7.2 to 4.0 (−0.9)	−0.2
8	24 h/4 days (n = 5)	5689	0.27	0.46	0.27	−11.2 to 1.4 (−1.6)	−0.4
9	3 h/1 day (n = 5)/diluted library	5651	0.16	0.32	0.51	−8.4 to 5.6 (−0.7)	−0.7
	All experiments averages ⁵:
	All clones	5621	0.16	0.36	0.48
	Clones with 4aa ORF	200	0.04	0.73	0.18
	Clones with 5aa ORF	221	0.02	0.52	0.44
	Clones with 6aa ORF	209	0.06	0.37	0.56
	Clones with FLAG sequence	638	0.03	0.77	0.17
	Clones with FLAG + 1 sequence	129	0.03	0.68	0.20
	Clones with FLAG + 2 sequence	126	0.05	0.64	0.23
	Clones 48+ aa without FLAG	237	0.06	0.55	0.34

¹ For experiments 1–8 we reanalyzed the original fastq data from [29], experiment 9 with a diluted starting library was done within the framework of this study. ² Number of clones detected among the 5701 unique sequence clones in the database for which at least 5 reads were mapped in each experiment. ³ Fraction of clones in each category. POS and NEG were assigned when p_adj 0.05; otherwise, the clone was categorized as non-significant (NS). ⁴ All vector clone frequency changes were highly significant (p_adj 0.01) in their respective experiment, except for Exp 4 (p_adj 0.05). ⁵ The distribution of values from the 9 experiments is not significantly different from a normal distribution (Shapiro–Wilk test, p 0.5).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Castro, J.F.; Tautz, D. The Effects of Sequence Length and Composition of Random Sequence Peptides on the Growth of E. coli Cells. Genes 2021, 12, 1913. https://doi.org/10.3390/genes12121913

AMA Style

Castro JF, Tautz D. The Effects of Sequence Length and Composition of Random Sequence Peptides on the Growth of E. coli Cells. Genes. 2021; 12(12):1913. https://doi.org/10.3390/genes12121913

Chicago/Turabian Style

Castro, Johana F., and Diethard Tautz. 2021. "The Effects of Sequence Length and Composition of Random Sequence Peptides on the Growth of E. coli Cells" Genes 12, no. 12: 1913. https://doi.org/10.3390/genes12121913

APA Style

Castro, J. F., & Tautz, D. (2021). The Effects of Sequence Length and Composition of Random Sequence Peptides on the Growth of E. coli Cells. Genes, 12(12), 1913. https://doi.org/10.3390/genes12121913

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Effects of Sequence Length and Composition of Random Sequence Peptides on the Growth of E. coli Cells

Abstract

1. Introduction

2. Materials and Methods

2.1. Library and Replication Experiment

2.2. Analysis Pipeline

2.3. Database Generation

2.4. Sequence Features

2.5. Mapping of Reads to Full Database

2.6. Frequency Change Determination and Group Assignment

3. Results

3.1. Replication with Diluted Library

3.2. Characterization of the Sequences in the Random Clone Library

3.3. Frequency Changes in Clones during the Growth Experiments

3.4. Length, GC Content, and Amino Acid Composition Dependence

3.5. Structural Features

4. Discussion

4.1. New Analysis Pipeline

4.2. Clone Effects

4.3. Negative Effects of Vector Coded Amino Acids

4.4. Protein Structure Correlations

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI