Statistical phylogenetic methods are a powerful tool for inferring the evolutionary history of viruses through time and space. The selection of mathematical models and analysis parameters has a major impact on the outcome, and has been relatively well-described in the literature. The preparation of a sequence dataset is less formalized, but its impact can be even more profound. This article used simulated datasets of enterovirus sequences to evaluate the effect of sample bias on picornavirus phylogenetic studies. Possible approaches to the reduction of large datasets and their potential for introducing additional artefacts were demonstrated. The most consistent results were obtained using “smart sampling”, which reduced sequence subsets from large studies more than those from smaller ones in order to preserve the rare sequences in a dataset. The effect of sequences with technical or annotation errors in the Bayesian framework was also analyzed. Sequences with about 0.5% sequencing errors or incorrect isolation dates altered by just 5 years could be detected by various approaches, but the efficiency of identification depended upon sequence position in a phylogenetic tree. Even a single erroneous sequence could profoundly destabilize the whole analysis by increasing the variance of the inferred evolutionary parameters.
This is an open access article distributed under the Creative Commons Attribution License
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited