Current Challenges and Pitfalls in Soil Metagenomics

Soil microbial communities are essential components of agroecological ecosystems that influence soil fertility, nutrient turnover, and plant productivity. Metagenomics data are increasingly easy to obtain, but studies of soil metagenomics face three key challenges: (1) accounting for soil physicochemical properties; (2) incorporating untreated controls; and (3) sharing data. Accounting for soil physicochemical properties is crucial for better understanding the changes in soil microbial community composition, mechanisms, and abundance. Untreated controls provide a good baseline to measure changes in soil microbial communities and separate treatment effects from random effects. Sharing data increases reproducibility and enables meta-analyses, which are important for investigating overall effects. To overcome these challenges, we suggest establishing standard guidelines for the design of experiments for studying soil metagenomics. Addressing these challenges will promote a better understanding of soil microbial community composition and function, which we can exploit to enhance soil quality, health, and fertility.

The advent of next generation sequencing (NGS) has enabled in-depth investigations of the composition and functions of whole microbial communities [13,14] and allowed the performance of metagenomic studies [15]. The term metagenomics was coined in 1998 by Handelsman et al. [16] and refers to the genomic information of the microbial community inhabiting an environment. The decreasing cost of NGS has facilitated the analysis of complex environments and has allowed researchers to perform metagenomic studies on soil [17] and produce large-scale data. However, the ease of generating metagenomic data brings new challenges for study design and analysis. First, soil microbial communities and their functions are influenced by environmental factors such as soil physicochemical properties, but this complexity of the soil microbiome is largely neglected [18,19]. Second, proper controls are necessary to better understand the influence of the environment on soil microbial communities. Third, large, published, and publicly available datasets are necessary to conduct meta-analyses to develop a better understanding of microbial communities on a larger scale. In this paper, we discuss how these three challenges are limiting further development of the field of metagenomics and how they can be overcome. Moreover, we outline the implications of surmounting these obstacles for our understanding of soil microbial communities.

Soil Physicochemical Properties Improves Our Understanding of Community Processes in Soil Microbiomes
Soil physicochemical properties play a crucial role in determining soil microbiome composition and function [20,21]. However, most soil microbiome studies either do not include soil physicochemical properties or limit measurements to pH, soil organic carbon (SOC), total nitrogen (TN), total carbon (TC), moisture, and/or temperature, which do not fully reflect the complex soil chemical matrix and its various constituent elements (e.g., N, C, K, P, Zn, Fe, Ca, Mn, Mg). Soil properties vary with soil depth [22], as shown for SOC, total N [23,24], and total P [25]. Moreover, soil properties, particularly pH, are dynamic and fluctuate with changes in climate [26], environment [27], and the aboveground population [28].
A common pitfall of soil microbiome studies is a lack of measurements of soil physicochemical properties. This is especially important in agricultural studies because land use determines soil physiochemical properties [20], which change nutrient availability and cycling [2]. These changes explain (part of) the shifts observed in soil microbial communities, root-soil interactions [29], pathogen suppression [30], and microbial processes [31]. If these properties are not measured before the start of a long-term experiment, the control cannot be used to investigate the influence of time on soil physicochemical properties because time changes soil physicochemical properties [32]. In other words, the direct effect of the treatment cannot be separated from the indirect effect of changes in soil physicochemical properties.
Including soil physicochemical properties in every soil metagenomic study does not only ease the comparison of studies, but also increases the knowledge within a system. This will allow us to further disentangle soil microbial communities by linking specific microorganisms or functions to precise chemical changes [33]. Microbial responses are context-dependent; they change with soil disturbance [34], climate, and nutrient availability [26,35]. Nutrients also determine the abundance of root promoting microorganisms [36], which can even help explain changes in plant productivity. These contexts change not only the content of microbial communities but also their interactions [37]. The choice to not include physicochemical properties in a soil microbial community study means there is a vital part missing in explaining soil microbial community responses.
In addition, accounting for soil properties can provide a better understanding of why specific microbes are especially sensitive to certain abiotic changes and the potential impact of this sensitivity on their ecosystem function. Recently, Leite et al. [36] showed that the efficiency of plant-growth-promoting microbes depends on the availability of nitrogen. A stronger grasp of the effects of soil physicochemical properties on soil microbial communities will provide insights into how these microbial communities are shaped and their roles in soil quality, health, and plant productivity.

Untreated Controls Are Critical for a Better Understanding of Soil Microbial Communities
Soil microbial communities are complex and susceptible to change [5,19]; this emphasizes the importance of establishing a baseline for comparison with the targeted treatment. Untreated controls provide such a baseline for determining whether a change in the soil microbial community is due to the treatment or to an unknown factor (e.g., stochastic processes). To illustrate this need, consider a hypothetical experiment comparing organic amendment and inorganic fertilizer; in such a comparison, the abundance of Acidobacteria is high in the treatment with organic amendment but low in the treatment with inorganic fertilizer. These results have two potential explanations: (1) organic amendment increases the abundance of Acidobacteria, and (2) inorganic fertilizers decrease the abundance of Acidobacteria. In the absence of a control containing the original soil without any amendment (untreated), both explanations are valid. Thus, including a control simplifies the interpretation of the results and facilitates the design of follow-up experiments.
Nonetheless, most comparisons of organic amendments with inorganic fertilizers directly compare the microbial communities in the two treatments. This approach does not provide a clear picture of which treatment is the main cause of microbial shifts. One reason for the lack of untreated controls is that soil biology agricultural research is often conducted on farms, where not treating an area of soil will have economic and food-security consequences if crop yields are reduced. A second reason is that expanding the number of samples to include a control increases the cost as well as the burden of data analysis, which may be an issue when there are time constraints. A potential solution is to maintain a small area of the plot without any fertilizers for use as a control, which would benefit research without reducing crop yield. If there are time or cost constraints, a single plot could be kept free of any treatments instead of replicating the control three or more times. A third reason is in cases where the area is highly farmed, so there is no untreated land available. Even when a piece of land is untreated from that time, the history of the land can still influence and explain changes in the soil microbial community [38]. An untreated control is also important when accounting for the influence of soil physicochemical properties. Without an untreated control, it is very difficult to determine whether changes in the soil microbial community reflect soil physicochemical properties, time, or the treatment. It is important to realize that the lack of an untreated control impacts the vision on the soil microbial community. The lack of an untreated control may mislead the conclusions. For instance, Soman et al. [39] found that bacterial diversity was higher in soil with poultry litter than with inorganic fertilizer, it is the case that poultry litter is better for bacterial diversity than inorganic fertilizer. However, the untreated control shows that bacterial diversity was higher without treatment. Thus, inorganic fertilizer and poultry litter both lower bacterial diversity and it might be better to use no fertilizer if an increase in bacterial diversity is the aim.
Some researchers have developed creative solutions for including a control sample. One common approach is to measure the soil microbial communities and/or soil physicochemical properties before the start of the study for use as a control. However, this approach neglects the possibility of changes in the soil microbial community and/or soil physicochemical properties over time, especially in long-term field experiments [40,41]. Consequently, it is important to have a real-time control that is exposed to the elements in the same way as the treatments. Another common solution is the use of inorganic fertilizer as a control. However, inorganic fertilizer also changes the soil microbial community and soil physicochemical properties [42], making it impossible to establish a baseline. Other researchers use non-agricultural fields, such as grassland, as controls, but such controls cannot be reliably compared to agricultural fields because soil microbial communities differ greatly among different land-use systems [43]. Absolute microbiome profiling quantifies absolute abundance in metagenome samples which eases comparisons between soil microbial communities [37]. This does not, however, lift the need for an untreated control; it does not explain which environmental variables influenced the soil microbial community during the experiment, or how they had this influence.
Including an untreated control in every soil microbiome study would contribute to reproducibility and facilitate the integration of studies in meta-analyses. In addition, the effects of the treatments could be separated from environmental effects. Widespread adoption of untreated controls in soil microbial community research would also enable the creation of a database of microbial baselines that could be used to infer possible consequences of changes in soil microbial communities.

FAIR Data Are Needed to Better Understand the Soil Microbiome
More sequencing data are being generated than ever before [44,45], which has increased the complexity of conducting transparent and reproducible research. Data sharing and transparency are key factors for ensuring reproducibility and comparability between studies. To help structure data management and ensure reproducibility in scientific research, the FAIR (findable, accessible, interoperable, reusable) guidelines for scientific data were introduced in 2016 [46]. This approach has since been evaluated by the biomedical research community [47,48] and has also been proposed for environmental metagenomics [49]. Many initiatives have improved data sharing and standardization, such as TerraGenome [50], the Earth Microbiome Project [51,52], and the Genomics Standards Consortium [53].
There is an increasing need for meta-analyses of the soil microbiome to uncover soil microbial community mechanisms and untangle the influences of soil type, location, soil physicochemical properties, and treatments. However, the FAIR approach is not widely used in soil metagenomics research. For instance, if the FAIR approach is not applied, information such as soil depth is missing from the methods. This loss of information reduces reproducibility and increases the difficulty of metagenomic meta-analyses. Data are not always checked for metadata correctness and completeness. This could be improved with guidelines to standardize metadata across platforms [54].
The FAIR principles should be prioritized from the start of the project as they require mutual awareness and consent from the relevant group members and a consideration of costs involved in maintaining FAIR data [55]. Reasons for not applying the FAIR principles in environmental metagenomics include work regulations preventing data sharing, technical difficulties in sharing data [56], or underestimation of the importance of data sharing [57]. One option for addressing these issues is to include data-sharing seminars or lectures in PhD programs and conferences to teach prospective researchers about the importance of data sharing. Another option is to develop data-sharing tools that facilitate data sharing [56]. Data-sharing requirements are a solution that has already been implemented by funding agencies which has significantly improved data sharing over the last years [58]. Despite these initiatives, some data-sharing problems have remained, because the guidelines for data sharing as given by a journal are not always translated to reality. Vasilevsky et al. [59] found that encouragement of data sharing is not pursued by journals as they do not require it. They found that, for journals requiring data sharing, the lack of data does not stop them from publishing a paper. Some journals require that a data statement is included, but authors are often not compliant to this statement [60] or the data are not findable [61]. Another workaround that some authors use to fulfil the data-sharing requirements by a journal, is to only publish part of their sequencing data [58]. Following FAIR guidelines with software [45] and R packages [62] is also important to create an open-data environment. Stricter journal guidelines with specific guidance increases FAIR data management and data sharing [59,63].
Applying the FAIR principle to all studies of soil metagenomics will allow metaanalyses to elucidate the dynamics of soil microbial communities and provide new insights into published data. It would also provide greater transparency, which would allow us to learn more about soil microbial communities and how they are affected by environmental conditions. Most importantly, we could work together as a scientific community to find solutions and advance the development of the field of metagenomics.

Discussion
Soil microbial communities are a vital part of our ecosystems that contribute to disease suppression, nutrient cycling, and soil fertility. The emerging field of metagenomics has the potential to uncover the functions and mechanisms of soil microbial communities. However, common guidelines are necessary to ensure comparability between studies and a better understanding of soil microbial communities. We argue that those guidelines should start with (i) a more detailed characterization of soil properties, (ii) the inclusion of untreated controls to avoid biased conclusions on shifts of community structure, and (iii) data-sharing and transparency measures to ensure reproducibility.
Many soil physicochemical properties affect soil microbial communities, and if we do not include at least some soil physicochemical properties, we will be navigating in the dark. The importance of including soil physicochemical properties goes beyond cross-study comparisons and is essential to disentangle soil microbial community mechanisms and connect them to specific chemical changes. Accounting for the soil physicochemical properties also brings the additional benefit of indirectly showing the effect of other soil organisms that shape the soil factors (e.g., decomposers [64]) without the need to characterize them. This will contribute to a better understanding of the role of soil physicochemical properties and the specific microorganisms that we can use for precision farming and increasing soil health and plant productivity. It is especially important to include soil physicochemical properties in tropical soil, where soil properties may be very different to temperature soils [65], which can change soil microbial communities significantly [5].
Adopting appropriate controls will promote reproducibility in metagenomics and establish a clear microbial baseline for soils. In-field and real-time controls can be used to create a database of soil microbial community baselines around the globe, which we can combine with soil physicochemical property data to better understand soil changes [66][67][68]. However, the creation of a true soil microbial community baseline demands a renewed emphasis on transparent data sharing. Soil metagenomics is complex, and the results are influenced by many factors, including the methods of DNA isolation, sequencing, and data analysis [69][70][71]. Therefore, it is essential not only to share our data in a FAIR way but also to include FAIR metadata [46,48,72].
We present the three key challenges, but we also acknowledge the existence of many other challenges. Mocali and Benedetti [73] highlighted that, in order to study soil microbial communities via soil metagenomics, we need to consider efficient DNA-extraction methods with well-defined screening strategies and sequencing approaches. For example, Dimitrov et al. [69] proposed that successive DNA extractions optimize the DNA yield and led to a better understanding of the microbial community composition. Altogether, we highlight that we need also to consider methodological challenges to fully embrace the complexity of soil microbial community.
Prosser [74] argued that microbial ecology research should go beyond describing the microbial community and its functionality. We extend this recommendation by suggesting common guidelines to ensure that data collection efforts are not wasted or repeated unnecessarily. These challenges are also applicable to metatranscriptomics, metabolomics, and metaproteomics [75]. However, these disciplines are largely underexplored for soil microbial communities due to operational challenges [76]. The challenges motivating these guidelines need to be solved to ensure that the field of soil metagenomics continues to expand. Strengthening our research and increasing our understanding of the soil microbiome will accelerate efforts to tackle issues related to carbon sequestration, greenhouse gas mitigation, and food security.

Conclusions
In this paper, we discuss the three critical challenges faced by soil metagenomics research: (1) accounting for soil physicochemical properties; (2) incorporating untreated controls; and (3) sharing data. We suggest resolving these issues by establishing standard guidelines for experimental design in soil metagenomics. A strict procedure by journals and funding agencies will help to convince researchers to adhere to these guidelines. Overcoming these challenges benefits the field by the increased understanding of soil microbial community responses and facilitating cross-study analyses such as meta-analyses and global models.

Conflicts of Interest:
The authors declare no conflict of interest.