A Large-Scale Dataset of Barley, Maize and Sorghum Variety Identiﬁcation Using DNA Fingerprinting in Ethiopia

: The data described in this paper were part of a large-scale nationally representative household survey, the Ethiopian Socioeconomic Survey (ESS 2018/19). Grain samples of barley, maize and sorghum were collected in six regions in Ethiopia. Variety identiﬁcation was assessed by matching samples to a reference library composed of released improved materials, using approximately 50,000 markers from DArTseq platforms. This data were part of a study documenting the reach of CGIAR-related germplasms in Ethiopia. These objective measures of crop varietal adoption, unique in the public domain, can be analyzed along with a large set of variables related to agro-ecologies, household characteristics and plot management practices, available in the Ethiopian Socioeconomic Survey 2018/19.


Background and Summary
Documenting the diffusion of improved cultivars has long been a preoccupation of the broad development community [1]. While monitoring progress made in the dissemination of improved cultivars is important, assessing the impacts of adoption, for example, on agricultural productivity or farmers' livelihoods, constitutes an equally crucial accomplishment. Such evidence can provide valuable insights for the research process to move forward.
Varietal adoption surveys have been carried out using a range of methods-eliciting information from farmers through household surveys, gathering expert opinions, and conducting seed-sales inquiries-that all possess inherent flaws [2]. DNA fingerprinting, originally deployed for seed regulations in developed countries, has been introduced in recent years for crop variety identification. Using the method as a benchmark, multiple empirical studies have shown that varietal types (landrace vs. improved) and varietal names are commonly mismeasured [3][4][5][6]. DNA fingerprinting has now been accepted as the method of choice for collecting rigorous data, and efforts are made to identify best practices among multiple methodological options [7].
Crop germplasm development is a major activity of agricultural research centers. In collaboration with national partners, the CGIAR-a global partnership of research centers engaged in research for a food secure future-has supported the development and release of hundreds of cultivars. The data contained herein were used to document the reach of CGIAR-related germplasms in Ethiopia [8].
Data were collected within the framework of a large-scale nationally representative survey, the Ethiopian Socioeconomic Survey (ESS 2018/19). In the regions of Amhara, Dire Dawa, Harar, Oromia, SNNP, and Tigray, crop cuts of barley, maize, and sorghum were conducted in 197 Enumeration Areas (EAs). In each EA, plots grown by the ESS households during the agricultural season were first listed by enumerators who randomly selected up to ten plots per crop. On each selected plot, 200 g of grains were collected from a random 4 × 4 m quadrant.
This data can shed light on several research questions. While measurement errors have the potential to be harmful at the micro and macro levels, limited work has yet investigated its implications [9]. Mismeasurement mechanisms need to be diagnosed to identify in which cases mismeasurement matters: the magnitude of the bias, its source and associated correlates. The search for solutions for detecting and minimizing the consequences of mismeasurements is also a promising research path.
Crop variety mismeasurements are important because farmers make planting and input allocation decisions based on their information about the type of crop variety they think they have. This variety misclassification might lead to sub-optimal outcomes.
The third set of questions relates to the success of breeding research efforts: how traits selected by breeders under research stations generalize to farmers' fields is a key question. Some widespread crop traits are measured in the ESS: yield, maturity length, and crop damage. Questions related to the seed system functioning and delivery can also be investigated, using available metrics such as farmers' recycling practices and grain purity.

Data Description
The data are available as separate .csv files and can be opened by any statistical software. All data are stored in the OpenICPSR Repository [10] and accessible through the OpenICPSR online portal. The data consist of DArTseq reports from each crop ( Table 1). The merging with the ESS datasets can be performed using the variables Genotype (from DArTseq reports) and sccq05 (in Section 9a of the post-harvest questionnaire). The genetic distance between samples is also provided as a matrix for each of the three crops. In the case of barley, the reference material composing the reference library could not be uniquely distinguished. Some reference library samples are, thus, grouped into bins, and caution should be exerted when using the varietal-level data for these samples.   The ESS survey questionnaires collected along the DNA fingerprinting data are available in open access [11]. The post-planting questionnaire collected plot and crop-level data such as area measurements, inputs, and farming management practices. The post-harvest questionnaire captured harvest, crop damage, post-harvest management and utilization. Household, community and geospatial datasets are also available. It is, finally, worth noting that the ESS dataset is georeferenced at the EA level and additional data sources can be merged by location.

The Ethiopian Socioeconomic Survey (ESS)
The Ethiopian Socioeconomic Survey (ESS) is a household panel survey integrated with the CSA's Annual Agricultural Sample Survey (Central Statistics Agency, 2017). The ESS uses a two-stage probability sample: the first stage entails selecting primary sampling units, or CSA enumeration areas (EAs), from the AgSS sample of 1600 EAs. The ESS 2018/19 data were collected in 565 EAs, of which 316 are rural and 219 are urban. In the second stage, 12 households were selected randomly in each rural EA from a complete    The ESS survey questionnaires collected along the DNA fingerprinting data are available in open access [11]. The post-planting questionnaire collected plot and crop-level data such as area measurements, inputs, and farming management practices. The post-harvest questionnaire captured harvest, crop damage, post-harvest management and utilization. Household, community and geospatial datasets are also available. It is, finally, worth noting that the ESS dataset is georeferenced at the EA level and additional data sources can be merged by location.

The Ethiopian Socioeconomic Survey (ESS)
The Ethiopian Socioeconomic Survey (ESS) is a household panel survey integrated with the CSA's Annual Agricultural Sample Survey (Central Statistics Agency, 2017). The ESS uses a two-stage probability sample: the first stage entails selecting primary sampling units, or CSA enumeration areas (EAs), from the AgSS sample of 1600 EAs. The ESS 2018/19 data were collected in 565 EAs, of which 316 are rural and 219 are urban. In the second stage, 12 households were selected randomly in each rural EA from a complete listing of households. The data are representative, at the regional level, for the most populous regions of the country. A more detailed description of the ESS4 is available in the basic information document [11]. The ESS surveys datasets are in the global public domain [12].

Crop Sampling
The ESS performs crop-cuts on 21 annual crops such as cereals, legumes and oilseeds. In each EA, plots grown by the ESS households during the agricultural season are listed. Crop-cut plots are then randomly selected for up to ten plots per crop. The selection of crop plots gives priority to pure stand fields: mixed stands (or intercropped) plots will not be selected if there are at least ten pure stand plots available in the listing. When the farmer is ready to harvest, enumerators implement the crop-cut procedure. Following measurements of the plot boundaries with a meter, a 4 × 4 m quadrant was randomly selected within the plot [13]. All crops located within the 16 m 2 quadrant were harvested with the fresh weight recorded with scale. The crop dry weight was recorded after two weeks.  [14]. The crop sample barcode ID was scanned and recorded in the ESS section on harvest amount (Section 9a) to allow matching with the ESS dataset.
The data were collected with resident enumerators who were fluent in both English, Amharic and other local languages. One enumerator was assigned for each EA. The enumerator conducted the interviews, measured the land, conducted crop cutting and collected crop samples. Field supervisors provided field-level coordination and supervision. One field supervisor was assigned to monitor the work of up to three enumerators. The crop samples data collection was also remotely supervised with data continuously uploaded to the server. Data collection resulted in a total of n = 1122 samples obtained (n = 249 barley, n = 505 maize and n = 368 sorghum), representative at the household level across major growing areas (Table 2).

Reference Library
Crop variety DNA fingerprinting consists of matching the genetic material extracted from a collected sample with its closest genetic profile in a reference library. To reliably iden-tify improved varieties, the reference library must be constituted from released improved varieties that can conceivably be found in the landscape surveyed.
For maize, the reference library for Ethiopia was previously compiled under a previous CIMMYT/EIAR DNA fingerprinting research project [15]. All improved maize varieties released in Ethiopia were included in the reference library. As there were no readily available reference libraries for barley and sorghum, we compiled collections of breeders' seed from the Ethiopian Institute of Agricultural Research (EIAR) and its regional centers. The reference library comprised 41 of the 46 food barley varieties released and 17 of the 19 malt barley varieties released in Ethiopia since 1990. The maize reference library was sourced from an ongoing EIAR-CIMMYT project and comprised 40 improved varieties and 14 maize parental lines. For sorghum, a total of 29 varieties were included in the reference library including all varieties that are still under production by EIAR. A list of varieties is available in Supplementary File for the three crops.

DNA Extraction and Genotyping
Grain samples were ground to flour (50 mg) and DNA was extracted in ILRI's laboratory in Addis Ababa. The task was completed in three months using Qiagen DNeasy plant mini kits (250) according the manufacturer's instruction. The concentration of the extracted DNA was adjusted to 50-100 ng/µL and 30 µL was loaded onto 96-well plates and shipped to Australia for genotyping using the DArTseq platform sequencing technologies (Diversity Array Technology). The DArTseq platforms use a combination of proprietary complexity reduction methods and next-generation sequencing platforms [16,17].

Technical Validation
Data were collected by resident enumerators. The listing and random selection of crop-cut plots were recorded on Survey Solutions to ensure that the survey procedure was adhered to. Enumerators were extensively trained in determining the location of the random 4 × 4 m quadrant in the field selected for crop-cutting. Constraints were embedded in the Survey Solution questionnaire to guarantee sample collection in the selected EAs.
Ensuring sample traceability along the chain is an important requirement. Using the survey architecture provided by Survey Solution, barcodes were tracked along the chain, from EAs to CSA field offices in the regions, and then to the CSA headquarters in Addis. Only a few samples (n = 12) went missing and could not be recovered. Improved materials used for the reference library were similarly recorded with the origin and date of sample collection.
Grain samples were grounded using a grinding machine. To avoid cross contamination, between samples, the grinding machine and the bench were cleaned using brush and tissue and, finally, with 70% ethanol. The quality and the concentration of the extracted DNA samples was checked using gel electrophoresis and Nanodrop spectrophotometry (DeNonix DS-11 FX model), and was adjusted to 50-100 ng/µL before shipment. After adjusting the concentration, 30 µL of each sample was aliquoted into 96 well semi-skirted plates and shipped along with the sample tracking file, which indicated each sample position on the plate well. Once they arrived in Australia, all genomic DNA samples were tested by 0.8% gel electrophoresis in TAE buffer and genotyping was conducted only when the samples met the minimum standard. Libraries generated with the methods of complexity reduction as per the DArT PL product definition were also quality controlled on agarose gels (1.2%), and all those matching the expected fragment size distribution were pooled for sequencing on Hiseq2500 (Illumina). Sequencing data quality was tested using DArT PL's proprietary script and final data quality was evaluated by the use of technical replication of approximately 10% of the samples in analysis.
Data were collected by resident enumerators. The listing and random selection of crop-cut plots were recorded on Survey Solutions to ensure that the survey procedure was adhered to. Enumerators were extensively trained in determining the location of the random 4 × 4 m quadrant in the field selected for crop-cutting. Constraints were embedded in the Survey Solution questionnaire to guarantee sample collection in the selected EAs.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/data6060058/s1, list of varieties included in the reference library.
Author Contributions: F.K. supported the survey conception, training, data collection, processing, and quality control of the DNA fingerprinting data documented here. F.K. wrote the data paper. A.A. and A.H.T. supervised the design, data collection and quality control of the Ethiopian Socioeconomic survey 2018/19, and edited the data paper. A.T.N. supervised DNA extraction and edited the data paper. J.C. led varietal identification analysis and A.K. performed DNA quality control and supervised genotyping with DArTseq platforms. The Central Statistics Agency designed the sampling frame, implemented data collection and performed quality control. All authors have read and agreed to the published version of the manuscript.