Variability and Number of Circulating Complementary Sex Determiner (Csd) Alleles in a Breeding Population of Italian Honeybees under Controlled Mating

In Apis mellifera, csd is the primary gene involved in sex determination: haploid hemizygous eggs develop as drones, while females develop from eggs heterozygous for the csd gene. If diploid eggs are homozygous for the csd gene, diploid drones will develop, but will be eaten by worker bees before they are born. Therefore, high csd allelic diversity is a priority for colony survival and breeding. This study aims to investigate the variability of the hypervariable region (HVR) of the csd gene in bees sampled in an apiary under a selection scheme. To this end, an existing dataset of 100 whole-genome sequences was analyzed with a validated pipeline based on de novo assembly of sequences within the HVR region. In total, 102 allelic sequences were reconstructed and translated into amino acid sequences. Among these, 47 different alleles were identified, 44 of which had previously been observed, while 3 are novel alleles. The results show a high variability in the csd region in this breeding population of honeybees.


Introduction
The complementary sex determiner (csd) gene plays a primary role in the sexual regulation of honeybees [1].Honeybees are haplodiploid, with females (queen and workers) being diploid and males (drones) being haploid [2].Diploid workers and queens are heterozygous for csd, while haploid drones are hemizygous.In the case of csd homozygosity, diploid drones develop, but they are identified and eaten by worker bees during the larval stage [3].
The csd gene comprises nine exons, including the potential specifying domain region (PSD) located within exons six to nine.Within the PSD a highly polymorphic region defined as the hypervariable region (HVR) is present.The HVR region differs greatly between alleles due to the varying number of asparagine/tyrosine repeats [4].Based on current knowledge, it has been demonstrated that in Apis mellifera five amino acid differences and length variations in the PSD region are sufficient to produce diploid females regularly [5].
The csd gene has evolved under a balancing selection that is strongly beneficial for heterozygotes, as homozygotes have zero fitness [6].Consequently, new and rare alleles have an advantage because their probability of being homozygous is much lower than that of common alleles [1,[6][7][8].Furthermore, Lechner et al. showed that mutations in the HVR region occur 2.4 times more frequently than in microsatellites [9].
In nature, the combination of these selective forces, along with the polyandry of queens mating, help to maintain a large number of different circulating csd alleles [10].The number of csd alleles was initially estimated to be about 10-13 [11], but further studies have identified a higher number of circulating alleles, ranging from 16 locally [12][13][14][15][16] to over 200 alleles worldwide [9,[17][18][19].In the Italian honeybee population, 88 different alleles have recently been identified by analyzing 125 colonies of different genetic types in 12 different Italian regions [20].
However, in a closed breeding population, some breeding practices, such as mating few selected breeders to produce numerous sister queens, could increase inbreeding level and consequently increase homozygosity.This could have significant implications for various productive and reproductive traits and in particular for the csd gene [21,22].
This article aims to investigate the diversity of csd alleles circulating in an Italian breeding population, where breeders are selected regardless of their csd alleles.For this purpose, we analyzed 100 colonies belonging to a breeding population, selected since 2015 for three different traits related to honey production, hygienic behavior, and docility behavior.

Data
One hundred worker colonies were sampled in June 2021 from a honeybee breeding farm located in Lombardy.One young worker bee was collected from each hive and stored at 4 • C in 1.5 mL Eppendorf tube filled with 99% ethanol.
The sampled colonies belong to a breeding population which undergoes a selection scheme in which selected virgin queens are mated with selected drones using isolated mating stations.At the beginning of the selection project in 2015, tests were conducted to identify an area (valley) isolated from the presence of other drones.For this purpose, groups of virgin queens were brought to the site, and after a few weeks, it was verified that none of them had been fertilized/mated.Subsequently, the isolated mating station was protected by specific local ordinances, preventing the presence or introduction of other colonies unrelated to the project.
Each year, 108 colonies are phenotyped to estimate their breeding value (EBV) for honey production, hygienic behavior, and docility.
At the end of the season, the 7-8 colonies with the best EBV are selected as breeders for the next generation to produce the new 108 colonies.In detail, from the best colony, a group of 12 sisters is obtained through grafting, serving as drone-producing queens.In beekeeping, the drone-producing queens are considered as fathers since drones are haploid and descend from the queen; they can therefore be considered as spermatozoa of the queens.Simultaneously, the remaining top 6-7 colonies undergo grafting to produce 18 virgin queens each.Finally, the 108 virgin queens are mated in the isolated mating stations with the drone-producing queens to produce the new 108 colonies.
In order to ensure genetic diversity and avoid breeding from closely related colonies, the selection process alternates between two distinct groups of 108 colonies each year.This means that if one group is phenotyped and used for breeding in a particular year, the next year the other group is phenotyped and used for breeding.This ensures that the two groups do not share common ancestors in consecutive years.
The 100 worker bees analyzed in this study belong to 100 different colonies founded in two different years and descend from 13 maternal lines.In detail, 40 colonies were driven by 40 queens born in 2019 from 6 selected breeding queens (group L1Q to L6Q), while the other 60 colonies were born in 2020 from 60 queens, daughters of 7 different queens (groups L7Q to L13Q).Therefore, these 100 colonies can be traced as descendants from 13 maternal lines.As the selection process involves the alternating use of 2 different groups, these 13 genetic lines of 2 consecutive years have no shared common relatives.More detail on the ancestry of the 51 worker bees for which HVR regions were reconstructed are shown in Figure 1.Moreover, Table 1 shows the distribution of the 51 samples among the 13 genetic lines of descent.
groups, these 13 genetic lines of 2 consecutive years have no shared common relatives.More detail on the ancestry of the 51 worker bees for which HVR regions were reconstructed are shown in Figure 1.Moreover, Table 1 shows the distribution of the 51 samples among the 13 genetic lines of descent.

DNA Extraction, Library Preparation, Sequence Processing and Alignment
The whole insect was ground with liquid nitrogen and DNA was extracted using E.Z.N.A. ® Insect DNA Kit (Omega Biotek, Norcross, GA, USA) with one minor modification: the incubation with CTL Buffer and Proteinase K Solution was performed for 60 min instead of 30 min.

DNA Extraction, Library Preparation, Sequence Processing and Alignment
The whole insect was ground with liquid nitrogen and DNA was extracted using E.Z.N.A. ® Insect DNA Kit (Omega Biotek, Norcross, GA, USA) with one minor modification: the incubation with CTL Buffer and Proteinase K Solution was performed for 60 min instead of 30 min.
Celero™ DNA-Seq kit (Tecan Trading AG, Männedorf, Switzerland) was used for library preparation following the manufacturer's instructions.Libraries were quantified by Qubit 2.0 and quality tested by Agilent 2100 Bioanalyzer High Sensitivity DNA assay (Agilent, Santa Clara, CA, USA).Libraries were then prepared for sequencing and sequenced with Illumina technology on NovaSeq 6000 in paired-end 150 mode.Reads were mapped to the reference genome (Amel_HAv3.1:GCF_003254395.2).Aligned sequences are only those that map to unique positions; duplicated sequences were marked by picard (https://broadinstitute.github.io/picard/,accessed on 1 February 2023) and removed from downstream analysis.

De Novo Assembly and Analysis of Sample-Specific HVR Allele Consensus Sequences
For each sample, Samtools was used to extract reads that mapped to exon 7 of the csd gene, which includes the HVR region [4].The trimmed reads that passed quality control filters, were mapped to the Amel_Hav3.1 honeybee genome, resulting in a total of 7350 mapped reads corresponding to the HVR region (Hav3.1_CM009933.2:11771976-11772119).The CAP3 Sequence Assembly Program (-o 40, -p 90) [23], was used to cluster and generate the contigs.CAP3 output was qualitatively filtered.First, only samples with 2 contigs were retained, resulting in 70 samples.Then, an additional 19 samples with more than 1 singleton were excluded, leaving 51 samples that passed these 2 filtering steps.The two contig sequences obtained from each sample, supposed to represent the two alleles, were translated into amino acid sequences using EMBOSS transeq tool (https://www.ebi.ac.uk/Tools/st/emboss_transeq/, accessed on 10 March 2023).In-frame translations resulting in truncated polypeptides were discarded.The successfully reconstructed amino acid sequences numbered 102 and were aligned using Clustal Omega (1.2.2) [24].

Csd HVR Amino Acid Sequence
The hypervariable region (HVR) of the csd gene of 100 diploid worker bees belonging to 100 different colonies was sequenced using the pipeline described above.After de novo assembly and quality control filter, 102 amino acid sequences were successfully reconstructed.The obtained amino acid sequences were compared with a recent review that grouped 652 amino acid sequences into 225 alleles based on identity [19].Each sequence was assigned an Amelcsd-HVR ID, numbered sequentially from the most to the least common (from Amelcsd-HVR1 to Amelcsd-HVR225) [19].Out of the 47 alleles found, 42 were reported and named by Bilodeau et al., 2 alleles (allele 7 and allele 14) were described only by Paolillo et al. [20], while 3 (Allele 35, 39, 42) were novel alleles.In Supplementary Table S1, allele sequences are reported together with their frequencies and the corresponding Amelcsd-HVR numbers or Paolillo Allele numbers [19,20].Moreover, Figure 2 visually represents the aligned amino acid sequences of the 47 alleles, highlighting the notable differences in both amino acid composition and length.Furthermore, in line with the findings of previous studies, all the analyzed worker bees demonstrate heterozygosity and they exhibit notable differences in both length and sequence of the hypervariable region of the csd gene between the two carried alleles, as shown in Table 2.In total, 47 different sequences were found: 21 of these are shared in at least 2 samples and 26 are reported only once.Allele frequencies were determined by dividing the number of observed sequences by the total number of reconstructed sequences (102) and are reported in Table S1.The most frequent allele is Amelcsd-HVR115 (allele 1), which is present in eight samples (frequency: 8%) belonging to five different genetic lines (two born in 2019 and three in 2020).In addition, the frequency of the 21 different alleles shared in at least 2 sample within the 13 genetic lines was analyzed.As described in the histogram reported in Figure 3, 17 out of 21 alleles were identified in at least 2 different genetic lines and, among these, 2 alleles (1 and 4) were reported in 5 different genetic lines.In total, 47 different sequences were found: 21 of these are shared in at least 2 samples and 26 are reported only once.Allele frequencies were determined by dividing the number of observed sequences by the total number of reconstructed sequences (102) and are reported in Table S1.The most frequent allele is Amelcsd-HVR115 (allele 1), which is present in eight samples (frequency: 8%) belonging to five different genetic lines (two born in 2019 and three in 2020).In addition, the frequency of the 21 different alleles shared in at least 2 sample within the 13 genetic lines was analyzed.As described in the histogram reported in Figure 3, 17 out of 21 alleles were identified in at least 2 different genetic lines and, among these, 2 alleles (1 and 4) were reported in 5 different genetic lines.

Discussion
Some selection practices applied in bee breeding can increase inbreeding within colonies, resulting in increased homozygosity, especially if single drone insemination is used.The consequent reduced variability is problematic for productive and reproductive traits, particularly at the csd gene, causing colony losses [21,22].Therefore, understanding

Discussion
Some selection practices applied in bee breeding can increase inbreeding within colonies, resulting in increased homozygosity, especially if single drone insemination is used.The consequent reduced variability is problematic for productive and reproductive traits, particularly at the csd gene, causing colony losses [21,22].Therefore, understanding the distribution and frequency of different csd alleles in a population of honeybees is important for beekeepers to make informed decisions about breeding practices and to maintain genetic diversity within their colonies.
This study aims to examine the effect of modern selection practices on the variation of the csd gene in a closed breeding honeybee population, using existing next-generation resequencing data to reconstruct csd alleles from diploid honeybees.
A bioinformatic pipeline was used for this purpose, which included manual review to achieve a curated, reference-independent sequence assembly.The pipeline was tested and validated by GenBank data in a previous study [20].
The HVR region sequence of 51% of the total samples in the study was successfully reconstructed.The obtained sequences were validated by comparison with known alleles found in the literature [19,20].As shown in Table S1, 44 out of 47 alleles showed 100% identity with already-published sequences, while the remaining 3 are novel alleles.The results of this study confirmed that the csd gene is highly variable in both amino acid sequence and length.Moreover, the length of the alleles varies considerably, ranging from 30 to 53 amino acid sequences, which is in line with previous studies [15,19,20].Table 2 further supports these findings, providing additional evidence of the extensive variability observed in the HVR region of the csd gene between the two alleles carried by the worker bees.This variability is consistent with previous studies [15,20].
The 51 colonies sequenced in the study belong to 2 generations and 13 genetic lines/breeding queens.Specifically, 24 colonies were born in 2019 and descended from 6 breeding queens, while the remaining 27 were born in 2020 and descended from 7 queens.
As shown in Figure 3, the different HVR sequences are well distributed among the genetic lines.This result indicates that there is still high variability within each line despite 6 years of selection.
In the present study, 47 different csd alleles are identified in a dataset of 102 sequences from 51 honeybees belonging to a breeding population.Paolillo et al. [20] found 88 alleles using the same pipeline on a dataset of 138 sequences from honeybees sampled from a variety of beekeepers across the Italian peninsula.Together, these results are in line with the findings of Lechner et al. [9], who identified 53 alleles locally and 87 in a larger dataset covering a broader geographic area.
Since homozygous individuals do not survive in a honeybee colony, the observed heterozygosity must be 1 by definition.Nevertheless, based on the allele frequencies detected in this paper (Table S1) an index of expected gene diversity at this locus can be computed as described by Nei [25]: In the present population, where a selection scheme has been in place since 2014, this index resulted in 0.96.This result can be compared with findings from a closed breeding population screened for the csd genes in New Zealand [14].In the study, the alleles of 42 queens were analyzed by sampling 6 drone pupae for each queen.They successfully obtained both allele sequences from 35 out of the 42 queens, while only 1 allele sequence was obtained from the remaining 7, resulting in a total of 77 amino acid sequences, identifying 16 different alleles [14].Based on the frequencies provided in our population in this paper, we estimated that the expected gene diversity in this population was 0.92 in 2010 and 0.90 in 2011.Interestingly, in a previous investigation [20] based on a larger population (across Italian peninsula), the expected gene diversity index as 0.98 using the allele frequencies provided by the authors.Therefore, all these results show that despite selection, the apparent loss of heterozygosity in the csd allele remains relatively limited within this breeding population.Consequently, we think that the major concern about the risk that selection in honeybees can excessively erode the genetic diversity in this species, is largely overrated.

Conclusions
The results of this study provide a valuable insight into the variability of the HVR region in closed honeybee populations, and the potential impact of controlled mating practices on their genetic diversity.The study found a total of 47 different csd alleles from 51 diploid worker honeybees collected from a selected breeding population.The high number of alleles found agrees with previous studies on csd gene variability in honeybees, even if the bees in this study belong to a highly selected population.However, this study highlights the importance of considering the csd alleles of reproducers in beekeeping practices to maintain a healthy and diverse genetic pool.The pipeline used in this study can also be applied to evaluate other selective traits, providing important information for breeding programs aimed at enhancing honeybee colony health, production, and survival.

Figure 1 .
Figure 1.Genealogical trees illustrating the ancestry of the 51 analyzed worker bees.Queens are shown in pink and colonies are shown in yellow.

Figure 1 .
Figure 1.Genealogical trees illustrating the ancestry of the 51 analyzed worker bees.Queens are shown in pink and colonies are shown in yellow.

Figure 2 .
Figure 2. Alignment of amino acid sequences in the hypervariable region (HVR) of the csd alleles.

Figure 2 . 11 Figure 3 .
Figure 2. Alignment of amino acid sequences in the hypervariable region (HVR) of the csd alleles.

Figure 3 .
Figure 3. Histogram reporting the frequency of the 21 alleles found in at least 2 samples in the 13 genetic lines, the colors are given according to the genetic lines.

Table 1 .
Number of analyzed samples for each year and Genetic line (GL).

Table 2 .
Amino acid sequences from the hypervariable region of the csd gene of the 51 diploid worker bees (ID sample), their respective mother ID, and the genetic lines (GL).