Aldehyde Dehydrogenase Diversity in Azospirillum Genomes

: Aldehyde dehydrogenases (ALDHs) are indispensable enzymes that play a pivotal role in mitigating aldehyde toxicity by converting them into less reactive compounds. Despite the availability of fully sequenced Azospirillum genomes in public databases, a comprehensive analysis of the ALDH superfamily within these genomes has yet to be undertaken. This study presents the identiﬁcation and classiﬁcation of 17 families and 31 subfamilies of ALDHs in fully assembled Azospirillum genomes. This classiﬁcation system framework provides a more comprehensive understanding of the diversity and redundancy of ALDHs across bacterial genomes, which can aid in elucidating the distinct characteristics and functions of each family. The study also proposes the adoption of the ALDH19 family as a powerful phylogenetic marker due to its remarkable conservation and non-redundancy across various Azospirillum species. The diversity of ALDHs among different strains of Azospirillum can inﬂuence their adaptation and survival under various environmental conditions. The ﬁndings of this study could potentially be used to improve agricultural production by enhancing the growth and productivity of crops. Azospirillum bacteria establish a mutualistic relationship with plants and can promote plant growth by producing phytohormones such as indole-3-acetic acid (IAA). The diversity of ALDHs in Azospirillum can affect their ability to produce IAA and other beneﬁcial compounds that promote plant growth and can be used as biofertilizers to enhance agricultural productivity.


Introduction
Aldehydes are ubiquitous organic compounds found in nature, they can be produced endogenously through processes associated with the metabolism of amino acids [e.g., arginine, proline, lysine, and valine], alcohols, lipid peroxidation (LPO), and carbohydrate oxidation, among others.Although aldehydes are crucial intermediates in various metabolic pathways, excessive accumulation can lead to toxicity, and environmental factors that induce cellular stress can trigger the accumulation of aldehydes beyond a certain threshold, resulting in cytotoxicity [1][2][3][4].
Aldehyde dehydrogenases (ALDHs) are a superfamily of enzymes present in both prokaryotic and eukaryotic organisms.They use NAD(P) + as a cofactor to catalyze the conversion of aldehydes (Figure S1), including fatty, aromatic, and terpenoid aldehydes, into less reactive molecules.ALDHs play a vital role in protecting cells from aldehyde toxicity.Aldehydes are highly reactive molecules that can form adducts with DNA, RNA, and proteins, disrupting cellular homeostasis, deactivating enzymes, and damaging DNA [3][4][5][6].In addition to their role in detoxification, ALDHs are involved in the synthesis and elimination of a variety of important molecules, including vitamins, amino acids, steroids, betaine, retinoic acid, and gamma-aminobutyric acid (GABA) [6,7].
ALDH can catalyze three types of reactions, all of which result in the conversion of aldehydes to carboxylic acids, coenzyme A acyls, or phosphate acyls [4,8].
ALDHs are homo-dimeric, homo-tetrameric, or homo-hexameric enzymes; most ALDHs consist of a single domain containing 450-550 amino acids, but some families have additional domains that enable them to catalyze multiple reactions.The ALDH domain can be further divided into three structural domains: a cofactor (NAD or NADP) binding domain, a catalytic domain, and an arm-shaped oligomerization domain (Figure S2) [4,[8][9][10][11].
The catalysis of aldehydes into carboxylic acids by ALDH2 and other ALDH family members involves five steps: activation of the catalytic thiol, a nucleophilic attack on the electrophilic aldehyde, formation of a tetrahedral thiohemiacetal intermediate, hydrolysis of the resulting thioester, and dissociation of the reduced cofactor (NADH or NADPH) and regeneration of the enzyme by binding to NAD(P).These steps are mediated by the amino acids Glu268 and Cys302 or their equivalents in other ALDH family members [4,12].
There is great diversity in the ALDH superfamily.In eukaryotes, there are 18 families composed by 35 subfamilies, in humans there are 19 ALDHs classified in 10 families, while in plants there are 14 families [13,14].
Compared to eukaryotes, the diversity of ALDHs in prokaryotes has been less wellstudied.One study of 258 Pseudomonas strains found 6510 ALDHs which could be grouped into 42 families.Fourteen of these families accounted for 76% of all ALDHs [8].
Aldehyde dehydrogenases are classified based on sequence identity.ALDHs with at least 40% sequence identity or more are grouped into the same family, while those with at least 60% sequence identity are classified into the same subfamily.Subfamilies and numbers of ALDHs are designated chronologically [6].This classification system, which is based on the recommendations of Margaret Dayhoff, is also used for more than 130 other protein superfamilies [7,15].However, Dayhoff's methodology has limitations when used to classify ALDHs from microorganisms.When applied to these prokaryotic proteins, new families emerge beyond the current classification.This is likely due to the large number and diversity of ALDHs found in microbial genomes [8].
The Azospirillum genus comprises land-dwelling microorganisms that typically inhabit damp environments, such as sludge, surface waters, and cultivated soils.These non-spore-forming bacteria are spiral or slightly curved rods that contain polyhydroxybutyrate granules.They stain Gram-negative, and are highly motile due to a single polar flagellum and several shorter lateral flagella.It has been proposed that these Rhodospirillaceae bacteria colonized terrestrial habitats around 200-400 million years ago, coinciding with the emergence of vascular plants [16,17].This genus includes species that establish a mutualistic relationship with plants and promote their growth.This growth-promoting effect is generally attributed to the ability of these bacteria to fix nitrogen and produce indole-3-acetic acid (IAA).While both abilities are important, nitrogen fixation is typically associated with all known plant-associated strains of this genus, whereas IAA production has only been observed in the species A. brasilense, A. lipoferum, A. argentinense, A. formosense, and A. baldaniorum [16,17].
The most well-studied species of Azospirillum have been isolated from crops of agricultural interest, such as wheat, corn, soybean, tomato, strawberry, and sugar cane.However, some species have also been isolated from soils contaminated with oil, bacterial fuel cells, geysers, and sulfurous waters.This indicates that Azospirillum can tolerate and adapt to a variety of stressful conditions, likely due to the presence of various stress-tolerant mechanisms [18].Additionally, it has been observed that introducing A. brasilense into the roots of specific plants can reduce the build-up of reactive oxygen species.This capacity has been linked, among other factors, to the production of IAA by the bacterium [19,20].
Azospirillum bacteria have been extensively used as model organisms to study mutualistic interactions between plants and bacteria.While most research has focused on their plant-growth-promoting abilities, their ability to adapt to different environments and stresses is not well understood [19].This study aims to explore how the diversity of ALDHs among different strains of Azospirillum affects their adaptation and survival under various environmental conditions.
There are no published studies on the diversity of ALDHs in the genomes of Azospirillum genera.Therefore, we used a bioinformatics approach to search for proteins with the ALDH domain in 17 complete Azospirillum genomes that are available in public databases [20].The recovered nucleotide and protein sequences were analyzed using bioinformatic tools to classify them according to their phylogeny and probable function.

Genome Selection
We retrieved Azospirillum genomes from the NCBI Assembly Database [19].Only genomes with a complete assembly level in January 2022 were downloaded in GenBank Flat File (GBFF) format for further analysis.

Aldehyde Dehydrogenases Sequence Searching
To obtain a complete collection of protein-coding sequences, we parsed GBFF files and stored all coding sequences annotated as ORFs in FASTA files, separating amino acid sequences from nucleotide sequences.
To identify proteins containing the ALDH domain, we performed local alignments against the protein collection on the BLAST platform [21], using the aldA gene from Pseudomonas syringae DC3000 as the query sequence [22].We set an E-value threshold of 0.05 and stored all hits in FASTA files for classification, alignment, and phylogenetic analysis [23].

Comparative Genome Analyses
To cluster related genes in the pangenome, we used the Anvi'o 7.1 software with the DIAMOND and MCL algorithms.We set the minbit parameter to 0.5 and the mcl inflation parameter to 0.5 to classify and cluster the genes [24].We then used the Conserved Domains Database to predict the corresponding ALDH family for each cluster [25,26].All curated data is provided in the attached file named curated_database.xlsx.It is available in the Supplementary Material.

Multiple Sequence Alignment and Phylogenetic Analyses
Multiple sequence alignment of the ALDH sequences was performed using the iterative MUSCLE algorithm in MEGA X software [27,28].The global identity matrix was obtained by analyzing alignment in UGENE v.46.0 software [29].Using all sites in gaps/missing treatment, and a moderate branch swap filter, we searched for the best substitution model for maximum likelihood analysis using MEGA.Based on their lower BIC's values [30], we selected the LG+G+F model for amino acid sequences, and the GTR+G model for nucleotide sequences.
We constructed a phylogenetic tree from the amino sequence alignment using the maximum likelihood method with the LG+G+F model MEGA X software [27], with 1000 bootstrap replicates [31].

Protein Modelling, Molecular Conservation, and Structural Analysis
Aldehyde dehydrogenases amino acid sequences from the Azospirillum genus were retrieved from the Uniprot database [32].Representative members of each ALDH family with structures modeled by Alphafold and deposited in Uniprot [33] were selected.These structures were downloaded and stored in PDB format.The amino acid sequences of the downloaded models were compared using the UGENE v.46.0 software [29], the catalytic residues (Cysteine) of each sequence were identified, and the glutamate residues corresponding to the activator residue were also located.
Using the ChimeraX software (v.1.6)[34], the PDB models were analyzed, including structural alignment (matchmaking).Subsequently, the visualization was intended to show the positions of the catalytic and activator residues.

Aldehyde Dehydrogenase Gene Localization in Azospirillum Genomes
To identify the positions of ALDHs loci within each Azospirillum genome, we wrote several shell scripts to compile the locations of each ALDH in all contigs.We then exported this information to CSV files for further analysis, and generated graphs using the RAWGraphs platform [35].

Aldehyde Dehydrogenase Gene Identification as Potential Phylogenetic Markers
We searched for potential phylogenetic markers within single-copy core genes (SCGs) in the pangenomic analysis performed with Anvi o software.We constructed a maximum likelihood phylogeny and compared it to the results obtained for the rpoD gene, which has been previously used as a phylogenetic marker for the Azospirillum genus [36].

Selection of Azospirillum Genomes
From the Genome Data Bank (NCBI), we selected 17 genomes with a complete assembly level (Table 1).The genome sizes ranged from 6.32 to 8.1 Mpb, and each genome included chromosome and plasmids sequences (contigs).The strain with the lower number of contigs was A. brasilense Az39 (6 contigs), while the strain with the highest number of contigs was A. sp TSA2s (10 contigs) (Figure 1).

Number of Aldehyde Dehydrogenases Identified for Each Azospirillum Strain
We identified a total of 315 ALDH sequences, with an average of 18.5 ± 3.7 aldehyde dehydrogenases per genome.The strain with the lowest number of ALDHs was A. ramasamyi M2T2B2 (12 homologs), while the strain with the highest number of ALDHs was A. sp TSH100 (27 homologs).
We did not observe a correlation between the number of contigs and the abundance of ALDHs, suggesting that the number of ALDHs are not associated with horizontal gene transfer through the acquisition of exogenous genetic material.Also, we did not find a correlation between genome size and the number of ALDHs.
The average length of ALDH proteins was 556 ± 190 amino acids (aa).Since the literature reports an approximate size of 500 aa for the ALDH domain, we constructed a length frequency histogram and found three groups.The first group had a length between 396-559 aa, the second group between 884-916 aa, and third group between 1237-1252 aa.We analyzed the sequences in the two larger groups using the Conserved Domains Database [25,26].The second group belongs to superfamily of alcohol dehydrogenases/aldehyde dehydrogenases, and its length is due to an additional domain with alcohol dehydrogenase (ADH) activity [46].The third group belongs to the proline dehydrogenases/ We identified 38 ALDHs belonging to the ALDH5 family (CDD-ID: cd07103), each with one to four paralogs per genome.This family is also part of the core genes of the genus Azospirillum.Our phylogenomic analyses revealed three subfamilies within the ALDH5 family, with an overall identity of 66.9% ± 19.5% between the proteins of this family.This indicates variability between subfamilies.Notably, 35 of the 38 ALDHs in this family are part of CSF-1, while the other subfamilies are only present in A. oryzae KACY14407, A. thiophilum BV-S, and A. sp.TSA2s.In humans, ALDH5 participates in the conversion of succinate semialdehyde (SSA) into succinate.SSA is produced from the decomposition of ɣ-aminobutyric acid (GABA), a 4-carbon non-protein amino acid, which is a neurotransmitter that inhibits stress signals in the central nervous system.ALDH5 has also been implicated in the metabolism of 4-hydroxy-2-nonenal (4-HNE), a molecule produced by lipid peroxidation that causes oxidative stress in cells [79].
This family of ALDHs (CDD:ID: cd07138) is present in the A. brasilense strains, 15 of the 17 analyzed genomes.We identified two subfamilies with an average identity of 47.3% ± 39.7%.The phylogram shows high diversity, but all the sequences appear to have a common ancestor.According to the Conserved Domain Database, these ALDHs are related to the 6 oxolauric-aldehyde dehydrogenases of the bacterium Rhodococcus ruber SC1, where a protein of this family (cddD) converts 12-oxolauric acid into dodecanoic acid [55].

• MSR1, DhaS, and COG1012 group
It is important to note that the DhaS, MSR1-like, and COG1012 families share common ancestors with 6-oxo-lauric aldehyde dehydrogenases and type II aldehyde dehydrogenases.Classifying this group of proteins was challenging, as they displayed Evalue scores of 0 or near zero for identical groups in a random manner when categorized in the Conserved Domains Database (CDD).This issue will be explored in greater detail in further discussion.

Number of Aldehyde Dehydrogenases Identified for Each Azospirillum Strain
We identified a total of 315 ALDH sequences, with an average of 18.5 ± 3.7 aldehyde dehydrogenases per genome.The strain with the lowest number of ALDHs was A. ramasamyi M2T2B2 (12 homologs), while the strain with the highest number of ALDHs was A. spTSH100 (27 homologs).
We did not observe a correlation between the number of contigs and the abundance of ALDHs, suggesting that the number of ALDHs are not associated with horizontal gene transfer through the acquisition of exogenous genetic material.Also, we did not find a correlation between genome size and the number of ALDHs.
The average length of ALDH proteins was 556 ± 190 amino acids (aa).Since the literature reports an approximate size of 500 aa for the ALDH domain, we constructed a length frequency histogram and found three groups.The first group had a length between 396-559 aa, the second group between 884-916 aa, and third group between 1237-1252 aa.We analyzed the sequences in the two larger groups using the Conserved Domains Database [25,26].The second group belongs to superfamily of alcohol dehydrogenases/aldehyde dehydrogenases, and its length is due to an additional domain with alcohol dehydrogenase (ADH) activity [46].The third group belongs to the proline dehydrogenases/ɣ-glutamyl aldehyde dehydrogenases superfamily and these proteins also have an additional domain named proline dehydrogenase (PRODH) [49] (Figure 2).

Sequence Alignment and Clustering of Aldehyde Dehydrogenases
The identity matrix from the global alignment shows a wide range of identity (3-100%) (Table S1).To achieve a reliable clustering of ALDH sequences, not based only on identity percentage, we used the Anvi'o software [24], which allows us to cluster code genes based on progressive alignments and the Markov clusters algorithm (MCL) [24].The 315 ALDH sequences were grouped into 31 clusters of orthologous genes (COG's).We used the Conserved Domains Database (CDD) [25] to identify the ALDH family of each COG.This resulted in 31 COGs being grouped into 17 families (Table 2).Fifteen ALDH families belonged to the aldehyde dehydrogenase superfamily (ALDH-SF, CDD: cd06534), which is characterized by a single ALDH domain.One ALDH family belonged to the proline dehydrogenase/pyrroline-5-carboxylate dehydrogenase superfamily (ADH/ALDH, CDD: PRK00197), and the last family belonged to the acetaldehyde-CoA/alcohol dehydrogenase superfamily (PRODH/ALDH, CDD: PRK11905).Members of families with additional domains are proteins with divergent sizes, as shown in the histogram above.The distribution of subfamilies and identity per subfamily is shown in Figure 3.

Alignment and Phylogenetic Analysis
Using the maximum likelihood method, we generated a phylogenetic tree from the previously conducted alignment (Figure 4).The tree showed clustered branches that maintained the relationships established through the pangenomic analysis and identification carried out in CDD [24][25][26].Of the 17 families identified, 10 exhibited a shorter branch distance than the remaining part of the tree, which suggests a shared ancestor.These families were identified as potential ALDHs that transform aldehydes into carboxylic acids (Blue group in Figure 4) [52,58,60].The DhaS, COG1012, and ALDH MSR1-LIKE families were also found in this group (Red group in in Figure 4), with a close identity of about 40% between groups.Notably, no literature was found for COG1012 and ALDH MSR1-LIKE indicating information on the possible substrates and products of these families.It would be beneficial to conduct future studies that test the substrates of aldehyde dehydrogenases that are closely related in terms of phylogenetics, as previous research has found evidence of substrate promiscuity among various aldehyde dehydrogenases [58].The other seven families had ALDHs capable of producing carboxylic acids (KGS-ALDH and ALDH7), acylating ALDHs (ALDH7), phosphorylating ALDHs (ALDH19), ALDHs with an additional domain (ALDH16 and ALDH20), and ALDHs involved in the degradation of indole acetic acid (PaaZ).For the ALDH 07078 family, no literature was found that provided information on its possible function

Alignment and Phylogenetic Analysis
Using the maximum likelihood method, we generated a phylogenetic tree from the previously conducted alignment (Figure 4).The tree showed clustered branches that maintained the relationships established through the pangenomic analysis and identification carried out in CDD [24][25][26].Of the 17 families identified, 10 exhibited a shorter branch distance than the remaining part of the tree, which suggests a shared ancestor.These families were identified as potential ALDHs that transform aldehydes into carboxylic acids (Blue group in Figure 4) [52,58,60].The DhaS, COG1012, and ALDH MSR1-LIKE families were also found in this group (Red group in in Figure 4), with a close identity of about 40% between groups.Notably, no literature was found for COG1012 and ALDH MSR1-LIKE indicating information on the possible substrates and products of these families.It would be beneficial to conduct future studies that test the substrates of aldehyde dehydrogenases that are closely related in terms of phylogenetics, as previous research has found evidence of substrate promiscuity among various aldehyde dehydrogenases [58].The other seven families had ALDHs capable of producing carboxylic acids (KGS-ALDH and ALDH7), acylating ALDHs (ALDH7), phosphorylating ALDHs (ALDH19), ALDHs with an additional domain (ALDH16 and ALDH20), and ALDHs involved in the degradation of indole acetic acid (PaaZ).For the ALDH 07078 family, no literature was found that provided information on its possible function [11,49,50,64,[70][71][72][73][74].According to the pangenomic analysis, the strains have comparable varieties of ALDH, as illustrated in Figure 5.

Comparison of Structural Models of Aldehyde DehydrogenaseFamilies
17 Alphafold-generated representative models, one per each ALDH family, were obtained from Uniprot Database and were superimposed.The most notable structural changes were found in the oligomerization domain; the PaaZ family stands out due to the larger size of this domain compared to other single-domain ALDHs.In experimentally characterized proteins from this family, the extended length of this domain allows them to form trimers instead of dimers, which are formed by the rest of the families [11,66].
According to our alignment results (Figure 6), the catalytic cysteine residue is in a similar position in all structures, while the catalytic glutamic acid is present in most families, except for ALDH6, ALDH19, and ALDH20.Despite an extensive search, we were unable to find any literature that elucidates the activation mechanism of the catalytic cysteine residue for these families.However, the crystal structures of these ALDHs available in the Protein Data Bank (PDB) demonstrate the absence of the acidic glutamic acid residue that is typically observed in other ALDH families [53,71,[75][76][77].In terms of the Rossmann fold, almost all structures exhibit a five-beta sheet configuration.The exception is the ALDH19 family, which has a four-beta sheet structure (black structure in Figure 7).This is consistent with the crystal structures of this family [77,78].

Comparison of Structural Models of Aldehyde Dehydrogenase Families
17 Alphafold-generated representative models, one per each ALDH family, were obtained from Uniprot Database and were superimposed.The most notable structural changes were found in the oligomerization domain; the PaaZ family stands out due to the larger size of this domain compared to other single-domain ALDHs.In experimentally characterized proteins from this family, the extended length of this domain allows them to form trimers instead of dimers, which are formed by the rest of the families [11,66].
According to our alignment results (Figure 6), the catalytic cysteine residue is in a similar position in all structures, while the catalytic glutamic acid is present in most families, except for ALDH6, ALDH19, and ALDH20.Despite an extensive search, we were unable to find any literature that elucidates the activation mechanism of the catalytic cysteine residue for these families.However, the crystal structures of these ALDHs available in the Protein Data Bank (PDB) demonstrate the absence of the acidic glutamic acid residue that is typically observed in other ALDH families [53,71,[75][76][77].In terms of the Rossmann fold, almost all structures exhibit a five-beta sheet configuration.The exception is the ALDH19 family, which has a four-beta sheet structure (black structure in Figure 7).This is consistent with the crystal structures of this family [77,78].

Comparison of Structural Models of Aldehyde DehydrogenaseFamilies
17 Alphafold-generated representative models, one per each ALDH family, were obtained from Uniprot Database and were superimposed.The most notable structural changes were found in the oligomerization domain; the PaaZ family stands out due to the larger size of this domain compared to other single-domain ALDHs.In experimentally characterized proteins from this family, the extended length of this domain allows them to form trimers instead of dimers, which are formed by the rest of the families [11,66].
According to our alignment results (Figure 6), the catalytic cysteine residue is in a similar position in all structures, while the catalytic glutamic acid is present in most families, except for ALDH6, ALDH19, and ALDH20.Despite an extensive search, we were unable to find any literature that elucidates the activation mechanism of the catalytic cysteine residue for these families.However, the crystal structures of these ALDHs available in the Protein Data Bank (PDB) demonstrate the absence of the acidic glutamic acid residue that is typically observed in other ALDH families [53,71,[75][76][77].In terms of the Rossmann fold, almost all structures exhibit a five-beta sheet configuration.The exception is the ALDH19 family, which has a four-beta sheet structure (black structure in Figure 7).This is consistent with the crystal structures of this family [77,78].

Analyzing Aldehyde Dehydrogenase Families Found in Azospirillum Genomes
Next, we describe the characteristics and data of the aldehyde dehydrogenases families identified in this study, and discuss their potential impact on the metabolic activity within the genus Azospirillum.

• ALDH5
We identified 38 ALDHs belonging to the ALDH5 family (CDD-ID: cd07103), each with one to four paralogs per genome.This family is also part of the core genes of the genus Azospirillum.Our phylogenomic analyses revealed three subfamilies within the ALDH5 family, with an overall identity of 66.9% ± 19.5% between the proteins of this family.This indicates variability between subfamilies.Notably, 35 of the 38 ALDHs in this family are part of CSF-1, while the other subfamilies are only present in A. oryzae KACY14407, A. thiophilum BV-S, and A. sp.TSA2s.In humans, ALDH5 participates in the conversion of succinate semialdehyde (SSA) into succinate.SSA is produced from the decomposition of ɣ-aminobutyric acid (GABA), a 4-carbon non-protein amino acid, which is a neurotransmitter that inhibits stress signals in the central nervous system.ALDH5 has also been implicated in the metabolism of 4-hydroxy-2-nonenal (4-HNE), a molecule produced by lipid peroxidation that causes oxidative stress in cells [79]. •

6-OL-ALDH
This family of ALDHs (CDD:ID: cd07138) is present in the A. brasilense strains, 15 of the 17 analyzed genomes.We identified two subfamilies with an average identity of 47.3% ± 39.7%.The phylogram shows high diversity, but all the sequences appear to have a common ancestor.According to the Conserved Domain Database, these ALDHs are related to the 6 oxolauric-aldehyde dehydrogenases of the bacterium Rhodococcus ruber SC1, where a protein of this family (cddD) converts 12-oxolauric acid into dodecanoic acid [55].

• MSR1, DhaS, and COG1012 group
It is important to note that the DhaS, MSR1-like, and COG1012 families share common ancestors with 6-oxo-lauric aldehyde dehydrogenases and type II aldehyde dehydrogenases.Classifying this group of proteins was challenging, as they displayed Evalue scores of 0 or near zero for identical groups in a random manner when categorized in the Conserved Domains Database (CDD).This issue will be explored in greater detail in further discussion.

Analyzing Aldehyde Dehydrogenase Families Found in Azospirillum Genomes
Next, we describe the characteristics and data of the aldehyde dehydrogenases families identified in this study, and discuss their potential impact on the metabolic activity within the genus Azospirillum.

• ALDH5
We identified 38 ALDHs belonging to the ALDH5 family (CDD-ID: cd07103), each with one to four paralogs per genome.This family is also part of the core genes of the genus Azospirillum.Our phylogenomic analyses revealed three subfamilies within the ALDH5 family, with an overall identity of 66.9% ± 19.5% between the proteins of this family.This indicates variability between subfamilies.Notably, 35

Analyzing Aldehyde Dehydrogenase Families Found in Azospirillum Genomes
Next, we describe the characteristics and data of the aldehyde dehydrogenases families identified in this study, and discuss their potential impact on the metabolic activity within the genus Azospirillum.

• ALDH5
We identified 38 ALDHs belonging to the ALDH5 family (CDD-ID: cd07103), each with one to four paralogs per genome.This family is also part of the core genes of the genus Azospirillum.Our phylogenomic analyses revealed three subfamilies within the ALDH5 family, with an overall identity of 66.9% ± 19.5% between the proteins of this family.This indicates variability between subfamilies.Notably, 35 of the 38 ALDHs in this family are part of CSF-1, while the other subfamilies are only present in A. oryzae KACY14407, A. thiophilum BV-S, and A. sp.TSA2s.In humans, ALDH5 participates in the conversion of succinate semialdehyde (SSA) into succinate.SSA is produced from the decomposition of ɣ-aminobutyric acid (GABA), a 4-carbon non-protein amino acid, which is a neurotransmitter that inhibits stress signals in the central nervous system.ALDH5 has also been implicated in the metabolism of 4-hydroxy-2-nonenal (4-HNE), a molecule produced by lipid peroxidation that causes oxidative stress in cells [79]. •

6-OL-ALDH
This family of ALDHs (CDD:ID: cd07138) is present in the A. brasilense strains, 15 of the 17 analyzed genomes.We identified two subfamilies with an average identity of 47.3% ± 39.7%.The phylogram shows high diversity, but all the sequences appear to have a common ancestor.According to the Conserved Domain Database, these ALDHs are related to the 6 oxolauric-aldehyde dehydrogenases of the bacterium Rhodococcus ruber SC1, where a protein of this family (cddD) converts 12-oxolauric acid into dodecanoic acid [55].

• MSR1, DhaS, and COG1012 group
It is important to note that the DhaS, MSR1-like, and COG1012 families share common ancestors with 6-oxo-lauric aldehyde dehydrogenases and type II aldehyde dehydrogenases.Classifying this group of proteins was challenging, as they displayed Evalue scores of 0 or near zero for identical groups in a random manner when categorized in the Conserved Domains Database (CDD).This issue will be explored in greater detail in further discussion.
-aminobutyric acid (GABA), a 4-carbon non-protein amino acid, which is a neurotransmitter that inhibits stress signals in the central nervous system.ALDH5 has also been implicated in the metabolism of 4-hydroxy-2-nonenal (4-HNE), a molecule produced by lipid peroxidation that causes oxidative stress in cells [79]. •

6-OL-ALDH
This family of ALDHs (CDD:ID: cd07138) is present in the A. brasilense strains, 15 of the 17 analyzed genomes.We identified two subfamilies with an average identity of 47.3% ± 39.7%.The phylogram shows high diversity, but all the sequences appear to have a common ancestor.According to the Conserved Domain Database, these ALDHs are related to the 6 oxolauric-aldehyde dehydrogenases of the bacterium Rhodococcus ruber SC1, where a protein of this family (cddD) converts 12-oxolauric acid into dodecanoic acid [55].

• MSR1, DhaS, and COG1012 group
It is important to note that the DhaS, MSR1-like, and COG1012 families share common ancestors with 6-oxo-lauric aldehyde dehydrogenases and type II aldehyde dehydrogenases.Classifying this group of proteins was challenging, as they displayed E-value scores of 0 or near zero for identical groups in a random manner when categorized in the Conserved Domains Database (CDD).This issue will be explored in greater detail in further discussion.

• MSR1-ALDH-LIKE
This family of ALDHs (CDD-ID: cd07108) is present in the strains of A. brasilense, A. thermophilum CFH70021, and A. sp.TSH58, having between one and three paralogs per strain.This family has an average identity of 83.8% ± 13.5% and is grouped into a single subfamily.These sequences are related to an ALDH of the aquatic bacterium Magnetospirillum gryphiswaldense MSR-1, but there is no experimental evidence for its function [57].

•
DhaS (cd7114) We identified 18 proteins associated with the Pseudomonas putida ALDH DhaS, which have also been discovered in Bacillus amyloliquefaciens, where they are classified as a potential indole-acetaldehyde dehydrogenase [58,59].These related proteins are divided into five subfamilies, with an overall identity of 54.7% ± 21.3%.In our phylogenetic tree, they are distributed across two distinct nodes, one of which has a closer connection to the ALDH2 family.The ALDH2 family also includes ALDHs with indole acetaldehyde dehydrogenase activity, suggesting convergent evolution between the two subfamilies [58][59][60].

• ALDH COG1012
Phylogenetic analysis of the 18 sequences in this family of proteins revealed six subfamilies.These subfamilies are evolutionarily close to the families cd07114, cd07108 (DhaS), and cd07559 (ALDH2).There is no experimental evidence to define the substrates that these ALDHs can use.The average identity of these ALDHs is 50.8% ± 22.7%.They are present in 11 of the 17 genomes analyzed, with one to three paralogs per genome.NAD(P) + binding motifs were found, but it is unclear whether these ALDHs are acylating or oxidizing.

• KGSADH
This ALDH family (CDD-ID: cd07097) has not been classified according to the proposed nomenclature for ALDHs.Phylogenetic analysis revealed 52 ALDH distributed into three subfamilies, with an average identity of 52.8% ± 29.6%, highlighting the high variation in identity between subfamilies.The clustering was confirmed using the Conserved Domains Database [25,26].The number of paralogs of this family in each genome is one to six.Cluster subfamily 1 (CSF-1) is present in all strains with at least one paralog, while CSF-3 is only present in A. sp TSA2s.No paralog of this family has been found in eukaryotes.In A. brasilense, these proteins could participate in arabinose metabolism by converting α-ketoglutarate semialdehyde into alpha-ketoglutarate.These proteins prefer NAD + over NADP + , and use glutaraldehyde and succinic semialdehyde as substrates [50,51].

• ALDH6
This family (CDD-ID: cd cd07085) is formed by the methylmalonate semialdehyde dehydrogenases.We identified 30 homologs of these ALDHs, which are present in all strains analyzed, with one to four paralogs per genome.They form a single subfamily with an average identity of 71.7% ± 16.0%.These proteins have been found in prokaryotic and eukaryotic organisms, and participate in the distal metabolism of valine.Using NAD + and CoA, they convert methylmalonate semialdehyde to propionyl-CoA [53,54].

• ALDH20
We identified 19 homologs of this family (CDD-ID: PRK13805) in the analyzed Azospirillum genomes.The A. lipoferum 4B strain has one paralog of this family, while the paralog found in A. humireducens SgZ-5 (WP_236783856.1) is a partial sequence that has only the ALDH domain.An alignment of this partial sequence with the WP_108547921.1 sequence of the same strain, reveals an identity of 99%, suggesting that the partial protein may have originated through a duplication event of the complete gene.This family consists of two domains: the first with aldehyde dehydrogenase activity (ALDH) and the second with alcohol dehydrogenase activity (ADH).In fermentative bacteria, this family of proteins participates in the anaerobic production of ethanol from acetyl-CoA, with acetaldehyde as an intermediate.These proteins are usually found in the form of spirosomes, which can acquire various conformational states.They are also sensitive to metal-catalyzed oxidation [49,55], and functional mutations in these proteins have been shown to cause greater sensitivity to heat [56].

• ALDH2
We identified 17 ALDHs of this family (CDD-ID:cd7559), with one homolog in each genome, indicating that this family is part of the core of the analyzed genomes.Additionally, our analyses revealed a high identity (90.9% ± 6.2%) among the members of this family.In the bacterium A. brasilense Sp7, experimental evidence has shown that the protein aldehyde dehydrogenase WP_035679551.1 participates in the conversion of acetaldehyde into acetic acid [22].Notably, this protein has 97% aldehyde dehydrogenase from the bacterium A. brasilense Yu62 [60], which participates in the conversion of indole acetaldehyde into indole acetic acid.Additionally, it has been observed that when A. brasilense Sp7 is in the form of a cyst, the transcription levels of this protein are reduced, which can lead to decreased levels of pyruvate and acetyl-CoA within the cell [61].In mammals, members of this family are found in the mitochondria and are responsible for converting acetaldehyde to acetic acid in the cells of various organs, including the brain.Deficiency of this protein is associated with poor alcohol metabolism and development of Alzheimer's and Parkinson's diseases [2,80].

• ALDH16
The members of this family (CDD-ID:PRK11905) are proline dehydrogenases/pyrroline-5-carboxylate dehydrogenases.In Azospirillum, we found one homolog in each genome, indicating that it is part of the core genome.This family has an average identity of 84.2% ± 9.1%.In the free-living bacterium Sinorhizobium meliloti, this protein, known as PutA, participates in the conversion of L-proline into glutamate through the sequential action of proline dehydrogenase and delta-pyrroline-5-carboxylate dehydrogenase activities, which are conferred by the PRODH and ALDH domains present in this family of ALDHs [62].

• ALDH19
This family of ALDHs (CDD-ID:PRK00197) is known as the glutamate-5-semialdehyde dehydrogenases.They require NAD + for activity and have been found in mammals, plants, and bacteria.In mammals, this family is related to ALDH4 and ALDH12, respectively, but due and to the less than 30% identity, they are grouped into different families in our analyses, even though they have similar functional motifs and participate in the same pathway, proline catabolism [63].These proteins convert glutamate-5-semialdehyde into glutamic acid.No additional paralogs were found in the analyzed genomes, and this family has an average identity of 91.3% ± 5.1%.It is classified into a single subfamily and is part of the core genome of the analyzed genomes.

• ALDH7
The 16 ALDHs of this family (CDD-ID:cd07130) form a single subfamily with an average identity of 89.5% ± 8.0%.One paralog was found in each genome except for A. thermophilum CFH70021, which lacks the ALDHs of this family.These ALDHs, which are also present in plants and animals, are characterized by their ability to convert α-aminoadipic semialdehyde into α-aminoadipic acid.They typically form tetrameric structures [65,72].

• ALDH CD7078
We found 14 members of this family, although the group identity has a high variability of 47.3% ± 39.7%.The sequences were grouped into a single subfamily, with one member in each strain of the species A. brasilense, A. ramasamyi, A. humicireducens, A. baldaniorum, A. oryzae, and A. sp.We did not find experimental reports that reveal the possible substrates or metabolic pathways in which these enzymes participate.

• Paaz
This bifunctional enzyme participates in pathways that include the degradation of phenylacetic acid.It catalyzes the opening of oxepin-CoA to convert it into 3-oxo-5,6dehydrosuberyl-CoA.This enzyme has a longer terminal carboxyl zone that in other classic ALDHs.This additional region allows the protein to form trimers, as well as a hydrophobic tunnel through which the enzyme receives its substrate and carries out the conversion.The enzyme is susceptible to spontaneous conversion to 2-hydroxycyclohepta-1,4,6-triene-1carboxyl-CoA in the absence of NAD(P) + [66].We found this ALDH in Azospirillum strains that do not synthetize indole acetic acid, suggesting that these strains may be able to use IAA as a carbon source.However, the validation of this hypothesis will require further investigation in subsequent studies.

• ALDH cd07105
We identified a group of five ALDHs in this family that can be classified together, with an average identity of 72.38%.These ALDHs are present in A. thermophilum, A. sp B510, and A. humireducens SgZ5, as well as two homologs in A. sp TSH100.This ALDH family is part of an operon that contains up to five enzymes that catalyze the conversion of naphthalene into salicylate [67].

• ALDH cd07120
A member of this family was detected in two Azospirillum strains, A. thermophilum CFH70021 and A. sp B510, with a high similarity of 96% between the two members.This family is similar to the psfA gene of Pseudomonas putida Fu1, which participates in the conversion of furfural [68].

• ALDH cd07152
A member of this family was identified in A. sp TSH100 and A. sp B510, with 88% identity between the two sequences.This member is similar to benzaldehyde dehydrogenase II of Acinetobacter calcoaceticus [69], suggesting that it may have a similar function.

Aldehyde Dehydrogenases as Phylogenetic Marker
We examined the location of aldehyde dehydrogenases in the bacterial genomes of the genus Azospirillum to determine whether they were located on the chromosome or on any of the chromids.This investigation was motivated by studies that suggest that chromids typically experience changes in their GC composition over time, and that genes on the chromosome may undergo fewer changes than genes that have been translocated to a chromid [81].
Most ALDH sequences are distributed on chromosomes, chromids, and even plasmids.The ALDH2 and ALDH16 families, which belong to the SCG, are also distributed in this manner.However, the ALDH19 family is an exception, as its members putatively encode for phosphorylating aldehyde dehydrogenases.Of the seventeen genes associated with this family, sixteen remain on the chromosome, while one is present on chromid 1 (in A. thermophilum CFH70021) (Figure 8).In addition to the aldH19 and rpoD genes from the Azospirillum genus, sequences from the other alphaproteobacteria and the gammaproteobacteria P. syringae DC3000 were included in the analyses.This decision was based on the discovery that both aldH19 and rpoD are SCG genes in these genomes, and their location on the chromosome was The persistence of this gene on the chromosome and its previously encountered identity has led to its consideration as a potential phylogenetic marker.So, the sequences can subsequently be compared to the rpoD gene, a well-established phylogenetic marker in Azospirillum.
In addition to the aldH19 and rpoD genes from the Azospirillum genus, sequences from the other alphaproteobacteria and the gammaproteobacteria P. syringae DC3000 were included in the analyses.This decision was based on the discovery that both aldH19 and rpoD are SCG genes in these genomes, and their location on the chromosome was confirmed for genomes with more than one contig.Interestingly, the comparison of the phylogenetic trees of aldH19 and rpoD genes revealed that aldH19 has undergone five times fewer changes over time than the rpoD gene.The branch position of aldH19 was retained, establishing a phylogenetic relationship between the alphaproteobacteria and a larger branch for the gammaproteobacterium P. syringae DC3000.Additionally, the branches in the aldH19 phylogenetic tree better grouped individuals of the same species (Figure 9).

Discussion
The Azospirillum genus exhibits remarkable resilience and adaptability to a wide range of environmental stressors, including oil-contaminated soils, bacterial fuel cells, geysers, and sulfurous waters [40,43].However, exposure to these stressors can lead to increased aldehyde levels, which can endanger bacterial viability.Other environmental factors that can affect the viability of these bacteria include fluctuations in temperature, pH, salinity, and nutrient availability.The diversity of ALDHs within the Azospirillum genus is likely to play a crucial role in their adaptation and survival under these such stressful conditions [42,44,47].Azospirillum potentially acquired genetic material from coexisting organisms within their shared habitat through horizontal gene transfer [16].This genetic exchange could have significantly influenced the genus's ability to undergo evolutionary adaptations, leading to the development of novel metabolic pathways, and the production of a wide range of products and intermediates, including aldehydes.This, in turn, may have contributed to enhanced environmental adaptability.The evolution or acquisition of ALDHs may have been fundamental in the survival of strains capable of utilizing environmental nutrients and responding effectively to stress.Greater diversity of ALDH families facilitates the clearance of different aldehyde types, underscoring the importance of maintaining a sufficient variety of ALDHs for bacterial homeostasis [1,82].
This study highlights the need for a revised, universally applicable classification system for ALDHs.The current system works for eukaryotes, but it is inadequate for prokaryotes, given the growing diversity of ALDHs discovered through ongoing research.As a result, novel ALDHs are being classified without following established guidelines, leading to inconsistencies in nomenclature, numbering, structure, and function across the currently defined ALDH families-given the growing diversity of ALDHs discovered through ongoing research [4,8,83].
A major challenge in accurately classifying ALDHs is the limitation of platforms such as SMART [84], which only detect the ALDH domain and can therefore only determine whether a given sequence belongs to the ALDH superfamily.Additionally, while the literature generally cites ALDH sizes of approximately 500 amino acids, certain families we found, such as PaaZ, have a larger carboxy-terminal region that enables the formation of homo-hexamers by combining three homo-dimeric structures, deviating from the more commonly reported homo-tetrameric structures [66].Additionally, some studies incorporate ALDHs with multiple domains, but fail to address the consequential impact of these additional domains on the global functionality of the protein [8].
We employed pangenomic and phylogenetic clustering techniques, specifically the maximum likelihood approach, to classify ALDH families within the Azospirillum genus.These methods consistently preserved established groupings present in the CDD database, providing strong corroboration for the assigned CDD codes.Subsequently, we undertook a comprehensive literature review to cross-reference the probable functions and, in certain instances, performed identity comparisons with sequences of experimentally validated ALDHs.
We also confirmed the utility of emerging artificial-intelligence-based platforms, such as Alphafold, for structural approximations and comparative analyses of diverse ALDH families.Notably, these platforms allowed us to examine critical residues such as the catalytic cysteine and activating glutamate, as well as to evaluate ALDHs with extended oligomerization regions or additional domains.
Our study identified ALDHs with single-domain and bi-domain structures.Among the single-domain enzymes, we found enzymes that can generate products that can be incorporated into the tricarboxylic acid (TCA) cycle.For example, the ALDH5 family can generate succinic acid, an intermediate in the TCA cycle [52,85].The ALDH6 family can generate propionyl-CoA and acetyl-CoA, which are both precursors of succinyl-CoA, another intermediate in the TCA cycle [53,54].
The 6-OL-ALDH family is a group of enzymes that can metabolize long-chain aldehydes.But it does not have a formal family number in the ALDH numbering system.Our results show that the 6-OL-ALDH enzyme from Azospirillum is closely related to other aldehyde dehydrogenases that have been shown to metabolize long-chain aldehydes, such as the AldC enzyme from P. syringae DC3000.The AldC enzyme has been shown to efficiently metabolize aliphatic aldehydes with five to nine carbons [86].Although there is no experimental evidence to support the function of the 6-OL-ALDH family in Azospirillum, our data suggest that this family of enzymes may have biotechnological potential, especially in the field of bioremediation.This is because the 6-OL-ALDH family may be able to metabolize hydrocarbons that share the same metabolic pathways as long-chain aldehydes derived from oils [87].
Despite potential shared metabolic pathways between IAA and Phenylacetic acid, initial phenylalanine deamination in plants appears to be catalyzed by specific enzymes [88].Therefore, it is crucial to gather experimental evidence regarding the distinct routes through which this phytohormone is synthesized in each microorganism, and to clarify the potential involvement of specific and nonspecific ALDHs.Some of the ALDHs we identified could play a significant role in the metabolism of both phytohormones [89].This is relevant to the Azospirillum genus because species A. brasilense, A. baldaniorum, and A. argentinense positively regulate plant growth through IAA, which is produced principally through the indole-3-pyruvate (IPyA) pathway, with the final step of IAA synthesis carried out by aldehyde dehydrogenases [90][91][92][93].We found families with high identity to experimentally evaluated ALDHs that can perform this function, such as the ALDH2 and DhaS families.The ALDH2 family has high identity with AldA from A. brasilense Yu62 and AldA of P. syringae DC3000 (~90% and ~40%, respectively).In the first case, a mutation of the A. brasilense Yu62 aldA gene resulted in a ~40% reduction in the biosynthesis of this phytohormone [60], while in the second case, the ability to convert indole-3-acetaldehyde (IAC) into IAA was confirmed in P. syringae DC3000, with AldA producing more IAA than the other two ALDHs (AldB and AldC) [58].
In the DhaS family, a related ALDH in Bacillus amyloliquefaciens SQR9 positively regulates its transcription in the presence of the precursor tryptophan.When the dhaS gene was mutated, the yield of IAA produced was only 23% of that produced by the wild-type (WT) strain.However, when the gene was heterologously expressed with other genes from the IPyA pathway, the yield of IAA increased up to 180% of that produced by the WT strain [59].
It is important to note that the COG1012 and MSR-1-like families do not share homology with proteins that have been experimentally tested.However, they have a level of identity with DhaS and ALDH2 families that is relatively close to 40%, which is the threshold used to establish a family.Given the promiscuity observed among phylogenetically related ALDH families, it is plausible that the COG1012 and MSR-1-like families may be able to metabolize substrates to those metabolized by the DhaS and ALDH2 families, although their efficiency may vary, as is the case with AldA, AldB, and AldC of P. syringae DC3000 [58,86].It is worth noting that aldA from A. brasilense Sp7 belongs to the ALDH2 family and has been shown to convert aldehyde into acetate [22].It is known that some ALDH families are promiscuous in their substrate specificity [94], which suggests that the ALDH2, COG1012, DhaS, and MSR1-like families could be involved in the metabolism of aldehyde, acetaldehyde, and indole acetaldehyde, with different yields for each substrate type.
α-KGSALDHs have been studied in A. brasilense, where they play a crucial role in arabinose metabolism by converting the intermediate α-KGSA into α-ketoglutarate.Three α-KGSALDHs were examined in these studies and classified into two families: α-KGSALDH I in one family, and α-KGSALDH-II and α-KGSALDH-III in another independent family, with high identity between II and III [51,95].Our findings indicate that the α-KGSALDHs of Azospirillum are predominantly grouped into two distinct families, except for the protein QCG99333.1 of A. sp TSA2s, which is grouped independently.When comparing our sequences to the α-KGSALDH II and III sequences (accession numbers AB275768 and AB275769), they would be grouped into the larger α-KGSALDH group, while α-KGSALDH-I shows greater similarity to members of the ALDH5 family.Therefore, the second group of α-KGSALDH from Azospirillum lacks experimentally proven α-KGSALDHs, but they exhibit good grouping in this family in the CDD platform classification, with an E-value of 0.0.This group is related to YcbD, an α-KGSALDH from B. subtilis that is part of an operon with the ability to metabolize D-glucarate/galactarate [96].
The PaaZ family was identified in species lacking the ipdC gene, which plays a crucial role in the IPyA pathway of IAA biosynthesis [91,93].Azospirillum species, known for their beneficial effects associated with IAA synthesis, possess the ipdC gene [17,97].Upon examining the genetic context of the strains where paaZ was found, it was discovered that it forms part of a catabolon with genes associated with the degradation of phenylacetic acid and possibly IAA (paaABCDE, paaZ or paaN, paaH, paaG, paaI, paaJ, and paaK).This suggests that strains lacking the IPyA pathway but possessing the Paa pathway may have the ability to use IAA or PAA as carbon sources [11].This aligns with previous reports associating the beneficial effects of strains like A. oryzae with beneficial effects other than IAA synthesis, such as nitrogen fixation [98], or in A. ramasamyi with a negative indole production [44].Therefore, PaaZ would have a dual function as oxepin-CoA hydrolase/3oxo-5,6-dehydrosuberyl-CoA semialdehyde dehydrogenase, as previously reported in Pseudomonas putida F1 [99].
This corroborates the presence of two distinct groups within the Azospirillum genus.The first group confers a beneficial effect on plants through the production of IAA in the IPyA pathway.On the other hand, the second group, which possesses genes necessary for a metabolic pathway that can degrade this phytohormone, elicits beneficial effects that are distinct from IAA biosynthesis.Analysis of all examined genomes revealed a solitary copy of the ALDH16 family.Comparison of Azospirillum and Sinorhizobium meliloti members showed a shared identity of approximately 67%, indicating that they belong to the same subfamily and likely possess similar functions.These ALDH16 types have a C-terminal domain with a duplicated Rossman fold, which does not bind to NAD like the Rossman fold of the ALDH domain.However, the function of this region remains unknown.[73].
Our analysis also revealed that the ALDH20 family from Azospirillum also has a second domain (ADH).Interestingly, we identified a sequence in the A. humicireducens SgZ-5 strain that shares high homology with the ALDH domain of the ALDH20 family.This sequence consists of 396 amino acids and has the locus A6A40_RS31455 and a protein ID WP_236783856.1.Additionally, we found another gene from the ALDH20 family in this strain that contains both domains and is 899 amino acids long.This gene has a locus A6A40_RS21750 and the protein ID of WP_108547921.1.the two genes share 99% sequence identity (sequences are available in Supplementary Material CURATED_DATABASE.xls).However, because we did not find any homologs of the short sequence with a similar size in the NCBI database, we suspect that partial duplication of the long sequence may have occurred.Further testing is needed to determine the functional capacity of this sequence.Although the catalytic cysteine remains intact, the oligomerization domain may be affected due to its incompleteness.
Furthermore, the presence of both ALDH20 and ALDH2 families in all analyzed Azospirillum strains suggests that these bacteria have pathways to produce ethanol through anaerobic fermentation mediated by ALDH20-formed spirosomes as in E. coli [49,76].Ethanol degradation is mediated by pyruvate decarboxylases and ALDH2.These findings provide valuable insights into the metabolic capabilities of Azospirillum, and may have implications for future research in this area [22].
The ALDH19 family in eukaryotes is characterized by bifunctional proteins, but in Azospirillum, only the ALDH domain appears to operate independently, catalyzing aldehyde phosphorylation.These proteins have four beta sheets, unlike other ALDH families, which have five.Additionally, the catalytic cysteine is activated by a residue other than glutamic acid, although the mechanism of this process has not been documented in the current literature [64,71,77].
Our study found that the ALDH19 family exhibits low sequence variation (Table S1), with only one strain showing a deviation from the chromosomal location (Figure 8).We tested the utility of ALDH19 as a phylogenetic marker for the Azospirillum genus by comparing it to the rpoD marker used in previous studies [36].Our results showed that ALDH19 sequences have nearly five times less variation than RpoD sequences.Additionally, a phylogenetic tree constructed using ALDH19 sequences better retained current species classification, indicating that ALDH19 can be a reliable tool for organizing new strains of the Azospirillum genus for which complete genomic analysis is not possible.The diversity of ALDHs in Azospirillum bacteria can affect their ability to produce IAA and other beneficial compounds that promote plant growth.Therefore, understanding the genetic diversity of ALDHs in Azospirillum can help identify strains that are more effective in promoting plant growth and can be used as biofertilizers to enhance agricultural productivity.

Supplementary Materials:
The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/d15121178/s1, Figure S1.Subunit structure of Pseudomonas syringae DC3000 indole-3-acetaldehyde dehydrogenase (5IUW).Schematic representation of the structure of a subunit of AldA from P. syringae with bound NAD (spheres representation), showing alpha helices in green, beta sheets in blue, and loops in yellow.A close-up of the Rossmann fold, formed by five beta sheets, is shown in gray square box. Figure was produced using ChimeraX software.Figure S2.Domain organization of P. syringae DC3000 AldA monomer (5IUW) [58].AldA structural subunit is color coded to distinguish the oligomerization domain (blue), catalytic domain (purple), NAD(P) + binding domain (cyan).Figure S3.Pangenome analysis of 17 Azospirillum genomes using Anvi o software revealed that the identified ALDHs are distributed among 33 Clusters of Orthologous Genes (COG's).Table S1.Identity matrix of global alignment in supplementary file

Diversity 2023 , 23 Figure 1 .
Figure 1.Correlation analysis was performed to assess the relationship between the number of contigs and ALDH abundance.The numbers inside of circles correspond to genome size [measured in Megabase pair (Mbp)].In this analysis, 17 Azospirillum genomes retrieved from NCBI database were considered.

Figure 1 .
Figure 1.Correlation analysis was performed to assess the relationship between the number of contigs and ALDH abundance.The numbers inside of circles correspond to genome size [measured in Megabase pair (Mbp)].In this analysis, 17 Azospirillum genomes retrieved from NCBI database were considered.

Figure 2 .
Figure 2. Length frequency distribution of proteins with ALDH domain.

Figure 3 .
Figure 3. Three-level pie chart showing the distribution of ALDH families, subfamilies, and sequence identity within each subfamily.

Figure 3 .
Figure 3. Three-level pie chart showing the distribution of ALDH families, subfamilies, and sequence identity within each subfamily.

Figure 4 .
Figure 4. Phylogenetic tree constructed with 315 ALDH protein sequences from strains of the Azospirillum genus, showing families and clusters identified using the Conserved Domain Database CDD platform (1000 bootstrap).The number of Azospirillum ALDH proteins found in each family is indicated next to the family name or CDD-ID number.

Figure 4 .Figure 5 .
Figure 4. Phylogenetic tree constructed with 315 ALDH protein sequences from strains of the Azospirillum genus, showing families and clusters identified using the Conserved Domain Database CDD platform (1000 bootstrap).

Figure 6 .
Figure 6.Multiple sequence alignment of the ALDH family of the Azospirillum genus.Multiple sequence alignment of the ALDH families in the Azospirillum genus was constructed using UGENE software.The ALDH consensus sequence is shown above the alignment.The sequences are denoted by their protein ID, followed by the species abbreviation and the ALDH family.This alignment was used in the superposition analysis of the ALDH models obtained by AlphaFold.The first residue highlighted corresponds to glutamic acid (gray), which activates the catalytic cysteine (purple), which is highly conserved in most ALDH families.

Figure 5 .
Figure 5. Number of genes in each ALDH family in 17 complete Azospirillum genomes, with a dendrogram above the table showing the phylogenetic relationship between the strains, as determined using Anvi'o 7.1 software; green and red shadings correspond to higher and lower ALDHs values, respectively.

Diversity 2023 ,Figure 5 .
Figure 5. Number of genes in each ALDH family in 17 complete Azospirillum genomes, with a dendrogram above the table showing the phylogenetic relationship between the strains, as determined using Anvi'o 7.1 software; green and red shadings correspond to higher and lower ALDHs values, respectively.

Figure 6 .
Figure 6.Multiple sequence alignment of the ALDH family of the Azospirillum genus.Multiple sequence alignment of the ALDH families in the Azospirillum genus was constructed using UGENE software.The ALDH consensus sequence is shown above the alignment.The sequences are denoted by their protein ID, followed by the species abbreviation and the ALDH family.This alignment was used in the superposition analysis of the ALDH models obtained by AlphaFold.The first residue highlighted corresponds to glutamic acid (gray), which activates the catalytic cysteine (purple), which is highly conserved in most ALDH families.

Figure 6 .
Figure 6.Multiple sequence alignment of the ALDH family of the Azospirillum genus.Multiple sequence alignment of the ALDH families in the Azospirillum genus was constructed using UGENE software.The ALDH consensus sequence is shown above the alignment.The sequences are denoted by their protein ID, followed by the species abbreviation and the ALDH family.This alignment was used in the superposition analysis of the ALDH models obtained by AlphaFold.The first residue highlighted corresponds to glutamic acid (gray), which activates the catalytic cysteine (purple), which is highly conserved in most ALDH families.

Figure 7 .
Figure 7. Structural comparison of superimposition of Azospirillum ALDH models (A) Superimposition of the ALDH models obtained by AlphaFold to detect the Rossman fold that integrates the NAD(P) + coenzyme (highlighted in a red square box), the catalytic cysteine, and glutamic acid amino acids, (which interact with the aldehyde substrate and drive the ALDH enzymatic reaction) (panel (B)).

Figure 7 .
Figure 7. Structural comparison of superimposition of Azospirillum ALDH models (A) Superimposition of the ALDH models obtained by AlphaFold to detect the Rossman fold that integrates the NAD(P) + coenzyme (highlighted in a red square box), the catalytic cysteine, and glutamic acid amino acids, (which interact with the aldehyde substrate and drive the ALDH enzymatic reaction) (panel (B)).

23 Figure 7 .
Figure 7. Structural comparison of superimposition of Azospirillum ALDH models (A) Superimposition of the ALDH models obtained by AlphaFold to detect the Rossman fold that integrates the NAD(P) + coenzyme (highlighted in a red square box), the catalytic cysteine, and glutamic acid amino acids, (which interact with the aldehyde substrate and drive the ALDH enzymatic reaction) (panel (B)).

Diversity 2023 ,Figure 8 .
Figure 8. Alluvial diagram showing the localization of Azospirillum genus ALDHs on the chromosome (chr) or other replicons as chromids and plasmids (p01 to p07).

Figure 8 .
Figure 8. Alluvial diagram showing the localization of Azospirillum genus ALDHs on the chromosome (chr) or other replicons as chromids and plasmids (p01 to p07).

Figure 9 .
Figure 9. Phylogenetic marker utility of aldH19 gene.A comparison of phylogenetic trees from rpoD and aldH19 genes.

Table 1 .
Information of Azospirillum strains used in the study.

Table 2 .
Aldehyde dehydrogenase gene families in the Azospirillum genus and their probable function.