# Pigeons: A Novel GUI Software for Analysing and Parsing High Density Heterologous Oligonucleotide Microarray Probe Level Data

^{1}

^{2}

^{3}

^{4}

^{*}

^{†}

## Abstract

**:**

^{®}probe level data through XSpecies. One can determine empirical boundaries for removing poor probes based on genomic hybridisation of the test species to the Xspecies array, followed by making a species-specific Chip Description File (CDF) file for transcriptomics in the heterologous species, or Pigeons can be used to examine an experimental design to identify potential Single-Feature Polymorphisms (SFPs) at the DNA or RNA level. Pigeons is also focused around visualization and interactive analysis of the datasets. The software with its manual (the current release number version 1.2.1) is freely available at the website of the Nottingham Arabidopsis Stock Centre (NASC).

## 1. Introduction

^{®}arrays and dominated the market of high-density microarray for many years. Although significant quantities of informative, reproducible, and high quality data is generated by the use of a GeneChip

^{®}for expression profiling, the Affymetrix chips are only available for a limited number of species of eukaryotes and a small number of model/commercial plant species, including Arabidopsis thaliana, barley, rice, maize, tomato, soybean, sugar cane, grape and wheat [3,4].

^{®}of a heterologous species. The next step uses a Script to parse an Affymetrix CDF file of the selected chip. The parser uses the CDF file of the chip and the CEL file of the hybridisation to identify and remove “bad” probe-pairs whose perfect match probe intensities are below a cut-off value defined by the user, eventually making a “new” CDF file for Species X [5]. The new probe–masked file, namely the species X.CDF, can be used for Xspecies transcriptomic analysis of RNA hybridisation. Hammond et al. [3] showed that the Xspecies approach had been successfully applied to analyzing the transcriptome of Brassica oleracea L. by labeling gDNA from B. oleracea and hybridising it to the ATH1-121501 (ATH1) GeneChip

^{®}array. The approach with heterologous oligonucleotide microarrays was also utilised to profile and to compare the transcriptional levels of Thlaspi caerulescens and Thlaspi arvense, both being species where no GeneChip

^{®}is available [4]. A further application of this novel approach was to examine the evidence for neutral transcriptome evolution in plants by quantifying more than 18,000 genes transcripts at the level of 14 taxa from the Brassica family [6]. However, the original script parser has a specific limitation in choosing the cut-off - the selection of the value is essentially arbitrary, although a more recent iteration does allow a degree of sub-sampling to suggest thresholds. One method to improve on this approach is to generate many custom CDF files according to different cut-offs, from low to high. Then, a range of good probes pairs and probe-sets with respect to the chosen specific cut-offs are obtained. The researcher, using a spreadsheet, plots these data as background information and uses them to finally decide the optimal value of the cut-off and the corresponding CDF file [7]. The approach is valid but is still human-dependent since people choose the threshold based on their observations and experience when looking at the plot.

^{®}s, researchers have the ability to screen hybridisation datasets for potential SFP markers that exist in minor species. Thus, it is essential to design biological and algorithmic approaches for heterologous oligonucleotide microarray analysis, to help facilitate the genomic investigation of minor plants and animals. Here, we have developed an innovative software package “Pigeons”, abbreviated from “Photographically InteGrated En-suite for the OligoNucleotide Screening”, to work towards a solution to the issues mentioned above. Pigeons allows the user to input and analyse microarray data from the Xspecies microarray approach. This can be DNA hybridisations across species, to determine the empirical boundaries for custom CDF files for Xspecies transcriptomics or to examine an experimental design to identify SFPs at single oligonucleotides within the probe-sets, either at the DNA or RNA level. To allow intuitive interaction and final selection of features of interest, we have also developed a specific visualization interface to facilitate navigation through the hundreds of thousands of Affymetrix oligonucleotides.

## 2. Methods and Algorithms

_{1}; “

_{1}”) and a genotype derived from a domesticated landrace (DipC; Parent 2; P

_{2}; “

_{2}”) was made and a single hybrid seed (F

_{1}) allowed to grow and produce an F

_{2}population of seed. This population was planted and recorded at the Tropical Crops Research Unit at the University of Nottingham in 2003. Individual plants were recorded for numerous traits, including “number of stems per plant”. The extremes of the “number of stems per plant” distribution were identified and 10 plants from each extreme had DNA extracted by standard techniques and mixed in equal amounts to produce a bulked sample of “low stem number” (“

_{3}”) and a bulked sample of “high stem number” (“

_{4}”), respectively.

#### 2.1. ATM

**Figure 1.**Plane Curve. A vector valued function traced out by retention units with respect to the cut-off of poorly hybridising oligonucleotides using the heterologous GeneChip

^{®}platform, with ATH1-121501 used as the basis to generate the image.

^{2}is defined as follows:

_{1}and f

_{2}are real-valued functions of the parameter x. The two components of Y, Y

_{1}and Y

_{2}, are therefore viewed as sets of retained probe-pairs and probe-sets, respectively, when a defined cut-off is given. Using the vector-valued retention function F, we can easily trace the graph of a curve to know the relationships among cut-off and the retention units of probe-pairs and probe-sets. The point of the position vector F(x) coincides with the point (y

_{1}, y

_{2}) on the plane curve given by the component equations, as shown in Figure 1. The arrowhead on the curve represents the curve’s orientation by pointing in the direction of increasing values of x, namely x

_{3}> x

_{2}> x

_{1}. Due to the nature of the problem, the retention function F monotonically decreases in the direction of the point (0, 0). This characteristic means that mapping from y

_{1}to y

_{2}is also a monotone function, and moreover, it is actually like a learning curve with a stagnant occurrence.

_{tp}) can be defined as the intersection between a tangent to the stagnant phase of the curve and the tangent to the linear-like decreasing portion of the curve. The inverse of this point F

^{−}(F(x

_{tp})) could be selected as the threshold value. However, the cut-off decision problem is not deterministic, and it usually needs to take biological sense into account, so requires more tolerance in the selection of the threshold. The ATM offers a turning portion (TP) covering the turning point and derived from a closed interval I from which realistic thresholds can be retrieved. Let I be the surrounding area of x

_{tp}ϵ X such that F(x

_{tp}) is the turning point, and then we construct the turning portion by TP={F(x): x ϵ I }. Construction requires careful definition of a lower boundary (x

_{lb}) and an upper boundary (x

_{ub}) of I, with the aim of developing the idea of selecting a flexible region, rather than a single turning point. Since F is a one-one function well-defined in the interval I, which decreases monotonically; in theory, we can define x

_{lb}and x

_{ub}such that F(x

_{lb}) would be in the terminating phase of the plateau and F(x

_{ub}) would be in the earliest phase of sharp decline, respectively.

^{3}, let be an r-dimensional subspace of and

^{┴}be the orthogonal complement of . Given a matrix

**B**

_{3}

_{×r}such that the column space of

**B**is , and then for there exists a projector

**P**to project v onto along

^{┴}, i.e.,

**P**v = u, u ϵ . The unique linear operator

**P**can be acquired by

**P**=

**B**(

**B**

^{T}

**B**)

^{-1}

**B**

^{T}, in particular, if

**B**constitutes orthonormal bases, then

**P**=

**B**

**B**

^{T}. During simplification of the system, the goal at this stage is to minimize the loss of information relevant to the problem of concern. As a consequence, given

**B**(e.g., [0,0,1]

^{T}) and n numbers of vectors of thresholds with their retention units, and after linear transformation of each vector v

_{i}ϵ , i = 1,…, n we will then gain a learning data set D = {u

_{i}:

**P**v

_{i}= u

_{i}ϵ that ideally has the most informative features for turning portion discovery. Suppose that all the data vectors in TP have been projected onto a particular area, we define the area as a hotspot D′ such that

_{inf(J)}, x

_{sup(J)}]

**M**. Given a number of clusters c (1 < c < n), then the learning data set is dominated by fuzzy sets and the fuzzy partition matrix

**M**=[m

_{ij}]

_{c×n}, where and . For the individual entries in

**M**, m

_{ij}are the membership degree of element u

_{j}ϵ D to cluster i, i.e., . Let be a set of cluster prototypes so that each cluster is represented with a cluster centre vector ω

_{i}, and the objective function with two constraints can then be defined as below:

_{>1}is termed the “fuzzifier” or weighting exponent, and d

_{ij}is the distance between object u

_{j}and cluster centre ω

_{i}, within ATM, the Euclidean inner product norm denoted by ‖∙‖ is taken, i.e., d

_{ij}= ‖u

_{j}- ω

_{i}‖. The purpose of the clustering algorithm is to obtain the solution

**M**and Ω minimizing the cost function , and this can be carried out by:

**I**has been established. Besides the selection of feasible cut-offs, the ATM also provides an automated threshold value x

_{ATM}and a target interval I' for the selection of candidate cut-offs. Both x

_{ATM}and I' are evaluated through the fuzzy boundary between the first two fuzzy sets. The elements in the boundary imply that and have them in common with various membership values. Owing to the grayness characteristic and the continuity of the learning-like curve, we believe that a good threshold value for parsing the Affymetrix chip description files would come from a projected object that simultaneously belongs to the two clusters with remarkable membership degrees. As a result, the fuzzy boundary can enable us to offer a more reasonable selection of threshold boundary cut-offs. Two indices, l and k, are utilised to determine the highly likely threshold boundary cut-offs and the automated threshold value, determined by

_{l},x

_{k}] is constructed as the target interval I'. Let be u the arithmetic mean of the elements of , and x

_{ATM}can also be calculated by linear interpolation or by the Lagrange polynomial, as shown in the following formulae:

_{ATM},I',I) to resolve the issue of the threshold cut-off choices. The suggested cut-off given by the ATM, x

_{ATM}, can directly be exploited to remove the weak intensity signals while any values within a target interval, I' = [x

_{l},x

_{k}], can be taken as the potential threshold boundary cut-offs. The design of the target interval gives users a chance of picking a scientifically reasonable value on their own. Those values in a tolerance interval, i.e., x ϵ I= [x

_{inf(J)},x

_{sup(J)}], can be used as feasible thresholds and values outside the interval are viewed as less feasible choices.

#### 2.2. DFC

_{1}and G

_{2}) under the design of the single trait experiment. While two distinct parental genotype gDNAs are involved in generating G

_{1}, G

_{2}is composed of two different phenotypically based F

_{2}bulk segregant pools, derived from a hybrid between the two parental genotypes. We then label the four Xspecies chips with

_{1}&

_{2}for the two parent samples and with

_{3}&

_{4}for the two F

_{2}bulks. In practice, these F

_{2}bulks are constructed from the pooled DNA of F

_{2}individuals. These are derived from the controlled cross between the parental genotypes with allocation to the contrasting bulk based upon a specific trait of interest. The phenotype classification is a necessary prerequisite for the numerical analysis of potential SFP markers.

_{1}and

_{3}are classified into one type under a single trait experiment whereas

_{2}and

_{4}belong to the other trait version—the prerequisite can be denoted as . Let N be the number of genes and #( ) be the cardinal number of a probe-set then each chip can be represented as follows:

_{ij}

^{m}denotes the j-th signal intensity of the i-th probe-set on the m-th chip. Let Q

_{ij}

^{1}= b

_{ij}

^{1}/b

_{ij}

^{2}and Q

_{ij}

^{2}= b

_{ij}

^{3}/b

_{ij}

^{4}be the intensity ratio of G

_{1}and G

_{2}, respectively, thus the ratio value of one for this feature represents unchanged hybridisation signal in this experiment and less than or greater than one is for differentially hybridised oligonucleotides. To generate a symmetric distribution of intensity ratios, the fold-change ratio is defined by

_{ij}

^{1}is used to assess the differential probe hybridisation of the parental group. For the evaluation of the offspring group, FC

_{ij}

^{2}is calculated in the same way as FC

_{ij}

^{1}simply replacing Q

_{ij}

^{1}with Q

_{ij}

^{2}. Given the threshold of weak signals x

_{ATM}, the cut-off of a fold-change between the parents and that between the offspring , a number of logical criteria are applied to globally screen and search Affymetrix’s single oligoprobes for SFP markers. For , let the first condition be b

_{ij}

^{m}> x

_{ATM}since any signals whose intensities are below the threshold should not be used for good probes in the analysis of heterologous data—this satisfies the demand of the XSpecies technology. When the first criterion holds, the DFC enables the procedure to run the second condition with the two fold-change indicators FC

_{ij}

^{1}and FC

_{ij}

^{2}, FC

_{ij}

^{1}≥ ϵ

_{1}and FC

_{ij}

^{2}≥ ϵ

_{2}, to measure whether still holds at the genomic level. The FC approach is commonly used in microarray data analysis to identify differentially expressed genes (DEGs) between a treatment and a control. Calculated as the ratio of two conditions/samples, the FC gives the absolute ratio of normalized intensities in a non-log scale. We extend the same concept in our approach by introducing an additional FC—one ratio assesses the differential hybridisation within G

_{1}and the other assesses the differential hybridisation within G

_{2}. The extra FC tests whether the difference in phenotype could result from a difference in genotype at a single locus. Therefore, when there are any differentially hybridised oligonucleotides for the feature of interest between the two parental genotypes, the inherited attribute of would imply that we could expect those differentially hybridised oligonucleotides to have also been transmitted into the F

_{2}individuals. In a word, the corresponding fold-change of the F

_{2}is introduced as a cross-check mechanism for identifying SFPs which are consistent between parental genotype/trait and bulk genotype/trait. The mixture of F

_{2}genotypes (which are bulked according to the trait difference which segregates within the cross) should mean that the attribute difference is only detected when the location of the parental SFP is close to the gene controlling the trait difference. The accuracy of this approach is dependent upon bulk size used. Smaller bulk sizes will lead to the identification of SFPs which are located distantly from (and probably on different chromosomes to) the target trait associated SFPs. Oligo-probes that satisfy the second criterion above are potential SFP markers distinguishing the two phenotypes and could be further tested and used for genetic mapping of the gene controlling the phenotypic difference.

#### 2.3. POST

_{ij}for the examination of a signal variant in the single trait experiment, for the value is calculated by the following formula:

^{®}microarray data analysis [17,18,19], and is the average intensity ratio between parental samples and F

_{2}bulks in a base 2 logarithmic scale with a mnemonic for subtraction and a mnemonic for addition. The POST then uses the MA-value and a single sample t-test to statistically assess differentially hybridised oligonucleotides between the parent group and the offspring group and to test in a probe-set i whether or not there is significant difference between an interrogated probe k and the other probes in that probe-set, in terms of their log ratios. As a test statistic, the average of the MA-values of each of the probe-pairs except the probe k is denoted by ρ

_{ik}and determined by:

_{i}= #(B

_{i}) - 1 is the sample size in the examined probe-set i. Suppose that the sampling distribution of ρ

_{ik}is normal so that the random variable

_{i}- 1 degrees of freedom. Where S

_{ik}is the standard deviation of the sample of the log ratios in the i-th probe-set excluding the MA-value of the oligoprobe k. The last step performed by the POST is to asymptotically compute the p-value converting the value of T

_{ik}into a probability that expresses how likely the oligonucleotides in question are to be differentially hybridised. To visualize the results of this probewise testing of single oligonucleotides, a filter with a Volcano Plot output was also developed. The volcano plot is an effective and easy-to-interpret scatter plot for the selection of DEGs [11]. In the POST, the plot shows the negative common logarithm (base 10) of the p-value versus the average intensity ratio in the form of the binary logarithm (base 2), i.e., average fold-change ratio. Probe-pairs with large log ratios and low p-values are easily detectable in the view and a list of potential SFP markers can be generated.

_{1}or the offspring group G

_{2}. We name the intensity difference the D-value, in contrast to the MA-value, and define it in compliance with the trait of interest as below:

_{1}or G

_{2}; meanwhile, an ad hoc test procedure within G

_{1}or G

_{2}also assumes that the population distribution is at least approximately normal and proceeds with the probe-wise strategy. However, there are practical issues that need to be addressed. The majority of intensity signals are likely to be affected by poor hybridisation of the target genome to the heterologous oligonucleotide microarray, leading to the presence of a few or even one possible SFP within a probe-set. The exact number per probe-set will be dependent upon the evolutionary distance between the target species and design array, the rate of evolution of the individual gene represented by the probe-set and the array design itself. Thus, the sample mean is in general a good estimator for the central value of the data distribution of δ

_{ij}when statistical testing is performed according to the probe-wise strategy. But for those probe-sets which have two or more possible SFPs, the mean is no longer an appropriate measure of location under the probe-wise procedure since it will be susceptible to an extreme value. Accordingly, the γ-trimmed mean (0 < γ < 0.5) is employed instead of the mean as the statistic in this version of POST. More mathematically, let , …, k - 1, k + 1, …, #(B

_{i}} and let δ

_{i(1)}≤ δ

_{i(2)}≤… δ

_{i(ni)}be the observations of Δ

_{i}

^{k}written in ascending order. We define the sample γ-trimmed mean δ

_{ik}to account for probe-specific fluctuations in a probe-set i and its value is calculated by

_{i}] is the value of γn

_{i}rounded down to the nearest integer. Then, let s

_{ik}

^{2}be the sample γ-Winsorized variance in the data of Δ

_{i}

^{k}and consider the finite-sample Student-t statistic analogue, the γ-trimmed mean can be studentized by s

_{ik}as the form of

_{ik}using a Student’s t-distribution with n

_{i}- 2h - 1 degrees of freedom. Also, Patel et al. [21] further introduced a scaled Student-t variate a(n

_{i},h)t

_{ik}and proposed approximating the distribution of a(n

_{i},h)t

_{ik}with a Student’s t distribution having v(n

_{i},h) degrees of freedom, where a(n

_{i},h) = 1 + 16h

^{0.5}e

^{2h-ni}for small-samples (n

_{i}< 18) t-type statistics and v(n

_{i},h) has a slight variation depending on γ in their investigation. Given γ = 0.05, 0.10, 0.15, 0.20 or 0.25 we apply the Tukey-McLaughlin suggestion and Patel’s refined approximation to each of t

_{ik}for the calculation of the p-value, and the asymptotic p-value accompanied with the intensity difference can therefore be prepared for the volcano plot filter and output. To better reveal detection of large-magnitude changes in the output, the POST used the square-root-transformation of the D-value into the fold-change difference FCD

_{ij}defined as follows:

_{1}and G

_{2}, one can quickly identify the most-meaningful changes in hybridisation signal strength focused on the feature of interest.

## 3. Results and Discussion

#### 3.1. Software Implementation

^{®}data generated from cross-species experiments and the current version number is 1.2.1, released in late-June 2012. The software is able to read most recent or current Affymetrix .CEL file types, including version 3, version 4 and Command Console version 1 (the latest one at the time of program development). It is focused around visualization and interactive studies of data (Figure 2). This computer program is a freeware license so it is free of charge to download and to fully execute for research uses. The .NET Framework version 3.5 or greater is required to install the program. 2 MB of free hard disk space is the minimum to execute the program while 200 MB would be better if data/image file export is required. The golden rule of thumb is that the more RAM the better the capacity, and the faster the microprocessor the quicker the response. At least 1 GB RAM and an Intel

^{®}Pentium

^{®}M-class processors or better are recommended, although slower CPU speeds with 512 MB system memory will still work in most circumstances. This computer software has successfully been tested on Windows 2000, Windows XP, Windows Vista and Windows 7.

**Figure 2.**Software Snapshots. Pigeons is a tab-page based standalone graphical user interface (GUI) program. There are three tab-pages for the three main applications in the main form. Each application can be used either separately or jointly. Other tools in a menu strip are also tab-page associated, that is, their availability depends on the application currently being performed. (

**A**) Central Applications. The three main applications are: (i) Pigeon Filter; (ii) Pigeon Mining/Image and (iii) Pigeon Query. These are executed after the completion of two core components; (iv) File Reading; and (v) Data Preprocessing; (

**B**) Statistical Analyses. Several essential tools can also be called from the menu strip. They are: (i) Dual fold-change (DFC); (ii) Probewise one-sample statistical test (POST); (iii) Twin Volcano Plot; (iv) Volcano Plot; (v) Box Plot dialog-box; and (vi) Box Plot output.

_{2}hybrids using the binary average fold-change ratio (Figure 2(B-iv)), the Twin Volcano Plot (TVP) has been designed based on statistical tests within the groups (Figure 2(B-iii)). Results acquired by either the DFC or the POST can be exported as lists and as graphical representations for probe-sets to assist in the interpretation of oligo-level data at the DNA or RNA level. Pigeon Query is an interface for quick probe-set retrieval from datasets (Figure 2(A-iii)). Besides the three main applications, a couple of essential upstream tools are also involved in this software package—they are data preprocessing (Figure 2(A-v)) and a box-and-whisker plot (Figure 2(B-v, 2B-vi)). The Exponential-Normal Convolution Model was utilised for background correction in this program to adjust for systematic effects that arise from variation in the Affymetrix platform [18]. Pigeons employs quantile normalization to address the comparability of intensity distributions between arrays [19]. Then, one can use the box-and-whisker plot, a significant quality control tool, to examine the data before and after data preprocessing. This exploratory data analysis conducts a check for evaluating any extraordinary chip distributions and to verify if a normalization procedure has been effective. A user manual has been provided and built within an installer program so that users can access it from the start menu of MS Windows after the Pigeons has successfully been installed on a local machine. The software with its manual (the current release number version 1.2.1) can be freely downloaded at http://affymetrix.arabidopsis.info/xspecies/pigeons.

#### 3.2. Case Studies of ATM

^{®}Arrays [3,4] whereas the two animals (case 4 and 5) were hybridised onto the Human U133 Plus 2.0 Genome Arrays [22,23]. In the third case, the Affymetrix Rice Genome Array was used to investigate transcriptomic profiling related to drought stress in Musa [7]. In the original Xspecies approach, i.e., the first case, a probe mask created at a cut-off value of 400 was determined systemically and empirically by generating 13 custom CDF files with a series of gDNA hybridisation intensity thresholds and each CDF was assessed in turn. The probe mask file excluded 68% of the probe-pairs but retained 96% of the available probe-sets, and this was used to study transcriptional response under phosphorus stress. This empirical method of determining the cut-off value was also applied to the second and the fourth cases, which selected the preferred hybridisation intensity thresholds of 300 and of 100, respectively. The same probe selection strategy but subtly different considerations were taken in account in the third and fifth cases. The authors of these two studies determined the hybridisation intensity threshold used to create a probe mask file that was able to detect the maximum possible number of Differentially Expressed Genes (DEGs) even though Hammond et al. showed that there was a significant loss of available probe-sets for transcriptomic profiling at the higher end of the cut-off value [3]. As a result, the selected cut-offs used in Banana and Sheep were at the value of 550 and of 450 respectively.

Species | Selected Cut-off | Automated Threshold Mapping (ATM) | Reference | |||
---|---|---|---|---|---|---|

% | Suggested Cut-off | Target Interval | Tolerance Interval | |||

Brassica oleracea L. | 400 | 2.17 | 391.34 ^{a} | [351,426] ^{a} | [272,454] ^{a} | Hammond et al. 2005 [3] |

Thlaspi caerulescens | 300 | 10.54 | 331.63 ^{a} | [297,363] ^{a} | [234,387] ^{a} | Hammond et al. 2006 [4] |

Musa (Banana) | 550 | 10.47 | 492.40 ^{b} | [399,586] ^{b} | [305,698] ^{b} | Davey et al. 2009 [7] |

Equine (Horse) | 100 | 5.93 | 94.07 ^{a} | [82,106] ^{a} | [65,119] ^{a} | Graham et al. 2010 [22] |

Ovine (Sheep) | 450 | 6.93 | 481.20 ^{b} | [381,582] ^{b} | [284,694] ^{b} | Graham et al. 2011 [23] |

^{a}ATM was accompanied by a cluster validation procedure using Fukuyama-Sugeno’s index;

^{b}The partition entropy was applied as a cluster validity index into the ATM algorithm.

#### 3.3. Examples of an SFP Screen

_{2}offspring derived from a cross between two contrasting parental genotypes. The offspring were bulked according to the trait “number of branches per plant”. Bambara groundnut (Vigna subterranea (L.) Verdc.) is an underutilised indigenous African crop species and an important food legume grown widely in sub-Saharan Africa and has been shown to be highly inbreeding. At present, limited sequence resources exist, which means that the Xspecies is a valid approach. The gDNA-based probe-selection using heterologous oligonucleotide microarrays allows us to interrogate thousands of SFPs in parallel and, through the current design, should allow us to efficiently discover markers in a genomic region associated with a specific phenotype. As an illustration of this point, we selected the agronomic trait “number of branches per plant” in a cross between a wild accession with a spreading habit and a cultivated accession with a bunched habit [13,14]. Cross-hybridisation of bambara groundnut DNA from the two parental landrace genotypes VSSP11 (few stem per plant) and DipC (many stem per plant) were conducted using the Affymetrix Arabidopsis ATH1 GeneChip

^{®}. Meanwhile, two bulks from F

_{2}individuals (10 individuals each, representing the high and low stem number extremes from 96 individual F

_{2}plants) were hybridised separately onto the Arabidopsis ATH1 GeneChip

^{®}array. The experiment was therefore composed of four gDNA hybridisation chips and their relationship could be represented as , as defined in the methodology section. The probe-level raw data were then background-adjusted and quantile-normalized using the RMA method [18,19] so that these preprocessed intensity signals could be carried over into high level analyses.

**Figure 3.**Filtering on Volcano Plots. The customised Volcano-plot tools depicting estimated fold-change (x-axis) and statistical significance (−log10P-value, y-axis) were created by means of the POST inferential statistics for filtering on screening of the single oligonucleotides related to the trait of interest. Each point represents an oligonucleotide probe, and the black crosses corresponded to large fold-changes with a p-value less than the significance level or the user-defined value under a number of filtering criteria. (

**A**) Volcano Plot (VP). This is an example of applying the POST approach to test between groups of parents and F

_{2}hybrid bulks using the binary average fold-change ratio, the MA-value; (

**B**) Twin Volcano Plot (TVP). This is an illustration of another version of POST—testing oligonucleotide probes within a parental group and within an offspring group, respectively, followed by plotting the two graphical summaries together in different colours. Light-gray spots were the output of the parental group and gray ones represented the group of F

_{2}hybrid bulks. The fold-change difference was defined by transforming the intensity difference D-value into its square root, and was used as a measure to identify the significant intensity differences in the plot.

_{2}offspring act as a cross-checking mechanism in our experimental design, the fold-change of the offspring (FCF

_{2}) is used as one of the filtering parameters. Additionally, the optimal hybridisation threshold cut-off of the gDNA hybridisation intensity produced by ATM and the cut-off of the parental fold-change used in DFC can be optionally selected to increase the sensitivity of the graphical filters. The 7,903 differentially hybridised signals were summarised (BH adjusted p < 0.05, MA ≥ 0.75, MA ≤ -0.75, FCF

_{2}≥ 1.5) when the POST procedure was performed between the group of parents and of F

_{2}samples (Figure 3(A)). The lower levels of hybridisation of features will be more likely to show a significant difference between parental genotypes by chance than high level differences in hybridisation, although the latter could represent repetitive elements within the bambara groundnut genome. Due to the scale of the binary fold-change ratio, this phenomenon is quite common in microarray data analysis. The same preprocessed data set was tested using the other version of POST to examine intensities within groups, followed by filtering potential SFPs using the coloured TVP (Figure 3(B)). Interestingly, there were only 59 probe-pairs (BH adjusted p < 0.05, FCD ≥ 8, FCD ≤ -8, FCF

_{2}≥ 1.5) detected as statistically differentially hybridised using the probewise strategy. The sharply reduced number from thousands to dozens shows that the D-value is highly selective against low intensity signals and that the design of TVP, disjointed testing on two groups with a process of filtering in relation to each other, was much more sensitive than the approach of VP based on the average fold-change ratio.

_{2}in our illustration since the stringent conditions used led to very little in dual fold-change analysis and the hybridisation molecule in this case is genomic DNA, rather than dealing with expression values for RNA. As such, we might expect there to be a similar “dosage” of each gene in the individual genotypes, in the absence of wide-spread duplications. There were four instances inspected using VP and TVP, respectively whereas two cases were considered in DFC. Initial filtering parameters were fixed in the four instances of VP (BH adjusted p < 0.05, MA ≥ 0.75, MA ≤ -0.75) and TVP (10% trimmed mean, BH adjusted p < 0.05, FCD ≥ 8, FCD ≤ -8) and in the two instances of DFC (FCP ≥ 2, FCF

_{2}≥ 1.5). ATM with Fukuyama-Sugeno’s index producing the three-tuple suggestion (93.04, [81,106], [63,120]) of gDNA hybridisation intensity cut-offs for the cases of VP3, 4 and DFC2. Only the perfect match features of the ATH1 GeneChip

^{®}was considered in these investigations. When filtering on VP and TVP using initial conditions of x and y axis without extra parameters, we found that VP1 identified more than ten thousand potential SFPs. This was eight times the number using TVP1. This large difference was similar to our findings in Figure 3. We also noticed that the number of differentially hybridised features significantly declined from VP1 to VP2 and very dramatically dropped from VP1 to VP3. These results reveal that the gDNA hybridisation intensity threshold is an essential parameter in the VP filter and low signal hybridised probe-pairs were largely generated in the experiment. This is consistent with the phylogenetic distance between Vigna subterranea L and Arabidopsis thaliana. When all conditions were applied in VP4 and TVP4, there were approximately equivalent numbers of potential SFPs identified in the two cases, 10 and 8, respectively. An analogous situation between VP1 and VP3 could be found in the investigation of DFC as well. While 3,360 differentially hybridised features were detected in DFC1, very surprisingly, there were just 5 probable SFPs discovered in DFC2—the lowest number out of ten examined conditions. This implies that dual fold-change analysis would be the most stringent approach among the three methods. From the outcomes of VP4, TVP4 and DFC2, where few SFPs were identified we can conclude that the Affymetrix ATH1 GeneChip might not be the best array for heterologous genomic DNA hybridisation with a view to interrogation of the bambara groundnut genome, due to the distant evolutionary relationship between Arabidopsis thaliana and bambara groundnut.

**Table 2.**Screening for differentially hybridised oligonucleotides by filtering on two types of volcano plots and dual fold-change analysis under a number of criteria.

Method | Filtering Criteria | Number of potentiallydifferential hybridization ^{d} | ||||
---|---|---|---|---|---|---|

VP | p-value ^{a} | MA-value | FCF_{2} | TH ^{b}^{,c} | Probe-pairs | Probe-Sets |

VP1 | <0.05 | ≥|0.75| | - | - | 13,694 | 10,492 |

VP2 | <0.05 | ≥|0.75| | ≥1.5 | - | 7903 | 6722 |

VP3 | <0.05 | ≥|0.75| | - | >93.04 | 125 | 124 |

VP4 | <0.05 | ≥|0.75| | ≥1.5 | >93.04 | 10 | 10 |

TVP ^{e} | p-value ^{a} | FCD-value | FCF_{2} | FCP | Probe-pairs | Probe-Sets |

TVP1 | <0.05 | ≥|8.0| | - | - | 1,637 | 1,563 |

TVP2 | <0.05 | ≥|8.0| | ≥1.5 | - | 59 | 59 |

TVP3 | <0.05 | ≥|8.0| | - | >2 | 50 | 50 |

TVP4 | <0.05 | ≥|8.0| | ≥1.5 | >2 | 8 | 8 |

DFC | FCP | FCF_{2} | TH ^{b}^{,c} | Probe-pairs | Probe-Sets | |

DFC1 | ≥2 | ≥1.5 | - | 3,360 | 3,132 | |

DFC2 | ≥2 | ≥1.5 | >93.04 | 5 | 5 |

_{2}: the cut-off of F

_{2}fold-change; TH: the genomic DNA hybridisation intensity threshold; MA-value: binary average fold-change ratio; FCD-value: fold-change difference as the square-root-transformation of the D-value.

^{a}Benjamini-Hochberg adjusted p-values were calculated for multiple testing correction;

^{b}The mask of multiple chips was applied. A technique where each signal is extracted from the minimal intensity of four gDNA chips in the single trait experiment to create a pseudo array that will be analysed under the ATM framework;

^{c}Fukuyama-Sugeno’s index was used to generate ATM-suggested gDNA hybridisation intensity threshold;

^{d}SFPs were examined on the Perfect Match probe datasets in all cases;

^{e}10% trimmed mean, γ = 0.1, of intensity difference was used.

_{2}. Of the two possibilities for SFPs, the latter seemed more likely. Although the identified oligoprobes exceeded the ATM’s suggested threshold and the cut-off based on the two fold-change parameter, they did not have a particularly large intensity difference (data not shown) so should probably not be selected as candidates. On the other hand the partition f has potentially large FCD-values with signal intensities slightly smaller than the gDNA hybridisation intensity threshold based on the ATM suggestion. Out of the 5 filtered entities, there was only one having very poor hybridisation (42 vs. 93.04), and this was discarded. The partition built by deducting the intersection of the four units from TVP4 is able to complement another potential constraint of DFC—the hard cut-off value of gDNA hybridisation intensity. When it comes to the area where TVP2 excludes VP4 & TVP4, there were 47 candidates, the largest number in the Euler diagram, detected as statistically significant variable probe-pairs (Figure 4(A)). However, we did not consider any of these as potential SFPs. The reason is that nearly all elements of this set have a much smaller parental fold-change than the given cut-off. Similarly, most discovered probes in the portion where VP4 excludes TVP2 & DFC2 have either small intensity differences or small parental fold-change. In this analysis, there was one probe, 265228_s_at_195_89, belonging to this type of set and we regarded it as a candidate because of its strong hybridisation and reasonable parental ratio of FC (1822/962). The Euler diagram was then updated to show the situation of retained candidates in the units (Figure 4(B)). Eventually, this informed selection enables us to produce a final list of potential SFPs for further validation in vitro.

**Figure 4.**Euler Diagram Analysis. This was an example to show how potential SFPs can be selected by the POST and the DFC using Pigeons. The four-set diagram was established according to VP4 (abde), DFC2 (de), TVP2 (bcef) and TVP4 (ef) illustrated in Table 2, where lowercase letters stand for the portions of the four filtering methods. (

**A**) SFP Candidates. Numbers in the partitions indicate the number of detected probe-pairs that can be recognised as potential SFPs; (

**B**) Final Candidates. After careful selection and consideration portion by portion, potentially differentially hybridised oligonucleotides could be determined. They were e:264674_at_473_177, 257321_at_566_65; b:258467_at_680_81; f:244964_at_665_15, 255530_at_691_371, 257050_at_8_423 and 266293_at_656_319; a:265228_s_at_195_89; (

**C**) Optimal strategy for potential SFP selection. Where √: candidates; ×: elimination, ≈FCP: the parental fold-change value is just below cut-off; «FCP: the parental fold-change value is significantly below cut-off, small D: little intensity difference; ≈FCD: the fold-change difference value is slightly above cut-off; «TH: poor hybridisation; ≈TH: the signal intensity is a little lower than the value of gDNA hybridisation intensity threshold; 2↑SFPs: there are more than or equal to two potential SFPs found in the same probe-set.

## 4. Conclusions

## Acknowledgments

## Conflicts of Interest

## References

- Wang, J. Computational biology of genome expression and regulation—A review of microarray bioinformatics. J. Environ. Pathol. Toxicol. Oncol.
**2008**, 27, 157–179. [Google Scholar] [CrossRef] - Kumar, R.M. The widely used diagnostics “DNA microarrays”—A review. Am. J. Infect. Dis.
**2009**, 5, 214–225. [Google Scholar] [CrossRef] - Hammond, J.P.; Broadley, M.R.; Craigon, D.J.; Higgins, J.; Emmerson, Z.F.; Townsend, H.J.; White, P.J.; May, S.T. Using genomic DNA-based probe-selection to improve the sensitivity of high-density oligonucleotide arrays when applied to heterologous species. Plant Methods
**2005**, 1, 10. [Google Scholar] [CrossRef] - Hammond, J.P.; Bowen, H.C.; White, P.J.; Mills, V.; Pyke, K.A.; Baker, A.J.; Whiting, S.N.; May, S.T.; Broadley, M.R. A comparison of the Thlaspi caerulescens and Thlaspi arvense shoot transcriptomes. New Phytol.
**2006**, 170, 239–260. [Google Scholar] [CrossRef] - Graham, N.S.; Broadley, M.R.; Hammond, J.P.; White, P.J.; May, S.T. Optimising the analysis of transcript data using high density oligonucleotide arrays and genomic DNA-based probe selection. BMC Genomics
**2007**, 8, 344. [Google Scholar] [CrossRef] - Broadley, M.R.; White, P.J.; Hammond, J.P.; Graham, N.S.; Bowen, H.C.; Emmerson, Z.F.; Fray, R.G.; Iannetta, P.P.M.; McNicol, J.W.; May, S.T. Evidence of neutral transcriptome evolution in plants. New Phytol.
**2008**, 180, 587–593. [Google Scholar] [CrossRef] - Davey, M.W.; Graham, N.S.; Vanholme, B.; Swennen, R.; May, S.T.; Keulemans, J. Heterologous oligonucleotide microarrays for transcriptomics in a non-model species; A proof-of-concept study of drought stress in Musa. BMC Genomics
**2009**, 10, 436. [Google Scholar] [CrossRef] [Green Version] - Kreyszig, E. Advanced Engineering Mathematics, 10th ed.; John Wiley & Sons: Hoboken, NJ, USA, 2011; pp. 790–842. [Google Scholar]
- Xu, R.; Wunsch, D., II. Survey of clustering algorithms. IEEE Trans. Neural Netw.
**2005**, 16, 645–678. [Google Scholar] [CrossRef] - Schena, M.; Shalon, D.; Heller, R.; Chai, A.; Brown, P.O.; Davis, R.W. Parallel human genome analysis: Microarray-based expression monitoring of 1,000 genes. Proc. Natl Acad. Sci. USA
**1996**, 93, 10614–10619. [Google Scholar] [CrossRef] - Cui, X.; Churchill, G.A. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol.
**2003**, 4, 210. [Google Scholar] [CrossRef] [Green Version] - Kooperberg, C.; Aragaki, A.; Strand, A.D.; Olson, J.M. Significance testing for small microarray experiments. Stat. Med.
**2005**, 24, 2281–2298. [Google Scholar] [CrossRef] - Mayes, S.; Stadler, S.; Basu, S.; Murchie, E.; Massawe, F.; Kilian, A.; Roberts, J.A.; Mohler, V.; Wenzel, G.; Beena, R.; et al. BAMLINK—A cross disciplinary programme to enhance the role of bambara groundnut (Vigna subterranea L. Verdc.) for food security in Africa and India. Acta Hortic.
**2009**, 806, 137–150. [Google Scholar] - Basu, S.; Mayes, S.; Davey, M.; Roberts, J.A.; Azam-Ali, S.N.; Mithren, R.; Pasquet, R.S. Inheritance of “domestication” traits in bambara groundnut (Vigna subterranea L. Verdc.). Euphytica
**2007**, 157, 59–68. [Google Scholar] [CrossRef] - Bezdek, J. Pattern Recognition with Fuzzy Objective Function Algorithms, 1st ed.; Plenum Press: New York, NY, USA, 1981; pp. 95–154. [Google Scholar]
- Jeffery, I.B.; Higgins, D.G.; Culhane, A.C. Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. BMC Bioinform.
**2006**, 7, 359. [Google Scholar] [CrossRef] - Dudoit, S.; Yang, Y.H.; Callow, M.J.; Speed, T.P. Statistical methods for identifying genes with differential expression in replicated cDNA microarray experiments. Stat. Sin.
**2002**, 12, 111–139. [Google Scholar] - Irizarry, R.A.; Hobbs, B.; Collin, F.; Beazer-Barclay, Y.D.; Antonellis, K.J.; Scherf, U.; Speed, T.P. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics
**2003**, 4, 249–264. [Google Scholar] [CrossRef] - Bolstad, B.M.; Irizarry, R.A.; Astrand, M.; Speed, T.P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics
**2003**, 19, 185–193. [Google Scholar] [CrossRef] - Tukey, J.W.; McLaughlin, D.H. Less vulnerable confidence and significance procedures for location based on a single sample: Trimming/Winsorization 1. Sankhya A
**1963**, 25, 331–352. [Google Scholar] - Patel, K.R.; Mudholkar, G.S.; Fernando, J.L.I. Student’s t approximations for three simple robust estimators. J. Am. Stat. Assoc.
**1988**, 83, 1203–1210. [Google Scholar] - Graham, N.S.; Clutterbuck, A.L.; James, N.; Lea, R.G.; Mobasheri, A.; Broadley, M.R.; May, S.T. Equine transcriptome quantification using human GeneChip arrays can be improved using genomic DNA hybridisation and probe selection. Vet. J.
**2010**, 186, 323–327. [Google Scholar] [CrossRef] - Graham, N.S.; May, S.T.; Daniel, Z.C.T.R.; Emmerson, Z.F.; Brameld, J.M.; Parr, T. Use of the Affymetrix Human GeneChip array and genomic DNA hybridisation probe selection to study ovine transcriptomes. Animal
**2011**, 5, 861–866. [Google Scholar] [CrossRef] - Fukuyama, Y.; Sugeno, M. A New Method of Choosing the Number of Clusters for the Fuzzy C-Mean Method. Available online: http://citeseer.uark.edu:8080/citeseerx/showciting;jsessionid=1AF0955F44EC87078947AADEDE29D50C?cid=664813 (accessed on 10 December 2013).
- Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. B
**1995**, 57, 289–300. [Google Scholar]

© 2014 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Lai, H.-M.; May, S.T.; Mayes, S.
Pigeons: A Novel GUI Software for Analysing and Parsing High Density Heterologous Oligonucleotide Microarray Probe Level Data. *Microarrays* **2014**, *3*, 1-23.
https://doi.org/10.3390/microarrays3010001

**AMA Style**

Lai H-M, May ST, Mayes S.
Pigeons: A Novel GUI Software for Analysing and Parsing High Density Heterologous Oligonucleotide Microarray Probe Level Data. *Microarrays*. 2014; 3(1):1-23.
https://doi.org/10.3390/microarrays3010001

**Chicago/Turabian Style**

Lai, Hung-Ming, Sean T. May, and Sean Mayes.
2014. "Pigeons: A Novel GUI Software for Analysing and Parsing High Density Heterologous Oligonucleotide Microarray Probe Level Data" *Microarrays* 3, no. 1: 1-23.
https://doi.org/10.3390/microarrays3010001