2.1. ATM
As a mixed model of numerical analysis and a soft computing technique, a heuristic method called Automated Threshold Mapping (ATM) was proposed to improve on the human-dependent cut-off selection of poorly hybridising oligonucleotide probes. One of the requirements to be able to exploit an Xspecies array is to select a threshold to generate a custom CDF file for further analysis. Therefore, it is necessary to understand the relationships between a particular threshold value, the probe-pairs and the probe-sets retained at this threshold,
i.e., three two-ways and one three-way comparisons. Because this problem involves one input (threshold level) and two outputs (probe-pairs and probe-sets), an idea was drawn from vector calculus to assess the relationships among the three variables and to generalize a solution to this problem. Through the generation of a plane curve (
Figure 1), we have found that the retained probe-sets and the retained probe-pairs decline when the threshold value is increased and that the relationship between the two retained variables is a monotonic function. This relates to the point that a probe-set is removed only if there are no retained probe-pairs in that probe-set, so that the number of retained probe-pairs declines more sharply than the number of retained probe-sets does, when the threshold value rises. We also find that the plane curve is like a learning curve with a plateau. Thus, an appropriate selection of threshold values could come from the portion of the curve (circle in
Figure 1) between the end of the plateau and the beginning of the linear-like drop. Considering the greyness of the position, we want to provide a suggested threshold value, together with an interval of feasible thresholds available for selection using projection, fuzzy clustering and interpolation techniques. From the observation of the plane curve, given a series of vectors that consist of a threshold and its retained units, the vectors are initially projected onto the retained probe-set space, where fuzzy clustering is performed. Since the section of the curve targeted is a limited bridge between the plateau and the linear-like drop, a good fuzzy clustering approach would lead the bridge to a refined overlap of the first two clusters. Based on this, a suggested threshold value could be produced by an interpolation technique.
Figure 1.
Plane Curve. A vector valued function traced out by retention units with respect to the cut-off of poorly hybridising oligonucleotides using the heterologous GeneChip® platform, with ATH1-121501 used as the basis to generate the image.
Figure 1.
Plane Curve. A vector valued function traced out by retention units with respect to the cut-off of poorly hybridising oligonucleotides using the heterologous GeneChip® platform, with ATH1-121501 used as the basis to generate the image.
To define the methods and principles mathematically, first of all, a vector-valued function is introduced to perform an in-depth analysis of the problem. Let X be a scalar variable and Y be a vector variable with two dimensions. A vector function F:
![Microarrays 03 00001 i005]()
→
2 is defined as follows:
where X is a set of cut-offs and the component functions
f1 and
f2 are real-valued functions of the parameter
x. The two components of Y, Y
1 and Y
2, are therefore viewed as sets of retained probe-pairs and probe-sets, respectively, when a defined cut-off is given. Using the vector-valued retention function F, we can easily trace the graph of a curve to know the relationships among cut-off and the retention units of probe-pairs and probe-sets. The point of the position vector F(
x) coincides with the point (
y1,
y2) on the plane curve given by the component equations, as shown in
Figure 1. The arrowhead on the curve represents the curve’s orientation by pointing in the direction of increasing values of
x, namely
x3 >
x2 >
x1. Due to the nature of the problem, the retention function F monotonically decreases in the direction of the point (0, 0). This characteristic means that mapping from
y1 to
y2 is also a monotone function, and moreover, it is actually like a learning curve with a stagnant occurrence.
A tangent vector-based numerical analysis could be applied to the evaluation and the differentiation of the function at a given point. For example, a turning point F(xtp) can be defined as the intersection between a tangent to the stagnant phase of the curve and the tangent to the linear-like decreasing portion of the curve. The inverse of this point F−(F(xtp)) could be selected as the threshold value. However, the cut-off decision problem is not deterministic, and it usually needs to take biological sense into account, so requires more tolerance in the selection of the threshold. The ATM offers a turning portion (TP) covering the turning point and derived from a closed interval I from which realistic thresholds can be retrieved. Let I be the surrounding area of xtp ϵ X such that F(xtp) is the turning point, and then we construct the turning portion by TP={F(x): x ϵ I }. Construction requires careful definition of a lower boundary (xlb) and an upper boundary (xub) of I, with the aim of developing the idea of selecting a flexible region, rather than a single turning point. Since F is a one-one function well-defined in the interval I, which decreases monotonically; in theory, we can define xlb and xub such that F(xlb) would be in the terminating phase of the plateau and F(xub) would be in the earliest phase of sharp decline, respectively.
The ATM is a data-driven mapping method using a two-stage unsupervised learning process for I determination. The first stage involves orthogonal projection in order to highlight the turning portion. To achieve this, we consider an inner-product vector space
![Microarrays 03 00001 i008]()
=
3, let
![Microarrays 03 00001 i010]()
be an
r-dimensional subspace of
![Microarrays 03 00001 i011]()
and
┴ be the orthogonal complement of
![Microarrays 03 00001 i013]()
. Given a matrix
B3×r such that the column space of
B is
![Microarrays 03 00001 i014]()
, and then for
![Microarrays 03 00001 i015]()
there exists a projector
P to project
v onto
![Microarrays 03 00001 i016]()
along
┴,
i.e.,
Pv =
u,
u ϵ
![Microarrays 03 00001 i018]()
. The unique linear operator
P can be acquired by
P =
B(
BTB)
-1BT, in particular, if
B constitutes orthonormal bases, then
P =
BBT. During simplification of the system, the goal at this stage is to minimize the loss of information relevant to the problem of concern. As a consequence, given
B (e.g., [0,0,1]
T) and
n numbers of vectors of thresholds with their retention units, and after linear transformation of each vector
vi ϵ
![Microarrays 03 00001 i008]()
,
i = 1,…,
n we will then gain a learning data set D = {
ui:
Pvi =
ui ϵ
![Microarrays 03 00001 i020]()
that ideally has the most informative features for turning portion discovery. Suppose that all the data vectors in TP have been projected onto a particular area, we define the area as a hotspot D′ such that
where J is an index set to collect and distinguish the elements of the hotspot, and inf(
J) & sup(
J) denote infimum and supremum of J respectively. Obviously, J is a subset of {1, …,
n}, and both D' and J are well-ordered closed sets.
In other words, the second task is to identify the hotspot to discover a range of feasible cut-off thresholds. Grouping methods would be appropriate for this task as the object of clustering is to group a set of data vectors to include only those vectors which are similar to each other. Although there are similarities between data points within a group to a certain extent, it is also believed that some of the similarities might also occur between groups. This is due to the intrinsic design element aiming to develop a flexible choice of realistic thresholds. Some elements within the turning portion of the curve are closer to the end of plateau, others are near to the beginning of linear-like decline, and still others will be around the turning point of the curve. Part of the problem with depicting the hotspot is to capture the “grayness” of the cross-cluster similarities so it is essential to allow some degree of uncertainty in its description. The ATM applies Fuzzy c-Means (FCM) clustering to this issue since the FCM allows us to build clusters with vague boundaries, where some overlapping clusters include the same object, to a certain degree [
15]. Based on an objective function or performance index
![Microarrays 03 00001 i022]()
, the weighted within-class sum of squares, to quantify how good the quality of clustering models is, the FCM attempts to find the best allocation of data to clusters with a gradual membership matrix
M. Given a number of clusters
c (1 <
c <
n), then the learning data set
![Microarrays 03 00001 i023]()
is dominated by fuzzy sets
![Microarrays 03 00001 i024]()
and the fuzzy partition matrix
M =[
mij]
c×n, where
![Microarrays 03 00001 i025]()
and
![Microarrays 03 00001 i026]()
. For the individual entries in
M,
mij are the membership degree of element
uj ϵ D to cluster
i,
i.e.,
![Microarrays 03 00001 i027]()
. Let
![Microarrays 03 00001 i028]()
be a set of cluster prototypes so that each cluster
![Microarrays 03 00001 i029]()
is represented with a cluster centre vector ω
i, and the objective function with two constraints can then be defined as below:
Here,
q ϵ
>1 is termed the “fuzzifier” or weighting exponent, and
dij is the distance between object
uj and cluster centre ω
i, within ATM, the Euclidean inner product norm denoted by ‖∙‖ is taken,
i.e.,
dij = ‖
uj -
ωi‖. The purpose of the clustering algorithm is to obtain the solution
M and Ω minimizing the cost function
![Microarrays 03 00001 i034]()
, and this can be carried out by:
namely the FCM proceeds with two events: the computation of cluster centroids and the allocation of data elements to these centroids. In practice, the cost function
![Microarrays 03 00001 i037]()
is minimized by an alternating optimization (AO) scheme,
i.e., the membership degrees are first optimized given recently fixed cluster parameters, followed by optimizing the cluster prototypes given currently fixed membership degrees. This reiterative procedure will be repeated until the cluster centres have reached equilibrium which is equivalent in mathematics to the optimal objective function
![Microarrays 03 00001 i038]()
.
After the grouping scheme is accomplished, the hotspot D' can be deciphered by defining the greatest lower bound and the least upper bound of its index set. The first two clusters (
![Microarrays 03 00001 i039]()
,
![Microarrays 03 00001 i040]()
) are concentrated for the purpose of deciphering since most of elements of
![Microarrays 03 00001 i041]()
are very likely to be projected from vectors in stagnant phase while data points near the beginning of the sharp drop have mostly fallen into
![Microarrays 03 00001 i042]()
. Thus, we let D′ be the subset of the union of the two clusters and set the infimum and the supremum of J according to the objects whose membership values are the maximum of
![Microarrays 03 00001 i043]()
and
![Microarrays 03 00001 i044]()
,
i.e.,
Not only do the above equations define the index set J, but also they reveal that the tolerance interval
I has been established. Besides the selection of feasible cut-offs, the ATM also provides an automated threshold value
xATM and a target interval I' for the selection of candidate cut-offs. Both
xATM and I' are evaluated through the fuzzy boundary between the first two fuzzy sets. The elements in the boundary imply that
![Microarrays 03 00001 i046]()
and
![Microarrays 03 00001 i047]()
have them in common with various membership values. Owing to the grayness characteristic and the continuity of the learning-like curve, we believe that a good threshold value for parsing the Affymetrix chip description files would come from a projected object that simultaneously belongs to the two clusters with remarkable membership degrees. As a result, the fuzzy boundary can enable us to offer a more reasonable selection of threshold boundary cut-offs. Two indices,
l and
k, are utilised to determine the highly likely threshold boundary cut-offs and the automated threshold value, determined by
Here
ϵ is a small number to assess the possibility of the overlap between the two clusters. By this definition, the fuzzy boundary is then portrayed as the set of
![Microarrays 03 00001 i049]()
and another closed interval [
xl,
xk] is constructed as the target interval I'. Let be
u the arithmetic mean of the elements of
![Microarrays 03 00001 i050]()
, and
xATM can also be calculated by linear interpolation or by the Lagrange polynomial, as shown in the following formulae:
In summary, the ATM returns a 3-tuple (xATM,I',I) to resolve the issue of the threshold cut-off choices. The suggested cut-off given by the ATM, xATM, can directly be exploited to remove the weak intensity signals while any values within a target interval, I' = [xl,xk], can be taken as the potential threshold boundary cut-offs. The design of the target interval gives users a chance of picking a scientifically reasonable value on their own. Those values in a tolerance interval, i.e., x ϵ I= [xinf(J),xsup(J)], can be used as feasible thresholds and values outside the interval are viewed as less feasible choices.
2.2. DFC
Dual fold-change analysis (DFC) is an approach to seek potential single-feature polymorphism markers through screening all of the 25-mer oligonucleotide probes of the heterologous microarray. Initially, there are two groups (G
1 and G
2) under the design of the single trait experiment. While two distinct parental genotype gDNAs are involved in generating G
1, G
2 is composed of two different phenotypically based F
2 bulk segregant pools, derived from a hybrid between the two parental genotypes. We then label the four Xspecies chips with
1 &
2 for the two parent samples and with
3 &
4 for the two F
2 bulks. In practice, these F
2 bulks are constructed from the pooled DNA of F
2 individuals. These are derived from the controlled cross between the parental genotypes with allocation to the contrasting bulk based upon a specific trait of interest. The phenotype classification is a necessary prerequisite for the numerical analysis of potential SFP markers.
1 and
3 are classified into one type under a single trait experiment whereas
2 and
4 belong to the other trait version—the prerequisite can be denoted as
![Microarrays 03 00001 i060]()
. Let N be the number of genes and #(
![Microarrays 03 00001 i061]()
) be the cardinal number of a probe-set
![Microarrays 03 00001 i062]()
then each chip can be represented as follows:
where
bijm denotes the
j-th signal intensity of the
i-th probe-set on the
m-th chip. Let
Qij1 =
bij1/
bij2 and
Qij2 =
bij3/
bij4 be the intensity ratio of G
1 and G
2, respectively, thus the ratio value of one for this feature represents unchanged hybridisation signal in this experiment and less than or greater than one is for differentially hybridised oligonucleotides. To generate a symmetric distribution of intensity ratios, the fold-change ratio is defined by
where FC
ij1 is used to assess the differential probe hybridisation of the parental group. For the evaluation of the offspring group, FC
ij2 is calculated in the same way as FC
ij1 simply replacing Q
ij1 with Q
ij2. Given the threshold of weak signals
xATM, the cut-off of a fold-change between the parents
![Microarrays 03 00001 i066]()
and that between the offspring
![Microarrays 03 00001 i067]()
, a number of logical criteria are applied to globally screen and search Affymetrix’s single oligoprobes for SFP markers. For
![Microarrays 03 00001 i068]()
, let the first condition be b
ijm >
xATM since any signals whose intensities are below the threshold should not be used for good probes in the analysis of heterologous data—this satisfies the demand of the XSpecies technology. When the first criterion holds, the DFC enables the procedure to run the second condition with the two fold-change indicators FC
ij1 and FC
ij2, FC
ij1 ≥ ϵ
1 and FC
ij2 ≥ ϵ
2, to measure whether
![Microarrays 03 00001 i069]()
still holds at the genomic level. The FC approach is commonly used in microarray data analysis to identify differentially expressed genes (DEGs) between a treatment and a control. Calculated as the ratio of two conditions/samples, the FC gives the absolute ratio of normalized intensities in a non-log scale. We extend the same concept in our approach by introducing an additional FC—one ratio assesses the differential hybridisation within G
1 and the other assesses the differential hybridisation within G
2. The extra FC tests whether the difference in phenotype could result from a difference in genotype at a single locus. Therefore, when there are any differentially hybridised oligonucleotides for the feature of interest between the two parental genotypes, the inherited attribute of
![Microarrays 03 00001 i070]()
would imply that we could expect those differentially hybridised oligonucleotides to have also been transmitted into the F
2 individuals. In a word, the corresponding fold-change of the F
2 is introduced as a cross-check mechanism for identifying SFPs which are consistent between parental genotype/trait and bulk genotype/trait. The mixture of F
2 genotypes (which are bulked according to the trait difference which segregates within the cross) should mean that the attribute difference is only detected when the location of the parental SFP is close to the gene controlling the trait difference. The accuracy of this approach is dependent upon bulk size used. Smaller bulk sizes will lead to the identification of SFPs which are located distantly from (and probably on different chromosomes to) the target trait associated SFPs. Oligo-probes that satisfy the second criterion above are potential SFP markers distinguishing the two phenotypes and could be further tested and used for genetic mapping of the gene controlling the phenotypic difference.
2.3. POST
The FC is typically viewed to be significant if there is at least a two-fold difference [
10]. In addition, the FC threshold is selected arbitrarily and does not involve any assessment of statistical confidence so using the FC approach alone may not be optimal [
11,
16]. Although it is a straightforward and intuitive way to detect oligonucleotides using the dual fold-change criterion, the approach does not engage any evaluation of the significance of differential hybridisation in the presence of biological and experimental variation, which might differ from probe to probe. We have therefore developed inferential statistics herein through a method called the probewise one-sample statistical test (POST) for the assessment of the differential oligoprobe variation observed in terms of statistical power and measures of confidence. We first define an MA-value
ρij for the examination of a signal variant in the single trait experiment, for
![Microarrays 03 00001 i071]()
the value is calculated by the following formula:
to exactly correspond with the experimental attribute of
![Microarrays 03 00001 i073]()
. The MA-value is named after the MA plot, a very useful tool in cDNA and GeneChip
® microarray data analysis [
17,
18,
19], and is the average intensity ratio between parental samples and F
2 bulks in a base 2 logarithmic scale with a mnemonic for subtraction and a mnemonic for addition. The POST then uses the MA-value and a single sample
t-test to statistically assess differentially hybridised oligonucleotides between the parent group and the offspring group and to test in a probe-set
i whether or not there is significant difference between an interrogated probe
k and the other probes in that probe-set, in terms of their log ratios. As a test statistic, the average of the MA-values of each of the probe-pairs except the probe
k is denoted by
ρik and determined by:
where
ni = #(
Bi) - 1 is the sample size in the examined probe-set
i. Suppose that the sampling distribution of
ρik is normal so that the random variable
has a Student’s
t-distribution with
ni - 1 degrees of freedom. Where
Sik is the standard deviation of the sample of the log ratios in the
i-th probe-set excluding the MA-value of the oligoprobe
k. The last step performed by the POST is to asymptotically compute the
p-value converting the value of
Tik into a probability that expresses how likely the oligonucleotides in question are to be differentially hybridised. To visualize the results of this probewise testing of single oligonucleotides, a filter with a Volcano Plot output was also developed. The volcano plot is an effective and easy-to-interpret scatter plot for the selection of DEGs [
11]. In the POST, the plot shows the negative common logarithm (base 10) of the
p-value
versus the average intensity ratio in the form of the binary logarithm (base 2),
i.e., average fold-change ratio. Probe-pairs with large log ratios and low
p-values are easily detectable in the view and a list of potential SFP markers can be generated.
Another approach for statistical inference using a different measure based on intensity difference has also been implemented in the POST to identify and evaluate significantly variable oligonucleotides within an experimental group. Basically, the approach is a methodology analogous to that of testing between two groups, but it is more focused on variation within a single group. Since a potential SFP marker could be due to oligonucleotide target regions within the test genome with deletions or duplications or nucleotide differences with respect to the design probe-pairs, we propose using intensity difference rather than the traditional intensity ratio to determine significant differences in intensity between the signal of array elements within either the parent group G
1 or the offspring group G
2. We name the intensity difference the
D-value, in contrast to the MA-value, and define it in compliance with the trait of interest as below:
Similar to statistical tests between groups, the sample mean of the
D-value would be the statistic to test whether the intensity difference of the oligoprobe under interrogation is significantly different from that of the other signals in the same probe-set of G
1 or G
2; meanwhile, an
ad hoc test procedure within G
1 or G
2 also assumes that the population distribution is at least approximately normal and proceeds with the probe-wise strategy. However, there are practical issues that need to be addressed. The majority of intensity signals are likely to be affected by poor hybridisation of the target genome to the heterologous oligonucleotide microarray, leading to the presence of a few or even one possible SFP within a probe-set. The exact number per probe-set will be dependent upon the evolutionary distance between the target species and design array, the rate of evolution of the individual gene represented by the probe-set and the array design itself. Thus, the sample mean is in general a good estimator for the central value of the data distribution of
δij when statistical testing is performed according to the probe-wise strategy. But for those probe-sets which have two or more possible SFPs, the mean is no longer an appropriate measure of location under the probe-wise procedure since it will be susceptible to an extreme value. Accordingly, the γ-trimmed mean (0 < γ < 0.5) is employed instead of the mean as the statistic in this version of POST. More mathematically, let
![Microarrays 03 00001 i077]()
, …,
k - 1,
k + 1, …, #(
Bi} and let
δi(1) ≤
δi(2) ≤…
δi(ni) be the observations of Δ
ik written in ascending order. We define the sample γ-trimmed mean
δik to account for probe-specific fluctuations in a probe-set
i and its value is calculated by
where
h = [
γni] is the value of
γni rounded down to the nearest integer. Then, let
sik2 be the sample γ-Winsorized variance in the data of Δ
ik and consider the finite-sample Student-
t statistic analogue, the γ-trimmed mean can be studentized by
sik as the form of
Tukey and McLaughlin [
20] suggested a reasonably accurate approximation of the distribution of
tik using a Student’s
t-distribution with
ni - 2
h - 1 degrees of freedom. Also, Patel
et al. [
21] further introduced a scaled Student-
t variate
a(
ni,
h)
tik and proposed approximating the distribution of
a(
ni,
h)
tik with a Student’s
t distribution having
v(
ni,
h) degrees of freedom, where
a(
ni,
h) = 1 + 16
h0.5e2h-ni for small-samples (
ni < 18)
t-type statistics and
v(
ni,
h) has a slight variation depending on γ in their investigation. Given γ = 0.05, 0.10, 0.15, 0.20 or 0.25 we apply the Tukey-McLaughlin suggestion and Patel’s refined approximation to each of
tik for the calculation of the
p-value, and the asymptotic
p-value accompanied with the intensity difference can therefore be prepared for the volcano plot filter and output. To better reveal detection of large-magnitude changes in the output, the POST used the
square-root-transformation of the
D-value into the fold-change difference FCD
ij defined as follows:
which produces a symmetric distribution of intensity differences under the assumption that most oligonucleotides are not differentially hybridised, so that the modified volcano plot using fold-change differences is still able to plot changes in both directions, showing equidistance from the centre. Due to the experimental design, the POST tests the inferential statistics on individual oligonucleotides within the parent group and within the offspring group respectively, colouring the plotted points in accordance to the group that they belong to. The colour scheme can be employed as a third dimension of information, for ease of filtering and the setting of parameters. By constructing the coloured volcano plots of G
1 and G
2, one can quickly identify the most-meaningful changes in hybridisation signal strength focused on the feature of interest.