Identiﬁcation of Differentially Expressed Genes between Original Breast Cancer and Xenograft Using Machine Learning Algorithms

: Breast cancer is one of the most common malignancies in women. Patient-derived tumor xenograft (PDX) model is a cutting-edge approach for drug research on breast cancer. However, PDX still exhibits differences from original human tumors, thereby challenging the molecular understanding of tumorigenesis. In particular, gene expression changes after tissues are transplanted from human to mouse model. In this study, we propose a novel computational method by incorporating several machine learning algorithms, including Monte Carlo feature selection (MCFS), random forest (RF), and rough set-based rule learning, to identify genes with signiﬁcant expression differences between PDX and original human tumors. First, 831 breast tumors, including 657 PDX and 174 human tumors, were collected. Based on MCFS and RF, 32 genes were then identiﬁed to be informative for the prediction of PDX and human tumors and can be used to construct a prediction model. The prediction model exhibits a Matthews coefﬁcient correlation value of 0.777. Seven interpretable interactions within the informative gene were detected based on the rough set-based rule learning. Furthermore, the seven interpretable interactions can be well supported by previous experimental studies. Our study not only presents a method for identifying informative genes with differential expression but also provides insights into the mechanism through which gene expression changes after being transplanted from human tumor into mouse model. This work would be helpful for research and drug development for breast cancer.


Introduction
According to epidemiological data from the World Health Organization (WHO), breast cancer has been confirmed to be the most frequently diagnosed cancer in women from developed and developing countries and is accompanied with rising expectancy of average life span [1,2]. Early in 2011, more than 508,000 women died from breast cancer, accounting for more than 58% in developing countries [1]. However, the incidence of breast cancer is higher in Western Europe (89.7/100,000) is higher than those in Eastern Africa (19.3/100,000) and most developing countries (40/100,000); this finding reflects Genes 2018, 9, 155; doi:10.3390/genes9030155 www.mdpi.com/journal/genes the specific regional distribution of breast cancer incidence [3] and suggests the need for development of precise and effective methods for diagnosis and treatments.
To identify a cure for breast cancer, researchers have applied five levels of research, namely, molecular, cellular, histopathological, animal model, and clinical levels [4]. Among these levels, the animal model level not only confirms the conclusion obtained from studies on the other levels as an integrated creature but also lays the foundation for further clinical studies [5,6]. In studies on this level, mice are a typical and widely used species because of their high degree of genetic similarity to human beings and simple cultivation requirements. However, murine tumors considerably differ from human tumors at both genetic and molecular levels [7]. Therefore, a murine tumor model cannot thoroughly reflect the characteristics of human tumors. To solve this limitation, scholars have developed patient-derived tumor xenograft (PDX) model with various modified subtypes. In general, PDX mouse model is produced by directly transplanting cancerous tissues from a unique patient into an immune-deficient mouse model; this model can be used for studies on the biological, pharmacological, or clinical characteristics of such tumor tissues [8,9]. In comparison with conventional xenograft models using cell lines, the PDX model retains the molecular signatures and cancerous heterogeneity, which includes the cancer stem cells of the original tumor [10].
Although the PDX mouse model maintains most of the characteristics of the original tumor, genomic and transcriptional variations induced by murine microenvironment selection have been characterized. In 2014, researchers from University of North Carolina (USA) found that in pancreatic and colorectal cancer, the mutation frequencies of two oncogenic genes, namely, KRAS and PIK3CA, were greatly altered by the murine microenvironment across passages; hence, at the genomic level, the PDX mouse model cannot completely maintain the characteristics of in situ tumor tissues [11]. At the transcriptomic level, a study on PDX tumor expression patterns of 58 patients with eight different tumors reported that 48 genes were differentially expressed between the PDX mouse model and the tumor tissues of the patients [12].
In this study, we obtained and used the gene expression data from a recent study [13] on PDX mouse models and original breast cancer to identify differentially expressed genes, which are called informative genes. We collected a total of 657 PDX and 174 human tumors. By using Monte Carlo feature selection (MCFS), we evaluated the relevance of the gene expression in distinguishing the original breast cancer and the PDX models. Subsequently, we compared several algorithms for building classifiers including random forest (RF), rough set-based rule learning, support vector machine (SVM) and dagging. We found that building classifier using RF algorithm with the 32 identified informative genes could achieve the best performance, obtaining the highest Matthews coefficient correlation (MCC) value of 0.777. This result demonstrated the predictive power of the 32 identified informative genes in distinguishing PDX tumors from original human tumors. Furthermore, we applied rough set-based rule learning method to detect human-interpretable rules to reveal the interactions between the 32 informative genes. These genes rules, and interactions were found to be supported by previous studies through a literature review.

Materials and Methods
Our entire data set contained 831 tumor cells, including 657 PDX and 174 tumor cells. The 831 tumor cells were encoded into feature vectors by the expression levels of 69 genes. The 69 genes we used were derived from the work by Lawson et al. [13]. To investigate metastatic breast cancer, Lawson et al. did a literature survey and collected a small set of genes that were highly possible as the key genes in breast cancer and clearly involved in breast cancer related functions, such as stemness, pluripotency, epithelial-to-mesenchymal transition (EMT), mammary lineage specification, dormancy, cell cycle and proliferation. Based on the curated gene set, they designed a special single-cell array called Biomark 96.96 dynamic array [13] which can measure the expression level of 116 genes. In our study, since some genes were not expressed in many tumor cells and did not play important roles as indicated by the literature, we only analyzed 69 genes by filtering those with missing expression in more than half of tumor cells.
Our method for identifying differentially expressed genes consisted of three steps: (i) applying MCFS method [14][15][16] to rank features; (ii) using incremental forward selection (IFS) and RF [17,18] methods to construct prediction model; and (iii) employing rough set-based rule learning [19] to detect interactions between differentially expressed genes.

Data Preparation and Feature Construction
The expression levels of 116 genes in 1293 tumor cells from three patients with breast cancer and the corresponding PDX models were obtained from a recent publication [13]. These genes are involved in stemness, pluripotency, epithelial-to-mesenchymal transition, mammary lineage specification, dormancy, cell cycle, and proliferation [13]; these processes suggest their crucial roles in tumor development. For the missing data in the original dataset, we filtered out those genes with missing data in more than half of tumor cells and those tumor cells with missing data in more than half of 116 genes. The rest of the missing data retained in the dataset were imputed using the nearest neighbor averaging method [20,21]; this filter process resulted to 69 genes and 831 tumor cells, in which 657 PDX mouse tumor cells and 174 original human tumor cells were found. The expression levels of the 831 tumor cells were used as features to encode each cell in this study.

Monte Carlo Feature Selection Method
MCFS method is designed to select informative features for supervised classifiers by randomly constructing a large number of tree classifiers from an original training dataset. MCFS assigns higher relative importance (RI) to a feature if this feature is selected by more tree classifiers. Suppose the total of s×t classification trees are constructed by selecting s subsets of m features (m d, where d is the total number of features), and constructing t trees for each of the s subsets. Each of the t trees is built and trained by a random selection of training and test sets using the original training set. The RI of a feature can be measured by estimating the overall number of splits involving the feature in all nodes of all trees constructed. Particularly, the RI of a feature g can be estimated by the following equation where wAcc is the weighted accuracy over all samples, IG(n g (τ)) is the information gain of the node n g (τ), no. in n g (τ)) is the number of samples in node n g (τ), (no. in τ) is the number of samples in tree τ, and u and v are fixed real numbers.
In this study, we performed MCFS ranking by using the MCFS software package [14]. MCFS program can output a list in which features were ranked according to their RI values. The ranking list can be formulated as where N is the total number of features (N = 69 in this study).

Incremental Forward Selection Method and Random Forest Classification
We used MCFS method to evaluate the importance of the 69 features, resulting in a feature list F. However, determining the optimal number of features to be used is difficult. Therefore, IFS [22][23][24] and RF methods [25,26] were employed to tell how many features should be used.
IFS method first constructed a series of candidate feature sets, denoted as F 1 , F 2 , . . . , F N , where . . , f i } by incrementally adding features to precedent feature set according to their orders in the ranking list. For example, F 1 was composed of only the top 1 feature in the list F, F 2 was composed of the top 1 and top 2 features in the list F, and so on. To select the optimal feature set, i.e., determining the optimal features to be used, RF classification algorithm was then performed on all candidate feature sets. The classification performance of each candidate model was evaluated by cross-validation [27,28] (see Section 2.5 for details). The candidate model achieving the best performance was selected as the optimal prediction model. Candidate feature set that corresponded to the optimal model was determined as the optimal feature set. Features contained in the optimal feature set were actually the informative genes identified to be differentially expressed between PDX and human tumor cells.

Rough Set-Based Rule Learning
After obtaining the informative genes, the rough set-based rule learning algorithm can be applied to detect interaction between informative genes. The detected interactions were represented as rules. A rule is a terminology to describe a relation between rule conditions (the left-hand-side of the rule) and the rule outcome (the right-hand-side). For example, in our study, a rule can be presented as IF-THEN relationship based on gene expression: IF Gene1 = high AND Gene2 = low THEN type = human cell. Details for creating rough set-based rule model are presented below, including an introduction of rough set theory, and the repeated incremental pruning to produce error reduction (RIPPER) algorithm which was used to generate decision rules based on rough set theory.
Rough set is described briefly as follows. A decision (classification) system A can be defined as: where U denotes all the objects, A represents all the condition attributes (features), and d is the decision attribute (class label). An attribute a ∈ A can be regarded as a function such that a(x) is the value of the attribute a of object x. Generalized decision attribute ∂ A can be defined to replace d, to deal with the situation where the values of all the condition attributes of some objects are the same, but the values of the decision variable d of these objects are different [29]. A function, discerns(a, x, y, d), is defined as the situation in which object x and y are discernable using attribute a in term of the generalized decision variable ∂ A . A symmetric |U| × |U| matrix, called discernibility matrix M d A , is defined as follows The entry M d A x i , x j , i.e., the (i, j)th entry of the matrix M d A , contains all the attributes that can separate object x i from object x j . For example, an entry M d A x i , x j = {a 1 , a 2 , a 5 } means that x i and x j can be distinguished by a 1 , a 2 or a 5 .
A Boolean product-of-sums (POS) function, called discernibility function f d A (x) that is relative to object x, can be computed from the row x of M d A as follows where a * corresponds to the membership of attribute a. Function f d A (x) is composed of some conjunctions and each conjunction contains some terms corresponding to a non-empty entry of the row (a * 1 a * 2 ) contains two conjunctions indicating that two entries of the row x of M d A are not empty with one set as {a 1 , a 3 , a 5 } and another set as {a 1 , a 2 }. Finding the set of all the prime implicants of f d A (x), all the reducts, RED(A, x, d), that are relative to an object x can be determined, where a reduct is a minimal set of features that can separate x from other objects equally well with the full set of attributes A. A reduct preserves the boundaries between the approximation regions defined by the rough set [29]. A discernibility function g d A (U) can be defined to discern all objects as follows g d Accordingly, the reducts RED(A, d) of the whole decision system are the prime implicants of g d A (U). The Boolean POS function f d A (x) and g d A (U) can often be considerably simplified according to multiplicative idempotence and absorption [29]. However, finding all reducts using f d A (x) or g d A (U) is non-deterministic polynomial (NP)-hard, and for many applications only one reduct is needed. In our work, a heuristic method called Johnson Reducer algorithm [29] is applied to find a single reduct which is generally close to minimal size.
Johnson Reducer [30], which is based on a discernibility function, first initializes the set of the reduct R(A, x, d) to be empty. Then (1) find the attribute a that appears most frequently in the remaining conjunctions of the discernibility function; (2) add attribute a to R(A, x, d); (3) remove all the conjunctions that contain a. Repeat (1), (2) and (3) until all conjunctions are removed from the discernibility function.
Based on the reduct, the RIPPER algorithm is applied to generate decision rules. RIPPER, proposed by Cohen [31] in 1995, is a rule learning algorithm which is capable of handling large noisy datasets effectively. RIPPER is the improved version of Incremental Reduced Error Pruning (IREP) [32] which combines both the separate-and-conquer technique used first in the relational learner FOIL [33], a system learning Horn clauses from data expressed as relations, and the reduced error pruning strategy proposed by Brunk and Pazzani [34]. RIPPER algorithm is described briefly in Figure 1. Johnson Reducer [30], which is based on a discernibility function, first initializes the set of the reduct ( , , ) to be empty. Then (1) find the attribute that appears most frequently in the remaining conjunctions of the discernibility function; (2) add attribute to ( , , ); (3) remove all the conjunctions that contain . Repeat (1), (2) and (3) until all conjunctions are removed from the discernibility function.
Based on the reduct, the RIPPER algorithm is applied to generate decision rules. RIPPER, proposed by Cohen [31] in 1995, is a rule learning algorithm which is capable of handling large noisy datasets effectively. RIPPER is the improved version of Incremental Reduced Error Pruning (IREP) [32] which combines both the separate-and-conquer technique used first in the relational learner FOIL [33], a system learning Horn clauses from data expressed as relations, and the reduced error pruning strategy proposed by Brunk and Pazzani [34]. RIPPER algorithm is described briefly in Figure 1. The repeated incremental pruning to produce error reduction (RIPPER) algorithm. RIPPER starts from an empty rule and splits the training set into a growing set and a pruning set. Then it repeatedly grows rules that achieve the highest Foil's information gain using the growing set and prunes the rules using the pruning set until certain conditions are met. Finally, a global pruning is applied to prune the rules to gain the final rule set.
In our study, we fulfilled the detection of rules by using the Johnson Reducer algorithm [29], which was also implemented in the MCFS software package [14,35].

Measurements
Tenfold cross-validation [27,[36][37][38][39][40][41][42][43][44][45][46] was used to evaluate the prediction performance. For each fold, prediction performance was measured by the MCC. Final results were summarized as the 10fold average. MCC considers the sample numbers among classes and is generally regarded as a balance measurement to indicate prediction accuracy even though sample numbers among classes are of very different sizes. In this study, we adopted MCC for the performance measurement because the numbers of the two classes (human cell and PDX cell) were of great imbalance (174 vs. 657).
Given the involved n samples, denoted by , and N classes, represented by 1, 2, …, N.
True classes of samples were defined as matrix Initialize a set E to be the training set.
Choose a class C that contains least instances: Initialize a rule R to have an empty left-hand side that predicts C; Split E into growing and pruning sets; While there are positive samples (instances of C) in the growing set, or the description length (DL) is 64 bits greater than the smallest DL found so far, or the error rate is greater than 50% Until R is perfect (or no more attributes to add): For each attribute a not included in R, and for each value v: Consider a = v to add to the left-hand side of R; Choose the a and v that have the highest Foil's information gain; Add a = v to R; Prune R using reduced error pruning.
Remove the instances covered by R from the growing set.
Global optimization strategy is applied to further prune the rule. Figure 1. The repeated incremental pruning to produce error reduction (RIPPER) algorithm. RIPPER starts from an empty rule and splits the training set into a growing set and a pruning set. Then it repeatedly grows rules that achieve the highest Foil's information gain using the growing set and prunes the rules using the pruning set until certain conditions are met. Finally, a global pruning is applied to prune the rules to gain the final rule set.
In our study, we fulfilled the detection of rules by using the Johnson Reducer algorithm [29], which was also implemented in the MCFS software package [14,35].

Measurements
Tenfold cross-validation [27,[36][37][38][39][40][41][42][43][44][45][46] was used to evaluate the prediction performance. For each fold, prediction performance was measured by the MCC. Final results were summarized as the 10-fold average. MCC considers the sample numbers among classes and is generally regarded as a balance measurement to indicate prediction accuracy even though sample numbers among classes are of very different sizes. In this study, we adopted MCC for the performance measurement because the numbers of the two classes (human cell and PDX cell) were of great imbalance (174 vs. 657).
Given the involved n samples, denoted by s 1 , s 2 , . . . , s n , and N classes, represented by 1, 2, . . . , N. True classes of samples were defined as matrix Y = (y ij ) n×N , where y ij = 1 if the s i belongs to class j; otherwise, this variable is set to 0. Predicted classes of samples were defined as matrix X = (x ij ) n×N X = x ij n×N , where x ij = 1, if s i is predicted to be class j; otherwise, x ij = 0. Thus, MCC is defined as where cov(X,Y) was covariance function of X and Y, which can be computed by where x k and y k denote the kth column of X and Y, respectively; x k and y k denote the mean values of numbers in x k and y k , respectively. The range of MCC is between −1 and 1. With high MCC value, the classifier yields a good performance (1 means the given classifier yields a perfect classification, 0 indicates a classification no better than random prediction, and −1 represents a total misclassification).

Results
In this study, we collected the expression levels of 69 genes in 831 tumor cells, which included 657 PDX and 167 human tumor cells. By using MCFS, we ranked the 69 genes according to their RI on the tumor cell classification. Then, by applying IFS and RF methods, we drew the IFS-curve, which illustrated the relationship between the number of features used for building the model and the corresponding performance measured by the MCC value (see Figure 2). As shown in Figure 2, when using the top 32 genes to build the model, the prediction performance can reach the highest MCC value of 0.777. Details of the IFS curve can be found in Supplementary Material I.
where cov(X,Y) was covariance function of X and Y, which can be computed by where k x and k y denote the kth column of X and Y, respectively; k x and k y denote the mean values of numbers in k x and k y , respectively.
The range of MCC is between −1 and 1. With high MCC value, the classifier yields a good performance (1 means the given classifier yields a perfect classification, 0 indicates a classification no better than random prediction, and −1 represents a total misclassification).

Results
In this study, we collected the expression levels of 69 genes in 831 tumor cells, which included 657 PDX and 167 human tumor cells. By using MCFS, we ranked the 69 genes according to their RI on the tumor cell classification. Then, by applying IFS and RF methods, we drew the IFS-curve, which illustrated the relationship between the number of features used for building the model and the corresponding performance measured by the MCC value (see Figure 2). As shown in Figure 2, when using the top 32 genes to build the model, the prediction performance can reach the highest MCC value of 0.777. Details of the IFS curve can be found in Supplementary Material I. Furthermore, we compared our method to other similar methods, including the rough-set based method as mentioned above, the SVM [47] and the dagging [48]. The comparison result was shown in Table 1. As can be seen in Table 1, using RF with the top 32 features achieved the highest MCC, the 32 genes were identified as informative genes for predicting PDX and human tumor cells. Also, we performed enrichment analysis (multiple testing corrections) on the 32 identified genes. The significant Kyoto encyclopedia of genes and genomes (KEGG) and gene ontology (GO) enrichment results of 32 genes with false discovery rate (FDR) < 0.05 can be found in Supplementary Material II. Furthermore, we compared our method to other similar methods, including the rough-set based method as mentioned above, the SVM [47] and the dagging [48]. The comparison result was shown in Table 1. As can be seen in Table 1, using RF with the top 32 features achieved the highest MCC, the 32 genes were identified as informative genes for predicting PDX and human tumor cells. Also, we performed enrichment analysis (multiple testing corrections) on the 32 identified genes. The significant Kyoto encyclopedia of genes and genomes (KEGG) and gene ontology (GO) enrichment results of 32 genes with false discovery rate (FDR) < 0.05 can be found in Supplementary Material II. To somewhat open the RF model, we applied the rough set-based rule learning method to detect human-interpretable IF-THEN rules to reveal the interactions between the 32 informative genes. Finally, seven IF-THEN rules were detected (see Table 2). Details of the seven IF-THEN rules were provided in Supplementary Material III, including support values, coverage, ranking of the rules, etc. Moreover, in the Section 4, comprehensive interaction analysis as well as visualization of the seven IF-THEN rules were provided.

Discussion
Based on the detailed expression profile of PDX mouse tumor tissue and the corresponding clinical tissue, we applied our newly presented computational method for the identification of core differentially expressed genes between PDX model and clinical sample. The 32 important genes used for constructing RF prediction model are listed in Table 3. All these genes have been identified to be differentially expressed in clinical tumor samples and PDX mouse model by recent publications, which validated the efficacy and accuracy of our predicted genes. Furthermore, most of these differentially expressed genes contribute to tumorigenesis; this finding implies the potential difference of oncogenic mechanisms in PDX mouse model from in situ. This study at the transcriptomic level may not only deepen our understanding on the tumorigenesis in PDX tumor mouse model but also further reveal the potential restriction of PDX tumor mouse model due to the differential expression profile between PDX tumor tissue and clinical samples. The detailed analysis of optimal genes can be seen below.

Differentially Expressed Genes
Among our predicted differentially expressed gene, EMP1 is at the top rank. In encoding a tumor-associated membrane protein, this gene has been confirmed to be downregulated during the passage and transplantation of breast cancer tumor cells [12,13].
The second gene PARP2 has also been predicted to have differential expression patterns in our study. According to recent publications, this gene has been confirmed to contribute to DNA repair regulation in tumor tissues; this finding is quite common and significant in tumor malignant tissues [49,50]. Studies based on PDX model further confirmed that this gene mediates the chemotherapy resistance of breast cancer and has a differential expression pattern between breast cancer PDX model and corresponding tumor tissue in situ [50,51]. Previous findings demonstrated that during the transplantation processes, the biological process of DNA repair is inhibited, which indicates that PARP2 might be lowly expressed in PDX model [52].
The third ranked gene KRT19 has been widely reported to contribute to breast cancer tumorigenesis by its involvement in the organization of myofibers [53,54]. This gene is differentially expressed in PDX model compared with the original tumors not only in breast cancer but also in pancreatic cancer and hepatocellular carcinoma [55][56][57]. The upregulation of this gene helps maintain the structural integrity of epithelial cells [58]. During the transplantation of primary tumor into the mouse model, epithelial cell integrity became unstable in the PDX mouse model; this finding indicates that this gene can be downregulated in PDX model [58]. Interestingly, the homolog of KRT19 and KRT5 is also in the optimal feature set. The products of these two genes may form a specific dimer and contribute to specific biological processes as a whole [59]; this observation suggests that this gene can also be differentially expressed in the PDX mouse model.
In encoding a specific membrane glycoprotein, MUC1 is usually expressed in the epithelial female cancer subtypes (e.g., ovarian cancer and breast cancer) and regarded as the biomarker of cancer cell stemness [60][61][62]. A recent study on the stemness of cancer cells confirmed that during the passage of PDX mouse model, the stemness of cancer cells is gradually shaped by the murine microenvironment [62]; this finding indicates that the expression of MUC1 can be altered in this progress. Similarly, another gene involved in the stemness maintenance of cancer cells, PROM1 [55,63], can also have differential expression patterns.
CXCR4 encodes a specific G protein coupled receptor, which has been widely reported to be expressed on the immature CD34 + hematopoietic stem cells [64,65]. Recent studies on breast cancer found that the expression profiles of CXCR4 are very diverse among the primary tumor of breast cancer, the metastatic tumor tissues, and the PDX tumors filtered by specific murine microenvironment [66,67].
In our results, two specific Erb-B2 Receptor Tyrosine Kinase (ERBB)-pathway-associated genes, i.e., ERBB2 and ERBB3, have been predicted as potential distinctive biomarkers. ERBB family signaling pathway has been widely reported to contribute to transcriptional regulation during the tumorigenesis of breast cancer [68]. Further studies on the detailed expression pattern of these two genes in clinical tumor tissues and PDX models confirmed that after the selection of murine tumor microenvironment and PDX model passage, the expression patterns of ERBB2 and ERBB3 have been systematically regulated for further adaption to the murine mouse microenvironment [55,69].
Another gene, ID4, has also been identified as a breast-cancer-associated gene, which participates in the regulation of basic helix-loop-helix transcription factors [70,71]. Although no direct evidence is found on the differential expression pattern in paired PDX mouse model and tumor tissues in situ, the expression of creatine kinase is altered due to the microenvironment of PDX mouse model [72,73]. As the negative regulator of creatine kinase, the expression pattern of ID4 may also be changed in a different microenvironment.
A recent study on single-cell transcriptome of PDX mouse model reported a series of genes with differential expression patterns in the PDX mouse model compared with the original tumors [13]; this series included PTEN, NTRK2, PGR, and TP53, which are identified in this study as the differentially expressed genes.

Rules of Quantitative Expression Level Requirements
As we have mentioned above, we detected seven specific rules to distinguish primary tumor tissues and PDX tumor tissues based on the expression levels of 14 genes. In the seven rules, the first six provided criteria for classifying a tumor cell into the original human tumor, which also represents the primary tumor-specific gene expression patterns (see Table 2). Among the 14 genes, seven genes, i.e., KRT19, EMP1, CXCR4, CD44, PTEN, PARP2, and PLCB4, were required to have high expression levels in the original tumor cells, whereas others should have low expression levels. As we discussed above, some of these genes, such as KRT19, EMP1, and PARP2, have evidence from previous studies, which indicate their higher expression levels in the original tumors compared with PDX tumor cells [12,13,52,58].
Further, to facilitate our understanding of the seven rules, we visualized the seven rules by using Ciruvis, a web-based tool for rule networks and interaction detection using rule-based classifiers, as suggested in the literature [74]. The result is shown in Figure 3, from which we found that, strong interactive relationships involved genes of KRT19, KRT5 and CDH3 (interactions of dark red color). Thus, we provided full analysis on the interactions among KRT19, KRT5 and CDH3.
First, the dimerizing of Krt5 with Krt14 has been confirmed to be involved in the regulation of EMT in tumorigenesis [75][76][77]. Further studies on the protein-protein interactions confirmed that this heterodimer could also interact with another component of cytokeratins, Krt6b [78]. As a functional mediator, Krt6b connects Krt5 with Krt19 through protein-protein interaction [79]. Krt6b shows negatively correlated expression pattern with Krt19 during tumorigenesis [80]. Therefore, the strong interaction between KRT19 and KRT5 has been indicated by previous experiments. Rule networks for the seven detected rules generated by Ciruvis [74]. For the purpose of visualization, Ciruvis assigned specific colors for genes and placed genes into a circle. Meanwhile, Ciruvis indicated the strength of interaction among genes by using the degree of red color on the interactions. The darker the red color is, the stronger the interaction among genes.
Second, the Ciruvis figure also show strong correlations between KRT5 and CDH3, which encodes a member of the cadherin superfamily. A previous study in 2010 [81] reported that BRCA1 repressed the expression of various cytokeratins including Krt5. It is also known that BRCA1 product manipulate the expression pattern of both KRT5 and CDH3 [81], suggesting the correlation and coregulation of the expressions of these two genes.
Third, the correlation of KRT19 and CDH3 is suggested by the previous studies which reported the participation of these two genes in the essential proliferation and metastasis associated pathway MAPK/ERK cascade in tumorigenesis [82,83]. Furthermore, KRT19 shows co-expression pattern with various CDH3 associated genes like CDH1 [84], EGFR [85] and CTNNB1 [54] in certain pathological conditions, indicating that there may be potential biological relevance between KRT19 and CDH3.
In summary, these findings strongly support the correlations of both functions and expression patterns among the three genes KRT19, KRT5 and CDH3, which are all related to EMT associated pathological processes in human tumors, but not in PDX mouse model.

Conclusions
Through a new computational method, we identified the differentially expressed genes and quantitative rules that can classify the PDX tumor and the original tumor with high accuracies. Multiple well-known oncogenes and tumor suppressors are differentially expressed; this finding indicates the risk of using those genes as therapeutic targets in breast cancer through PDX mouse model. Therefore, our study not only provides a functional tool to distinguish the primary tumor from PDX tumor by gene expression levels but also reveals the detailed gene expression alterations and shaping characteristics in murine microenvironment. In this work, we employed SVM, RF, . Rule networks for the seven detected rules generated by Ciruvis [74]. For the purpose of visualization, Ciruvis assigned specific colors for genes and placed genes into a circle. Meanwhile, Ciruvis indicated the strength of interaction among genes by using the degree of red color on the interactions. The darker the red color is, the stronger the interaction among genes.
First, the dimerizing of Krt5 with Krt14 has been confirmed to be involved in the regulation of EMT in tumorigenesis [75][76][77]. Further studies on the protein-protein interactions confirmed that this heterodimer could also interact with another component of cytokeratins, Krt6b [78]. As a functional mediator, Krt6b connects Krt5 with Krt19 through protein-protein interaction [79]. Krt6b shows negatively correlated expression pattern with Krt19 during tumorigenesis [80]. Therefore, the strong interaction between KRT19 and KRT5 has been indicated by previous experiments.
Second, the Ciruvis figure also show strong correlations between KRT5 and CDH3, which encodes a member of the cadherin superfamily. A previous study in 2010 [81] reported that BRCA1 repressed the expression of various cytokeratins including Krt5. It is also known that BRCA1 product manipulate the expression pattern of both KRT5 and CDH3 [81], suggesting the correlation and co-regulation of the expressions of these two genes.
Third, the correlation of KRT19 and CDH3 is suggested by the previous studies which reported the participation of these two genes in the essential proliferation and metastasis associated pathway MAPK/ERK cascade in tumorigenesis [82,83]. Furthermore, KRT19 shows co-expression pattern with various CDH3 associated genes like CDH1 [84], EGFR [85] and CTNNB1 [54] in certain pathological conditions, indicating that there may be potential biological relevance between KRT19 and CDH3.
In summary, these findings strongly support the correlations of both functions and expression patterns among the three genes KRT19, KRT5 and CDH3, which are all related to EMT associated pathological processes in human tumors, but not in PDX mouse model.

Conclusions
Through a new computational method, we identified the differentially expressed genes and quantitative rules that can classify the PDX tumor and the original tumor with high accuracies. Multiple well-known oncogenes and tumor suppressors are differentially expressed; this finding indicates the risk of using those genes as therapeutic targets in breast cancer through PDX mouse model. Therefore, our study not only provides a functional tool to distinguish the primary tumor from PDX tumor by gene expression levels but also reveals the detailed gene expression alterations and shaping characteristics in murine microenvironment. In this work, we employed SVM, RF, dagging and rough set classifiers due to the small samples problem. Some of the latest classifiers, such as ensemble classifier [86], will be tested in future work.