Intelligent Microarray Data Analysis through Non-negative Matrix Factorization to Study Human Multiple Myeloma Cell Lines

: Microarray data are a kind of numerical non-negative data used to collect gene expression proﬁles. Since the number of genes in DNA is huge, they are usually high dimensional, therefore they require dimensionality reduction and clustering techniques to extract useful information. In this paper we use NMF, non-negative matrix factorization, to analyze microarray data, and also develop “intelligent” results visualization with the aim to facilitate the analysis of the domain experts. For this purpose, a case study based on the analysis of the gene expression proﬁles (GEPs), representative of the human multiple myeloma diseases, was investigated in 40 human myeloma cell lines (HMCLs). The aim of the experiments was to study the genes involved in arachidonic acid metabolism in order to detect gene patterns that possibly could be connected to the di ﬀ erent gene expression proﬁles of multiple myeloma. NMF results have been veriﬁed by western blotting analysis in six HMCLs of proteins expressed by some of the most abundantly expressed genes. The experiments showed the e ﬀ ectiveness of NMF in intelligently analyzing microarray data.


Introduction
The massive use of technologies, in any domain, is leading to an exponential growth of available data. This abundant data and their complexity very often make them unusable. For this reason, automatic tools are more and more frequently used to facilitate their management and analysis. However, when completely automatic tools are used without proper technical knowledge, they provide results that are not completely understandable.
Intelligent data analysis (IDA) is a methodology to extract useful hidden knowledge from data by involving human expertise in the analysis process [1,2]. It includes different methodologies to analyze real-world problems, such as statistics, artificial intelligence, machine learning, data mining, and data visualization. Domain experts and computer scientists are both necessary during the data analysis stimuli. Both the two known COX isoforms (COX-1 and COX-2) contribute to PGE2 production, even though COX-1 seems to be its major source in normal tissue. COX-1 is also the target of a low dose of acetylsalicylic acid (ASA), the well-known aspirin active principle ingredient that irreversibly inactivates the platelet cyclooxygenase-1 by acetylating its S530, preventing thromboxane formation. In fact, thrombotic complications in patients with newly diagnosed MM treated with lenalidomide, dexamethasone, and thalidomide chemotherapy benefit from aspirin prophylaxis [24][25][26]. Tens of millions of worldwide adults take aspirin to reduce their risk of heart attack or stroke. Studies over the last two decades have suggested that regular use of aspirin may have another important benefit: decreasing the risk of developing or dying from some types of cancer [27,28]. These aspects potentiate the idea that the genes and their encoded proteins involved in the COX-mediated AA cascade would of relevance and noteworthy investigations also in multiple myeloma as in other types of cancer. For this purpose, a microarray data matrix containing the GEP of 40 HMCLs was used. The experiments, here below described, were designed to study a subset of 64 genes that were selected as involved in arachidonic acid metabolism. Particularly, for this purpose, the gene taxonomy proposed by [29] was used, 64 genes were grouped according to their role in arachidonic acid metabolism. Indeed the aim of the analysis was to detect expression patterns possibly connected to different multiple myeloma histotypes, stages, and grades that could emerge from the interaction of the selected genes. The NMF part-based representation was used to reduce the problem dimensionality and to describe the HMCLs as a composition of metagenes (i.e., groups of genes that emerged automatically from data). Finally, the western blotting technique was used to verify if the relationships between genes and HMCLs that have been suggested by NMF were actually coded.

Western Blotting
The western blotting analysis was conducted by using the methods reported in Perrone et al. with minor modifications [38]. Briefly, all cells were washed twice with 10 mL phosphate-buffered saline (PBS), scraped in 1 mL PBS and centrifuged for 10 min at 1500 rpm, 4 • C. Proteins were extracted from cells by homogenization in cold RIPA buffer (Sigma-Aldrich) containing 1X protease inhibitor cocktail and centrifuged at 14,000 rpm for 10 min at 4 • C. The supernatant was recovered, and the protein concentration was measured using a DC Protein Assay Reagent Kit (BIO-RAD, Hercules, CA, USA). Protein extract (25 µg) was separated on 10% polyacrylamide gel (BIO-RAD) and then transferred onto a polyvinylidene difluoride membrane (PVDF) by a Trans-Blot Turbo Transfer System (BIO-RAD). The membrane was blocked for 1 h at room temperature with blocking buffer (1% blotting grade blocker, 0.1% Tween20 in Tris-buffered saline, TBS). The membrane was then incubated with either anti-COX-1 (1:500 mouse monoclonal, overnight at 4 • C), anti-LTA4H, anti-PGEs2, anti-CYP1A2, anti-CBR1 (1:200 mouse monoclonal, overnight at 4 • C), anti-ALOX-15-B (1:1000 rabbit monoclonal, 1 h at room temperature), or anti-17µ-HSD4, anti-GGT1, anti-ACAA1, anti-vinculin, anti-µ-actin (1:1000 mouse monoclonal, 1 h at room temperature) antibodies, diluted in blocking buffer. After the incubation time, the membrane was washed with washing buffer (0.1% Tween-20 in Tris-buffered saline, TBS) three times and incubated with a secondary peroxidase antibody (1:3000 anti-mouse or anti-rabbit) for 1 h at room temperature. After washing, the membrane was treated with enhanced chemiluminescence (ECL, BIO-RAD) according to the manufacturer's instructions and the blot was visualized by UVITEC Cambridge (LifeTechnologies, Carlsbad, CA, USA). The expression level was evaluated by densitometric analysis using UVITEC Cambridge software (LifeTechnologies) and the µ-actin or vinculin expression level was used as a loading control to normalize the sample values.

Non-Negative Matrix Factorization
NMF is a low-rank approximation technique that decomposes the original data matrix X ∈ R n×m + in the product of two non-negative matrices W and H (X ≈ WH, with W ∈ R n×k + and H ∈ R k×m + ) called basis and encoding matrix, respectively, where k is the factorization rank that indicates the number of hidden genes (metagenes) to extract.
An explicative example is reported in Figure 1. A microarray data matrix with 30 genes (rows) and 10 HMCLs (columns) was decomposed in the matrices W ∈ R 30×2 + and H ∈ R 2×10 + . The basis matrix W, contains the latent factors that are automatically extracted by the algorithm. In the context of the microarray data analysis, these factors are called metagenes: groups of genes that emerged from the computation and used to describe the original data in a sub-dimensional space. Each column of W represents a metagene that is described by the same genes used in the original data. Each entry w ij is the membership values of the i-th gene to the j-th metagene. The metagenes represent the relationships among the genes. From the mathematical point of view, these are linear relationships that emerge from data. One gene can be involved in the definition of different bases (metagenes); indeed, these are not exclusive relationships. This well represents the biological nature of the metagenes that could actually involve different genes depending on their role.
The encoding matrix H represents the original data (HMCLs) in terms of the metagene expression profile. Each entry h ij denotes the importance of the i-th metagene in reconstructing the j-th HMCL. The non-negativity constraint of NMF allows describing the HMCLs as a linear combination of these metagenes, weighted by the coefficients in H (part-based decomposition).
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 17 Graphical representations of the metagene expression profiles could reveal different metagene behaviors for the considered HMCLs, as in Figure 1. Therefore these patterns are used to automatically group the HMCLs based on their similarities. For this study, the standard multiplicative update algorithm [39] based on the Euclidean distance was used. Moreover, since the NMF algorithms are strictly dependent on the initial starting point, the non-negative double singular value decomposition (NNDSVD) algorithm was applied since it was proven to improve the quality of clustering results [40].
The rank of the factorization is an input parameter of the problem. It is crucial for the results since it indicates the number of metagenes and the clusters that should be extracted. The correlation matrix gives a qualitative measurement of the stability of the algorithm. Quantitative measurement of this stability is given by the cophenetic correlation coefficient that in gene expression analysis it is used to quantify the stability of the clustering results of microarray gene expression [10].

Results and Discussion
In this section, we report and discuss the results obtained by NMF on the selected dataset (Section 3.1). Clustering results and the HMCLs part-based representation, in terms of the extracted metagenes, are firstly described and analyzed in Section 3.2. Section 3.3 describes the hidden metagenes and their role in arachidonic acid metabolism; finally, Section 3.4 reports western blotting analysis to validate NMF results.

Microarray Data
A microarray data matrix containing the gene expression profile of 40 HMCLs was used to study the genes involved in arachidonic acid metabolism, to detect genes expression profile patterns likely connected to the different multiple myeloma types. All the numerical results were obtained by implementing the algorithms in Matlab 8.4 codes and running them on a machine equipped with an Intel® core 2 Duo CPU with RAM 8.00 GB.
Microarray data of multiple myeloma cell lines available on the public database ArrayExpress [41] were analyzed. Graphical representations of the metagene expression profiles could reveal different metagene behaviors for the considered HMCLs, as in Figure 1. Therefore these patterns are used to automatically group the HMCLs based on their similarities.
For this study, the standard multiplicative update algorithm [39] based on the Euclidean distance was used. Moreover, since the NMF algorithms are strictly dependent on the initial starting point, the non-negative double singular value decomposition (NNDSVD) algorithm was applied since it was proven to improve the quality of clustering results [40].
The rank of the factorization is an input parameter of the problem. It is crucial for the results since it indicates the number of metagenes and the clusters that should be extracted. The correlation matrix gives a qualitative measurement of the stability of the algorithm. Quantitative measurement of this stability is given by the cophenetic correlation coefficient that in gene expression analysis it is used to quantify the stability of the clustering results of microarray gene expression [10].

Results and Discussion
In this section, we report and discuss the results obtained by NMF on the selected dataset (Section 3.1). Clustering results and the HMCLs part-based representation, in terms of the extracted metagenes, are firstly described and analyzed in Section 3.2. Section 3.3 describes the hidden metagenes and their role in arachidonic acid metabolism; finally, Section 3.4 reports western blotting analysis to validate NMF results.

Microarray Data
A microarray data matrix containing the gene expression profile of 40 HMCLs was used to study the genes involved in arachidonic acid metabolism, to detect genes expression profile patterns likely connected to the different multiple myeloma types. All the numerical results were obtained by implementing the algorithms in Matlab 8.4 codes and running them on a machine equipped with an Intel ® core 2 Duo CPU with RAM 8.00 GB.
Microarray data of multiple myeloma cell lines available on the public database ArrayExpress [41] were analyzed.
The microarray data were collected using the Affymetrix GeneChip Human Genome U133 Plus 2.0 [HG-U133-Plus-2]. The analysis was conducted on a subset of the original data starting from 40 human myeloma cell lines (HMCLs) that provided a signature for stratification of patient risk [41] and 64 genes involved in arachidonic acid metabolism as depicted in Figure 2.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 17 The microarray data were collected using the Affymetrix GeneChip Human Genome U133 Plus 2.0 [HG-U133-Plus-2]. The analysis was conducted on a subset of the original data starting from 40 human myeloma cell lines (HMCLs) that provided a signature for stratification of patient risk [41] and 64 genes involved in arachidonic acid metabolism as depicted in Figure 2.
A data matrix was built by using 64 rows (genes) and 40 columns (HMCLs) ( Figure 3). It was evident that the dimensionality of the data was too big to directly identify any GEP pattern. Hence, a dimensionality reduction was needed to bring out useful information that could be easily visualized through the part-based representation.  A data matrix was built by using 64 rows (genes) and 40 columns (HMCLs) (Figure 3). It was evident that the dimensionality of the data was too big to directly identify any GEP pattern. Hence, a dimensionality reduction was needed to bring out useful information that could be easily visualized through the part-based representation.

HMCL Clustering Results
Fifty runs of the NMF algorithm varying the rank value in the interval 2-10 were executed [42,43], and both the consensus matrices and the cophenetic correlation coefficient were collected from the analysis of the encoding matrix H. Whilst the consensus matrix gives a qualitative measurement of the cluster quality, the cophenetic correlation coefficient is a quantitative measure, and its trend is used to choose the optimal rank (i.e., the number of clusters). Figure 4 shows the cophenetic correlation coefficient trend in the interval 2-10 (Each point represents the mean value of the cophenetic correlation coefficient on the fifty runs). From value five we observed a significant reduction of this trend, so it has been chosen as factorization rank. The consensus matrix in Figure 5 represents the HMCLs (rows and columns) grouped according to the metagene expression profile similarities. Five clusters (yellow squares) are depicted showing some regularities within the groups. The algorithm proved to be stable and it converged to the same solution in all 50 runs. This result is shown in Figure 5 where the squares are clearly separated and only yellow and blue values are present.

HMCL Clustering Results
Fifty runs of the NMF algorithm varying the rank value in the interval 2-10 were executed [42,43], and both the consensus matrices and the cophenetic correlation coefficient were collected from the analysis of the encoding matrix H. Whilst the consensus matrix gives a qualitative measurement of the cluster quality, the cophenetic correlation coefficient is a quantitative measure, and its trend is used to choose the optimal rank (i.e., the number of clusters). Figure 4 shows the cophenetic correlation coefficient trend in the interval 2-10 (Each point represents the mean value of the cophenetic correlation coefficient on the fifty runs). From value five we observed a significant reduction of this trend, so it has been chosen as factorization rank.

HMCL Clustering Results
Fifty runs of the NMF algorithm varying the rank value in the interval 2-10 were executed [42,43], and both the consensus matrices and the cophenetic correlation coefficient were collected from the analysis of the encoding matrix H. Whilst the consensus matrix gives a qualitative measurement of the cluster quality, the cophenetic correlation coefficient is a quantitative measure, and its trend is used to choose the optimal rank (i.e., the number of clusters). Figure 4 shows the cophenetic correlation coefficient trend in the interval 2-10 (Each point represents the mean value of the cophenetic correlation coefficient on the fifty runs). From value five we observed a significant reduction of this trend, so it has been chosen as factorization rank. The consensus matrix in Figure 5 represents the HMCLs (rows and columns) grouped according to the metagene expression profile similarities. Five clusters (yellow squares) are depicted showing some regularities within the groups. The algorithm proved to be stable and it converged to the same The consensus matrix in Figure 5 represents the HMCLs (rows and columns) grouped according to the metagene expression profile similarities. Five clusters (yellow squares) are depicted showing some regularities within the groups. The algorithm proved to be stable and it converged to the same solution in all 50 runs. This result is shown in Figure 5 where the squares are clearly separated and only yellow and blue values are present. It is known that NMF allows a part-based representation of the original data by a linear additive composition of parts that can be interpreted as building blocks. In this scenario, the 'blocks' are the metagenes. Figure 6 shows the HMCLs composition in terms of metagenes in each of the five groups that have been returned from the consensus matrix. The whole bar is given by the sum of the single metagene bars, expressed as a percentage.
The emerging patterns are now clearer. The dimensionality reduction and the clustering have made it possible. They allow the study of both the metagene expression profiles through the HMCLs and the different behaviors of the metagenes among the clusters. This simplified representation helps the domain experts in better understanding the relationships hidden in data. It is a starting point for conducting the biological analysis. Figure 6 shows that the HMCLs in the first cluster are characterized by a high expression of the first metagene (dark blue bars), and the HMCLs in the second cluster from high expression of the third metagene (green). The HMCLs in the third cluster are characterized by a high expression of the fourth metagene (yellow) bars, and the bars in clusters four and five have very high expression of the metagenes two (light blue) and five (dark yellow), respectively.
Furthermore, it is noteworthy that, even if the metagenes two and four are highly expressed in clusters four and three, they are always expressed in all the HMCLs. Note that even if this representation has allowed the identification of the most expressed metagenes per HMCLs, the analysis has to take into account the less expressed metagenes. Indeed, all the extracted metagenes contribute to defining each HMCL, thus they cannot be removed as noise. In fact, the dimensionality reduction has already removed the noise from data, whilst it has highlighted the relationships among genes and HMCLs. The emerged patterns are then used by the domain experts, involved in the data analysis process (biologists and physicians in this case), to better understand their role in arachidonic acid metabolism, as it is described in Section 3.4.
The expression profiles of the single metagenes are depicted (same colors code used before) in It is known that NMF allows a part-based representation of the original data by a linear additive composition of parts that can be interpreted as building blocks. In this scenario, the 'blocks' are the metagenes. Figure 6 shows the HMCLs composition in terms of metagenes in each of the five groups that have been returned from the consensus matrix. The whole bar is given by the sum of the single metagene bars, expressed as a percentage.
The emerging patterns are now clearer. The dimensionality reduction and the clustering have made it possible. They allow the study of both the metagene expression profiles through the HMCLs and the different behaviors of the metagenes among the clusters. This simplified representation helps the domain experts in better understanding the relationships hidden in data. It is a starting point for conducting the biological analysis. Figure 6 shows that the HMCLs in the first cluster are characterized by a high expression of the first metagene (dark blue bars), and the HMCLs in the second cluster from high expression of the third metagene (green). The HMCLs in the third cluster are characterized by a high expression of the fourth metagene (yellow) bars, and the bars in clusters four and five have very high expression of the metagenes two (light blue) and five (dark yellow), respectively.
Furthermore, it is noteworthy that, even if the metagenes two and four are highly expressed in clusters four and three, they are always expressed in all the HMCLs. Note that even if this representation has allowed the identification of the most expressed metagenes per HMCLs, the analysis has to take into account the less expressed metagenes. Indeed, all the extracted metagenes contribute to defining each HMCL, thus they cannot be removed as noise. In fact, the dimensionality reduction has already removed the noise from data, whilst it has highlighted the relationships among genes and HMCLs. The emerged patterns are then used by the domain experts, involved in the data analysis process (biologists and physicians in this case), to better understand their role in arachidonic acid metabolism, as it is described in Section 3.4.
The expression profiles of the single metagenes are depicted (same colors code used before) in Appendix A, reporting the same behavior summarized in Figure 6. Moreover, visual analysis allows us to detect some subgroups within the groups. It is the case that some HMCLs such as XG24, XG3, XG2, U266, LP1, XG12, XG13, and XG10 of the first cluster have a medium expression of the metagene 3 (the green one), different than the other HMCLs of the same groups; the HMCLs XG4, XG6, and NAN6 of the second cluster, that use metagene 4 (the yellow bars); in the same clusters we recognize another subgroup composed of the HMCLs XG6, L363, KMS12BM, MDN, NAN1, and NAN6 that have a medium expression of metagene 2 (light blue). Differences among HMCLs in the same group are the object of study for the domain experts.
At this point, it appeared interesting to analyze the genes forming each metagene. Since our analysis started using Moreaux's classification of HMCLs, [41] it seemed important to compare his results to our obtained clusters. According to Moreaux's study, the 40 HMCLs could be clustered into 6 groups using unsupervised hierarchical clustering, whereas our clustering led to five clusters.
This difference was predictable. Moreaux's analysis was conducted on a large set of genes. Our analysis is focused on a smaller subset of genes, the arachidonic acid metabolism involved genes.

Metagene Analysis
The encoding matrix H was used to cluster the HMCLs and to extract the metagene expression profiles, whereas the basis matrix W contains the metagenes, represented as sets of genes with different weights. We extracted five different metagenes. Figure 7 depicts the top most 20 genes involved in each metagene. The genes have been ordered and visualized, according to their influence (expressed in percentage) on each metagene. For a broader analysis, twenty genes are reported even though we observed the highest weights for the first five/six genes. These graphs give more insights to the domain experts that are helpful to drive the biological analysis of the HMCLs.
As an example, analyzing Figure 5 we know that the HMCL KMM1 is mostly composed of the expression of metagene 1, partly by metagene 2, and minimally by metagene 4. Figure 7 specifies the genes that contribute to each metagene definition. The genes PTGES2, HSD17B4, LTA4H, ACAA, CBRI, and PPARD are the heaviest in the definition of metagene 1, so the expression of these genes highly affects the HMCL KMM1; the genes MGST2, GGT1, SCP2, FPR2, PTGES3, PLAG10, and CYP2W1 from metagene 2 only partly influence KMM1, and finally, the genes CYSLTR1, CBR1, CYP2J2, EHHADH, CYP1A2, and ACOX3 from metagene 4 slightly condition KMM1. It is possible to conduct a similar analysis on the other cell lines, looking for a relationship among the genes and Moreover, visual analysis allows us to detect some subgroups within the groups. It is the case that some HMCLs such as XG24, XG3, XG2, U266, LP1, XG12, XG13, and XG10 of the first cluster have a medium expression of the metagene 3 (the green one), different than the other HMCLs of the same groups; the HMCLs XG4, XG6, and NAN6 of the second cluster, that use metagene 4 (the yellow bars); in the same clusters we recognize another subgroup composed of the HMCLs XG6, L363, KMS12BM, MDN, NAN1, and NAN6 that have a medium expression of metagene 2 (light blue). Differences among HMCLs in the same group are the object of study for the domain experts.
At this point, it appeared interesting to analyze the genes forming each metagene. Since our analysis started using Moreaux's classification of HMCLs, [41] it seemed important to compare his results to our obtained clusters. According to Moreaux's study, the 40 HMCLs could be clustered into 6 groups using unsupervised hierarchical clustering, whereas our clustering led to five clusters.
This difference was predictable. Moreaux's analysis was conducted on a large set of genes. Our analysis is focused on a smaller subset of genes, the arachidonic acid metabolism involved genes.

Metagene Analysis
The encoding matrix H was used to cluster the HMCLs and to extract the metagene expression profiles, whereas the basis matrix W contains the metagenes, represented as sets of genes with different weights. We extracted five different metagenes. Figure 7 depicts the top most 20 genes involved in each metagene. The genes have been ordered and visualized, according to their influence (expressed in percentage) on each metagene. For a broader analysis, twenty genes are reported even though we observed the highest weights for the first five/six genes. These graphs give more insights to the domain experts that are helpful to drive the biological analysis of the HMCLs.
As an example, analyzing Figure 5 we know that the HMCL KMM1 is mostly composed of the expression of metagene 1, partly by metagene 2, and minimally by metagene 4. Figure 7 specifies the genes that contribute to each metagene definition. The genes PTGES2, HSD17B4, LTA4H, ACAA, CBRI, and PPARD are the heaviest in the definition of metagene 1, so the expression of these genes highly affects the HMCL KMM1; the genes MGST2, GGT1, SCP2, FPR2, PTGES3, PLAG10, and CYP2W1 from metagene 2 only partly influence KMM1, and finally, the genes CYSLTR1, CBR1, CYP2J2, EHHADH, CYP1A2, and ACOX3 from metagene 4 slightly condition KMM1. It is possible to conduct a similar analysis on the other cell lines, looking for a relationship among the genes and the cellular functions they are involved in. The taxonomy in [24] can give more insight into the role of the genes in each metagene. For example, the genes in the first metagene mostly belong to the cyclooxygenases and related enzymes involved in eicosanoid biosynthesis in humans, enzymes involved in eicosanoid catabolism in human, eicosanoid receptors, and transcription factors in humans. Similar analyses could be conducted on the other metagenes.

Proofs of Concepts: Western Blotting Analysis
With the aim to verify the correctness of the NMF results, the expression of some proteins codified by the topmost twenty genes that are part of the identified metagene were found in six The taxonomy in [24] can give more insight into the role of the genes in each metagene. For example, the genes in the first metagene mostly belong to the cyclooxygenases and related enzymes involved in eicosanoid biosynthesis in humans, enzymes involved in eicosanoid catabolism in human, eicosanoid receptors, and transcription factors in humans. Similar analyses could be conducted on the other metagenes.

Proofs of Concepts: Western Blotting Analysis
With the aim to verify the correctness of the NMF results, the expression of some proteins codified by the topmost twenty genes that are part of the identified metagene were found in six representative HMCLs (U266, SKMM2, KMS12BM, NCIH929, RPMI8226, MM1S) by using the western blotting technique [44][45][46][47]. One or two HMCLs were selected from each cluster, based on their different responses to the clinically used drugs. In Figure 5, the expression of some specific proteins, analyzed by western blotting analysis and involved in the eicosanoid biosynthesis related to the arachidonic acid pathway of various human myeloma cell lines, is depicted.
The expression extent values of each protein were assigned as listed in Figure 8. U266-B1 is a B lymphocyte plasmacytoma cell line belonging to cluster 1 characterized by the expression of MG1-2-3-5 (depicted from the colored boxes under cell line name, Figure 8). HSD17B4, the gene coding for the protein HSD417B, was expressed at 8% in MG1, at 3% in MG3, at 5% in MG4, and 4% in MG5 (gene percentages are indicated in the colored boxes, Figure 8 and Figure 7) but its expression (0.37) in U266-B1 was due only to MG1, 3, and 5. In fact, U266-B1 (cluster 1, Figure 5) was not represented by MG4 and HSD17B4 was not present in the top most 20 genes of MG2.
In U266-B1, COX-1 was not detected by western blot analysis ( Figure 6) thus it was supposed that PTGS1 is downregulated in this cell line.
GGT1 expressed at 8% in MG2 codifies the corresponding protein which was found with a value of 0.18 in U266-B1. Comparable expression values 0.17 and 0.26, respectively, were found for LTA4H and ALOX15B; LTA4H was represented by all 5 MG with a preponderant contribution of MG1 in which LTA4H was expressed at 7% of all genes. ALOX15P was found in MG1, 3, 4, and 5 but its expression was higher in MG3 (7%).
Among the analyzed proteins, PGES2 and ACAA1 were highly expressed in U266-B1 with a value of 19.9 and 2.17, respectively. PGES2 was found in MG1, 2, and 5 with no involvement of MG3 and 4 which do not compose the U266-B1 cell line. ACAA1 derive from MG1, 3, and 5, with no contribution of MG 2 and 4. CBR1 and CYP1A2 were not detected in U266-B1, even though their corresponding genes were represented in MGs 1-5. Their absence could be justified by the downregulation of CBR1 and CYP1A2.
SK-MM-2 and KMS-12-BM, both belonging to cluster 2 ( Figure 5), had a different metagene composition as depicted by the colored boxes under cell line name (Figure 8).
SK-MM-2, a B lymphocyte plasmacytoma cell line, was mostly composed of metagene 3 and 1, with a little contribute given by metagene 2. From western blotting analysis (Figure 9) of this cell line, it was evident that all the proteins codified by the genes of metagenes 2 and 3 were expressed, except for CYP1A2.
Concerning KMS-12-BM, a multiple myeloma cell line, it had an almost similar contribution from all metagenes with the exception of metagene 5. In this case, PGES2 was the major expressed U266-B1 is a B lymphocyte plasmacytoma cell line belonging to cluster 1 characterized by the expression of MG1-2-3-5 (depicted from the colored boxes under cell line name, Figure 8). HSD17B4, the gene coding for the protein HSD417B, was expressed at 8% in MG1, at 3% in MG3, at 5% in MG4, and 4% in MG5 (gene percentages are indicated in the colored boxes, Figures 7 and 8) but its expression (0.37) in U266-B1 was due only to MG1, 3, and 5. In fact, U266-B1 (cluster 1, Figure 5) was not represented by MG4 and HSD17B4 was not present in the top most 20 genes of MG2.
In U266-B1, COX-1 was not detected by western blot analysis ( Figure 6) thus it was supposed that PTGS1 is downregulated in this cell line.
GGT1 expressed at 8% in MG2 codifies the corresponding protein which was found with a value of 0.18 in U266-B1. Comparable expression values 0.17 and 0.26, respectively, were found for LTA4H and ALOX15B; LTA4H was represented by all 5 MG with a preponderant contribution of MG1 in which LTA4H was expressed at 7% of all genes. ALOX15P was found in MG1, 3, 4, and 5 but its expression was higher in MG3 (7%).
Among the analyzed proteins, PGES2 and ACAA1 were highly expressed in U266-B1 with a value of 19.9 and 2.17, respectively. PGES2 was found in MG1, 2, and 5 with no involvement of MG3 and 4 which do not compose the U266-B1 cell line. ACAA1 derive from MG1, 3, and 5, with no contribution of MG 2 and 4. CBR1 and CYP1A2 were not detected in U266-B1, even though their corresponding genes were represented in MGs 1-5. Their absence could be justified by the downregulation of CBR1 and CYP1A2.
SK-MM-2 and KMS-12-BM, both belonging to cluster 2 ( Figure 5), had a different metagene composition as depicted by the colored boxes under cell line name (Figure 8).
SK-MM-2, a B lymphocyte plasmacytoma cell line, was mostly composed of metagene 3 and 1, with a little contribute given by metagene 2. From western blotting analysis (Figure 9) of this cell line, it was evident that all the proteins codified by the genes of metagenes 2 and 3 were expressed, except for CYP1A2. metagene 2. Concerning HSD17B4, ACAA1, and ALOX15B, since their encoding genes were not present in the top most 20 representing MG2, we expected not to find them. Surprisingly, western blotting experiments (Figure 9) revealed the expression of HSD17B4, ACAA1, and ALOX15B with values of 0.44, 0.27, and 0.48, respectively. Their presence is justified by the fact that their codified genes are actually present but with a percentage of expression less than 1% and therefore not represented in Figure 8.
The last but not least important cluster was represented from the cell line MM1S with a marked component of metagene 5 but a low contribution of metagenes 1, 2, and 5. This is the only HMCL used in which CYP1A2 was expressed and with a value comparable to the positive control. Similar expression values were found for the other proteins with the exception of GGT1 and ACAA1.  Concerning KMS-12-BM, a multiple myeloma cell line, it had an almost similar contribution from all metagenes with the exception of metagene 5. In this case, PGES2 was the major expressed protein while for COX-1, CBR1, and CYP1A2, no expression was detected (Figure 9). NCI-H929, another multiple myeloma cell line, belongs to cluster 3 with a marked expression of genes of metagene 4, and in part of metagenes 1, 3, and 5. Also, in this cell line, the expression of PGES2 was high (439) with respect to the positive control. CBR1 and CYP1A2 were not found in U266-B1, while a good expression of ALOX15B, HSD417B, ACCA1, and COX-1 was observed. In this cell line, a good COX-1 expression was observed if considering that the reference cell line was stimulated to express COX-1.
RPMI-8266 belonging to the cluster 4 was composed of the expression of genes belonging to metagene 2. Concerning HSD17B4, ACAA1, and ALOX15B, since their encoding genes were not present in the top most 20 representing MG2, we expected not to find them. Surprisingly, western blotting experiments (Figure 9) revealed the expression of HSD17B4, ACAA1, and ALOX15B with values of 0.44, 0.27, and 0.48, respectively.
Their presence is justified by the fact that their codified genes are actually present but with a percentage of expression less than 1% and therefore not represented in Figure 8.
The last but not least important cluster was represented from the cell line MM1S with a marked component of metagene 5 but a low contribution of metagenes 1, 2, and 5. This is the only HMCL used in which CYP1A2 was expressed and with a value comparable to the positive control. Similar expression values were found for the other proteins with the exception of GGT1 and ACAA1.

Conclusions
Herein, a case study showing the use of NMF as a tool to intelligently analyze microarray data was proposed. Particularly the aim of the experiments was to study the genes involved in arachidonic acid metabolism, in order to detect gene patterns and relationships among the HMCLs that could be related to the different gene expression profiles of multiple myeloma. For this purpose, some numerical experiments on microarray gene expression real data belonging to 40 human multiple myeloma cell lines have been accomplished, selecting a subset of genes related to arachidonic acid metabolism, and applying the NMF algorithm to the so obtained genes-HMCLs matrix. The experiments showed the effectiveness of the NMF in intelligently analyzing microarray data to support the domain experts' activities. Through NMF we were able to select, using the genes involved in the metabolism of arachidonic acid, a certain number of multiple myeloma cell lines. This characterization has a high value from a biological point of view because it allows performing experiments using the most correct cell line. For example, the information obtained from this study led to the selection of NCIH929 and RPMI-8226 as the two most appropriate cell lines to investigate structure-inhibitory activity relationships of a set of novel compounds and the cyclooxygenase (COX) catalytic activity. Moreover, NMF results have been verified by western blotting analysis in six HMCLs of proteins expressed by some of the most abundant expressed genes. Our data confirm the correctness of NMF data outcomes for the six HMCLs. Further experiments are ongoing to increase the number of HMCLs and to be correlated to patients' disease profiles, in order to verify the presence of patterns that link different stages and grades of multiple myeloma, and the responses to the provided drug therapies.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Metagene expression profiles through the HMCLs are reported in Figures A1-A5.