Next Article in Journal
Online Machine Learning and Surrogate-Model-Based Optimization for Improved Production Processes Using a Cognitive Architecture
Previous Article in Journal
High-Efficiency Simulation of Dynamic Stability Derivatives Based on a Particle Swarm Optimization and Long Short-Term Memory Network (PSO-LSTM) Coupling Aerodynamic Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evaluating the Performance of the Generalized Linear Model (glm) R Package Using Single-Cell RNA-Sequencing Data

1
Department of Computer Science, Saudi Electronic University, Riyadh 11673, Saudi Arabia
2
Department of Computer Science, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11432, Saudi Arabia
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2023, 13(20), 11512; https://doi.org/10.3390/app132011512
Submission received: 9 September 2023 / Revised: 1 October 2023 / Accepted: 18 October 2023 / Published: 20 October 2023

Abstract

:
The glm R package is commonly used for generalized linear modeling. In this paper, we evaluate the ability of the glm package to predict binomial outcomes using logistic regression. We use single-cell RNA-sequencing datasets, after a series of normalization, to fit data into glm models repeatedly using 10-fold cross-validation over 100 iterations. Our evaluation criteria are glm’s Precision, Recall, F1-Score, Area Under the Curve (AUC), and Runtime. Scores for each evaluation category are collected, and their medians are calculated. Our findings show that glm has fluctuating Precision and F1-Scores. In terms of Recall, glm has shown more stable performance, while in the AUC category, glm shows remarkable performance. Also, the Runtime of glm is consistent. Our findings also show that there are no correlations between the size of fitted data and glm’s Precision, Recall, F1-Score, and AUC, except for Runtime.

1. Introduction

Supervised learning in the R programming language has been facilitated with built-in packages dedicated to processing data and building different types of models. Three types of logistic regression that are commonly used to build models in R include binary (binomial), multinomial, and ordinal regression. Binomial regression produces its prediction in a form of two types of outcomes, such as true or false, while multinomial regression has three or more types of outcome. Ordinal regression, on the other hand, deals with three or more ordered values [1]. GLM (generalized linear model) is a statistical model that allows the linear regression model to be extended to accommodate a wide range of responses and error distributions. The main idea of GLM is to use a link function as a way to relate the linear model to the response variable. GLM has many “families” (link functions) that can perform different types of regression, such as linear regression, logistic regression, and Poisson regression [2,3]. The evaluation of the glm R package remains inconclusive in terms of its performance, specifically, its Precision, Recall, F1-Score, Area Under the Curve (AUC), and Runtime [4,5,6,7]. Current studies evaluate generalized linear models in general but not in the R package itself [8,9]. Also, evaluating the performance of the glm R package using single-cell RNA-sequencing data helps to understand its applicability, strengths, and limitations in this very specific domain. In this paper, we choose to evaluate the glm package performance, since it is the most common R package for creating linear models, and we use “binomial” (nonlinear) regression because our data can only be classified into two types of outcomes: zeroes and ones.
Precision [10,11] is defined as the number of correctly predicted true values (TP) over the total number of falsely predicted true values (FP) and the number of correctly predicted true values. That is: P r e c i s i o n = T P F P + T P .
Recall [11,12] is defined as the number of correctly predicted true values (TP) over the total number of values that are classified as false (FN) and the total number of correctly predicted true values. That is: R e c a l l = T P F N + T P .
The F1-Score [13,14,15], consequently, is derived as F1-Score = 2 × ( P r e c i s i o n × R e c a l l P r e c i s i o n + R e c a l l ) .
AUC [12,16] measures the ability of a specific model to distinguish between positive signals and negative signals. Models with higher AUC have better prediction. AUC can be calculated directly from the subject model’s ROC (receiver operating characteristic) curve, which includes the scores of both the true positive rate (represented, in the ROC plot, as the y-axis) and the false positive rate (represented as the x-axis) at different variations of thresholds.
When testing a dataset, similarities in input values often cause misclassification errors. Thus, we choose to run a cross-validation test where the datasets are randomly split into different folds (usually 10 folds). In each testing iteration, one fold is set as a testing set, and the rest of the remaining folds are set as a training set. The adjacent fold, then, in the next iteration, is set as the testing set, and the remaining folds are set as the training, set and so on.
A correlation between two variables means that a change in one variable will have an immediate effect on the other. The Pearson correlation coefficient, which is measured by dividing the covariance of two variables by the multiplication of the standard deviations of the first and second variables, is commonly used for this specific purpose. A linear correlation, either positive or negative, indicates a strong correlation.
Single-cell RNA-sequencing (scRNA-seq) is a technology used to express gene profiles at the level of the individual cell. It provides a better view of cellular heterogeneity within a complex biological system by measuring the RNA transcripts of individual cells. This is normally performed on a sample level; for example, a blood sample. In single-cell RNA-sequencing, it is performed on the individual cell level [17,18,19,20,21,22,23]. The level of gene expression provides information that can be used to study the causes of tumors and other diseases, such as fabric lung [24,25,26,27,28,29,30,31]. To deal with excessive data obtained from sequencing, supervised learning techniques are applied on cell annotations to give meaning to these data [32,33,34,35,36,37,38,39,40]. In our experiment, we use 22 scRNA-seq datasets from published studies [41] to evaluate the performance of the glm R package. These datasets are made available via the Conquer repository [42].
This paper is organized as follows. We will first describe the datasets used, then explain our procedures for evaluation and the analysis results of our measurement outcomes. Finally, we summarize our findings and derive our conclusion.

2. Materials and Design of Experiment

The evaluation of glm’s performance is not a straightforward process. Datasets must go through a normalization process to fit into the glm package. Memory overload, in some cases, can limit testing of additional datasets. Nevertheless, we have managed to utilize 22 datasets out of 40 from the targeted repository.

2.1. Datasets

The datasets used in this paper were extracted from the Conquer (consistent quantification of external RNA-sequencing data) repository that was developed by C. Soneson and M. Robinson at the University of Zurich, Switzerland [42]. Three organisms’ data were included in the repository—Homo sapiens, Mus musculus, and Danio rerio. Each dataset includes a different number of one organism’s cells. Protocols used to extract data from these cells were sequences—SMARTer C1, Smart-Seq2, SMART-Seq, Tang, Fluidigm C1Auto prep, and SMARTer. The data were split into two categories: gene-level and transcript-level. At the gene-level, there are four different types of measurements: TPM (transcripts per million abundance estimates for each gene), count (gene read counts), count_lstpm (length-scaled TPMs), and avetxlength (the average length of the transcripts expressed in each sample for each gene). In the transcript-level category, there are three different types of measurements: TPM (transcripts per million abundance estimates for each transcript), count (transcript read counts), and efflength (effective transcript lengths).
We have explored all datasets available in the Conquer repository. Unfortunately, not all datasets could be used, or even modified, to fit into the glm R package, for various reasons. For a dataset to fit our testing methodology, we have to verify that its samples can be separated into only two groups of common phenotypes; thus, each phenotype can be assigned either 1 or 0 in the same dataset. For example, the dataset GSE80032 was found to be unsuitable for our test because its samples include only one phenotype; using binomial regression, there would only be one class for all samples to fit in. The size of the dataset is also a significant factor in our test. A group of samples must not be so small that they generate misclassification or execution errors. Following these restrictions and based on our observation during dataset preparation, the ideal sample groups must have at least 30 samples to guarantee no misclassification or errors will be generated. In the end, 22 datasets fit perfectly into our test. Table 1 represents our selected datasets as they were represented in the Conquer repository with additional information.
Accessing information in these datasets is not a straightforward process. We needed to have full control over these datasets in order to manipulate them. Thus, before we proceeded into our test, each dataset had to be normalized before it could fit into our method. In exploring the dataset categories, we found that some measurements (except for avetxlength) contained a large number of 0 values, which can jeopardize the integrity of the evaluation process. Thus, we chose avetxlength as our input data.

2.2. Evaluation Procedure

To access avetxlength’s data in each dataset, we closely followed the steps provided by the Conquer repository authors (using experiment()[“gene”] and assay()[avetxlength]). We downloaded the .rds file for each dataset and, by using the R programming language, we retrieved ExperimentList instances that contained RangedSummarizedExperiment for a gene-level object, which allowed us to access all available abundances; TPM (transcripts per million), gene count, length-scaled TPMs, and avetxlength. In our test, we chose genes’ avetxlength because it perfectly fits our model. The avetxlength data at this stage constitute a matrix, X i , j , where i represents samples and j represents genes.
To be able to use our binomial model, we individually explored each dataset to find two distinct phenotypic characteristics associated with the dataset’s sample groups that could be used to label our outputs. We denote the first characteristic as 1 and the other as 0. Thus, sample IDs are replaced by either 1 or 0. For example, in the EMTAB2805 dataset, different phenotypes are associated with each group of samples. We chose only two phenotypes based on stages of the cell cycle: G1 and G2M. As a result, IDs of the samples associated with the first stage, G1, were replaced by 1 (True), and IDs of the samples associated with the second stage, G2M, were replaced by 0 (False). We eliminated any extra samples associated with other cell cycle stages, if any. In some datasets, such as in GSE80032, there is only one phenotype; thus, classification is impossible. Table 2 shows the current status of the X i , j matrix.
We now rotate the dimensions of X i , j so that rows represent samples, and columns represent genes (Table 3). We then substitute each sample’s ID with either 1 or 0, depending on their original classification in their original dataset status (Table 4).
In Table 5, we list all selected datasets after we identified the appropriate phenotypes (column 3) that were used in our test.
To increase the integrity of our evaluation, we conducted a Wilcoxon test on our matrix to find the p-values associated with each gene.
At the time of testing, we ran into memory exhaustion errors caused by the large amount of genes that the matrix retained. Thus, we had to trim our matrix to include only 1000 genes that had the lowest p-values.
The final form of our input matrix is X j , i , where j ranges from 1 to n (the total number of phenotypes represented by either 1s or 0s) and i ranges from 1 to 1000 (genes with the lowest p-value).
At this point, our matrix has been completely normalized; thus, we can proceed into testing the glm R package over 100 iterations.
At the start of each testing iteration, we shuffle the rows (samples that were denoted by 1 and 0) of our matrix X j , i to prevent any bias that may occur from having identical sample IDs. Next, we conduct a 10-fold cross-validation test where we take 10% of the samples (rows) as our testing set and the remaining 90% as our training set. We measure the system’s time (which is our execution start time), and then we fit the training set into the glm package using its “binomial” function. Immediately after this step, we again measure the system’s time (which is our execution end time). By subtracting the end time from the start time, we obtain the glm runtime at that specific iteration.
The constructed model obtained from fitting glm is now used for prediction by employing the testing set. From this process, we can derive the confusion matrix that is the result of predicting the recently created model against the testing set. The confusion matrix is a 2 × 2 matrix. In some cases where we have only a one-dimension matrix, we programmatically force the creation of a 2 × 2 matrix to enable finding the actual and predicted values.
We now calculate the Precision by dividing the total number of true positive values by the total number of true positive values and the total number of false positive values. We also calculate the Recall from the same confusion matrix by dividing the total number of true positive values by the total number of true positive values and the total number of false negative values. At this point, we have both Precision and Recall; thus, we can calculate the F1-Score, which is Precision multiplied by Recall over the total value of Precision and Recall multiplied by 2.
Calculating the AUC requires additional steps. We must first compute the receiver operating characteristic (ROC) using the pROC R package and then, by using our prediction, we can extract the AUC measurement.
In the second iteration of our cross-validation testing, we take the fold adjacent to the first fold (that is the next 10% of the matrix) as our testing set and the remaining 90% as our training set. We repeat the same procedure applied to glm and obtain measurements of Runtime, Precision, Recall, F1-Score, and AUC. At the end of this 10-fold cross-validation testing, we have 10 measurements for each evaluation category. We collect these 10 measurements and calculate their means as our final score.
The previous testing procedure (starting from shuffling samples to collecting mean measurements of Runtime, Precision, Recall, F1-Score, and AUC) is repeated 100 times. Each time we collect the mean of all scores and plot our final results for visual analysis.

2.3. Code Repository

The evaluation codes were implemented in the R programming language. Scripts were deposited at https://github.com/Omar-Alaqeeli/glm (accessed on 7 May 2023). There are 22 R scripts. Each script retrieves and processes one dataset and fits it into the glm() package. Only the first script, titled EMTAB2805, contains the required package installation code lines. Datasets in their original status cannot be deposited into the same GitHub repository due to their excessive size, but they can be accessed at the Conquer repository link: http://imlspenticton.uzh.ch:3838/conquer/ (accessed on 7 May 2023). All codes were run on a personal computer using version 3.6.1. of R Studio. Table 6 shows the full specifications of the system used and R Studio.

3. Results and Analysis

In this section, we present the results obtained from running our Algorithm 1 on the chosen datasets using the glm package. Our analysis is based on the median and interquartile range of each dataset used. Outliers are excluded. For visualization, we chose boxplots because they includes all the aforementioned measurements.
Algorithm 1 The complete algorithm for testing glm using Conquer datasets.
1:
procedure glm_test( X i , j : a two-dimensional matrix containing avetxlength for each gene; 1st dimension is for samples and 2nd dimension is for genes)
2:
Rotate X i , j dimensions so that X j , i
3:
Separate samples into two groups based on selected phenotypes.
4:
In 1st group: replace S a m p l e _ i d 1
5:
In 2nd group: replace S a m p l e _ i d 0
6:
for  n { 1 i } do                         ▹Loop over genes
7:
      Run Wilcoxon-Test on n
8:
      Extract and store p-value
9:
end for
10:
Shorten X j , i to include only 1000 genes with the lowest p-value.
11:
for  m { 1 100 } do                         ▹General loop
12:
      Shuffle rows in X j , i
13:
      for k { 1 10 } do                      ▹ Loop over samples
14:
            Set 10% of X j , i as T e s t i n g _ s e t
15:
            Set 90% of X j , i as T r a i n i n g _ s e t
16:
            Measure current t i m e s t a r t
17:
            Fit T r a i n i n g _ s e t into glm().
18:
            Measure current t i m e e n d
19:
            Predict model on T e s t i n g _ s e t
20:
            Calculate the method’s Run-time using t i m e e n d t i m e s t a r t
21:
            Calculate the method’s Precision using Confusion Matrix
22:
            Calculate the method’s Recall using Confusion Matrix
23:
            Calculate the method’s F1-Score using the method’s Precision and Recall
24:
            Calculate the method’s AUC using ROC function
25:
            Repeat steps 14–24 using the adjacent fold
26:
      end for
27:
      Calculate the m e a n of all scores collected
28:
end for
29:
Plot final scores

3.1. Precision

Results in Figure 1 show that Precision varies across different datasets. The highest median value recorded was 0.94 when using GSE71585-GPL17021. Interestingly, this result also has the smallest interquartile range. On the other hand, the lowest median value recorded was 0.40 when using the GSE102299-smartseq2 dataset. The majority of Precision median values are in the range of 0.48 to 0.66, with 0.2 variation for 15 Precision values out of 22. Table 7 shows Precision’s medians, each associated with its corresponding dataset.

3.2. Recall

The results of Recall depicted in Figure 2 highlight that the highest median value recorded is 0.75 when using the GSE71585-GPL17021 dataset. Notably, this result exhibits the smallest interquartile range among all datasets. On the other hand, the lowest medians, 0.50, are recorded when using GSE102299-smartseq2, GSE48968-GPL17021-125bp, and GSE77847. Nearly all median values presented in Table 7 range from 0.5 to 0.59. Only two datasets, GSE66053-GPL18573 and GSE71585-GPL17021, demonstrate medians that exceed 0.59, which may be due to the disparity between the number of 0s and 1s in these datasets.
The result suggests that glm is reliable when it comes to identifying type II error rates. The lower the median, the better the model.

3.3. F1-Score

The F1-Score results depicted in Figure 3 indicate that the dataset GSE71585-GPL17021 achieved the highest median, with a value of 0.83. On the other hand, the GSE102299-smartseq2 dataset exhibits the smallest interquartile range and the lowest median value of 0.44. Among all F1-Scores in Table 7, 13 F1-Score values fall within the range of 0.50 to 0.55. The remaining F1-Scores are close to this range, with the exception of GSE66053-GPL18573, which has a median of 0.72.

3.4. Area under the Curve

The AUC median values shown in Figure 4 provide insights into glm’s prediction performance. It shows that the highest median is 0.76 in the SRP073808 dataset, and the lowest median is 0.62 in the GSE60749-GPL13112 dataset. Out of 22, 19 of the median scores fall within the range of 0.72 to 0.76, which indicates consistent performance. These AUC scores also demonstrate a similar interquartile range as well as, roughly, the same size of interquartile range. Only one AUC score deviates notably from this range: that of the GSE60749-GPL13112 dataset.

3.5. Runtime

The Runtime analysis (in seconds) depicted in Figure 5 shows that the highest median value is 29.58 and the lowest is 0.53. The majority of the median scores (18 datasets) fall within the range of 0.53 to 5.35. However, there are four datasets (GSE57872, GSE66053-GPL18573, GSE71585-GPL17021, and GSE48968-GPL17021-125bp) that demonstrate significantly higher runtime values of 11.72, 16.13, 19.87, and 29.58, respectively. It is unclear why these datasets took a longer time, as there are no correlations found between the size of these datasets and Runtime.

4. Discussion and Conclusions

Based on our analysis of the results presented in the previous section, we can deduce the following regarding each evaluation category. Regarding glm’s Precision, there is a noticeable fluctuation in the results. That can be seen with the median value of fitting the GSE71585-GPL17021 dataset. Nevertheless, it has the smallest interquartile range, which indicates that it has the most accurate testing because its scores mostly do not fluctuate. Other interquartile ranges show some similarities in size, but their medians vary. Thus, when it comes to Precision, glm shows minimal performance.
When considering Recall results, glm shows more stable performance. The majority of median scores are concentrated within a narrow range of 0.09, which highlights the excellent performance of glm. The high score of GSE71585-GPL17021’s Recall is expected, as it aligns with its high Precision score mentioned earlier. While the large size of GSE71585-GPL17021 could potentially contribute to this high score, it is important to note that GSE57872, which is larger than GSE71585-GPL17021, does not exhibit the same pattern. Therefore, the high Recall score of GSE71585-GPL17021 cannot be solely attributed to its size.
F1-Scores, on the other hand, show a fluctuating performance. This is largely connected to the glm’s Precision because F1-Scores are derived from it. This is especially obvious when looking at both the Precision and F1-Score figures. Additionally, median values of Precision and F1-Scores, in Table 7, are nearly similar. Therefore, glm also has minimal performance when it comes to F1-Score.
In the AUC category, glm shows remarkable performance. The AUC medians across the vast majority of tests are almost the same, and their interquartile ranges have the same length. This indicates that glm has a stable performance regardless of the size of the input data.
The analysis of Runtime (in seconds) results indicates that the glm package demonstrates a consistent performance. The fluctuation of runtime occurs within a narrow time range of approximately 0.25 s. However, four datasets deviate significantly from this time range. This could be due to the size of GSE57872 and GSE71585-GPL17021, which may contribute to their divergent runtimes, but this assumption is proven false when considering the size of the other two datasets, GSE71585-GPL17021 and GSE81903. Hence, further investigation is required to understand the reasons behind these outliers.
It is intuitive to suspect that the size of the fitted data may influence the performance of glm. In our datasets, although the number of columns is consistent, 1000, the number of rows varies significantly. To visually investigate the influence of the size of the data in each evaluation category, we plot the relation between size and collected median scores. Figure 6 shows that there are no correlations between the size of the fitted data and glm’s Precision, Recall, F1-Score, and AUC, except for a minimal effect on Runtime.
In further research, we plan to investigate the performance of glm in various categories using different “families” (link functions) for different types of regression, including Gaussian, gamma, Poisson, and quasi functions. This investigation will involve utilizing different datasets that require regression models with more than two labels or classes. By expanding our analysis to these different regression types, we will gain a more comprehensive understanding of glm’s performance across diverse scenarios.

Author Contributions

Conceptualization, O.A. and R.A.; methodology, O.A.; software, O.A.; validation, R.A.; formal analysis, O.A.; writing—original draft preparation, O.A.; writing—review and editing, R.A.; funding acquisition, R.A. All authors have read and agreed to the published version of the manuscript.

Funding

The authors extend their appreciation to the Deanship of Scientific Research, Imam Mohammad Ibn Saud Islamic University (IMSIU), Saudi Arabia, for funding this research work through Grant No. (221409004).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Evaluation codes are available at https://github.com/Omar-Alaqeeli/glm (accessed on 11 May 2023). The Conquer repository can be found at http://imlspenticton.uzh.ch:3838/conquer/ (accessed on 11 May 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AUCArea under the Curve
scRNA-seqSingle-cell Ribonucleic Acid Sequencing
ROCReceiver Operating Characteristic

References

  1. Cucchiara, A. Applied Logistic Regression. Technometrics 1992, 34, 358–359. [Google Scholar] [CrossRef]
  2. Dunn, P.K.; Smyth, G.K. Generalized Linear Models with Examples in R; Springer: Berlin/Heidelberg, Germany, 2018; Volume 53. [Google Scholar]
  3. Rutherford, A. ANOVA and ANCOVA: A GLM Approach; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
  4. Guisan, A.; Weiss, S.B.; Weiss, A.D. GLM versus CCA spatial modeling of plant species distribution. Plant Ecol. 1999, 143, 107–122. [Google Scholar] [CrossRef]
  5. Stefánsson, G. Analysis of groundfish survey abundance data: Combining the GLM and delta approaches. ICES J. Mar. Sci. 1996, 53, 577–588. [Google Scholar] [CrossRef]
  6. Pepe, M.S. An interpretation for the ROC curve and inference using GLM procedures. Biometrics 2000, 56, 352–359. [Google Scholar] [CrossRef]
  7. Tran, M.N.; Nguyen, N.; Nott, D.; Kohn, R. Bayesian deep net GLM and GLMM. J. Comput. Graph. Stat. 2020, 29, 97–113. [Google Scholar] [CrossRef]
  8. Potts, S.E.; Rose, K.A. Evaluation of GLM and GAM for estimating population indices from fishery independent surveys. Fish. Res. 2018, 208, 167–178. [Google Scholar] [CrossRef]
  9. Calcagno, V.; de Mazancourt, C. glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models. J. Stat. Softw. 2010, 34, 1–29. [Google Scholar] [CrossRef]
  10. Bi, J.; Kuesten, C. Type I error, testing power, and predicting precision based on the GLM and LM models for CATA data–Further discussion with M. Meyners and A. Hasted. Food Qual. Prefer. 2023, 106, 104806. [Google Scholar] [CrossRef]
  11. Xiong, Y. Building text hierarchical structure by using confusion matrix. In Proceedings of the 2012 5th International Conference on BioMedical Engineering and Informatics, Chongqing, China, 16–18 October 2012; pp. 1250–1254. [Google Scholar]
  12. Davis, J.; Goadrich, M. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2006; pp. 233–240. [Google Scholar]
  13. Caelen, O. A Bayesian interpretation of the confusion matrix. Ann. Math. Artif. Intell. 2017, 81, 429–450. [Google Scholar] [CrossRef]
  14. Zhang, D.; Wang, J.; Zhao, X. Estimating the uncertainty of average F1 scores. In Proceedings of the 2015 International Conference on the Theory of Information Retrieval, Northampton, MA, USA, 27–30 September 2015; pp. 317–320. [Google Scholar]
  15. Zhang, D.; Wang, J.; Zhao, X.; Wang, X. A Bayesian hierarchical model for comparing average F1 scores. In Proceedings of the 2015 IEEE International Conference on Data Mining, Atlantic City, NJ, USA, 14–17 November 2015; pp. 589–598. [Google Scholar]
  16. Myerson, J.; Green, L.; Warusawitharana, M. Area under the curve as a measure of discounting. J. Exp. Anal. Behav. 2001, 76, 235–243. [Google Scholar] [CrossRef]
  17. Habermann, A.C.; Gutierrez, A.J.; Bui, L.T.; Yahn, S.L.; Winters, N.I.; Calvi, C.L.; Peter, L.; Chung, M.I.; Taylor, C.J.; Jetter, C.; et al. Single-cell RNA sequencing reveals profibrotic roles of distinct epithelial and mesenchymal lineages in pulmonary fibrosis. Sci. Adv. 2020, 6, eaba1972. [Google Scholar] [CrossRef] [PubMed]
  18. Bauer, S.; Nolte, L.; Reyes, M. Segmentation of brain tumor images based on atlas-registration combined with a Markov-Random-Field lesion growth model. In Proceedings of the 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Chicago, IL, USA, 30 March–2 April 2011; pp. 2018–2021. [Google Scholar] [CrossRef]
  19. Pliner, H.A.; Shendure, J.; Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat. Methods 2019, 16, 983–986. [Google Scholar] [CrossRef] [PubMed]
  20. Seyednasrollah, F.; Laiho, A.; Elo, L.L. Comparison of software packages for detecting differential expression in RNA-seq studies. Brief. Bioinform. 2013, 16, 59–70. [Google Scholar] [CrossRef] [PubMed]
  21. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; The Wadsworth & Brooks/Cole Statistics/Probability Series; Wadsworth & Brooks/Cole Advanced Books & Software: Monterey, CA, USA, 1984. [Google Scholar]
  22. Grubinger, T.; Zeileis, A.; Pfeiffer, K.P. evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R. J. Stat. Softw. Artic. 2014, 61, 1–29. [Google Scholar] [CrossRef]
  23. Hothorn, T.; Hornik, K.; Zeileis, A. Unbiased Recursive Partitioning: A Conditional Inference Framework. J. Comput. Graph. Stat. 2006, 15, 651–674. [Google Scholar] [CrossRef]
  24. Qian, J.; Olbrecht, S.; Boeckx, B.; Vos, H.; Laoui, D.; Etlioglu, E.; Wauters, E.; Pomella, V.; Verbandt, S.; Busschaert, P.; et al. A pan-cancer blueprint of the heterogeneous tumor microenvironment revealed by single-cell profiling. Cell Res. 2020, 30, 745–762. [Google Scholar] [CrossRef]
  25. Zhou, Y.; Yang, D.; Yang, Q.; Lv, X.; Huang, W.; Zhou, Z.; Wang, Y.; Zhang, Z.; Yuan, T.; Ding, X.; et al. Single-cell RNA landscape of intratumoral heterogeneity and immunosuppressive microenvironment in advanced osteosarcoma. Nat. Commun. 2020, 11, 6322. [Google Scholar] [CrossRef]
  26. Adams, T.S.; Schupp, J.C.; Poli, S.; Ayaub, E.A.; Neumark, N.; Ahangari, F.; Chu, S.G.; Raby, B.A.; DeIuliis, G.; Januszyk, M.; et al. Single-cell RNA-seq reveals ectopic and aberrant lung-resident cell populations in idiopathic pulmonary fibrosis. Sci. Adv. 2020, 6, eaba1983. [Google Scholar]
  27. Nawy, T. Single-cell sequencing. Nat. Methods 2014, 11, 18. [Google Scholar] [CrossRef]
  28. Gawad, C.; Koh, W.; Quake, S.R. Single-cell genome sequencing: Current state of the science. Nat. Rev. Genet. 2016, 17, 175–188. [Google Scholar] [CrossRef]
  29. Metzker, M.L. Sequencing technologies—The next generation. Nat. Rev. Genet. 2010, 11, 31–46. [Google Scholar] [CrossRef] [PubMed]
  30. Jaakkola, M.K.; Seyednasrollah, F.; Mehmood, A.; Elo, L.L. Comparison of methods to detect differentially expressed genes between single-cell populations. Brief. Bioinform. 2016, 18, 735–743. [Google Scholar] [CrossRef] [PubMed]
  31. Wang, T.; Li, B.; Nelson, C.E.; Nabavi, S. Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC Bioinform. 2019, 20, 40. [Google Scholar] [CrossRef] [PubMed]
  32. Hafemeister, C.; Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 2019, 20, 296. [Google Scholar] [CrossRef]
  33. Krzak, M.; Raykov, Y.; Boukouvalas, A.; Cutillo, L.; Angelini, C. Benchmark and Parameter Sensitivity Analysis of Single-Cell RNA Sequencing Clustering Methods. Front. Genet. 2019, 10, 1253. [Google Scholar] [CrossRef]
  34. Darmanis, S.; Sloan, S.A.; Zhang, Y.; Enge, M.; Caneda, C.; Shuer, L.M.; Hayden Gephart, M.G.; Barres, B.A.; Quake, S.R. A survey of human brain transcriptome diversity at the single cell level. Proc. Natl. Acad. Sci. USA 2015, 112, 7285–7290. [Google Scholar] [CrossRef]
  35. Seyednasrollah, F.; Rantanen, K.; Jaakkola, P.; Elo, L.L. ROTS: Reproducible RNA-seq biomarker detector—Prognostic markers for clear cell renal cell cancer. Nucleic Acids Res. 2015, 44, e1. [Google Scholar] [CrossRef]
  36. Elo, L.L.; Filen, S.; Lahesmaa, R.; Aittokallio, T. Reproducibility-Optimized Test Statistic for Ranking Genes in Microarray Studies. IEEE/ACM Trans. Comput. Biol. Bioinform. 2008, 5, 423–431. [Google Scholar] [CrossRef]
  37. Anders, S.; Pyl, P.T.; Huber, W. HTSeq—A Python framework to work with high-throughput sequencing data. Bioinformatics 2014, 31, 166–169. [Google Scholar] [CrossRef]
  38. Kowalczyk, M.S.; Tirosh, I.; Heckl, D.; Rao, T.N.; Dixit, A.; Haas, B.J.; Schneider, R.K.; Wagers, A.J.; Ebert, B.L.; Regev, A. Single-cell RNA-seq reveals changes in cell cycle and differentiation programs upon aging of hematopoietic stem cells. Genome Res. 2015, 25, 1860–1872. [Google Scholar] [CrossRef]
  39. Law, C.W.; Chen, Y.; Shi, W.; Smyth, G.K. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014, 15, R29. [Google Scholar] [CrossRef] [PubMed]
  40. McCarthy, D.J.; Chen, Y.; Smyth, G.K. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012, 40, 4288–4297. [Google Scholar] [CrossRef] [PubMed]
  41. Alaqeeli, O.; Xing, L.; Zhang, X. Software Benchmark—Classification Tree Algorithms for Cell Atlases Annotation Using Single-Cell RNA-Sequencing Data. Microbiol. Res. 2021, 12, 20022. [Google Scholar] [CrossRef]
  42. Soneson, C.; Robinson, M.D. Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods 2018, 15, 255–261. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Precision scores (y-axis) for each dataset (x-axis) used with the glm package. The bullets and dashed lines rrepresents extreme values distribution.
Figure 1. Precision scores (y-axis) for each dataset (x-axis) used with the glm package. The bullets and dashed lines rrepresents extreme values distribution.
Applsci 13 11512 g001
Figure 2. Recall scores (y-axis) for each dataset (x-axis) used with the glm package. The bullets and dashed lines rrepresents extreme values distribution.
Figure 2. Recall scores (y-axis) for each dataset (x-axis) used with the glm package. The bullets and dashed lines rrepresents extreme values distribution.
Applsci 13 11512 g002
Figure 3. F1-Scores (y-axis) for each dataset (x-axis) used with the package. The bullets and dashed lines rrepresents extreme values distribution.
Figure 3. F1-Scores (y-axis) for each dataset (x-axis) used with the package. The bullets and dashed lines rrepresents extreme values distribution.
Applsci 13 11512 g003
Figure 4. AUC scores (y-axis) for each dataset (x-axis) used with the glm package. The bullets and dashed lines represents extreme values distribution.
Figure 4. AUC scores (y-axis) for each dataset (x-axis) used with the glm package. The bullets and dashed lines represents extreme values distribution.
Applsci 13 11512 g004
Figure 5. Runtime scores (y-axis) for each dataset (x-axis) used with the glm package. Dashed lines represents extreme values distribution.
Figure 5. Runtime scores (y-axis) for each dataset (x-axis) used with the glm package. Dashed lines represents extreme values distribution.
Applsci 13 11512 g005
Figure 6. Evaluation scores for all measurements versus dataset sizes.
Figure 6. Evaluation scores for all measurements versus dataset sizes.
Applsci 13 11512 g006
Table 1. List of selected datasets with their IDs, organisms from which cells were taken, a brief description, cell counts, and sequencing protocols (source: http://imlspenticton.uzh.ch:3838/conquer/ (accessed on 7 May 2023)), # means number of cells.
Table 1. List of selected datasets with their IDs, organisms from which cells were taken, a brief description, cell counts, and sequencing protocols (source: http://imlspenticton.uzh.ch:3838/conquer/ (accessed on 7 May 2023)), # means number of cells.
DatasetIDOrganismBrief Description# of CellsProtocol
EMTAB2805Buettner2015Mus musculusmESC in different cell cycle stages288SMARTer C1
GSE100911Tang2017Danio reriohematopoietic and renal cell heterogeneity245Smart-Seq2
GSE102299-
smartseq2
Wallrapp2017Mus musculusinnate lymphoid cells from mouse lungs after various treatments752Smart-Seq2
GSE45719Deng2014Mus musculusdevelopment from zygote to
blastocyst + adult liver
291SMART-Seq
GSE48968-
GPL13112
Shalek2014Mus musculusdendritic cells stimulated with pathogenic components1378SMARTer C1
GSE48968-
GPL17021-125bp
Shalek2014Mus musculusdendritic cells stimulated with pathogenic components935SMARTer C1
GSE48968-
GPL17021-25bp
Shalek2014Mus musculusdendritic cells stimulated with pathogenic components99SMARTer C1
GSE52529-
GPL16791
Trapnell2014Homo sapiensprimary myoblasts over a time course of serum-induced differentiation288SMARTer C1
GSE52583-
GPL13112
Treutlein2014Mus musculuslung epithelial cells at different developmental stages171SMARTer C1
GSE57872Patel2014Homo sapiensglioblastoma cells from tumors + gliomasphere cell lines864SMART-Seq
GSE60749-
GPL13112
Kumar2014Mus musculusmESCs with various genetic perturbations, cultured in
different media
268SMARTer C1
GSE60749-
GPL17021
Kumar2014Mus musculusmESCs with various genetic perturbations, cultured in
different media
147SMARTer C1
GSE63818-
GPL16791
Guo2015Homo sapiensprimordial germ cells from embryos at different times of gestation328Tang
GSE66053-
GPL18573
Padovan
Merhar2015
Homo sapienslive and fixed single cells96Fluidigm C1 Auto prep
GSE71585-
GPL13112
Tasic2016Mus musculuscell type identification in primary visual cortex1035SMARTer
GSE71585-
GPL17021
Tasic2016Mus musculuscell type identification in primary visual cortex749SMARTer
GSE71982Burns2015Mus musculusutricular and cochlear sensory epithelia of newborn mice313SMARTer C1
GSE77847Meyer2016Mus musculusDnmt3a loss of function in Flt3-ITD and Dnmt3a-mutant AML96SMARTer C1
GSE79102Kiselev2017Homo sapiensdifferent patients with myeloproliferative disease181Smart-Seq2
GSE81903Shekhar2016Mus musculusP17 retinal cells from the Kcng4-cre;stop-YFP X Thy1-stop-YFP Line #1 mice384Smart-Seq2
SRP073808Koh2016Homo sapiensin vitro cultured H7 embryonic stem cells (WiCell) and H7-derived downstream early mesoderm progenitors651SMARTer C1
GSE94383Lane2017Mus musculusLPS stimulated and unstimulated
264.7 cells
839Smart-Seq2
Table 2. X i , j matrix containing avetxlength data from the EMTAB2805 dataset.
Table 2. X i , j matrix containing avetxlength data from the EMTAB2805 dataset.
G1_cell1G1_cel110G1_cell11G1_cell12G1_cell13
ENSMUSG00000000001.43046.72003026.1603026.3303026.3473061.610..
ENSMUSGO0000000003.15466.0490466.049466.049466.049466.049..
ENSMUSG00000000028.141531.72001863.1921789.4321519.0801546.610..
ENSMUSG00000000031.15910.41611309.6721309.6721309.6721309.672..
ENSMUSG00000000037.162715.72163241.1223241.1223241.1223241.122..
........
........
........
Table 3. X j , i after rotation.
Table 3. X j , i after rotation.
ENSMUSG
…1.4
ENSMUSGO …3.15ENSMUSG …28.14ENSMUSG …31.15ENSMUSG …37.16
G1_cell13046.720466.0491531.720910.41612715.722..
G1_cel1103026.160466.0491863.1921309.67243241.122..
G1_cell113026.330466.0491789.4321309.67243241.122..
G1_cell123026.347466.0491519.0801309.67243241.122..
G1_cell133061.610466.0491546.6101309.67243241.122..
........
........
........
Table 4. X j , i after substituting each sample’s ID with either 1 or 0.
Table 4. X j , i after substituting each sample’s ID with either 1 or 0.
ENSMUSG …1.4ENSMUSGO …3.15ENSMUSG …28.14ENSMUSG …31.15ENSMUSG …37.16
13046.720466.0491531.720910.41612715.722..
13026.160466.0491863.1921309.67243241.122..
13026.330466.0491789.4321309.67243241.122..
13026.347466.0491519.0801309.67243241.122..
13061.610466.0491546.6101309.67243241.122..
........
........
........
03036.04466.0491521.0401309.6722801.727..
03020.82466.0491505.8201309.6723241.122..
03005.26466.0491886.2601309.6723691.805..
03063.34466.0491918.4361309.6724644.531..
03019.17466.0491900.1701309.6722856.473..
........
........
........
Table 5. List of datasets used in our evaluation along with their chosen phenotypes that were eventually set at either 1 or 0, # refers to the number of 1s and 0s in the matrix.
Table 5. List of datasets used in our evaluation along with their chosen phenotypes that were eventually set at either 1 or 0, # refers to the number of 1s and 0s in the matrix.
#DatasetPhenotype# of 1s# of 0s
1EMTAB2805Cell Stages (G1 & G2M)9696
2GSE10091116-cell stage blastomere & Mid blastocyst cell (92–94 h post-fertilization)4338
3GSE102299-smartseq2treatment: IL25 & treatment: IL33188282
4GSE4571916-cell stage blastomere & Mid blastocyst cell (92–94 h post-fertilization)5060
5GSE48968-GPL13112BMDC (Unstimulated Replicate Experiment) & BMDC (1 h LPS Stimulation)9695
6GSE48968-GPL17021-125bpBMDC (On Chip 4 h LPS Stimulation) & BMDC (2 h IFN-B Stimulation)9094
7GSE48968-GPL17021-25bpLPS 4 h, GolgiPlug 1 h & stimulation: LPS 4 h,
GolgiPlug 2 h
4653
8GSE52529-GPL16791hour post serum-switch: 0 & hour post serum-switch: 249696
9GSE52583-GPL13112age: Embryonic day 18.5 & age: Embryonic day 14.58045
10GSE57872cell type: Glioblastoma & cell type: Gliomasphere Cell Line672192
11GSE60749-GPL13112culture conditions: serum + LIF & culture conditions: 2i + LIF17494
12GSE60749-GPL17021serum + LIF & Selection in ITSFn, followed by expansion in N2 + bFGF/laminin9354
13GSE63818-GPL16791male & female197131
14GSE66053-GPL18573Cells were added to the Fluidigm C1…& Fixed cells were added to the Fluidigm C1 …8214
15GSE71585-GPL13112input material: single cell & tdtomato labeling: positive (partially)81108
16GSE71585-GPL17021dissection: All & input material: single cell69157
17GSE71982Vestibular epithelium & Cochlear epithelium160153
18GSE77847sample type: cKit + Flt3ITD/ITD,Dnmt3afl/-MxCre AML-1 & sample type: cKit + Flt3ITD/ITD,Dnmt3afl/-MxCre AML-24848
19GSE79102patient 1 scRNA-seq &
patient 2 scRNA-seq
8596
20GSE81903retina ID: 1Ra & retina ID: 1la9696
21GSE94383time point: 0 & min time point:
75 min
186145
22SRP073808H7hESC & H7_derived_APS7764
Table 6. System and R Studio specification details at the time of running the evaluation scripts.
Table 6. System and R Studio specification details at the time of running the evaluation scripts.
platform×86_64-apple-darwin15.6.0
arch×86_64
osdarwin15.6.0
system×86_64, darwin15.6.0
status
major3
minor6.1
year2019
month07
day05
svn rev76,782
languageR
version.stringR version 3.6.1 (5 July 2019)
nicknameAction of the Toes
Table 7. Median values in all evaluation categories (column numbers represent datasets).
Table 7. Median values in all evaluation categories (column numbers represent datasets).
12345678910111213141516171819202122
Precision0.520.610.400.480.530.530.520.510.660.790.700.700.610.890.460.940.520.490.510.570.570.58
Recall0.520.510.500.510.550.500.560.520.530.510.540.590.510.620.530.750.520.500.560.540.510.54
F1-Scores0.510.550.440.490.530.510.520.500.580.620.600.630.550.720.480.830.510.510.520.540.530.55
AUC0.730.740.750.740.750.740.740.750.730.730.620.730.750.680.730.670.730.750.730.720.750.76
Runtime2.420.532.230.731.3829.580.601.330.9511.721.830.682.9216.132.2219.872.880.701.635.353.571.19
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alaqeeli, O.; Alturki, R. Evaluating the Performance of the Generalized Linear Model (glm) R Package Using Single-Cell RNA-Sequencing Data. Appl. Sci. 2023, 13, 11512. https://doi.org/10.3390/app132011512

AMA Style

Alaqeeli O, Alturki R. Evaluating the Performance of the Generalized Linear Model (glm) R Package Using Single-Cell RNA-Sequencing Data. Applied Sciences. 2023; 13(20):11512. https://doi.org/10.3390/app132011512

Chicago/Turabian Style

Alaqeeli, Omar, and Raad Alturki. 2023. "Evaluating the Performance of the Generalized Linear Model (glm) R Package Using Single-Cell RNA-Sequencing Data" Applied Sciences 13, no. 20: 11512. https://doi.org/10.3390/app132011512

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop