Inferring Cell–Cell Communications from Spatially Resolved Transcriptomics Data Using a Bayesian Tweedie Model

Cellular communication through biochemical signaling is fundamental to every biological activity. Investigating cell signaling diffusions across cell types can further help understand biological mechanisms. In recent years, this has become an important research topic as single-cell sequencing technologies have matured. However, cell signaling activities are spatially constrained, and single-cell data cannot provide spatial information for each cell. This issue may cause a high false discovery rate, and using spatially resolved transcriptomics data is necessary. On the other hand, as far as we know, most existing methods focus on providing an ad hoc measurement to estimate intercellular communication instead of relying on a statistical model. It is undeniable that descriptive statistics are straightforward and accessible, but a suitable statistical model can provide more accurate and reliable inference. In this way, we propose a generalized linear regression model to infer cellular communications from spatially resolved transcriptomics data, especially spot-based data. Our BAyesian Tweedie modeling of COMmunications (BATCOM) method estimates the communication scores between cell types with the consideration of their corresponding distances. Due to the properties of the regression model, BATCOM naturally provides the direction of the communication between cell types and the interaction of ligands and receptors that other approaches cannot offer. We conduct simulation studies to assess the performance under different scenarios. We also employ BATCOM in a real-data application and compare it with other existing algorithms. In summary, our innovative model can fill gaps in the inference of cell–cell communication and provide a robust and straightforward result.


Introduction
Different biochemical signalings from cellular communications control different activities of living organisms, which highlights the importance of understanding cell-cell communications (CCC) on biological processes and mechanisms [1,2]. In practice, we infer the CCC from some known ligand-receptor (LR) pairs because the interaction of LRs mediates communication. As single-cell RNA sequencing (scRNA-seq) technologies have matured, researchers have gradually gained opportunities to investigate CCC from scRNAseq data since we know more about the ligand and receptor gene expression information and cell type annotation at the cellular level. Several approaches have been proposed for inferring the CCC from scRNA-seq data. For example, CellPhoneDB [3] calculates the mean of the average ligand expression level for one cell type and the average receptor expression level for another cell type and conducts a permutation test to determine the significance of this LR pair between two cell types. Instead of using the mean or product to measure the communication between two cell types, SingleCellSingleR [4] introduces a regularized product score for an LR pair and provides an ad hoc benchmark to decide an appropriate Figure 1. Overview of the workflow. BATCOM requires three main inputs: (1) a gene expression matrix obtained from the SRT data with coordinates information, (2) a matrix that reflects the cell type annotations of spots/cells, and (3) a list of known LR pairs. After fitting the model, a heatmap and a network can be generated for visualization purposes.

Spot-to-Spot Communication Score
Since the spot-based SRT data only provide the gene expression level for each spot, we define the communication score C k i,j (i, j = 1, 2, . . . , N) from sender spot i to receiver spot j for the LR pair k as where L k i and R k j are the expression level of ligand L at spot i and the expression level of receptor R at spot j, separately. Specifically, L k i and R k j can be estimated using the arithmetic mean if the ligand L or receptor R consists of more than one subunit. In other words, where s L and s R denote the numbers of subunits of ligand L and receptor R, and y i and y j are the corresponding normalized gene expression levels of spot i and spot j in SRT data. Although the geometric mean of subunits can be an alternative option for estimation, this will introduce a significantly larger number of zeros present in C k i,j compared to the arithmetic mean.
While public databases such as CellChatDB [5], CellPhoneDB [3], and CellTalkDB [19] offer an extensive list of LR pairs, it is crucial to note that not all pairs are relevant for the analysis since they may not exist in the data being studied. Therefore, we remove LR pairs whose spot-to-spot communication scores C i,j contain more than 98% zeros. Once we obtain the communication scores C k i,j for each relevant LR pair k, we can calculate spot-to-spot communication scores for a specific signaling pathway or the entire system by summing up the corresponding LR pairs.
The C k i,j communication scores in Equation (1) will be the outcome variables in the regression modeling yielding N 2 observations. In practice, this dimensionality can be reduced by only considering (i, j) pairs that are within a certain distance of each other.

BATCOM Model Structure
To model the signaling strength for LR pair k across all combinations of cell types from the communication scores between two spots, we propose a generalized linear regression model where M i,g 1 and M j,g 2 are the proportions of cell types g 1 (g 1 = 1, 2, . . . , G) at the sender spot i and g 2 (g 2 = 1, 2, . . . , G) at the receiver spot j. Although the model typically includes G 2 interaction terms to account for the communications among G cell types, some of these product terms may be filtered out in practice due to minimal or non-existent observations, or they may be based on prior knowledge. In addition, ρ > 0 is a communication constraint parameter, and D i,j is a suitably chosen distance metric between spots i and j, which we considered to be Euclidean in our applications. As spots i and j are spatially further apart (D i,j increasing), they have less ability to communicate. This effect is captured by the exp(−ρD i,j ) term in Equation (2), which down-weights the impact of the cell type memberships as distance increases. Larger values of ρ represent a faster spatial decay such that only adjacent spots may communicate, while smaller values of ρ allow communication across longer distances. However, the specific value of ρ should be chosen based on the scale of distances in the dataset. In this paper, we scaled D ij so that adjacent spots had a distance of 1.
Based on Equation (1), C k i,j (i, j = 1, 2, . . . , N) are not independent of each other as they partially come from the same spot. For example, when i = 1, all C k 1,j (j = 1, 2, . . . , N) should be correlated with each other because they all depend on the same sender spot expression L k 1 . Thus, Equation (2) includes two random effect parameters ν L i and ν R j to introduce correlation around the corresponding sender spot i for ligand L and receiver spot j for receptor R, respectively. Returning to our example, if spot 1 exhibits high expression levels for a specific ligand L k 1 , it will result in the corresponding communication scores C k 1,j being large or above average for all j. Neglecting to account for the shared structure across C k 1,j for j = 1, 2, . . . , N may lead to an overestimation of the effect of the cell types most prevalent in spot 1. To address this issue, we introduce the inclusion of a large ν L 1 to capture the characteristics of spot 1, accounting for its high expression of ligand L and communication scores C 1,j . By incorporating this additional variability, the remaining variation in C k 1,j becomes associated with the primary target of interest: the cell type combinations.
The regression coefficients β k g 1 ,g 2 in Equation (2) are the parameters of interest in our model and thus reflect the communication strength of LR pair k from sender cell type g 1 to receiver cell type g 2 . A positive β k g 1 ,g 2 indicates that as the memberships of cell type g 1 at the sender spot and cell type g 2 at the receiver spot jointly increase, the communication is predicted to increase. Conversely, negative coefficients suggest that larger cell type memberships will decrease the spot-to-spot communication of LR pair k. In this way, our approach is capable of deconvoluting the mean spot-to-spot communication scores into the interactions between the cell types at the sender spot and the cell types at the receiver spot.
It is important to note that the original gene expression matrix of SRT data is usually sparse. Moreover, if the ligand L or the receptor R is not expressed, the communication score C k i,j will be zero according to Equation (1). As a result, C k i,j (i, j = 1, 2, . . . , N) is expected to be a sparse vector with continuous positive scores in the non-zero positions. Considering this property of the data, one common choice would be fitting a zero-inflated or hurdle model. However, both models require two sets of coefficients to account for the probability of zeros and the value of non-zeros separately [20][21][22], and it would be difficult to integrate these two different sets of coefficients together to reflect the strength of communication between cell types. To that end, we utilize the compound Poisson-Gamma distribution to model the communications scores C k i,j . This distribution effectively models the zero-inflated continuous values observed in the data while simplifying the modeling process and enhancing interpretability.

Compound Poisson-Gamma Distribution
For the compound Poisson-Gamma distribution CPG(λ, α, γ), the random variable C can be generated as follows: where λ is the rate of the Poisson distribution, and α and γ are the shape and scale of the Gamma distribution, respectively. Based on the settings in Equation (3), we have which implies that the joint distribution of C and T is Usually, one would integrate out T from Equation (4) to obtain a marginal distribution of C. However, the infinite summand p(c|λ, α, γ) = ∑ ∞ t=0 p(c, t|λ, α, γ) does not have a closed-form representation. We can only use approximation approaches, such as series expansion [23] or Fourier inversion [24], to approximate the infinite number of terms. Although several studies have conducted statistical estimation and inference for the marginal distribution of C based on the approximation [25,26], in this paper, our methodology uses the joint distribution of C and T. Our approach is related to the EM algorithm presented in [27], although we use a Bayesian data augmentation strategy.
The compound Poisson-Gamma distribution CPG(λ, α, γ) is equivalently known as the Tweedie distribution TW(µ, φ, p) when 1 < p < 2. The Tweedie parametrization gradually shifts from a Poisson distribution to a Gamma distribution as p increases. Building a compound Poisson-Gamma generalized linear model in terms of the parameters of TW(µ, φ, p) is easier than the original CPG(λ, α, γ). Thus, it is critical to know the unique relationship between two sets of parameters (µ, φ, p) and (λ, α, γ) as follows: Smyth [28] provides a detailed description of the computational process.
Returning to the communication scores C k i,j , we will utilize the Tweedie parameterization for model specification. The mean µ k i,j is parameterized through Equation (2) using a log-link function for g(); hence, the parameters of interest β g 1 ,g 2 determine this mean of C k i,j . The other parameters φ and p are global, and their values are shared across all (i, j) pairs of spots.

Parameter Estimation
Considering the complexity and intractability of the proposed model, a Bayesian approach for inference will be a good choice. We use the Hamiltonian Monte Carlo (HMC) algorithm to make the sampling more efficient than the usual Gibbs sampling [29]. Because HMC requires the gradient of the log-posterior density function, we derived the closedform solutions for it in Appendix A. The closed-form solutions are computationally less intensive and hence much faster than iterative methods.
As part of the Bayesian model specification, prior distributions for all parameters must be specified. Given the limited information available on the parameters, it is often preferable to choose weakly informative priors that offer both convenience and simplicity. Thus, the intercept term will have a disperse N(0, 100 2 ) prior, and the remaining coefficients β g 1 ,g 2 have N(0, 1) priors. As for the other two Tweedie parameters φ (φ > 0) and p (1 < p < 2), we transform them to log φ and θ = log p−1 2−p and assign a normal distribution and a logistic distribution as priors, respectively. Overall, the priors are β 0 ∼ N(0, 100 2 ), β g 1 ,g 2 iid ∼ N(0, 1), g 1 = 1, 2, . . . , G, g 2 = 1, 2, . . . , G, log φ ∼ N(0, 10 2 ), In addition to the parameters specified above, our methodology relies on the spatial tuning parameter ρ in Equation (2). This parameter is unknown; however, we can fit multiple models with varying values of ρ and select the most suitable model by considering model selection statistics.
Equation (2) reflects a mixed-effect model that accounts for both fixed and random effects. The random effects ν L i and ν R j are assumed to be independently and identically distributed according to the standard normal distribution N(0, 1). In an initialization step, we fit this mixed-effects model using a Newton-Raphson algorithm on the posterior distribution. However, since the random effects are not our primary parameters of interest, we treat the estimated ν L and ν R as fixed parameters in the main Bayesian framework to reduce the computational burden.
As noted previously, we are using the joint distribution of (C, T) as the relevant likelihood function in our Bayesian model specification since it has a closed-form representation. Thus, every observation C i,j has a corresponding unknown latent variable T i,j , and our HMC algorithm includes a data augmentation Gibbs step to sample values of T given the observed c and the current parameter values. We note that when C = 0, T must be equal to zero. Ts for the non-zero Cs are sampled from probabilities proportion to p Poisson (t |λ) × p Gamma (c|t α, γ) for t = 1, 2, . . . , T max . T max is an adaptive parameter in our algorithm that is increased and decreased depending on how large the sampled Ts are relative to this maximum threshold.
In general, our Markov chain Monte Carlo (MCMC) sampling algorithm consists of two steps. In one step, we apply HMC to jointly update the vector of model parameters (β 0 , β g1,g2 , φ, p) for the current values of T. The other step updates the augmentation parameters T for the non-zero communications, given the current parameter values. We note that there are often few changes to the T values, so it is more computationally efficient to run multiple steps of HMC for each update to the augmentation parameters. Typically, we consider 10 HMC steps for every single update of T.
During the implementation of the MCMC procedure, we first conduct a preliminary run of 6500 iterations to tune the maximum threshold T max and the covariance matrix of the momentum variables of HMC, so that the acceptance rate of all parameters can be kept around 45% to 65%. We then run an additional 13,500 iterations and discard the first 3500 iterations as the burn-in period, at which point approximate convergence is achieved. This results in 10,000 retained samples, and the inference is made using this collection.

Hypothesis Testing
As previously mentioned, the regression coefficients β g 1 ,g 2 from Equation (1) are the main parameters of interest because they reflect the association between the celltype communication and the LR interaction. Thus, for each coefficient, we test the null hypothesis H 0 : β g 1 ,g 2 = 0 against the alternative hypothesis H a : β g 1 ,g 2 = 0. We calculate the mean and variance of the samples β g 1 ,g 2 from the inference period of HMC and then generate the statistic W =β 2 g 1 ,g 2 var(β g 1 ,g 2 ) for a Wald test. As stated in [30], the standard Bayesian large sample theory implies that the test statistic W approximately follows an asymptotic χ 2 (1) distribution under the null hypothesis, and we can easily obtain a pseudo-p-value by considering the tail probability beyond the W associated with the estimated coefficient β g 1 ,g 2 . The pseudo-p-values across all interactions of cell types will be adjusted for multiple hypothesis testing using the false discovery rate (FDR) correction [31]. In this way, the inference will be considered in a frequentist framework for ease of interpretation. In this paper, we declare a CCC significant on an LR pair if the FDR adjusted p-value is less than 0.05.

Results
We validated our modeling strategy through extensive simulation studies using data generated separately from two distinct models: our proposed compound Poisson-Gamma model and a pseudo-hurdle Gamma model. To accurately represent the spatial positions, we created a panel of 100 spots arranged in a 10 × 10 grid. Due to a lack of a comparable model structure among the existing CCC methods, we can only compare our model to variations to the structure of our BATCOM model. In the subsequent sections, we present the results of our simulation studies in Figures 2-5. For more detailed numerical results, please refer to Appendix B. Following that, we applied the proposed model to a real dataset and compared the results with other commonly used spatial CCC methods.

Data Generated from the Compound Poisson-Gamma Model
To evaluate performance, we applied our method to simulated data generated from the proposed Tweedie model. Across simulations, we varied the number of cell groups G, the communication constraint parameter ρ, the sparsity rate of coefficients δ, and the two Tweedie parameters φ and p. For the proportions of cell types M i,1 , . . . , M i,G , we randomly generated each element from U(0, 1) and then re-scaled each row so they sum to 1. Based on the regression structure, a dense vector of original coefficients β O was independently sampled from 1 2 U(0.1, 0.5) + 1 2 U(−0.5, −0.1), and the random effects ν L and ν R were sampled from N(0, 0.4) independently. Considering the sparsity rates of coefficients, we randomly picked δ × G 2 coefficients (excluding the intercept) in β O and set them as zeros to get β δ . To make different scenarios comparable, we fixed the β δ , ν L , and ν R across all corresponding scenarios. We also added −2 to the intercept to make the mean value µ = exp(X β δ + ν L + ν R ) similar to real SRT data. For each combination of parameters, we generated 100 different simulated datasets to avoid uncertainty and detect variability.
For each dataset, the inference of BATCOM was performed by estimating and fixing the random effects, running the MCMC algorithm, and considering FDR-corrected Wald tests as discussed in Section 2.4. After obtaining these results, we generated a confusion matrix for each simulated dataset by comparing the results with the true values. A true positive (TP) was recorded when the estimate had the same sign as the true value and the adjusted p-value was less than 0.05. Similarly, a true negative (TN) was counted when the adjusted p-value was larger than 0.05, and the true value was zero. Conversely, a false negative (FN) was registered if the adjusted p-value was greater than 0.05, but the true value was not zero. Any estimate with an adjusted p-value less than 0.05 for a true value of zero or an estimate with a sign different from the true value was counted as a false positive (FP). Using the confusion matrix from the analysis of each generated dataset, we calculated the true positive rate (TPR), false positive rate (FPR), and observed FDR. We also plotted the receiver operating characteristic (ROC) curve by setting different cutoffs of the adjusted p-value to compute the area under the curve (AUC). For each scenario, we calculated the mean and standard deviations of these measurements across 100 simulated datasets.
Because we treat the communication constraint parameter ρ as a tuning parameter and do not estimate it during MCMC, it is critical to check the performance of the proposed model when the estimated value is close to and far away from the true value. Figure 2 presents the results of the proposed models fitted usingρ = 0.2, 0.5, 0.8 under the scenarios with the true values of ρ = 0.2, 0.4, 0.6, and 0.8, separately. Additionally, we applied a standard Bayesian model selection strategy to choose the best value for this tuning parameter. To that end, we consider the widely applicable information criterion (WAIC 2 ) [29] according to the formula where i = 1, 2, . . . , n reflects the observation and s = 1, 2, . . . , S represents the iteration of the inference period of HMC. For each dataset, we fit the model underρ = 0.2, 0.5, 0.8 and select theρ that yields the lowest WAIC 2 as the BEST model choice.
In Figure 2, the results indicate that when theρ used to fit the model is close to the true value, the performance is excellent, with high TPR and AUC, as well as low FPR and observed FDR. We also consider the value ofρ = 0.5 as a useful default choice of the distance tuning parameter since it straddles the expected range of communication parameters (0, 1). Empirically, this value performs well across the full range of ρ with only minor degradation in the more extreme cases of ρ = 0.2, 0.8. Thus, if running only one version of the model, we recommend usingρ = 0.5 as the tuning parameter. For the remainder of this paper, we use the defaultρ = 0.5 choice. When computational resources permit, we suggest trying a small collection of ρ and selecting the best model based on WAIC 2 .
We then compared our Bayesian framework (BATCOM) with a frequentist framework (referred to as TWGAM) using the gam function from R package mgcv [32]. TWGAM is a frequentist implementation of our model structure with the same distribution assumption and design matrix (scaled by distance) but without any random effect terms. Moreover, we compared our model structure concerning the proportions of cell types in each spot with other algorithm structures considering just one cell type. Many existing CCC methods treat each spot as containing only one cell type without considering the heterogeneity in each spot. To mimic this phenomenon, we constructed a corresponding zero-one matrix. For each spot (i.e., each row of the M matrix), we assigned a 1 to the cell type with the maximum proportion and a 0 to the rest of the cell types (named MAXPROP). MAXPROP uses all the other features of our proposed BATCOM except that the membership proportions M i,g (i = 1, . . . , N; g = 1, . . . , G) are binary. For this MAXPROP, we employed our overall framework, including the random effect imputation, the MCMC sampling strategy for parameter estimation, and the Wald tests for inference. In addition, we compared our distribution assumption with the binomial logistic distribution of non-zero values (named LOGISTICS), which does not account for random effects.  Figure 3 displays the performance of different models usingρ = 0.5 as the tuning parameter value under different scenarios, where the true communication constraint parameter ρ is either 0.4 or 0.6. It is easy to see that our proposed model (BATCOM) maintains a very robust performance across these different scenarios, consistently achieving high TPR and AUC when compared to three other models. Remarkably, BATCOM also effectively controls the FPR, and the observed FDR was around or below 0.05. In contrast, the frequentist framework (TWGAM), which shares the same distribution assumption as BATCOM, consistently performs worse, with much higher FPR and FDR. One possible explanation for this result is that TWGAM does not integrate prior knowledge about the parameters, and the failure to account for random effects could contribute to an increased FDR. The model that only considers one cell type for a spot (MAXPROP) is even worse than TWGAM in all aspects, including FPR, FDR, and AUC, with particularly high FPR and FDR. To some extent, these results suggest that existing CCC methods that ignore within-spot heterogeneity may produce numerous false discoveries. Additionally, although the model with the Bernoulli distribution assumption (LOGISTICS) generally maintains a small FPR and FDR, it has the lowest power (TPR), especially when we have a high value of the Tweedie parameter p. In other words, it tends to make too-conservative conclusions by inferring that most interactions among cell types are insignificant. In our previous simulation results, we considered all spot pairs for analysis. However, in real-world applications, we can effectively reduce the number of observations in the model by focusing only on spot pairs (i, j) that are within a specific distance threshold of each other, as mentioned earlier. To investigate the trade-off between efficiency and accuracy resulting from this reduction, we conducted an additional simulation experiment by varying the distance threshold and evaluating its impact on estimation accuracy. Specifically, we set the threshold values to 10 (i.e., including all spot pairs), 7, 5, and 3, as illustrated in Figure 4. It is evident that reducing the threshold to include fewer (i, j) spot pairs in the model inevitably affects estimation accuracy; however, the performance did not deviate significantly. The TPR and AUC showed a significant reduction when the threshold was set to 3, but the change remained below 10%. In contrast, the FPR and FDR exhibited no significant changes with varying thresholds. This observation suggests that reducing the number of spot pairs during model fitting leads to a more conservative decision-making process in our algorithm.
It is worth noting that the choice of the threshold is closely related to the selection of ρ. If ρ is large, only the closest spots have a substantial contribution, making a small threshold appropriate. However, when ρ is small, more spots contribute, and a larger threshold may be necessary to avoid excluding relevant spots. In Figure 4, we usedρ = 0.5, resulting in a minimal weight for the exponential term of distance in Equation (2) when the distance between two spots exceeds 5.

Data Generated from the Pseudo-Hurdle Gamma Model
To complete our comprehensive evaluation of performance, we also simulated data from a structure that differs from our methodology while maintaining a comparable set of model parameters describing the relationship between cell type memberships and communication scores. We generated this additional simulated data from a pseudo-hurdle Gamma model. In the traditional hurdle model, one needs to have two different sets of coefficients to determine the probability of zeros and the distribution of non-zero values separately. Here, we generated the new data using the model where µ depended on the same βs through Equation (2), the shape parameter α new was randomly selected from a uniform distribution U(0.5, 5), and the scale parameter In this way, the simulated data from this pseudo-hurdle Gamma model have the same mean µ as our proposed compound Poisson-Gamma distribution, ensuring that the parameter interpretation of β is comparable.
After switching to the pseudo-hurdle Gamma model as the data generator for our simulation scheme, we found that the results remained consistent with our previous findings using the compound Poisson-Gamma distribution. As shown in Figure 5, our proposed model outperforms other models, particularly in terms of its high TPR and low observed FDR in all scenarios. Even when faced with higher sparsity rates of coefficients and a larger number of cell types, our proposed model strikes a balance between identifying new discoveries and minimizing errors.

Case Study
We applied our methods to the Visium spatial transcriptomics data of cutaneous squamous cell carcinoma (cSCC) [33], in which each spot contains multiple cells. Ji et al. [33] performed scRNA-seq on both tumors and normal skin and profiled SRT data on tumors simultaneously. As an example, we focused on the SRT data from replicate 2 of patient 2, which had a greater sequencing depth than other samples. This sample contains 1932 spots (after excluding spots with less than 100 genes detected) and 10,703 genes (after filtering out genes not expressed in at least 97.5% of the spots). To perform the upstream cell type deconvolution analysis, we utilized scRNA-seq data from the same patient as a reference and obtained the cell type proportion matrix using the full mode of RCTD [16].
Subsequently, we conducted a comparative analysis between our proposed method, employingρ = 0.5 as the tuning parameter, and other CCC algorithms tailored to SRT data, including Giotto [6], COMMOT [8], and SpaTalk [9]. To ensure a fair comparison, we utilized the same cell type proportion matrix from RCTD [16] (G = 24 cell types) and the same list of known LR pairs from CellTalkDB [19] (3398 pairs). However, Giotto [6] and COMMOT [8] required a single cell type to be specified for each spot, which posed a challenge for comparison. To address this, we assigned to each spot the cell type that had the highest proportion in the matrix, which introduced 12 cell types into the algorithms. Furthermore, we fit our model using the MAXPROP version, which is expected to be suboptimal for spot-based SRT data.
After filtering out LR pairs based on each algorithm's default rules, we found that our BATCOM and MAXPROP methods considered 712 LR pairs, whereas SpaTalk considered 515 LR pairs, Giotto considered 983 LR pairs, and COMMOT focused on 664 LR pairs. Regarding cell-type interactions, while there were a total of 576 interactions possible among the 24 cell types, we filtered out some interactions due to minimal or non-existent observations; this resulted in 364 interactions (out of 576) for BATCOM and 96 interactions (out of 144) for MAXPROP. SpaTalk does not consider interactions between the same cell type, leading to 552 interactions under its consideration. Meanwhile, Giotto and COMMOT dealt with 144 interactions due to the presence of 12 cell types. Figure 6 illustrates the UpSet plot for the number of significant CCC on LR pairs identified by different methods. As we can see, our proposed method, BATCOM, and MAXPROP detected the third and second highest number of significant communications, respectively, following SpaTalk, which identified the most significant results. It is not surprising that our proposed model (BATCOM) identified fewer significant CCC results than SpaTalk, as SpaTalk does not correct the p-values for multiple comparisons. In contrast, Giotto and COMMOT found the fewest communication pairs. Notably, MAXPROP identified many more significant CCCs than BATCOM, which is consistent with the higher FDR observed in the preceding simulations. Compared to Giotto and COMMOT, which also utilize the modal cell type, MAXPROP found more significant communications. Specifically, one-third to one-half of the results from Giotto and COMMOT overlapped with MAXPROP but disagreed with BATCOM, which is the version of our model that uses all available information about the spot's cell type makeup. All methods share a common finding of six significant CCC pairs ( Table 1). The results suggest that several LR interactions between different cell types may play a critical role in tumor-specific cellular crosstalk. Specifically, the ligand SERPINE1, which binds to the receptor ITGB5, has been found to promote tumor growth and angiogenesis in several types of cancer, including skin cancer [34]. Similarly, THBS1, SDC4, and TLN1 have been linked to the development of metastasis and chemoresistance in skin cancer [35][36][37]. Notably, previous research has established PLAU and ITGA5 as critical biomarkers for various types of squamous cell carcinoma [38][39][40][41]. Furthermore, the study by Fang et al. [39] suggests that PLAU affects the formation of inflammatory cancer-associated fibroblasts, which is consistent with the findings of our CCC analysis. These results emphasize the crucial role of specific LR interactions in cancer progression and highlight potential targets for therapeutic interventions. In addition to inferring CCC for a specific LR pair, exploring the overall cellular communication or communication within a specific signaling pathway based on SRT data can provide valuable insights. To demonstrate this, we aggregated the communication scores across all LR pairs and fit BATCOM using C i,j = ∑ k C k i,j as the outcome variable in Equation (2). Figure 7 depicts the overall significant CCCs, indicating that cancer-related cells communicate closely with each other. Specifically, normal-differentiating keratinocytes (KC) exhibit positive communication with tumor-differentiating KC and plasmacytoid dendritic cells (PDC) while showing negative communication with fibroblasts. Notably, Ji et al. [33] found that the subpopulations of tumor KCs (basal, cycling, and differentiating) closely resemble the normal KC subpopulations, and they identified a fourth major tumor KC subpopulation, called tumor-specific keratinocytes (TSK), that exclusively exists in tumor skin and distinguishes itself from other tumor cells. Furthermore, Ji et al. [33] discovered that TSK and tumor-basal KC are both present in the leading edge of the tumor. Our results are consistent with these findings, as Figure 7 shows that tumor-basal KC frequently communicates with TSK.

Discussion
In this paper, we present a generalized linear regression model for inferring CCC based on LR interactions. Our model offers high flexibility in fitting both the single-cell resolution SRT data and spot-based SRT data. A significant challenge in spot-based SRT data is the presence of cell type mixtures in each spot, which we address by assuming that the mean spot-to-spot communication score is a convolution of possible interactions between cell types at the sender spot and the receiver spot. Our proposed model takes advantage of the regression model's properties to naturally handle communication between different cell types simultaneously, while also directly providing the direction of the association between CCC and LR interaction. Furthermore, our approach explicitly models the decreasing ability to communicate as the distance between cells or spots increases, differing from other algorithms that employ an arbitrary threshold to restrict communication.
Due to the limited information available on the parameters, our detailed Bayesian algorithm assumes the prior distributions defined in Equation (6). It is crucial to recognize that the choice of different prior distributions can lead to diverse model performances. To explore this, we also examined alternative priors such as N(0, 0.1 2 ) and N(0, 0.01 2 ) for the regression coefficients β g 1 ,g 2 , and we observed that the model's inference regarding significant connections remained robust relative to using the default N(0, 1) prior. However, if necessary, the prior standard deviation can be easily adjusted in practice.
When comparing Bayesian and frequentist inference with the same distributional assumptions, we found that Bayesian inference provides more accurate estimation with lower FDR (Figure 3). However, MCMC algorithms can be time-consuming. For instance, in the cSCC case study, BATCOM and MAXPROP had an average running time of 67.38 and 19.80 min, respectively, per LR pair. In contrast, COMMOT and Giotto had average running times of 2.16 and 0.18 min, respectively, for each LR pair. SpaTalk had a different approach, identifying significant LR pairs for each cell type interaction, with an average running time of 4.38 min per interaction. To reduce the computational burden, one potential solution is to employ a threshold to control the number of included (i, j) spot or cell pairs. Although this approach inevitably impacts estimation accuracy, our simulation results (Figure 4) demonstrate that the resulting influence on performance will be minimal if using a moderate or largeρ, such as 0.5. Alternatively, further exploration could focus on developing a more precise frequentist inference framework for the compound Poisson-Gamma distribution.
Currently, in this paper, we have defined the communication score as a product of the arithmetic mean of the ligand expressions at the sender spot and the mean receptor expression of the receiver spot. This simple approach implicitly assumes equal importance across the subunits of the LR pair, which may not always be the case. Certain subunits could have varying weights or specific distributions, necessitating a more sophisticated strategy in the future to accurately account for their expression. Moreover, we only considered the simple multiplication of ligands and receptors as the communication score between cells/spots (Equation (1)), while some other algorithms, such as CellChat [5], consider more complex relationships between ligands and receptors, including agonists and antagonists. It is certainly possible to design more intricate communication scores between cells/spots by accounting for these relationships. Given the versatility of our model, we can apply our approach directly to communication scores that have numerous zeros and positive continuous data, regardless of their complexity. This flexibility enables us to adapt our method to various scenarios and extend its applicability in future studies.
As previously mentioned, the tuning parameter ρ in Equation (2) is responsible for controlling the rate of decay of spot-to-spot communication as the distance between two spots increases. The appropriate value of ρ depends on the distance unit and potential communication assumption used in a specific tissue. While the parameter is not estimated during MCMC, we recommend usingρ = 0.5 as a default value based on our simulation study results ( Figure 2). For those with more computational resources, we suggest experimenting with different values of ρ and selecting the best one based on WAIC 2 . For this manuscript, the utilization of Euclidean distance in our proposed model to account for the spatial proximity of cells or spots is specifically due to the current SRT data being derived from tissue slices relying on Cartesian coordinates. If future advancements in SRT technology enable the measurement of tissue shapes beyond the current capabilities, it will become imperative to explore alternative distance measurements that are better suited for such scenarios.
Furthermore, in the proposed model, we assume a decreasing trend in the communication probability when the distance between cells/spots increases. However, long-distance signaling is also essential in biological activities [42]. Therefore, a more comprehensive consideration of the relationship between communication and distance should be a focus of future research.
The results presented in Figure 6 indicate that various algorithms yield highly divergent results, with each method exhibiting a substantial number of distinct significant CCC. This observation aligns with the prior work by Li et al. [43]. It should be noted that the inferior performance of Giotto and COMMOT in our study may be partially attributed to directly assigning the cell type with the highest proportion to each spot. However, it is important to highlight that these CCC algorithms only allow for one cell type per spot. To ensure a fair comparison, we implemented the MAXPROP version of our model structure, aligning with the basic design of these algorithms. While we are confident that our methodology is statistically rigorous and reliable, the significant disparities across the different methods make it difficult to determine the most appropriate method at the biological level. Adding to this challenge is the lack of ground truth in this research domain. Moreover, although we conducted simulation studies with two different distributions to evaluate our model's performance, these assessments still rely on the underlying structure of our algorithm. Thus, it is imperative to undertake further experimental investigations and validations of CCC analysis to determine the most appropriate method for this area of inquiry.
As the field of CCC analysis continues to grow with the generation of more SRT data, we believe that our proposed model will serve as a valuable approach for inferring cellular communication in a flexible and accurate way. Our innovative approach can bridge gaps in current CCC inference methods and provide a straightforward outcome. Future studies could explore the applicability of our model to other more complicated structures and further validate its effectiveness in real-world scenarios. and if t i > 0 (i.e., y i > 0), Because the parameters sampled in HMC are β, log φ, and θ = log p−1 2−p , we need their gradients. To simplify the calculation, we only discuss the gradient of the ith (i = 1, 2, . . . , N) contribution to the log-likelihood. The gradient of the overall log-likelihood can be computed by summing all the N individual log-likelihood contributions.
When t i = 0 (i.e., y i = 0), where x ij is the ith row and (j + 1)th column of the design matrix X, and j = 0, 1, . . . , q. Moreover, it is easy to know that On the other hand, when t i > 0 (i.e., y i > 0), where α = 2−p p−1 and dα dp = − 1 (p−1) 2 . Next, we can solve the gradient of the log-prior functions according to Equation (6): , where σ 2 j is the prior variance of β j , j = 0, 1, . . . , q, Additionally, due to the parameter transformation, the last puzzle of the gradient of the log-posterior density function is the gradient of the corresponding log-Jacobian terms, i.e., Finally, the gradient of the log-posterior density function should be