Next Article in Journal
The Labor Market and Growth Implications of Skill Distribution: A Dynamic General Equilibrium Model with Skill Heterogeneity
Next Article in Special Issue
Consistent Estimators of the Population Covariance Matrix and Its Reparameterizations
Previous Article in Journal
Unlocking Market Potential: Strategic Consumer Segmentation and Dynamic Pricing for Balancing Loyalty and Deal Seeking
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Matrix Factorization and Prediction for High-Dimensional Co-Occurrence Count Data via Shared Parameter Alternating Zero Inflated Gamma Model

1
Department of Statistics and Biostatistics, California State University East Bay, Hayward, CA 94542, USA
2
Department of Statistics, Kansas State University, Manhattan, KS 66506, USA
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(21), 3365; https://doi.org/10.3390/math12213365
Submission received: 28 September 2024 / Revised: 24 October 2024 / Accepted: 25 October 2024 / Published: 27 October 2024
(This article belongs to the Special Issue Statistics for High-Dimensional Data)

Abstract

:
High-dimensional sparse matrix data frequently arise in various applications. A notable example is the weighted word–word co-occurrence count data, which summarizes the weighted frequency of word pairs appearing within the same context window. This type of data typically contains highly skewed non-negative values with an abundance of zeros. Another example is the co-occurrence of item–item or user–item pairs in e-commerce, which also generates high-dimensional data. The objective is to utilize these data to predict the relevance between items or users. In this paper, we assume that items or users can be represented by unknown dense vectors. The model treats the co-occurrence counts as arising from zero-inflated Gamma random variables and employs cosine similarity between the unknown vectors to summarize item–item relevance. The unknown values are estimated using the shared parameter alternating zero-inflated Gamma regression models (SA-ZIG). Both canonical link and log link models are considered. Two parameter updating schemes are proposed, along with an algorithm to estimate the unknown parameters. Convergence analysis is presented analytically. Numerical studies demonstrate that the SA-ZIG using Fisher scoring without learning rate adjustment may fail to find the maximum likelihood estimate. However, the SA-ZIG with learning rate adjustment performs satisfactorily in our simulation studies.

1. Introduction

Matrix factorization is a fundamental technique in linear algebra and data science, widely used for dimensionality reduction, data compression, and feature extraction. Recent research expands its use in various fields, including recommendation systems (e.g., collaborative filtering), bioinformatics, and signal processing. Researchers are actively pursuing new types of factorizations. In their effort to discover new factorizations and provide a unifying structure, ref.  [1] list 53 systematically derived matrix factorizations arising from the generalized Cartan decomposition. Their results apply to invertible matrices and generalizations of orthogonal matrices in classical Lie groups.
Nonnegative matrix factorization (NMF) is particularly useful when dealing with non-negative data, such as in image processing and text mining. Ref. [2] survey existing NMF methods and their variants, analyzing their properties and applications. Ref. [3] present a comprehensive survey of NMF, focusing on its applications in feature extraction and feature selection. Ref. [4] summarize theoretical research on NMF from 2008 to 2013, categorizing it into four types and analyzing the principles, basic models, properties, algorithms, and their extensions and generalizations.
There are many advances aimed at developing more efficient algorithms tailored to specific applications. One prominent application of matrix factorization is in recommendation systems, particularly for addressing the cold-start problem. In recommender systems, matrix factorization models decompose user–item, user–user, or item–item interaction matrices into lower-dimensional latent spaces, which can then be used to generate recommendations.
Ref. [5] summarizes the literature on recommender systems and proposes a multifaceted collaborative filtering model that integrates both neighborhood and latent factor approaches. Ref. [6] develop a matrix factorization model to generate recommendations for users in a social network. Ref. [7] proposes a matrix factorization model for cross-domain recommender systems by extracting items from three different domains and finding item similarities between these domains to improve item ranking accuracy.
Ref. [8] develop a matrix factorization model that combines user–item rating matrices and item–side information matrices to develop soft clusters of items for generating recommendations. Ref. [9] propose a matrix factorization that learns latent representations of both users and items using gradient-boosted trees. Ref. [10] provide a systematic literature review on approaches and algorithms to mitigate cold-start problems in recommender systems.
Matrix factorization has also been used in natural language processing (NLP) in recent years. Word2Vec by [11,12] marks a milestone in NLP history. Although no clear matrices are presented in their study, Word2Vec models the co-occurrence of words and phrases using latent vector representations via a shallow neural network model. Another well-known example of matrix factorization in NLP is the word representation with global vectors (GloVe) by [13]. They model the co-occurrence count matrix of words using cosine similarity of latent vector representations via alternating least squares. However, practical data may be heavily skewed, making the sum or mean squared error unsuitable as an objective function. To address this, ref. [13] utilize weighted least squares to reduce the impact of skewness in the data. Manually creating the weight is difficult to make it work well for real data. In GloVe training, the algorithm is set to run a fixed number of iterations without another convergence check. Here, we consider using the likelihood principle to model the skewed data matrix.
In this paper, our goal is to model non-negative continuous sparse matrix data from skewed distributions with an abundance of zeros but lacking covariate information. We are particularly interested in zero-inflated Gamma observations, often referred to as semi-continuous data due to the presence of many zeros and the highly skewed distribution of positive observations. Examples of such data include insurance claim data, household expenditure data, and precipitation data (see [14]). Ref. [15] study and simulate data using actual deconvolved calcium imaging data, employing a zero-inflated Gamma model to accommodate spikes of observed inactivity and traces of calcium signals in neural populations. Ref. [16] examine the amount of leisure time spent on physical activity and explanatory variables such as gender, age, education level, and annual per capita family income. They find that the zero-inflated Gamma model is preferred over the multinomial model. Beyond Gamma distribution, Weibull distribution can also model skewed non-negative data. Unfortunately, when the shape parameter is unknown, the Weibull distribution is not a member of the exponential family. Further, the alternating update is not suitable because its sufficient statistics is not linear in the observed data. A different estimation procedure will need to be developed if Weibull distribution is used. The Gamma distribution not only is a member of the exponential family, but its sufficient statistic is also linear in the observed data. This gives solid theoretical ground for alternating update to find a solution. Section 5 explains more details in this regard.
In all the aforementioned examples, there are explicitly observed covariates or factors. Furthermore, the two parts of the model parameters are pendicular to each other, allowing model estimation by fitting two separate regressions: binomial regression and Gamma regression. Unfortunately, such models cannot be applied to user–item or co-occurrence count matrix data arising from many practical applications. Examples include user–item or item–item co-occurrence data from online shopping platforms and co-occurring word–word pairs in sequences of texts. One reason is the absence of observed covariates. Additionally, the mechanism that leads to the observed data entries in the data matrix for the binomial and Gamma parts may be of the same nature, making it more appropriate to utilize shared parameters in both parts of the model. Therefore, in this paper, we consider shared parameter modeling of zero-inflated Gamma data using alternating regression. We consider two different link functions for the Gamma part: canonical link and log link.
We believe that our study is the first that utilizes a shared parameter likelihood for zero-inflated skewed distribution to conduct matrix factorization. The alternating least squares (ALS) shares a similar spirit as our SA-ZIG in terms of alternately updating parameters. Most matrix factorization methods in the literature involving ALS are adding an assumption without realizing it. In order for ALS to be valid, the data need to have constant variance. This is because the objective function behind the ALS procedure is the mean squared error, which gives equal emphasis to all observations regardless of how big or small their variations are. If the variations for different observations are drastically different, it is unfair to treat the residuals equally. The real data in the high-dimensional sparse co-occurrence matrix are often very skewed with a lot of zeros and do not have constant variance. The contribution of this paper is the SA-ZIG model that models the positive co-occurrence data with Gamma distribution and attributes the many zeros in the data as being from a Bernoulli distribution. Shared parameters are used in both the Bernoulli and Gamma parts of the model. The latent row and column vector representations in matrix decomposition can be thought of as missing values that have some distributions relying on a smaller set of parameters. Estimating the vector representation for the rows relies on the joint likelihood of the observed matrix data and the missing vector representation for the columns. Due to missingness, the estimation of row vector representations is transformed to using conditional likelihood of the observed data given the column vector representations, and vice versa. This alternating update in the end gives the maximum likelihood estimate if the sufficient statistic for the column vector representation is linear in the observed data and the row vector representations. Both ALS and our SA-ZIG rely on this assumption to be valid.
The remainder of this paper is structured as follows. Section 2 outlines the fundamental framework of the ZIG model. Section 3 focuses on parameter estimation within the SA-ZIG model using the canonical link. Section 4 addresses parameter estimation for the scenario involving the log link in the Gamma regression component. Convergence analysis is presented in Section 5. Section 6 details the SA-ZIG algorithms incorporating learning rate adjustments. Section 7 presents the experimental studies. Finally, Section 8 concludes the paper by summarizing the research findings, contributions, and limitations.

2. Shared Parameter Alternating Zero-Inflated Gamma Regression

Suppose the observed data are { y i j , i = 1 , , n , j = 1 , , n } whose distribution depends on some unknown Bernoulli random variable g i j and a Gamma random variable such that
g i j = 0 , with probability 1 p i j 1 , with probability p i j , y i j = 0 , if g i j = 0 Gamma observation , if g i j = 1 ,
The mechanism behind the observed data may be a result of combined contribution from some covariates or factors but none was observed.
We can write the probability mass function (pmf) for the Bernoulli random variable and probability density function (pdf) for the Gamma random variable as follows:
p g i j = p i j g i j · 1 p i j 1 g i j , f y i j g i j = 1 = 1 Γ v i · v i y i j μ i j v i · 1 y i j · exp v i y i j μ i j .
The product of the two functions generates the joint distribution of the Bernoulli and Gamma random variable
f y i j , g i j = p g i j · f y i j g i j .
By summing up the two possibilities of the Bernoulli random variable, we obtain the pdf of the zero-inflated Gamma (ZIG) distribution:
f y i j = x = 0 1 p g i j = x · f y i j g i j = x = 1 p i j I y i j = 0 · p i j f y i j g i j = 1 I y i j > 0 .
For the Bernoulli part, we use the logit link function to connect the probability p i j to the effects of some unknown covariates or factors as follows:
η i j = log p i j 1 p i j = w i w ˜ j + b i + b ˜ j ,
where w i and b i are unknown parameters related to row i, while w ˜ j and b ˜ j are unknown parameters related to column j. Assume w i and w ˜ j are d-dimensional vectors and b i , b ˜ j are scalars. For the Gamma observations,
log f y i j y i j > 0 = log Γ v i j + v i j log v i j μ i j + v i j 1 log y i j v i j y i j μ i j .
When using the canonical link, the mean of the response variable is connected to the unknown parameters through the following formula:
g ( μ i j ) = μ i j 1 = w i w ˜ j + e i + e ˜ j ,
where e i and e ˜ j are scalars. With the canonical link, the likelihood function and the score equations are both a function of the sufficient statistic. As the sufficient statistic carries all information about the unknown parameters, we can restrict our attention to the sufficient statistic without losing any information. This link function, however, could have difficulty in the estimation process. The natural parameter space is { μ i j : μ i j > 0 } . The right-hand side of Equation (3) may yield an estimate of μ i j that is outside of this parameter space.
Another link function that is popularly used for Gamma distribution is the log link g ( μ ) = log ( μ ) , which gives the log linear model:
g ( μ i j ) = log ( μ i j ) = w i w ˜ j + e i + e ˜ j .
The log link eliminates the non-negativity problem and the model parameters have better interpretation than the canonical link. For each unit increment in w ˜ j k or w i k , the mean increases multiplicatively by exp ( w i k ) or by exp ( w ˜ j k ) , respectively. Even though the parameters enjoy better interpretation, the estimation could also become a problem in the sense that the dot product on the right-hand side of (4) could become large, leading to μ i j being infinity, and hence, the estimation algorithm diverges.
In both links, the common dot product w i w ˜ j is used to reflect the fact that the cosine similarity is the key driving force behind the observed data in the table. For example, in natural language processing (NLP), w i and w ˜ j each represent a word in a dense vector. Each one of them contains both linguistic information and word usage information in it. The observations are distance-weighted co-occurrence counts that are linked to relevance between the words and the relevance can be captured with the cosine similarity. In the example of item–item co-occurrence matrix, w i and w ˜ j represent the hidden product information of the items including characteristics, properties, functionality, popularity, users’ ratings on them, etc. The dot product again tells how the two items are relevant to each other. Sometimes, the co-occurrence matrix was derived from a time series such as a sequence of watched movies in time order from a customer. In this case, the co-occurrence count tells how often the two items were considered in a similar time frame because the counts were weighted based on the position separation of the two items in the sequence. The relevance of these items summarized by the cosine similarity could reflect how the two products are alike in their property, functionality, etc. Therefore, the dot product serves as a major contributor for the observed weighted count.
In both link functions, the intercepts are allowed to be different from that in the logistic model for flexibility. In some classical statistical models such as in [17], shared parameter modeling is used by assuming one part of model parameters to be proportional to the other. Such proportionally assumed parameters are meaningful for the case where covariates are observed. In our case, both w and w ˜ are unknown. Different pairs of w and w ˜ could give the same dot product. Adding an additional proportionality parameter only makes the model even more non-identifiable. This is why we believe it is better to put flexibility in the intercepts instead of using the proportionality parameters.
Due to shared parameters being used, the parameters from both the Gamma regression part and the Logistic regression part should be estimated simultaneously. There are past studies that use a separate set of parameters. Ref. [14] conducted hypothesis testing to compare two groups via mixture of zero-inflated Gamma and zero-inflated log-normal models. However, the parameters from the two model components were modelled separately. In a totally applied setting, ref. [17] considered normal and log-Gamma mixture model for HIV RNA data using shared parameters, assuming that the two parts of the model are proportional to each other. They proposed such an application and simulation, but no inference was given. The shared parameter modeling was employed in some of the literature to achieve some model parsimony. For example, ref. [18] used shared parameters for both a random effects linear model and a probit-modeled censoring process. Ref. [19] used a similar approach for simultaneous modeling of a mean structure and informative censoring. Ref. [20] used shared parameters to model the intensity of a Poisson process and a binary measure of severity.
Denote θ i = ( w i , b i , e i ) , θ ˜ i = ( w ˜ i , b ˜ i , e ˜ i ) , and θ = ( θ 1 , , θ n ) , θ ˜ = ( θ ˜ 1 , , θ ˜ n ) . We estimate θ and θ ˜ alternately. In this estimation scheme, the likelihood function can be treated as a function of either θ or θ ˜ but not concurrently at a same time. When estimating θ , the likelihood is treated as a function of θ while θ ˜ stays fixed. Reversely, when we estimate the θ ˜ , the likelihood is treated as a function of θ ˜ while θ is treated as fixed. This resembles block coordinate descent in which one part of the parameters is updated while holding the remaining part of the parameters as fixed values. For clarity, we use separate notations l i and l ˜ j to refer to these two occasions, i.e.,
l i = l i ( θ i ; θ ˜ ) = j = 1 n l i j ( θ i ; θ ˜ j ) , l ˜ j = l ˜ j ( θ ˜ j ; θ ) = i = 1 n l i j ( θ ˜ j ; θ i ) ,
where l i j ( θ i ; θ ˜ j ) and l i j ( θ ˜ j ; θ i ) are both equal to log ( f ( y i j ) ) but one is treated as a function of θ i while the other one is treated as a function of θ ˜ j .
Note that l i and l ˜ j simply represent the log likelihood of a row or a column of the data matrix. The l i can be thought of as the log likelihood function of θ i for data in the i t h row of the co-occurence matrix and the l ˜ j is the log likelihood function of θ ˜ j for data in the j t h column of the matrix. Each log likelihood function could be split into two components; one corresponds to the Bernoulli part and the other corresponds to the Gamma part:
l i = l i ( θ i ; θ ˜ ) = l i ( 1 ) + l i ( 2 ) , l ˜ j = l ˜ j ( θ ˜ j ; θ ) = l ˜ j ( 1 ) + l ˜ j ( 2 ) ,
where
l i ( 1 ) = l i ( 1 ) ( θ i ; θ ˜ ) = j = 1 n I y i j = 0 · log 1 p i j + I y i j > 0 · log p i j , l i ( 2 ) = l i ( 2 ) ( θ i ; θ ˜ ) = j = 1 n I y i j > 0 · log f y i j y i j > 0 , l ˜ j ( 1 ) = l ˜ j ( 1 ) ( θ ˜ j ; θ ) = i = 1 n I y i j = 0 · log 1 p i j + I y i j > 0 · log p i j , l ˜ j ( 2 ) = l ˜ j ( 2 ) ( θ ˜ j ; θ ) = i = 1 n I y i j > 0 · log f y i j y i j > 0 .
Assuming that the observations y i j ’s are independent of each other, conditional on unobserved θ i and θ ˜ j , i = 1 , , n , j = 1 , , n , the overall loglikelihood function from all observations can be written as
l ( θ , θ ˜ ) = i = 1 n l i ( θ i ; θ ˜ ) = i ( l i ( 1 ) + l i ( 2 ) ) ; l ˜ ( θ ˜ ; θ ) = j = 1 n l ˜ j ( θ ˜ j ; θ ) = j ( l ˜ j ( 1 ) + l ˜ j ( 2 ) ) .
The alternating regression deals with two separate log likelihood functions l ( θ ; θ ˜ ) and l ˜ ( θ ˜ ; θ ) , respectively. The l i ( 1 ) and l ˜ j ( 1 ) part is the traditional log likelihood for binary logistic regression. The l i ( 2 ) and l ˜ j ( 2 ) part is the Gamma log likelihood restricted to only positive observations. If the two parts do not share common parameters, then the estimation can be performed separately. However, they share some common parameters w i and w ˜ j . In the next two sections, we describe the parameter estimations.

3. ZIG Model with Canonical Links

In this section, we consider the case with canonical links, i.e., the Bernoulli part uses logit function and the Gamma part uses negative inverse link. Using the canonical link with generalized linear models enjoys the benefit that the score equations are a function of sufficient statistics. Here, we consider parameter estimation under the canonical links. The negative inverse link has difficulty interpreting model parameters and also has some restrictions in terms of the support of the link function, which does not match the positive value of Gamma distribution. More details can be seen as we introduce the model.
Recall that the the log likelihood and the link function for the logistic part on the i t h row of data are
l i ( 1 ) = j = 1 n I y i j = 0 · log 1 p i j + I y i j > 0 · log p i j , η i j = log p i j 1 p i j = w i w ˜ j + b i + b ˜ j ,
where w i and b i are unknown parameters while treating w ˜ j and b ˜ j as fixed. The log likelihood and the negative inverse link function for the Gamma part on the i t h row of data are
l i ( 2 ) = j = 1 n I y i j > 0 · log f y i j y i j > 0 , with log f y i j y i j > 0 = log Γ v i j + v i j log v i j μ i j + v i j 1 log y i j v i j y i j μ i j ,
τ i j = g ( μ i j ) = μ i j 1 = w i w ˜ j + e i + e ˜ j .
The log likelihood function for the i t h row of data is
l i = l i ( 1 ) + l i ( 2 ) .
To obtain partial derivatives, we write l i ( 1 ) as follows:
l i ( 1 ) = j = 1 n 1 g i j · log 1 p i j + g i j · log p i j = j = 1 n g i j · η i j log 1 + exp η i j .
The inverse logit and its partial derivative w.r.t. w i are
p i j = exp w i w ˜ j + b i + b ˜ j 1 + exp w i w ˜ j + b i + b ˜ j , p i j w i = p i j 1 p i j w ˜ j .
Therefore, the first-order partial derivatives can be summarized as
l i ( 1 ) w i = j = 1 n g i j · w ˜ j exp η i j 1 + exp η i j w ˜ j = j = 1 n g i j p i j w ˜ j , l i ( 1 ) b i = j = 1 n g i j p i j .
The negative second-order partial derivatives and their expectations are the same and are given below:
2 l i ( 1 ) w i w i = j = 1 n p i j 1 p i j w ˜ j w ˜ j = E 2 l i ( 1 ) w i w i
2 l i ( 1 ) w i b i = j = 1 n w ˜ j p i j b i = j = 1 n w ˜ j p i j 1 p i j = E 2 l i ( 1 ) b i 2 2 l i ( 1 ) b i 2 = j = 1 n p i j 1 p i j = E 2 l i ( 1 ) b i 2
Now, consider the second component of the log likelihood l i ( 2 ) from the ith row of data
l i ( 2 ) = j = 1 n I y i j > 0 · log f y i j y i j > 0 = j = 1 n g i j · v i j log w i w ˜ j e i e ˜ j + v i j y i j w i w ˜ j + e i + e ˜ j log Γ v i j + v i j log v i j + v i j 1 log y i j .
The first-order partial derivatives for the l i ( 2 ) are
l i ( 2 ) w i = j = 1 n g i j v i j · w ˜ j w i w ˜ j + e i + e ˜ j + v i j y i j w ˜ j , l i ( 2 ) e i = j = 1 n g i j v i j w i w ˜ j + e i + e ˜ j + v i j · y i j .
The second-order partial derivatives and their negative expectations are
2 l i ( 2 ) w i w i g i j = 1 = j = 1 g i j = 1 n v i j w ˜ j w ˜ j w i w ˜ j + e i + e ˜ j 2 = E 2 l i ( 2 ) w i w i g i j = 1 ;
2 l i ( 2 ) w i e i g i j = 1 = j = 1 g i j = 1 n v i j w ˜ j w i w ˜ j + e i + e ˜ j 2 = E 2 l i ( 2 ) w i e i g i j = 1
2 l i ( 2 ) e i 2 g i j = 1 = j = 1 g i j = 1 n v i j w i w ˜ j + e i + e ˜ j 2 = E 2 l i ( 2 ) e i 2 g i j = 1 .
All of the aforementioned formulae work with the i t h row of data. When we combine the log likelihood from different rows of data, other rows do not contribute to the partial derivative with respect to θ i . That is,
l w i = l i w i ; l b i = l i ( 1 ) b i ; l e i = l i ( 2 ) e i ;
2 l w i w i = 2 l i w i w i ; 2 l w i b i = 2 l i ( 1 ) w i b i ; 2 l w i e i = 2 l i ( 2 ) w i e i ;
2 l b i 2 = 2 l i ( 1 ) b i 2 ; 2 l b i e i = 2 l i b i e i = 0 ; 2 l e i 2 = 2 l i ( 2 ) e i 2 .
As a result, the estimation of the components in θ = ( θ 1 , , θ n ) does not need to be performed simultaneously. Instead, we can cycle through the estimation of θ i , for i = 1 , , n , one by one iteratively. After θ i is updated, the estimated value of θ 1 , θ 2 , , θ i is used to update θ k , k = i + 1 , , n .
The first part of the alternating regression has the following updating equation based on the Fisher scoring algorithm:
θ i ( t + 1 ) = θ i ( t ) + S θ i ( t ) 1 U θ i ( t ) , i = 1 , , n ,
For each i and t, this update requires the value of θ i and their score equations and information matrix at the t t h iteration. One iteration alone here is not taking advantage of the data because retrieving the i t h row of data takes a significant amount of time when the dimension of the data matrix is huge. Therefore, for each row of data retrieved, it is better to update θ i ( t ) a certain number of times. Specifically, the θ i ( t ) is updated again and again in a loop of E epochs and the updated values are used to recompute the score equations and information matrix, all of which are used for next epoch’s update. After all epochs are completed, θ i ( t ) takes the value of θ i ( t , E ) at the end of all iterations from E epochs based on the updating formula below:
θ i ( t , k + 1 ) = θ i ( t , k ) + S θ i ( t , k ) 1 U θ i ( t , k ) , k = 1 , , E ,
This inner loop of updates makes good use of the already loaded data to refine the estimate of θ i ( t ) so that the end estimate is closer to its MLE when the same current θ ˜ ( c ) values are used. Note that the updated θ i in each epoch from (20) leads to changes in the score equations and Fisher information matrix, whose update will in turn result in a better estimate of θ i . Such multiple rounds of updates in (20) reduce the variation of update in θ i ( t ) when we go through many iterations based on Equation (19). It effectively reduces the number of times to retrieve the data. Below are the formulae involved in the updating equations:
θ i ( t ) = w i ( t ) , b i ( t ) , e i ( t ) ; U θ i ( t ) = l i w i , l i b i , l i e i θ i ( t ) ; S θ i ( t ) = S w i ( t ) S w i b i ( t ) S w i e i ( t ) S b i w i ( t ) S b i ( t ) S b i e i ( t ) S e i w i ( t ) S e i b i ( t ) S e i ( t )
with
l i w i θ i ( t ) = k = 1 2 l i ( k ) ( θ i ( t ) ; θ ˜ ( c ) ) w i , l i b i θ i ( t ) = l i ( 1 ) ( θ i ( t ) ; θ ˜ ( c ) ) b i , l i e i θ i ( t ) = l i ( 2 ) ( θ i ( t ) ; θ ˜ ( c ) ) e i ,
where these partial derivatives are the l i ( 1 ) w i , l i ( 2 ) w i , l i ( 1 ) b i , and l i ( 2 ) e i evaluated at θ i ( t ) and θ ˜ ( c ) using formulae in (8) and (12), and
S w i ( t ) = E 2 l i ( 1 ) ( θ i ( t ) ; θ ˜ ( c ) ) w i w i + E 2 l i ( 2 ) ( θ i ( t ) ; θ ˜ ( c ) ) w i w i g i j = 1 ; S w i b i ( t ) = E 2 l i ( 1 ) ( θ i ( t ) ; θ ˜ ( c ) ) w i b i ; S b i ( t ) = E 2 l i ( 1 ) ( θ i ( t ) ; θ ˜ ( c ) ) b i 2 ; S b i e i ( t ) = 0 ; S w i e i ( t ) = E 2 l i ( 2 ) ( θ i ( t ) ; θ ˜ ( c ) ) w i e i g i j = 1 ; S e i ( t ) = E 2 l i ( 2 ) ( θ i ( t ) ; θ ˜ ( c ) ) e i 2 g i j = 1 ; ,
where the expectations are based on Formulae (9), (10), (13)–(15) evaluated at θ i ( t ) and θ ˜ ( c ) if it is in the outer loop of the update and using the values of θ ( t , k ) and θ ˜ ( c ) when it is in the inner loop of updates.
This concludes the first part of the alternating ZIG regression. It gives an iterative update of θ while θ ˜ is fixed. After we finish updating all θ i ’s, we move on to treat the θ as fixed to estimate θ ˜ .
Starting with l ˜ j = l ˜ j ( 1 ) + l ˜ j ( 2 ) , we consider the partial derivatives with respect to θ ˜ j while holding θ fixed. The derivations are similar to those for l i but still we list them for clarity. Note that
p i j w ˜ j = exp η i j · w i 1 + exp η i j exp 2 η i j · w i 1 + exp η i j 2 = p i j 1 p i j w i , p i j b ˜ j = p i j 1 p i j .
The first-order partial derivatives regarding l ˜ j ( 1 ) are
l ˜ j ( 1 ) w ˜ j = i = 1 n g i j p i j · w i , l ˜ j ( 1 ) b ˜ j = i = 1 n g i j p i j .
The negative second derivatives and their expectations are
2 l ˜ j ( 1 ) w ˜ j w ˜ j = i = 1 n w i · p i j w ˜ j = i = 1 n w i w i p i j 1 p i j = E 2 l ˜ j ( 1 ) w ˜ j w ˜ j ,
2 l ˜ j ( 1 ) w ˜ j b ˜ j = i = 1 n w i p i j 1 p i j = E 2 l ˜ j ( 1 ) w ˜ j b ˜ j ,
2 l ˜ j ( 1 ) b ˜ j 2 = i = 1 n p i j 1 p i j = E 2 l ˜ j ( 1 ) b ˜ j 2 .
The first-order partial derivatives of the second component are l ˜ j ( 2 ) .
l ˜ j ( 2 ) w ˜ j = i = 1 n g i j v i j · w i w i w ˜ j + e i + e ˜ j + v i j y i j w i , l ˜ j ( 2 ) e ˜ j = i = 1 n g i j v i j w i w ˜ j + e i + e ˜ j + v i j y i j .
The negative second-order partial derivatives and their expectations are
2 l ˜ j ( 2 ) w ˜ j w ˜ j g i j = 1 = i = 1 g i j = 1 n v i j w i w i w i w ˜ j + e i + e ˜ j 2 = E 2 l ˜ j ( 2 ) w ˜ j w ˜ j g i j = 1 , 2 l ˜ j ( 2 ) w ˜ j e ˜ j g i j = 1 = i = 1 g i j = 1 n v i j w i w i w ˜ j + e i + e ˜ j 2 = E 2 l j ( 2 ) ˜ w ˜ j e ˜ j g i j = 1 , 2 l ˜ j ( 2 ) e ˜ j 2 g i j = 1 = i = 1 g i j = 1 n v i j w i w ˜ j + e i + e ˜ j 2 = E 2 l ˜ j ( 2 ) e ˜ j 2 g i j = 1 .
Therefore, the other side of the updating equation for the alternating ZIG regression based on the Fisher scoring algorithm is
θ ˜ j ( t + 1 ) = θ ˜ j ( t ) + S θ ˜ j ( t ) 1 U θ ˜ j ( t ) , j = 1 , , n ; θ ˜ j ( t , k + 1 ) = θ ˜ j ( t , k ) + S θ ˜ j ( t , k ) 1 U θ ˜ j ( t , k ) , k = 1 , , E .
Again, for each j and iteration number t, this equation is iterated for a certain number of epochs to obtain a refined estimate of θ ˜ j based on the current value θ ( c ) without having to reload the data. The quantities in the updating equation are listed below.
θ ˜ j ( t ) = w ˜ j ( t ) , b ˜ j ( t ) , e ˜ j ( t ) ; U θ ˜ j ( t ) = l ˜ j w ˜ j l ˜ j b ˜ j , l ˜ j e ˜ j θ ˜ j ( t ) ; S θ ˜ j ( t ) = S w ˜ j ( t ) S w ˜ j b ˜ j ( t ) S w ˜ j e ˜ j ( t ) S b ˜ j w ˜ j ( t ) S b ˜ j ( t ) S b ˜ j e ˜ j ( t ) S e ˜ j w ˜ j ( t ) S e ˜ j b ˜ j ( t ) S e ˜ j ( t )
with
l ˜ j w ˜ j θ ˜ j ( t ) = k = 1 2 l ˜ j ( k ) ( θ ˜ j ( t ) ; θ ( c ) ) w ˜ j , l ˜ j b ˜ j θ ˜ j ( t ) = l ˜ j ( 1 ) ( θ ˜ j ( t ) ; θ ( c ) ) b ˜ j , l ˜ j e ˜ j θ ˜ j ( t ) = l ˜ j ( 2 ) ( θ ˜ j ( t ) ; θ ( c ) ) e ˜ j ,
where these partial derivatives are l ˜ j ( 1 ) w ˜ j , l ˜ j ( 2 ) w ˜ j , l ˜ j ( 1 ) b ˜ j , and l ˜ j ( 2 ) e ˜ j evaluated at θ ˜ j ( t ) and θ ( c ) , and
S w ˜ j ( t ) = E 2 l ˜ j ( 1 ) ( θ ˜ j ( t ) ; θ ( c ) ) w ˜ j w ˜ j + E 2 l ˜ j ( 2 ) ( θ ˜ j ( t ) ; θ ( c ) ) w ˜ j w ˜ j g i j = 1 ; S w ˜ j b ˜ j ( t ) = E 2 l ˜ j ( 1 ) ( θ ˜ j ( t ) ; θ ( c ) ) w ˜ j b ˜ j ; S b ˜ j ( t ) = E 2 l ˜ j ( 1 ) ( θ ˜ j ( t ) ; θ ( c ) , ) b ˜ j 2 ; S b ˜ j e ˜ j ( t ) = 0 ; S w ˜ j e ˜ j ( t ) = E 2 l ˜ j ( 2 ) ( θ ˜ j ( t ) ; θ ( c ) ) w ˜ j e ˜ j g i j = 1 ; S e ˜ j ( t ) = E 2 l ˜ j ( 2 ) ( θ ˜ j ( t ) ; θ ( c ) ) e ˜ j 2 g i j = 1 .
The expectations involved in the above formulae were using θ ˜ ( t ) at the t t h iteration and the current values θ ( c ) if it is in the outer loop of the update and using the values of θ ˜ ( t , k ) and θ ( c ) when it is in the inner loop of updates.
All of the formulae above assume the parameter v i j is known. When it is not known, the parameter can be estimated with MLE, bias corrected MLE, or moment estimator. Simulation studies in [21] suggested that when the sample size is large, all the estimators perform similarly but the moment estimator has the advantage of being easy to compute. If the sample size is medium, then bias corrected MLE is better.
The canonical link for the Gamma regression part could encounter problems. Recall that the canonical link for the Gamma regression is g ( μ i j ) = μ i j 1 = w i w ˜ j + e i + e ˜ j . The left-hand side of the equation is required to be negative because the support of the Gamma distribution is (0, ) . However, the right-hand side of the equation could freely take any value in ( , ) . Due to this conflict in natural parameter space and the range of the estimated value, the likelihood function sometimes could become undefined because log ( μ i j 1 ) appears in it but log ( μ ^ i j 1 ) cannot be evaluated for negative μ ^ i j (see Equation (11)).

4. Using Log Link for Gamma Regression

In this section, we consider using the log link to model the Gamma regression part while maintaining the logistic regression part. The settings of the two-part model are similar to those in Section 3 except that modifying the link function from canonical to log link changes the score equations and the Hessian matrix.
For updating formulae on the model parameters based on Fisher scoring algorithm, the first-order and second-order partial derivatives of l i ( 1 ) remain the same as in (8)–(10) because the new link function does not appear in the Bernoulli part. Now consider the second component of the log likelihood l i ( 2 ) , which was given in (5). In this section, denote τ i j = w i w ˜ j + e i + e ˜ j as in the previous subsection. Then, the log likelihood of the ith row of non-zero observations in the co-occurrence matrix, corresponding to the Gamma distribution, can be expressed as
l i ( 2 ) = j = 1 n I y i j > 0 · log f y i j y i j > 0 = j = 1 n g i j · log f y i j y i j > 0 = j = 1 n g i j log Γ v i j + v i j log v i j + v i j 1 log y i j v i j τ i j v i j y i j 1 exp τ i j .
The equations below give the first-order partial derivatives of l i ( 2 ) with respect to the parameters w i and e i .
l i ( 2 ) w i = j = 1 n g i j v i j τ i j w i + y i j μ i j τ i j w i = j = 1 n g i j v i j μ i j 1 y i j μ i j w ˜ j , l i ( 2 ) e i = j = 1 n g i j v i j μ i j 1 y i j μ i j .
To obtain the second-order partial derivatives, first note that
μ i j = exp τ i j μ i j w i = μ i j τ i j τ i j w i = μ i j w ˜ j , and μ i j e i = μ i j τ i j τ i j e i = μ i j .
Then, the second-order partial derivatives and their negative expectations which are components of the Fisher information matrix can be derived as
2 l i ( 2 ) w i w i = j = 1 n g i j v i j y i j μ i j μ i j + 1 w ˜ j w ˜ j E 2 l i ( 2 ) w i w i g i j = 1 = j = 1 n g i j v i j w ˜ j w ˜ j ,
2 l i ( 2 ) w i e i = j = 1 n g i j v i j y i j μ i j μ i j + 1 w ˜ j E 2 l i ( 2 ) w i e i g i j = 1 = j = 1 n g i j v i j w ˜ j ,
2 l i ( 2 ) e i 2 = j = 1 n g i j v i j y i j μ i j μ i j + 1 E 2 l i ( 2 ) e i 2 g i j = 1 = j = 1 n g i j v i j .
The first part of the alternating regression has the following updating equation based on the Fisher scoring algorithm:
θ i ( t + 1 ) = θ i ( t ) + S θ i ( t ) 1 U θ i ( t ) , i = 1 , , n , θ i ( t , k + 1 ) = θ i ( t , k ) + S θ i ( t , k ) 1 U θ i ( t , k ) , k = 1 , , E .
The updating equation in (29) is iterated E epochs, say E equals 20, for the same i and t, where t is the iteration number. The multiple epochs here allow the estimate of θ i to get closer to its MLE for given current value θ ˜ ( c ) as the score equations and information matrix are also updated with the estimated parameter value. There is no need to have too many epochs because the parameter θ ˜ ( c ) is not the true value yet and still needs to be estimated later. Below are the formulae involved in the updating equations:
θ i ( t ) = w i ( t ) , b i ( t ) , e i ( t ) ; U θ i ( t ) = l i w i , l i b i , l i e i θ i ( t ) ; S θ i ( t ) = S w i ( t ) S w i b i ( t ) S w i e i ( t ) S b i w i ( t ) S b i ( t ) S b i e i ( t ) S e i w i ( t ) S e i b i ( t ) S e i ( t )
with
l i w i θ i ( t ) = k = 1 2 l i ( k ) ( θ i ( t ) ; θ ˜ ( c ) ) w i , l i b i θ i ( t ) = l i ( 1 ) ( θ i ( t ) ; θ ˜ ( c ) ) b i , l i e i θ i ( t ) = l i ( 2 ) ( θ i ( t ) ; θ ˜ ( c ) ) e i ,
where these partial derivatives are l i ( 1 ) w i , l i ( 2 ) w i , l i ( 1 ) b i , and l i ( 2 ) e i evaluated at θ i ( t ) and θ ˜ ( c ) using Equations (8) and (25), and
S w i ( t ) = E 2 l i ( 1 ) ( θ i ( t ) ; θ ˜ ( c ) ) w i w i + E 2 l i ( 2 ) ( θ i ( t ) ; θ ˜ ( c ) ) w i w i g i j = 1 ; S w i b i ( t ) = E 2 l i ( 1 ) ( θ i ( t ) ; θ ˜ ( c ) ) w i b i ; S b i ( t ) = E 2 l i ( 1 ) ( θ i ( t ) ; θ ˜ ( c ) ) b i 2 ; S b i e i ( t ) = 0 ; S w i e i ( t ) = E 2 l i ( 2 ) ( θ i ( t ) ; θ ˜ ( c ) ) w i e i g i j = 1 ; S e i ( t ) = E 2 l i ( 2 ) ( θ i ( t ) ; θ ˜ ( c ) ) e i 2 g i j = 1 ; .
The expectations involved in the above formulae were given in (9)–(10) and (26)–(28) except that the parameters are using the current values θ ( t ) and θ ˜ ( c ) at the t t h iteration if it is in the outer loop of the update and using the values of θ ( t , k ) and θ ˜ ( c ) when it is in the inner loop of updates.
The aforementioned equations are used in updating the θ part of the alternating ZIG regression. Now, consider the other side of the alternating ZIG regression in which updates are performed for θ ˜ while holding θ fixed. First, consider the first-order partial derivatives
l ˜ w ˜ j = i = 1 n l ˜ i j ( 1 ) w ˜ j + l ˜ i j ( 2 ) w ˜ j = i = 1 n l ˜ i j ( 1 ) w ˜ j + i = 1 n l ˜ i j ( 2 ) w ˜ j = l ˜ j ( 1 ) w ˜ j + l ˜ j ( 2 ) w ˜ j .
The first-order partial derivatives of l ˜ j ( 1 ) are
l ˜ j ( 1 ) w ˜ j = i = 1 n g i j p i j w i , l ˜ j ( 1 ) b ˜ j = i = 1 n g i j p i j , l ˜ j ( 1 ) e ˜ j = 0 .
The second-order partial derivatives 2 l ˜ j ( 1 ) w ˜ j w ˜ j , 2 l ˜ j ( 1 ) b ˜ j e ˜ j , 2 l ˜ j ( 1 ) b ˜ j 2 and their expectations are same as in the canonical link case, which are given in Equations (21)–(23). Additionally,
2 l ˜ j ( 1 ) b ˜ j e ˜ j = 0 ; 2 l ˜ j ( 1 ) e ˜ j 2 = 0 ; E 2 l ˜ j ( 1 ) b ˜ j e ˜ j = 0 ; E 2 l ˜ j ( 1 ) e ˜ j 2 = 0 .
Next, consider the first-order partial derivatives of the second component l ˜ j ( 2 )
l ˜ j ( 2 ) w ˜ j = i = 1 n g i j v i j μ i j 1 y i j μ i j w i , l ˜ j ( 2 ) e ˜ j = i = 1 n g i j v i j μ i j 1 y i j μ i j .
To derive the second-order partial derivatives, note that two frequently used terms are
μ i j = exp τ i j μ i j w ˜ j = μ i j τ i j · τ i j w ˜ j = μ i j w i and μ i j e ˜ j = μ i j τ i j · τ i j e ˜ j = μ i j .
Using the two terms, the second derivatives can be written as
2 l i ( 2 ) w ˜ j w ˜ j = i = 1 n g i j v i j y i j μ i j μ i j + 1 w i w i E 2 l i ( 2 ) w ˜ j w ˜ j g i j = 1 = i = 1 n g i j v i j w i w i , 2 l i ( 2 ) w ˜ j e ˜ j = i = 1 n g i j v i j y i j μ i j μ i j + 1 w i E 2 l i ( 2 ) w ˜ j e ˜ j g i j = 1 = i = 1 n g i j v i j w i , 2 l i ( 2 ) e ˜ j 2 = i = 1 n g i j v i j y i j μ i j μ i j + 1 E 2 l i ( 2 ) e ˜ j 2 g i j = 1 = i = 1 n g i j v i j .
Therefore, the updating equations for θ ˜ based on the Fisher scoring algorithm are
θ ˜ j ( t + 1 ) = θ ˜ j ( t ) + S θ ˜ j ( t ) 1 U θ ˜ j ( t ) , j = 1 , , n ,
θ ˜ j ( t , k + 1 ) = θ ˜ j ( t , k ) + S θ ˜ j ( t , k ) 1 U θ ˜ j ( t , k ) , k = 1 , , E .
The updating Equation (30) is in the outer loop of iterations and the iterations in (31) are in the inner loop of epochs for the same j and t. The quantities involved are
θ ˜ j ( t ) = w ˜ j ( t ) , b ˜ j ( t ) , e ˜ j ( t ) ; U θ ˜ j ( t ) = l ˜ j w ˜ j l ˜ j b ˜ j l ˜ j e ˜ j θ ˜ j ( t ) ; S θ ˜ j ( t ) = S w ˜ j ( t ) S w ˜ j b ˜ j ( t ) S w ˜ j e ˜ j ( t ) S b ˜ j w ˜ j ( t ) S b ˜ j ( t ) S b ˜ j e ˜ j ( t ) S e ˜ j w ˜ j ( t ) S e ˜ j b ˜ j ( t ) S e ˜ j ( t )
with
l ˜ j w ˜ j θ ˜ j ( t ) = k = 1 2 l ˜ j ( k ) ( θ ˜ j ( t ) ; θ ( c ) ) w ˜ j , l ˜ j b ˜ j θ ˜ j ( t ) = l ˜ j ( 1 ) ( θ ˜ j ( t ) ; θ ( c ) ) b ˜ j , l ˜ j e ˜ j θ ˜ j ( t ) = l ˜ j ( 2 ) ( θ ˜ j ( t ) ; θ ( c ) ) e ˜ j ,
where these partial derivatives are derivatives l ˜ j ( 1 ) w ˜ j , l ˜ j ( 2 ) w ˜ j , l ˜ j ( 1 ) b ˜ j , l ˜ j ( 2 ) e ˜ j and
S w ˜ j ( t ) = E 2 l ˜ j ( 1 ) ( θ ˜ j ( t ) ; θ ( c ) ) w ˜ j w ˜ j + E 2 l ˜ j ( 2 ) ( θ ˜ j ( t ) ; θ ( c ) ) w ˜ j w ˜ j g i j = 1 ; S w ˜ j b ˜ j ( t ) = E 2 l ˜ j ( 1 ) ( θ ˜ j ( t ) ; θ ( c ) ) w ˜ j b ˜ j ; S b ˜ j ( t ) = E 2 l ˜ j ( 1 ) ( θ ˜ j ( t ) ; θ ( c ) , ) b ˜ j 2 ; S b ˜ j e ˜ j ( t ) = 0 ; S w ˜ j e ˜ j ( t ) = E 2 l ˜ j ( 2 ) ( θ ˜ j ( t ) ; θ ( c ) ) w ˜ j e ˜ j g i j = 1 ; S e ˜ j ( t ) = E 2 l ˜ j ( 2 ) ( θ ˜ j ( t ) ; θ ( c ) ) e ˜ j 2 g i j = 1 ,
evaluated at θ ˜ ( t ) and θ ( c ) during the outer loop of iterations and at θ ˜ ( t , k ) and θ ( c ) during the inner loop of iterations.

5. Convergence Analysis

In this section, we discuss the convergence behavior of the algorithm analytically. Our algorithm contains two components: a logistic regression part and a Gamma regression part. If these two components’ parameter estimations were independent of each other, then it is the situation of the standard Generalized Linear Model (GLM) in each part. In our model, the two components’ estimation cannot be separated but the two components in the log likelihood are additive. Hence, the convergence behavior in one component standard GLM case is still relevant. In this section, we first talk about the convergence behavior in the standard case, as this case applies to the situation when we hold the θ ˜ fixed while estimating θ or vice versa.
In the standard GLM setting, most of the parameter estimation converges pretty fast. There are also abnormal behaviors that could happen such as when the estimated model component gets out of the valid range of the distribution. For example, in the Gamma regression, if the canonical link (negative inverse) is used, the estimated mean may become negative every now and then even though the distribution requires a positive mean. Refs. [22,23,24] presented conditions for the existence of MLE in logistic regression models. They proved in the non-trivial case that if there is overlap in the convex cones generated by the covariate values from different classes, the maximum likelihood estimates of the regression parameters exist and are unique. On the other hand, if the convex cones generated by the covariates in different classes have complete separation or quasi complete separation, then the maximum likelihood estimates do not exist or are unbounded. Generally, they recommended inserting a stopping rule if complete separation is found or restarting the iterative algorithm with standardized observations (to have mean 0 and variance 1) if quasi complete separation is found. Ref. [24] also recommended a procedure to check the conditions. For quasi complete separation, the estimation process diverges at least at some points. This makes the estimated probability of belonging to the correct class grow to one. Therefore, ref. [24] recommended checking the maximum predicted probability p ( t ) for each data point at the t t h iteration. If the maximum probability is close to 1 and is bigger than previous iterations’ maximum probability, there are two possibilities for which this could happen. They suggested initially printing a warning but continue the iteration because the data point is likely to be an outlier observation in its own class and there is overlap in the two convex cones. In this case, the MLE exists and is unique so the algorithm should be allowed to continue. The other possibility is that there is quasi complete separation in the data. In this case, the process should be stopped and rerun with the observation vectors standardized with zero mean and unit variance.
Ref. [24] also stated that the difficulties associated with complete and quasi complete separation are small sample problems. With large sample size, the probability of observing a set of separated data points is close to zero. Complete separation may occur with any type of data but it is unlikely that quasi complete separation will occur with truly continuous data.
Ref. [25] illustrated the non-convergence problem with a Poisson regression example. The author proposed a simple solution and implemented it in the R g l m 2 package, version 1.2.1. Specifically, if the iteratively reweighted least squares (IRLS) procedure produces either an infinite deviance or predicted values which fall within an invalid range, then the amount of update of the parameter estimates is repeatedly halved until the update no longer shows the behavior. Moreover, it produced a further step-halving, which checks that the updated deviance is making a reduction compared to that in the previous iteration. If it did not show the reduction, it triggers the step-halving to make the algorithm monotonically reduce the deviance.
Based on the aforementioned studies illustrating the standard case of GLM convergence behavior, we analytically present the convergence behavior of our alternating ZIG regression. For further discussion, we first specify our convex cones. Recall that θ ˜ serves as data when we estimate θ i . In our context, the convex cones are defined through θ ˜ , while we estimate θ i and through θ while we estimate θ ˜ j . That is, for estimating θ i , the convex cones are
G θ ˜ ( i ) = j = 1 y i j > 0 n k j θ ˜ j k j > 0 , F θ ˜ ( i ) = j = 1 y i j = 0 n k j θ ˜ j k j > 0 .
For estimating the θ ˜ j , the convex cones are denoted as G θ ( i ) and F θ ( i ) , respectively.
Firstly, we consider the case that either complete separation or quasi complete separation exists in the data. We discuss the estimation of θ i while holding θ ˜ fixed.
Suppose there exists complete separation or quasi separation in G θ ˜ ( i ) and F θ ˜ ( i ) . That is, G θ ˜ ( i ) F θ ˜ ( i ) = , G θ ˜ ( i ) R d + 2 , and F θ ˜ ( i ) R d + 2 . Also, suppose G θ ˜ ( i ) and F θ ˜ ( i ) . Then, there exists a vector direction c i such that w ˜ j c i 0 for y i j > 0 and w ˜ j c i 0 for y i j = 0 . (Note: This vector can be taken to be the perpendicular direction to a vector that lies in between G θ ˜ ( i ) and F θ ˜ ( i ) but does not belong to either G θ ˜ ( i ) or F θ ˜ ( i ) .) Let l i ( 1 ) ( k ) be the log likelihood of the Bernoulli part when the model parameters are updated toward direction c i by k unit. That is, the original log likelihood l i ( 1 ) and the updated l i ( 1 ) ( k ) are as follows:
l i ( 1 ) = j = 1 n I y i j = 0 · log 1 p i j + I y i j > 0 · log p i j where p i j = exp η i j 1 + exp η i j , η i j = w i w ˜ j + b i + b ˜ j l i ( 1 ) ( k ) = j = 1 n I y i j = 0 · log 1 p i j ( k ) + I y i j > 0 · log p i j ( k ) where p i j ( k ) = exp η i j + k w ˜ j c i 1 + exp η i j + k w ˜ j c i , η i j = w i w ˜ j + b i + b ˜ j
  • For y i j > 0 , w ˜ j c i 0 . Hence, as k increases to , p i j ( k ) increases toward 1. This implies j = 1 n I y i j > 0 · log p i j ( k ) increases toward 0.
  • For y i j = 0 , w ˜ j c i 0 . Hence, as k increases to , p i j ( k ) decreases toward 0. This implies j = 1 n I y i j = 0 · log 1 p i j ( k ) increases toward 0.
Putting the two pieces together, we know that as k increases, l i ( 1 ) ( k ) increases for any given w i , b i . Therefore, the maximum cannot be reached until k is . This means the MLE does not exist or the solution set { θ ^ i } is unbounded for the current value of θ ˜ . This perspective can be also seen by looking at the partial derivative of l i ( 1 ) ( k ) with respect to k. Note that
l i ( 1 ) ( k ) k = j = 1 n I y i j = 0 · p i j ( k ) w ˜ j c i + j = 1 n I y i j > 0 1 p i j ( k ) w ˜ j c i .
Since w ˜ j c i 0 when y i j = 0 , we know the first term is non-negative. Similarly, w ˜ j c i 0 when y i j > 0 implies the second term is non-negative. Therefore, the partial derivative of l i ( 1 ) ( k ) is non-negative. This indicates that the gradient of l i ( 1 ) is positive unless w ˜ j c i = 0 . Therefore, there is no solution for l i ( 1 ) = 0 except the trivial solution c i = 0 in binomial regression alone. Of course, this is not exactly our case because we still have the Gamma regression component to be considered together.
Now consider the l i ( 2 ) component with its first- and second-order derivatives with respect to k.
l i ( 2 ) ( k ) = j = 1 n I ( y i j > 0 ) o i j + v i j log y i j exp τ i j + k w ˜ j c i v i j y i j exp τ i j + k w ˜ j c i ,
where o i j = Γ v i j + v i j log v i j log y i j . Note that
l i ( 2 ) ( k ) k = j = 1 n I ( y i j > 0 ) v i j w ˜ j c i y i j exp τ i j + k w ˜ j c i 1 = j = 1 n I ( y i j > 0 ) w ˜ j c i v i j u i j ( k ) ,
where u i j ( k ) = y i j exp τ i j + k w ˜ j c i exp τ i j + k w ˜ j c i / v i j is the standardized Gamma random variable that has mean 0 and variance 1 because the term exp τ i j + k w ˜ j c i is the mean of y i j and v i j is the shape parameter. Further,
2 l i ( 2 ) ( k ) k 2 = j = 1 n I ( y i j > 0 ) v i j w ˜ j c i 2 · y i j exp τ i j + k w ˜ j c i < 0 .
The negativity of the second-order derivative implies that l i ( 2 ) ( k ) is concave.
Gathering the partial derivatives from both l i ( 1 ) and l i ( 2 ) together, we obtain the partial derivative of l i
l i ( k ) k = l i ( 1 ) ( k ) k + l i ( 2 ) ( k ) k = j = 1 n I y i j = 0 p i j ( k ) w ˜ j c i + j = 1 n I y i j > 0 ( 1 p i j ( k ) ) w ˜ j c i + j = 1 n I y i j > 0 v i j u i j ( k ) w ˜ j c i .
From the above discussion, the first two terms in (35) corresponding to l i ( 1 ) ( k ) k are greater than 0. The u i j ( k ) is centered at zero and has variance 1. Therefore, in order for the MLE to exist, the sum of i = 1 n I y i j > 0 v i j u i j ( k ) w ˜ j c i I ( u i j ( k ) < 0 ) must cancel the total value of l i ( 1 ) ( k ) k and the positive term i = 1 n I y i j > 0 v i j u i j ( k ) w ˜ j c i I ( u i j ( k ) > 0 ) . If the shape parameter v i j of Gamma distribution is small, the distribution is highly skewed with a long right tail. In this case, there are more observations having values less than its mean. However, a small shape parameter also makes v i j u i j ( k ) small such that the l i ( 1 ) ( k ) k may dominate, and hence, the entire partial derivative l i ( k ) k is greater than zero. When the shape parameter is large, the Gamma random variables are approximately normally distributed. In this case, the observations are symmetrically located on either side of its mean. This makes the number of observations satisfying u i j < 0 and u i j > 0 roughly equal. Consequently, the l i ( 2 ) ( k ) k is close to zero. Then, there are no extra values left to neutralize l i ( 1 ) ( k ) k . As a result, regardless of whether the shape parameter v i j is large or small, when complete separation or quasi complete separation holds, it is highly likely that the MLE does not exist. When the shape parameter v i j is intermediate such that the Gamma distribution is still skewed, the more values of Gamma observations less than its mean might allow i = 1 n I y i j > 0 v i j u i j ( k ) w ˜ j c i I ( u i j ( k ) < 0 ) to cancel all other positive terms. This is the case that there could be a solution for θ ˜ i .
Next, consider the case that there is overlap in the two convex cones G θ ˜ ( i ) and F θ ˜ ( i ) (i.e., there is neither complete separation nor quasi complete separation in the data). Recall that the ZIG model is a two-part model with a Bernoulli part and a Gamma part. The log likelihood function l i ( 1 ) corresponding to the Bernoulli part can be shown to be strictly concave in θ i = ( w i , b i ) . This is because the first component g i j η i j = g i j ( w i w ˜ j + b i + b ˜ j ) in l i ( 1 ) is an affine function of θ i (see Equation (7) for the expression of l i ( 1 ) ), which is both convex and concave, if we hold w ˜ j and b ˜ j fixed. The second component log 1 + exp ( η i j ) is strictly concave, as its second derivative is less than 0, as shown below.
2 x 2 [ log 1 + exp ( x ) ] = exp ( x ) 1 + exp ( x ) 2 < 0 .
Thus, the log likelihood corresponding to the Bernoulli part l i ( 1 ) is strictly concave.
Now, consider the log likelihood corresponding to the Gamma part l i ( 2 ) in (24). We only need to consider the summation over the last two terms ν i j τ i j and ν i j y i j exp ( τ i j ) because the other terms do not involve the regression parameters, where τ i j = w i w ˜ j + e i + e ˜ j . Note that τ i j is an affine function of θ i , which is both convex and concave and exp ( g ( x ) ) is convex if g ( x ) is convex (see ref. [26]). This leads to the term ν i j y i j exp ( τ i j ) being strictly concave in θ i . Hence, we have the l i ( 2 ) strictly concave in θ i = ( w i , b i , e i ) . Combining the two concave components l i ( 1 ) and l i ( 2 ) , we know that the entire log likelihood for the i t h row l i is a strictly concave function with respect to the parameter being estimated for any row i. Additionally, there is overlap in the two convex cones. Therefore, for any direction in the overlapping area of the convex cones, updating the parameter along that direction will lead to the two components of l i ( 1 ) ( k ) k in Formula (34) being of opposite sign to each other. In this case, l i has a unique minimum when there is neither complete separation nor quasi complete separation in G θ ˜ ( i ) and F θ ˜ ( i ) . Similar arguments apply when we estimate θ ˜ j while holding θ fixed. That is, a unique MLE of θ ˜ j exists when there is neither complete separation nor quasi complete separation in G θ ( j ) and F θ ( j ) . This means that the alternating procedure will find the MLE of θ ˜ j when θ is fixed and will find the MLE of θ i when θ ˜ is fixed in this non-separation scenario.
Next, we consider the convergence behavior for the estimation of θ without fixing the value of θ ˜ . For estimating θ , consider the complete data x = ( y , θ ˜ ) . The entire θ ˜ matrix is missing. Due to too many missing values, a logical approach is to regard the θ ˜ as randomly drawn from a distribution which has relatively few parameters. Assume the distribution of x is in the exponential family.
Let f ˜ ( x | θ ) be the unconditional density of the complete data x = ( y , θ ˜ ) and K ( x | y , θ ) be the conditional density given y . Denote the marginal density of y given θ as g ( y | θ ) . In the next few paragraphs, we explain that the alternating updates in ZIG leads ultimately to a value of θ that maximizes l ˇ ( θ ) = log g ( y | θ ) .
For exponential families, the unconditional density f ˜ ( x | θ ) and conditional density K ( x | y , θ ) = f ˜ ( x | θ ) / g ( y | θ ) both have the same natural parameter θ and the same sufficient statistic t ( x ) except that they are defined over different sample spaces Ω ( x ) versus Ω ( y ) . We can write f ˜ ( x | θ ) and K ( x | y , θ ) in general exponential family format as
f ˜ ( x | θ ) = exp ( θ t ( x ) T + c ( x ) ) / a ( θ ) , K ( x | y , θ ) = exp ( θ t ( x ) T + c ( x ) ) / b ( θ | y ) ,
where
a ( θ ) = Ω ( x ) exp ( θ t ( x ) T + c ( x ) ) d x , b ( θ | y ) = Ω ( y ) exp ( θ t ( x ) T + c ( x ) ) d x .
Then, l ˇ ( θ ) = log a ( θ ) + log b ( θ | y ) . The first- and second-order derivatives of l ˇ ( θ ) are
l ˇ ( θ ) θ = log a ( θ ) θ + log b ( θ | y ) θ = E ( t ( x ) | y , θ ) E ( t ( x ) | θ ) , 2 l ˇ ( θ ) θ θ T = 2 log a ( θ ) θ θ T + 2 log b ( θ | y ) θ θ T = E [ Var ( t ( x ) | y , θ ) | θ ] Var ( t ( x ) | θ ) ,
where E ( t ( x ) | θ ] ) and Var ( t ( x ) | θ ) are the expectation and variance under the complete data likelihood from y and θ ˜ . And E ( t ( x ) | y , θ ] ) and Var ( t ( x ) | y , θ ) are the conditional expectation and variance of the sufficient statistics. E [ Var ( t ( x ) | y , θ ) | θ ] is the expected value of the conditional covariance matrix when y has sampling density g ( y | θ ) . The last equality of both equations assumes the order of expectation and derivative can be exchanged. The previous equation means the derivative of the log likelihood is the difference between the conditional and unconditional expectation of the sufficient statistics.
Meanwhile, the updating equation based on the Fisher scoring algorithm can be written as
θ ^ ( k + 1 ) = θ ^ ( k ) + Var ( t ( x ) | θ ^ ( k ) ) E [ Var ( t ( x ) | y , θ ^ ( k ) ) | θ ^ ( k ) ] 1 l ˇ ( θ ) θ θ = θ ^ ( k ) T = θ ^ ( k ) + Var 1 ( E ( t ( x ) | y , θ ^ ( k ) ) | θ ^ ( k ) ) [ E ( t ( x ) | y , θ ^ ( k ) ) E ( t ( x ) | θ ^ ( k ) ] ) ,
where in the limit, θ ^ ( k + 1 ) = θ ^ ( k ) = θ , for some θ , which leads to E ( t ( x ) | y , θ ) E ( t ( x ) | θ ) or l ˇ ( θ ) / θ = 0 at θ .
The complete data log likelihood based on the joint distribution of x = ( y , θ ˜ ) can be written as log f ˜ ( x | θ ) = l ( θ ; θ ˜ ) + log ( P θ ˜ ( θ ˜ | θ ) ) , where l ( θ ; θ ˜ ) is defined in the end of Section 2, and P θ ˜ ( θ ˜ | θ ) ) is the probability density function of θ ˜ given θ . We know P θ ˜ ( θ ˜ | θ ) is also a member of the exponential family. Assume its sufficient statistics are linear in θ ˜ . Given θ ˜ , the observed data likelihood l ( θ ; θ ˜ ) based on y contains the Bernoulli and Gamma parts, in which the sufficient statistic for θ is linear in y . Then, the sufficient statistics for the complete data problem are linear in the data y and θ ˜ . In this case, calculating E ( t ( x ) | y , θ ( k ) ) is equivalent to a procedure which first fills in the individual data points for θ ˜ and then computes the sufficient statistics using filled-in values. With the filled-in value for θ ˜ , the computation of the estimator for θ follows the usual maximum likelihood principle. This results in iterative update of θ and θ ˜ back and forth. Essentially, the problem is a transformation from an assumed parameter-vector to another parameter-vector that maximized the conditional expected likelihood.
One of the difficulties in the ZIG model is that the parameterization allows arbitrary orthogonal transformations on both w and w ˜ without affecting the value of the likelihood. Even for cases where the likelihood for the complete data ( y , θ ˜ ) problem is concave, the likelihood for the ZIG may not be concave. Consequently, multiple solutions of the likelihood equations can exist. An example is a ridge of solutions corresponding to orthogonal transformations of the parameters.
For complete data problems in exponential family models with canonical links, the Fisher scoring algorithm is equivalent to the Newton–Raphson algorithm, which has a quadratic rate of convergence. This advantage is due to the fact that the second derivative of the log-likelihood does not depend on the data. In these cases, the Fisher scoring algorithm has a quadratic rate of convergence when the starting values are near a maximum. However, we do not have the complete data y and θ ˜ when estimating θ . Fisher scoring algorithms often fail to have quadratic convergence in incomplete data problems, since the second derivative often does depend upon the data. Further, the scoring algorithm does not have the property of always increasing the likelihood. It could in some cases move toward a local maximum if the choice of starting values is poor.
In summary, we conclude that our alternating ZIG regression has an unique MLE in each side of the regressions when either θ or θ ˜ stays fixed and the algorithm converges when there is overlap in data. The data refer to θ ˜ while we estimate θ i and refer to θ while we estimate θ ˜ j . When there is overlap in data, both l i ( 1 ) and l i ( 2 ) are concave functions with unique maximum. However, when there is complete separation or quasi complete separation, the alternating ZIG regression will fail to converge with high chance. This is because the first-order partial derivative of l i ( 1 ) is non-negative and l i ( 1 ) increases with k. Even though the l i ( 2 ) component is a well-behaved concave curve, the entire log likelihood, unfortunately, may not have MLE exist or the solution set may be unbounded especially when the Gamma observations have large or too small shape parameter. The overall convergence behavior for estimating θ without holding θ ˜ fixed can treat θ ˜ as missing data. The alternating update ultimately finds the maximum likelihood estimate of θ based on the sampling distribution of the observed matrix y if the solution is in the interior of the parameter space. This requires the joint distribution of y and θ ˜ to be in the exponential family with sufficient statistic that is linear in y and θ ˜ .

6. Adjusting Parameter Update with Learning Rate

In the convergence analysis section, our discussion is based on holding either θ or θ ˜ fixed while estimating the other one. The Fisher scoring algorithm is a modified version of Newton’s method. In general, Newton’s method solves g ( x ) = 0 by numerical approximation. This algorithm starts with an initial value x 0 and computes a sequence of points via x n + 1 = x n g ( x n ) 1 g ( x n ) . Newton’s method converges fast because the distance between the estimate and its true value shrinks quickly such that the distance in the next step of iteration is asymptotically equivalent to the squared distance in the previous iteration. That is, Newton’s method has quadratic convergence order (cf. [27] p. 29 and [28]). This convergence order holds when the initial value of the iteration is in the neighborhood of the true value and the third derivative g ( x ) is continuous and g ( x ) is non-zero.
The Fisher scoring algorithm is slightly different from Newton’s method. In the Fisher scoring algorithm, we replace the g ( x n ) by its expected value. This algorithm is asymptotically equivalent to Newton’s method and therefore enjoys the same asymptotic property such as consistency of the estimate of the parameter. As sample size becomes large, the convergence order increases [28]. It has some advantages over Newton’s method in that the expected value of the Hessian matrix is positive definite which guarantees the update is uphill toward the direction of maximizing the log likelihood function assuming that the model is correct and the covariates are true explanatory variables.
As the Fisher scoring algorithm assumes that θ ˜ is the true value when we estimate θ or vice versa, there could be complications when the parameter being fixed is not equal to the true parameter value. To see this point, note that our updating equations are all written in the context of using the Fisher scoring algorithm, which relies on the expectation of the Hessian matrix using correct distribution at the true parameter value. In particular, the algorithm uses θ ˜ as a fixed value while estimating θ . The resulting estimate of θ determines the distribution because the distribution is a function of θ and θ ˜ . When the parameter is being fixed at a value far from its true value during the intermediate steps, the distribution is wrong even though it is in the right family. The consequence of using a wrong distribution to compute the expectation of the Hessian matrix could lead to a sequence of parameter updates that converges to a limiting value unequal to the true parameter [28]. In this case, the algorithm might diverge. Our simulation study in later section confirms this point.
To avoid this parameter update divergence problem, we introduce learning rate adjustment so that the change in parameter estimate is scaled by l r / t 1 / 4 , where l r is a small constant learning rate such as 0.1 or 0.01, and t is the iteration number. That is, the general updating formula is θ i ( t + 1 ) = θ i ( t ) + l r t 1 / 4 · S θ i ( t ) 1 · U θ i ( t ) and the adjustment is applied to both inner loop and outer loop iterations. The algorithm follows the same work flow as those listed in Algorithm 1 except that the epoch update in the inner loop (Algorithm 2) is replaced with the Algorithm 3 using the learning rate adjustment l r t 1 / 4 . How we decide to use this learning rate adjustment comes from modifying the popularly used adaptive moment estimation (Adam) and the stochastic gradient descent. In the stochastic gradient descent, the parameter update has learning rate adjustment to make small moves so that it compensates for the random nature of selecting only one observation to compute the gradient. Specifically, given parameters w ( t ) and a gradient function evaluated at one randomly selected observation g ( t ) , the update is based on formula w ( t + 1 ) = w ( t ) l r · a ( t ) g ( t ) , where a ( t ) satisfies t = 1 a ( t ) = and t = 1 a 2 ( t ) < . The a ( t ) corresponds to our t 1 / 4 S θ i ( t ) 1 . However, our parameter update does not use just one θ ˜ . Instead, all θ ˜ 1 , , θ ˜ n were used in computing the gradient and the information matrix. Given parameters w ( t ) and a loss function L ( t ) at the t t h training iteration, the Adam update takes the form w ( t + 1 ) = w ( t ) l r · m ^ w / ( v ^ w + ϵ ) , where m ^ w and v ^ w are the exponential moving average of the gradients and the second moments of the gradients in the past iterations, respectively, and ϵ is a small scalar (e.g., 10 8 ) used to prevent division by 0. Our use of the Fisher information S θ i ( t ) 1 should provide a better mechanism than Adam’s exponential moving average of second moments to achieve the effect of increasing the learning rate for sparser parameters and decreasing the learning rate for ones that are less sparse. This is because Adam only uses the diagonal entries of S θ i ( t ) and ignores the covariance between the estimated parameters existing in the off-diagonal entries. Refs. [29,30] both pointed out that Adam may not converge to optimal solutions even for some simple convex problems, although it is overwhelmingly popular in machine learning applications.
Using the learning rate adjustment makes smaller steps in each update before changing directions. This learning rate adjustment turns out to be crucial. The simulation study in the next section examines the effect of the learning rate adjustment.
Algorithm 1 SA-ZIG regression
  • Require:  S θ i ( t ) 1 and S θ ˜ j ( t ) 1 exist.
  •      t 0 , overall L o s s ( t ) = total negative log likelihood
  •      c o n v e r g e d = F a l s e
  •     while  t m a x i t  do
  •          while  c o n v e r g e d = F a l s e  do
  •                 i 1
  •                while  i n  do
  •                     do n_epoch update of U θ i ( t ) and S θ i ( t ) and θ i ( t + 1 ) using i t h row of data. ▹ see Algorithm 2 or 3
  •                     do n_epoch update of U θ ˜ i ( t ) and S θ ˜ i ( t ) and θ ˜ i ( t + 1 ) using i t h column of data.
  •                      i i + 1
  •                for  k = 1 , , n  do
  •                     retrieve kth row of data
  •                     recompute U θ k ( t + 1 ) , U θ ˜ k ( t + 1 ) and their L 2 norms using θ ( t + 1 ) and θ ˜ ( t + 1 ) values.
  •                     recompute l o s s k and l o s s ˜ k for k t h row and column respectively using θ ( t + 1 ) and θ ˜ ( t + 1 ) values.
  •                compute the overall L o s s ( θ ( t + 1 ) , θ ˜ ( t + 1 ) ) .
  •                check if the relative change in Loss is less than a predefined threshold ϵ .
  •            t t + 1
Algorithm 2 Epoch update in inner loop of Algorithm 1 without learning rate adjustment
  • for epoch ∈ 1,…, n_epoch do
  •       compute U θ i ( t ) and S θ i ( t ) .
  •       update θ i ( t + 1 ) using current value of { θ ˜ j ( t ) , j = 1 , , n } and θ i ( t ) based on formulae in (29)
  • for epoch ∈ 1,…, n_epoch do
  •       compute U θ ˜ i ( t ) and S θ ˜ i ( t ) .
  •       update θ ˜ i ( t + 1 ) using current value of { θ j ( t ) , j = 1 , , n }, and θ ˜ i ( t )
  •       based on formulae in (31)
Algorithm 3 Updating equations in SA-ZIG regression with learning rate adjustment
  •  for  i = 1 , , n  do
  •        for  k = 1 , , E  do
  •               θ i ( t , k + 1 ) = θ i ( t , k ) + l r t 1 / 4 · S θ i ( t , k ) 1 · U θ i ( t , k )
  •               θ ˜ i ( t , k + 1 ) = θ ˜ i ( t , k ) + l r t 1 / 4 · S θ ˜ i ( t , k ) 1 · U θ ˜ i ( t , k )
  •         θ i ( t ) θ i ( t , E )
  •         θ ˜ i ( t ) θ ˜ i ( t , E )
  •         θ i ( t + 1 ) = θ i ( t ) + l r t 1 / 4 · S θ i ( t ) 1 · U θ i ( t )
  •         θ ˜ i ( t + 1 ) = θ ˜ i ( t ) + l r t 1 / 4 · S θ ˜ i ( t ) 1 · U θ ˜ i ( t )

7. Numerical Studies

7.1. A Simulation Study with ZIG Using Log Link

In this section, we present a simulation study to assess the performance of the alternating ZIG regression. We found through some experiments that data generation has to be very careful because the mean of the Gamma distribution was given by μ i j = exp ( w i w ˜ j + e i + e ˜ j ) . When we randomly generate the w i ’s and w ˜ j ’s independently from Uniform distribution with each of them having dimension 50, their dot product could easily become so large that exponentiated value μ i j and the variance of the distribution (∝ μ i j 2 ) become infinity or undefined. Keeping this point in mind, we generate the data as follows:
  • The w i k ’s were generated independently from Uniform (−0.25, 0.25) with seed 99, i = 1 , , 300 and k = 1 , , 50 . Set w i = ( w i 1 , , w i 50 ) .
  • The w ˜ i k = w i k , i = 1 , , 300 and k = 1 , , 50 . That is, w ˜ i = w i .
  • The b i ’s and b ˜ i ’s were generated independently from Uniform (0, 0.05) with seed 97 and 96, respectively, i = 1 , , 300 .
  • The e i ’s and e ˜ i ’s were generated independently from Uniform (0.1, 0.35) with seed 1 and 2, respectively, i = 1 , , 300 .
  • The success probability p i j of positive observations was calculated as p i j = ( 1 + exp ( η i j ) ) 1 , where η i j is based on the logit Formula (2).
  • Generate B i j independently from the Bernoulli distribution with success probability p i j , i = 1 , , 300 and j = i , , 300 .
  • If B i j = 0 , set Y i j = 0 . Otherwise, generate Y i j from the Gamma distribution with mean μ i j = exp ( τ i j ) and shape parameter ν i j = 4 , where τ i j is computed based on Formula (4) for i = 1 , , 300 and j = 1 , , 300 .
The matrix containing Y i j , i , j = 1 , , 300 , is the data we used for the alternating ZIG regression. With the generated data matrix Y , we applied our algorithm to the data with two different sets of initial values for the unknown parameters. One setting is to check whether the algorithm would perform well when the parameter estimation is less difficult when part of the parameters was initialized with true values. The other setting demands more accurate estimation in both w i and w ˜ j in the right direction.
In Setting 1, we set all the parameters’ initial values equal to their true values except for w ˜ i k , i = 1 , , 300 , k = 1 , , 50 . The initial values of w ˜ i k ’s were randomly generated from Uniform (−0.25, 0.25) with seed number 98. Even though initial values of w , b , b ˜ , e and e ˜ were set to be equal to the true values, the algorithm was not aware of this and these parameters were still estimated along with w ˜ .
In Setting 2, we generated the initial values for all the parameters randomly as follows:
  • w ’s initial values were generated independently from Uniform (−0.25, 0.25) with dimension 300 by 50 and seed 102.
  • w ˜ ’s initial values were generated independently from Uniform (−0.25, 0.25) with dimension 300 by 50 and seed 103.
  • b i ’s and b ˜ j ’s initial values were generated independently from Uniform (0, 0.05) with seed 104 and 105, respectively, i , j = 1 , , 300 .
  • e i ’s and e ˜ j ’s were independently generated from Uniform (0.1, 0.35) with seed 106 and 107, respectively, i , j = 1 , , 300 .
We consider update both with learning rate adjustment (Algorithm 3) and without learning rate adjustment (Algorithm 1).
The results of Setting 1 are presented in Table 1 and Figure 1. We can see in Figure 1 that even though the norm of the score vectors went up after initial reduction, the loss function (i.e., the negative log-likelihood) consistently decreases as the iteration number increases. In the last five iterations, the algorithm with no learning rate adjustment (Algorithm 1) reached a lower value in overall loss compared to that with learning rate adjustment (Algorithm 3). However, the norms of the score vectors from the algorithm with learning rate adjustment are much smaller than those from the algorithm with no learning rate adjustment (see Table 1).
The results for Setting 2 are given in Figure 2. The last column shows the algorithm with learning rate adjustment alone because the scale of the third column is too high to see the reducing trend. In this case, with the initial values of all parameters generated randomly from Uniform distribution, the uncertainty in the parameters caused the algorithm without learning rate to breakdown. We see that the algorithm with no learning rate adjustment using updates in Algorithm 2 struggles to reduce the norms and headed toward wrong directions throughout the optimization process. On the other hand, the algorithm with learning rate adjustment had a little fluctuation at the beginning but continued toward the right direction by reducing both the norm of the score vectors and the loss function as the iterations proceed. Note that the dataset was identical for these two applications and both algorithms started with the same initial values. The only difference between the two algorithms is that learning rate adjustment is used in one algorithm but not in the other algorithm. Hence, the failure of the algorithm with no learning rate adjustment is due to the update being too big, which leads to the algorithm going toward wrong directions. The numerical study confirms that the alternating updates are capable of finding the maximum likelihood estimate, and learning rate adjustment in Algorithm 3 is an important component of the algorithm.

7.2. An Application of SA-ZIG on Word Embedding

In this section, we demonstrate the application of SA-ZIG using log link on a small dataset. This dataset comprises news articles sourced from the Reuters news archive. Using a Python script to simulate user clicks, we downloaded approximately 2000 articles from the Business category, spanning from 2019 to the summer of 2021. These articles are stored in an SQLite database. From this corpus, we created a small vocabulary consisting of the V = 300 most frequently used words. We trained a d = 20 -dimensional dense vector representation for each of the 300 words using the downloaded business news.
To apply the SA-ZIG model, we first obtain weighted word–word co-occurrence count, computed as follows:
Y i j = s all   sentences 1 / d i j ( s ) , if   d i j ( s ) k 0 , otherwise
where d i j ( s ) = separation between word i and word j in sentence s, and k is pre-determined window size 10.
Figure 3 shows the histogram of the counts from the first 12 rows. The counts for each row are highly skewed and have a lot of zeros. It can be seen that zero-inflated Gamma may be suitable for the data.
The co-occurrence matrix with sparse format was fed into the SA-ZIG model with learning rate adjustment. The parameters are initialized based on the following description:
  • The entries in V × d matrix w were independently generated from uniform distribution between 0.5 / ( V d ) and 0.5 / ( V d ) with seed 99.
  • The entries in V × d matrix w ˜ were generated from uniform distribution between 0.5 / ( V d ) and 0.5 / ( V d ) with seed 98.
  • The V bias terms in b were independently generated from uniform (−0.1, 0.1) with seed 97.
  • The V bias terms in b ˜ were independently generated from uniform (−0.1, 0.1) with seed 96.
  • The V bias terms in e 1 , , e V were independently generated from uniform (0.1, 0.6) with seed 1.
  • The bias terms in e ˜ 1 , , e ˜ V were independently generated from uniform (0.1, 0.6) with seed 2.
  • The learning rate at t t h iteration is set to l r / ( t 1 / 4 ) as given in Algorithm 3 with l r = 0.5 .
We let the model parameter update run 60 iterations in the outlier loop and 20 epochs in the inner loop. These numbers are used because the GloVe model training in [13] ran 50 iterations for vector dimensions ≤ 300. We want to see how the estimation proceeds with the iteration number. At the 60th iteration, the algorithm has not converged yet. This can be seen from the loss curve in Figure 4, which still sharply moves downward. The cosine similarity between word pairs is shown in the heatmap on the right panel of Figure 4.
To visualize the relationships between different words, we first project the learned vector representations to the first 10 principal components (PCs). These 10 PCs cumulatively explained 97% of variations in the word vectors. We then conduct further dimension reduction with T-Distributed Stochastic Neighbor Embedding (t-SNE) to two-dimensional space. The t-SNE minimizes the divergence between two distributions: a distribution that measures pairwise similarities of the input objects and a distribution that measures pairwise similarities of the corresponding low-dimensional points in the embedding. The words on t-SNE coordinates are shown in Figure 5. Each point on the plot represents a word in the low-dimensional space. The closer two words are on the plot, the more similar they are in meaning or usage. For example, the top five most similar words to ‘capital’ from the 300 words are ‘firm’, ‘management’, ‘fund’, ‘board’, ‘financial’. These are based on the cosine similarity. The five most similar words to ‘stocks’ from the 300 words are ‘companies’, ‘bitcoin’, ‘banks’, ‘prices’ and ‘some’. The top five most similar words to ‘finance’ are ‘technology’, ‘energy’, ‘trading’, ‘consumer’, ‘analysts’. Words ‘monday’, ‘tuesday’, ‘wednesday’ are clustered together. This scatter plot captures some semantic similarities.

8. Conclusions and Discussion

In summary, we presented the shared parameter alternating zero-inflated Gamma (SA-ZIG) regression model in this paper. The SA-ZIG model is designed for highly skewed non-negative matrix data. It uses a logit link to model the zero versus positive observations. For the Gamma part, we considered two link functions: the canonical link and the log link, and derived updating formulas for both.
We proposed an algorithm that alternately updates the parameters θ and θ ˜ while holding one of them fixed. The Fisher scoring algorithm, with or without learning rate adjustment, was employed in each step of the alternating update. Numerical studies indicate that learning rate adjustment is crucial in SA-ZIG regression. Without it, the algorithm may fail to find the optimal direction.
After model estimation, the matrix is factorized into the product of a left matrix and a right matrix. The rows of the left matrix and the columns of the right matrix provide vector representations for the rows and columns, respectively. These estimated row and column vector representations can then be used to assess the relevance of items and make recommendations in downstream analysis.
The SA-ZIG model is inherently similar to factor analysis. In factor analysis, both the loading matrix and the coefficient vector are unknown. The key difference between SA-ZIG and factor analysis is that SA-ZIG uses a large coefficient matrix, whereas factor analysis uses a single vector. Additionally, SA-ZIG assumes a two-stage Bernoulli–Gamma model, while factor analysis assumes a normal distribution.
In both models, likelihood-based estimation can determine convergence behavior by linking the complete data likelihood with the conditional likelihood. For factor analysis, the normal distribution and the linearity of the sufficient statistic in the observed data allow the use of ALS to estimate both the loading matrix and the coefficient vector ([31]). However, SA-ZIG cannot use ALS because the variance of the Gamma distribution is not constant.
In both SA-ZIG and factor analysis, the unobserved row (or column) vector representation and the factor loading matrix can be treated as missing data, assuming the data are missing completely at random. For missing data analysis, the well-known Expectation Maximization (EM) algorithm can be used to estimate parameters. The EM algorithm has the advantageous property that successive updates always move towards maximizing the log likelihood. It works well when the proportion of missing data is small, but it is notoriously slow when a large amount of data is missing.
For SA-ZIG, the alternating scheme with the Fisher scoring algorithm offers the benefit of a quadratic rate of convergence if the true parameters and their estimates lie within the interior of the parameter space. However, in real applications, the estimation process might diverge at either stage of the alternating scheme because the Fisher scoring update does not always guarantee an upward direction, especially in cases of complete or quasi complete separation. Additionally, the algorithm may struggle to find the optimal solution due to the non-identifiability of the row and column matrices under orthogonal transformations, leading to a ridge of solutions. The learning rate adjustment in Algorithm 3 helps by making small moves during successive updates in later stage of the algorithm and thereby is more likely to find a solution.
Future research on similar problems could explore alternative distributions beyond Gamma. Tweedie and Weibull distributions, for instance, are capable of modeling both symmetric and skewed data through varying parameters, each with its own associated link functions. However, new algorithms and convergence analyses would need to be developed specifically for these distributions. In practical applications, the most suitable distribution for the observed data is often uncertain, making diagnostic procedures an important area for further investigation.

Author Contributions

Methodology, T.K. and H.W.; Software, T.K. and H.W.; Validation, T.K.; Writing—original draft, T.K. and H.W.; Writing—review & editing, T.K. and H.W.; Supervision, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Edelman, A.; Jeong, S. Fifty three matrix factorizations: A systematic approach. arXiv 2022, arXiv:2104.08669. [Google Scholar] [CrossRef]
  2. Gan, J.; Liu, T.; Li, L.; Zhang, J. Non-negative Matrix Factorization: A Survey. Comput. J. 2021, 64, 1080–1092. [Google Scholar] [CrossRef]
  3. Saberi-Movahed, F.; Berahman, K.; Sheikhpour, R.; Li, Y.; Pan, S. Nonnegative matrix factorization in dimensionality reduction: A survey. arXiv 2024, arXiv:2405.03615. [Google Scholar]
  4. Wang, Y.-X.; Zhang, Y.-J. Nonnegative matrix factorization: A comprehensive review. IEEE Trans. Knowl. Data Eng. 2013, 25, 1336–1353. [Google Scholar] [CrossRef]
  5. Koren, Y. Factorization meets the neighborhood: A multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, Las Vegas, NV, USA, 24–27 August 2008; Association for Computing Machinery: New York, NY, USA, 2008; pp. 426–434. [Google Scholar]
  6. Zhang, Z.; Liu, H. Social recommendation model combining trust propagation and sequential behaviors. Appl. Intell. 2015, 43, 695–706. [Google Scholar] [CrossRef]
  7. Fernández-Tobías, I.; Cantador, I.; Tomeo, P.; Anelli, V.W.; Di Noia, T. Addressing the user cold start with cross-domain collaborative filtering: Exploiting item metadata in matrix factorization. User Model. User-Adapt. Interact. 2019, 29, 443–486. [Google Scholar] [CrossRef]
  8. Puthiya Parambath, S.A.; Chawla, S. Simple and effective neural-free soft-cluster embeddings for item cold-start recommendations. Data Min. Knowl. Discov. 2020, 34, 1560–1588. [Google Scholar] [CrossRef]
  9. Nguyen, P.; Wang, J.; Kalousis, A. Factorizing lambdaMART for cold start recommendations. Mach. Learn. 2016, 104, 223–242. [Google Scholar] [CrossRef]
  10. Panda, D.K.; Ray, S. Approaches and algorithms to mitigate cold start problems in recommender systems: A systematic literature review. J. Intell. Inf. Syst. 2022, 59, 341–366. [Google Scholar] [CrossRef]
  11. Mikolov, T.; Chen, K.; Corrado, G.S.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
  12. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems; Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2013; Volume 26. [Google Scholar]
  13. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
  14. Mills, E.D. Adjusting for Covariates in Zero-Inflated Gamma and Zero-Inflated Log-Normal Models for Semicontinuous Data. Ph.D. Dissertation, University of Iowa, Iowa City, IA, USA, 2013. [Google Scholar] [CrossRef]
  15. Wei, X.-X.; Zhou, D.; Grosmark, A.D.; Ajabi, Z.; Sparks, F.T.; Zhou, P.; Brandon, M.P.; Losonczy, A.; Paninski, L. A zero-inflated gamma model for post-deconvolved calcium imaging traces. Neural Data Sci. Anal. 2020, 3. [Google Scholar] [CrossRef]
  16. Nobre, A.A.; Carvalho, M.S.; Griep, R.H.; Fonseca, M.d.J.M.d.; Melo, E.C.P.; Santos, I.d.S.; Chor, D. Multinomial model and zero-inflated gamma model to study time spent on leisure time physical activity: An example of elsa-brasil. Rev. Saúde Pública 2017, 51, 76. [Google Scholar] [CrossRef] [PubMed]
  17. Moulton, L.H.; Curriero, F.C.; Barroso, P.F. Mixture models for quantitative hiv rna data. Stat. Methods Med. Res. 2002, 11, 317–325. [Google Scholar] [CrossRef] [PubMed]
  18. Wu, M.C.; Carroll, R.J. Estimation and comparison of changes in the presence of informative right censoring by modeling the censoring process. Biometrics 1988, 44, 175–188. [Google Scholar] [CrossRef]
  19. Have, T.R.T.; Kunselman, A.R.; Pulkstenis, E.P.; Landis, J.R. Mixed effects logistic regression models for longitudinal binary response data with informative drop-out. Biometrics 1998, 54, 367–383. [Google Scholar] [CrossRef]
  20. Albert, P.S.; Follmann, D.A.; Barnhart, H.X. A generalized estimating equation approach for modeling random length binary vector data. Biometrics 1997, 53, 1116–1124. [Google Scholar] [CrossRef]
  21. Du, J. Which Estimator of the Dispersion Parameter for the Gamma Family Generalized Linear Models Is to Be Chosen? Master’s Thesis, Dalarna University, Falun, Sweden, 2007. Available online: https://api.semanticscholar.org/CorpusID:34602767 (accessed on 23 October 2024).
  22. Haberman, S. The Analysis of Frequency Data; Midway reprint; University of Chicago Press: Chicago, IL, USA, 1977. [Google Scholar]
  23. Silvapulle, M.J. On the existence of maximum likelihood estimators for the binomial response models. J. R. Stat. Soc. Ser. B-Methodol. 1981, 43, 310–313. [Google Scholar] [CrossRef]
  24. Albert, A.; Anderson, J.A. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 1984, 71, 1–10. [Google Scholar] [CrossRef]
  25. Marschner, I.C. glm2: Fitting generalized linear models with convergence problems. R J. 2011, 3, 12–15. [Google Scholar] [CrossRef]
  26. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  27. Givens, G.H.; Hoeting, J.A. Computational Statistics, 2nd ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2013. [Google Scholar]
  28. Osborne, M.R. Fisher’s method of scoring. Int. Stat. Rev. Rev. Int. Stat. 1992, 60, 99–117. [Google Scholar] [CrossRef]
  29. Reddi, S.J.; Kale, S.; Kumar, S. On the convergence of adam and beyond. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  30. Shi, N.; Li, D.; Hong, M.; Sun, R. RMSprop converges with proper hyper-parameter. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Online, 3–7 May 2021. [Google Scholar]
  31. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–22. [Google Scholar] [CrossRef]
Figure 1. L 2 norm of the score functions U θ and U θ ˜ (Uthetas norm) in blue and orange colors, respectively, and overall loss of the ZIG model in log10 scale. Top two panels: without learning rate adjustment; bottom two panels: with learning rate adjustment. All parameters were initialized with true parameter values except for w ˜ . In the top panels, the norms of the score vectors decrease drastically in early iterations but increase over later iterations even though the overall loss is consistently reduced. In the bottom panels, the norms of the score vectors first show similar pattern as the top panel but consistently decrease in later iterations. The overall loss shows the desired reducing trend throughout all iterations.
Figure 1. L 2 norm of the score functions U θ and U θ ˜ (Uthetas norm) in blue and orange colors, respectively, and overall loss of the ZIG model in log10 scale. Top two panels: without learning rate adjustment; bottom two panels: with learning rate adjustment. All parameters were initialized with true parameter values except for w ˜ . In the top panels, the norms of the score vectors decrease drastically in early iterations but increase over later iterations even though the overall loss is consistently reduced. In the bottom panels, the norms of the score vectors first show similar pattern as the top panel but consistently decrease in later iterations. The overall loss shows the desired reducing trend throughout all iterations.
Mathematics 12 03365 g001
Figure 2. Comparing performance of the alternating ZIG regression using log link with or without learning rate adjustment over 8 simulated datasets. Each row is for one dataset. First column: log 10 ( L 2 norm of U θ ) ; second column: log 10 ( L 2 norm of U θ ˜ ) ; third column: log 10 (overall loss); fourth column: overall loss of the algorithm with learning rate adjustment.
Figure 2. Comparing performance of the alternating ZIG regression using log link with or without learning rate adjustment over 8 simulated datasets. Each row is for one dataset. First column: log 10 ( L 2 norm of U θ ) ; second column: log 10 ( L 2 norm of U θ ˜ ) ; third column: log 10 (overall loss); fourth column: overall loss of the algorithm with learning rate adjustment.
Mathematics 12 03365 g002
Figure 3. Histogram of weighted count from each of the first 12 rows of the data matrix.
Figure 3. Histogram of weighted count from each of the first 12 rows of the data matrix.
Mathematics 12 03365 g003
Figure 4. (Left): The absolute value of change in the log10 norms of U θ (red) and U θ ˜ (blue) from successive iterations and overall loss curve. (Right): Cosine similarity of word vector representations shown as heatmap.
Figure 4. (Left): The absolute value of change in the log10 norms of U θ (red) and U θ ˜ (blue) from successive iterations and overall loss curve. (Right): Cosine similarity of word vector representations shown as heatmap.
Mathematics 12 03365 g004
Figure 5. Plot of word vector representations using t-SNE on first 10 principal components.
Figure 5. Plot of word vector representations using t-SNE on first 10 principal components.
Mathematics 12 03365 g005
Table 1. Result of the last 5 iterations for the L 2 norm of the score vectors and overall loss in ZIG model with log link. The parameters were initialized with true parameters except w ˜ , which was randomly initialized. The negative iteration number means counting from the end. The algorithm with no learning rate adjustment did not reduce the score vectors’ norms as fast as the algorithm with learning rate adjustment even though it achieved slightly lower loss.
Table 1. Result of the last 5 iterations for the L 2 norm of the score vectors and overall loss in ZIG model with log link. The parameters were initialized with true parameters except w ˜ , which was randomly initialized. The negative iteration number means counting from the end. The algorithm with no learning rate adjustment did not reduce the score vectors’ norms as fast as the algorithm with learning rate adjustment even though it achieved slightly lower loss.
No Learning Rate AdjustmentWith Learning Rate Adjustment
Iteration | | U θ | | L 2 | | U θ ˜ | | L 2 Overall Loss | | U θ | | L 2 | | U θ ˜ | | L 2 Overall Loss
−585,805.7988,896.7777,998.775596.065298.3178,313.71
−486,123.0589,012.5677,996.445596.005303.8178,312.30
−386,614.4689,236.3677,994.115581.835294.0878,310.92
−287,309.8489,758.4477,991.795579.145293.4878,309.59
−187,574.2789,699.6677,989.415556.295286.1278,308.15
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, T.; Wang, H. Matrix Factorization and Prediction for High-Dimensional Co-Occurrence Count Data via Shared Parameter Alternating Zero Inflated Gamma Model. Mathematics 2024, 12, 3365. https://doi.org/10.3390/math12213365

AMA Style

Kim T, Wang H. Matrix Factorization and Prediction for High-Dimensional Co-Occurrence Count Data via Shared Parameter Alternating Zero Inflated Gamma Model. Mathematics. 2024; 12(21):3365. https://doi.org/10.3390/math12213365

Chicago/Turabian Style

Kim, Taejoon, and Haiyan Wang. 2024. "Matrix Factorization and Prediction for High-Dimensional Co-Occurrence Count Data via Shared Parameter Alternating Zero Inflated Gamma Model" Mathematics 12, no. 21: 3365. https://doi.org/10.3390/math12213365

APA Style

Kim, T., & Wang, H. (2024). Matrix Factorization and Prediction for High-Dimensional Co-Occurrence Count Data via Shared Parameter Alternating Zero Inflated Gamma Model. Mathematics, 12(21), 3365. https://doi.org/10.3390/math12213365

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop