Determinants of revenue enjoyed by an influential online gamer are assessed in the context of a monopoly environment (i.e., Twitch platform) characterized by uncertainty of membership on both sides of the market. Formally, this is equivalent to saying that we evaluate the impact of $\theta $ (i.e., the externality of the influential online gamer on the viewers’ side) on $C{S}_{d}^{*}$ and, by direct effect due to the perfect alignment of interests, also on ${\mathsf{\Pi}}^{*}$, while bearing in mind that the measurement of the cross-group network externality $\theta $ can take multiple forms (e.g., externality of the type ‘being a follower’, externality of the type ‘being a subscriber’, externality of the type ‘being a viewer’, etc.) as acutely explained below. This aspect is extremely relevant since it confirms that the idea that a unique type of indirect network effect is erroneous, and it does not have adherence to the observed reality in real-world platforms such as Twitch.

#### 5.2.1. Data

Twitch is a video game live streaming service platform operated by Twitch Interactive, a subsidiary of Amazon. Introduced in June 2011 as a spin-off of the streaming platform Justin.tv, it primarily focuses on video game live streaming, including broadcasts of sports competitions, music, games, creative content, and real-life streams. The platform provides a channel analytics page to online gamers, which allows them to obtain a comprehensive view of stream revenues and engagement statistics over customizable date ranges. These detailed breakdowns allow us to better understand the evolution of revenues and viewership trends. Several metrics of one influential Portuguese online gamer were monitored and collected during a period of 397 days predominantly covering the year 2017. Since most Portuguese online gamers use Twitch for streaming their activity and knowing that interests of the platform and online gamers are perfectly aligned based on Lemma 5, we believe that this empirical analysis is a good approximation for the monopoly environment that has been previously subject to theoretical treatment. Before moving into details about the selection of variables, one should clarify how the content produced by online gamers is rewarded at Twitch. While online gamers do not become Twitch affiliated members, they can obtain revenue only through donations directly sent by viewers to their PayPal account. After satisfying four entry criteria, online gamers can become Twitch affiliated members [

37]. When that is the case, Twitch ensures that online gamers are able to monetize their channel by allowing them to benefit from additional sources of revenue that are subject to a 50-50 split with the platform.

Since the scope of analysis relies on revenues shared between online gamers and the platform, the dependent variable corresponds to revenues enjoyed by the online gamer with the exception of donations and commercial partnerships. These consist of a composite basket composed of channel subscription revenues and non-subscription revenues (information on subscription revenues can be segmented by type: paid subscriptions, Twitch Prime subscriptions, and gifted subscriptions. Information about non-subscription revenues can be segmented by source: ads, cheering (i.e., bits), game sales, extensions, bounties, and other bits interactions). As such, the empirical analysis considers three different models, which vary according to the type of dependent variable: model M1 considers only channel subscription revenues; model M2 considers only non-subscription revenues, and; model M3 considers total revenues. Knowing that each period of 24 h corresponds to a single observation in the dataset built to complete this empirical task, we consider as explanatory variables all the available statistics for online gamers at Twitch, which are summarized as follows.

Active data about the opposite side of the market (i.e., data on active actions conducted by viewers) include new followers (i.e., new followers received by the channel during live streaming in the selected date range), subscribers (i.e., number of subscribers in the selected date range), and live views (i.e., total views of live streams, which neither include video on demand (VOD) nor clip views).

Passive data about the opposite side of the market (i.e., data on passive actions conducted by viewers) include average viewers (i.e., the average number of concurrent viewers in a stream; to calculate this number, the platform identifies how many viewers there are at each point in time the online gamer is live streaming such that the final outcome is an average across all the time spent on live streaming in the selected date range), max viewers (i.e., the maximum number of viewers that the online gamer reached across all streams in the selected date range), unique viewers (i.e., the number of unique people who viewed the online gamer’s live streams across the selected date range) and host/raid viewers (i.e., the percentage of viewers that come from hosts or raids (i.e., within-group externality between online gamers). Interactive data include time streamed (i.e., the total time of broadcasting in minutes), chat audience (i.e., the number of unique viewers who chatted across the selected date range), chat messages (i.e., the total number of chat messages sent), clips created (i.e., the number of clips created from streams), clip views (i.e., total views of clips created from streams), ad breaks (i.e., the total duration of ad breaks ran by the online gamer during streams in minutes), ad time per hour (i.e., the average amount of time per hour that ads were running during streams in minutes), notification engagements (i.e., the number of viewers engaged with go-live notifications sent out for streams in the selected data range). Data exclusively related to the online gamer’s characteristics include internet speed (i.e., the average download speed in the selected data range in Mbps) and psychological state of the online gamer (i.e., a dummy variable takes value 1 if the online gamer feels happy by the end of day $t-1$, while taking the value 0 otherwise).

Since the regressors have different units of measure, all values are standardized before performing the empirical analysis. This procedure maintains the anonymity of the online gamer intact and allows us to interpret the estimated coefficients as elasticities. For the sake of brevity,

Table S2 in Supplementary Material compiles summary statistics.

#### 5.2.2. Identification Strategy

Based on Lemma 5, we hypothesize that the revenue of the influential online gamer can either decrease or at least be subject to a negligible increment (i.e., approximately zero) for increasing intensity of the indirect network effect exerted on viewers. To understand whether this hypothesis holds in reality, three different machine learning models are trained and tested: principal component analysis (PCA), least absolute shrinkage and selection operator (LASSO), and a novel continual learning (CL) modeling approach based on the combination of random forest (RF) with ordinary least squares (OLS). All choices are justified by the high number of covariates and the subsequent need to mitigate concerns related to endogeneity, spurious correlation, omitted variable bias, and reverse causation.

For a regression estimator to meaningfully fit a model, it is mandatory the absence of omitted (i.e., confounding) variables correlate with covariates, the measurement of covariates should be done without error, covariates should not be correlated with the error term, and reverse causation should not occur (i.e., covariates affect the dependent variable, but not the opposite). Explanatory variables that do not satisfy these requirements are said to be endogenous. PCA is adopted to dissuade this concern since it corresponds to an unsupervised machine learning model that allows reducing the dimensionality of the initial set of covariates by creating a lower number of principal components (PC) that represent the initial set of covariates and ensure a proper evaluation of the dependent variable. The main advantages of this machine learning model include the provision of information about the relative contribution of each PC on explaining the total variance of a certain dependent variable, in addition to allowing for an economic interpretation of each extracted component to clearly express the respective content.

In turn, LASSO is a supervised machine learning model that uses a certain penalized regression technique to find the subset of variables from the initial set of covariates with significant explanatory power on the dependent variable. Despite avoiding concerns related to spurious correlation due to the reduction of dimensionality, many techniques exist to perform this operation. Knowing that coefficient estimates, and the set of independent variables depend on λ (i.e., the general degree of penalization) and α (i.e., the relative contribution of $\ell $1 versus $\ell $2 norm penalization), a key question is how to choose tuning parameters. The most appropriate method depends on the setting and objective of the analysis, computational constraints, and if and how the independence and identically distribution assumption is violated. We use k-fold cross-validation and rolling h-step ahead cross-validation as penalized regularization techniques.

In a recent study, the authors of [

38] developed a deep learning framework to ensure a more realistic learning analysis. According to the authors, CL is a new, simple, and efficient method proven to be valid as an alternative to standard regularization techniques. Common approaches to mitigate the omitted variable bias problem consider the execution of regularization to identify the relevant information that properly represents the past behavior of a dependent variable. While Ref. [

38] adopted CL in the context of neural networks, we used this method in the broad context of machine learning. The idea of CL is to allow a refined treatment of covariates based on a bias-variance trade-off argument since it consists of a two-step approach that performs a bias-variance decomposition. In a first step, we apply RF to covariates with stronger relative importance on explaining the dependent variable in order to mitigate the risk of overfitting. We consider that covariates that better explain the dependent variable are the active and passive data variables previously described. In a second step, we perform OLS estimation on predicted values obtained in the first step in order to mitigate the risk of underfitting. This option relative to other regularization techniques considers two important technical refinements: RF is initially applied to introduce higher bias and, thus, lower variance relative to OLS, which is a necessary condition to ensure generalization power; afterward, relying on predicted values obtained in the first step, OLS is applied to obtain non-biased regression estimates and, thus, higher variance in relation to RF (Bias (variance) is an error from erroneous assumptions in the learning algorithm (sensitivity to fluctuations in the training set). High bias (variance) can cause an algorithm to miss relations between features and target (model the random noise in training data rather than in the intended output), which fosters underfitting (overfitting), respectively).

#### 5.2.3. Results

#### Principal Component Analysis

We apply the Kaiser’s rule to conclude that each dependent variable is explained by 5 PCs (This rule states that the optimal number is given by PCs whose eigenvalue is above 1.

Figure S1 in Supplementary Material shows a graphical representation of the final outcomes).

Table S3 in Supplementary Material reveals that approximately 73% of the variance of each dependent variable is explained by these PCs. In terms of economic interpretation, knowing that PC1 is explained by the number of subscribers, followers, and lagged dependent variables, we conclude that it corresponds to a latent dimension that captures loyal viewers. PC2 is explained by unique viewers, streaming time, number of chat messages, and clip views, which suggests that it corresponds to a representative dimension of the non-faithful audience of the online gamer. PC3 corresponds to the publicity dimension since it is composed of covariates related to ads, while PC4 covers structural conditions faced by the online gamer to execute the activity (i.e., Internet speed and notifications capturing user engagement). PC5 is negatively (positively) affected by the percentage of host/raid viewers (chat audience and psychological state of the online gamer by the end of the previous day, respectively). Therefore, it consists of a latent component that captures the emotional dimension of the online gamer.

We then infer the following conclusions. First, the online gamer should be primarily focused on actions aimed at converting non-committed viewers into a fully committed audience by resorting to brand loyalty strategies. Second, this individual should make efforts to reduce the likelihood of being influenced by the emotional dimension since it appears to have a negative and significant effect on all types of revenue enjoyed by the online gamer. Nevertheless, this impact is particularly felt on the type of revenue that had the lowest incremental gain over time (i.e., non-subscription revenues). Third, statistically significant effects are resilient to different types of revenue, which suggests that this online gamer has the incentive to become professional. Finally, the hypothesis claiming that the online gamer’s revenue can either decrease or at least be subject to a negligible increment for increasing intensity of the indirect network effect promoted on viewers cannot be rejected due to the negative and significant coefficient of PC2, which means that the negative impact on the gamer’s revenue for increasing viewership is clearly promoted by the non-committed audience of the online gamer.

#### Least Absolute Shrinkage and Selection Operator

Focusing on estimated coefficients with rolling h-step ahead cross-validation technique, a first conclusion is that covariates with explanatory power on the different types of dependent variables are the number of subscribers and live views. In the case of non-subscription revenues, ad breaks are also statistically significant. Nevertheless, one can observe that only the number of subscribers has a considerably high magnitude on each dependent variable. Consequently, this result suggests that subscriptions have a positive impact on the different types of revenue enjoyed by the online gamer, but the remaining statistically significant covariates (i.e., followers and live views) have a negligible impact, particularly on non-subscription revenues. While the first conclusion is contrary to the null hypothesis claimed in the identification strategy, the second one is aligned with the idea that the revenue enjoyed by the online gamer is subject to a negligible gain for increasing intensity of the indirect network effect promoted on viewers. This regularization technique also allows us to obtain coefficients associated with a prediction for n days ahead. In addition to coefficients associated with one-day step-ahead prediction, we also consider those associated with 30 days step-ahead prediction, which allows us to conclude that the magnitude of estimated coefficients remains practically unchanged. Overall, the ambiguity of results yielding under the LASSO with rolling h-step ahead cross-validation technique demonstrates that PCA results are at least partially robust. Furthermore, the estimated coefficients under LASSO and post-estimation OLS are extremely similar, which reinforces the previous conclusion.

In turn, we execute two analyses with the k-fold cross-validation technique by exogenously assuming 10 folds. On the one hand, we find covariates with explanatory power on the different types of dependent variables considering the pair (${\mathsf{\lambda}}_{\mathrm{LOPT}}^{*},$ ${\mathsf{\alpha}}_{\mathrm{LOPT}}^{*}$) that minimizes the mean square prediction error (MSPE). On the other hand, we find covariates with explanatory power on the different types of dependent variables considering the pair (${\mathsf{\lambda}}_{\mathrm{LSE}}^{*},$ ${\mathsf{\alpha}}_{\mathrm{LSE}}^{*}$) that corresponds to the largest λ for the optimal $\mathsf{\alpha}$ for which the MSPE is within one standard error of the minimal MSPE. Overall, we substantiate the following conclusions. First, one finds that the former (later) pair contemplates a higher (lower) number of statistically significant covariates, respectively. In particular: total and subscription revenues are explained by 13 covariates when considering the pair (${\mathsf{\lambda}}_{\mathrm{LOPT}}^{*},$ ${\mathsf{\alpha}}_{\mathrm{LOPT}}^{*}$), but both are only explained by 5 covariates when considering the pair (${\mathsf{\lambda}}_{\mathrm{LSE}}^{*},$ ${\mathsf{\alpha}}_{\mathrm{LSE}}^{*}$), and non-subscription revenue is explained by 10 covariates when considering the pair (${\mathsf{\lambda}}_{\mathrm{LOPT}}^{*},$ ${\mathsf{\alpha}}_{\mathrm{LOPT}}^{*}$) and explained by 6 covariates when considering the pair (${\mathsf{\lambda}}_{\mathrm{LSE}}^{*},$ ${\mathsf{\alpha}}_{\mathrm{LSE}}^{*}$). Second, one observes that ${\alpha}^{*}$ = 1 is always verified, which means that the LASSO estimation is unambiguously preferred to ELASTIC NET and RIDGE regressions to explain the different types of dependent variables. Third, knowing that this technique only allows estimating one-day step-ahead coefficients, we conclude that the number of subscribers has a positive, significant, and strong effect on the different types of revenue enjoyed by the online gamer, but remaining covariates exhibit a nearly zero effect on the different types of dependent variables. This implies that both techniques exhibit consistent results that do not necessarily contradict the idea that the revenue of the online gamer increases in redundant magnitude for a stronger intensity of the indirect network effect promoted on viewers, thus, not allowing us to reject the null hypothesis proposed in the identification strategy.

#### Continual Learning

Results of the linear OLS specification demonstrate that the revenue enjoyed by the online gamer is inversely related to viewership due to the negative sign of the respective coefficient, which is statistically significant at the 1% level. Results of the quadratic OLS specification reveal that all the different types of revenue exhibit an inverted U-shape relationship with viewership, which means that these present similar properties to functions such as the Laffer curve. Therefore, we conclude that there is a critical threshold above which the growth of the network harms any type of revenue enjoyed by the online gamer. As a robustness check, we adopt the Autoregressive Integrated Moving Average (ARIMA) model since this internalizes the optimal time dimension for a given dependent variable. In terms of autoregressive components, results indicate that one-period past values have a positive and significant effect on the first difference of all the possible dependent variables. From a technical point of view, this result is aligned with the idea that CL performs favorably in multi-period forecasting exercises compared to alternative modeling options. From a substantive point of view, this result suggests that the revenue enjoyed by the online gamer in past periods is expected to have a positive influence over the one enjoyed in the current period.