The Dynamics between Structural Conditions and Entrepreneurship in Europe: Feature Extraction and System GMM Approaches

: Structural conditions and population characteristics of countries have been identiﬁed in the literature as factors for an individual to become, or to have intentions of becoming, an entrepreneur. However, this is still a subject under research, which has become increasingly relevant and could be crucial in the current challenges of European countries. In this work, the factors for entrepreneurial intentions and entrepreneurship activity are studied. More precisely, the structural conditions of European countries, which has changed over the last two decades, is analysed. The aim is to describe this behaviour and to state the main conditions for developing entrepreneurship activities and the intentions to become an entrepreneur. To achieve this purpose, feature extraction, namely, principal component analysis and dynamic longitudinal approaches are used. In particular, we propose that the system-generalised method of moments (GMM) model is adequate in this situation. The results suggest that the structure of the European framework conditions for entrepreneurship, obtained using the Factor Analysis year by year, is quite diversiﬁed until 2008, while after 2008, it is more stable. Moreover, it is concluded that the conditions associated with entrepreneurial intentions and entrepreneurial activity differ between these two time periods. Hence, the dynamic aspect of the structural conditions that affect entrepreneurial activities or intentions should be acknowledged.


Introduction
Entrepreneurship can be defined as a person's capacity to put their ideas into action. It requires the ability to articulate project planning, management, to take calculated risks and to innovate and be creative, in order to achieve previously defined goals [1]. In recent decades, entrepreneurship has gained increasing attention and becomes a crucial topic on both researchers' and political leaders' agendas [2]. From the perspective of political leaders, entrepreneurial activity is a source of innovation, competitiveness and economy development, while academia focuses on deepening the knowledge about this topic [2].
In an extensive meta-analysis [3], it is pointed out that the most common measure of entrepreneurship is the Total early-stage Entrepreneurial Activity (TEA), which is defined as the percentage of individuals aged 18-64 who are either starting new businesses, or are currently running a business for less than 42 months. Furthermore, TEA is attractive as a measure of entrepreneurship because it does not rely on official government statistics. This is an advantage in cross-country studies, as it facilitates standardisation, since the definition of entrepreneurship varies between governments. As previous studies have demonstrated a dynamic behaviour of the EFC indicators between countries and through time [12], the correlation structure of EFC indicators is analysed for the period between 2001 and 2019, in order to understand whether it remains the same or whether there are changes. For this purpose, FA, with the varimax approach, is used. Next, it is analysed how the EFCs relate to EI and TEA, through a longitudinal study of an unbalanced panel data set on 34 European countries between 2001 and 2019.
According to several studies on GEM data (e.g., [13,14]), entrepreneurship is heterogeneous across countries. Hence, it is important to take into account the intra-country specificities when conducting these types of analyses. For that, different specifications of longitudinal models are tested to understand which EFCs have a significant effect on the countries' EI or TEA over time. Additionally, similarly to the work of [15], the estimations are controlled for per-capita GDP, population growth and type of industry, as the country's wealth may have a significant impact on the nature of its entrepreneurial activity [16].
The GEM survey data has the advantage of being less susceptible to biases in coverage across regions [3]. In fact, due to less thorough data collection or economic and regulatory environments that discourage entrepreneurs from formally registering their activities, there is the risk that low-income countries are underrepresented in official statistics.
The rest of the paper is organised as follows. In Section 2, details on the data set used are presented, as well as the statistical methodologies. The results obtained when using the FA and system GMM are presented in Section 3. Finally, Section 4 presents the main conclusions and proposals for future work.

Data
As expressed in the previous Section, several researchers have recognised EI, TEA and EFCs as relevant indicators of entrepreneurial activity (e.g., [3][4][5][6][7][8][9][10][11][12][13][14][15][16]); thus, these will be used in the present study. More precisely, the data set used in this work contains data on 34 European countries, from 2001 to 2019, on the EBA, namely EI and TEA, and on the following 12 EFCs from GEM, as described in [8] EFCs are calculated as the mean of the ordinal perceptions of national experts on entrepreneurship, for each country, on several subjects, ranging between 1 and 9. The 12 EFCs are used as independent variables, and as dependent variables, both EI and TEA are used separately.
There were countries for which information concerning some years was not available. Hence, the data consist of unbalanced repeated measurements of EFCs, EI and TEA over time. For example, for Czech Republic and Serbia, the data were only available in three time points, contrasting with the 18 measures available for Germany, Ireland and Slovenia. Table 1 shows the distribution of the number of time points available for each country.

Methodology
This section describes the methods applied to understand the relation between the 12 EFCs and EI, as well as the relation with TEA. A first visual inspection of the evolution through the years of the EFCs, depicted in Figure 1, shows a possible change in the structure of the EFCs, especially until and after 2008. Hence, before analysing the impact of each EFC on EI and on TEA, a FA was applied to identify whether the correlation structure of the EFC indicators changes throughout time. Subsequently, longitudinal regression models were applied for both dependent variables-EI and TEA.
For the FA analysis, the software IBM SPSS v.28 was used. For the dynamic longitudinal analysis, the STATA software was used, in particular the command xtabond2 created by [17]. Next, both FA and longitudinal analysis will be explained in detail.

Feature Extraction
In the age of big data, an appropriate selection of attributes is essential for several reasons, some of which are next enumerated. The knowledge induced by data analysis algorithms on a smaller number of attributes is often more understandable. Moreoever, several algorithms present worse performance when using a large number of variables, and the attributes' selection can improve it. Thus, when large multivariate data sets are analysed, it is often desirable to reduce its dimensionality. In addition, some domains have a high cost of data collection, in which cases, attribute selection can reduce costs. For a review on the recent developments of feature extraction methods such as PCA and FA, see [18].
In statistics, machine learning and information systems, dimensionality reduction (or dimension reduction) is the process of reducing the number of random variables into consideration in order to obtain a set of main variables. The main approaches for dimension reduction can be divided into: feature selection and extraction. The selection methods try to choose the most representative variables as variables to consider. The extraction methods use the information contained in the data to create another data set with a smaller dimension, while retaining as much information from the original data as possible.
PCA aims to represent (or describe) an initial number of variables by a smaller number of hypothetical variables. Thus, it identifies new variables (in smaller numbers) without a significant lost of information. The components are calculated in descending order of importance, with the first explaining as much of the data variability as possible, i.e., Often, it is possible to retain most of the variability in the original variables, with q much smaller than p [19].
PCA is based on the analysis of linear correlations between variables, which tries to identify sets of variables that are highly correlated. It allows one to conclude whether it is possible to explain the pattern of correlations through a smaller number of variables. As each linear combination explains as much as possible of the unexplained variance and it has to be orthogonal to any other combination already defined, the set of all combinations found constitutes a unique solution.
Therefore, it is an exploratory analysis technique used to reduce the size of the data, and to identify latent factors.
The representation by PC does not require assumptions on the probability distribution of the variables. However, PCA is largely dependent on the existence of variables with different scales. Therefore, variables must be standardised prior to applying PCA.
PCA implies the absence of error; the latent variables are linear combinations of the initial variables.
In Factor Analysis (FA), each observed variable is described as a function of the common factors retained, and the error is presented, because usually a smaller number of factors is retained when compared with the initial number of variables.
In FA, only the common variation, shared by all the variables, is retained in each factor, while in PCA, the total variation is present in the set of original variables.
Thus, a rotation process is considered in FA. A rotation is a linear transformation that is performed on the component solution for the purpose of making the solution easier to interpret. Interpreting a rotated solution is to identify the concept measured by each of the retained components, also designated as a construct or latent variable.
The number of factors to retain can be defined using several criteria, for example, the factors with an eigenvalue larger than 1 and the percentage of variation of the original variables retained, among others.
PCs, Y j , are expressed as linear combinations of the original variables, X i (see Equation (1)). Thus, the error is absent.
In contrast, in FA, each observed variable, X i , is described as a function of the common factors, FC k , k = 1, ..., m, with m < p: . .
However, the factors are easier to interpret and, therefore, facilitate the definition of latent variables in the data.

Data Adequacy
It is not always appropriate to use FA. In fact, exploratory FA is only useful if the matrix of population correlation is statistically different from the identity matrix [20]. It is, then, necessary that the variables are correlated. The Bartlett Sphericity test can be used to test the equality of the correlation matrix to the identity matrix. Only if there is a statistical difference between them is the FA is useful in the estimation of common factors.
The sample adequacy for FA application can be measured by the Kaiser-Meyer-Olkin (KMO) coefficient. This coefficient is based on the partial correlations between the variables and provides information on whether variable selection and sample size are suitable for FA. Sample adequacy can be classified as: incompatible, for KMO [21].
The Measure of Sample Adequacy (MSA) of each item/variable must be >0.50. If the variable has a MSA lower than this value, it must be removed from the analysis, as it is not sufficiently correlated with the others and will not be suitable for the analysis.
The communalities, computed in both procedures, PCA and FA, refer to the percentage of the accounted variance of the observed variable retained in the components (or factors). A given variable will display a large communality if it loads heavily on at least one of the retained components.

Retaining Factors
There are a several methods that can be used to select the appropriate number of components to retain in FA [22]. The most used are: the Kaiser criterion, which proposes to retain the factors with an eigenvalue larger than 1; the observation of the Scree plot, by evaluating when there is a substantial decline in the magnitude of the eigenvalues; or by specifying a limit value for the cumulative percentage of variance explained by the factors (which are usually larger than 70%). In several research areas, the interpretability criteria are perhaps the most important criterion for determining the number of components.

Reliability
To measure the reliability of a factor, the Cronbach's alpha can be used. This is the most widely used objective measure of reliability. Lee Cronbach established the Cronbach's alpha in 1951 [23] to provide a measure of a test or scale's internal consistency, and it is determined by a number between 0 and 1. The amount to which all elements measure the same concept or construct is referred to as internal consistency. The value of Cronbach's alpha rises as the variables become more linked. A high Cronbach's alpha coefficient, on the other hand, does not automatically imply a high level of internal consistency. More comparable items testing the same notion should be added to raise Cronbach's alpha.
Typically, the acceptable values of Cronbach's alpha range from 0.70 to 0.95 [24].

System GMM
This study analyses repeated measures of EFCs through time for each country, hence longitudinal models (also known as panel models) are applied.
Contrary to cross-sectional methods that aggregate time-series, which may conceal an underlying dynamic (macroeconomic or of other nature), longitudinal models offer the possibility to account for, and investigate, the heterogeneity that may occur between economies.
The standard Fixed Effects Model (FEM) and the Random Effects Model (REM) control for the existence of a bias related to heterogeneity across economies and time. FEM considers individual, unobserved characteristics (i.e., unobserved heterogeneity). The REM model specification assumes that group effects follow a normal distribution over all the groups. However, these models do not overcome the endogeneity problem due to the potential correlation between one or several explanatory variables and the residuals. One strategy to overcome the endogeneity concern could be the Instrumental variable (IV) estimator; however, the main challenge is obtaining valid instruments applicable to panel analyses [25] and theoretically validating them.
In addition to the aforementioned problem, there is another disadvantage in using the FEM and REM models, which is that they are static and do not allow the unbiased estimation of parameters when including past (lagged) values of the dependent variable in the model.
In fact, recently, studies that analyse EI and TEA (e.g., [26][27][28][29]), have taken into account the dynamic effect of past values of the variables in their estimations, finding a statistically significant effect. Thus, it was decided to consider in the present analysis the effect of the lagged values of EI and TEA in the estimations.
Because the lagged dependent variable is dependent (correlated) on the error term, a dynamic longitudinal model should be considered. In fact, using the pooled OLS estimation technique yields inconsistent estimates. Furthermore, using the FEM estimator to transform the data to remove the fixed effects does not completely eliminate the inconsistency, because the transformed lagged dependent variable still depends on the error term [30,31].
Moreover, the data under analysis is extremely unbalanced, as there is no information on the variables of interest for all countries in all moments (years). This unbalanced nature of the data requires that the longitudinal dynamics of the expected values of EI and TEA are analysed through a model that overcomes this issue.
Thus, in this analysis, the General Method of Moments (GMM) estimator, proposed by [32,33], is used. This model controls for possible endogeneity and unobserved heterogeneity. This technique use lags of the dependent variables as explanatory variables to capture the effect of past values of the dependent variable. In this, lagged dependent variable values are used as (internal) instruments to control this endogenous relationship. The GMM model eliminates endogeneity by "internally transforming the data". This transformation is a statistical process that subtracts a variable's past value from its present value [17]. As a result, the number of observations is reduced, and this process (internal transformation) improves the GMM model's efficiency [34].
However, to prevent potential data loss, as we deal with unbalanced data, the secondorder transformation (two-step GMM) was considered, as recommended by [32]. This transformation applies "forward orthogonal deviations", subtracting the average of all future available observations of a particular variable from its current value [17], instead of only subtracting from the previous observation (as in the first-step procedure).
The GMM model considered can be parameterised, as follows: where Y it is the dependent variable (in this case, IE or TEA), X it is the vector of independent variables, ϕ is the autoregressive parameter, η i is the unobserved fixed effect and ε it is the idiosyncratic shocks normally distributed with zero mean and constant variance. The error term follows the error component structures, in which E(η i ) = 0; E(ε it ) = 0 and E(η i ε it ) = 0, E(ε it ε its ) = 0. Additionally, to capture the overall economic and social context, it is controlled for per-capita GDP (following the works of [35][36][37]), for size of country's population, as this is likely to affect the number of people available to work in the labour force, as well as the country's entrepreneurship rates [38]. Furthermore, year dummy variables (time-specific effects) were included to reduce the influence of cross-sectional error dependence in the dynamic panel model.
System GMM uses moment conditions that are functions of the model parameters and the data, such that their expectation is zero at the parameters' true values. It controls for endogeneity of the lagged dependent variable in a dynamic panel model-when there is correlation between the explanatory variable and the error term in a model; omitted variable bias; unobserved heterogeneity; and measurement errors.
This method consists of a system of two equations: an original equation, expressed in the levels' form with the first differences as instruments. It transforms all aggressors through differentiation, thus removing fixed effects (that do not vary over time); and using a transformed equation, expressed in a first-differenced form with levels as instruments. To clarify: The calculation of the system GMM estimator is based on a stacked system comprising 0.5(T + 1)(T − 2) moment conditions, for which instruments are observed (where T represents the total number of years, in this specific case). The Arellano-Bond test for serial correlation is used to test for second-order serial correlation in the first-differenced residuals for model diagnostics. The null hypothesis states that the residuals are uncorrelated serially. If the null hypothesis cannot be rejected, it demonstrates that there is no second-order serial correlation and that the GMM estimator is reliable. In addition, the Hansen J-test is used to assess the null hypothesis of instrument validity as well as the validity of the additional moment restriction required for system GMM. The instruments are valid if the null hypothesis is not rejected. The variance inflation factor (VIF), an indication of how much the standard error has inflated, was used to calculate collinearity diagnostics for the regression equation estimated (results are not presented due to space restrictions, but available under request).

Feature Extraction-Factor Analysis
The results of applying a FA to the full data set, the EFC in European countries during two decades (2002-2019), considering varimax rotation with Kaiser Normalisation, are presented in Table 2.
The obtained results demonstrate a KMO = 0.898, corresponding to a very good adequacy of the sample for FA application. The MSA of each independent variable are larger than 0.75, indicating that all of them contribute to a significant difference between the correlations' matrix and the identity matrix. Using the criterion of the eigenvalues larger than 1, two factors are extracted and approximately 64% of the global initial variance of the variables is explained. The communalities of the variables are larger than 0.437. The first factor explains approximately 36% of the global variance of the variables and corresponds to the structural conditions of the countries (EFCs 1, 2, 4, 7, 8, 9, 10 and 11) ( Table 2). The second explains approximately 28% of the global variance of the variables and it corresponds to norms and education (EFCs 3, 5 and 6), taxes and bureaucracy (EFC 12). The reliability of Cronbach's alpha values for each factor is acceptable, as they are larger than 0.7.
The EFCs' indicators, which are combined in the second component, include very different subjects and are poorly interpretable when combined into a single factor. However, this behaviour is different over time. This is presented in Table 3 and discussed in the corresponding text. In Table 4, it can also be observed that in the beginning (first few years) some of these EFCs have MSAs less than 0.4, indicating that they are not very correlated with the others in the analysis. Reliability statistics-Cronbach's alpha, for two factors (Table 5) also illustrates reliability values that are larger in the last few years. Thus, effectively, this second factor, although to present acceptable reliability, is not evidence of their consistency in terms of real interpretation and constitution.   In Table 5 (and in Figure 1), the reliability of Cronbach's alpha results, for applying FA to the European countries' EFCs, are presented. For each year of the data set, two factors and a varimax rotation with Kaiser Normalization (FA), are considered. The results are between 0.491 and 0.911. These values demonstrate a possible modification in the structure of the EFCs, from year to year, especially until and after 2008. After 2008, all the reliability Cronbach's alpha values are larger than 0.7, indicating a more stable structure of the EFCs in two factors with good reliability. Note that, for each year, the number of countries with data available is not the same.
The results obtained, for each one year of the data set, considered as criterion to retain the factors that the eigenvalues > 1, are presented in Table 4. The number of factors defined using this criterion varies between two and four factors. After 2008, the values obtained are more stable, with larger KMO values, and defining two or three factors. The EFCs with MSA < 0.4 and with communalities < 0.5 are also fewer after 2018.
Despite reasonable reliability, the structure obtained using the exploratory FA for each year is quite different. However, the main items remain the same. This is illustrated in Table 3, where F1 indicates that the EFC in that row is in Factor 1, and the last column presents the number of years for which the EFC is in Factor 1. For example, EFC1 is in Factor 1 for all years from 2002 to 2019, thus in 18 years. In general, taxes and bureaucracy (EFC3), basic (EFC5) and post (EFC6) school entrepreneurial education and training and cultural and social norms (EFC12), remain in Factor 2. These correspond to norms and education. The remaining EFCs (blue rows) remain in Factor 1, which corresponds to structural conditions of countries.

System GMM Results
The estimations from the six longitudinal models adjusted are presented in Table 6. For the time horizon of 2001-2008, EI is negatively associated with the perception regarding the extent to which public policies support entrepreneurship as a relevant economic issue (EFC2), and positively associated with the perception regarding the presence of property rights, commercial, accounting and other legal and assessment services and institutions that support or promote SMEs (EFC8). Moreover, TEA is only positively associated with the perception regarding the extent to which public policies support entrepreneurship (taxes or regulations are either size-neutral or encourage new and SMEs) (EFC3).
For the time horizon of 2009-2019, EI is negatively associated with the perception regarding ease of access to physical resources-communication, utilities, transportation, land or space-at a price that does not discriminate against SMEs-EFC11. TEA is positively associated with the perception regarding the presence and quality of programs directly assisting SMEs at all levels of government (national, regional and municipal) (EFC4); with the perception regarding the extent to which training in creating or managing SMEs is incorporated within the education and training system in higher education such as vocational, college, business schools, etc. (EFC6); and the perception regarding the extent to which social and cultural norms encourage or allow actions leading to new business methods or activities that can potentially increase personal wealth and income (EFC12). Moreover, TEA is negatively associated with the perception regarding the extent to which national research and development will lead to new commercial opportunities and is available to SMEs (EFC7).
For all estimations, the past values of EI and TEA are positively and significantly associated with the present values of EI and TEA (p-value < 0.01), respectively. The results for the Hansen test indicate that, for all specifications, the null hypothesis is not rejected, which means the instruments are valid and the model is specified correctly. Regarding the reported p-values for AR(1) and AR(2) Arellano-Bond autocorrelation tests, as expected, there is first-order serial correlation and no evidence of significant second-order serial correlation. The significance of F statistics also demonstrates that all independent variables jointly and significantly explain the model at a 5% significance level.

Conclusions and Future Work
To answer the two research questions: "How does the EFCs influence the Entrepreneurship Intentions?" and "How does the EFCs influence the early stage Entrepreneurial Activity?", a set of statistical methods that allow and take into account the expected dynamic in the time and intra-country of these indicators were used.
A feature extraction, namely FA, applied as a first exploratory analysis, allowed us to understand that two major time horizons should be considered, i.e. The results demonstrate that not all EFCs indicators are associated with entrepreneurship intentions and early stage entrepreneurial activity. Furthermore, the associations differ between the two time horizons considered, namely: until 2008 and after.
For the first time horizon, conditions such as the perception regarding the presence of property rights, commercial, accounting and other legal and assessment services and institutions that support or promote SMEs and the extent to which public policies support entrepreneurship positively affect EI and TEA. The perception regarding the extent to which public policies support entrepreneurship as a relevant economic issue negatively impacts EI, but has no significant impact on TEA.
For the 2009-2019 time horizon, only the perception regarding ease of access to physical resources-communication, utilities, transportation, land or space-at a price that does not discriminate against SMEs, negatively affects EI, but not TEA. TEA is only negatively affected by the perception regarding the extent to which national research and development will lead to new commercial opportunities and is available to SMEs. Furthermore, in this time period, the TEA is positively affected by perception regarding the presence and quality of programs directly assisting SMEs at all levels of government (national, regional and municipal); with the perception regarding the extent to which training in creating or managing SMEs is incorporated within the education and training system in higher education such as vocational, college, business schools, etc.; and the perception regarding the extent to which social and cultural norms encourage or allow actions leading to new business methods or activities that can potentially increase personal wealth and income.
The difference between the impact of EFCs on EI and on TEA enhances the importance of combining the two statistical procedures considered in the present analysis-the FA and the dynamic longitudinal analysis-in studies on this matter.
On all longitudinal dynamic estimations, past values of EI and TEA are significantly associated with present values which, once again, corroborates the dynamic aspect of how structural conditions affect EI and TEA.
The analysis of the relation between EFC and other entrepreneurial behaviour and attitudes, as well as the exploration of further control variables for model robustness and to study the primary data instead of the GEM indicators, are future work possibilities. For example, as entrepreneurship may be related to human ability and propensity to take risks, these qualities could be controlled in future analysis. Additionally, the implementation of a longitudinal Principal Component Analysis would also reinforce the present analysis.
Author Contributions: This work was conducted by the four authors in collaboration through joint and distributed tasks. Joint tasks included conceptualization, organization of the paper, stateof-the-art investigations, definition of the methodology, collection of the resources, interpretation, writing-original draft preparation and writing-review and editing. Major contribution in software implementation, validation and visualization was given by A.B. and A.C. All authors have read and agreed to the published version of the manuscript.
Funding: This work has been supported by national funds through FCT-Fundação para a Ciência e Tecnologia through project UIDB/04728/2020.