Local Processing of Massive Databases with R: A National Analysis of a Brazilian Social Programme

: The analysis of massive databases is a key issue for most applications today and the use of parallel computing techniques is one of the suitable approaches for that. Apache Spark is a widely employed tool within this context, aiming at processing large amounts of data in a distributed way. For the Statistics community, R is one of the preferred tools. Despite its growth in the last years, it still has limitations for processing large volumes of data in single local machines. In general, the data analysis community has difﬁculty to handle a massive amount of data on local machines, often requiring high-performance computing servers. One way to perform statistical analyzes over massive databases is combining both tools (Spark and R) via the sparklyr package, which allows for an R application to use Spark. This paper presents an analysis of Brazilian public data from the Bolsa Família Programme (BFP—conditional cash transfer), comprising a large data set with 1.26 billion observations. Our goal was to understand how this social program acts in different cities, as well as to identify potentially important variables reﬂecting its utilization rate. Statistical modeling was performed using random forest to predict the utilization rated of BFP. Variable selection was performed through a recent method based on the importance and interpretation of variables in the random forest model. Among the 89 variables initially considered, the ﬁnal model presented a high predictive performance capacity with 17 selected variables, as well as indicated high importance of some variables for the observed utilization rate in income, education, job informality, and inactive youth, namely: family income, education, occupation and density of people in the homes. In this work, using a local machine, we highlighted the potential of aggregating Spark and R for analysis of a large database of 111.6 GB. This can serve as proof of concept or reference for other similar works within the Statistics community, as well as our case study can provide important evidence for further analysis of this important social support programme.


Introduction
The use of large databases defies the traditional computational limits of data capture, processing, analysis, and storage [1]. This kind of database has become a valuable source of information

Related Work
It is very common today to find large databases in the Web with information from different areas of knowledge. For example, BitTorrent is a very popular P2P communications protocol in which people can share files [13]. In Brazil, there is a massive volume of structured, semi-structured, and non-structured public data available on government sites, so that the public administration is more open and transparent. The authors in [14] have developed a soft system methodology that transforms open government public data into open linked data, according to the objective of specific groups. In terms of the analysis of Brazilian social data, the literature usually presents studies that use grouped data or samples (e.g., [15][16][17]).
We can mention some works that used Apache Spark to perform data analysis: in [18], it was used to analyze tweets transmitted with very little latency (few seconds). In [19], Spark and Hadoop were used and compared for analyzing log files. The study concluded that Spark, due to its effective exploitation of main memory and efficiency use of optimization techniques, was faster than Hadoop. In [20], the authors have used deep learning in mobile big data analytics and discussed a scalable learning framework in Apache Spark. Apache Spark was used to apply machine learning operations to big data in [21], with a consideration that Spark can turn the preprocessing step considerably easier.
The literature still has few examples on the use of sparklyr to address big data applications. Most citations are related to commercial products/tools or specialized studies. This section briefly lists some related works making using of R and sparklyr within statistical data analysis scenarios. Gopalani [22] compared and discussed Hadoop and Spark, and analyzed performance using the k-means machine learning algorithm. Bluhm [11] illustrated the use of Spark in Econometrics. Also, Yu et al. [23] introduced GeoSpark to manipulate spatial data. In addition, Azevedo et al. [24] created a data visualization through the Shiny package in which the data processing was carried out through sparklyr. However, these computational tools are still little explored by the statistical community. Some examples of works in computational sociology are also found: ref. [25] shows how trends in the field have reshaped sociology. Humon and Fararo [26] talks about computational sociology, which consists of the analysis of empirical data, theoretical explanation and computational simulation. Salgado and Gilbert [27] expose the dialogue between social theory and computational models of social processes.

Apache
Hadoop is an open source project from Apache Software Foundation, written in Java, and encompassing a collection of related subprojects that fit into the distributed computing infrastructure [28]. The main characteristics that made Hadoop interesting for applications in large databases are [29]: • a permissive free software license; • scalability, allowing execution in cluster environments with hundreds of servers; • fault tolerance, ensuring the availability of data and execution of tasks even in the event of failures.
Basically, Hadoop has the storage of data sets by Hadoop Distributed File System (HDFS), which provides distributed storage, and a programming model by MapReduce, which subdivides the task for faster processing. The MapReduce programming model is used to process data in parallel, dividing the data into smaller fractions and distributing them to clusters. In this way, processing time is reduced. An example of this programming model can be seen at [30]. Thus, there are two main phases: Map and Reduce. A map() function receives the data and returns a key-value pair, while the reduce() function aggregates the information.

Apache Spark
Apache Spark is a unified computing engine and a set of libraries for processing parallel data in computer clusters [9]. It is also a project of the Apache Software Foundation and written in Scala language, which is more efficient because it executes the processing faster. Spark uses the DAG (Directed Acyclic Graph) execution model, which offers better flexibility and performance than MapReduce, that allows a multiple levels forming a tree structure, being more flexible and allowing features such as map, filter, union, etc [28]. Spark's popularity has increased in recent years because it is easy to implement with existing technologies, such as HDFS and HBase data sources. Also, it includes the Spark Streaming, Spark SQL, Spark GraphX and Spark MLlib libraries, which are suitable for processing data in streaming, SQL, graphs and machine learning algorithms, respectively.
Spark presents the following abstractions [9]: DataSet, DataFrame, SQL tables and resilient distributed data sets (RDDs), which represent distributed collections of data. To perform the parallelism, the data is divided into partitions, a set of lines that are on a machine. Just as, when a transformation is made in a DataFrame, in fact it results in a set of RDD transformations, and practically all Spark code is compiled into the RDD. Two types of procedures are valid over RDDs: • Transformations: return a new RDD, such as map, filter and coalesce; • Actions: return a new value, such as reduce, collect and count.
RDD uses lazy evaluation, that is, an execution is started when an action function is triggered. Thus, Spark does not perform calculation until it is really needed. Also, the use of Spark becomes more accessible since its APIs facilitates data processing. Spark currently provides APIs in Python, Scala, Java and R. For more details see [31].

R and Spark with Sparkylr Package
A traditional tool that has grown significantly in recent years, becoming one of the main tools for data analysis and visualization, was the R software, which is a language and environment for statistical computing and graphics [32]. It is a free software, with simple syntax and that has a variety of packages that facilitate data analysis. However, as for the processing of large volumes of data, it has a native limitation, since in its standard version the data is read into the computer's RAM memory. However, one way to work with large databases still in R is to increase it through the use of packages.
The sparklyr package was developed by Javier Luraschi et al. in order to link R to Apache Spark. Conforming to [33], in this way the ease of use of R is combined with the computational strength of Apache Spark, making it possible to reconcile the writing of a simple and fluid code with the processing of large databases without the need to learn new programming languages. Furthermore, it is compatible with other R packages, such as dplyr and capable of connecting to local or remote clusters, which can increase the processing power. A standard workflow in sparklyr is given by:

Spark connection 2. Data analysis 3. Spark disconnect
Recently, the analysis of large volumes of data has been highlighted as regards the resolution of problems involving fraud detection, recommendation of products and services and identification of similar customers, for example. For this, it is important that the models used learn from the data and make good predictions, which is why Machine Learning algorithms have become so popular recently.
In order to introduce the usage of the sparklyr package, we display in following some basic commands in Table 1. Furthermore, the entire code applied to perform this paper analyses is shared in the results section.

Performing Machine Learning with Random Forest
The Random Forest (RF) algorithm [34] uses the idea of combining models. This idea constitutes the ensemble methods, which combine models with the intention of balancing bias and variance. According to [35], bias refers to how well the model approaches the real relationship between variables, and variance refers to how much the model varies, depending on the sample used for training. Regarding this trade-off, the ensemble procedure is possible to reduce the variance, without increase the bias-variance [36]. Random Forest differs from other ensemble methods, like gradient boosting, due to its interpretability through the feature importance, and the independent structure of its base-learners which can provide an easy parallelization setting [34].
This methodology consists of generating multiple decision trees in parallel, where h(x, θ m ), m = 1, . . . , M where x is an observation, such that x ∈ R p , associated with the random variable X, where p is the number of variable. The combination of all these models form "forest". Figure 1 shows the structure of a RF. The prediction of new observations x * using all the trees are given by and respectively for regression and classification tasks, where T is the number of trees in a forest. As a result of applying the average results of a high number of trees, the method loses the interpretation obtained with the individual decision trees [6]. Two important parameters in adjusting the RF are ntree and mtry. The first refers to the number of trees to be built and the second the number of covariates chosen at random for each division. In general, a classification model requires √ p as the number of random selected variables for each division, where p represents the total number of covariates. For a regression model, this amount is given by p/3. In RF, the error estimate is obtained through the out-of-bagging (OOB) sample, which is composed by the observations that are left out of the bootstrap sample, that is, they are not used in the construction of the tree.

Selection of Variables
The procedure for selecting variables is based on the importance of the variable. According to [38], the selection of variables has two objectives: to find important variables related to the response variable (for interpretation purposes) and to find a parsimonious number of important variables (for forecasting). For this work, we are interested in the first objective.

Pseudocode
The random forest can be designed as follows: 1. Let N be the total number of observations in the database and B a large number of repetitions.
Sample, B times and randomly, N observations with replacement (bootstrap samples); 2. Let M be the total number of covariables in the database. Select, at random and without replacement, a subset of covariates such that m < M variables, for each sample previously selected. The value of m is the same always; 3. Train a DT for each sample taken. Each tree will have maximum growth, therefore there is no pruning; 4. Get the forecast for each of the trees; 5. The final forecast is obtained by means (quantitative variables) or fashion (qualitative variables).
Just as [38], the steps for selecting variables are: • Ordering 1. Calculate the importance of variables; 2. Discard minor variables, as the most important ones have the greatest impact; 3. Order the remaining variables in decreasing order of importance and plot them together with the corresponding standard deviation. The minimum value of the CART model forecast that fits this curve is used as a cutoff point of importance, to maintain only the K variables that exceed that point.
• Selection 1. Build nested RF models including the first k variables, starting with the model with only the most important variable, calculating the OOB error rates; 2. Select the variables involved in the model that lead to the smallest OOB error.

Validation and Evaluation Measures
In machine learning, model validation is referred to as the process to verify the suitability of the trained model in a perspective of predictive performance on new data. Refaeilzadeh et al. [39] verified that there are at least four methods of machine learning validation starting from resubstitution, hold-out, k-fold, and leave-one-out or Jackknife. In this paper we consider a repeated holdout validation, which provides a better estimate once it reduces the bias, especially compared to the standard holdout validation method [40]. This behavior is observed because instead of selecting just one sample to train a model and evaluate it, multiple samples are used, minimizing the effect of choosing a single set of observations in the simple holdout.
During the entire validation process it is important to consider some evolution measures. For a regression problem, the most used metric is the Mean Square Error (MSE), which is the average of the squared model errors. The best model is the one with the lowest MSE value. This metric is given by: Frequently, for a binary classification problem, the metrics come from the confusion matrix, as exemplified in Table 2, Where TP represents the positive true values, TN the negative true values, FP the false positive values and FN the false negative values. In this paper, the following evaluation measures were used to quantify the performance of our binary classification model.
• Accuracy: considers the total number of correct answers in the model over the total number of observations. The best model is the one with the highest accuracy. It is defined as: • F1 score (F1): represents a combination of two other metrics, Recall (R) and Precision (P). The best model is the one with the highest F1 value. It is defined as: where R = TP TP + FN and P = TP TP + FP .
• Matthew's Correlation Coefficient (MCC): represents a linear qualitative correlation between prediction and real values. The best model is the one with the largest MCC. In comparison with F1-score and accuracy, the MCC produces more reliable estimations, since the other two parameters can generate overoptimistic inflated results, especially on imbalanced datasets [41]. The coefficient is defined by: .

Results and Data Analysis
Public data from the Bolsa Família Programme (BFP) were used to illustrate an analysis of large databases using the R software and the sparklyr package. All analyzes were performed using software R version 3.6.0 using the RStudio integrated environment [42] and performed on a personal laptop with the following configuration: Windows 10 64-bit operating system, Intel Core i3-5005U processor 2.00 GHz and 4 GB of RAM. It is worth mentioning that neither GPU nor virtual nodes were used. In fact, the Spark connection in a local mode starts with single process that runs most of the cluster components like the Spark context and a single executor [33]. Moreover, the package sparklyr version 1.0.1 and Spark version 2.4.0 were used. The main code is available at https://github.com/LED-UFBA/sparklyr_bf.

Data Description
The data used refer to the monthly payments of the BFP in the period from 2013 to 2019 and were extracted from the Brazilian Transparency Portal [43]. The downloaded files are of the CSV format and together a total size about 111.6 GB with 1.26 billion observations. The reference year and month variables were removed from the databases, as the data are in files separated by month. Thus, the considered variables are shown in Table 3, which SIAFI corresponds to the Integrated Financial Administration System.

BFP Analysis for the Period from 2013 to 2019
The BFP analysis aims to obtain quantitative knowledge regarding the program in Brazil, aiming to know the most dependent and independent locations regarding the use of the benefit, as well as to identify the variables that are potentially important for the use of the program. For this, there is an interest in the number of beneficiaries and in the rate of use of the program. Usage rate is understood as the ratio between the total beneficiaries and the total population measured in the last 2010 Census provide by Brazilian Institute of Geography and Statistics [44]. The BFP utilization rate may be viewed as an important social indicator as well as it is related to the country's social issues. Thus, when there is a high rate it is reflected that many people are in the poverty or extreme poverty range. The utilization rate is the ratio between the number of personal benefits and size of population in each city and for each month. Thus, the BFP utilization rate by city is the average over the observed months. The average was chosen in order to represent the general behavior of the utilization, however other statistics may be considered in future works. The understanding of the BFP utilization ratio is relevant to support Brazilians public policies, because this variable is directly associated with social problems as the unemployment and poverty. Therefore, cities with high values of this ratio can receive more social/economic assistance and directed public policies from the government, in order to improve life's quality in those places.
First, the databases were converted from .csv to .parquet and the year and reference month variables were removed. The BFP beneficiaries were aggregated by city in order to consider the utilization rate in each of the 5.565 Brazilian cities.
In general, performing a temporal descriptive analysis for the utilization rate of BFP (Figure 2), we can observe that utilization rate is around 10%, which means that, on average of cities, one in each ten people receives the benefit in Brazil. Also, and that the largest number of beneficiaries occurred in July 2014 and the lowest in July 2017. However, there is a decay in utilization rate after May 2019. The drop was due to government actions in 2017 and 2019, such as registration irregularities or cut of funds. Despite it being useful to consider changes over time in order to identify periods with abnormalities, Figure 2 displays the behavior of the utilization rate of BFP in Brazil over the years. Furthermore, Figure 3 describes the distribution of the utilization rate in each city, which has a bimodal behavior that shows there are at least two kinds of cities in Brazil, stated by a low and a high utilization of BFP. In this sense, we performed a dichotomization over the mean (11%). In this sense, maps were drawn up with the average city utilization rates of the BFP, for the years studied, in a categorized way. Through Figure 4 shows that the highest rates are found predominantly in the cities from the North and Northeast regions of Brazil. Figure 5 displays the both categories for each Brazilian state. The states that have the highest utilizaiton rate is Alagoas (AL), Sergipe (SE), Piauí (PI) and Maranhão (MA). The states that have the lowest utilization rate are Santa Catarina (SC), São Paulo (SP), Rio Grande do Sul (RS) and Paraná (PR). From this outcome, is clear that BFP has different behavior in distinct cities and, between the north/northeast region with the south/southeast discrepancy is more evident. These outcomes can support specific policies from the government to improve the programme's efficiency.   Subsequently, socioeconomic variables were considered in order to explain the utilization rate and a city database was prepared. Such variables are based on the variables collected in the 2010 Census and were taken from the portal of the [44] and from the [45], summing up a total of 89 covariates. A description of all of them can be found in the Appendix A. In this sense the modeling step was carried out, aiming to identify the variables that are possibly important for the use of the BFP. For this, two forms of modeling were performed: considering the response variable in its nature (regression) and in the categorized form (classification). The regression model correspond to response variable is continuous. The rate of use (y i ), defined as the number of people who use the assistance divided by the total population from that city, was estimated. Figure 6 shows the graph of the importance of the variables, with their respective cutoff in the variable selection. The majority of the variables are continuous, with the exception COD_UF and REGI AO. Setting the y as target variable and the others variables as predictors, regression models were applied using Random Forest [34]. Its performance was evaluated using the Root Mean Squared Error (RMSE) which was calculated through a validation technique of 100 repeated holdout with split ratio of 70-30% training-test. This ratio was selected because provide a significant proportion sample [46].
The tuning of the hyperparameters used in the Random Forest algorithm estimation was realized through a grid search varying the following parameters: mtry-number of variables randomly sampled as candidates at each split ranges in {1; 3; 6}; nodesize-as the minimum size of observations in terminal nodes ranges in {5; 10; 25}; and ntree-as the number of tree collections in a Random Forest ranges in {100; 500; 1000}; The best combination of the hyperparameters was selected by the model which produced the lowest Root Mean Squared Error.
The result is summarized in the Table 4 and express a great performance in order to solve the task of predict the rate of use (y i ) from different cities, which could represent a useful tool to guide the Government to better manage resources, and provide better support in directing public policies. The hyparameters that produced the lowest RMSE were mtry = 6, nodesize = 25 and ntree = 1000 . The second modeling approach is a classification problem, resulting in the "Low" and "High" categories. The determination these categories from each state is given through the mean value of the BFP utilization rate, i.e., if a municipality has a BFP utilization rate lower than the mean value, it is labeled as "low" and, otherwise receive the "high" label. Moreover, the same tuning process was realized in order to select the better parameter's setting the evaluation of the results were obtained using the same repeated holdout validation technique, with 100 repetitions and a split ratio of 70-30% of training-test data, but the metrics were the ACC, MCC and F1-Score.
Performance metrics of classification models are presented in Table 5. The hyper-parameters of RF which achieved there lowest generalization error were mtry = 4, nodesize = 5, ntree = 1000. Beside its high predictive capacity, the Random Forest model also give an interpretation of the importance of each variable used to estimate the class of each city. Importance values use the Out-of-Bag (OOB) samples in its calculation. In each of those samples, a predictor was selected and its values were shuffled. Afterwards, the mean percentage of decreased accuracy's value is obtained, and it is computed as the variable importance. Table 6 represent an ranking of them based on the variable importance values. This information can add value in to formulation of public policy. From the result is clear that poverty is an important aspect in the attendance of the BFP program, therefore future design actions and plans can consider more targeted problems. Moreover, the presence of the UF and REGIAO as high-rated importance variables can reveal an inequality between federative states and the Brazilian region, which is also an important feature to be analyzed through the government. Also, according to the Atlas of Human Development in Brazil (2019) [45], the IDHM_R is an indicator of the ability of the inhabitants of a locality to guarantee a proper standard that ensures their basic needs, for example, water, food and, housing. Thus, it can be seen from Figure 7 that the highest values refer to the states of the South, Southeast and Midwest regions of the country. Furthermore, this variable also plays a important role in the rate of use of BFP, due to appears at fifth on the variable importance ranking in Table 6.
In order to verify the distribution behavior of the continuous variables over the binary utilization rate, Figure 8 displays a negative influence of RDPC, IDHM_R and IDHM and a positive influence of the other variables. This figure also corroborates the variable selection method used, since the distributions differ for the variable response.

Summary of Results
Through this analysis, it was possible to characterize the use of the program in Brazil and verify which are the cities, states, and regions with a low and high utilization rate of BF Program, as well as, through methods of selection of variables Random Forest, to identify important variables for the use of the BFP, such as the municipal human development index and the proportion of people vulnerable to poverty. Also, the analysis identifies as the most important the PPOB variable (Proportion of people with per capita household income equal to or less than R$255.00 per month) that gives us subsidies to believe in the effective action of the social program. Moreover, in addition to important factors such as income and education, this analysis draws attention to job informality and inactive youth, as measure by variables TRABSC (Percentage of people aged 18 or over who are employed without a formal contract) and T_NESTUDA (Proportion of young people aged 15 to 24 years old who do not study and do not work).

Final Considerations
For the Statistics community, R is one of the preferred tools. Despite its growth in the last years, it still has limitations for processing large volumes of data in single local machines. One way to perform statistical analyzes over massive databases is combining both tools (Spark and R) via the sparklyr package, which allows for a R application to use Spark.
In this paper, the implementation performed with the R software via the sparklyr package considered 111.6 GB of the monthly Brazilian public data from the Bolsa Família Program, which were processed on a local machine. Through the analysis it was possible to understand how this social program works in different cities, as well as to identify variables of great importance for the use of the program for example the variable that represents the proportion of young people aged 15 to 24 years who do not study and do not work.
Therefore, it is noted the potential of aggregating Spark and R for analysis of large databases, since in this work, using one local machine, was possible to analyze public data of large size, with about 1.26 billion observations, as well as providing important information which may subsidize national public management. Several future works may be considered in order to compare the time computational performance, other traditional statistical or machine learning models as well as time serial models to the monthly payments of the BFP.    Economically active population. Corresponds to the number of people in this age group who, in the reference week of the Census, were employed in the labor market or who, being unemployed, had sought work in the month prior to the date of the survey PEA1014 Economically active population 10 to 14 years of age Economically active population. Corresponds to the number of people in this age group who, in the reference week of the Census, were employed in the labor market or who, being unemployed, had sought work in the month prior to the date of the survey PEA1517 Economically active population between 15 and 17 years of age    Ratio between the population living in permanent private households with water piped to one or more rooms and the total population living in permanent private households multiplied by 100. The water can come from the general network, from a well, from a spring or from a reservoir supplied by rainwater or water tanker T_ANALF_15M Illiteracy rate of the population aged 15 or over Ratio between the population aged 15 and over who cannot read or write a simple ticket and the total number of people in this age group multiplied by 100 T_ANALF11A14 Illiteracy rate of the population between 11 and 14 years of age Ratio between the population aged 11 to 14 years old who cannot read or write a simple ticket and the total number of people in this age group multiplied by 100 T_ANALF15A17 Illiteracy rate of the population between 15 and 17 years of age Ratio between the population aged 15 to 17 years old who cannot read or write a simple ticket and the total number of people in this age group multiplied by 100 Ratio between the population living in permanent private households with a density greater than 2 and the total population living in permanent private households multiplied by 100. The density of the household is given by the ratio between the total household residents and the total number of rooms used as a dorm T_ENV Aging rate Ratio between the population aged 65 and over and the total population multiplied by 100 T_FBBAS Gross attendance rate for basic education Ratio between the total number of people of any age attending basic education (elementary or high school-regular or serial) and the population aged 6 to 17 years multiplied by 100 T_FBFUND Gross attendance rate for primary education Ratio between the total number of people of any age attending regular elementary school and the population aged 6 to 14 years multiplied by 100 T_FBMED Gross high school attendance rate Ratio between the total number of people of any age attending regular high school and the population aged 15 to 17 years multiplied by 100 T_FBPRE Gross pre-school attendance rate Ratio between the total number of children up to 5 years old (only 5 years old in 1991) attending pre-school and the population in that same age group multiplied by 100 T_FBSUPER Gross higher education attendance rate Ratio between the total number of people of any age attending higher education (undergraduate, specialization, master's or doctorate) and the population aged 18 to 24 years multiplied by 100 T_FORA4A5 % of children aged 4 to 5 who do not attend school Ratio between the number of children aged 4 to 5 years who do not attend school and the total number of children in this age group multiplied by 100 T_FORA6A14 % of children aged 6 to 14 who do not attend school Ratio between children aged 6 to 14 who do not attend school and the total number of children in this age group multiplied by 100 T_FREQ0A3 School attendance rate of the population from 0 to 3 years old Ratio between the 0 to 3 year old population attending school, at any level or grade, and the total population in this age group multiplied by 100 Ratio between the population aged 11 to 14 years old who was attending school, at any level or grade, and the total population in this age group multiplied by 100 T_FREQ4A5 School attendance rate of the population from 4 to 5 years old Ratio between the population of 4 to 5 years old who was attending school, at any level or grade, and the total population in this age group multiplied by 100 T_FREQ5a6 Percentage of the population aged 5 to 6 years attending school Ratio between the population of 5 to 6 years old who was attending school, at any level or grade, and the total population in this age group multiplied by 100 T_FREQ6A14 School attendance rate of the population from 6 to 14 years of age Ratio between the population aged 6 to 14 years old who was attending school, at any level or grade, and the total population in this age group multiplied by 100 T_FUND15A17 Percentage Ratio between people aged 18 or over with no complete elementary education and in informal occupation and the total population in this age group multiplied by 100. Informal occupation implies that they work but are not: employees with a formal contract, military personnel in the army, navy, aeronautics, military police or fire brigade, employed by the legal regime of civil servants or employers and self-employed with contribution to an official social security institute T_LUZ Percentage of population living in households with electricity Ratio between the population living in permanent private households with electric lighting and the total population living in permanent private households multiplied by 100.
Lighting is considered to be from a general network, with or without a meter T_M10A14CF Percentage of women aged 10 to 14 years who had children Ratio between women 10 to 14 years of age who had children and the total number of women in this age group multiplied by 100 T_M15A17CF Percentage of women aged 15 to 17 years who had children Ratio between women aged 15 to 17 who had children and the total number of women in this age group multiplied by 100 T_MED18M Percentage of the population aged 18 or over with completed high school Ratio between the population aged 18 or over who completed high school, in any of its modalities (regular serial, non-serial, EJA or supplementary) and the total number of people in this age group multiplied by 100