# Local Processing of Massive Databases with R: A National Analysis of a Brazilian Social Programme

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Computational Environment

#### 3.1. Apache Hadoop

- a permissive free software license;
- scalability, allowing execution in cluster environments with hundreds of servers;
- fault tolerance, ensuring the availability of data and execution of tasks even in the event of failures.

#### 3.2. Apache Spark

- Transformations: return a new RDD, such as map, filter and coalesce;
- Actions: return a new value, such as reduce, collect and count.

#### 3.3. R and Spark with Sparkylr Package

- Spark connection
- Data analysis
- Spark disconnect

## 4. Performing Machine Learning with Random Forest

#### 4.1. Selection of Variables

#### 4.2. Pseudocode

- Let N be the total number of observations in the database and B a large number of repetitions. Sample, B times and randomly, N observations with replacement (bootstrap samples);
- Let M be the total number of covariables in the database. Select, at random and without replacement, a subset of covariates such that $m<M$ variables, for each sample previously selected. The value of m is the same always;
- Train a DT for each sample taken. Each tree will have maximum growth, therefore there is no pruning;
- Get the forecast for each of the trees;
- The final forecast is obtained by means (quantitative variables) or fashion (qualitative variables).

- Ordering
- Calculate the importance of variables;
- Discard minor variables, as the most important ones have the greatest impact;
- Order the remaining variables in decreasing order of importance and plot them together with the corresponding standard deviation. The minimum value of the CART model forecast that fits this curve is used as a cutoff point of importance, to maintain only the K variables that exceed that point.

- Selection
- Build nested RF models including the first k variables, starting with the model with only the most important variable, calculating the OOB error rates;
- Select the variables involved in the model that lead to the smallest OOB error.

#### 4.3. Validation and Evaluation Measures

- Accuracy: considers the total number of correct answers in the model over the total number of observations. The best model is the one with the highest accuracy. It is defined as:$$ACC={\displaystyle \frac{TP+TN}{TP+TN+FP+FN}}.$$
- F1 score (F1): represents a combination of two other metrics, Recall (R) and Precision (P). The best model is the one with the highest F1 value. It is defined as:$$F1={\displaystyle \frac{2*P*R}{P+R}}.$$
- Matthew’s Correlation Coefficient (MCC): represents a linear qualitative correlation between prediction and real values. The best model is the one with the largest MCC. In comparison with F1-score and accuracy, the MCC produces more reliable estimations, since the other two parameters can generate overoptimistic inflated results, especially on imbalanced datasets [41]. The coefficient is defined by:$$MCC={\displaystyle \frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}}.$$

## 5. Results and Data Analysis

#### 5.1. Data Description

#### 5.2. BFP Analysis for the Period from 2013 to 2019

#### 5.3. Summary of Results

## 6. Final Considerations

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A

Acronym | Name | Description |
---|---|---|

COD_UF | Federation Unit Code | Code used by IBGE to identify the state |

CPR | Percentage of employed persons aged 18 or over who are self-employed | Ratio between the number of self-employed workers aged 18 and over and the total number of employed persons in this age group multiplied by 100 |

E_ANOSESTUDO | Expectation of years of study at 18 years of age | Average number of years of schooling that a generation of children entering school must complete by reaching 18 years of age, if current standards remain throughout their school life |

ESPVIDA | Life expectancy at birth | Average number of years that people should live from birth, if the level and pattern of age-related mortality prevalent in the year of the Census remain constant throughout life |

FECTOT | Total fertility rater | Average number of children a woman should have at the end of her reproductive period (15 to 49 years of age) |

GINI | Gini Index | It measures the degree of inequality that exists in the distribution of individuals according to per capita household income. Its value varies from 0, when there is no inequality (the per capita household income of all individuals has the same value), to 1, when the inequality is maximum (only one individual holds all the income) |

HOMEMTOT | Resident male population | Total male population |

IDHM | Municipal Human Development Index | Municipal Human Development Index. Geometric mean of the indices of the dimensions Income, Education and Longevity, with equal weights |

IDHM_E | Municipal Human Development Index—Education Dimension | Synthetic index of the Education dimension, which is one of the 3 components of the MHDI. It is obtained through the geometric average of the sub-index of attendance of children and young people to school, with a weight of 2/3, and of the sub-index of education of the adult population, with a weight of 1/3 |

IDHM_L | Municipal Human Development Index—Longevity Dimension | Longevity dimension index which is one of the 3 components of the MHDI. It is obtained from the Life expectancy at birth indicator, using the formula: [(observed value of the indicator) − (minimum value)]/[(maximum value) − (minimum value)], where the minimum and maximum values are 25 and 85 years, respectively |

IDHM_R | Municipal Human Development Index—Income dimension | Income dimension index which is one of the 3 components of the MHDI. It is obtained from the Per capita income indicator, using the formula: [ln (observed value of the indicator) − ln (minimum value)]/[ln (maximum value) − ln (minimum value)], where the minimum and maximum values are R$8.00 and R$4033.00 (as of August 2010) |

MORT1 | Mortality up to one year of age | Number of children who should not survive the first year of life in every 1000 children born alive |

MORT5 | Mortality up to five years of age | Probability of dying between birth and the exact age of 5, per 1000 children born alive |

MULHERTOT | Resident female population | Total female population |

P_FORMAL | Degree of formalization of the work of employed persons | Ratio between the number of persons aged 18 and over formally employed and the total number of employed persons in this age group multiplied by 100 |

P_FUND | Percentage of employed persons with complete primary education | Ratio between the number of employed persons aged 18 and over who have already completed elementary school (regular serial, regular non-serial, EJA or supplementary) and the total number of employed persons in this age group multiplied by 100 |

P_MED | Percentage of employed persons with complete high school | Ratio between the number of employed persons aged 18 or over who have already completed high school (regular serial, non-serial regular, EJA or supplementary) and the total number of employed persons in this age group multiplied by 100 |

PEA | Economically active population 10 years of age and over | Economically active population. Corresponds to the number of people in this age group who, in the reference week of the Census, were employed in the labor market or who, being unemployed, had sought work in the month prior to the date of the survey |

PEA1014 | Economically active population 10 to 14 years of age | Economically active population. Corresponds to the number of people in this age group who, in the reference week of the Census, were employed in the labor market or who, being unemployed, had sought work in the month prior to the date of the survey |

PEA1517 | Economically active population between 15 and 17 years of age | Economically active population. Corresponds to the number of people in this age group who, in the reference week of the Census, were employed in the labor market or who, being unemployed, had sought work in the month prior to the date of the survey |

PEA18M | Economically active population aged 18 or over | |

PESO1 | Population up to 1 year of age | Population residing in this age group |

PESO1114 | Population 11 to 14 years of age | Population residing in this age group |

PESO15 | Population 15 years of age and over | Population residing in this age group |

PESO1517 | Population 15 to 17 years of age | Population residing in this age group |

PESOM1014 | Women aged 10 to 14 | Resident population in this age group and female |

PESOM15M | Women aged 15 and over | Resident population in this age group and female |

PIND | Proportion of extremely poor | Proportion of individuals with per capita household income equal to or less than R$70.00 per month, in reais on 1 August 2010 |

PINDCRI | Proportion of extremely poor children | Proportion of persons up to 14 years of age who have a per capita household income equal to or less than R$70.00 per month, in reais on 1 August 2010 |

PMPOB | Proportion of poor | Proportion of individuals with per capita household income equal to or less than R$140.00 per month, in reais on 1 August 2010 |

PMPOBCRI | Proportion of poor children | Proportion of persons up to 14 years of age who have per capita household income equal to or less than R$140.00 per month, in reais on 1 August 2010 |

PPOB | Proportion of vulnerable to poverty | Proportion of individuals with per capita household income equal to or less than R$255.00 per month, in reais on 1 August 2010, equivalent to 1/2 minimum wage on that date |

PPOBCRI | Proportion of children vulnerable to poverty | Proportion of individuals up to 14 years of age who have a per capita household income equal to or less than R$255.00 per month, in reais in August 2010, equivalent to 1/2 minimum wage on that date |

RAZDEP | Dependency ratio | Percentage of the population under the age of 15 and the population aged 65 and over in relation to the population aged 15 to 64 |

RDPC | Average per capita income | Ratio between the sum of the income of all residents in permanent private households and the total number of these individuals. Values in reais on 1 August 2010 |

REGIAO | Region according to IBGE | Region according to IBGE |

REN1 | % of employed persons with income of up to 1 minimum wage—18 years old or more | Ratio between the number of persons aged 18 and over employed and with monthly income from all jobs less than 1 minimum wage in July 2010 and the total number of employed persons in this age group multiplied by 100 |

REN2 | % of employed persons with an income of up to 2 minimum wages—18 years or over | Ratio between the number of persons aged 18 and over employed and with monthly income from all jobs less than 2 minimum wages in July 2010 and the total number of employed persons in this age group multiplied by 100 |

REN3 | % of employed persons with an income of up to 3 minimum wages—18 years or over | Ratio between the number of persons aged 18 and over employed and with monthly income from all jobs below 3 minimum wages in July 2010 and the total number of employed persons in this age group multiplied by 100 |

RENOCUP | Average income of employed persons—18 years and over | Average income from all jobs of employed persons aged 18 or over. Amounts in reais on 1 August 2010 |

RIND | Average per capita household income of the extremely poor | Average per capita household income of people with per capita household income of R$70.00 or less, at August 2010 prices |

RMPOB | Average per capita household income of the poor | Average per capita household income of people with per capita household income equal to or less than R$140.00 per month, at August 2010 prices |

RPOB | Average per capita household income of people vulnerable to poverty | Average per capita household income of people with per capita household income equal to or less than R$255.00 per month, at August 2010 prices |

SOBRE40 | Probability of survival up to 40 years | Likelihood of a newborn child living up to 40 years of age, if the level and pattern of age mortality prevalent in the year of the Census remain constant throughout life |

SOBRE60 | Probability of survival up to 60 years | The probability that a newborn child will live up to 60 years of age, if the level and pattern of age-related mortality prevalent in the year of the Census remain constant throughout life |

T_AGUA | Percentage of population living in households with running water | Ratio between the population living in permanent private households with water piped to one or more rooms and the total population living in permanent private households multiplied by 100. The water can come from the general network, from a well, from a spring or from a reservoir supplied by rainwater or water tanker |

T_ANALF_15M | Illiteracy rate of the population aged 15 or over | Ratio between the population aged 15 and over who cannot read or write a simple ticket and the total number of people in this age group multiplied by 100 |

T_ANALF11A14 | Illiteracy rate of the population between 11 and 14 years of age | Ratio between the population aged 11 to 14 years old who cannot read or write a simple ticket and the total number of people in this age group multiplied by 100 |

T_ANALF15A17 | Illiteracy rate of the population between 15 and 17 years of age | Ratio between the population aged 15 to 17 years old who cannot read or write a simple ticket and the total number of people in this age group multiplied by 100 |

T_ANALF18A24 | Illiteracy rate of the population between 18 and 24 years of age | Ratio between the population aged 18 to 24 years old who cannot read or write a simple ticket and the total number of people in this age group multiplied by 100 |

T_ANALF18M | Illiteracy rate of the population aged 18 or over | Ratio between the population aged 18 and over who cannot read or write a simple ticket and the total number of people in this age group multiplied by 100 |

T_ATIV1014 | Activity rate—10 to 14 years | Ratio between persons aged 10 to 14 years of age who were economically active, that is, who were occupied or unemployed in the reference week of the Census and the total number of people in this age group multiplied by 100. The person who, not being employed in the reference week, she had sought work in the month prior to this survey |

T_ATIV1517 | Activity rate—15 to 17 years | Ratio between persons aged 15 to 17 years of age who were economically active, that is, who were employed or unemployed in the reference week of the Census and the total number of people in this age group multiplied by 100. The person who, not being employed in the reference week, she had sought work in the month prior to this survey |

T_ATRASO_0_BASICO | Percentage of the population from 6 to 17 years old attending basic education that does not have an age-grade delay | Ratio between the number of people from 6 to 17 years old attending regular basic basic education (elementary + secondary) without age-grade delay and the total number of people in that age group attending this level of education multiplied by 100 |

T_BANAGUA | Percentage of population living in households with bathroom and running water | Ratio between the population living in permanent private households with running water in at least one of their rooms and with an exclusive bathroom and the total population living in permanent private households multiplied by 100. The water may come from the general network, from wells, from spring or reservoir supplied by rainwater or water tanker. Exclusive bathroom is defined as a room with a shower or bath and a sanitary device |

T_CRIFUNDIN_TODOS | % of children living in households where none of the residents have completed elementary school | Ratio between the number of children up to 14 years old living in households where none of the residents have completed elementary school and the total population in this age group living in permanent private households multiplied by 100 |

T_DENS | Percentage of population living in households with density greater than 2 people per bedroom | Ratio between the population living in permanent private households with a density greater than 2 and the total population living in permanent private households multiplied by 100. The density of the household is given by the ratio between the total household residents and the total number of rooms used as a dorm |

T_ENV | Aging rate | Ratio between the population aged 65 and over and the total population multiplied by 100 |

T_FBBAS | Gross attendance rate for basic education | Ratio between the total number of people of any age attending basic education (elementary or high school—regular or serial) and the population aged 6 to 17 years multiplied by 100 |

T_FBFUND | Gross attendance rate for primary education | Ratio between the total number of people of any age attending regular elementary school and the population aged 6 to 14 years multiplied by 100 |

T_FBMED | Gross high school attendance rate | Ratio between the total number of people of any age attending regular high school and the population aged 15 to 17 years multiplied by 100 |

T_FBPRE | Gross pre-school attendance rate | Ratio between the total number of children up to 5 years old (only 5 years old in 1991) attending pre-school and the population in that same age group multiplied by 100 |

T_FBSUPER | Gross higher education attendance rate | Ratio between the total number of people of any age attending higher education (undergraduate, specialization, master’s or doctorate) and the population aged 18 to 24 years multiplied by 100 |

T_FORA4A5 | % of children aged 4 to 5 who do not attend school | Ratio between the number of children aged 4 to 5 years who do not attend school and the total number of children in this age group multiplied by 100 |

T_FORA6A14 | % of children aged 6 to 14 who do not attend school | Ratio between children aged 6 to 14 who do not attend school and the total number of children in this age group multiplied by 100 |

T_FREQ0A3 | School attendance rate of the population from 0 to 3 years old | Ratio between the 0 to 3 year old population attending school, at any level or grade, and the total population in this age group multiplied by 100 |

T_FREQ11A14 | School attendance rate of the population from 11 to 14 years of age | Ratio between the population aged 11 to 14 years old who was attending school, at any level or grade, and the total population in this age group multiplied by 100 |

T_FREQ4A5 | School attendance rate of the population from 4 to 5 years old | Ratio between the population of 4 to 5 years old who was attending school, at any level or grade, and the total population in this age group multiplied by 100 |

T_FREQ5a6 | Percentage of the population aged 5 to 6 years attending school | Ratio between the population of 5 to 6 years old who was attending school, at any level or grade, and the total population in this age group multiplied by 100 |

T_FREQ6A14 | School attendance rate of the population from 6 to 14 years of age | Ratio between the population aged 6 to 14 years old who was attending school, at any level or grade, and the total population in this age group multiplied by 100 |

T_FUND15A17 | Percentage of the population aged 15 to 17 with complete primary education | Ratio between the population aged 15 to 17 years who completed elementary school, in any of its modalities (regular serial, non-serial, EJA or supplementary) and the total number of people in this age group multiplied by 100 |

T_FUND18M | Percentage of the population aged 18 or over with complete primary education | Ratio between the population aged 18 or over who completed elementary school, in any of its modalities (regular serial, non-serial, EJA or supplementary) and the total number of people in this age group multiplied by 100 |

T_FUNDIN_TODOS | % people living in households where no resident has completed elementary school | Ratio between people living in households where none of the residents have completed elementary school and the total population living in permanent private households multiplied by 100 |

T_FUNDIN_TODOS_MMEIO | % of people in households vulnerable to poverty and in which no one has complete basic education | Percentage of people living in households vulnerable to poverty (with per capita income less than 1/2 the minimum wage in August 2010) and in which no one has completed elementary school |

T_FUNDIN18MINF | % of persons aged 18 or over with no complete elementary education and informally employed | Ratio between people aged 18 or over with no complete elementary education and in informal occupation and the total population in this age group multiplied by 100. Informal occupation implies that they work but are not: employees with a formal contract, military personnel in the army, navy, aeronautics, military police or fire brigade, employed by the legal regime of civil servants or employers and self-employed with contribution to an official social security institute |

T_LUZ | Percentage of population living in households with electricity | Ratio between the population living in permanent private households with electric lighting and the total population living in permanent private households multiplied by 100. Lighting is considered to be from a general network, with or without a meter |

T_M10A14CF | Percentage of women aged 10 to 14 years who had children | Ratio between women 10 to 14 years of age who had children and the total number of women in this age group multiplied by 100 |

T_M15A17CF | Percentage of women aged 15 to 17 years who had children | Ratio between women aged 15 to 17 who had children and the total number of women in this age group multiplied by 100 |

T_MED18M | Percentage of the population aged 18 or over with completed high school | Ratio between the population aged 18 or over who completed high school, in any of its modalities (regular serial, non-serial, EJA or supplementary) and the total number of people in this age group multiplied by 100 |

T_MULCHEFEFIF014 | Percentage of heads of household, without complete elementary school and with at least one child under 15 years of age | Ratio between the number of women who are responsible for the household, do not have complete elementary school and have at least 1 child under the age of 15 living in the household and the total number of female heads of household multiplied by 100 |

T_NESTUDA_NTRAB_MMEIO | % of people aged 15 to 24 who do not study or work and are vulnerable to poverty | Ratio between people aged 15 to 24 who do not study or work and are vulnerable to poverty and the total population in this age group multiplied by 100. People living in households with per capita income below 1/2 the minimum wage in August 2010 are defined as vulnerable to poverty |

T_RMAXIDOSO | % of people in households vulnerable to poverty and dependent on the elderly | Ratio between people living in households vulnerable to poverty (with per capita income less than 1/2 the minimum wage in August 2010) and where the main source of income comes from residents aged 65 and over and the total resident population in permanent private households multiplied by 100 |

T_SLUZ | % of people in households without electricity | Ratio between people living in households without electricity and the total population living in permanent private households multiplied by 100 |

T_SUPER25M | Percentage of the population aged 25 or over with a college degree | Ratio between the population aged 25 or over who has completed at least a college degree and the total number of people in this age group multiplied by 100 |

TIPOLOG_RUR_URB | Typology of the municipality according to IBGE | Typology of the municipality according to IBGE |

TRABCC | % of employees with a formal contract—18 years old or more | Ratio between the number of employees aged 18 and over with a formal contract and the total number of persons employed in this age group multiplied by 100 |

TRABPUB | Percentage of employed persons aged 18 or over who are public sector workers | Percentage of employed persons aged 18 or over who are public sector workers |

TRABSC | % of employees without a formal contract—18 years old or more | Ratio between the number of employees aged 18 and over without a formal contract and the total number of persons employed in this age group multiplied by 100 |

UF | Federation Unit Code | Code used by IBGE to identify the state |

## References

- Bhandarkar, M. MapReduce programming with apache Hadoop. In Proceedings of the 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), Atlanta, Georgia, USA, 19–23 April 2010; IEEE Computer Society: Piscataway, NJ, USA, 2010; p. 1. [Google Scholar]
- Sagiroglu, S.; Sinanc, D. Big data: A review. In Proceedings of the 2013 International Conference on Collaboration Technologies and Systems (CTS), San Diego, CA, USA, 20–24 May 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 42–47. [Google Scholar]
- Caixa. 2019. Available online: http://www.caixa.gov.br/programas-sociais/bolsa-familia/Paginas/default.aspx (accessed on 20 April 2019).
- Citizenship, M. Special Secretariat for Social Development. Ministry of Citizenship. 2019. Available online: http://mds.gov.br/assuntos/bolsa-familia/o-que-e/como-funciona/como-funciona (accessed on 29 April 2019).
- Hummon, N.P.; Fararo, T.J. Actors and networks as objects. Soc. Netw.
**1995**, 17, 1–26. [Google Scholar] [CrossRef] - Expósito, O.Á. Guide to Spark Machine Learning for Credit Scoring. Bachelor’s Thesis, Universitat Politècnica de Catalunya, Barcelona, Spain, 2018. [Google Scholar]
- Hadoop. Apache Software Foundation. Apache Hadoop. 2019. Available online: http://hadoop.apache.org/ (accessed on 20 May 2019).
- Spark. Apache Software Foundation. Apache Spark. 2019. Available online: http://spark.apache.org/ (accessed on 20 May 2019).
- Zaharia, M.; Chambers, B. Spark: The Definitive Guide; O’Reilly: Sevastopol, CA, USA, 2018; Available online: https://learning.oreilly.com/library/view/spark-the-definitive/9781491912201/ (accessed on 29 June 2019).
- Luraschi, J.E.A.; Kuo, K.; Ushey, K.; Allaire, J.; Macedo, S.; Falaki, H.; Wang, L.; Zhang, A.; Li, Y. The Apache Software Foundation Package ‘Sparklyr’. Available online: https://cran.r-project.org/web/packages/sparklyr/index.html (accessed on 15 April 2019).
- Bluhm, B.; Cutura, J. Econometrics at Scale: Spark Up Big Data in Economics; Technical Report; SAFE Working Paper No. 266; Leibniz Institute for Financial Research SAFE: Frankfurt, Germany, 2020. [Google Scholar]
- Safhi, H.M.; Frikh, B.; Ouhbi, B. Energy load forecasting in big data context. In Proceedings of the 2020 5th International Conference on Renewable Energies for Developing Countries (REDEC), Marrakech, Morocco, 24–26 March 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
- Hales, D.; Patarin, S. Computational sociology for systems “in the wild”: The case of BitTorrent. IEEE Distrib. Syst. Online
**2005**, 6. [Google Scholar] [CrossRef] - Victorino, M.; de Holanda, M.T.; Ishikawa, E.; Oliveira, E.C.; Chhetri, S. Transforming Open Data to Linked Open Data Using Ontologies for Information Organization in Big Data Environments of the Brazilian Government: The Brazilian Database Government Open Linked Data–DBgoldbr. Knowl. Organ.
**2018**, 45, 443–466. [Google Scholar] [CrossRef] - Schwartzman, S. Education-Oriented Social Programs in Brazil: The Impact of Bolsa Escola. In Paper Submitted to the Global Conference on Education Research in Developing Countries (Research for Results on Education), Global Development Network, Prague, 32 March–2 April 2005; Instituto de Estudos do Trabalho e Sociedade: Rio de Janeiro, Brazil, 2005. [Google Scholar]
- Ferro, A.R.; Kassouf, A.L.; Levison, D. The impact of conditional cash transfer programs on household work decisions in Brazil. In Child Labor and the Transition between School and Work; Emerald Group Publishing Limited: Bingley, UK, 2010. [Google Scholar]
- Magalhães, L.A.; Fonseca, M.F.; Custodio, D.D.O.; Martinho, P.; Daltio, J.; de Carvalho, C.; Castro, G. Gathering spatial data on social vulnerability in Brazil. Embrapa Territorial-Artigo em anais de congresso (ALICE). In Proceedings of the International Conference on Agro Big Data and Decision Support Systems, Montevideo, Uruguay, 27–29 September 2017; pp. 183–185. [Google Scholar]
- Shoro, A.G.; Soomro, T.R. Big data analysis: Apache spark perspective. Glob. J. Comput. Sci. Technol.
**2015**, 15, No 1-C. [Google Scholar] - Mavridis, I.; Karatza, H. Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark. J. Syst. Softw.
**2017**, 125, 133–151. [Google Scholar] [CrossRef] - Alsheikh, M.A.; Niyato, D.; Lin, S.; Tan, H.P.; Han, Z. Mobile big data analytics using deep learning and apache spark. IEEE Netw.
**2016**, 30, 22–29. [Google Scholar] [CrossRef][Green Version] - Alexopoulos, A.; Drakopoulos, G.; Kanavos, A.; Mylonas, P.; Vonitsanos, G. Two-Step Classification with SVD Preprocessing of Distributed Massive Datasets in Apache Spark. Algorithms
**2020**, 13, 71. [Google Scholar] [CrossRef][Green Version] - Gopalani, S.; Arora, R. Comparing apache spark and map reduce with performance analysis using k-means. Int. J. Comput. Appl.
**2015**, 113, 8–11. [Google Scholar] [CrossRef] - Yu, J.; Zhang, Z.; Sarwat, M. Spatial data management in apache spark: The geospark perspective and beyond. Geoinformatica
**2019**, 23, 37–78. [Google Scholar] [CrossRef] - Azevedo, A.R.; Ara, A.; Noguti, M.Y.; de Brito, A.C. Application in Shiny: Intersection between gender, class, and race at ENEM 2016. In Proceedings of the III International Statistics Seminar with R, Niterói, RJ, Brazil, 22–24 May 2018. [Google Scholar]
- Brent, E.E., Jr. Computational sociology: Reinventing sociology for the next millennium. Soc. Sci. Comput. Rev.
**1993**, 11, 487–499. [Google Scholar] [CrossRef] - Hummon, N.P.; Fararo, T.J. The emergence of computational sociology. J. Math. Sociol.
**1995**, 20, 79–87. [Google Scholar] [CrossRef] - Salgado, M.; Gilbert, N. Emergence and communication in computational sociology. J. Theory Soc. Behav.
**2013**, 43, 87–110. [Google Scholar] [CrossRef][Green Version] - White, T. Hadoop: The Definitive Guide; O’Reilly Media, Inc.: Sevastopol, CA, USA, 2015. [Google Scholar]
- Grover, M.; Malaska, T.; Seidman, J.; Shapira, G. Hadoop Application Architectures: Designing Real-World Big Data Applications; O’Reilly Media, Inc.: Sevastopol, CA, USA, 2015. [Google Scholar]
- Pattamsetti, R.M.R. Distributed Computing in Java 9; Packt Publishing Ltd.: Birmingham, UK, 2017. [Google Scholar]
- Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J.; et al. Apache spark: A unified engine for big data processing. Commun. ACM
**2016**, 59, 56–65. [Google Scholar] [CrossRef] - R, C. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2019. [Google Scholar]
- Luraschi, J.; Kuo, K.; Ruiz, E. Mastering Spark with R: The Complete Guide to Large-Scale Analysis and Modeling; O’Reilly Media: Sevastopol, CA, USA, 2019. [Google Scholar]
- Breiman, L. Random forests. Mach. Learn.
**2001**, 45, 5–32. [Google Scholar] [CrossRef][Green Version] - Zadrozny, B. Learning and evaluating classifiers under sample selection bias. In Proceedings of the twenty-first international conference on Machine learning, Banff, AB, Canada, 4–8 July 2004; p. 114. [Google Scholar]
- Breiman, L. Bias, Variance, and Arcing Classifiers; Technical Report; Tech. Rep. 460; Statistics Department, University of California, Berkeley: Berkeley, CA, USA, 1996. [Google Scholar]
- Dmitrievsky, M. Random Decision Forest in Reinforcement Learning; MetaQuotes Language 5 (MQL5); 2018. Available online: https://www.mql5.com/en/articles/widget/3856 (accessed on 23 September 2019).
- Genuer, R.; Poggi, J.M.; Tuleau-Malot, C. Variable selection using random forests. Pattern Recognit. Lett.
**2010**, 31, 2225–2236. [Google Scholar] [CrossRef][Green Version] - Refaeilzadeh, P.; Tang, L.; Liu, H. Cross-Validation. Encycl. Database Syst. (EDBS)
**2009**, 5, 532–538. [Google Scholar] - Tantithamthavorn, C.; McIntosh, S.; Hassan, A.E.; Matsumoto, K. An empirical comparison of model validation techniques for defect prediction models. IEEE Trans. Softw. Eng.
**2016**, 43, 1–18. [Google Scholar] [CrossRef] - Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom.
**2020**, 21, 6. [Google Scholar] [CrossRef] [PubMed][Green Version] - Allaire, J.J. Rstudio. 2009. Available online: https://www.rstudio.com/about/ (accessed on 10 May 2019).
- Portal. Transparency Portal; General Controller of the Union 2019. Available online: http://www.portaltransparencia.gov.br/download-de-dados/bolsa-familia-pagamentos/ (accessed on 28 April 2019).
- IBGE. Brazilian Institute of Geography and Statistics; IBGE 2019. Available online: http://www.ibge.gov.br (accessed on 15 May 2019).
- Atlas. The IDHM. Atlas of Human Development in Brazil 2019. Available online: http://www.atlasbrasil.org.br/2013/pt/o_atlas/metodologia/idhm_renda/ (accessed on 5 June 2019).
- Larsen, J.; Goutte, C. On optimal data split for generalization estimation and model selection. In Proceedings of the Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468), Madison, WI, USA, 25 August 1999; pp. 225–234. [Google Scholar]

**Figure 1.**General structure of a random forest model. Source: adapted from [37] and prepared by the authors.

**Figure 3.**Distribution of the average of BFP city utilization rate. Source: prepared by the authors.

**Figure 4.**Map of the average city use rates of the BFP in 2019, for 2 classes. Source: prepared by the authors.

**Figure 6.**Importance of variables with cutoff point via CART regression modeling. Source: prepared by the authors.

**Figure 8.**The distribution behavior of some variables over the binary utilization rate in Brazil. Source: prepared by the authors.

Commands | Description |
---|---|

install.packages (“sparklyr”) | Install the sparklyr package from CRAN |

library (“sparklyr”) | Load the package |

spark_install() | Install Spark |

sc <- spark_connect (master="local") | Create a local connect with R and Spark |

spark_connection_is_open (sc) | Verify if the connection is available |

spark_read_csv(path) | Read datasets in CSV (Comma Separated Values) |

dataset %>% select (columns) | Select columns |

src_tbls (sc) | Check the datasets that are in Spark |

glimpse (dataset) | Check the dataset structure |

spark_disconnect (sc) | Disconnect from Spark |

Predicted value | Real Value | ||

Yes | No | ||

Yes | TP | FP | |

No | FN | TN |

Variable | Description |
---|---|

UF | State (Federative Unit) |

Code SIAFI | City code in SIAFI |

Name SIAFI | City Name in SIAFI |

NIS | Number of social identification |

Value | Amount received by BFP |

**Table 4.**RMSE obtained by the Random Forest algorithm to estimate the rate of use (${y}_{i}$) evaluated over the test data set.

Mean | Median | SD | |
---|---|---|---|

RMSE | 0.0175 | 0.0175 | 0.0003 |

Model | Accuracy | F1 Score | MCC | |||
---|---|---|---|---|---|---|

Mean | SD | Mean | SD | Mean | SD | |

2 classes | 0.9534 | 0.0039 | 0.9529 | 0.0039 | 0.9060 | 0.0078 |

RANKING | VARIABLE | DESCRIPTION | V.IMP |
---|---|---|---|

1 | PPOB | Proportion of people with per capita household income equal to or less than R$ 255.00 per month | 30.77 |

2 | UF | Federative Unity | 24.72 |

3 | RDPC | Average per capita income | 20.99 |

4 | PPOBCRI | Proportion of children vulnerable to poverty | 20.61 |

5 | IDHM_R | Municipal Human Development Index—Income Dimension | 18.32 |

6 | PMPOB | Proportion of people with per capita household income equal to or less than R$ 140.00 per month | 16.30 |

7 | TRABSC | % of employed persons aged 18 or over who are employed without a formal contract | 13.73 |

8 | T_DENS | Ratio between the total number of residents in the household and the total number of rooms used as a dormitory | 13.73 |

9 | T_ANALF15M | Illiteracy rate of the population aged 15 or over | 13.53 |

10 | T_ANALF18M | Illiteracy rate of the population aged 18 or over | 13.08 |

11 | REGIAO | Country Region | 12.71 |

12 | T_FUNDIN | % of people in households vulnerable to poverty and in whom no one has complete basic | 10.78 |

13 | T_RMAXIDOSO | % of people in households vulnerable to poverty and dependent on the elderly | 10.76 |

14 | IDHM | Municipal Human Development Index | 9.93 |

15 | T_NESTUDA | Proportion of young people aged 15 to 24 years old who do not study and do not work | 9.01 |

16 | PIND | Proportion of individuals with per capita household income equal to or less than R$ 70.00 per month | 8.58 |

17 | PMPOBCRI | Proportion of individuals up to 14 years of age who have per capita household income equal to or less than R $ 70.00 per month | 8.47 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Paz, H.; Maia, M.; Moraes, F.; Lustosa, R.; Costa, L.; Macêdo, S.; Barreto, M.E.; Ara, A. Local Processing of Massive Databases with R: A National Analysis of a Brazilian Social Programme. *Stats* **2020**, *3*, 444-464.
https://doi.org/10.3390/stats3040028

**AMA Style**

Paz H, Maia M, Moraes F, Lustosa R, Costa L, Macêdo S, Barreto ME, Ara A. Local Processing of Massive Databases with R: A National Analysis of a Brazilian Social Programme. *Stats*. 2020; 3(4):444-464.
https://doi.org/10.3390/stats3040028

**Chicago/Turabian Style**

Paz, Hellen, Mateus Maia, Fernando Moraes, Ricardo Lustosa, Lilia Costa, Samuel Macêdo, Marcos E. Barreto, and Anderson Ara. 2020. "Local Processing of Massive Databases with R: A National Analysis of a Brazilian Social Programme" *Stats* 3, no. 4: 444-464.
https://doi.org/10.3390/stats3040028