Dealing with Randomness and Concept Drift in Large Datasets
Abstract
:1. Introduction
1.1. Motivation
1.2. Underlying Problem, Aim and Objectives
 To address data randomness and concept drift in a reallife application;
 To apply the samplemeasureassess (SMA) algorithm for unsupervised and supervised model optimisation;
 To highlight pathways for educationists, data scientists and other researchers to follow in engaging policy makers, development stakeholders and the general public in putting generated data to use.
 To motivate an unified and interdisciplinary understanding of datadriven decisions across disciplines.
1.3. Gap Challenges
2. Proposed Approach
2.1. Data Sources
2.2. Data Randomness and Concept Drift
2.3. Learning Rules from Data by Sampling, Measuring and Assessing
Algorithm 1 SMA—Sample, Measure, Assess 

2.4. Experimental Setup
2.4.1. Data Visualisation
2.4.2. Unsupervised Modelling
 Each of the determinants equals 1, $\parallel {w}_{k}\parallel =1$;
 Each of the ${\mathcal{PC}}_{k}$, maximises the variance $V\left\{{w}_{k}^{{}^{\prime}}{\mathcal{I}}_{k}\right\}$; and
 The covariance $COV\left\{{w}_{k}^{{}^{\prime}}{\mathcal{I}}_{k}\phantom{\rule{0.166667em}{0ex}}{w}_{r}{}^{\prime}{\mathcal{I}}_{r}\right\}=0,\phantom{\rule{0.166667em}{0ex}}\forall k<r$.
2.4.3. Supervised Modelling
3. Analyses, Results and Evaluation
3.1. Data Visualisation
3.2. Unsupervised Modelling
3.3. Supervised Modelling
Thresholding and Learning Rate
4. Contribution to Knowledge and Discussion
4.1. Contribution to Knowledge
4.2. Discussion
5. Concluding Remarks
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
ANN  Artificial Neural Networks 
BD  Big Data 
BDMSDG  Big Data Modelling of Sustainable Development Goals 
CAA  Commission for Academic Accreditation 
CSIR  Council for Scientific and Industrial Research 
DIRISA  Data Intensive Research Initiative of South Africa 
DSF  Development Science Framework 
DV  Data Visualisation 
EDA  Exploratory Data Analysis 
GPA  Grade Point Average 
MoE  Ministry of Education 
PCA  Principal Component 
PEDSC  Polar Environment Data Science Centre 
SDG  Sustainable Development Goals 
SILPA  Standards for Institutional Licensure and Program Accreditation 
SMA  Sample–Measure–Assess 
UAE  United Arab Emirates 
UNWDF  United Nations World Data Forum 
References
 Costa, E.B.; Fonseca, B.; Santana, M.A.; de Araújo, F.F.; Rego, J. Evaluating the effectiveness of educational data mining techniques for early prediction of students’ academic failure in introductory programming courses. Comput. Hum. Behav. 2017, 73, 247–256. [Google Scholar] [CrossRef]
 Wilson, K. What does it mean to do teaching? A qualitative study of resistance to Flipped Learning in a higher education context. Teach. High. Educ. 2020, 1–14. [Google Scholar] [CrossRef]
 Hua Leong, F.; Marshall, L. Modeling engagement of programming students using unsupervised machine learning technique. GSTF J. Comput. 2018, 6, 1–6. [Google Scholar]
 Brooks, C.; Erickson, G.; Greer, J.; Gutwin, C. Modelling and quantifying the behaviours of students in lecture capture environments. Comput. Educ. 2014, 75, 282–292. [Google Scholar] [CrossRef]
 Miguéis, V.L.; Freitas, A.; Garcia, P.J.V.; Silva, A. Early segmentation of students according to their academic performance: A predictive modelling approach. Decis. Support Syst. 2018, 115, 36–51. [Google Scholar] [CrossRef]
 Domínguez Figaredo, D. DataDriven Educational Algorithms Pedagogical Framing. Revista Iberoamericana de Educación a Distancia 2020, 23, 65–84. [Google Scholar] [CrossRef]
 Mwitondi, K.S.; Said, R.A. A databased method for harmonising heterogeneous data modelling techniques across data mining applications. J. Stat. Appl. Probab. 2013, 2, 293–305. [Google Scholar] [CrossRef]
 Zenisek, J.; Holzinger, F.; Affenzeller, M. Machine learning based concept drift detection for predictive maintenance. Comput. Ind. Eng. 2019, 137, 106031. [Google Scholar] [CrossRef]
 CHEDS. Center For Higher Education Data and Statistics; Ministry of Education: Dubai, United Arab Emirates, 2018.
 Žliobaitė, I.; Pechenizkiy, M.; Gama, J. An Overview of Concept Drift Applications. In Big Data Analysis: New Algorithms for a New Society; Japkowicz, N., Stefanowski, J., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 91–114. [Google Scholar]
 Tsymbal, A.; Pechenizkiy, M.; Cunningham, P.; Puuronen, S. Dynamic integration of classifiers for handling concept drift. Inf. Fusion 2008, 9, 56–68. [Google Scholar] [CrossRef][Green Version]
 SILPA. Standards for Institutional Licensure and Program Accreditation; Ministry of Education: Dubai, United Arab Emirates, 2019.
 Mwitondi, K.S.; Moustafa, R.E.; Hadi, A.S. A DataDriven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters. Data Sci. J. 2013, 12, WDS247–WDS253. [Google Scholar] [CrossRef][Green Version]
 Saggi, M.K.; Jain, S. A survey towards an integration of big data analytics to big insights for valuecreation. Inf. Process. Manag. 2018, 54, 758–790. [Google Scholar] [CrossRef]
 Reyes, J.A. The skinny on big data in education: Learning analytics simplified. TechTrends 2015, 59, 75–80. [Google Scholar] [CrossRef]
 Zhou, L.; Pan, S.; Wang, J.; Vasilakos, A.V. Machine learning on big data: Opportunities and challenges. Neurocomputing 2017, 237, 350–361. [Google Scholar] [CrossRef][Green Version]
 Xu, Y.; Goodacre, R. On Splitting Training and Validation Set: A Comparative Study of CrossValidation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. J. Anal. Test. 2018, 2, 249–262. [Google Scholar] [CrossRef][Green Version]
 Chen, S.; Dorn, S.; Lell, M.; Kachelrieß, M.; Maier, A. Manifold LearningBased Data Sampling for Model Training; Springer: Berlin/Heidelberg, Germany, 2018; pp. 269–274. [Google Scholar]
 Mwitondi, K.; Munyakazi, I.; Gatsheni, B. A robust machine learning approach to SDG data segmentation. J. Big Data 2020, 7. [Google Scholar] [CrossRef]
 Mwitondi, K.; Munyakazi, I.; Gatsheni, B. Amenability of the United Nations Sustainable Development Goals to Big Data Modelling. In Proceedings of the International Workshop on Data SciencePresent and Future of Open Data and Open Science, Joint Support Centre for Data Science Research, Mishima Citizens Cultural Hall, Mishima, Shizuoka, Japan, 12–15 November 2018. [Google Scholar]
 Mwitondi, K.; Munyakazi, I.; Gatsheni, B. An Interdisciplinary DataDriven Framework for Development Science. In Proceedings of the DIRISA National Research Data Workshop, CSIR ICC, Pretoria, South Africa, 19–21 June 2018. [Google Scholar]
 Drori, I.; Krishnamurthy, Y.; Lourenco, R.; Rampin, R.; Cho, K.; Silva, C.; Freire, J. Automatic Machine Learning by Pipeline Synthesis using ModelBased Reinforcement Learning and a Grammar. arXiv 2019, arXiv:cs.LG/1905.10345. [Google Scholar]
 Bo, L.; Wang, L.; Jiao, L. Feature Scaling for Kernel Fisher Discriminant Analysis Using LeaveOneOut Cross Validation. Neural Comput. 2006, 18, 961–978. [Google Scholar] [CrossRef] [PubMed]
 Galkin, F.; Aliper, A.; Putin, E.; Kuznetsov, I.; Gladyshev, V.N.; Zhavoronkov, A. Human microbiome aging clocks based on deep learning and tandem of permutation feature importance and accumulated local effects. bioRxiv 2018. [Google Scholar] [CrossRef][Green Version]
 Mwitondi, K.S.; Said, R.A.; Zargari, S.A. A robust domain partitioning intrusion detection method. J. Inf. Secur. Appl. 2019, 48, 102360. [Google Scholar] [CrossRef]
 Looney, C.G. Pattern Recognition Using Neural Networks: Theory and Algorithms for Engineers and Scientists; Oxford University Press: New York, NY, USA, 1997. [Google Scholar]
 Webb, A. Statistical Pattern Recognition; Wiley: London, UK, 2005. [Google Scholar]
 Lawrence, A.J. Deletion Influence and Masking in Regression. J. R. Stat. Society. Ser. B (Methodol.) 1995, 57, 181–189. [Google Scholar] [CrossRef]
 Bendre, S.M. Masking and swamping effects on tests for multiple outliers in normal sample. Commun. Stat. Theory Methods 1989, 18, 697–710. [Google Scholar] [CrossRef]
 Parsons, M.A.; Godøy, Ø.; LeDrew, E.; de Bruin, T.F.; Danis, B.; Tomlinson, S.; Carlson, D. A conceptual framework for managing very diverse data for complex, interdisciplinary science. J. Inf. Sci. 2011, 37, 555–569. [Google Scholar] [CrossRef][Green Version]
 Johnson, S.R.; Stage, F.K. Academic Engagement and Student Success: Do HighImpact Practices Mean Higher Graduation Rates? J. High. Educ. 2018, 89, 753–781. [Google Scholar] [CrossRef]
 Rienties, B.; Toetenel, L. The impact of learning design on student behaviour, satisfaction and performance: A crossinstitutional comparison across 151 modules. Comput. Hum. Behav. 2016, 60, 333–341. [Google Scholar] [CrossRef]
 Lerman, R. Do firms benefit from apprenticeship investments? IZA World Labor 2019. [Google Scholar] [CrossRef][Green Version]
 Di Meglio, G.; BargeGil, A.; Camiña, E.; Moreno, L. Knocking on Employment´s Door: Internships and Job Attainment. Munich Personal RePEc Archive 2019. Available online: https://mpra.ub.unimuenchen.de/95712/1/MPRA_paper_95712.pdf (accessed on 15 July 2021).
 Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95—International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
 Shi, Y.; Eberhart, R. A modified particle swarm optimizer. In Proceedings of the 1998 IEEE International Conference on Evolutionary Computation Proceedings, IEEE World Congress on Computational Intelligence (Cat. No.98TH8360), Anchorage, AK, USA, 4–9 May 1998; pp. 69–73. [Google Scholar]
Code  Variable  Type  Description  Summaries 

IST  Institution  String  University  One with two campuses 
GDR  Sex  Binary  Sex  Female (55%); Male (45%) 
NTA  Nationality  String  Home country  UAE (56%) Oman (28%) 
CPSTYP  Type  String  Start or cont/trans  Bach (74%); Dip (25.7%); Master’s (0.3%) 
LVL  Level  String  Diploma, first or post  3 different levels 
SPC  Specialisation  String  Broad specialisation  5 different specialisations 
MJR  Major  String  Specific field  43 different major subjects 
INT  InternSector  String  Internship sector  60 different sectors 
PCD  ProgramCredits  Numeric  Total credits to grad.  Q1 = 24 Med = 129 Mean = 102 Q3 = 129 
RCP  RegCreditsPrev  Numeric  Reg. Spring credits  Q1 = 12 Med = 15 Mean = 14 Q3 = 16 
PVC  PrevCreditsComplete  Numeric  Comp. spring credits  Q1 = 12 Med = 15 Mean = 13.1 Q3 = 15 
RGC  RegCredits  Numeric  Reg. Curr. credits  Q1 = 9 Med = 15 Mean = 12.6 Q3 = 16 
CMC  CumulativeCredits  Numeric  Cumulative credits  Q1 = 15 Med = 93 Mean = 76 Q3 = 108 
CGP  CumulativeGPA  Numeric  Cumulative GPA  Q1 = 2.2 Med = 2.6 Mean = 2.7 Q3 = 3.1 
QES  QualifyingExitScore  Percentage  Score from QAward  Q1 = 65 Med = 74 Mean = 68 Q3 = 82 
BSG  BeforeSemGPA  Numeric  GPA Before internship  Q1 = 2.2 Med = 2.8 Mean = 2.7 Q3 = 3.4 
ISG  InSemGPA  Numeric  Insemester GPA  Q1 = 2.7 Med = 3.1 Mean = 3.1 Q3 = 3.6 
ASG  AfterSemGPA  Numeric  GPA After internship  Q1 = 2.3 Med = 3.0 Mean = 2.8 Q3 = 3.5 
CLS  Class  Binary  Tweaked GPA  ≥Mean (49%) and <Mean (51%) 
Population Error  Training Error  Cross Validation Error  Test Error 

${\psi}_{D,POP}$  ${\psi}_{D,TRN}$  ${\psi}_{D,XVD}$  ${\psi}_{D,TEST}$ 
Actual population error  From random training  From random validation  From random testing 
PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8  PC9  PC10 

−0.566  0.015  −0.146  0.032  0.142  −0.004  −0.008  −0.140  −0.144  0.772 
0.182  −0.049  −0.662  −0.067  0.157  −0.012  −0.064  −0.207  0.667  0.069 
0.248  −0.044  −0.610  −0.053  0.258  −0.005  0.018  0.231  −0.662  −0.060 
−0.100  −0.011  −0.286  −0.360  −0.872  −0.015  0.045  −0.063  −0.106  0.017 
−0.535  0.020  −0.172  0.123  0.114  −0.037  0.026  −0.547  −0.146  −0.577 
0.165  −0.015  −0.140  0.906  −0.332  0.038  0.067  −0.074  −0.043  0.098 
−0.507  0.013  −0.183  0.154  −0.039  0.034  −0.107  0.753  0.230  −0.227 
0.001  −0.573  0.052  0.044  −0.036  −0.664  −0.473  −0.011  −0.039  0.008 
−0.063  −0.582  0.013  −0.024  0.049  −0.075  0.799  0.072  0.071  −0.006 
−0.018  −0.572  0.034  −0.018  −0.015  0.741  −0.339  −0.064  −0.040  −0.011 
Model (${\widehat{\mathcal{L}}}_{\mathit{t}\mathit{r},\mathit{t}\mathit{s}}$)  Threshold  ${\mathit{\psi}}_{\mathit{D},\mathit{TRN}}$  ${\mathit{\psi}}_{\mathit{D},\mathit{TEST}}$  $\mathbb{E}\left[\mathbf{\Delta}\right]$  Sample $\left[{\mathit{x}}_{\mathit{\nu},\mathit{\tau}}\right]$  Sample $\left[{\mathit{x}}_{\overline{\mathit{\nu}},\mathit{\tau}}\right]$ 

ANN−Bin−1  0.50  0.02926  0.02764  −0.001618  ${S}_{tr}=2529$  ${S}_{ts}=615$ 
ANN−Tri−1  0.50  0.28143  0.31159  0.030157  ${S}_{tr}=3006$  ${S}_{ts}=138$ 
ANN−Bin−2  0.40  0.01979  0.03074  0.010950  ${S}_{tr}=2526$  ${S}_{ts}=618$ 
ANN−Tri−2  0.40  0.27945  0.29552  0.016063  ${S}_{tr}=2809$  ${S}_{ts}=335$ 
ANN−Bin−3  0.25  0.02228  0.02852  0.006242  ${S}_{tr}=2513$  ${S}_{ts}=631$ 
ANN−Tri−3  0.25  0.28689  0.25738  −0.029507  ${S}_{tr}=2670$  ${S}_{ts}=474$ 
ANN−Bin−4  0.10  0.00913  0.01757  0.008437  ${S}_{tr}=2518$  ${S}_{ts}=626$ 
ANN−Tri−4  0.10  0.28283  0.26737  −0.015464  ${S}_{tr}=2482$  ${S}_{ts}=662$ 
ANN−Bin−5  0.05  0.00434  0.00652  0.0021791  ${S}_{tr}=2531$  ${S}_{ts}=613$ 
ANN−Tri−5  0.05  0.27599  0.29562  0.019636  ${S}_{tr}=2366$  ${S}_{ts}=778$ 
ANN−Bin−6  0.01  0.00201  0.01801  0.016000  ${S}_{tr}=2478$  ${S}_{ts}=666$ 
ANN−Tri−6  0.01  0.28493  0.26580  −0.019129  ${S}_{tr}=2211$  ${S}_{ts}=933$ 
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. 
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mwitondi, K.S.; Said, R.A. Dealing with Randomness and Concept Drift in Large Datasets. Data 2021, 6, 77. https://doi.org/10.3390/data6070077
Mwitondi KS, Said RA. Dealing with Randomness and Concept Drift in Large Datasets. Data. 2021; 6(7):77. https://doi.org/10.3390/data6070077
Chicago/Turabian StyleMwitondi, Kassim S., and Raed A. Said. 2021. "Dealing with Randomness and Concept Drift in Large Datasets" Data 6, no. 7: 77. https://doi.org/10.3390/data6070077