Dealing with Randomness and Concept Drift in Large Datasets
Abstract
:1. Introduction
1.1. Motivation
1.2. Underlying Problem, Aim and Objectives
- To address data randomness and concept drift in a real-life application;
- To apply the sample-measure-assess (SMA) algorithm for unsupervised and supervised model optimisation;
- To highlight pathways for educationists, data scientists and other researchers to follow in engaging policy makers, development stakeholders and the general public in putting generated data to use.
- To motivate an unified and interdisciplinary understanding of data-driven decisions across disciplines.
1.3. Gap Challenges
2. Proposed Approach
2.1. Data Sources
2.2. Data Randomness and Concept Drift
2.3. Learning Rules from Data by Sampling, Measuring and Assessing
Algorithm 1 SMA—Sample, Measure, Assess |
|
2.4. Experimental Setup
2.4.1. Data Visualisation
2.4.2. Unsupervised Modelling
- Each of the determinants equals 1, ;
- Each of the , maximises the variance ; and
- The covariance .
2.4.3. Supervised Modelling
3. Analyses, Results and Evaluation
3.1. Data Visualisation
3.2. Unsupervised Modelling
3.3. Supervised Modelling
Thresholding and Learning Rate
4. Contribution to Knowledge and Discussion
4.1. Contribution to Knowledge
4.2. Discussion
5. Concluding Remarks
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
ANN | Artificial Neural Networks |
BD | Big Data |
BDMSDG | Big Data Modelling of Sustainable Development Goals |
CAA | Commission for Academic Accreditation |
CSIR | Council for Scientific and Industrial Research |
DIRISA | Data Intensive Research Initiative of South Africa |
DSF | Development Science Framework |
DV | Data Visualisation |
EDA | Exploratory Data Analysis |
GPA | Grade Point Average |
MoE | Ministry of Education |
PCA | Principal Component |
PEDSC | Polar Environment Data Science Centre |
SDG | Sustainable Development Goals |
SILPA | Standards for Institutional Licensure and Program Accreditation |
SMA | Sample–Measure–Assess |
UAE | United Arab Emirates |
UNWDF | United Nations World Data Forum |
References
- Costa, E.B.; Fonseca, B.; Santana, M.A.; de Araújo, F.F.; Rego, J. Evaluating the effectiveness of educational data mining techniques for early prediction of students’ academic failure in introductory programming courses. Comput. Hum. Behav. 2017, 73, 247–256. [Google Scholar] [CrossRef]
- Wilson, K. What does it mean to do teaching? A qualitative study of resistance to Flipped Learning in a higher education context. Teach. High. Educ. 2020, 1–14. [Google Scholar] [CrossRef]
- Hua Leong, F.; Marshall, L. Modeling engagement of programming students using unsupervised machine learning technique. GSTF J. Comput. 2018, 6, 1–6. [Google Scholar]
- Brooks, C.; Erickson, G.; Greer, J.; Gutwin, C. Modelling and quantifying the behaviours of students in lecture capture environments. Comput. Educ. 2014, 75, 282–292. [Google Scholar] [CrossRef]
- Miguéis, V.L.; Freitas, A.; Garcia, P.J.V.; Silva, A. Early segmentation of students according to their academic performance: A predictive modelling approach. Decis. Support Syst. 2018, 115, 36–51. [Google Scholar] [CrossRef]
- Domínguez Figaredo, D. Data-Driven Educational Algorithms Pedagogical Framing. Revista Iberoamericana de Educación a Distancia 2020, 23, 65–84. [Google Scholar] [CrossRef]
- Mwitondi, K.S.; Said, R.A. A data-based method for harmonising heterogeneous data modelling techniques across data mining applications. J. Stat. Appl. Probab. 2013, 2, 293–305. [Google Scholar] [CrossRef]
- Zenisek, J.; Holzinger, F.; Affenzeller, M. Machine learning based concept drift detection for predictive maintenance. Comput. Ind. Eng. 2019, 137, 106031. [Google Scholar] [CrossRef]
- CHEDS. Center For Higher Education Data and Statistics; Ministry of Education: Dubai, United Arab Emirates, 2018.
- Žliobaitė, I.; Pechenizkiy, M.; Gama, J. An Overview of Concept Drift Applications. In Big Data Analysis: New Algorithms for a New Society; Japkowicz, N., Stefanowski, J., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 91–114. [Google Scholar]
- Tsymbal, A.; Pechenizkiy, M.; Cunningham, P.; Puuronen, S. Dynamic integration of classifiers for handling concept drift. Inf. Fusion 2008, 9, 56–68. [Google Scholar] [CrossRef] [Green Version]
- SILPA. Standards for Institutional Licensure and Program Accreditation; Ministry of Education: Dubai, United Arab Emirates, 2019.
- Mwitondi, K.S.; Moustafa, R.E.; Hadi, A.S. A Data-Driven Method for Selecting Optimal Models Based on Graphical Visualisation of Differences in Sequentially Fitted ROC Model Parameters. Data Sci. J. 2013, 12, WDS247–WDS253. [Google Scholar] [CrossRef] [Green Version]
- Saggi, M.K.; Jain, S. A survey towards an integration of big data analytics to big insights for value-creation. Inf. Process. Manag. 2018, 54, 758–790. [Google Scholar] [CrossRef]
- Reyes, J.A. The skinny on big data in education: Learning analytics simplified. TechTrends 2015, 59, 75–80. [Google Scholar] [CrossRef]
- Zhou, L.; Pan, S.; Wang, J.; Vasilakos, A.V. Machine learning on big data: Opportunities and challenges. Neurocomputing 2017, 237, 350–361. [Google Scholar] [CrossRef] [Green Version]
- Xu, Y.; Goodacre, R. On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. J. Anal. Test. 2018, 2, 249–262. [Google Scholar] [CrossRef] [Green Version]
- Chen, S.; Dorn, S.; Lell, M.; Kachelrieß, M.; Maier, A. Manifold Learning-Based Data Sampling for Model Training; Springer: Berlin/Heidelberg, Germany, 2018; pp. 269–274. [Google Scholar]
- Mwitondi, K.; Munyakazi, I.; Gatsheni, B. A robust machine learning approach to SDG data segmentation. J. Big Data 2020, 7. [Google Scholar] [CrossRef]
- Mwitondi, K.; Munyakazi, I.; Gatsheni, B. Amenability of the United Nations Sustainable Development Goals to Big Data Modelling. In Proceedings of the International Workshop on Data Science-Present and Future of Open Data and Open Science, Joint Support Centre for Data Science Research, Mishima Citizens Cultural Hall, Mishima, Shizuoka, Japan, 12–15 November 2018. [Google Scholar]
- Mwitondi, K.; Munyakazi, I.; Gatsheni, B. An Interdisciplinary Data-Driven Framework for Development Science. In Proceedings of the DIRISA National Research Data Workshop, CSIR ICC, Pretoria, South Africa, 19–21 June 2018. [Google Scholar]
- Drori, I.; Krishnamurthy, Y.; Lourenco, R.; Rampin, R.; Cho, K.; Silva, C.; Freire, J. Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar. arXiv 2019, arXiv:cs.LG/1905.10345. [Google Scholar]
- Bo, L.; Wang, L.; Jiao, L. Feature Scaling for Kernel Fisher Discriminant Analysis Using Leave-One-Out Cross Validation. Neural Comput. 2006, 18, 961–978. [Google Scholar] [CrossRef] [PubMed]
- Galkin, F.; Aliper, A.; Putin, E.; Kuznetsov, I.; Gladyshev, V.N.; Zhavoronkov, A. Human microbiome aging clocks based on deep learning and tandem of permutation feature importance and accumulated local effects. bioRxiv 2018. [Google Scholar] [CrossRef] [Green Version]
- Mwitondi, K.S.; Said, R.A.; Zargari, S.A. A robust domain partitioning intrusion detection method. J. Inf. Secur. Appl. 2019, 48, 102360. [Google Scholar] [CrossRef]
- Looney, C.G. Pattern Recognition Using Neural Networks: Theory and Algorithms for Engineers and Scientists; Oxford University Press: New York, NY, USA, 1997. [Google Scholar]
- Webb, A. Statistical Pattern Recognition; Wiley: London, UK, 2005. [Google Scholar]
- Lawrence, A.J. Deletion Influence and Masking in Regression. J. R. Stat. Society. Ser. B (Methodol.) 1995, 57, 181–189. [Google Scholar] [CrossRef]
- Bendre, S.M. Masking and swamping effects on tests for multiple outliers in normal sample. Commun. Stat. Theory Methods 1989, 18, 697–710. [Google Scholar] [CrossRef]
- Parsons, M.A.; Godøy, Ø.; LeDrew, E.; de Bruin, T.F.; Danis, B.; Tomlinson, S.; Carlson, D. A conceptual framework for managing very diverse data for complex, interdisciplinary science. J. Inf. Sci. 2011, 37, 555–569. [Google Scholar] [CrossRef] [Green Version]
- Johnson, S.R.; Stage, F.K. Academic Engagement and Student Success: Do High-Impact Practices Mean Higher Graduation Rates? J. High. Educ. 2018, 89, 753–781. [Google Scholar] [CrossRef]
- Rienties, B.; Toetenel, L. The impact of learning design on student behaviour, satisfaction and performance: A cross-institutional comparison across 151 modules. Comput. Hum. Behav. 2016, 60, 333–341. [Google Scholar] [CrossRef]
- Lerman, R. Do firms benefit from apprenticeship investments? IZA World Labor 2019. [Google Scholar] [CrossRef] [Green Version]
- Di Meglio, G.; Barge-Gil, A.; Camiña, E.; Moreno, L. Knocking on Employment´s Door: Internships and Job Attainment. Munich Personal RePEc Archive 2019. Available online: https://mpra.ub.uni-muenchen.de/95712/1/MPRA_paper_95712.pdf (accessed on 15 July 2021).
- Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95—International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
- Shi, Y.; Eberhart, R. A modified particle swarm optimizer. In Proceedings of the 1998 IEEE International Conference on Evolutionary Computation Proceedings, IEEE World Congress on Computational Intelligence (Cat. No.98TH8360), Anchorage, AK, USA, 4–9 May 1998; pp. 69–73. [Google Scholar]
Code | Variable | Type | Description | Summaries |
---|---|---|---|---|
IST | Institution | String | University | One with two campuses |
GDR | Sex | Binary | Sex | Female (55%); Male (45%) |
NTA | Nationality | String | Home country | UAE (56%) Oman (28%) |
CPSTYP | Type | String | Start or cont/trans | Bach (74%); Dip (25.7%); Master’s (0.3%) |
LVL | Level | String | Diploma, first or post | 3 different levels |
SPC | Specialisation | String | Broad specialisation | 5 different specialisations |
MJR | Major | String | Specific field | 43 different major subjects |
INT | InternSector | String | Internship sector | 60 different sectors |
PCD | ProgramCredits | Numeric | Total credits to grad. | Q1 = 24 Med = 129 Mean = 102 Q3 = 129 |
RCP | RegCreditsPrev | Numeric | Reg. Spring credits | Q1 = 12 Med = 15 Mean = 14 Q3 = 16 |
PVC | PrevCreditsComplete | Numeric | Comp. spring credits | Q1 = 12 Med = 15 Mean = 13.1 Q3 = 15 |
RGC | RegCredits | Numeric | Reg. Curr. credits | Q1 = 9 Med = 15 Mean = 12.6 Q3 = 16 |
CMC | CumulativeCredits | Numeric | Cumulative credits | Q1 = 15 Med = 93 Mean = 76 Q3 = 108 |
CGP | CumulativeGPA | Numeric | Cumulative GPA | Q1 = 2.2 Med = 2.6 Mean = 2.7 Q3 = 3.1 |
QES | QualifyingExitScore | Percentage | Score from Q-Award | Q1 = 65 Med = 74 Mean = 68 Q3 = 82 |
BSG | BeforeSemGPA | Numeric | GPA Before internship | Q1 = 2.2 Med = 2.8 Mean = 2.7 Q3 = 3.4 |
ISG | InSemGPA | Numeric | In-semester GPA | Q1 = 2.7 Med = 3.1 Mean = 3.1 Q3 = 3.6 |
ASG | AfterSemGPA | Numeric | GPA After internship | Q1 = 2.3 Med = 3.0 Mean = 2.8 Q3 = 3.5 |
CLS | Class | Binary | Tweaked GPA | ≥Mean (49%) and <Mean (51%) |
Population Error | Training Error | Cross Validation Error | Test Error |
---|---|---|---|
Actual population error | From random training | From random validation | From random testing |
PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 |
---|---|---|---|---|---|---|---|---|---|
−0.566 | 0.015 | −0.146 | 0.032 | 0.142 | −0.004 | −0.008 | −0.140 | −0.144 | 0.772 |
0.182 | −0.049 | −0.662 | −0.067 | 0.157 | −0.012 | −0.064 | −0.207 | 0.667 | 0.069 |
0.248 | −0.044 | −0.610 | −0.053 | 0.258 | −0.005 | 0.018 | 0.231 | −0.662 | −0.060 |
−0.100 | −0.011 | −0.286 | −0.360 | −0.872 | −0.015 | 0.045 | −0.063 | −0.106 | 0.017 |
−0.535 | 0.020 | −0.172 | 0.123 | 0.114 | −0.037 | 0.026 | −0.547 | −0.146 | −0.577 |
0.165 | −0.015 | −0.140 | 0.906 | −0.332 | 0.038 | 0.067 | −0.074 | −0.043 | 0.098 |
−0.507 | 0.013 | −0.183 | 0.154 | −0.039 | 0.034 | −0.107 | 0.753 | 0.230 | −0.227 |
0.001 | −0.573 | 0.052 | 0.044 | −0.036 | −0.664 | −0.473 | −0.011 | −0.039 | 0.008 |
−0.063 | −0.582 | 0.013 | −0.024 | 0.049 | −0.075 | 0.799 | 0.072 | 0.071 | −0.006 |
−0.018 | −0.572 | 0.034 | −0.018 | −0.015 | 0.741 | −0.339 | −0.064 | −0.040 | −0.011 |
Model () | Threshold | Sample | Sample | |||
---|---|---|---|---|---|---|
ANN−Bin−1 | 0.50 | 0.02926 | 0.02764 | −0.001618 | ||
ANN−Tri−1 | 0.50 | 0.28143 | 0.31159 | 0.030157 | ||
ANN−Bin−2 | 0.40 | 0.01979 | 0.03074 | 0.010950 | ||
ANN−Tri−2 | 0.40 | 0.27945 | 0.29552 | 0.016063 | ||
ANN−Bin−3 | 0.25 | 0.02228 | 0.02852 | 0.006242 | ||
ANN−Tri−3 | 0.25 | 0.28689 | 0.25738 | −0.029507 | ||
ANN−Bin−4 | 0.10 | 0.00913 | 0.01757 | 0.008437 | ||
ANN−Tri−4 | 0.10 | 0.28283 | 0.26737 | −0.015464 | ||
ANN−Bin−5 | 0.05 | 0.00434 | 0.00652 | 0.0021791 | ||
ANN−Tri−5 | 0.05 | 0.27599 | 0.29562 | 0.019636 | ||
ANN−Bin−6 | 0.01 | 0.00201 | 0.01801 | 0.016000 | ||
ANN−Tri−6 | 0.01 | 0.28493 | 0.26580 | −0.019129 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mwitondi, K.S.; Said, R.A. Dealing with Randomness and Concept Drift in Large Datasets. Data 2021, 6, 77. https://doi.org/10.3390/data6070077
Mwitondi KS, Said RA. Dealing with Randomness and Concept Drift in Large Datasets. Data. 2021; 6(7):77. https://doi.org/10.3390/data6070077
Chicago/Turabian StyleMwitondi, Kassim S., and Raed A. Said. 2021. "Dealing with Randomness and Concept Drift in Large Datasets" Data 6, no. 7: 77. https://doi.org/10.3390/data6070077
APA StyleMwitondi, K. S., & Said, R. A. (2021). Dealing with Randomness and Concept Drift in Large Datasets. Data, 6(7), 77. https://doi.org/10.3390/data6070077