Improving Education Predictions Through Reasoning by Analogy and Causal Relationships Applied to Smart Exploitation of Data
Abstract
1. Introduction
1.1. Literature Review
1.2. Context
- Lack of quality data (unknown, incomplete, or inaccurate information);
- Unpredictable changes in relevant causes (unexpected events, political decisions, technological advances, or external factors);
- The presence of multiple alternative possibilities.
1.3. General Objective and Contribution
- Perform smart data analysis: Dashboards for data selection, transformation, and integration will be developed to extract insights from the data. This will provide empirical knowledge of how the event to be predicted has behaved in the past and will uncover new non-trivial knowledge about the event.
- Apply reasoning by analogy: An analogy-based reasoning approach will be developed to relate past and present events with similar characteristics and shared contexts. This method will facilitate the identification of analogies, and the most similar past event will serve as a basis for predicting the new event.
- Integration of causal relationships: Causal relationships extracted from smart data analysis and expert knowledge will be incorporated into the prediction process.
- Finally, the validation of the results of the proposed method with the Pearson chi-square test of independence and MAE (mean absolute error).
1.4. Structure of This Paper
2. Materials and Methods
- Materials: This defines the model’s elements, with the two main objects being the most similar previous event and the event to be predicted, their relationship and cause–effect relationship.
- Methods: This specifies the phases and tasks to standardize and systematize the process by employing reasoning by analogy, smart data analysis, and the definition of cause–effect rules.
- Validation: To validate the results of the proposed model, the Pearson chi-square test of independence is used. This is a statistical procedure used to determine whether there is a significant relationship between two variables, specifically, the most similar previous event and the event to be predicted.
2.1. Materials
- Number 1. Causes of the most similar previous event (C).
- Number 2. Cause-effect relationship of the most similar previous event (R).
- Number 3. Effects of the most similar previous event (E).
- Number 4. Similarity between the causes of the most similar previous event and the event to be predicted (S).
- Number 5. Similarity between the effects of the most similar previous event and the effects of the event to be predicted (S′)
- Number 6. Causes of the event to be predicted (C′)
- Number 7. Cause-effect relationship of the event to be predicted (R′)
- Number 8. Effects of the event to be predicted (E′)
- If some causes (C) of the most similar previous event (Ev) produce specific effects (E), and;
- There is a certain degree of similarity between the common causes (C) of the event to be predicted (Ev′);
- Then, there will also be a certain degree of similarity between the effects (E) of the most similar previous event (Ev) and the effects (E′) of the event to be predicted (Ev′).
- Common effects (ce1, ce2, ce3, …, cen) from the most similar previous event;
- Specific effects (se1, se2, se3, …, sen) are unique to the event to be predicted.
- Identifying Relevant Causes and Effects: Detect patterns in large datasets, distinguishing between common and specific causes while reducing noise.
- Expanding Expert Knowledge: Recognize complex, non-trivial patterns.
- Evaluating Causal Relationships: Assess whether causal relationships are valid or spurious by controlling for confounding variables.
- Discovering Causal Rules: Automate knowledge generalization into IF…THEN rules.
- Predicting Future Effects: Analyze historical patterns to extrapolate how identified causes will affect future events.
- Most Similar Previous Event (Ev): The root represents the event, intermediate levels represent common causes (C), and the leaves represent common effects (E).
- Event to Be Predicted (Ev′): The root represents the event, intermediate levels include common causes (C) from the most similar previous event and specific causes (C′) of the event to be predicted, while the leaves represent specific effects (E′).
- EffectsEv′ (E′/C) is the EffectEv (E/C) of the most similar previous event, adjusted proportionally based on the result of the similarity function S through the similarity function S′ (Formula (5)).Effects Ev′ (E′/C) = f (EffectsEv (E/C), Similarity FunctionS′)
- EffectsEv′ (E′/C′): The specific causes (C′) of the new event (Ev′) are derived from expert knowledge extracted through the intelligent utilization of large datasets using dashboards (Formula (6)).EffectsEv′ (E′/C′) = f (Intelligent analysis of big data)
2.2. Methods
- Smart data analysis
- Definition of Data Sources:
- ∘
- Data Definition: Identify the data required to develop the dashboard. Describe the obtained data, the meaning of attributes, and their format. Descriptive statistics techniques can be applied to explore the data.
- ∘
- Define Data Sources: Determine the origin of each dataset.
- ∘
- Establish Dimensional and Fact Entities: Dimensions categorize and describe factors, while factors are numerical or quantitative measures that represent key metrics for analysis.
- Extraction, Transformation, and Loading (ETL):This process cleans and transforms the data into a suitable format for analysis and storage in a centralized repository (data lake):
- ∘
- Extraction: Gather data from multiple sources, such as databases, flat files, online applications, and event logs. All relevant data for analysis is collected.
- ∘
- Transformation: Perform operations to clean, structure, and prepare the data for analysis.
- ∘
- Loading: Load the transformed data into a centralized repository (e.g., data warehouse or analytical database).
- Metrics and KPIs:
- ∘
- Metrics are quantitative values that measure a specific aspect of an event. They are quantitative (always expressed in numerical terms), specific (related to a particular aspect of the event) and comparable (allow for analysis across different periods or contexts).
- ∘
- Key Performance Indicators (KPIs) are strategically selected metrics used to assess whether an organization is meeting its key objectives. They are more specific and directly linked to the event’s strategic goals. They are characterized by relevant (aligned with the event’s strategic) and contextualized (their value has meaning within a context).
- BI Model Definition:The model is constructed by relating entities based on the characteristics stored in dimensional and fact entities:
- ∘
- Identify Key Attributes: Abstract and select attributes to serve as primary and foreign keys, establishing connections between factors and dimensions.
- ∘
- Define a Star Schema: Create relationships between entities. In the star schema, fact entities act as the central core connecting the dimensions that contextualize the data.
- Interface Development:Information is grouped into sections, defining what data will be displayed in each dashboard section.
- Testing:Execute the model and verify that the metrics and indicators displayed on the dashboard match those calculated from the data sources.
- Tuning the Dashboard:Developing the dashboard for knowledge extraction is an iterative process. Necessary adjustments are made to ETL processes and the user interface until non-trivial expert knowledge that meets the dashboard’s objectives is uncovered.
- Knowledge Extraction:Interpret the obtained results, focusing on how various dashboard selections translate into knowledge and uncover hidden information. Identify the causes and variables relevant to the event to be predicted.
- Reasoning by Analogy
- Selection of Events for the Previous Event Database:Select representative previous events based on their relevance and similarity to the event to be predicted.
- Identification of Common Causes:Identify common causes through the smart data analysis described earlier or by consulting domain experts to determine influential causes.
- Representation of Causes and Effects in Previous Events:Represent previous events using vectors that include relevant causes and their corresponding effects.
- Representation of the Event to Be Predicted (Ev′):Similar to previous events, represent the event to be predicted as a collection of common causes (C).
- Quantification of Common Causes:Quantify the common causes of previous events (Ev) and the event to be predicted (Ev′) based on knowledge derived from smart data analysis.
- Local Distance Calculation:Use a local distance function to measure similarity between individual causes of previous events and the event to be predicted, identifying the most similar previous event.
- Global Distance Calculation:Calculate the global distance to assess the overall similarity of the event to be predicted (Ev′) relative to the entire database of previous events.
- Selection of the Most Similar Previous Event:Evaluate the local distance function to compare each common cause of previous events with the corresponding cause in the event to be predicted. The previous event with the smallest distance is identified as the most similar previous event.
- Explain Effects Using Common Causes:If the common causes (C) of the most similar previous event (Ev) are similar to the common causes (C′) of the event to be predicted (Ev′), it can be inferred that their effects (E and E′) will also be similar.
- Prediction or Generalization of Knowledge
- Identify Relevant Variables:Determine independent variables (causes), dependent variables (effects), and confounding variables.
- Representation of DAGs:Create a DAG comprising causes, effects, and confounding variables. In a DAG, causes are nodes, and causal relationships are arrows pointing from causes to effects.
- Variable Control:Include confounding variables in the model to eliminate their biasing effects. Confounding variables often create “backdoor effects”. Steps to mitigate this include:
- ∘
- Identifying variables through smart data analysis.
- ∘
- Including them as control variables in the model.
- ∘
- Evaluating whether the influence of causes on effects changes significantly.
- Approximate Calculation of Causal Impact:Understand how independent variables influence dependent variables using techniques such as regression discontinuity, matching, or a difference-in-differences (DiD) design.
- Prediction or Knowledge Generalization:
- ∘
- Prediction: Use the most similar previous event and its effects as a baseline to model the influence of specific causes on the new event’s effects.
- ∘
- Definition of Rules: Define cause–effect rules as conditional statements linking antecedents (“IF”) to consequences (“THEN”).
2.3. Validation
- Null Hypothesis (H0): Assumes no significant differences between observed and expected values; both follow a similar pattern.
- Alternative Hypothesis (H1): Opposes the null hypothesis, indicating significant differences between the two variables. The expected data do not align with the observed data.
- Degree of Freedom: Depends on the number of sample values and is used to consult Chi2 tables. It is calculated as (Formula (8)):
- Significance Level or Critical Value (α): Represents the probability that the null hypothesis is true.
- Decision Criterion:
- ∘
- Reject H0 when Chi2 >= Chi2t (r − 1) × (k − 1). If the calculated Chi2 value is greater than or equal to the critical value, the null hypothesis is rejected, indicating no significant relationship between the two categorical variables.
- ∘
- Fail to reject H0 when Chi2 < Chi2t (r − 1) × (k − 1). If the calculated Chi2 value is less than the critical value, the null hypothesis cannot be rejected, indicating a significant relationship between the hypothetical and observed values.
- The larger the Chi2 value, the less likely the initial hypothesis is true.
- The closer Chi2 is to zero, the more aligned the observed and predicted distributions are.
3. Prediction of the Percentage of Students Who Promote to the Next Grade with All Subjects Passed in the Four Years of Middle School for the 2021–2022 School Year in Spain
- Junemann, M.AP. et al. [10]: Implemented neural networks to predict the academic performance of 15-year-old students in reading, mathematics, and science based on familial, social, and economic factors.
- Wang, T. et al. [11]: Used neural networks to calculate the number of errors a student might make while solving a problem. This prediction was based on the problem’s specific attributes and the student’s skills. The method was applied to optimize problem selection in a final assessment process.
- Cripps, A. [12]: Focused on university students, examining demographic characteristics, such as age, gender, and race, along with college entrance test results. Neural networks were used to predict a student’s ability to complete a course and their final grade.
- Buenaño-Fernandez, D. et al. [13]: Applied machine learning techniques to predict final grades of computer engineering students in Ecuador. This prediction was based on the students’ performance history across 68 courses in the program, using decision trees.
- Moscoso-Zea, O. et al. [14]: Analyzed student data to predict graduation rates based on characteristics of enrolled students. The prediction enabled early corrective measures to improve the admission process.
- Sheel, S. J. et al. [15]: Compared the use of neural networks with traditional statistical models to classify students into two groups based on the results of a single math-level test.
- Kalles, D. et al. [16] and Kotsiantis, S. et al. [17]: Used data from distance education to predict success or failure in final exams through various techniques, including neural networks. These datasets included demographic information, individual assignment grades, and virtual class attendance levels.
- Logarithmic Regression: The function equation is y = 6.209·ln (x) + 50.812;
- Second-Degree Polynomial Regression: The function equation is y = 0.0632·x2 − 253.49·x + 254.35;
- Power Regression: The function equation is y = 8·10 − 136·x41.42.
- Smart data analysis
- Data Sources
- Data Extraction, Transformation, and Loading (ETL)
- Promoted type: All subjects passed;
- States: Each of Spain’s 17 states and two autonomous cities, total 19 states;
- Grade: First, second, third and fourth grades of middle school.
- Metrics and KPIs
- Expenditure (€) per student who advances to the next grade with all subjects passed in middle school, segmented by school year and state (Formula (11)):Total expenditure (€)/Number of students enrolled in middle school
- Percentage of teachers in middle school relative to the number of enrolled students in middle school, segmented by school year and state (Formula (12)):Total number of teachers × 100/Number of students enrolled in middle school
- Percentage of repeating students in middle school relative to the number of enrolled students in middle school, segmented by school year and state (Formula (13)):Total number of repeating students × 100/Number of students enrolled in middle school
- BI Model Definition
- Teacher–student ratio;
- The average expenditure per student;
- Ratio of repeating students.
- Interface Development
- Explanation of the graphics in Figure 6:
- Graphic 1. Teachers- students ratio by school year (colour red).
- Graphic 2. Teachers-students ratio by State. Compare, by State, the teacher-student ratio (red) with the promoted student ratio (blue).
- Graphic 3. Expenditure by enrolment student by school year (colour yellow)
- Graphic 4. Expenditure by enrolment student by State. Compare, by State, expenditure by enrolment student (colour yellow) with the promoted student ratio (blue).
- Graphic 5. Repeating student by enrolment student by school year (colour green).
- Graphic 6. Repeating student by enrolment student by State. Compare, by State, Repeating student by enrolment student (colour green) with the promoted student ratio (blue).
- Knowledge Extraction
- Below-average expenditure per student: Generally, the percentage of students who are promoted to the next grade is below average in the Balearic Islands, Valencia, Canary Islands, Castilla-La Mancha, Murcia, and Andalusia.
- Above-average expenditure per student: Typically, the percentage of students promoted to the next grade is above average in the Basque Country, Navarre, Asturias, Galicia, Castilla y León, La Rioja, Cantabria, and Aragón.
- States with a below-average expenditure per student but an above-average student promotion rate are Madrid and Catalonia.
- States with an above-average expenditure per student but a below-average student promotion rate are Ceuta, Extremadura, and Melilla.
- Below-average teacher–student ratio: The percentage of students who are promoted to the next grade is below average in Ceuta, Valencia, Andalusia, Castilla-La Mancha, Melilla, Canary Islands, Extremadura, and Murcia.
- Above-average teacher–student ratio: The percentage of students who are promoted to the next grade is above average in Cantabria, Asturias, Basque Country, Castilla y León, Galicia, Aragón, and Navarre.
- A below-average teacher–student ratio and an above-average student promotion rate are found in La Rioja, Madrid, and Catalonia.
- An above-average teacher–student ratio and a below-average student promotion rate are found in Extremadura and the Balearic Islands.
- Below-average repeating student ratio: Typically, the percentage of students who are promoted to the next grade is above average in Galicia, Madrid, Cantabria, Navarre, Asturias, the Basque Country, and Catalonia.
- Above-average repeating student ratio: Generally, the percentage of students who are promoted to the next grade is below average in Melilla, Ceuta, Andalusia, Castilla-La Mancha, Murcia, Valencia, Extremadura, and the Balearic Islands.
- Below-average repeating student ratio and a below-average student promotion rate: Canary Islands.
- Above-average repeating student ratio and an above-average student promotion rate: La Rioja, Aragón, and Castilla y León.
- States with a lower-than-average investment per student and a lower-than-average teacher–student ratio have a higher-than-average rate of students promoted to the next grade (Madrid and Catalonia).
- States that have a below-average percentage of teachers, an above-average percentage of repeating students per total number of enrolled students, and their students are promoted to the next grade above the average (La Rioja).
- States whose investment per student is above average, with the percentage of teachers below average, and students promoted to the next grade below average (Ceuta, Melilla and Extremadura).
- Reasoning by Analogy
- Selection of Events in the Database of Previous Events
- Identification of Common Causes
- Teacher-to-student ratio;
- Expenditure per student;
- Percentage of repeating students to total enrolled students.
- Representation of Causes and Effects in Previous Events
- For the 2011–2012 school year in Andalusia:
- ∘
- (2011–2012; Andalusia; 640.29€; 1.53; 1.75) → (57.18%).
- For other states:
- ∘
- (2011–2012; Aragón; 740.80€; 1.74; 1.11) → (56.99%);
- ∘
- (2011–2012; Asturias; 920.02€; 2.01; 0.89) → (62.02%);
- ∘
- …;
- ∘
- (2020–2021; Basque Country; 1370.80€; 2.15; 0.39) → (72.62%);
- ∘
- (2020–2021; La Rioja; 950.65€; 1.69; 0.39) → (64.21%);
- ∘
- (2020–2021; Valencian Region; 900.59€; 1.89; 0.42) → (61.99%).
- Examples include:
- (2011–2012; 1.52%; 690.80€; 1.17%) → (56.91%);
- (2012–2013; 1.45%; 640.16€; 1.11%) → (57.38%);
- …;
- (2019–2020; 1.63%; 780.46€; 0.89%) → (73.69%);
- (2020–2021; 1.73%; 850.45€; 0.44%) → (65.74%).
- Representation of the Event to Be Predicted (Ev′)
- (2021–2022; Andalusia; 840.85€; 1.71%; 1.12%);
- (2021–2022; Aragon; 960.73€; 2.13%; 0.86%);
- (2021–2022; Asturias; 990.93€; 2.18%; 0.60%).
- Quantifying Common Causes for Previous Events and the Event to Be Predicted
- Section 1: Quantifies the academic periods from 2011–2012 to 2020–2021 based on common causes. Results for previous events are displayed by school year and state.
- Section 2: Quantifies the common causes for the event to be predicted.
- Section 3: Calculate distance from previous events to the event to be predicted based on common causes.
- Section 4: Selects the most similar previous event, detailed in the next section.
- Local Distance Calculation
- Global Distance Calculation
- Average common causes of previous events: 0.43;
- Average common causes of the event to be predicted: 0.84.
- Selecting the Most Similar Previous Event
- Prediction and/or Generalization of Knowledge
- Identifying Relevant Variables
- Expenditure per student;
- Teacher-to-student ratio;
- Repeating student ratio.
- Directed Acyclic Graphs (DAGs)
- Economic factors
- Educational resources
- Educational system efficiency
- The expenditure per student impacts the teacher ratio and repeating student ratio: a higher expenditure leads to more teachers and fewer repeating students.
- The teacher ratio also affects the repeating student ratio: more teachers result in smaller class sizes, which in turn reduces the number of repeating students.
- Identifying relevant variables and plausible causal relationships.
- Establishing assumptions about the direction of causal relationships.
- Validating or adjusting causal structures proposed by algorithms, such as CI.
- Variable Control
- Backdoor path 1: Teacher ratio ← expenditure per student → percentage of students promoted, i.e., spending per student affects both the teacher ratio and the percentage of students promoted.
- Backdoor path 2: Repeating student ratio ← expenditure per student → percentage of students promoted, i.e., spending per student affects both the repeating student ratio and the percentage of students promoted.
- Backdoor path 3: Teacher ratio ← expenditure per student → repeating student ratio → percentage of students promoted, i.e., spending per student affects both the teacher ratio and the repeating student ratio, and the repeating student ratio affects the percentage of students promoted.
- Atypical case: 2019–2020 school year
- Relaxation of assessment and promotion criteria: The Department of Education of Spain and the states agreed that grade repetition would be an exceptional measure, allowing most students to progress to the next educational level, even with outstanding subjects.
- Reduction in academic demands: The abrupt transition to online learning led to an overall decrease in the academic load and assessment requirements, making it easier for more students to meet the passing criteria.
- Decrease in the repetition rate: Official statistics show that the rate of repeating students in compulsory secondary education has decreased significantly.
- Adaptation of final assessments: Final exams and assessments were modified to adapt to the new educational reality, in many cases increasing the optionality and reducing the difficulty.
- Family support during lockdown: The teleworking of many parents, especially mothers, allowed for greater supervision and support in their children’s educational process, which had a positive impact on their academic performance.
- Focus on students’ emotional well-being: Educational authorities prioritized students’ emotional well-being during the pandemic, leading to greater understanding and flexibility on the part of teachers in assessing academic performance.
- Reduction of academic pressure: The elimination of in-person exams and the adaptation of assessments reduced pressure on students, allowing them to perform better in a less stressful environment.
- Institutional awareness of educational inequalities: The pandemic highlighted inequalities in access to education, leading institutions to take steps to ensure all students had the opportunity to advance their education, regardless of their circumstances.
- Faculty commitment: Faculty quickly adapted their teaching and assessment methods to continue the educational process online, showing great dedication to ensuring students could successfully complete the course.
- Approximate Causal Impact Calculation
- Expenditure per student, teacher ratio, and repeating student ratio are the independent variables;
- β0, β1, β2, β3 are the regression coefficients representing the impact of each independent variable on the percentage of students who are promoted.
- Prediction and Rules
- Single-parent families have fewer economic resources since they rely solely on the income of one parent and, at best, partial support from the other parent.
- They have a reduced ability to provide more effective parenting, as responsibilities cannot be shared equally between two parents.
- A single parent lacks the support of another adult to address challenges and difficulties related to educating children.
- The parent has reduced emotional stability caused by the absence of support from a second parent.
- Data Sources
- Prediction Application Overview
- Section 1: Displays the values of common causes and the results of the most similar previous event (2020–2021 school year).
- Section 2: Evaluates the specific causes for the event to be predicted (2021–2022 school year).
- Section 3: Produces the final prediction, accounting for the effect of the most similar previous event and adjustments for the specific causes of the event to be predicted.
- Specific causes and prediction
- Results and Validation
- Results
- IF the state expenditure by student is above the average
- ∘
- IF the teacher ratio is above the average
- ▪
- IF the repeating student ratio is below the average
- ▪
- IF the percentage of students who are promoted to the next grade is above the average: Asturias, Cantabría, Galicia, Basque Country.
- ▪
- OTHERWISE
- ▪
- IF the percentage of students who are promoted to the next grade is above the average: Aragón, Castilla-León
- ▪
- OTHERWISE Extremadura
- ∘
- OTHERWISE
- ▪
- IF the repeating student ratio is below the average THEN
- ▪
- IF the percentage of students who are promoted to the next grade is above the average: La Rioja
- ▪
- OTHERWISE
- ▪
- IF the percentage of students who are promoted to the next grade is below the average: Ceuta, Melilla
- OTHERWISE
- ∘
- IF the teacher ratio is above the average
- ▪
- IF the repeating student ratio is above the average THEN
- ▪
- IF the percentage of students who are promoted to the next grade is above the average: Balearic Island
- ∘
- OTHERWISE
- ▪
- IF the repeating student ratio is below the average THEN
- ▪
- IF the percentage of students who are promoted to the next grade is above the average: Canary Island
- ▪
- OTHERWISE Madrid, Catalonia
- ▪
- OTHERWISE
- ▪
- IF the percentage of students who are promoted to the next grade is below the average: Andalusia, Castilla- La Mancha, Valencian Region, Region of Murcia.
- Validation
- Table 10 columns:
- “Real Average” Column: The actual average percentage, for each state, of students who were promoted to the next grade with all subjects passed across the four middle school grades during the 2021–2022 school year.
- “Predicted Average” Column: The percentage predicted by the proposed method, for each CC.AA., of students who are promoted to the next grade with all subjects passed across the four ESO grades during the 2021–2022 school year.
- “Chi2” Column: X2 = ∑(<State>% promoted to the next grade 2021/2022 − <State>% promoted to the next grade proposed method 2021/2022)2/(<State> % promoted to the next grade proposed method 2021/2022).
- MAE column: (1/n) × ∑(<State>% promoted to the next grade 2021/2022 − <State>% promoted to the next grade proposed method 2021/2022)2, been n the number of states.
4. Discussion and Conclusions
- They require complete and representative data to perform. In contexts where information is partial, these techniques can generate biased or unreliable results.
- Statistical models, such as linear regression, assume that past trends will continue in the future, which is not always true where political, economic and social factors influence in an unpredictable way.
- Neural networks and other machine learning models are effective when there are clear and repetitive patterns in the data. However, the relationships between variables are not always linear or constant, which limits their predictive capacity.
- Techniques such as neural networks present a “black box effect”, which makes it difficult to understand how and why certain predictions are generated.
- Regarding the influence of non-quantifiable external factors, aspects such as student motivation, teaching quality or family influence are difficult to quantify and, therefore, difficult to model with statistical or machine learning approaches.
- Prediction models are often trained on historical data, which makes them less flexible when faced with changing conditions. Factors such as new educational policies or economic crises can drastically change the rate of school promotion, something that these models cannot accurately predict.
- Intelligent data analysis provides an empirical basis for the model, identifying trends and patterns in historical data as well as identifying relevant variables. It is estimated to contribute 60% to the model.
- Similarity reasoning establishes a baseline from which to predict the new event based on analogy with similar previous cases. It is estimated to contribute 25% to the model.
- Causal relationships allow us to understand the cause–effect relationships between variables, beyond correlations, improving the ability to generalize knowledge. It is estimated to contribute 15% to the model.
- The teacher–student ratio is below the average and the number of students who are promoted is above the average: La Rioja, Madrid and Catalonia.
- The expenditure per student is below the average and the number of students who are promoted to the next grade is above the average: Madrid and Catalonia.
- The repeating student ratio is higher than the average and the number of students is above the average: La Rioja, Aragón and Castilla and León.
- States with an expenditure above average, have a below-average percentage of students promoted: Extremadura and Balearic Island.
- States with an above-average teacher ratio have a below-average percentage of students promoted: Ceuta, Extremadura and Melilla.
- States with a lower rate of repeaters have a below-average percentage of students promoted: Canary Island.
- Detecting academic failure early in order to intervene as quickly as possible.
- Optimizing the planning of educational resources, e.g., increasing investment per student, increasing the number of teachers, reducing the number of students per class, etc.
- If a low passing rate is predicted, strategies can be designed to change the pedagogical approach.
- Showing predictive evidence to support improvements in educational investments. Measuring the impact of educational policies, e.g., scholarships, reinforcement programs, curricular changes, etc.
- Integrating new educational, social, and economic factors.
- Applying the model to different educational levels (high school, university, etc.) and countries would test its generalization and adaptability. Such expansion could reveal new patterns.
- Train and compare at least two state-of-the-art algorithms (e.g., Random Forest, Gradient Boosting Machines) using the same features, and report comparative metrics (MAE, RMSE) alongside the hybrid model.
- Developing intuitive dashboards and visualization tools would make the model’s information more accessible to educators and policymakers, facilitating data-driven decision-making.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Salatino, A.A.; Mannocci, A.; Osborne, F. Detection, analysis, and prediction of research topics with scientific knowledge graphs. In Predicting the Dynamics of Research Impact; Springer: Cham, Switzerland, 2021; pp. 225–252. [Google Scholar]
- Gu, X.; Krenn, M. Forecasting high-impact research topics via machine learning on evolving knowledge graphs. arXiv 2024, arXiv:2402.08640. [Google Scholar] [CrossRef]
- Krenn, M.; Buffoni, L.; Coutinho, B.; Eppel, S.; Foster, J.G.; Gritsevskiy, A.; Kopp, M. Predicting the Future of AI with AI: High-quality link prediction in an exponentially growing knowledge network. arXiv 2022, arXiv:2210.00881. [Google Scholar]
- Zeineddine, H.; Braendle, U.; Farah, A. Enhancing prediction of student success: Automated machine learning approach. Comput. Electr. Eng. 2021, 89, 106903. [Google Scholar] [CrossRef]
- Chen, Y.; Wei, W.; Wang, L.; Dong, Y.; Liang, C.J. Where do they go next? Causal inference-based prediction and visual analysis of graduates’ first destination. J. Vis. 2024, 27, 885–908. [Google Scholar] [CrossRef]
- Kitto, K.; Hicks, B.; Buckingham Shum, S. Using causal models to bridge the divide between big data and educational theory. Br. J. Educ. Technol. 2023, 54, 1095–1124. [Google Scholar] [CrossRef]
- Cao, C.; Ding, Z.; Lee, G.G.; Jiao, J.; Lin, J.; Zhai, X. Elucidating stem concepts through generative ai: A multi-modal exploration of analogical reasoning. arXiv 2023, arXiv:2308.10454. [Google Scholar]
- Pearl, J. Causal inference. In Causality: Objectives and Assessment; PMLR: Cambridge, MA, USA, 2010; pp. 39–58. [Google Scholar]
- Loyola-Gonzalez, O. Black-box vs. white-box: Understanding their advantages and weaknesses from a practical point of view. IEEE Access 2019, 7, 154096–154113. [Google Scholar] [CrossRef]
- Junemann, M.A.P.; Lagos, P.A.S.; Arriagada, R.C. Neural Networks to Predict Schooling Failure/Success. Comput. Sci. 2007, 4528, 571–579. [Google Scholar]
- Wang, T.; Mitrovic, A. Using neural networks to predict student’s performance. In Proceedings of the International Conference on Computers in Education, Auckland, New Zealand, 3–6 December 2002; pp. 969–973. [Google Scholar]
- Cripps, A. Using artificial neural networks to predict academic performance. In Proceedings of the ACM Symposium on Applied Computing, Philadelphia, PA, USA, 17–19 February 1996; pp. 33–37. [Google Scholar]
- Buenaño-Fernández, D.; Gil, D.; Luján-Mora, S. Application of Machine Learning in Predicting Performance for Computer Engineering Students: A Case Study. Sustainability 2019, 11, 2833. [Google Scholar] [CrossRef]
- Moscoso-Zea, O.; Saa, P.; Luján-Mora, S. Evaluation of algorithms to predict graduation rate in higher education institutions by applying educational data mining. Australas. J. Eng. Educ. 2019, 24, 4–13. [Google Scholar] [CrossRef]
- Sheel, S.J.; Vrooman, D.; Renner, R.S.; Dawsey, S.K. A Comparison of Neural Networks and Classical Discriminant Analysis in Predicting Students’ Mathematics Placement Examination Scores. Comput. Sci. 2001, 2074, 952–957. [Google Scholar]
- Kalles, D.; Pierrakeas, C. Analyzing student performance in distance learning with genetic algorithms and decision trees. Appl. Artif. Intell. 2006, 20, 655–674. [Google Scholar] [CrossRef]
- Kotsiantis, S.; Pierrakeas, C.; Pintelas, P. Predicting students’ performance in distance learning using machine learning techniques. Appl. Artif. Intell. 2004, 18, 411–426. [Google Scholar] [CrossRef]
- % of Middle School Students who Promoted to the Next Grade with All Subjects Passed in the 2020–2021 School Years. Department of Education of Spain. Statistics on Non-University Education. Available online: https://estadisticas.educacion.gob.es/EducaJaxiPx/Tabla.htm?path=/no-universitaria/alumnado/matriculado/2020-2021-rd/gen-eso/l0/&file=eso_01.px&L=0 (accessed on 1 January 2024).
- Number of Students Who Promoted to the Next Grade in Middle School from 2011–2012 to 2020–2021 School Years. Available online: https://www.educacionfpydeportes.gob.es/servicios-al-ciudadano/estadisticas/no-universitaria/alumnado/resultados.html (accessed on 1 January 2024).
- Number of Students Enrolled in Middle School from 2011–2012 to 2020–2021 School Years. Available online: https://www.educacionfpydeportes.gob.es/servicios-al-ciudadano/estadisticas/no-universitaria/alumnado/matriculado.html (accessed on 1 January 2024).
- Total Expenditure (€) per Student in Middel School from 2011–2012 to 2020–2021. Available online: https://www.educacionfpydeportes.gob.es/servicios-al-ciudadano/estadisticas/economicas/gasto.html (accessed on 1 January 2024).
- Number of Teachers at the Middle School from 2011–2012 to 2020–2021 School Years. Available online: https://www.educacionfpydeportes.gob.es/servicios-al-ciudadano/estadisticas/no-universitaria/profesorado/estadistica.html (accessed on 1 January 2024).
- Number of Repeating Students in Middle School from 2021-2012 to 2020-2021. Available online: https://estadisticas.educacion.gob.es/EducaJaxiPx/Tabla.htm?path=/no-universitaria/alumnado/matriculado/2020-2021-rd/gen-eso/l0/&file=eso_04.px&L=0 (accessed on 1 January 2024).
- Considine, G.; Zappalà, G. The influence of social and economic disadvantage in the academic performance of school students in Australia. J. Sociol. 2002, 38, 129–148. [Google Scholar] [CrossRef]
- Wallerstein, J. Children of Divorce: Stress and Developmental Task; McGraw-Hill: New York, NY, USA, 2002. [Google Scholar]
- Niemeyer, T.D.; Torres, M.I.V. Percepción materna del ajuste socioemocional de sus hijos preescolares: Estudio descriptivo y comparativo de familias separadas e intactas con alto y bajo nivel de ajuste marital. Revista de Psicología 2000, 9, 29–44. [Google Scholar] [CrossRef]
- Ram, B.; Feng, H. Changes in family structure and child outcomes: Roles of economic and familiar resources. Policy Stud. J. 2003, 31, 309–330. [Google Scholar] [CrossRef]
- White, L.; Rogers, S.J. Economic circumstances and family outcomes: A review of the 1990s. J. Marriage Fam. 2000, 62, 1035–1051. [Google Scholar] [CrossRef]
- Evolution of the Consumer Price Index (CPI) by Month and State. Available online: https://www.ine.es/jaxiT3/Tabla.htm?t=50918 (accessed on 1 January 2024).
- The Number of Single-Parent and Two-Parent Households by Year and State. Available online: https://www.ine.es/jaxi/Tabla.htm?path=/t20/p274/serie/prov/p02/l0/&file=02001.px&L=0 (accessed on 1 January 2024).
- The Number and Amount of Mortgages Signed per Month and State. Available online: https://www.ine.es/jaxiT3/Tabla.htm?t=3200&L=0 (accessed on 1 January 2024).
- Data for Calculating the % of Students who Promote to the Next Grade at Middel School with All Subjects Passed for the 2021–2022 School Year. Available online: https://estadisticas.educacion.gob.es/EducaJaxiPx/Tabla.htm?path=/no-universitaria/alumnado/resultados/2021-2022-rd/reggen/l0/&file=reggen_3_03.px (accessed on 1 January 2024).
Authors | Main Prediction Technique/Method | Strengths | Weaknesses |
---|---|---|---|
Salatino et al. [1] | Computer Science Ontology Scientific (CSO) Machine Learning Models | Identify new research topics and study the temporal evolution of specific topics | Reliance on Historical Patterns for Prediction Limitations in Managing Semantic Ambiguities |
Gu et al. [2] | Machine learning | The model predicted whether two concepts would connect and whether that connection would be high impact. | It performed well over three-year intervals, but its accuracy decreased when forecasting five years ahead. Excessive reliance on historical data |
Krenn et al. [3] | Statistics techniques/ Machine learning | For five-year predictions, it obtained an accuracy of 90%. | Dependence on manual characteristics Lack of integration of sociological factors |
Zeineddine et al. [4] | Machine learning | It improves on previous studies using data mining, which achieved over 70% accuracy. | The increase in the accuracy obtained with this paper is only +5.9%. |
Chen et al. [5] | Causal Inference/Machine Learning | Predicts graduates’ first job destinations based on academic performance | May depend on data quality and representativeness |
Kitto et al. [6] | Causal Models (DAG) | Apply causal graphs to represent how multiple factors affect reflective writing | May be difficult to generalize without fine-tuning |
Cao et al. [7] | Generative IA, Analogical Reasoning | Transforms complex STEM concepts into comprehensible visual metaphors for students | Effectiveness evaluation is still preliminary |
Variable | Description |
---|---|
Ev | The most similar previous event |
Ev′ | The event to be predicted |
C | Vector common causes |
ccn | Common cause, the most similar previous event |
R(E/C) | Relationship cause–effect, the most similar previous event |
E | Effects, the most similar previous event |
cen | Common effect, the most similar previous event |
S | Similarity function between the causes of most similar previous event and the causes of event to be predicted |
C′ | Vector common causes + specific causes |
cc′n | Common cause of event to be predicted |
sc′n | Specific cause |
R′(E′/C′) | Relationship cause–effect, the event to be predict |
E′ | Effects, event to be predicted |
ce′n | Common effects, the event to be predicted |
se′n | Specefic effects, the event to be predicted |
S′ | Similarity function between the effects of most similar previous event and the effects of event to be predicted |
Academic Year | % Promoted |
---|---|
2011–2012 | 56.91 |
2012–2013 | 57.38 |
2013–2014 | 59.13 |
2014–2015 | 60.30 |
2015–2016 | 61.51 |
2016–2017 | 59.88 |
2017–2018 | 61.57 |
2018–2019 | 62.18 |
2019–2020 | 72.70 |
2020–2021 | 65.23 |
Teacher-Students Ratio | Expenditure per Student | Repeating Students Ratio | ||||
---|---|---|---|---|---|---|
Above average | Below average | Above average | Below average | Above average | Below average | |
Promote above average | Cantabria, Asturias, The Basque Country, Castilla and León, Galicia, Aragón and Navarre | La Rioja, Madrid and Catalonia. | The Basque Country, Navarre, Asturias, Galicia, Castilla and León, La Rioja, Cantabria and Aragón | Madrid and Catalonia | La Rioja, Aragón, Castilla and León | Galicia, Madrid, Cantabria, Navarre, Asturias, The Basque Country and Catalonia |
Promote below average | Extremadura and Balearic Islands. | Ceuta, Valencia, Andalusia, Castilla-La Mancha, Melilla, Canary Islands, Extremadura and Region of Murcia | Ceuta, Extremadura and Melilla. | Balearic Island, Valencian Region, Canary Island, Castilla-La Mancha, Murcia and Andalusia | Melilla, Ceuta, Andalusia, Castilla-La Mancha, Murcia, Valencian Region, Extremadura, and Balearic Islands | Canary Islands |
School Year | Expenditure | % Teachers | % Repeating |
---|---|---|---|
2011–2012 | 0.32 | 0.41 | 1.00 |
2012–2013 | 0.08 | 0.21 | 0.92 |
2013–2014 | 0.01 | 0.04 | 0.81 |
2014–2015 | 0.00 | 0.00 | 0.83 |
2015–2016 | 0.14 | 0.25 | 0.75 |
2016–2017 | 0.27 | 0.28 | 0.58 |
2017–2018 | 0.37 | 0.38 | 0.69 |
2018–2019 | 0.46 | 0.50 | 0.60 |
2019–2020 | 0.70 | 0.69 | 0.61 |
2020–2021 | 1.00 | 1.00 | 0.00 |
Average | 0.33 | 0.38 | 0.68 |
School Year | Manhattan Distance | Euclidean Distance |
---|---|---|
2020–2021 | 1.66 | 1.51 |
2019–2020 | 3.44 | 1.66 |
2018–2019 | 4.82 | 1.74 |
2017–2018 | 5.64 | 1.86 |
2011–2012 | 6.14 | 2.44 |
2016–2017 | 6.23 | 1.92 |
2015–2016 | 7.10 | 2.12 |
2013–2014 | 7.98 | 2.35 |
2012–2013 | 8.07 | 2.48 |
2014–2015 | 8.09 | 2.48 |
School Year | % Promoted |
---|---|
2016–2017 | 59.88 |
2017–2018 | 61.57 |
2018–2019 | 62.18 |
2019–2020 | 72.70 |
2020–2021 | 65.23 |
School Year | % Student Promoted | Expenditure per Student | Teacher Ratio | Repeating Student Ratio |
---|---|---|---|---|
2016–2017 | 59.88 | 0.22 | 0.26 | 0.58 |
2017–2018 | 61.57 | 0.31 | 0.35 | 0.69 |
2018–2019 | 62.18 | 0.38 | 0.46 | 0.60 |
2019–2020 | 72.70 | 0.57 | 0.64 | 0.61 |
2020–2021 | 65.23 | 0.82 | 0.93 | 0.59 |
School Year | Real% Promoted | Regression Calculation |
---|---|---|
2016–2017 | 59.88 | 58.50 |
2017–2018 | 61.57 | 63.97 |
2018–2019 | 62.18 | 61.89 |
2019–2020 | 72.70 | 71.53 |
2020–2021 | 65.23 | 65.63 |
State | 1st Grade | 2nd Grade | 3rd Grade | 4th Grade | Real Promot. |
---|---|---|---|---|---|
ANDALUSIA | 61.0 | 56.7 | 56.9 | 63.3 | 59.48 |
ARAGÓN | 63.0 | 61.8 | 62.5 | 62.5 | 62.45 |
ASTURIAS | 73.0 | 66.7 | 64.0 | 63.9 | 66.90 |
BALEARIC ISLANDS | 66.9 | 60.0 | 60.9 | 64.1 | 62.98 |
CANARY ISLANDS | 61.7 | 58.9 | 58.9 | 60.5 | 60.00 |
CANTABRIA | 72.0 | 67.7 | 63.1 | 65.7 | 67.13 |
CASTILLA Y LEÓN | 66.3 | 63.6 | 63.4 | 63.8 | 64.28 |
CASTILLA-LA MANCHA | 59.9 | 55.9 | 57.4 | 57.6 | 57.70 |
CATALONIA | 72.6 | 68.5 | 65.0 | 73.0 | 69.78 |
VALENCIAN REGION | 61.2 | 54.1 | 55.2 | 60.7 | 57.80 |
EXTREMADURA | 66.2 | 63.3 | 59.5 | 62.9 | 62.98 |
GALICIA | 71.7 | 67.4 | 65.2 | 66.1 | 67.60 |
MADRID | 68.0 | 63.1 | 61.6 | 62.9 | 63.90 |
REGION OF MURCIA | 60.9 | 56.8 | 57.7 | 57.0 | 58.10 |
NAVARRE | 73.1 | 67.6 | 68.0 | 66.5 | 68.80 |
BASQUE COUNTRY | 73.1 | 67.6 | 70.1 | 75.8 | 71.65 |
LA RIOJA | 60.1 | 60.9 | 59.9 | 62.1 | 60.75 |
CEUTA | 81.0 | 59.0 | 61.4 | 65.9 | 66.83 |
MELILLA | 73.6 | 60.7 | 62.0 | 66.5 | 65.70 |
Average | 67.65 | 62.12 | 61.72 | 64.25 | 63.94 |
State | Real Avg. | Predicction Avg. | Chi2 | MAE |
---|---|---|---|---|
ANDALUSIA | 59.48 | 61.09 | 0.04 | 1.61 |
ARAGÓN | 62.45 | 64.93 | 0.09 | 2.48 |
ASTURIAS | 66.90 | 68.31 | 0.03 | 1.41 |
BALEARIC ISLANDS | 62.98 | 65.16 | 0.07 | 2.18 |
CANARY ISLANDS | 60.00 | 63.25 | 0.17 | 3.25 |
CANTABRIA | 67.13 | 70.22 | 0.14 | 3.09 |
CASTILLA AND LEÓN | 64.28 | 64.44 | 0.00 | 0.16 |
CASTILLA-LA MANCHA | 57.70 | 60.04 | 0.09 | 2.34 |
CATALONIA | 69.78 | 71.19 | 0.03 | 1.41 |
VALENCIAN REGION | 57.80 | 61.19 | 0.19 | 3.39 |
EXTREMADURA | 62.98 | 63.49 | 0.00 | 0.51 |
GALICIA | 67.60 | 68.16 | 0.00 | 0.56 |
MADRID | 63.90 | 59.06 | 0.40 | 4.84 |
REGION OF MURCIA | 58.10 | 57.28 | 0.01 | 0.82 |
NAVARRE | 68.80 | 72.37 | 0.18 | 3.57 |
BASQUE COUNTRY | 71.65 | 69.90 | 0.04 | 1.75 |
LA RIOJA | 60.75 | 61.21 | 0.00 | 0.46 |
CEUTA | 66.83 | 44.62 | 11.05 | 22.21 |
MELILLA | 65.70 | 48.21 | 6.34 | 17.49 |
Average | 18.88 | |||
Sum | 33.83 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lorenzo, A.; Olivas, J.A.; Romero, F.P.; Serrano-Guerrero, J. Improving Education Predictions Through Reasoning by Analogy and Causal Relationships Applied to Smart Exploitation of Data. Electronics 2025, 14, 2339. https://doi.org/10.3390/electronics14122339
Lorenzo A, Olivas JA, Romero FP, Serrano-Guerrero J. Improving Education Predictions Through Reasoning by Analogy and Causal Relationships Applied to Smart Exploitation of Data. Electronics. 2025; 14(12):2339. https://doi.org/10.3390/electronics14122339
Chicago/Turabian StyleLorenzo, Antonio, José A. Olivas, Francisco P. Romero, and Jesus Serrano-Guerrero. 2025. "Improving Education Predictions Through Reasoning by Analogy and Causal Relationships Applied to Smart Exploitation of Data" Electronics 14, no. 12: 2339. https://doi.org/10.3390/electronics14122339
APA StyleLorenzo, A., Olivas, J. A., Romero, F. P., & Serrano-Guerrero, J. (2025). Improving Education Predictions Through Reasoning by Analogy and Causal Relationships Applied to Smart Exploitation of Data. Electronics, 14(12), 2339. https://doi.org/10.3390/electronics14122339