2. Materials and Methods
2.1. Dataset and Labels
Source. The inputs are answers to 223 closed questions in the MICS platform’s self-assessment [
21]. Each question is represented by a 20-bin one-hot vector; 20 is the maximum number of possible answers for any question in the instrument. Questions with fewer than 20 response options are represented with the unused positions set to zero, so that all input rows are padded to the same width and the 4460-dimensional structure is preserved uniformly across projects.
Expert evaluation. The domain scores were assigned by a panel of one to three domain experts, scoring the five impact domains for every project via a structured evaluation protocol. Disagreements were resolved through discussion and consensus, and the agreed scores constitute the labels used for training. Although this protocol provides a reasonable reliability baseline, the limited number of raters and the inherent subjectivity of impact scoring remain acknowledged limitations; future work could expand the expert panel.
Instances. Of the 24 projects that initiated the full MICS self-assessment, nine provided complete responses and expert domain scores, and were included in the study. The remaining 15 were excluded owing to the lack of expert domain scores.
Label scaling. The targets are normalised to [0, 1] by dividing by 42 for training; predictions are rescaled to [1, 42] for reporting.
Label noise. The subjective nature of expert-assigned scores introduces a degree of label noise that may affect model learning, particularly in a small-sample regime where even modest inconsistencies can distort learned parameters. This concern is discussed further in
Section 4.3.
2.2. Feature Engineering
All inputs are already provided as one-hot matrices of shape 223 × 20 per project. We flatten each matrix column-major into a single 4460-length vector, concatenate a bias unit during training, and kept the values binary.
No feature selection, dimensionality reduction, redundancy analysis, or variance screening was performed beyond the encoding and flattening described above. This decision was deliberate: given the very small sample size (n = 9), any data-driven feature-selection step, such as filtering by variance, applying principal component analysis, or clustering question groups, would risk discarding genuine predictive signals while providing unreliable selection statistics. The full 4460-dimensional encoding is therefore retained so that all questionnaire information is passed to the network. We acknowledge that this means many binary features are near-constant across nine projects (i.e., most questions receive the same response across all instances), and that this redundancy may add noise. A systematic analysis of feature variability and inter-question correlation is identified as important future work when a larger labelled dataset becomes available.
2.3. Model Architecture and Training
Topology. A fully connected 4460–42–5 multilayer perceptron with logistic activations in the hidden and output layers (see
Figure 2).
Loss. The training objective is a sum of the five binary cross-entropy terms (one per output unit) computed against the five sigmoid-scaled targets, plus an L2 weight penalty. Because each target (y_k \ in [0, 1]) is produced by dividing an integer expert score by 42, the logistic cross-entropy (−[y_k\log\hat{y}_k + (1 − y_k)\log(1 − \hat{y}_k)]) is a coherent surrogate loss for this bounded regression problem: it is convex, differentiable everywhere the network output is non-degenerate, and penalises over-confident predictions. The choice of cross-entropy over MSE or MAE is motivated by the logistic output activation: pairing a sigmoid output with MSE gives gradients that vanish near saturation, whereas pairing it with cross-entropy produces well-conditioned gradients throughout the [0, 1] range [
22]. This property is particularly valuable when, as here, the training signal is very limited and gradient quality matters for each update. Ordinal regression formulations were considered but not adopted at this stage, because the ordinal structure of the 1–42 scale is absorbed by the continuous sigmoid mapping without requiring additional architectural modifications; treating it as a pure regression problem preserves architectural simplicity and is consistent with prior work on bounded score regression [
6].
Optimiser. fmincg (conjugate-gradient) with MaxIter = 10.
Regularisation. lambda = 0.07 during training. To verify numerical stability, the cost function was additionally evaluated at lambda = 1 as a diagnostic check: inflating lambda to this much larger value forces very small weights and drives all sigmoid outputs toward 0.5, producing a cost close to (n\cdot\log 2) (the entropy of a uniform Bernoulli). Confirming that this diagnostic cost is observed at lambda = 1 validates that the forward pass, label scaling, and cost implementation are internally consistent. This diagnostic is used solely for sanity checking the implementation; it does not influence the choice of the operational lambda = 0.07 used for all training and evaluation runs.
Initialisation. Small random weights drawn uniformly from ([−\epsilon, \epsilon]), where (\epsilon = \sqrt{6/(L_{\text{in}} + L_{\text{out}})}) per layer; bias units handled explicitly.
MaxIter justification. The choice of MaxIter = 10 was validated by comparing the training loss at iterations 5, 10, and 20 across multiple random initialisations. In all runs examined, the loss decreased monotonically and had stabilised to within 1% of its final value by iteration 10; extending to 20 iterations produced no further decrease. This behaviour is consistent with the fast convergence of conjugate-gradient methods on smooth, strongly regularised objectives [
23]. We acknowledge that a more rigorous validation would involve a systematic sweep across a wider range of MaxIter values (e.g., 10, 20, 50, 100); this is identified as future work (see
Section 5.3).
2.4. Problem Formulation
This is a multi-output regression with logistic heads: each of the five outputs independently models the scaled score for one domain. This design avoids forcing trade-offs across domains and accommodates use cases in which domain scores are not mutually exclusive [
6].
Before finalising this formulation, we examined pairwise Pearson correlations among the five domain scores across the nine projects. The observed correlations span a wide range (from approximately −0.4 to +0.6), and no pair of domains shows a sufficiently strong and consistent relationship to justify joint modelling (e.g., via a shared output layer or a structured output kernel). The assumption of conditionally independent logistic outputs is therefore a reasonable working approximation for the current data, though it may become a binding constraint if, in a larger dataset, strong inter-domain dependencies emerge. We flag this as a potential limitation: if future data reveal systematic cross-domain correlations, multi-task learning architectures or joint output layers may outperform the independent-head formulation adopted here.
2.5. Evaluation Protocol and Metrics
Model performance is tracked during training by computing the loss function and the root mean squared error (RMSE) for each domain on the training set, converted back to the original 1–42 scale for interpretability. As the primary evaluation protocol, LOOCV was implemented and carried out for the nine available projects; results are reported in
Section 3.4. In each fold, the model was retrained from scratch on eight projects and evaluated on the held-out ninth; predictions were rescaled to [1, 42] before computing errors. Per-domain RMSE, MAE, and R
2 are reported. Confusion matrices are not applicable because the targets are continuous.
Reporting. A script writes the predicted 5-vector for each run to impact.csv after rescaling to [1, 42].
2.6. Implementation Details and Code Structure
The Octave implementation follows a compact, didactic layout. Entry points are mics.m and trainingMics.m. Core modules implement forward and backward passes with explicit handling of bias terms. File roles are as follows:
mics.m: loads data, launches training with defaults, writes predictions to impact.csv.
trainingMics.m: initialises weights, sets regularisation and optimiser options, calls fmincg.
nnCostFunction.m: forward pass, logistic activations, vectorised cost with L2 penalty, backprop gradients.
predict.m: forward pass for inference only.
sigmoid.m, sigmoidGradient.m: elementwise activations and derivatives.
fmincg.m: conjugate-gradient optimiser.
2.7. Hyperparameters and Design Rationale
A single hidden layer enables the network to capture non-linear relationships among question clusters while mapping them to the expert-defined impact scale. The choice of a single hidden layer over a deeper architecture is supported by empirical evidence from the small-tabular-data literature: on datasets with fewer than a few hundred instances, shallow networks with one hidden layer typically match or outperform deeper networks because the reduction in parameter count reduces overfitting risk [
8,
9]. With n = 9, a single hidden layer is the appropriate default.
The hidden layer has 42 neurons. This width was arrived at heuristically during initial experimentation, and it coincidentally equals the upper bound of the ordinal scoring scale (1–42); however, this coincidence carries no methodological significance. A systematic ablation study comparing hidden-layer widths of 10, 20, 42, 64, and 100 neurons was performed. The qualitative rationale for 42 as a reasonable width is that it is large enough to represent diverse combinations of questionnaire responses but small enough, relative to the 4460-dimensional input, to be strongly constrained by the L2 regularisation, limiting the risk of memorising the nine training instances.
Lambda = 0.07 was selected by monitoring training-set RMSE across a small set of candidate values (0.01, 0.07, 0.1, 0.5, 1.0) across multiple random initialisations. Lambda = 0.07 yielded the lowest average RMSE without producing the mid-range prediction collapse associated with very high regularisation. This selection process was informal and heuristic rather than a formal grid search or cross-validated optimisation procedure. A systematic hyperparameter optimisation approach, for example, using grid search or a Bayesian optimisation library such as Optuna [
24], would be more rigorous and is recommended as future work when a larger dataset is available.
MaxIter = 10 is discussed and justified in
Section 2.3 above.
Logistic outputs per domain fit the independence-of-targets assumption, preserve the ability to calibrate or threshold each domain separately, and pair naturally with the cross-entropy loss (see
Section 2.3).
2.8. Training Procedure
Training begins by initialising Theta1 (4460 × 42 weight matrix, including bias) and Theta2 (42 × 5 weight matrix, including bias) with small random values drawn independently from a symmetric uniform distribution. A cost-function handle is then constructed that accepts the vectorised parameter vector, performs a full forward pass through the network using logistic activations, evaluates the sum of five binary cross-entropy terms against the scaled labels together with the L2 penalty, and returns the scalar cost and the analytic gradient via backpropagation. This handle is passed to fmincg, which iterates for at most 10 conjugate-gradient steps, updating the parameter vector to minimise cost. Upon convergence, the optimised parameters are reshaped back into Theta1 and Theta2, and a final forward pass through predict.m produces the scaled output scores for each project, which are then rescaled from [0, 1] to [1, 42] before being written to impact.csv.
2.9. Computational Complexity and Runtime
Forward and backward passes are dominated by dense matrix multiplications of sizes (m × 4460) by (4460 × 42) and (m × 42) by (42 × 5). With m = 9, the runtime is negligible on commodity laptops; the memory footprint is dominated by the input matrix (approximately 9 × 4460 elements).
2.10. Data Preprocessing and Schema
Inputs: nine CSVs X01.csv … X09.csv, each 223 rows by 20 columns, binary one-hot. These are flattened to a single 4460-length vector per project in column-major order. Questions with fewer than 20 options have the unused columns set to zero.
Labels: Y.csv, nine rows by five columns, integer scores on a 1–42 scale. Values are scaled to [0, 1] for training and rescaled for outputs.
No imputation: only fully answered projects are included in this study.
2.11. Project Characterisation
Of the 24 projects that initiated the MICS self-assessment, the nine that completed all 223 questions and provided the expert domain scores span a range of citizen-science typologies.
Table 1 below provides a broad characterisation of the nine projects; specific project names are withheld to protect confidentiality. Descriptors are provided at the level of thematic focus, geographic region, participation mode (following the PPSR typology of [
13]), approximate active duration, and order-of-magnitude participant count. The diversity of project types (spanning ecological monitoring, water quality, astrophysical observation, public-health surveillance, and urban heritage) is essential for exposing Alquimics to different combinations of participation modes, data practices, and impact pathways.
2.12. Random Initialisations
All reported results were obtained from five independent random initialisations per LOOCV fold, using different random seeds. The initialisation yielding the lowest training cost for each fold was selected; results were qualitatively consistent across initialisations, with variation in RMSE on the order of ±1 point on the 1–42 scale. This procedure mitigates the risk that a poor random initialisation drives the reported results, though the small number of restarts means that the global minimum of the non-convex loss surface is not guaranteed to have been reached. A more thorough multi-restart procedure is identified as future work.
3. Results
3.1. Overview of the Labelled Projects
The current dataset comprises nine citizen-science projects that completed the full MICS self-assessment and for which expert domain scores are available. Each project is represented by a 223 × 20 binary matrix of one-hot responses, which is flattened into a 4460-dimensional vector. The label matrix contains five expert-assigned scores per project, corresponding to the Environment, Economy, Governance, Science, and Society domains, on an ordinal scale from 1 to 42.
Although the sample is small, the projects span a variety of designs, partnerships, and intended outcomes (see
Table 1 in
Section 2.11). This diversity is essential for training Alquimics, because it exposes the network to different combinations of participation modes, data practices, and impact pathways.
3.2. Training Behaviour and Internal Diagnostics
Training is performed using the conjugate-gradient optimiser (fmincg) for 10 iterations, with a regularisation strength of lambda = 0.07. During optimisation, the code monitors two quantities: (1) the loss function, combining logistic cross-entropy terms for the five output nodes with the L2 weight penalty; and (2) the RMSE for each domain, computed on the training set and converted back to the original 1–42 scale.
In all runs inspected, the loss decreases monotonically over the first few iterations and then stabilises (reaching within 1% of its final value by iteration 10), indicating that the chosen learning setup is numerically well-behaved. The RMSE values reported by the script are used as early indicators of whether particular hyperparameter settings or initialisations are clearly unsuitable; for example, if training-set RMSE remains above 15 for all domains after 10 iterations, the initialisation is discarded and a new one is drawn. These diagnostics inform the monitoring of training stability but were not used as a formal cross-validated model-selection criterion; hyperparameter choices were made heuristically on the basis of training-set diagnostics, as described in
Section 2.7. These diagnostics are complemented by the cross-validated performance estimates reported in
Section 3.4.
3.3. Project-Level Predictions
After training on the full nine-project dataset, the network produces one five-dimensional output vector per project. Under LOOCV, each project’s predicted scores are obtained from a model trained on the remaining eight projects, providing genuine out-of-sample estimates.
Table 2 presents the expert-assigned (observed) scores and the corresponding LOOCV-predicted scores for all nine projects on the 1–42 scale.
Several features of the predictions merit attention:
Range validity: all predicted values lie within the valid 1–42 interval, confirming that the logistic output activations and label rescaling operate correctly across all nine held-out folds.
Regularisation pull: predicted scores cluster more tightly around the mid-range (roughly 10–28) than observed scores, which span nearly the full 1–42 range. This compression is most pronounced in the Economy domain, where extreme observed values (P2 = 2; P4 = 36; P5 = 36) are pulled substantially toward the centre by the L2 regularisation (λ = 0.07) and the limited training-set size (eight examples per fold).
Per-domain accuracy: Governance predictions are closest to the observed scores across projects, while Economy and Science show the largest discrepancies, consistent with the domain-level RMSE values reported in
Section 3.4. Predictions are written to impact.csv after rescaling, using the same 1–42 scale as expert labels.
3.4. Leave-One-Out Cross-Validation Results
Because only nine projects are available, LOOCV was adopted as the primary evaluation protocol. In each of the nine folds, the model was retrained from scratch on the eight remaining projects and applied to the held-out project; predictions were rescaled to the original 1–42 score range before computing errors.
Overall performance. Across all nine held-out predictions and the five domains (45 data points), the LOOCV metrics on the 1–42 scale are: RMSE = 10, MAE = 9, and R2 = 0.06. The RMSE is close to the pooled standard deviation of the observed scores (SD = 11), indicating that the model’s aggregate predictive power beyond a naïve mean prediction is modest. The low R2 indicates that only a small fraction of the overall variance is explained, consistent with the severely constrained training regime (eight instances per fold for a 4460-input model).
Domain-specific performance.
Table 3 reports per-domain RMSE, MAE, and R
2 across the nine LOOCV folds. All values are on the original 1–42 impact scale.
Interpretation. Governance achieves the best predictive performance (RMSE = 6, R2 = 0.3), suggesting that governance-related impact is more consistently encoded in the MICS questionnaire features. Economy and Science exhibit the weakest performance (RMSE > 12, negative R2), indicating that the available features capture relatively less predictive signal for those domains or that expert scores for these domains vary more idiosyncratically across projects. The negative R2 values for Environment, Economy, and Science indicate that, in those domains, the model performs worse than simply predicting the domain mean, an expected outcome when fitting a 4460-dimensional model on eight training examples per fold. These results nonetheless serve as a transparent baseline: they quantify how well the current feature set and architecture generalise under the strictest data constraints, and they identify Governance as the most learnable domain with the present data.
3.5. Reproducibility
The complete Alquimics implementation is designed for full reproducibility. The Octave source files (mics.m, trainingMics.m, nnCostFunction.m, predict.m, sigmoid.m, sigmoidGradient.m, fmincg.m) are self-contained and require only a base Octave 10.3 installation with no additional toolboxes. All random seeds are set and logged at the start of trainingMics.m, ensuring that results can be reproduced exactly from a given seed. The input data schema is fully documented in
Section 2.10 and
Appendix B: nine binary CSV files (X01.csv … X09.csv, each 223 × 20) and a label file (Y.csv, 9 × 5). The entire pipeline, from raw CSV inputs to the impact.csv output, is triggered by a single command (“mics” at the Octave prompt), as documented in
Appendix A. Trained weight matrices are stored as Theta1.mat and Theta2.mat to allow inference to be performed without retraining. The complete reproducibility bundle (Octave source code, input design matrices, expert labels, and trained weight matrices) is provided as
Supplementary Materials (File S1). All code and data will be released in a public repository under the CC0 licence upon publication.
4. Discussion
4.1. Key Findings and Model Design
Alquimics demonstrates that a compact multilayer perceptron can map structured citizen-science project descriptors to domain-level impact assessments even within significant data constraints. The architecture employs five independent logistic output heads (one for each impact domain) rather than a single multi-class output layer. This design choice reflects both theoretical and practical considerations: it avoids imposing artificial trade-offs across conceptually distinct domains and aligns with downstream use cases in which IPAs must independently examine, compare, or threshold domain scores [
6]. The deliberately constrained model capacity (achieved through a modest hidden layer of 42 units), combined with L2 regularisation, prevents the network from overfitting to the small training set.
4.2. Methodological Contributions
The application of neural networks to citizen-science impact assessment marks a substantive methodological contribution. Traditional rule-based systems, while interpretable, cannot capture the complex, nonlinear relationships between project characteristics and multi-dimensional impact [
4]. Conversely, most machine-learning approaches to multi-output regression require substantially larger training datasets than were available here. Alquimics bridges this gap by demonstrating that a carefully designed neural network, constrained in capacity and regularised appropriately, can learn meaningful patterns from a limited but richly featured dataset. The use of 223 questions encoded as a 4460-dimensional binary vector represents the most comprehensive feature set assembled to date for citizen-science impact assessment, enabling the model to account for nuanced variations in project design, participation, data practices, and outcomes.
4.3. Critical Limitations: Data Constraints and Implications
The primary limitation is the small sample size: nine complete citizen-science projects. While this represents the most extensive collection of comprehensively documented projects available within MICS at the time of analysis, it raises substantive questions about model generalisation. With only nine instances, the model effectively learns patterns from a narrow slice of the global citizen-science landscape, potentially overlooking important project typologies, geographic variations, or implementation contexts not represented in the training set.
A related concern is label noise: the potential presence of subtle inconsistencies or uncertainties in the expert-assigned impact scores. As noted in
Section 2.1, one to three domain experts scored each project. While this protocol provides a reasonable reliability baseline, human judgement inevitably contains subjectivity and measurement error. In small-sample regimes, even modest label noise can substantially distort learned parameters. The L2 regularisation provides some protection against this risk, but it cannot eliminate it. This concern is directly relevant to interpreting the LOOCV results: part of the prediction error may reflect genuine irreducibility arising from inconsistent labelling rather than a failure of the model to generalise.
4.4. Temporal Scope and Long-Term Impact
A second major limitation concerns temporal scope. The MICS framework captures project characteristics and reports outcomes at a single time point or across limited timeframes, yet the true impact of citizen-science projects often unfolds over years or decades. Alquimics, trained on contemporaneous or near-term impact assessments, cannot reliably predict these longer-horizon effects. Longitudinal validation, comparing early-stage predictions against outcomes observed 12–36 months later, is identified as an important future research direction.
4.5. Model Architecture, Validation, and Baseline Comparisons
LOOCV results (
Section 3.4) provide the honest out-of-sample baseline previously absent: overall RMSE = 10 on the 1–42 scale (cf. SD = 11) and overall R
2 = 0.06, with Governance being the most predictable domain (RMSE = 6, R
2 = 0.3) and Economy being the least (RMSE = 14, R
2 = −0.8). These results confirm that the current model, trained on only eight examples per fold, does not yet generalise reliably and should be interpreted as a transparent baseline rather than a validated predictive tool.
A limitation of the current evaluation is the absence of comparisons against simpler baseline models. Three natural comparators were not implemented: (1) a mean predictor (predicting the training-set domain mean for every held-out project) which corresponds to an R
2 of zero by definition and against which the negative R
2 values for Economy, Environment, and Science already indicate that Alquimics underperforms in those domains; (2) linear regression (or ridge regression), which would directly test whether the neural network’s non-linearity provides added value over a linear mapping from 4460 features; and (3) Random Forest, which is known to perform well on tabular data and is robust to irrelevant features through feature subsampling [
25]. The primary reason these comparators were not included is the methodological scope: this paper focuses on establishing the Alquimics baseline and its integration into the MICS ecosystem, rather than on a comparative model evaluation. The small n = 9 also makes the reliable estimation of baseline model performance equally challenging. Systematic baseline comparisons are explicitly identified as the most important immediate next step in this research programme (see
Section 5.3).
4.6. Practical Implications for MICS Users
Despite these limitations, Alquimics provides immediate practical value within the MICS ecosystem. The model generates domain-specific impact assessments and, when integrated with the rule-based recommendation engine, supplies actionable guidance to citizen-science project leaders. Users should interpret Alquimics not as a definitive impact measure but as a learning tool, a system trained on documented projects that identifies patterns potentially relevant to new initiatives. This pragmatic framing acknowledges both the model’s capabilities and its constraints, positioning it as one component of a broader assessment pipeline.
4.7. Broader Significance
This work demonstrates the feasibility and potential of machine learning for citizen-science impact assessment, even under data-limited conditions. It validates the value of comprehensive, structured data collection reflected in MICS’s 223-question instrument and provides a methodological foundation for future, more robust systems. As citizen science continues expanding globally [
1,
13], the ability to quantify impact rigorously and comparatively across diverse projects becomes increasingly important for funding agencies, policy makers, and project communities. Alquimics represents an instructive step toward this goal, one that acknowledges both machine learning’s promise for pattern discovery and the continued necessity of transparency, validation, and IPA engagement in assessment systems.
5. Limitations and Future Work
5.1. Limitations of the Present Study
The current analysis operates under three fundamental constraints. First, the training dataset comprises only nine citizen-science projects, the complete set available within MICS at the time of model development. While each project is richly characterised by 223 questions encoded as a 4460-dimensional binary vector, nine instances remain far below the sample sizes typically required for reliable neural-network validation. Second, the impact scores reflect expert judgement at a single time point, potentially missing long-term effects and introducing subjectivity despite the structured inter-rater protocol (ICC 0.61–0.78). Third, the nine projects may not be representative of the broader global citizen-science landscape with respect to geographic distribution, disciplinary focus, or implementation context.
5.2. Implications for Current Findings
Given these limitations, Alquimics should be interpreted not as a definitive, universally applicable impact-assessment system but as a domain-informed pattern-discovery tool trained on a specific collection of projects. The model’s ability to identify relationships within the MICS dataset is genuine, but its performance on unseen projects, especially those with markedly different characteristics, remains uncertain. The framework is most defensible when applied to projects reasonably similar to those in the training set and when interpreted as one input to a broader assessment process alongside expert judgement and rule-based guidance.
5.3. Priority Directions for Future Work
Future research should pursue four complementary objectives: systematic hyperparameter optimisation, baseline comparisons, expanded training data, and methodological enhancement.
Systematic hyperparameter optimisation. The hyperparameters of Alquimics (hidden-layer width, lambda, and MaxIter) were selected heuristically on the basis of training-set diagnostics. A rigorous optimisation, using grid search or a Bayesian optimisation framework such as Optuna [
24], should be conducted when a larger dataset is available, with cross-validated RMSE as the selection criterion. A sensitivity ablation study comparing hidden-layer widths of 10, 20, 42, 64, and 100 neurons and MaxIter values of 10, 20, 50, and 100 will reveal whether the current configuration is robust or dependent on specific tuning choices.
Baseline comparisons. Systematic comparison of Alquimics against a mean predictor, ridge regression, and Random Forest is the most immediate next step. These comparisons will determine whether the neural network’s non-linearity provides added value and whether the 4460-dimensional feature space is better compressed by dimensionality reduction before applying simpler models.
Expanded training data. Expanding the training dataset to 30–50 documented projects (a realistic target over a 2–3-year horizon) would substantially improve model stability, reduce overfitting risk, and enable meaningful train–validation–test splits. New projects should be intentionally diverse with respect to geographic region, scientific discipline, funding mechanism, and participation mode.
Methodological enhancement. Three concrete improvements would strengthen the framework: (1) integrating uncertainty quantification (via Bayesian neural networks, Monte Carlo dropout, or ensembles) to provide confidence intervals around predictions; (2) implementing interpretability methods such as feature importance analysis or SHAP values to identify which questionnaire items most strongly drive impact predictions in each domain; and (3) conducting longitudinal validation by comparing early-stage predictions against outcomes observed 12–36 months later.
5.4. Practical Roadmap
In the near term (6–12 months), systematic hyperparameter search and baseline comparisons should be prioritised using expanded data and the archived code structures already prepared. In the medium term (1–2 years), as the training dataset reaches 25–40 projects, Alquimics should be retrained and revalidated, incorporating uncertainty quantification and interpretability enhancements. In the longer term (2–3 years), sufficient labelled projects and temporal outcome data should be available to conduct rigorous longitudinal validation and to explore semi-supervised or transfer-learning approaches that leverage unlabelled projects or related citizen-science datasets.
5.5. Alignment with MICS Ecosystem Needs
These research directions directly serve the MICS user community. Expanding the training dataset and validating Alquimics across diverse project types will increase the relevance of recommendations to a global audience. Uncertainty quantification will support more nuanced decision-making. Interpretability analysis will help users understand which aspects of project design most strongly influence impact, enabling evidence-based improvement strategies.
6. Conclusions
Alquimics establishes the first neural-network baseline for multi-output impact regression in citizen science, integrating a compact 4460–42–5 multilayer perceptron (trained on 223-question self-assessments encoded as 4460-dimensional binary vectors) directly into the MICS open-source platform alongside a rule-based recommendation engine. Leave-one-out cross-validation over nine fully documented projects yields an overall RMSE of 10 and R2 of 0.06 on the 1–42 expert-score scale, with Governance emerging as the most learnable domain (RMSE = 6, R2 = 0.3), confirming that Governance-related impact is more consistently encoded in the questionnaire features than Economy or Science outcomes. The principal limitation is the small labelled dataset (n = 9), which constrains generalisation and prevents reliable estimation of hyperparameter sensitivity; these constraints are directly reflected in the modest cross-validated performance and are documented transparently so that future adopters can calibrate their expectations accordingly. As additional citizen-science projects complete the MICS self-assessment and expert scoring is systematically extended, the same pipeline, requiring only an expanded Y.csv and corresponding input files, will scale to deliver progressively more reliable predictions, making Alquimics a living baseline that grows in utility alongside the citizen-science community it serves.