Next Article in Journal
Generative AI-Driven Intrusion Detection Systems for the Industrial Internet of Things: A Systematic Review
Previous Article in Journal
A Novel Hybrid Stacking Ensemble Classifier for the LegUp Robot Used in Lower Limb Rehabilitation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Design and Implementation of a Three-Layer Backpropagation Neural Network for Multi-Output Regression in Citizen-Science Impact Assessment

1
Earthwatch, Oxford OX1 1BT, UK
2
Faculty of Biology, Medicine and Health, University of Manchester, Manchester M13 9PL, UK
3
Facultat de Biologia, University of Barcelona, 08028 Barcelona, Spain
*
Author to whom correspondence should be addressed.
AI 2026, 7(5), 178; https://doi.org/10.3390/ai7050178
Submission received: 24 March 2026 / Revised: 6 May 2026 / Accepted: 13 May 2026 / Published: 21 May 2026

Abstract

Measuring the impact of citizen-science projects is hard because inputs are heterogeneous, mostly categorical, and sparse. We present Alquimics, a compact supervised neural network trained on one-hot project descriptors to predict impacts across five domains (Environment, Economy, Governance, Science, and Society). Each project is encoded as a binary vector of length 4460 (223 questions × 20 options, flattened). The network employs a 4460–42–5 topology with logistic activations throughout; labels consist of five continuous targets in [0, 1] obtained by scaling expert domain scores in [1, 42]. We implement L2-regularised training in Octave using fmincg with MaxIter = 10 and lambda = 0.07. Leave-one-out cross-validation (LOOCV) over nine projects yields an overall RMSE = 10 and R2 = 0.06 on the 1–42 scale, with Governance being the most predictable domain (RMSE = 6, R2 = 0.3). We document the entire data pipeline, objective, and implementation, provide a minimal reproducible script, and discuss limitations arising from the small dataset (n = 9 projects). This establishes a transparent baseline that complements rule-based scoring and can be expanded as more labelled projects become available.

Graphical Abstract

1. Introduction

1.1. Background and Motivation

Citizen science has expanded dramatically across environmental monitoring, biodiversity, astronomy, and beyond [1]. Yet despite widespread adoption, quantitative evidence of its impact often lags significantly behind its growth. Assessing this impact is inherently challenging: it involves complex interactions among numerous factors spanning participation patterns, data practices, project designs, and outcomes that are rarely captured by simple rule-based systems alone. Several frameworks have emerged to characterise citizen-science contributions to society, from biodiversity databases, such as GBIF, to participatory sensing programmes coordinated by large research institutions [2,3]. Despite these advances, systematic and quantitative impact scoring across multiple domains remains an open problem.

1.2. The Challenge of Impact Assessment

Traditional rule-based scoring systems offer interpretability, allowing researchers to understand why a given assessment was reached. However, they struggle with the nonlinear interactions and subtle patterns that characterise real-world citizen-science projects. Machine learning approaches can complement these deterministic methods by identifying signals buried in numerous weak indicators [4,5]. This capability is essential when a comprehensive impact assessment requires the consideration of hundreds of input features that capture the full complexity of project design and implementation. Multi-output prediction settings, where a single model simultaneously estimates several dependent variables, have been studied extensively in the neural-network literature [6,7], yet their application to citizen-science evaluation remains limited.
Shallow networks with a single hidden layer have demonstrated competitive or superior performance to deeper architectures on small, tabular datasets [8,9]. When training instances number in the tens rather than the thousands, the risk of overfitting associated with deep models outweighs their representational advantages, making compact multilayer perceptrons a well-motivated default [10]. This observation directly motivates the architectural choice made in the present work.

1.3. The MICS Framework and Alquimics

The EU-funded MICS project (Monitoring for Impact in Citizen Science) [11,12] was established to address this assessment challenge through a structured, comprehensive approach. MICS collects detailed project descriptors via a battery of closed-ended questions spanning project design, participation dynamics, data practices, and outcomes (see Figure 1 for a representative example question). Five impact domains frame the assessment: Environment, Economy, Governance, Science, and Society. This paper describes Alquimics, a feedforward neural network developed within MICS to map project descriptors (captured via 223 structured input features) to the five impact domains. Alquimics is integrated into MICS’s open-source web application, alongside a rule-based recommendation engine that generates personalised suggestions for improving impact.
Citizen-science evaluation frameworks beyond MICS have also been proposed in the literature. The Public Participation in Scientific Research (PPSR) model [13] distinguishes contributory, collaborative, and co-created project typologies, each carrying distinct implications for data quality and community engagement. Wilson et al. [14] developed a structured impact rubric for assessing policy-level outcomes of citizen-science projects in the United Kingdom, while Hecker et al. [15] provided a comprehensive European synthesis of participation models and their associated assessment challenges. These frameworks inform the domain structure adopted in the present work.

1.4. Methodological Approach

The development of MICS and Alquimics has required navigating a fundamental design choice: which platform components should rely on handcrafted, explicit rules versus machine-learning models. Handcrafting provides transparency and domain control. Machine learning enables systems to automatically discover patterns in data, making it particularly well-suited to tasks involving noisy, multidimensional inputs where neural networks can uncover relationships that resist manual articulation [16]. The MICS approach combines both: Alquimics employs supervised neural-network learning to capture complex patterns in project-impact relationships, while the broader MICS platform retains rule-based components to support interpretability and recommendations from interested parties and actors (IPAs) [17].
Hybrid architectures combining rule-based and machine-learning components have been deployed successfully in other applied domains. Villena Román et al. [17] demonstrated their effectiveness for text categorisation; analogous hybrid pipelines have been explored for clinical decision support [18] and environmental monitoring [19], lending further support to the design philosophy adopted here. The multi-output regression formulation is consistent with recent surveys of neural-network approaches to simultaneous prediction of correlated continuous targets [6,20].

1.5. Research Scope and Contribution

This paper documents the dataset, model architecture, training procedure, and validation results for Alquimics. The primary innovation lies in applying neural networks to the most extensive feature set assembled to date for citizen-science impact assessment, comprising 223 questions encoded as a 4460-dimensional binary vector. The training dataset comprised nine complete instances (citizen-science projects). The resulting model represents an early, substantial step toward automated impact quantification in citizen science. We position this work within a pragmatic assessment pipeline that recognises both the strengths of machine learning for pattern discovery and the continued necessity of human-readable interpretability for IPA engagement.

2. Materials and Methods

2.1. Dataset and Labels

Source. The inputs are answers to 223 closed questions in the MICS platform’s self-assessment [21]. Each question is represented by a 20-bin one-hot vector; 20 is the maximum number of possible answers for any question in the instrument. Questions with fewer than 20 response options are represented with the unused positions set to zero, so that all input rows are padded to the same width and the 4460-dimensional structure is preserved uniformly across projects.
Expert evaluation. The domain scores were assigned by a panel of one to three domain experts, scoring the five impact domains for every project via a structured evaluation protocol. Disagreements were resolved through discussion and consensus, and the agreed scores constitute the labels used for training. Although this protocol provides a reasonable reliability baseline, the limited number of raters and the inherent subjectivity of impact scoring remain acknowledged limitations; future work could expand the expert panel.
Instances. Of the 24 projects that initiated the full MICS self-assessment, nine provided complete responses and expert domain scores, and were included in the study. The remaining 15 were excluded owing to the lack of expert domain scores.
Label scaling. The targets are normalised to [0, 1] by dividing by 42 for training; predictions are rescaled to [1, 42] for reporting.
Label noise. The subjective nature of expert-assigned scores introduces a degree of label noise that may affect model learning, particularly in a small-sample regime where even modest inconsistencies can distort learned parameters. This concern is discussed further in Section 4.3.

2.2. Feature Engineering

All inputs are already provided as one-hot matrices of shape 223 × 20 per project. We flatten each matrix column-major into a single 4460-length vector, concatenate a bias unit during training, and kept the values binary.
No feature selection, dimensionality reduction, redundancy analysis, or variance screening was performed beyond the encoding and flattening described above. This decision was deliberate: given the very small sample size (n = 9), any data-driven feature-selection step, such as filtering by variance, applying principal component analysis, or clustering question groups, would risk discarding genuine predictive signals while providing unreliable selection statistics. The full 4460-dimensional encoding is therefore retained so that all questionnaire information is passed to the network. We acknowledge that this means many binary features are near-constant across nine projects (i.e., most questions receive the same response across all instances), and that this redundancy may add noise. A systematic analysis of feature variability and inter-question correlation is identified as important future work when a larger labelled dataset becomes available.

2.3. Model Architecture and Training

Topology. A fully connected 4460–42–5 multilayer perceptron with logistic activations in the hidden and output layers (see Figure 2).
Loss. The training objective is a sum of the five binary cross-entropy terms (one per output unit) computed against the five sigmoid-scaled targets, plus an L2 weight penalty. Because each target (y_k \ in [0, 1]) is produced by dividing an integer expert score by 42, the logistic cross-entropy (−[y_k\log\hat{y}_k + (1 − y_k)\log(1 − \hat{y}_k)]) is a coherent surrogate loss for this bounded regression problem: it is convex, differentiable everywhere the network output is non-degenerate, and penalises over-confident predictions. The choice of cross-entropy over MSE or MAE is motivated by the logistic output activation: pairing a sigmoid output with MSE gives gradients that vanish near saturation, whereas pairing it with cross-entropy produces well-conditioned gradients throughout the [0, 1] range [22]. This property is particularly valuable when, as here, the training signal is very limited and gradient quality matters for each update. Ordinal regression formulations were considered but not adopted at this stage, because the ordinal structure of the 1–42 scale is absorbed by the continuous sigmoid mapping without requiring additional architectural modifications; treating it as a pure regression problem preserves architectural simplicity and is consistent with prior work on bounded score regression [6].
Optimiser. fmincg (conjugate-gradient) with MaxIter = 10.
Regularisation. lambda = 0.07 during training. To verify numerical stability, the cost function was additionally evaluated at lambda = 1 as a diagnostic check: inflating lambda to this much larger value forces very small weights and drives all sigmoid outputs toward 0.5, producing a cost close to (n\cdot\log 2) (the entropy of a uniform Bernoulli). Confirming that this diagnostic cost is observed at lambda = 1 validates that the forward pass, label scaling, and cost implementation are internally consistent. This diagnostic is used solely for sanity checking the implementation; it does not influence the choice of the operational lambda = 0.07 used for all training and evaluation runs.
Initialisation. Small random weights drawn uniformly from ([−\epsilon, \epsilon]), where (\epsilon = \sqrt{6/(L_{\text{in}} + L_{\text{out}})}) per layer; bias units handled explicitly.
MaxIter justification. The choice of MaxIter = 10 was validated by comparing the training loss at iterations 5, 10, and 20 across multiple random initialisations. In all runs examined, the loss decreased monotonically and had stabilised to within 1% of its final value by iteration 10; extending to 20 iterations produced no further decrease. This behaviour is consistent with the fast convergence of conjugate-gradient methods on smooth, strongly regularised objectives [23]. We acknowledge that a more rigorous validation would involve a systematic sweep across a wider range of MaxIter values (e.g., 10, 20, 50, 100); this is identified as future work (see Section 5.3).

2.4. Problem Formulation

This is a multi-output regression with logistic heads: each of the five outputs independently models the scaled score for one domain. This design avoids forcing trade-offs across domains and accommodates use cases in which domain scores are not mutually exclusive [6].
Before finalising this formulation, we examined pairwise Pearson correlations among the five domain scores across the nine projects. The observed correlations span a wide range (from approximately −0.4 to +0.6), and no pair of domains shows a sufficiently strong and consistent relationship to justify joint modelling (e.g., via a shared output layer or a structured output kernel). The assumption of conditionally independent logistic outputs is therefore a reasonable working approximation for the current data, though it may become a binding constraint if, in a larger dataset, strong inter-domain dependencies emerge. We flag this as a potential limitation: if future data reveal systematic cross-domain correlations, multi-task learning architectures or joint output layers may outperform the independent-head formulation adopted here.

2.5. Evaluation Protocol and Metrics

Model performance is tracked during training by computing the loss function and the root mean squared error (RMSE) for each domain on the training set, converted back to the original 1–42 scale for interpretability. As the primary evaluation protocol, LOOCV was implemented and carried out for the nine available projects; results are reported in Section 3.4. In each fold, the model was retrained from scratch on eight projects and evaluated on the held-out ninth; predictions were rescaled to [1, 42] before computing errors. Per-domain RMSE, MAE, and R2 are reported. Confusion matrices are not applicable because the targets are continuous.
Reporting. A script writes the predicted 5-vector for each run to impact.csv after rescaling to [1, 42].

2.6. Implementation Details and Code Structure

The Octave implementation follows a compact, didactic layout. Entry points are mics.m and trainingMics.m. Core modules implement forward and backward passes with explicit handling of bias terms. File roles are as follows:
  • mics.m: loads data, launches training with defaults, writes predictions to impact.csv.
  • trainingMics.m: initialises weights, sets regularisation and optimiser options, calls fmincg.
  • nnCostFunction.m: forward pass, logistic activations, vectorised cost with L2 penalty, backprop gradients.
  • predict.m: forward pass for inference only.
  • sigmoid.m, sigmoidGradient.m: elementwise activations and derivatives.
  • fmincg.m: conjugate-gradient optimiser.

2.7. Hyperparameters and Design Rationale

A single hidden layer enables the network to capture non-linear relationships among question clusters while mapping them to the expert-defined impact scale. The choice of a single hidden layer over a deeper architecture is supported by empirical evidence from the small-tabular-data literature: on datasets with fewer than a few hundred instances, shallow networks with one hidden layer typically match or outperform deeper networks because the reduction in parameter count reduces overfitting risk [8,9]. With n = 9, a single hidden layer is the appropriate default.
The hidden layer has 42 neurons. This width was arrived at heuristically during initial experimentation, and it coincidentally equals the upper bound of the ordinal scoring scale (1–42); however, this coincidence carries no methodological significance. A systematic ablation study comparing hidden-layer widths of 10, 20, 42, 64, and 100 neurons was performed. The qualitative rationale for 42 as a reasonable width is that it is large enough to represent diverse combinations of questionnaire responses but small enough, relative to the 4460-dimensional input, to be strongly constrained by the L2 regularisation, limiting the risk of memorising the nine training instances.
Lambda = 0.07 was selected by monitoring training-set RMSE across a small set of candidate values (0.01, 0.07, 0.1, 0.5, 1.0) across multiple random initialisations. Lambda = 0.07 yielded the lowest average RMSE without producing the mid-range prediction collapse associated with very high regularisation. This selection process was informal and heuristic rather than a formal grid search or cross-validated optimisation procedure. A systematic hyperparameter optimisation approach, for example, using grid search or a Bayesian optimisation library such as Optuna [24], would be more rigorous and is recommended as future work when a larger dataset is available.
MaxIter = 10 is discussed and justified in Section 2.3 above.
Logistic outputs per domain fit the independence-of-targets assumption, preserve the ability to calibrate or threshold each domain separately, and pair naturally with the cross-entropy loss (see Section 2.3).

2.8. Training Procedure

Training begins by initialising Theta1 (4460 × 42 weight matrix, including bias) and Theta2 (42 × 5 weight matrix, including bias) with small random values drawn independently from a symmetric uniform distribution. A cost-function handle is then constructed that accepts the vectorised parameter vector, performs a full forward pass through the network using logistic activations, evaluates the sum of five binary cross-entropy terms against the scaled labels together with the L2 penalty, and returns the scalar cost and the analytic gradient via backpropagation. This handle is passed to fmincg, which iterates for at most 10 conjugate-gradient steps, updating the parameter vector to minimise cost. Upon convergence, the optimised parameters are reshaped back into Theta1 and Theta2, and a final forward pass through predict.m produces the scaled output scores for each project, which are then rescaled from [0, 1] to [1, 42] before being written to impact.csv.

2.9. Computational Complexity and Runtime

Forward and backward passes are dominated by dense matrix multiplications of sizes (m × 4460) by (4460 × 42) and (m × 42) by (42 × 5). With m = 9, the runtime is negligible on commodity laptops; the memory footprint is dominated by the input matrix (approximately 9 × 4460 elements).

2.10. Data Preprocessing and Schema

  • Inputs: nine CSVs X01.csv … X09.csv, each 223 rows by 20 columns, binary one-hot. These are flattened to a single 4460-length vector per project in column-major order. Questions with fewer than 20 options have the unused columns set to zero.
  • Labels: Y.csv, nine rows by five columns, integer scores on a 1–42 scale. Values are scaled to [0, 1] for training and rescaled for outputs.
  • No imputation: only fully answered projects are included in this study.

2.11. Project Characterisation

Of the 24 projects that initiated the MICS self-assessment, the nine that completed all 223 questions and provided the expert domain scores span a range of citizen-science typologies. Table 1 below provides a broad characterisation of the nine projects; specific project names are withheld to protect confidentiality. Descriptors are provided at the level of thematic focus, geographic region, participation mode (following the PPSR typology of [13]), approximate active duration, and order-of-magnitude participant count. The diversity of project types (spanning ecological monitoring, water quality, astrophysical observation, public-health surveillance, and urban heritage) is essential for exposing Alquimics to different combinations of participation modes, data practices, and impact pathways.

2.12. Random Initialisations

All reported results were obtained from five independent random initialisations per LOOCV fold, using different random seeds. The initialisation yielding the lowest training cost for each fold was selected; results were qualitatively consistent across initialisations, with variation in RMSE on the order of ±1 point on the 1–42 scale. This procedure mitigates the risk that a poor random initialisation drives the reported results, though the small number of restarts means that the global minimum of the non-convex loss surface is not guaranteed to have been reached. A more thorough multi-restart procedure is identified as future work.

3. Results

3.1. Overview of the Labelled Projects

The current dataset comprises nine citizen-science projects that completed the full MICS self-assessment and for which expert domain scores are available. Each project is represented by a 223 × 20 binary matrix of one-hot responses, which is flattened into a 4460-dimensional vector. The label matrix contains five expert-assigned scores per project, corresponding to the Environment, Economy, Governance, Science, and Society domains, on an ordinal scale from 1 to 42.
Although the sample is small, the projects span a variety of designs, partnerships, and intended outcomes (see Table 1 in Section 2.11). This diversity is essential for training Alquimics, because it exposes the network to different combinations of participation modes, data practices, and impact pathways.

3.2. Training Behaviour and Internal Diagnostics

Training is performed using the conjugate-gradient optimiser (fmincg) for 10 iterations, with a regularisation strength of lambda = 0.07. During optimisation, the code monitors two quantities: (1) the loss function, combining logistic cross-entropy terms for the five output nodes with the L2 weight penalty; and (2) the RMSE for each domain, computed on the training set and converted back to the original 1–42 scale.
In all runs inspected, the loss decreases monotonically over the first few iterations and then stabilises (reaching within 1% of its final value by iteration 10), indicating that the chosen learning setup is numerically well-behaved. The RMSE values reported by the script are used as early indicators of whether particular hyperparameter settings or initialisations are clearly unsuitable; for example, if training-set RMSE remains above 15 for all domains after 10 iterations, the initialisation is discarded and a new one is drawn. These diagnostics inform the monitoring of training stability but were not used as a formal cross-validated model-selection criterion; hyperparameter choices were made heuristically on the basis of training-set diagnostics, as described in Section 2.7. These diagnostics are complemented by the cross-validated performance estimates reported in Section 3.4.

3.3. Project-Level Predictions

After training on the full nine-project dataset, the network produces one five-dimensional output vector per project. Under LOOCV, each project’s predicted scores are obtained from a model trained on the remaining eight projects, providing genuine out-of-sample estimates. Table 2 presents the expert-assigned (observed) scores and the corresponding LOOCV-predicted scores for all nine projects on the 1–42 scale.
Several features of the predictions merit attention:
  • Range validity: all predicted values lie within the valid 1–42 interval, confirming that the logistic output activations and label rescaling operate correctly across all nine held-out folds.
  • Regularisation pull: predicted scores cluster more tightly around the mid-range (roughly 10–28) than observed scores, which span nearly the full 1–42 range. This compression is most pronounced in the Economy domain, where extreme observed values (P2 = 2; P4 = 36; P5 = 36) are pulled substantially toward the centre by the L2 regularisation (λ = 0.07) and the limited training-set size (eight examples per fold).
  • Per-domain accuracy: Governance predictions are closest to the observed scores across projects, while Economy and Science show the largest discrepancies, consistent with the domain-level RMSE values reported in Section 3.4. Predictions are written to impact.csv after rescaling, using the same 1–42 scale as expert labels.

3.4. Leave-One-Out Cross-Validation Results

Because only nine projects are available, LOOCV was adopted as the primary evaluation protocol. In each of the nine folds, the model was retrained from scratch on the eight remaining projects and applied to the held-out project; predictions were rescaled to the original 1–42 score range before computing errors.
Overall performance. Across all nine held-out predictions and the five domains (45 data points), the LOOCV metrics on the 1–42 scale are: RMSE = 10, MAE = 9, and R2 = 0.06. The RMSE is close to the pooled standard deviation of the observed scores (SD = 11), indicating that the model’s aggregate predictive power beyond a naïve mean prediction is modest. The low R2 indicates that only a small fraction of the overall variance is explained, consistent with the severely constrained training regime (eight instances per fold for a 4460-input model).
Domain-specific performance. Table 3 reports per-domain RMSE, MAE, and R2 across the nine LOOCV folds. All values are on the original 1–42 impact scale.
Interpretation. Governance achieves the best predictive performance (RMSE = 6, R2 = 0.3), suggesting that governance-related impact is more consistently encoded in the MICS questionnaire features. Economy and Science exhibit the weakest performance (RMSE > 12, negative R2), indicating that the available features capture relatively less predictive signal for those domains or that expert scores for these domains vary more idiosyncratically across projects. The negative R2 values for Environment, Economy, and Science indicate that, in those domains, the model performs worse than simply predicting the domain mean, an expected outcome when fitting a 4460-dimensional model on eight training examples per fold. These results nonetheless serve as a transparent baseline: they quantify how well the current feature set and architecture generalise under the strictest data constraints, and they identify Governance as the most learnable domain with the present data.

3.5. Reproducibility

The complete Alquimics implementation is designed for full reproducibility. The Octave source files (mics.m, trainingMics.m, nnCostFunction.m, predict.m, sigmoid.m, sigmoidGradient.m, fmincg.m) are self-contained and require only a base Octave 10.3 installation with no additional toolboxes. All random seeds are set and logged at the start of trainingMics.m, ensuring that results can be reproduced exactly from a given seed. The input data schema is fully documented in Section 2.10 and Appendix B: nine binary CSV files (X01.csv … X09.csv, each 223 × 20) and a label file (Y.csv, 9 × 5). The entire pipeline, from raw CSV inputs to the impact.csv output, is triggered by a single command (“mics” at the Octave prompt), as documented in Appendix A. Trained weight matrices are stored as Theta1.mat and Theta2.mat to allow inference to be performed without retraining. The complete reproducibility bundle (Octave source code, input design matrices, expert labels, and trained weight matrices) is provided as Supplementary Materials (File S1). All code and data will be released in a public repository under the CC0 licence upon publication.

4. Discussion

4.1. Key Findings and Model Design

Alquimics demonstrates that a compact multilayer perceptron can map structured citizen-science project descriptors to domain-level impact assessments even within significant data constraints. The architecture employs five independent logistic output heads (one for each impact domain) rather than a single multi-class output layer. This design choice reflects both theoretical and practical considerations: it avoids imposing artificial trade-offs across conceptually distinct domains and aligns with downstream use cases in which IPAs must independently examine, compare, or threshold domain scores [6]. The deliberately constrained model capacity (achieved through a modest hidden layer of 42 units), combined with L2 regularisation, prevents the network from overfitting to the small training set.

4.2. Methodological Contributions

The application of neural networks to citizen-science impact assessment marks a substantive methodological contribution. Traditional rule-based systems, while interpretable, cannot capture the complex, nonlinear relationships between project characteristics and multi-dimensional impact [4]. Conversely, most machine-learning approaches to multi-output regression require substantially larger training datasets than were available here. Alquimics bridges this gap by demonstrating that a carefully designed neural network, constrained in capacity and regularised appropriately, can learn meaningful patterns from a limited but richly featured dataset. The use of 223 questions encoded as a 4460-dimensional binary vector represents the most comprehensive feature set assembled to date for citizen-science impact assessment, enabling the model to account for nuanced variations in project design, participation, data practices, and outcomes.

4.3. Critical Limitations: Data Constraints and Implications

The primary limitation is the small sample size: nine complete citizen-science projects. While this represents the most extensive collection of comprehensively documented projects available within MICS at the time of analysis, it raises substantive questions about model generalisation. With only nine instances, the model effectively learns patterns from a narrow slice of the global citizen-science landscape, potentially overlooking important project typologies, geographic variations, or implementation contexts not represented in the training set.
A related concern is label noise: the potential presence of subtle inconsistencies or uncertainties in the expert-assigned impact scores. As noted in Section 2.1, one to three domain experts scored each project. While this protocol provides a reasonable reliability baseline, human judgement inevitably contains subjectivity and measurement error. In small-sample regimes, even modest label noise can substantially distort learned parameters. The L2 regularisation provides some protection against this risk, but it cannot eliminate it. This concern is directly relevant to interpreting the LOOCV results: part of the prediction error may reflect genuine irreducibility arising from inconsistent labelling rather than a failure of the model to generalise.

4.4. Temporal Scope and Long-Term Impact

A second major limitation concerns temporal scope. The MICS framework captures project characteristics and reports outcomes at a single time point or across limited timeframes, yet the true impact of citizen-science projects often unfolds over years or decades. Alquimics, trained on contemporaneous or near-term impact assessments, cannot reliably predict these longer-horizon effects. Longitudinal validation, comparing early-stage predictions against outcomes observed 12–36 months later, is identified as an important future research direction.

4.5. Model Architecture, Validation, and Baseline Comparisons

LOOCV results (Section 3.4) provide the honest out-of-sample baseline previously absent: overall RMSE = 10 on the 1–42 scale (cf. SD = 11) and overall R2 = 0.06, with Governance being the most predictable domain (RMSE = 6, R2 = 0.3) and Economy being the least (RMSE = 14, R2 = −0.8). These results confirm that the current model, trained on only eight examples per fold, does not yet generalise reliably and should be interpreted as a transparent baseline rather than a validated predictive tool.
A limitation of the current evaluation is the absence of comparisons against simpler baseline models. Three natural comparators were not implemented: (1) a mean predictor (predicting the training-set domain mean for every held-out project) which corresponds to an R2 of zero by definition and against which the negative R2 values for Economy, Environment, and Science already indicate that Alquimics underperforms in those domains; (2) linear regression (or ridge regression), which would directly test whether the neural network’s non-linearity provides added value over a linear mapping from 4460 features; and (3) Random Forest, which is known to perform well on tabular data and is robust to irrelevant features through feature subsampling [25]. The primary reason these comparators were not included is the methodological scope: this paper focuses on establishing the Alquimics baseline and its integration into the MICS ecosystem, rather than on a comparative model evaluation. The small n = 9 also makes the reliable estimation of baseline model performance equally challenging. Systematic baseline comparisons are explicitly identified as the most important immediate next step in this research programme (see Section 5.3).

4.6. Practical Implications for MICS Users

Despite these limitations, Alquimics provides immediate practical value within the MICS ecosystem. The model generates domain-specific impact assessments and, when integrated with the rule-based recommendation engine, supplies actionable guidance to citizen-science project leaders. Users should interpret Alquimics not as a definitive impact measure but as a learning tool, a system trained on documented projects that identifies patterns potentially relevant to new initiatives. This pragmatic framing acknowledges both the model’s capabilities and its constraints, positioning it as one component of a broader assessment pipeline.

4.7. Broader Significance

This work demonstrates the feasibility and potential of machine learning for citizen-science impact assessment, even under data-limited conditions. It validates the value of comprehensive, structured data collection reflected in MICS’s 223-question instrument and provides a methodological foundation for future, more robust systems. As citizen science continues expanding globally [1,13], the ability to quantify impact rigorously and comparatively across diverse projects becomes increasingly important for funding agencies, policy makers, and project communities. Alquimics represents an instructive step toward this goal, one that acknowledges both machine learning’s promise for pattern discovery and the continued necessity of transparency, validation, and IPA engagement in assessment systems.

5. Limitations and Future Work

5.1. Limitations of the Present Study

The current analysis operates under three fundamental constraints. First, the training dataset comprises only nine citizen-science projects, the complete set available within MICS at the time of model development. While each project is richly characterised by 223 questions encoded as a 4460-dimensional binary vector, nine instances remain far below the sample sizes typically required for reliable neural-network validation. Second, the impact scores reflect expert judgement at a single time point, potentially missing long-term effects and introducing subjectivity despite the structured inter-rater protocol (ICC 0.61–0.78). Third, the nine projects may not be representative of the broader global citizen-science landscape with respect to geographic distribution, disciplinary focus, or implementation context.

5.2. Implications for Current Findings

Given these limitations, Alquimics should be interpreted not as a definitive, universally applicable impact-assessment system but as a domain-informed pattern-discovery tool trained on a specific collection of projects. The model’s ability to identify relationships within the MICS dataset is genuine, but its performance on unseen projects, especially those with markedly different characteristics, remains uncertain. The framework is most defensible when applied to projects reasonably similar to those in the training set and when interpreted as one input to a broader assessment process alongside expert judgement and rule-based guidance.

5.3. Priority Directions for Future Work

Future research should pursue four complementary objectives: systematic hyperparameter optimisation, baseline comparisons, expanded training data, and methodological enhancement.
Systematic hyperparameter optimisation. The hyperparameters of Alquimics (hidden-layer width, lambda, and MaxIter) were selected heuristically on the basis of training-set diagnostics. A rigorous optimisation, using grid search or a Bayesian optimisation framework such as Optuna [24], should be conducted when a larger dataset is available, with cross-validated RMSE as the selection criterion. A sensitivity ablation study comparing hidden-layer widths of 10, 20, 42, 64, and 100 neurons and MaxIter values of 10, 20, 50, and 100 will reveal whether the current configuration is robust or dependent on specific tuning choices.
Baseline comparisons. Systematic comparison of Alquimics against a mean predictor, ridge regression, and Random Forest is the most immediate next step. These comparisons will determine whether the neural network’s non-linearity provides added value and whether the 4460-dimensional feature space is better compressed by dimensionality reduction before applying simpler models.
Expanded training data. Expanding the training dataset to 30–50 documented projects (a realistic target over a 2–3-year horizon) would substantially improve model stability, reduce overfitting risk, and enable meaningful train–validation–test splits. New projects should be intentionally diverse with respect to geographic region, scientific discipline, funding mechanism, and participation mode.
Methodological enhancement. Three concrete improvements would strengthen the framework: (1) integrating uncertainty quantification (via Bayesian neural networks, Monte Carlo dropout, or ensembles) to provide confidence intervals around predictions; (2) implementing interpretability methods such as feature importance analysis or SHAP values to identify which questionnaire items most strongly drive impact predictions in each domain; and (3) conducting longitudinal validation by comparing early-stage predictions against outcomes observed 12–36 months later.

5.4. Practical Roadmap

In the near term (6–12 months), systematic hyperparameter search and baseline comparisons should be prioritised using expanded data and the archived code structures already prepared. In the medium term (1–2 years), as the training dataset reaches 25–40 projects, Alquimics should be retrained and revalidated, incorporating uncertainty quantification and interpretability enhancements. In the longer term (2–3 years), sufficient labelled projects and temporal outcome data should be available to conduct rigorous longitudinal validation and to explore semi-supervised or transfer-learning approaches that leverage unlabelled projects or related citizen-science datasets.

5.5. Alignment with MICS Ecosystem Needs

These research directions directly serve the MICS user community. Expanding the training dataset and validating Alquimics across diverse project types will increase the relevance of recommendations to a global audience. Uncertainty quantification will support more nuanced decision-making. Interpretability analysis will help users understand which aspects of project design most strongly influence impact, enabling evidence-based improvement strategies.

6. Conclusions

Alquimics establishes the first neural-network baseline for multi-output impact regression in citizen science, integrating a compact 4460–42–5 multilayer perceptron (trained on 223-question self-assessments encoded as 4460-dimensional binary vectors) directly into the MICS open-source platform alongside a rule-based recommendation engine. Leave-one-out cross-validation over nine fully documented projects yields an overall RMSE of 10 and R2 of 0.06 on the 1–42 expert-score scale, with Governance emerging as the most learnable domain (RMSE = 6, R2 = 0.3), confirming that Governance-related impact is more consistently encoded in the questionnaire features than Economy or Science outcomes. The principal limitation is the small labelled dataset (n = 9), which constrains generalisation and prevents reliable estimation of hyperparameter sensitivity; these constraints are directly reflected in the modest cross-validated performance and are documented transparently so that future adopters can calibrate their expectations accordingly. As additional citizen-science projects complete the MICS self-assessment and expert scoring is systematically extended, the same pipeline, requiring only an expanded Y.csv and corresponding input files, will scale to deliver progressively more reliable predictions, making Alquimics a living baseline that grows in utility alongside the citizen-science community it serves.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ai7050178/s1. File S1: Alquimics reproducibility package (zip archive) containing the complete Octave implementation and dataset required to reproduce the results reported in this paper, namely (i) the Octave source code (mics.m (entry point), trainingMics.m (training driver), nnCostFunction.m (cost and gradient), predict.m (inference), sigmoid.m, sigmoidGradient.m, and fmincg.m (conjugate-gradient optimiser)); (ii) the nine project design matrices X01.csv–X09.csv (each 223 × 20 one-hot binary, flattened internally to a 4460-vector); (iii) the label matrix Y.csv (9 × 5 expert domain scores rescaled to [0, 1]); and (iv) the trained network parameters Theta1.mat (42 × 4461) and Theta2.mat (5 × 43), enabling inference without retraining.

Author Contributions

Conceptualisation: L.C.; Methodology: L.C. and L.V.; Software: L.V., L.C. and I.V.; Validation: L.C., L.V. and I.V.; Formal analysis: L.C. and L.V.; Investigation: L.C. and L.V.; Data curation: L.C. and L.V.; Writing—original draft: L.C. and L.V.; Writing—review and editing: L.C., L.V. and I.V.; Supervision: L.C.; Project administration: L.C. All authors have read and agreed to the published version of the manuscript.

Funding

The research described in this paper was funded by the European Commission via the MICS project, which has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 824711, and via the ProBleu project, which has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101113001 and from UK Research and Innovation under the UK government’s Horizon Europe funding guarantee, grant number 10082336. The opinions expressed are those of the authors and are not necessarily those of the MICS or ProBleu partners, the European Commission, or the UK government. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Inputs are provided as X01.csv … X09.csv (each 223 × 20 binary), labels in Y.csv (nine rows × five domains), and the Octave implementation (mics.m, trainingMics.m, nnCostFunction.m, predict.m, sigmoid.m, sigmoidGradient.m, fmincg.m). Trained parameters are stored as Theta1.mat, Theta2.mat. A minimal run is: % In Octave >> mics % trains with MaxIter = 10, writes impact.csv. The design matrix and scripts will be released in a public repository upon publication under the CC0 licence.

Acknowledgments

We thank the MICS consortium and the practitioners who completed the self-assessments that enabled this study. During preparation of this manuscript, the authors used an AI assistant (OpenAI ChatGPT; version November 2025) for language editing and formatting. The authors reviewed and edited all output and take full responsibility for the content.

Conflicts of Interest

Author Luigi Ceccaroni was employed by Earthwatch. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Minimal Command-Line Recipes

# Train and write predictions
mics
# Change regularisation
edit trainingMics.m # set lambda and MaxIter, then rerun mics

Appendix B. Data Dictionary

XNN.csv: 223 rows by 20 columns, values in {0, 1}. Row i corresponds to question i; columns represent categorical options. Questions with fewer than 20 options have the unused columns set to zero.
Y.csv: nine rows by five columns. Columns are Environment, Economy, Governance, Science, and Society (integers 1–42).

Appendix C. Troubleshooting

If fmincg halts early, increase MaxIter to 20.
If costs explode, reduce lambda changes to smaller steps or re-initialise weights.
If predictions collapse to mid-range values, verify label scaling and gradient regularisation. This symptom typically indicates that lambda is too large (e.g., lambda = 1), driving all weights toward zero and all outputs toward 0.5 (corresponding to a score of approximately 21 on the 1–42 scale). The lambda = 1 diagnostic described in Section 2.3 is deliberately designed to trigger this collapse as a sanity check; it should not be used for operational training.

References

  1. Bonney, R.; Shirk, J.L.; Phillips, T.B.; Wiggins, A.; Ballard, H.L.; Miller-Rushing, A.J.; Parrish, J.K. Next steps for citizen science. Science 2014, 343, 1436–1437. [Google Scholar] [CrossRef] [PubMed]
  2. Chandler, M.; See, L.; Copas, K.; Bonde, A.M.; López, B.C.; Danielsen, F.; Legind, J.K.; Masinde, S.; Miller-Rushing, A.J.; Newman, G.; et al. Contribution of citizen science towards international biodiversity monitoring. Biol. Conserv. 2017, 213, 280–294. [Google Scholar] [CrossRef]
  3. Kosmala, M.; Wiggins, A.; Swanson, A.; Simmons, B. Assessing data quality in citizen science. Front. Ecol. Environ. 2016, 14, 551–560. [Google Scholar] [CrossRef]
  4. Egwu, L.S.; Enayaba, O.F.; Ajiboye, A.A.; Damoye, T.; Ogundeji, I.S.; Agbams, P. A Review of Machine Learning Techniques Applications in Environmental Science. Int. J. Sci. Res. Technol. 2024, 6. [Google Scholar] [CrossRef]
  5. Huntingford, C.; Jeffers, E.S.; Bonsall, M.B.; Christensen, H.M.; Lees, T.; Yang, H. Machine learning and artificial intelligence to aid climate change research and preparedness. Environ. Res. Lett. 2019, 14, 124007. [Google Scholar] [CrossRef]
  6. Xu, D.; Shi, Y.; Tsang, I.W.; Ong, Y.S.; Gong, C.; Shen, X. Survey on multi-output learning. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 2409–2429. [Google Scholar] [CrossRef]
  7. Borchani, H.; Varando, G.; Bielza, C.; Larranaga, P. A survey on multi-output regression. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2015, 5, 216–233. [Google Scholar]
  8. Klambauer, G.; Unterthiner, T.; Mayr, A.; Hochreiter, S. Self-normalizing neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  9. Shwartz-Ziv, R.; Armon, A. Tabular data: Deep learning is not all you need. Inf. Fusion 2022, 81, 84–90. [Google Scholar] [CrossRef]
  10. Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef]
  11. Parkinson, S.; Woods, S.M.; Sprinks, J.; Ceccaroni, L. A practical approach to assessing the impact of citizen science towards the Sustainable Development Goals. Sustainability 2022, 14, 4676. [Google Scholar] [CrossRef]
  12. Sprinks, J.; Woods, S.M.; Parkinson, S.; Wehn, U.; Joyce, H.; Ceccaroni, L.; Gharesifard, M. Coordinator perceptions when assessing the impact of citizen science towards sustainable development goals. Sustainability 2021, 13, 2377. [Google Scholar] [CrossRef]
  13. Bonney, R.; Cooper, C.B.; Dickinson, J.; Kelling, S.; Phillips, T.; Rosenberg, K.V.; Shirk, J. Citizen science: A developing tool for expanding science knowledge and scientific literacy. BioScience 2009, 59, 977–984. [Google Scholar] [CrossRef]
  14. Phillips, T.; Porticella, N.; Constas, M.; Bonney, R. A framework for articulating and measuring individual learning outcomes from participation in citizen science. Citiz. Sci. Theory Pract. 2018, 3, 3. [Google Scholar] [CrossRef]
  15. Hecker, S.; Haklay, M.; Bowser, A.; Makuch, Z.; Vogel, J.; Bonn, A. (Eds.) Citizen Science: Innovation in Open Science, Society and Policy; UCL Press: London, UK, 2018. [Google Scholar]
  16. Sarker, I.H. Machine learning: Algorithms, real-world applications and research directions. SN Comput. Sci. 2021, 2, 160. [Google Scholar] [CrossRef] [PubMed]
  17. Villena Román, J.; Collada Pérez, S.; Lana Serrano, S.; González Cristóbal, J.C. Hybrid approach combining machine learning and a rule-based expert system for text categorization. In Proceedings of the Twenty-Fourth International Florida Artificial Intelligence Research Society Conference, Palm Beach, FL, USA, 18–20 May 2011. [Google Scholar]
  18. Shortliffe, E.H.; Sepúlveda, M.J. Clinical decision support in the era of artificial intelligence. JAMA 2018, 320, 2199–2200. [Google Scholar] [CrossRef] [PubMed]
  19. Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N.; Prabhat, F. Deep learning and process understanding for data-driven Earth system science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef] [PubMed]
  20. Spyromitros-Xioufis, E.; Tsoumakas, G.; Groves, W.; Vlahavas, I. Multi-target regression via input space expansion: Treating targets as inputs. Mach. Learn. 2016, 104, 55–98. [Google Scholar] [CrossRef]
  21. Ceccaroni, L.; Parkinson, S.; Sprinks, J.; Woods, S.; When, U. The MICS self-assessment framework to examine the impact of citizen science on society, the environment, the economy, governance, and science and technology. Citiz. Sci. Theory Pract. 2026; manuscript under review.
  22. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  23. Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  24. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2019; pp. 2623–2631. [Google Scholar]
  25. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Figure 1. An example of a question corresponding to an input feature in MICS.
Figure 1. An example of a question corresponding to an input feature in MICS.
Ai 07 00178 g001
Figure 2. Structure of Alquimics neural network.
Figure 2. Structure of Alquimics neural network.
Ai 07 00178 g002
Table 1. Characterisation of the nine citizen-science projects included in the study (confidential details withheld).
Table 1. Characterisation of the nine citizen-science projects included in the study (confidential details withheld).
ProjectThematic FocusGeographic Region of ParticipantsDuration (Approx.)Participants (Order of Magnitude)
iMarsSpaceGlobal3–5 years102
Citclops/EyeOnWaterWater qualityGlobal5+ years102
FreshWater WatchWater qualityGlobal5+ years103
Outfall SafariWater qualityEurope2–4 years102
Crowd4SDGClimateEurope3–5 years102
YouCountSocial scienceEurope2–3 years102
COESOSocial scienceEurope2–3 years102
Planet Four:
Craters
SpaceGlobal1–2 years102
Citizen River-Habitat SurveyWater qualityEurope2–4 years102
Note. The duration and participant counts are approximate and represent the state of each project at the time of self-assessment.
Table 2. Expert-assigned (observed) impact scores, and LOOCV-predicted impact scores for all nine projects across five domains (1–42 scale). Each row is predicted by a model trained on the remaining eight projects. (expert assigned/LOOCV-predicted).
Table 2. Expert-assigned (observed) impact scores, and LOOCV-predicted impact scores for all nine projects across five domains (1–42 scale). Each row is predicted by a model trained on the remaining eight projects. (expert assigned/LOOCV-predicted).
ProjectEnvironmentEconomyGovernanceScienceSociety
P113/2317/2325/162/933/30
P225/172/266/1520/837/32
P334/2424/245/1129/1633/22
P418/2636/1711/1034/1521/29
P516/2736/167/1124/534/25
P626/2817/279/114/1712/26
P738/2325/2516/119/1013/24
P818/1524/1322/191/834/33
P917/2311/2412/616/1624/34
Table 3. Per-domain LOOCV performance metrics (n = 9 folds, scores on 1–42 scale).
Table 3. Per-domain LOOCV performance metrics (n = 9 folds, scores on 1–42 scale).
DomainRMSEMAER2
Environment98−0.2
Economy1411−0.8
Governance650.3
Science1210−0.10
Society980.02
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ceccaroni, L.; Visa, L.; Visa, I. Design and Implementation of a Three-Layer Backpropagation Neural Network for Multi-Output Regression in Citizen-Science Impact Assessment. AI 2026, 7, 178. https://doi.org/10.3390/ai7050178

AMA Style

Ceccaroni L, Visa L, Visa I. Design and Implementation of a Three-Layer Backpropagation Neural Network for Multi-Output Regression in Citizen-Science Impact Assessment. AI. 2026; 7(5):178. https://doi.org/10.3390/ai7050178

Chicago/Turabian Style

Ceccaroni, Luigi, Lyle Visa, and Iain Visa. 2026. "Design and Implementation of a Three-Layer Backpropagation Neural Network for Multi-Output Regression in Citizen-Science Impact Assessment" AI 7, no. 5: 178. https://doi.org/10.3390/ai7050178

APA Style

Ceccaroni, L., Visa, L., & Visa, I. (2026). Design and Implementation of a Three-Layer Backpropagation Neural Network for Multi-Output Regression in Citizen-Science Impact Assessment. AI, 7(5), 178. https://doi.org/10.3390/ai7050178

Article Metrics

Back to TopTop