Production Architecture of an AI-Powered Survey Evaluation System: Insights from Education

Gutiérrez-Leal, David Emiliano; León-Sandoval, Edgar; Contreras, Eduardo Quintana; Barbosa-Santillán, Liliana Ibeth

doi:10.3390/asi9060118

Open AccessArticle

Production Architecture of an AI-Powered Survey Evaluation System: Insights from Education

by

David Emiliano Gutiérrez-Leal

¹

,

Edgar León-Sandoval

^1,*

,

Eduardo Quintana Contreras

²

and

Liliana Ibeth Barbosa-Santillán

¹

School of Engineering and Sciences, Tecnologico de Monterrey, Monterrey 64700, Mexico

²

Center for Scientific Research and Higher Education of Ensenada, Ensenada 22860, Mexico

^*

Author to whom correspondence should be addressed.

Appl. Syst. Innov. 2026, 9(6), 118; https://doi.org/10.3390/asi9060118

Submission received: 23 April 2026 / Revised: 25 May 2026 / Accepted: 28 May 2026 / Published: 31 May 2026

(This article belongs to the Special Issue AI-Driven Educational Technologies: Systems and Applications)

Download

Browse Figures

Versions Notes

Abstract

This work presents a case study of a Large Language Model based system for automated classification of student survey responses. The system processes 22,286 open-text responses collected from 2062 students across 12 academic programs and 21 nationalities spanning the years 2010–2025. The system architecture has been deployed on institutional servers for security, while integrating databases, an asynchronous task queue for processing, a web-based service layer, and distributed background workers that interact with remote LLM inference services. This work provides a practical reference framework for educational institutions aiming to responsibly and effectively operationalize LLMs in real-world applications.

Keywords:

LLMs; education; educative program evaluation; survey analysis; natural language processing; machine learning deployment; educational technology; educational innovation; higher education

1. Introduction

Both the architecture and utilization of transformers and Large Language Models (LLMs) have recently experienced rapid adoption growth and widespread attention [1]. Today, LLMs are applied across diverse fields, including medicine, law, and education [2]. The latter in particular is increasingly adopting LLMs to streamline multiple classroom tasks, such as answering student questions and assisting with grading [3]. A particularly promising application of LLMs in education is the evaluation of student feedback. Educational institutions often employ satisfaction surveys to gather students’ opinions on various aspects, such as instructors, facilities, services, curriculum, and the overall environment [4,5,6,7,8]. The evaluation of these surveys can be streamlined using LLMs, enabling faster and more consistent analysis of open-ended student responses [9]. However, LLMs face notable limitations, particularly their propensity to generate inaccurate or unfounded information, commonly referred to as “hallucinations” [10,11,12]. A mitigation strategy commonly employed is to provide the model the relevant context during prompt construction [13,14]. Grounding the model’s output in user-supplied information makes it feasible to analyze large volumes of student responses in a timely manner and with minimal human intervention [15,16]. This work seeks to address the following question: Q1. How can a large language model system reliably automate classifying open-ended student survey responses in institutions? In order to answer this question, this work proposes a survey classification system designed to analyze indirect measurement instruments, that is surveys, using LLMs technology. By identifying key areas of institutional weakness reflected in student feedback, as well as overall perceived performance, the system facilitates informed decision-making and targeted academic improvements. The primary users of this system are the personnel responsible for scholarships and postgraduate programs at the Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE) [17], along with the directors of its various postgraduate programs. The system analyzed 22,286 responses collected between 29 January 2010, and 3 October 2025, categorizing them into 3828 unique labels. These responses were provided by a total of 2062 students from 12 different academic programs, and 21 different nationalities.

The remainder of this document is organized as follows: Section 2 details the system architecture, while Section 3 presents the experiments. Section 4 discusses the benchmarking and evaluation results. Finally, Section 5 presents the discussion, and Section 6 highlights the conclusions. Details on the survey design and certain data characteristics are included in Appendix A.

2. Production Architecture of an AI-Powered Survey Evaluation System

The system architecture was designed to operate within CICESE’s on-premises infrastructure. Instead of relying on third-party hosting, the application was deployed using virtualization and container technology running inside the institution’s internal servers while providing security, scalability and flexibility. The resulting platform is organized as a set of cooperating services composed of a web API, several asynchronous background workers, a relational database, and a key-value store.

The web service handled data ingestion, validation, retrieval, and export, while asynchronous workers executed time-intensive tasks such as response preprocessing, prompt construction, and interaction with the large language model. A MariaDB schema stored survey responses together with their generated labels and metadata, whereas the key-value database maintained queue state and worker coordination data.

The deployment was carried out with support from the institution’s technical staff. Access to the virtual machines is restricted to a predefined set of IP addresses, limiting connectivity to authorized institutional devices involved in data retrieval and administration. After configuring access permissions and security policies, the system was deployed in a Linux-based environment, and network connectivity was restricted to the essential operational zones shown in Figure 1 and Figure 2.

The following subsections describe the main architectural elements of the system, beginning with the labeling process and continuing with the Extract, Transform, Load (ETL) pipeline, component containerization, relational database design, web services, and asynchronous services.

2.1. Labeling

The institution relies on a manual process to label responses to the student satisfaction survey. Individual staff members independently reviewed survey responses and assigned labels based on their own interpretation and personal criteria. As a result, each annotator tended to generate a distinct set of labels, reflecting subjective judgments rather than a common, shared taxonomy. Next, these labels were manually consolidated into broader subgroups in an attempt to reduce fragmentation.

The labels produced during this process often consisted of multi-word phrases that closely resembled the original survey responses. This made it difficult to derive high-level insights, compare results across cohorts, or maintain consistency over time. Differences in interpretation, vocabulary, and label granularity further expanded the label space, making the resulting annotations difficult to manage and interpret.

Given these limitations, a central design objective was to build an automated labeling system capable of producing consistent labels while preserving transparency and interpretability. The proposed system was therefore designed to replace the manual workflow with a scalable process in which survey responses are transformed into structured prompts, processed by an LLM, and stored together with traceable metadata for subsequent human review and analysis.

2.2. Extract, Transform, Load (ETL) Pipeline

A structured Extract, Transform, Load (ETL) pipeline was required to convert raw survey data into a representation suitable for automated labeling and long-term storage. The pipeline was divided into two major stages: (1) data preparation and AI-driven semantic processing, and (2) data consolidation and structured database insertion. This separation allowed each stage to be validated, debugged, and scaled independently. The overall two-stage workflow is shown in Figure 3.

Each stage is described next, beginning with data preparation and AI processing, followed by data consolidation and database storage.

2.2.1. Data Preparation and AI Processing

The first stage of the pipeline transforms raw survey data into structured inputs suitable for large language model processing. Survey responses were extracted together with auxiliary resources such as program metadata, question definitions, nationality mappings. Term filtering was carried out using a list of forbidden terms defined in consultation with CICECE’s institutional ethics committee. These resources enriched the data and helped ensure compliance with institutional policies before AI processing.

The transformation phase included the following operations:

Content filtering: The responses were scanned for forbidden terms (defined by the ethics committee) to meet privacy and ethical standards.
Question segmentation: The responses were grouped by individual question.
Metadata enrichment: Additional metadata was added to each response.
Output: Structured JSON files are generated for each question, retaining both textual content and metadata.

After preprocessing, the selected language model was used for semantic processing, producing:

Semantic labels aligned with the proposed labeling framework.
Normalized satisfaction scores in the interval [0, 1] for quantitative analysis and comparison.
AI-labeled JSON files containing the enriched outputs.

These JSON files preserved traceability to the original responses while incorporating semantic annotations, as shown in Figure 4.

2.2.2. Data Consolidation and Database Storage

The second stage of the ETL pipeline transformed the AI-enriched outputs into a database-ready format compatible with the relational schema. In this phase, original survey data and AI-generated annotations were merged into a unified structured representation. Special attention was given to:

Schema alignment: Mapping JSON fields to normalized relational tables.
Data integrity validation: Ensuring consistency between questions, responses, labels, and foreign key relationships.
Deduplication and referential checks: Preventing orphan records and preserving one-to-many relationships where appropriate.

The final output of this stage was the insertion of structured records into the relational database, enabling efficient querying and analysis of survey responses and their associated semantic labels. This process ensured that the information generated during AI processing was preserved and made available for downstream applications such as reporting and institutional decision-making.

Virtualization was utilized to host a collection of Docker containers. This execution environment allowed access to the Docker Hub, which served as a public repository for the container images required by the project. Each container encapsulated a distinct component of the system’s business logic, including web services, background asynchronous worker processes, the relational database, and the key-value data store.

This design followed a microservices-oriented architecture. Service orchestration is handled with Docker Compose, where each of the service instances, environment variables, storage bindings, and internal network definitions are declared in a docker-compose.yaml file. Public base images used in the system include node:18, redis:7, and mariadb:11.

The system employed a microservices architecture hosted using virtualization, providing a safe execution environment for multiple containers, in which each Docker container encapsulated a discrete service component. Orchestration was managed via Docker Compose, which defined service instances, configurations, and inter-container networking. The use of containerization promoted modularity, portability, and maintainability throughout the deployment life cycle.

A subset of the containerized components included a MariaDB relational database. The primary role of this database was to store relationships among application entities. A data model was designed to align the query structures with institutional objectives. The resulting database schema was informed by the data formats and access patterns of other key services within the system.

The main four entities in the database schema are: users, responses, questions, and labels. Together, these entities support the core business operations of the platform, registering responses, assigning labels to the submitted content, and presenting stored responses together with their semantic annotations.

The Responses table, a core component of the database, the schema design as a strategic choice justified by specific operational requirements—such as maintaining historical data integrity or optimizing read performance for complex institutional metadata queries. These changes allow for incremental updates to the schema, prioritizing system stability and uninterrupted service over strict normalization.

To support role-based access control (RBAC), the database uses two specific tables. The tags_per_answer table manages the many-to-many relationship between responses and their tags. Similarly, the role_permissions table links specific permissions to user roles. During the requirements-gathering phase, the need for user roles was identified as a key feature to enable future administrative enhancements.

A web service was designed using a modular and flexible architectural design approach, in which the system functionality was divided into multiple independent microservices, each aligned with distinct functional domains.

Create, Read, Update and Delete operations (CRUD) were implemented as individual API endpoints that interacted with the relational database through a dedicated data-access layer. The system used an object-relational mapping framework to abstract direct database interaction. Beyond standard CRUD functionality, several services implemented higher-level operations tailored to the survey workflow, such as response ingestion, filtered retrieval, and preparation for data exportation.

The export component generates structured representations of survey data for human use. It aggregates attributes related to survey classification, participant demographics, academic context, and temporal metadata through the abstracted data-access layer, serializes the results into a standard tabular file format, and returns the resulting generated file to the client. This functionality supported downstream analysis and reporting while protecting internal system representations from unnecessary exposure.

A key subset of the response-handling functionality was responsible for initiating asynchronous services to manage computationally intensive tasks. This mechanism is triggered during response submission to avoid blocking the request-response cycle with high-latency inference or persistence operations. Instead of executing the entire classification workflow synchronously, the system delegated long-running tasks to the asynchronous processing layer.

The asynchronous processing services were responsible for analyzing, classifying, and persisting submitted responses. These services encapsulated the logic required for automated categorization by interacting with an external inference API powered by a large language model, as well as the processes necessary to store resulting data through an abstracted persistence layer. For internal process coordination, the asynchronous services interfaced with a key-value store that supported task tracking and execution-state management. Each invocation of the response submission workflow registered a new task associated with a unique identifier and enqueued it for background asynchronous execution. Once a task is created, the web service returns an acknowledgment containing the task identifier, while execution continues independently.

The asynchronous processing component also performs data transformation and sanitation to convert the raw user input into a structured representation suitable for downstream processing. Incoming textual content was parsed into an internal object format and passed through a cleaning phase to remove sensitive or extraneous information. The output was then normalized into a canonical representation containing fields such as question, answer, and comment, which served as the basis for later inference and persistence.

After preprocessing, the services persisted the processed responses in the database. Each stored record was assigned a universally unique identifier, (UUID), allowing the system to reference specific inputs without retransmitting the complete payload. This persistence step precedes the invocation of the large language model.

The production configuration included four workers for asynchronous task execution. When one worker was busy, the task scheduler delegated work to the next available worker; when all workers were occupied, additional requests remained queued until capacity became available.

3. Experiments

System evaluation was constrained by several structural limitations inherent to the available data and prior institutional practices. The most significant limitation arose from the nature of the manual labeling process. As previously discussed, no standardized or formally defined labeling scheme existed before the initiation of this project. Instead, survey responses had been manually annotated by multiple individuals, each applying labels based on personal interpretation and criteria. This resulted in a heterogeneous and inconsistent label set that could not be treated as a reliable ground truth. Due to the absence of a consistent and validated reference labeling, it was not feasible to apply conventional quantitative evaluation metrics commonly used in classification tasks, such as precision, recall, or F1. These metrics require a stable gold standard against which predictions can use as a stable reference, a condition not met in this context. To mitigate this constraint, the evaluation strategy was focused on both qualitative and operational criteria. The primary form of validation consisted of expert review by institutional stakeholders, who assessed whether the generated labels were coherent, interpretable, and useful for their intended analytical and decision-making purposes. While this form of evaluation did not provide a rigorous quantitative measure of model performance, it reflected the practical requirements and real-world usage conditions under which the system was deployed, meeting both criteria.

This approach is acknowledged as a present limitation from a methodological standpoint. While LLM-based evaluation is not without limitations—and traditional approaches have shown competitive performance in certain NLP tasks such as semantic tagging and clinical text mining [18]—the nature of the data in this system presented specific challenges that constrained the choice of evaluation method. Human-based annotation was deemed impractical given the scale of labels required, as manual labeling at this volume has proven methodologically unreliable in prior attempts. Dictionary-based approaches were also considered; however, the literature consistently reports their limitations in handling affective language, including emotional nuance, double negations, and sarcastic expressions [19,20]—precisely the constructs central to this system’s inputs. Given these constraints, and the production-oriented deployment context of the system, stakeholder acceptance and perceived utility were adopted as the most feasible evaluation criteria, acknowledging that a more controlled ground truth remains a desirable direction for future work.

Four candidate large language models were evaluated as candidates for the automated labeling component, including deepseek/deepseek-r1-0528, deepseek/ deepseek-v3.2-exp, mistralai/mistral-medium-3.1, and openai/gpt-4o. These models were chosen because they represented distinct trade-offs in terms of latency, output verbosity, and monetary cost, while remaining competitive within the state-of-the-art landscape at the time. The evaluation of the LLMs was focused on cost-efficiency, response time, and token consumption under realistic workload conditions as shown in Table 1, all while maintaining high performance and security standards.

From the evaluated options, deepseek/deepseek-r1-0528 was selected for deployment. Although it was identified early in the project as a promising candidate, its final selection was based on the following systematic benchmarking results. To compare candidate LLMs under practical conditions, a benchmarking process was conducted in which each model was evaluated using fifty independent inference requests. All requests were executed via OpenRouter, which served as a unified backend to ensure consistent request handling across models. The evaluation was designed to capture typical model behavior during real task execution. Each benchmark recorded the following information: the model identifier, the full prompt and corresponding response, the number of input and output tokens, timing data including request start time, completion time, and total latency, and the monetary cost associated with input and output tokens separately. The prompt structure matched the format used in the production system, while the content itself was drawn from randomized data. Specifically, twenty responses were randomly selected from the survey question along with their associated comments. These responses were injected into the prompt template to emulate real-world variability in input data and to observe model behavior under diverse semantic conditions.

Model benchmarking was conducted using four key quantitative indicators: the number of input tokens sent through the body of the API request, the number of output tokens counted from the body of the API response, the monetary cost associated with input tokens, and the monetary cost associated with output tokens.

Token counts for both inputs and outputs were obtained using a tokenization step based on the tokenization standards of each model. Specifically, the mistralai/ mistral-medium-3.1 model was tokenized using the mistralai/Mistral-7B-v0.1 tokenizer [24], which belongs to the same tokenizer family. For the DeepSeek models, deepseek/deepseek-r1-0528 and deepseek/deepseek-v3.2-exp were tokenized using their official tokenizers deepseek-ai/DeepSeek-R1 [21] and deepseek-ai/DeepSeek-V3 [22], respectively. For openai/gpt-4o, tokenization was performed using the Xenova/ gpt-4o community tokenizer [25].

Latency measurements were collected from the moment a request was dispatched until a response was returned by the OpenRouter API. Timing was recorded using Python 3.14 built-in time module, with timestamps obtained via the time.time() function. The total response time was calculated as the difference between the response arrival time and the request initiation time. While this method provided precise per-request measurements, some variability was attributed to transient network conditions during testing. The monetary cost of each request was computed using the pricing rates provided by OpenRouter for each model [26,27,28,29] as of November 2025. Table 2 summarizes the input and output token prices for all evaluated models.

During the benchmarking process, all model temperatures were set to 0. Figure 5, Figure 6, Figure 7 and Figure 8 presents the response time comparison and distribution across all of the tested models. Among the evaluated models, openai/gpt-4o exhibited the lowest average response time, with an average latency of approximately 15 s.

The experimental evaluation highlights the trade-offs among the candidate models in terms of latency, verbosity, and operational cost. openai/gpt-4o achieved the fastest average response times (approximately 15 s per request) but incurred the highest monetary cost for both input and output. In contrast, deepseek/deepseek-v3.2-exp offered the most economical pricing, making it the most cost-effective option. mistralai/mistral-medium-3.1 exhibited a tendency toward verbosity, generating the largest average number of output tokens, while deepseek/deepseek-r1-0528 demonstrated the highest latency, with an average response time of approximately 232 s, potentially limiting its suitability for time-sensitive deployments. Taken together, these findings provide a practical basis for comparing the models under realistic production conditions. Although the absence of a reliable gold standard prevented the use of conventional supervised evaluation metrics, the benchmarking framework enables a meaningful assessment of system behavior from an operational perspective. By combining cost, response time, and output characteristics with stakeholder review, the evaluation supported the selection of models not only on technical performance but also on practical institutional utility. The following section builds on these findings by discussing their implications for the overall system and its deployment context.

4. Results

On average, each student provided responses to 10.8 open-ended questions. The shortest non-empty responses contained a single word, while the longest responses reached up to 20 words in length. All survey responses were collected using an internal data collection tool developed and maintained by the institution.

The process of running the pipeline on the student satisfaction surveys resulted in the generation of 3828 unique tags. These tags were used to label the responses of 2062 students from 12 academic programs. Of these, 1184 responses were from male students and 878 were from female students (Figure 9); 434 responses came from doctoral students and 1628 from master’s students. The full breakdown by postgraduate program and year of response is shown in Figure 10. All responses were distributed over a 15-year period.

Based on the satisfaction scores generated during the AI-powered labeling process—where a score below 0.5 is defined as negative or unsatisfactory, a score of 0.5 is defined as neutral, and a score above 0.5 is defined as positive or satisfactory—the following figures present the distribution of positively labeled student responses broken down by interest area (Figure 11), academic program (Figure 12).

Taken together, these results indicate that model selection in this application depends on balancing responsiveness, cost efficiency, and output characteristics rather than optimizing a single metric. Although a conventional supervised evaluation was not feasible because no reliable gold standard existed, the combined use of benchmarking and stakeholder review provided a practical basis for assessing institutional usefulness. In operational terms, the results confirm that the proposed system can support real-world survey analysis while making the trade-offs among candidate models explicit for deployment decisions.

5. Discussions

The design and deployment of a production-grade LLM-based survey classification system at CICESE demonstrated that success depended on more than model performance. An end-to-end systems approach that integrated data handling, governance, infrastructure, and operational use was essential to ensure outputs that were technically reliable and operationally usable within the institutional context.

Model selection prioritized latency, cost, and security constraints over accuracy alone. This deployment-focused approach favored models that balanced classification quality with predictable performance and feasibility under routine operating conditions, aligning model choice with sustained service delivery requirements and the practical limits of CICESE’s survey workflows.

Deployment on CICESE servers met cybersecurity standards while preserving workflow stability and traceability. This demonstrated that AI integration remained compliant and accountable without disrupting institutional processes or weakening controls. Governance and reliability objectives were directly addressed through deployment design, avoiding post hoc corrections.

The project replaced subjective manual labeling with an automated workflow that included a semantically structured label space, parallel AI-based question classification, [0, 1] satisfaction scoring, and relational storage. Processing time was reduced from weeks or months to minutes, outputs were judged interpretable and useful for decision-making, and the initiative established a precedent for future AI projects at CICESE. Due to data-access policies that limited the scope to institutional surveys, future research was recommended to focus on standardized labeling protocols and curated benchmark datasets for stronger quantitative evaluation.

The database design reflected a deliberate tradeoff: it prioritized read performance for operational reporting over strict normalization, which introduced higher maintenance complexity over time. This decision was deemed acceptable within the project’s scope, as rapid retrieval and decision support were prioritized to meet institutional requirements.

The institutional cultural shift triggered by replacing subjective human labeling with AI-driven consistency was significant, as researchers redirected their energy toward other tasks. The AI-powered insights empowered administrators in ways the previous manual system could not, providing a clearer and more accurate understanding of key indicators. In practical terms, the platform changes the cadence of survey analysis inside the institution. Instead of waiting for delayed manual labeling, decision-makers can review labeled responses and satisfaction indicators soon after data collection, which supports more timely follow-up on recurring concerns, faster prioritization of academic services, and more consistent reporting across programs and evaluation cycles. The traceable storage of prompts, labels, and metadata also makes it easier to audit how specific outputs were produced, which is important when the results are used to inform administrative actions or policy adjustments.

Future work should validate the platform with a more robust and explicit assessment protocol. A useful next step would be to build a gold-standard subset of survey responses annotated by multiple domain experts, then compare the platform outputs against that benchmark using agreement and classification metrics, along with error analysis by program, question type, and label category. In parallel, the deployment could be monitored over successive survey cycles to measure drift in label distributions, response times, and stakeholder utility. Alternative labeling strategies, including multi-pass annotation and expert adjudication, should also be tested to determine which protocol produces the most stable and reproducible reference labels. User-centered validation would further strengthen the evidence base: periodic reviews with institutional staff, structured usability feedback, and controlled comparisons against the previous manual workflow would clarify whether the platform continues to improve turnaround time, interpretability, and decision support under real operating conditions.

6. Conclusions

Our study successfully designed, deployed, and integrated a production-grade survey classification system based on LLMs at CICESE. This result demonstrates that institutional AI initiatives can move from exploratory pilots to stable operational services when technical design is aligned with organizational requirements. Beyond the modeling component, the project delivered a complete implementation path that connected data ingestion, classification, storage, and reporting in a single workflow. In this sense, the work contributes both a functional system and an applied blueprint for future AI deployments in comparable higher-education and public research environments.

Model selection was guided by operational constraints rather than benchmark accuracy in isolation. Specifically, the evaluation prioritized latency, cost, cybersecurity compatibility, workflow stability, and traceability, since these factors determine whether a system can be sustained in routine institutional use. This decision framework ensured that model performance was interpreted within real deployment conditions and governance requirements. Consequently, the selected stack balanced classification quality with service reliability, enabling consistent day-to-day use while remaining compatible with CICESE standards for secure and accountable digital operations.

The resulting pipeline replaced manual survey labeling with an automated process that introduced a semantically structured labeling framework, parallelized AI-based question classification, and satisfaction scoring in the range [0, 1]. The system also consolidated outputs in a relational schema designed to support downstream institutional reporting and analysis. Together, these components transformed labeling from a subjective, labor-intensive task into a reproducible computational procedure. This architectural integration is a central contribution of the study because it links model inference, data structure, and decision support in a coherent production setting.

Operationally, the automation produced a substantial efficiency gain for the Department of Graduate Follow-up and Scholarships. Processing that previously required weeks or months of manual effort was reduced to minutes, allowing results to be generated in decision-relevant timeframes. This improvement is not only quantitative but organizational, because faster turnaround supports more responsive planning cycles and better coordination among administrative stakeholders. In practical terms, the system increased throughput without compromising traceability, which is critical for institutional trust and for sustained adoption across recurring survey-analysis workflows.

Stakeholder validation further confirmed the practical value of the system outputs. Users reported that the generated classifications and satisfaction indicators were interpretable, actionable, and aligned with institutional decision-making needs, indicating strong semantic usability beyond raw model performance. This feedback is important because it demonstrates that the pipeline supports human-in-the-loop governance rather than opaque automation. By producing results that domain users can understand and apply, the implementation strengthens organizational confidence in AI-assisted analysis and establishes a credible precedent for broader data-informed policy and planning processes.

Performance testing showed meaningful latency differences across candidate models under the same evaluation setup. The fastest model was openai/gpt-4o with an average response time of 15 s, followed by mistralai/mistral-medium-3.1 at 23 s and deepseek/deepseek-v3.2-exp at 48 s, while deepseek/deepseek-r1-0528 was the slowest at 232 s. These results reinforce the importance of deployment-aware selection criteria, since response time directly affects operational feasibility. Overall, the comparison provides an empirical basis for balancing quality, cost, and speed in future institutional model procurement and configuration decisions.

Author Contributions

Conceptualization, D.E.G.-L. and E.L.-S.; methodology, D.E.G.-L.; software, D.E.G.-L.; validation, D.E.G.-L., E.L.-S. and L.I.B.-S.; formal analysis, E.L.-S.; investigation, D.E.G.-L.; resources, E.L.-S. and E.Q.C.; data curation, E.L.-S.; writing—original draft preparation, D.E.G.-L.; writing—review and editing, D.E.G.-L., E.L.-S., L.I.B.-S. and E.Q.C.; visualization, D.E.G.-L.; supervision, E.L.-S.; project administration, E.L.-S. and E.Q.C.; funding acquisition, E.L.-S. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to acknowledge the financial support of Writing Lab, Institute for the Future of Education, Tecnologico de Monterrey, Mexico, in the production (APC) of this work.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Ethics Committee of CICESE (Comité de Bioética, on 09, 2025).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The datasets presented in this article are not readily available because due to subjects privacy and security concerns. Requests to access the datasets should be directed to CICESE directly.

Acknowledgments

We acknowledge CICESE for providing the institutional support and infrastructure necessary for this project.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Survey Data Description

The system recorded approximately 22,286 responses to the student satisfaction survey. These responses correspond to 12 graduate programs across the institution. The distribution of responses by academic program is summarized in Table A1.

Table A1. Distribution of survey responses by graduate program.

Program	Number of Responses
Life Sciences	4233
Earth Sciences	3136
Electronics and Telecommunications	2601
Marine Ecology	2497
Computer Sciences	2174
Optics	1854
Aquaculture	1713
Nanosciences	1690
Physical Oceanography	1475
Materials Physics	812
Advanced and Integrated Technologies	72
Sciences	29

The survey responses span the period from 29 January 2010 to 3 October 2025. Over this interval, responses were temporally distributed as follows: 2010 to 2014 with 5612 responses, 2015 to 2019 with 8570 responses, 2020 to 2024 with 6986 responses, and 2025 onwards with 1118 responses.

The analysis focused on a subset of open-ended survey questions. The specific questions included in this analysis are listed in Table A2, translated to English from their original Spanish language.

Table A2. Open-ended survey questions analyzed, translated to English.

Survey Questions—English Translation
What aspects of your thesis direction do you consider could be improved?
Did the program meet your expectations?—Why?
Only if your thesis has an applied research focus: Did you receive support to connect with the productive sector?—Why?
Do you believe there is consistency between the training received and the graduate profile described in the program?—Why?
Do you consider the length of the school term adequate?—Why?
In the development of your thesis, did you generate new knowledge or apply and/or improve known methods?—Why?
In terms of timely and effectiveness of your thesis advising, do you consider the collegial work of your thesis committee efficient?—Why?
How would you rate the academic training received?—Why?
How would you rate the thesis direction received?—Why?
In your opinion, should the curriculum include any additional course(s)?—Which one(s)?
How do you rate the administrative procedures?—Suggestions
Do you think the graduate program you are graduating from addresses a national need?—Why?
If time could be turned back, would you study the same graduate program at CICESE again?—If not, where would you study it?
Do you think any laboratory needs improvement or updating?—Which one(s)?
Do you consider your thesis topic current and significant?—Why?
How do you rate the teaching staff?—Why?
How do you rate the structure and content of the curriculum?—Why?

References

Saha, S.; Xu, L. Vision transformers on the edge: A comprehensive survey of model compression and acceleration strategies. Neurocomputing 2025, 643, 130417. [Google Scholar] [CrossRef]
Saleh, Y.; Abu Talib, M.; Nasir, Q.; Dakalbab, F. Evaluating large language models: A systematic review of efficiency, applications, and future directions. Front. Comput. Sci. 2025, 7, 1523699. [Google Scholar] [CrossRef]
Wang, S.; Xu, T.; Li, H.; Zhang, C.; Liang, J.; Tang, J.; Yu, P.S.; Wen, Q. Large language models for education: A survey and outlook. arXiv 2024, arXiv:2403.18105. [Google Scholar] [CrossRef]
Todorova, I.T.; Tsanov, I.D. A Survey of Students’ Opinions About the Curriculum and Their Satisfaction With the Atmosphere of the University and the Learning Process As A Whole. Rev. Gestão-RGSA 2024, 18, e010187. [Google Scholar] [CrossRef]
Bertini, F.; Dal Palù, A.; Formisano, A.; Pintus, A.; Rainieri, S.; Salvarani, L. Students’ Careers and AI: A decision-making support system for Academia. In Proceedings of the CEUR Workshop Proceedings. CEUR-WS, Pisa, Italy, 29–30 May 2023; Volume 3486, pp. 272–277. [Google Scholar]
Cha, S.; Loeser, M.; Seo, K. The Impact of AI-Based Course-Recommender System on Students’ Course-Selection Decision-Making Process. Appl. Sci. 2024, 14, 3672. [Google Scholar] [CrossRef]
Delahoz-Domínguez, E.J.; Hijón-Neira, R. Recommender System for University Degree Selection: A Socioeconomic and Standardised Test Data Approach. Appl. Sci. 2024, 14, 8311. [Google Scholar] [CrossRef]
Yin, S.; Imran, R.; Ullah, K.; Ali, Z.; Haleemzai, I. Responsible AI in student management: Preventing misdecision in career choice of university students under inaccurate guidance. Sci. Rep. 2025, 15, 38177. [Google Scholar] [CrossRef] [PubMed]
Parker, M.J.; Anderson, C.; Stone, C.; Oh, Y. A large language model approach to educational survey feedback analysis. Int. J. Artif. Intell. Educ. 2025, 35, 444–481. [Google Scholar] [CrossRef]
Zhang, M.; Zhao, T. Citation Accuracy Challenges Posed by Large Language Models. JMIR Med. Educ. 2025, 11, e72998. [Google Scholar] [CrossRef] [PubMed]
Cuskley, C.; Woods, R.; Flaherty, M. The Limitations of Large Language Models for Understanding Human Language and Cognition. Open Mind 2024, 8, 1058–1083. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Podlasek, A.; Shidara, K.; Liu, F.; Alaa, A.; Bernardo, D. Limitations of large language models in clinical problem-solving arising from inflexible reasoning. Sci. Rep. 2025, 15, 39426. [Google Scholar] [CrossRef] [PubMed]
Mei, L.; Yao, J.; Ge, Y.; Wang, Y.; Bi, B.; Cai, Y.; Liu, J.; Li, M.; Li, Z.Z.; Zhang, D.; et al. A Survey of Context Engineering for Large Language Models. arXiv 2025, arXiv:2507.13334. [Google Scholar] [CrossRef]
Wang, L.; Li, J.; Zhuang, B.; Huang, S.; Fang, M.; Wang, C.; Li, W.; Zhang, M.; Gong, S. Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis. J. Med. Internet Res. 2025, 27, e64486. [Google Scholar] [CrossRef] [PubMed]
Seneviratne, H.M.T.W.; Manathunga, S.S. Artificial intelligence assisted automated short answer question scoring tool shows high correlation with human examiner markings. BMC Med. Educ. 2025, 25, 1146. [Google Scholar] [CrossRef] [PubMed]
Gao, R.; Merzdorf, H.E.; Anwar, S.; Hipwell, M.C.; Srinivasa, A. Automatic assessment of text-based responses in post-secondary education: Using NLP and LLMs to deliver formative feedback. arXiv 2023, arXiv:2308.16151. [Google Scholar] [CrossRef]
Centro de Investigacion Cientifica y de Educacion Superior de Ensenada (CICESE), Ensenada, B.C. México; Acerca de CICESE. 2025. Available online: https://cicese-at.cicese.mx/int/index.php?mod=acd&op=mis (accessed on 15 February 2026).
Tascioglu, A.B.; Bertini, F.; Pistore, L.; Fabbri, A.; Montesi, D. Comorbidity Extraction for In-Hospital Mortality Analysis: A Comparison of Regular Expressions and Large Language Models. In Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Philadelphia, PA, USA, 12–15 October 2025; pp. 1–10. [Google Scholar]
Roy, J. Sentiment Analysis: Challenges and Insights. J. Mark. Soc. Res. 2025, 2, 180–186. [Google Scholar]
Hill, C.H.; Fresneda, J.E.; Anandarajan, M. The wisdom of the lexicon crowds: Leveraging on decades of lexicon-based sentiment analysis for improved results. J. Big Data 2025, 12, 129. [Google Scholar] [CrossRef]
DeepSeek Team. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv 2025. [Google Scholar] [CrossRef]
DeepSeek-AI; Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; et al. DeepSeek-V3 Technical Report. arXiv 2024. [Google Scholar] [CrossRef]
OpenAI Authors. GPT-4o System Card. arXiv 2024. [Google Scholar] [CrossRef]
Mistral AI. Mistral-7B-v0.1. 2023. Available online: https://huggingface.co/mistralai/Mistral-7B-v0.1 (accessed on 15 February 2026).
Xenova. Xenova/gpt-4o Community Tokenizer. 2024. Available online: https://huggingface.co/Xenova/gpt-4o (accessed on 15 February 2026).
OpenRouter. DeepSeek: R1 0528. Available online: https://openrouter.ai/deepseek/deepseek-r1-0528 (accessed on 15 February 2026).
OpenRouter. DeepSeek-V3.2-Exp. Available online: https://openrouter.ai/deepseek/deepseek-v3.2-exp (accessed on 15 February 2026).
OpenRouter. Mistral Medium 3.1. Available online: https://openrouter.ai/mistralai/mistral-medium-3.1 (accessed on 15 February 2026).
OpenRouter. GPT-4o. Available online: https://openrouter.ai/openai/gpt-4o (accessed on 15 February 2026).

Figure 1. Illustration of the overall production architecture of the proposed LLM-based system, based on the technology stack and its containerization.

Figure 2. The overall production architecture of the proposed LLM-based system, by task responsibility.

Figure 3. Three-stage ETL pipeline architecture illustrating the modular separation between data preparation, AI semantic processing, and data consolidation and database insertion.

Figure 4. Data preparation and AI processing pipeline illustrating the structured transformation of raw institutional survey data into semantically enriched, AI-labeled outputs. By isolating deterministic data governance operations from probabilistic AI inference, the architecture ensures traceability, data integrity, and scalable semantic processing under real-world institutional constraints.

Figure 5. Response time in seconds: Deepseek R1.

Figure 6. Response time in seconds: Deepseek V3.

Figure 7. Response time in seconds: GPT-4o.

Figure 8. Response time in seconds: Mistral.

Figure 9. Gender and program level distribution of survey respondents.

Figure 10. Program and year of response distribution of survey respondents.

Figure 11. Overlapped bar charts showing the distribution of positively labeled student responses by demographic and academic dimensions (score < 0.5: negative; score = 0.5: neutral; score > 0.5: positive). By interest area.

Figure 12. Overlapped bar charts showing the distribution of positively labeled student responses by demographic and academic dimensions (score < 0.5: negative; score = 0.5: neutral; score > 0.5: positive). By academic program.

Table 1. A summary of the evaluated large language models. All descriptions were directly obtained from the OpenRouter website.

Model	Released	Description (Source: OpenRouter Website)
`deepseek-r1-0528` [21]	2025	The May 28 update of DeepSeek R1 delivered performance comparable to OpenAI’s o1 model while remaining fully open source and exposing its reasoning tokens. The model contained 671 billion parameters, with 37 billion active during each inference, and was released as a completely open-source system.
`deepseek-v3.2-exp` [22]	2024	DeepSeek-V3.2-Exp was introduced as an experimental large language model by DeepSeek, serving as a transitional release between V3.1 and upcoming architectures. It featured DeepSeek Sparse Attention (DSA), a detailed sparse attention approach aimed at improving training and inference efficiency for long-context tasks while preserving the quality of generated outputs.
`mistral-medium-3.1`	2025	Mistral Medium 3.1 was a newer iteration of Mistral Medium 3, designed as an enterprise-focused language model to provide near-frontier-level performance while significantly lowering operational expenses. It combined advanced reasoning and multimodal capabilities with costs approximately eight times lower than conventional large models, making it well-suited for large-scale professional and industrial deployments.
`gpt-4o` [23]	2024	The 20 November 2024 release of GPT-4o improved creative writing capabilities, producing text that felt more natural, engaging, and better adapted to context, which enhanced clarity and relevance. It also handled uploaded files more effectively, enabling deeper analysis and more comprehensive responses.

Table 2. Pricing comparison of evaluated large language models. DeepSeek-V3.2-exp presents the lowest token pricing available.

	Price (USD per 1M Tokens)
Model	Input	Output
`mistralai/mistral-medium-3.1`	0.40	2.00
`deepseek/deepseek-r1-0528`	0.50	2.15
`deepseek/deepseek-v3.2-exp`	0.27	0.40
`openai/gpt-4o`	2.50	10.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Gutiérrez-Leal, D.E.; León-Sandoval, E.; Contreras, E.Q.; Barbosa-Santillán, L.I. Production Architecture of an AI-Powered Survey Evaluation System: Insights from Education. Appl. Syst. Innov. 2026, 9, 118. https://doi.org/10.3390/asi9060118

AMA Style

Gutiérrez-Leal DE, León-Sandoval E, Contreras EQ, Barbosa-Santillán LI. Production Architecture of an AI-Powered Survey Evaluation System: Insights from Education. Applied System Innovation. 2026; 9(6):118. https://doi.org/10.3390/asi9060118

Chicago/Turabian Style

Gutiérrez-Leal, David Emiliano, Edgar León-Sandoval, Eduardo Quintana Contreras, and Liliana Ibeth Barbosa-Santillán. 2026. "Production Architecture of an AI-Powered Survey Evaluation System: Insights from Education" Applied System Innovation 9, no. 6: 118. https://doi.org/10.3390/asi9060118

APA Style

Gutiérrez-Leal, D. E., León-Sandoval, E., Contreras, E. Q., & Barbosa-Santillán, L. I. (2026). Production Architecture of an AI-Powered Survey Evaluation System: Insights from Education. Applied System Innovation, 9(6), 118. https://doi.org/10.3390/asi9060118

Article Menu

Production Architecture of an AI-Powered Survey Evaluation System: Insights from Education

Abstract

1. Introduction

2. Production Architecture of an AI-Powered Survey Evaluation System

2.1. Labeling

2.2. Extract, Transform, Load (ETL) Pipeline

2.2.1. Data Preparation and AI Processing

2.2.2. Data Consolidation and Database Storage

3. Experiments

4. Results

5. Discussions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Survey Data Description

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI