1. Introduction
Both the architecture and utilization of transformers and Large Language Models (LLMs) have recently experienced rapid adoption growth and widespread attention [
1]. Today, LLMs are applied across diverse fields, including medicine, law, and education [
2]. The latter in particular is increasingly adopting LLMs to streamline multiple classroom tasks, such as answering student questions and assisting with grading [
3]. A particularly promising application of LLMs in education is the evaluation of student feedback. Educational institutions often employ satisfaction surveys to gather students’ opinions on various aspects, such as instructors, facilities, services, curriculum, and the overall environment [
4,
5,
6,
7,
8]. The evaluation of these surveys can be streamlined using LLMs, enabling faster and more consistent analysis of open-ended student responses [
9]. However, LLMs face notable limitations, particularly their propensity to generate inaccurate or unfounded information, commonly referred to as “hallucinations” [
10,
11,
12]. A mitigation strategy commonly employed is to provide the model the relevant context during prompt construction [
13,
14]. Grounding the model’s output in user-supplied information makes it feasible to analyze large volumes of student responses in a timely manner and with minimal human intervention [
15,
16]. This work seeks to address the following question: Q1. How can a large language model system reliably automate classifying open-ended student survey responses in institutions? In order to answer this question, this work proposes a survey classification system designed to analyze indirect measurement instruments, that is surveys, using LLMs technology. By identifying key areas of institutional weakness reflected in student feedback, as well as overall perceived performance, the system facilitates informed decision-making and targeted academic improvements. The primary users of this system are the personnel responsible for scholarships and postgraduate programs at the Centro de Investigación Científica y de Educación Superior de Ensenada (CICESE) [
17], along with the directors of its various postgraduate programs. The system analyzed 22,286 responses collected between 29 January 2010, and 3 October 2025, categorizing them into 3828 unique labels. These responses were provided by a total of 2062 students from 12 different academic programs, and 21 different nationalities.
The remainder of this document is organized as follows:
Section 2 details the system architecture, while
Section 3 presents the experiments.
Section 4 discusses the benchmarking and evaluation results. Finally,
Section 5 presents the discussion, and
Section 6 highlights the conclusions. Details on the survey design and certain data characteristics are included in
Appendix A.
2. Production Architecture of an AI-Powered Survey Evaluation System
The system architecture was designed to operate within CICESE’s on-premises infrastructure. Instead of relying on third-party hosting, the application was deployed using virtualization and container technology running inside the institution’s internal servers while providing security, scalability and flexibility. The resulting platform is organized as a set of cooperating services composed of a web API, several asynchronous background workers, a relational database, and a key-value store.
The web service handled data ingestion, validation, retrieval, and export, while asynchronous workers executed time-intensive tasks such as response preprocessing, prompt construction, and interaction with the large language model. A MariaDB schema stored survey responses together with their generated labels and metadata, whereas the key-value database maintained queue state and worker coordination data.
The deployment was carried out with support from the institution’s technical staff. Access to the virtual machines is restricted to a predefined set of IP addresses, limiting connectivity to authorized institutional devices involved in data retrieval and administration. After configuring access permissions and security policies, the system was deployed in a Linux-based environment, and network connectivity was restricted to the essential operational zones shown in
Figure 1 and
Figure 2.
The following subsections describe the main architectural elements of the system, beginning with the labeling process and continuing with the Extract, Transform, Load (ETL) pipeline, component containerization, relational database design, web services, and asynchronous services.
2.1. Labeling
The institution relies on a manual process to label responses to the student satisfaction survey. Individual staff members independently reviewed survey responses and assigned labels based on their own interpretation and personal criteria. As a result, each annotator tended to generate a distinct set of labels, reflecting subjective judgments rather than a common, shared taxonomy. Next, these labels were manually consolidated into broader subgroups in an attempt to reduce fragmentation.
The labels produced during this process often consisted of multi-word phrases that closely resembled the original survey responses. This made it difficult to derive high-level insights, compare results across cohorts, or maintain consistency over time. Differences in interpretation, vocabulary, and label granularity further expanded the label space, making the resulting annotations difficult to manage and interpret.
Given these limitations, a central design objective was to build an automated labeling system capable of producing consistent labels while preserving transparency and interpretability. The proposed system was therefore designed to replace the manual workflow with a scalable process in which survey responses are transformed into structured prompts, processed by an LLM, and stored together with traceable metadata for subsequent human review and analysis.
2.2. Extract, Transform, Load (ETL) Pipeline
A structured Extract, Transform, Load (ETL) pipeline was required to convert raw survey data into a representation suitable for automated labeling and long-term storage. The pipeline was divided into two major stages: (1) data preparation and AI-driven semantic processing, and (2) data consolidation and structured database insertion. This separation allowed each stage to be validated, debugged, and scaled independently. The overall two-stage workflow is shown in
Figure 3.
Each stage is described next, beginning with data preparation and AI processing, followed by data consolidation and database storage.
2.2.1. Data Preparation and AI Processing
The first stage of the pipeline transforms raw survey data into structured inputs suitable for large language model processing. Survey responses were extracted together with auxiliary resources such as program metadata, question definitions, nationality mappings. Term filtering was carried out using a list of forbidden terms defined in consultation with CICECE’s institutional ethics committee. These resources enriched the data and helped ensure compliance with institutional policies before AI processing.
The transformation phase included the following operations:
Content filtering: The responses were scanned for forbidden terms (defined by the ethics committee) to meet privacy and ethical standards.
Question segmentation: The responses were grouped by individual question.
Metadata enrichment: Additional metadata was added to each response.
Output: Structured JSON files are generated for each question, retaining both textual content and metadata.
After preprocessing, the selected language model was used for semantic processing, producing:
Semantic labels aligned with the proposed labeling framework.
Normalized satisfaction scores in the interval [0, 1] for quantitative analysis and comparison.
AI-labeled JSON files containing the enriched outputs.
These JSON files preserved traceability to the original responses while incorporating semantic annotations, as shown in
Figure 4.
2.2.2. Data Consolidation and Database Storage
The second stage of the ETL pipeline transformed the AI-enriched outputs into a database-ready format compatible with the relational schema. In this phase, original survey data and AI-generated annotations were merged into a unified structured representation. Special attention was given to:
Schema alignment: Mapping JSON fields to normalized relational tables.
Data integrity validation: Ensuring consistency between questions, responses, labels, and foreign key relationships.
Deduplication and referential checks: Preventing orphan records and preserving one-to-many relationships where appropriate.
The final output of this stage was the insertion of structured records into the relational database, enabling efficient querying and analysis of survey responses and their associated semantic labels. This process ensured that the information generated during AI processing was preserved and made available for downstream applications such as reporting and institutional decision-making.
Virtualization was utilized to host a collection of Docker containers. This execution environment allowed access to the Docker Hub, which served as a public repository for the container images required by the project. Each container encapsulated a distinct component of the system’s business logic, including web services, background asynchronous worker processes, the relational database, and the key-value data store.
This design followed a microservices-oriented architecture. Service orchestration is handled with Docker Compose, where each of the service instances, environment variables, storage bindings, and internal network definitions are declared in a docker-compose.yaml file. Public base images used in the system include node:18, redis:7, and mariadb:11.
The system employed a microservices architecture hosted using virtualization, providing a safe execution environment for multiple containers, in which each Docker container encapsulated a discrete service component. Orchestration was managed via Docker Compose, which defined service instances, configurations, and inter-container networking. The use of containerization promoted modularity, portability, and maintainability throughout the deployment life cycle.
A subset of the containerized components included a MariaDB relational database. The primary role of this database was to store relationships among application entities. A data model was designed to align the query structures with institutional objectives. The resulting database schema was informed by the data formats and access patterns of other key services within the system.
The main four entities in the database schema are: users, responses, questions, and labels. Together, these entities support the core business operations of the platform, registering responses, assigning labels to the submitted content, and presenting stored responses together with their semantic annotations.
The Responses table, a core component of the database, the schema design as a strategic choice justified by specific operational requirements—such as maintaining historical data integrity or optimizing read performance for complex institutional metadata queries. These changes allow for incremental updates to the schema, prioritizing system stability and uninterrupted service over strict normalization.
To support role-based access control (RBAC), the database uses two specific tables. The tags_per_answer table manages the many-to-many relationship between responses and their tags. Similarly, the role_permissions table links specific permissions to user roles. During the requirements-gathering phase, the need for user roles was identified as a key feature to enable future administrative enhancements.
A web service was designed using a modular and flexible architectural design approach, in which the system functionality was divided into multiple independent microservices, each aligned with distinct functional domains.
Create, Read, Update and Delete operations (CRUD) were implemented as individual API endpoints that interacted with the relational database through a dedicated data-access layer. The system used an object-relational mapping framework to abstract direct database interaction. Beyond standard CRUD functionality, several services implemented higher-level operations tailored to the survey workflow, such as response ingestion, filtered retrieval, and preparation for data exportation.
The export component generates structured representations of survey data for human use. It aggregates attributes related to survey classification, participant demographics, academic context, and temporal metadata through the abstracted data-access layer, serializes the results into a standard tabular file format, and returns the resulting generated file to the client. This functionality supported downstream analysis and reporting while protecting internal system representations from unnecessary exposure.
A key subset of the response-handling functionality was responsible for initiating asynchronous services to manage computationally intensive tasks. This mechanism is triggered during response submission to avoid blocking the request-response cycle with high-latency inference or persistence operations. Instead of executing the entire classification workflow synchronously, the system delegated long-running tasks to the asynchronous processing layer.
The asynchronous processing services were responsible for analyzing, classifying, and persisting submitted responses. These services encapsulated the logic required for automated categorization by interacting with an external inference API powered by a large language model, as well as the processes necessary to store resulting data through an abstracted persistence layer. For internal process coordination, the asynchronous services interfaced with a key-value store that supported task tracking and execution-state management. Each invocation of the response submission workflow registered a new task associated with a unique identifier and enqueued it for background asynchronous execution. Once a task is created, the web service returns an acknowledgment containing the task identifier, while execution continues independently.
The asynchronous processing component also performs data transformation and sanitation to convert the raw user input into a structured representation suitable for downstream processing. Incoming textual content was parsed into an internal object format and passed through a cleaning phase to remove sensitive or extraneous information. The output was then normalized into a canonical representation containing fields such as question, answer, and comment, which served as the basis for later inference and persistence.
After preprocessing, the services persisted the processed responses in the database. Each stored record was assigned a universally unique identifier, (UUID), allowing the system to reference specific inputs without retransmitting the complete payload. This persistence step precedes the invocation of the large language model.
The production configuration included four workers for asynchronous task execution. When one worker was busy, the task scheduler delegated work to the next available worker; when all workers were occupied, additional requests remained queued until capacity became available.
3. Experiments
System evaluation was constrained by several structural limitations inherent to the available data and prior institutional practices. The most significant limitation arose from the nature of the manual labeling process. As previously discussed, no standardized or formally defined labeling scheme existed before the initiation of this project. Instead, survey responses had been manually annotated by multiple individuals, each applying labels based on personal interpretation and criteria. This resulted in a heterogeneous and inconsistent label set that could not be treated as a reliable ground truth. Due to the absence of a consistent and validated reference labeling, it was not feasible to apply conventional quantitative evaluation metrics commonly used in classification tasks, such as precision, recall, or F1. These metrics require a stable gold standard against which predictions can use as a stable reference, a condition not met in this context. To mitigate this constraint, the evaluation strategy was focused on both qualitative and operational criteria. The primary form of validation consisted of expert review by institutional stakeholders, who assessed whether the generated labels were coherent, interpretable, and useful for their intended analytical and decision-making purposes. While this form of evaluation did not provide a rigorous quantitative measure of model performance, it reflected the practical requirements and real-world usage conditions under which the system was deployed, meeting both criteria.
This approach is acknowledged as a present limitation from a methodological standpoint. While LLM-based evaluation is not without limitations—and traditional approaches have shown competitive performance in certain NLP tasks such as semantic tagging and clinical text mining [
18]—the nature of the data in this system presented specific challenges that constrained the choice of evaluation method. Human-based annotation was deemed impractical given the scale of labels required, as manual labeling at this volume has proven methodologically unreliable in prior attempts. Dictionary-based approaches were also considered; however, the literature consistently reports their limitations in handling affective language, including emotional nuance, double negations, and sarcastic expressions [
19,
20]—precisely the constructs central to this system’s inputs. Given these constraints, and the production-oriented deployment context of the system, stakeholder acceptance and perceived utility were adopted as the most feasible evaluation criteria, acknowledging that a more controlled ground truth remains a desirable direction for future work.
Four candidate large language models were evaluated as candidates for the automated labeling component, including
deepseek/deepseek-r1-0528,
deepseek/ deepseek-v3.2-exp,
mistralai/mistral-medium-3.1, and
openai/gpt-4o. These models were chosen because they represented distinct trade-offs in terms of latency, output verbosity, and monetary cost, while remaining competitive within the state-of-the-art landscape at the time. The evaluation of the LLMs was focused on cost-efficiency, response time, and token consumption under realistic workload conditions as shown in
Table 1, all while maintaining high performance and security standards.
From the evaluated options, deepseek/deepseek-r1-0528 was selected for deployment. Although it was identified early in the project as a promising candidate, its final selection was based on the following systematic benchmarking results. To compare candidate LLMs under practical conditions, a benchmarking process was conducted in which each model was evaluated using fifty independent inference requests. All requests were executed via OpenRouter, which served as a unified backend to ensure consistent request handling across models. The evaluation was designed to capture typical model behavior during real task execution. Each benchmark recorded the following information: the model identifier, the full prompt and corresponding response, the number of input and output tokens, timing data including request start time, completion time, and total latency, and the monetary cost associated with input and output tokens separately. The prompt structure matched the format used in the production system, while the content itself was drawn from randomized data. Specifically, twenty responses were randomly selected from the survey question along with their associated comments. These responses were injected into the prompt template to emulate real-world variability in input data and to observe model behavior under diverse semantic conditions.
Model benchmarking was conducted using four key quantitative indicators: the number of input tokens sent through the body of the API request, the number of output tokens counted from the body of the API response, the monetary cost associated with input tokens, and the monetary cost associated with output tokens.
Token counts for both inputs and outputs were obtained using a tokenization step based on the tokenization standards of each model. Specifically, the
mistralai/ mistral-medium-3.1 model was tokenized using the
mistralai/Mistral-7B-v0.1 tokenizer [
24], which belongs to the same tokenizer family. For the DeepSeek models,
deepseek/deepseek-r1-0528 and
deepseek/deepseek-v3.2-exp were tokenized using their official tokenizers
deepseek-ai/DeepSeek-R1 [
21] and
deepseek-ai/DeepSeek-V3 [
22], respectively. For
openai/gpt-4o, tokenization was performed using the
Xenova/ gpt-4o community tokenizer [
25].
Latency measurements were collected from the moment a request was dispatched until a response was returned by the OpenRouter API. Timing was recorded using Python 3.14 built-in time module, with timestamps obtained via the
time.time() function. The total response time was calculated as the difference between the response arrival time and the request initiation time. While this method provided precise per-request measurements, some variability was attributed to transient network conditions during testing. The monetary cost of each request was computed using the pricing rates provided by OpenRouter for each model [
26,
27,
28,
29] as of November 2025.
Table 2 summarizes the input and output token prices for all evaluated models.
During the benchmarking process, all model temperatures were set to 0.
Figure 5,
Figure 6,
Figure 7 and
Figure 8 presents the response time comparison and distribution across all of the tested models. Among the evaluated models,
openai/gpt-4o exhibited the lowest average response time, with an average latency of approximately 15 s.
The experimental evaluation highlights the trade-offs among the candidate models in terms of latency, verbosity, and operational cost. openai/gpt-4o achieved the fastest average response times (approximately 15 s per request) but incurred the highest monetary cost for both input and output. In contrast, deepseek/deepseek-v3.2-exp offered the most economical pricing, making it the most cost-effective option. mistralai/mistral-medium-3.1 exhibited a tendency toward verbosity, generating the largest average number of output tokens, while deepseek/deepseek-r1-0528 demonstrated the highest latency, with an average response time of approximately 232 s, potentially limiting its suitability for time-sensitive deployments. Taken together, these findings provide a practical basis for comparing the models under realistic production conditions. Although the absence of a reliable gold standard prevented the use of conventional supervised evaluation metrics, the benchmarking framework enables a meaningful assessment of system behavior from an operational perspective. By combining cost, response time, and output characteristics with stakeholder review, the evaluation supported the selection of models not only on technical performance but also on practical institutional utility. The following section builds on these findings by discussing their implications for the overall system and its deployment context.
4. Results
On average, each student provided responses to 10.8 open-ended questions. The shortest non-empty responses contained a single word, while the longest responses reached up to 20 words in length. All survey responses were collected using an internal data collection tool developed and maintained by the institution.
The process of running the pipeline on the student satisfaction surveys resulted in the generation of 3828 unique tags. These tags were used to label the responses of 2062 students from 12 academic programs. Of these, 1184 responses were from male students and 878 were from female students (
Figure 9); 434 responses came from doctoral students and 1628 from master’s students. The full breakdown by postgraduate program and year of response is shown in
Figure 10. All responses were distributed over a 15-year period.
Based on the satisfaction scores generated during the AI-powered labeling process—where a score below 0.5 is defined as negative or unsatisfactory, a score of 0.5 is defined as neutral, and a score above 0.5 is defined as positive or satisfactory—the following figures present the distribution of positively labeled student responses broken down by interest area (
Figure 11), academic program (
Figure 12).
Taken together, these results indicate that model selection in this application depends on balancing responsiveness, cost efficiency, and output characteristics rather than optimizing a single metric. Although a conventional supervised evaluation was not feasible because no reliable gold standard existed, the combined use of benchmarking and stakeholder review provided a practical basis for assessing institutional usefulness. In operational terms, the results confirm that the proposed system can support real-world survey analysis while making the trade-offs among candidate models explicit for deployment decisions.
5. Discussions
The design and deployment of a production-grade LLM-based survey classification system at CICESE demonstrated that success depended on more than model performance. An end-to-end systems approach that integrated data handling, governance, infrastructure, and operational use was essential to ensure outputs that were technically reliable and operationally usable within the institutional context.
Model selection prioritized latency, cost, and security constraints over accuracy alone. This deployment-focused approach favored models that balanced classification quality with predictable performance and feasibility under routine operating conditions, aligning model choice with sustained service delivery requirements and the practical limits of CICESE’s survey workflows.
Deployment on CICESE servers met cybersecurity standards while preserving workflow stability and traceability. This demonstrated that AI integration remained compliant and accountable without disrupting institutional processes or weakening controls. Governance and reliability objectives were directly addressed through deployment design, avoiding post hoc corrections.
The project replaced subjective manual labeling with an automated workflow that included a semantically structured label space, parallel AI-based question classification, [0, 1] satisfaction scoring, and relational storage. Processing time was reduced from weeks or months to minutes, outputs were judged interpretable and useful for decision-making, and the initiative established a precedent for future AI projects at CICESE. Due to data-access policies that limited the scope to institutional surveys, future research was recommended to focus on standardized labeling protocols and curated benchmark datasets for stronger quantitative evaluation.
The database design reflected a deliberate tradeoff: it prioritized read performance for operational reporting over strict normalization, which introduced higher maintenance complexity over time. This decision was deemed acceptable within the project’s scope, as rapid retrieval and decision support were prioritized to meet institutional requirements.
The institutional cultural shift triggered by replacing subjective human labeling with AI-driven consistency was significant, as researchers redirected their energy toward other tasks. The AI-powered insights empowered administrators in ways the previous manual system could not, providing a clearer and more accurate understanding of key indicators. In practical terms, the platform changes the cadence of survey analysis inside the institution. Instead of waiting for delayed manual labeling, decision-makers can review labeled responses and satisfaction indicators soon after data collection, which supports more timely follow-up on recurring concerns, faster prioritization of academic services, and more consistent reporting across programs and evaluation cycles. The traceable storage of prompts, labels, and metadata also makes it easier to audit how specific outputs were produced, which is important when the results are used to inform administrative actions or policy adjustments.
Future work should validate the platform with a more robust and explicit assessment protocol. A useful next step would be to build a gold-standard subset of survey responses annotated by multiple domain experts, then compare the platform outputs against that benchmark using agreement and classification metrics, along with error analysis by program, question type, and label category. In parallel, the deployment could be monitored over successive survey cycles to measure drift in label distributions, response times, and stakeholder utility. Alternative labeling strategies, including multi-pass annotation and expert adjudication, should also be tested to determine which protocol produces the most stable and reproducible reference labels. User-centered validation would further strengthen the evidence base: periodic reviews with institutional staff, structured usability feedback, and controlled comparisons against the previous manual workflow would clarify whether the platform continues to improve turnaround time, interpretability, and decision support under real operating conditions.
6. Conclusions
Our study successfully designed, deployed, and integrated a production-grade survey classification system based on LLMs at CICESE. This result demonstrates that institutional AI initiatives can move from exploratory pilots to stable operational services when technical design is aligned with organizational requirements. Beyond the modeling component, the project delivered a complete implementation path that connected data ingestion, classification, storage, and reporting in a single workflow. In this sense, the work contributes both a functional system and an applied blueprint for future AI deployments in comparable higher-education and public research environments.
Model selection was guided by operational constraints rather than benchmark accuracy in isolation. Specifically, the evaluation prioritized latency, cost, cybersecurity compatibility, workflow stability, and traceability, since these factors determine whether a system can be sustained in routine institutional use. This decision framework ensured that model performance was interpreted within real deployment conditions and governance requirements. Consequently, the selected stack balanced classification quality with service reliability, enabling consistent day-to-day use while remaining compatible with CICESE standards for secure and accountable digital operations.
The resulting pipeline replaced manual survey labeling with an automated process that introduced a semantically structured labeling framework, parallelized AI-based question classification, and satisfaction scoring in the range [0, 1]. The system also consolidated outputs in a relational schema designed to support downstream institutional reporting and analysis. Together, these components transformed labeling from a subjective, labor-intensive task into a reproducible computational procedure. This architectural integration is a central contribution of the study because it links model inference, data structure, and decision support in a coherent production setting.
Operationally, the automation produced a substantial efficiency gain for the Department of Graduate Follow-up and Scholarships. Processing that previously required weeks or months of manual effort was reduced to minutes, allowing results to be generated in decision-relevant timeframes. This improvement is not only quantitative but organizational, because faster turnaround supports more responsive planning cycles and better coordination among administrative stakeholders. In practical terms, the system increased throughput without compromising traceability, which is critical for institutional trust and for sustained adoption across recurring survey-analysis workflows.
Stakeholder validation further confirmed the practical value of the system outputs. Users reported that the generated classifications and satisfaction indicators were interpretable, actionable, and aligned with institutional decision-making needs, indicating strong semantic usability beyond raw model performance. This feedback is important because it demonstrates that the pipeline supports human-in-the-loop governance rather than opaque automation. By producing results that domain users can understand and apply, the implementation strengthens organizational confidence in AI-assisted analysis and establishes a credible precedent for broader data-informed policy and planning processes.
Performance testing showed meaningful latency differences across candidate models under the same evaluation setup. The fastest model was openai/gpt-4o with an average response time of 15 s, followed by mistralai/mistral-medium-3.1 at 23 s and deepseek/deepseek-v3.2-exp at 48 s, while deepseek/deepseek-r1-0528 was the slowest at 232 s. These results reinforce the importance of deployment-aware selection criteria, since response time directly affects operational feasibility. Overall, the comparison provides an empirical basis for balancing quality, cost, and speed in future institutional model procurement and configuration decisions.