Generative AI-Assisted Automation of Clinical Data Processing: A Methodological Framework for Streamlining Behavioral Research Workflows
Abstract
1. Introduction
2. Related Work and Contributions
2.1. Automation of ETL Processes in Clinical Research
2.2. Application of Digital and Generative Technologies in Medicine
2.3. Generative AI as Interactive Co-Developer: A Systematic Literature Synthesis and Contribution Statement
3. Theoretical Framework
3.1. ETL Methodology for Clinical Data Processing
3.2. Technology Stack for AI-Assisted Automation
3.2.1. Large Language Models (LLMs)
3.2.2. Workflow Orchestration (n8n)
3.2.3. Containerization (Docker)
3.2.4. Data Processing and Visualization
3.3. Human-AI Collaborative Development Paradigm
| 1. PROMPT | Researcher defines functional requirements in natural language | Example: “Create a Python script that processes .txt files, removes invalid values, and saves the data as a CSV file” |
| 2. GENERATE | LLM produces initial implementation (code, configs, documentation) | Output: Python script with error handling, progress logging, and file validation |
| 3. TEST | Researcher executes in local environment, observes behavior | Result: FileNotFoundError due to incorrect path handling |
| 4. DEBUG | Researcher shares the error logs with LLM for diagnosis | LLM analyzes: “The script uses relative paths. In Docker containers, use absolute paths like `/data/FRyCB/`” |
| 5. REFINE | LLM generates corrected version based on feedback | Updated script: Resolves path handling, adds directory validation, handles edge cases |
| 6. DEPLOY | The new validated scripts are integrated into a containerized production pipeline | Final system: n8n workflow orchestrating Python preprocessing → R ML models → Shiny dashboard |
- Docker provides the execution environment (isolated, reproducible);
- n8n orchestrates workflow stages (data extraction → transformation → loading);
- Python/R perform data processing and ML analysis;
- Shiny deploys interactive visualization interfaces;
- LLMs assist development across all components through iterative co-creation.
4. Methodology
”How can Large Language Models be leveraged to guide researchers through the complete automation of data processing pipelines, from initial design to production deployment, without requiring advanced programming or DevOps expertise?”
4.1. Overview of the Automation Framework
- AI-Assisted Design Phase: Researchers interact with LLMs to define workflow requirements, generate initial code structures, and create configuration files (Docker, YAML, JSON).
- Iterative Refinement Phase: Generated code is tested, errors are debugged with AI assistance, and scripts are optimized through multiple iterations.
- Containerization Phase: Docker images are built to ensure reproducible execution environments with all dependencies (Python, R, n8n, required libraries).
- Orchestration Phase: n8n workflows coordinate the execution of Python preprocessing scripts, R-based machine learning models, and dashboard deployment.
4.2. Role of Generative AI in Pipeline Development
4.2.1. AI-Assisted Development Process
| Development Phase | Prompting Strategies | Refinement Cycles | Error Types Resolved | Outputs Generated |
|---|---|---|---|---|
| Initial Docker Setup | Direct installation requests; environment specification; error log analysis | 3 cycles | Alpine Linux externally managed environment errors; Docker image reference corrections; Memory allocation issues | Dockerfile with Python virtual environment; docker-compose.yml with volume mounts; Environment variables configuration |
| Authentication Config | Symptom-based troubleshooting (“Wrong username or password” errors) | 2 cycles | N8N_BASIC_AUTH format issues; Variable propagation failures | Updated docker-compose.yml; Alternative configurations with auth disabled |
| File System & Volumes | Path exploration requests; Windows-to-Linux path translation; Directory structure verification | 2 cycles | Container filesystem access; Mount point verification; Permission diagnostics | Volume mapping configuration; Directory creation commands; Cross-platform path resolution |
| Python Script Adaptation | Error log submission with file attachments; “FileNotFoundError” problem descriptions | 4 cycles per script (total: 4 scripts) | Relative vs. absolute path issues; Working directory context problems; Exception handling | Corrected Python scripts with absolute path resolution, error handling blocks, progress logging, directory validation |
| n8n Workflow Development | JSON workflow file sharing; Node-by-node functionality requests | 5 cycles | “Connection cannot be established” errors; Infinite loop/hanging nodes; Parameter mismatches | 5 n8n workflow JSON files (basic ETL, trigger-based, robust error handling, test/diagnostic versions) |
| R Script Integration | Specific error messages (“cannot change working directory”); Script header code sharing | 2–3 cycles | Windows vs. Linux path issues in R; setwd() context problems; Rscript execution failures | Modified R execution commands; bash wrapper with cd command; Updated workflow JSON |
| Shiny Dashboard Integration | Analysis of dual-container architecture; Port mapping verification | 2 cycles | Wrong image on n8n container; “site can’t be reached” | Generated proper service definitions. Created Shiny-specific Dockerfile |
4.2.2. Human Oversight and Validation
- -
- Validating accuracy of data transformations
- -
- Ensuring statistical appropriateness of ML methods
- -
- Verifying compliance with data privacy requirements
- -
- Making decisions about workflow architecture design
4.3. Dataset and Use Case
Ethical Considerations and Data Governance
- File format: Unstructured .txt files with inconsistent formatting;
- Volume: 102 files × 2 experimental conditions = 204 datasets;
- Size: Each file contains 18,000+ rows of frame-by-frame measurements;
- Complexity: 57 variables per file requiring cleaning, type conversion, and feature engineering;
- Repetitive processing: Identical transformation steps must be applied to all files.
4.4. Implementation Details
4.4.1. Technology Stack
4.4.2. Reproducibility and Portability
- Containerization: Docker images encapsulate all dependencies;
- Version Control: All scripts and configurations documented;
- Documentation: AI-assisted generation of README files and inline comments;
- Deployment Instructions: Step-by-step setup guide for replication are in Figure 3.
4.5. Generalized Methodological Heuristics
5. Results
5.1. Methodological Performance and Efficiency
Pipeline Execution Performance
- Primary Development System (AMD A10-6700T, 12 GB RAM, Windows 11)
- -
- Result: All 102 datasets processed successfully
- -
- Processing time: 4.8 h
- Secondary Laptop (Intel Core i7-1255U, 16 GB RAM, Windows 11)
- -
- Result: All 102 datasets processed successfully
- -
- Processing time: 3.6 h (faster CPU)
- Virtual Machine (Ubuntu 22.04 LTS, 8 GB RAM, simulated environment)
- -
- Result: All 102 datasets processed successfully
- -
- Processing time: 5.2 h (resource-constrained)
5.2. AI Co-Development Metrics
- Script Generation (45%): Python data preprocessing scripts, R machine learning scripts, Shiny dashboard code
- Debugging (30%): Path resolution errors, Docker configuration issues, n8n workflow logic
- Configuration (20%): Dockerfiles, docker-compose.yml, environment variables, volume mounting
- Documentation (5%): Code comments, README files, deployment instructions
5.3. Illustrative Analytical Outputs
5.3.1. Data Processing Workflow
- Step 1: Data Extraction
- Input: 102 .txt files from FaceReader software (18,000+ rows each, 57 variables)
- Output: Processed CSV files with standardized formatting
- Processing: Automated removal of metadata rows, column filtering, type conversion
- Step 2: Data Transformation
- Input: Processed CSV files
- Output: Feature-engineered datasets with cluster labels and variable importance rankings
- Step 3: Data Loading
- Input: Transformed datasets with ML outputs
- Output: Interactive Shiny dashboard, PDF reports, consolidated summary tables
- Processing: Automated visualization generation, group-level aggregation
5.3.2. Example Outputs from Automated Pipeline
5.3.3. Automated Report Generation
- Individual PDF Reports (102 files): Variable importance plots, cluster assignment visualizations, model performance metrics
- Consolidated Summary Tables (.txt format): Group-level aggregated results for Control and BPD groups
- Interactive Shiny Dashboard: Real-time filtering by group, dynamic visualization of top-ranked variables, downloadable results
- 102 individual variable importance plots (PDF)
- 102 cluster assignment reports (PDF)
- 2 group-level summary tables (Control and BPD)
- 1 interactive Shiny dashboard (accessible via http://localhost:3838, locally hosted; requires Docker container to be running)
5.3.4. Pipeline Execution Summary
- Total files created by pipeline: 410+ files (102 CSV × 2 conditions + 204 PDFs + summary tables + dashboard)
- Total processing time: ~4–5 h (fully automated)
- Manual intervention required: 0 interventions
5.4. Key Achievements
- Full ETL Automation—Complete end-to-end processing without manual intervention
- Reproducibility—Identical execution across different computing environments
- Scalability—Efficient processing of 102 participants (204 experimental conditions)
- Quality Assurance—Zero data entry errors through automated validation
- Time Efficiency—90%+ reduction in processing time compared to manual workflows
- Accessible Deployment—Successful execution on modest hardware (12 GB RAM systems)
- Interactive Visualization—Automated dashboard deployment for data exploration, Figure 4.
6. Discussion
6.1. Limitations of Interpretation
6.2. Methodological Contributions
6.2.1. Reducing Technical Entry Barriers for Domain Researchers
- Generate functional preprocessing scripts through iterative prompting
- Debug errors by sharing error logs with LLMs and receiving context-specific solutions
- Create containerized environments without deep knowledge of Docker internals
- Design workflow orchestration logic through natural language descriptions
6.2.2. Reproducibility Through Containerization
- Environment Isolation—Docker containers encapsulate all dependencies (Python libraries, R packages, system libraries) in a single portable image;
- Version Control—Dockerfiles and docker-compose.yml files document exact software versions, eliminating “works on my machine” problems;
- Cross-Platform Consistency—The same containerized pipeline executes identically on Windows, Linux, and macOS systems.
6.2.3. Scalability for Larger Datasets
- Parallel Processing—n8n workflows can be configured to process multiple participants simultaneously;
- Incremental Execution—Failed processing of individual datasets does not halt the entire pipeline;
- Efficient Resource Management—Docker resource limits prevent individual jobs from consuming excessive memory or CPU.
6.3. Role of Generative AI as Co-Developer
6.3.1. Reducing Technical Barriers
- Script Generation (45% of interactions)—Creating Python preprocessing scripts, R machine learning code, and Shiny dashboard applications;
- Debugging (30% of interactions)—Diagnosing path resolution errors, Docker configuration issues, and n8n workflow logic problems;
- Configuration Management (20% of interactions)—Generating Dockerfiles, docker-compose.yml files, and environment variable configurations;
- Documentation (5% of interactions)—Creating code comments, README files, and deployment instructions.
6.3.2. Accelerating Prototyping and Iteration with AI Assistance
6.3.3. Limitations of AI-Generated Code and the Need for Human Oversight
6.4. Comparison with Traditional Approaches
6.4.1. Manual Processing
- Opening each .txt file individually in spreadsheet software;
- Manually copying columns to standardized templates;
- Hand-coding missing value imputations;
- Running statistical analyses one participant at a time;
- Copy and pasting results into reports.
6.4.2. Hiring Data Engineers
6.4.3. Commercial Automation Platforms
6.5. Limitations and Future Directions
6.5.1. Limitations
6.5.2. Future Directions
6.6. Broader Implications
7. Conclusions
- Democratization of automation by enabling non-experts to build advanced pipelines through natural language interaction with LLMs;
- Guaranteed reproducibility through containerization ensuring identical execution regardless of computing environment;
- Demonstrated scalability with 90%+ efficiency gains for large datasets.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| BPD | Borderline Personality Disorder |
| ChatGPT | A generative AI tool developed by OpenAI |
| Claude | A generative AI model |
| Cyberball | A computerized social exclusion task |
| Docker | A platform for local containerized execution of the entire |
| EHRs | Electronic Health Records |
| ETL | Extract, Transform, Load (process) |
| FaceReader | Software for automatic analysis of facial expressions |
| GAI | Generative Artificial Intelligence |
| LLMs | Large Language Models |
| MDA | Mean Decrease Accuracy |
| ML | Machine Learning |
| n8n | Workflow orchestration and automation platform |
Appendix A. Docker Infrastructure and Project Structure
Appendix A.1. Container Architecture

Appendix A.2. Directory Structure

Appendix B. Data Processing Scripts
Appendix B.1. Python ETL Scripts
| Script | Function | Input—Output | AI Iterations |
|---|---|---|---|
| procesa_txt_Posic.py | Remove metadata rows/columns, convert to CSV.txt | .txt → Processed_.csv | 3 |
| eliminaPy.py | Filter invalid values, split by condition | Processed_.csv → Cleaned_.csv | 4 |
| tercerPy.py | Calculate movement magnitude, rename Action Units | Cleaned_.csv → Detail_.csv | 5 |
| CambiaNom.py | Standardize filenames by group | Detail_.csv → DatP#.csv/DatTLP#.csv | 2 |
Appendix B.2. R Machine Learning Scripts
| Algorithm | Parameters | Outputs |
|---|---|---|
| K-means Clustering (KMeans_FRyCB.R) | • Elbow method (k = 1–15) • Z-score standardization • Mean imputation for missing values | • Cluster assignments per participant • PDF elbow plots • Consolidated metrics table |
| Random Forest (RF_2025_ConditionA.R) | • ntree = 500, mtry = √p • OOB error estimation • MDA importance metric | • Variable importance rankings • Confusion matrices • PDF visualizations • Performance summary (Accuracy, OOB%) |
| K-Means Metric (Mean ± SD) | Control Group | BPD Group |
|---|---|---|
| totss (Total Sum of Squares) | 395,912.8 ± 104,732.6 | 375,051.8 ± 152,950.7 |
| tot_withinss (Within-Cluster Sum of Squares) | 318,070.7 ± 91,021.8 | 299,219.7 ± 127,013.8 |
| Betweenss (Between-Cluster Sum of Squares) | 77,842.1 ± 24,768.0 | 75,832.1 ± 33,172.1 |
Appendix C. Workflow Automation and Development Metrics
Appendix C.1. n8n Workflow Structure
- Execute Command: procesa_txt_Posic.py → eliminaPy.py → tercerPy.py → CambiaNom.py
- Execute Command: Rscript KMeans_FRyCB.R
- Execute Command: Rscript RF_2025_ConditionA.R
- File Move: Archive processed .txt files
- HTTP Request: Launch Shiny dashboard
Appendix C.2. AI-Assisted Development Summary
| Development Phase | Tool | Prompts/Iterations | Primary Assistance |
|---|---|---|---|
| Python scripts | ChatGPT | ~2514 | Path resolution, error handling, batch processing logic |
| Docker setup | Claude | ~157 | Dependency conflicts, volume mounting, multi-container configuration |
| n8n workflows | ChatGPT | ~208 | JSON structure, sequential execution, command-line parameters |
| R scripts | ChatGPT | ~126 | Missing value handling, variable importance extraction |
| Shiny dashboard | ChatGPT | ~1010 | UI components, filtering logic, visualization (see Table A5) |
| Iteration | User Prompt | AI Response |
|---|---|---|
| 1 | “Create Shiny app to display Random Forest results” | Generated basic UI with file upload and table display |
| 2 | “Add dropdown to filter by group (BPD/Control)” | Added selectInput(“group”, …) with reactive filtering |
| 3 | “Participant IDs don’t match” | Fixed CSV parsing: read.csv(stringsAsFactors=FALSE) |
| 7 | “Add pie chart for top 5 MDA values” | Generated renderPlot() with pie chart visualization |
| 10 | “Translate all labels to English” | Updated all UI text: titles, tabs, button labels |
Appendix D. Documented Prompt-Debug Cycle—Cross-Platform Path Resolution in Python/Docker
| Step | Agent | Action | Input | Output/Result |
|---|---|---|---|---|
| 1. PROMPT | Researcher | Defines functional requirement for procesa_txt_Posic.py in natural language | Processing specification: skip 9 metadata rows, filter 36 columns, remove invalid values (FIT_FAILED/UNKNOWN/FIND_FAILED), calculate HeadM magnitude, save as CSV | Initial prompt submitted to ChatGPT (GPT-4). Full text: https://drive.google.com/file/d/1NrbgruYd-2T4tW4qh5oFPOdlniqd9mK4/view (accessed on 6 March 2026) |
| 2. GENERATE | LLM (ChatGPT) | Generates initial Python script using relative paths (data/FRyCB/raw/) | Functional requirement prompt (Step 1) | procesa_txt_Posic_v1.py—script with relative path references. Full script: https://drive.google.com/file/d/1ZU_FHBjUbY-1UhqLXEN4ANbVKpsIrGLB/view (accessed on 6 March 2026) |
| 3. TEST | Researcher | Executes script locally (Windows 11) and inside Docker container via n8n | procesa_txt_Posic_v1.py | Local execution: 5 sample files processed successfully. Docker execution: FileNotFoundError: [Errno 2] No such file or directory: ‘data/FRyCB/raw/’ |
| 4. DEBUG | Researcher + LLM | Researcher submits full error log to ChatGPT with container context description. LLM diagnoses root cause: relative path resolved from container working directory (/), not from project root | Complete error log + docker-compose.yml volume mount description. Full prompt: https://drive.google.com/file/d/1YJAn45UIoBJ6hTO1tgCzeLAO9aqhIKkl/view (accessed on 6 March 2026) | LLM diagnosis: “Replace relative paths with absolute paths matching container volume mount: /data/FRyCB/raw/” + directory validation block + per-file exception handling |
| 5. REFINE | LLM (ChatGPT) | Generates corrected script with absolute paths, directory existence validation, and per-file exception handling | Diagnostic exchange from Step 4 | procesa_txt_Posic_v2.py—corrected script with absolute paths. Full script: https://drive.google.com/file/d/1-1nSdrlz1zb6y2RerH3VVPoSRI1mmnLs/view (accessed on 6 March 2026) |
| 6. DEPLOY | Researcher | Validates corrected script in Docker container. Applies absolute path convention proactively to all remaining scripts (H3). Integrates into production n8n workflow | procesa_txt_Posic_v2.py | 204 files (102 participants × 2 conditions) processed without error. Absolute path convention propagated to eliminaPy.py, tercerPy.py, CambiaNom.py, and R execution commands |
References
- Eraña-Díaz, M.L.; Cruz-Chávez, M.A.; Acosta-Flores, M.; Urbano, J.E.; Ruiz, N.L.; Gamba, J.P.O. Interdisciplinary Methodology for Resource Allocation Problems using Artificial Neural Networks and Software Robots. IEEE Access 2025, 13, 131141–131158. [Google Scholar] [CrossRef]
- Kimball, R.; Caserta, J. The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
- Raboso, D.L. Automatización de Procesos con Docker, n8n y Modelos de IA: Generación y Evaluación Automática de Exámenes Basada en RAG y Modelos de IA para Entornos Educativos. Available online: https://oa.upm.es/90131/ (accessed on 6 March 2026).
- Kshetri, N.; Hughes, L.; louise Slade, E.; Jeyaraj, A.; kumar Kar, A.; Koohang, A.; Raghavan, V.; Ahuja, M.; Albanna, H.; ahmad Albashrawi, M.; et al. “So what if ChatGPT wrote it?” Multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int. J. Inf. Manag. 2023, 71, 102642. [Google Scholar] [CrossRef]
- Eraña-Diaz, M.L.; Rosales-Lagarde, A.; Reyes-Soto, A.; Arango-de-Montis, I.; Rodríguez-Delgado, A.; Muñoz-Delgado, J. Data Engineering for Nonverbal Expression Analysis-Case Studies of Borderline Personality Disorder. In International Conference on Advances in Computing and Data Sciences; Springer Nature: Cham, Switzerland, 2024; pp. 150–169. [Google Scholar] [CrossRef]
- Dresselhaus, N. Case Study: Local LLM-Based NER with n8n and Ollama. 2025. Available online: https://drezil.de/Writing/ner4all-case-study.html (accessed on 31 August 2025).
- Patel, S.B.; Lam, K. ChatGPT: The future of discharge summaries? Lancet Digit. Health 2023, 5, e107–e108. [Google Scholar] [CrossRef]
- Ye, Y.; Cong, X.; Tian, S.; Cao, J.; Wang, H.; Qin, Y.; Lu, Y.; Yu, H.; Wang, H.; Lin, Y.; et al. Proagent: From robotic process automation to agentic process automation. arXiv 2023, arXiv:2311.10751. [Google Scholar] [CrossRef]
- Yang, R.; Tan, T.F.; Lu, W.; Thirunavukarasu, A.J.; Ting, D.S.W.; Liu, N. Large language models in health care: Development, applications, and challenges. Health Care Sci. 2023, 2, 255–263. [Google Scholar] [CrossRef]
- Arango-de-Montis, I.; Reyes-Soto, A.; Rosales-Lagarde, A.; Eraña-Díaz, M.L.; Vazquez-Mendoza, E.; Rodríguez-Delgado, A.; Muñoz-Delgado, J.; Vazquez-Mendoza, I.; Rodriguez-Torres, E.E. Automatic detection of facial expressions during the Cyberball paradigm in Borderline Personality Disorder: A pilot study. Front. Psychiatry 2024, 15, 1354762. [Google Scholar] [CrossRef] [PubMed]
- Ray, P.P. A survey on model context protocol: Architecture, state-of-the-art, challenges and future directions. TechRxiv 2025. [Google Scholar] [CrossRef]
- Parsa, S.; Somani, S.; Dudum, R.; Jain, S.S.; Rodriguez, F. Artificial intelligence in cardiovascular disease prevention: Is it ready for prime time? Curr. Atheroscler. Rep. 2024, 26, 263–272. [Google Scholar] [CrossRef]
- Temsah, M.-H.; Alhuzaimi, A.N.; Almansour, M.; Aljamaan, F.; Alhasan, K.; Batarfi, M.A.; Altamimi, I.; Alharbi, A.; Alsuhaibani, A.A.; Alwakeel, L.; et al. Art or artifact: Evaluating the accuracy, appeal, and educational value of AI-generated imagery in DALL· E 3 for illustrating congenital heart diseases. J. Med. Syst. 2024, 48, 54. [Google Scholar] [CrossRef] [PubMed]
- Merkel, D. Docker: Lightweight Linux containers for consistent development and deployment. Linux J. 2014, 239, 2. [Google Scholar]
- Joshi, S. A Review of Generative AI and DevOps Pipelines: CI/CD, Agentic Automation, MLOps Integration, and LLMs. Int. J. Innov. Res. Comput. Sci. Technol. 2025, 13, 1–14. [Google Scholar] [CrossRef]
- Jiang, F.; Jiang, Y.; Zhi, H.; Dong, Y.; Li, H.; Ma, S.; Wang, Y.; Dong, Q.; Shen, H.; Wang, Y. Artificial intelligence in healthcare: Past, present and future. Stroke Vasc. Neurol. 2017, 2, 230–243. [Google Scholar] [CrossRef]
- Topol, E. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again; Basic Books: New York, NY, USA, 2019. [Google Scholar]
- Yu, E.; Chu, X.; Zhang, W.; Meng, X.; Yang, Y.; Ji, X.; Wu, C. Large Language Models in Medicine: Applications, Challenges, and Future Directions. Int. J. Med. Sci. 2025, 22, 2792–2801. [Google Scholar] [CrossRef]
- Boettiger, C. An introduction to Docker for reproducible research. ACM SIGOPS Oper. Syst. Rev. 2015, 49, 71–79. [Google Scholar] [CrossRef]
- Krittanawong, C.; Virk, H.U.H.; Kaplin, S.L.; Wang, Z.; Sharma, S.; Jneid, H. Assessing the potential of ChatGPT for patient education in cardiac catheterization care. Cardiovasc. Interv. 2023, 16, 1551–1552. [Google Scholar] [CrossRef]
- Leivaditis, V.; Beltsios, E.; Papatriantafyllou, A.; Grapatsas, K.; Mulita, F.; Kontodimopoulos, N.; Baikoussis, N.G.; Tchabashvili, L.; Tasios, K.; Maroulis, I.; et al. Artificial Intelligence in Cardiac Surgery: Transforming Outcomes and Shaping the Future. Clin. Pract. 2025, 15, 17. [Google Scholar] [CrossRef] [PubMed]
- Safari, N.; Techatassanasoontorn, A.; Díaz Andrade, A. Auto-pilot, co-pilot and pilot: Human and generative AI configurations in software development. In ICIS 2024 Proceedings, 1st ed.; AIS Electronic Library (AISeL): Atlanta, GA, USA, 2024; pp. 1–9. Available online: https://researchers.mq.edu.au/en/publications/auto-pilot-co-pilot-and-pilot-human-and-generative-ai-configurati/ (accessed on 6 March 2026).
- Noor, N. Generative AI-Assisted Software Development Teams: Opportunities, Challenges, and Best Practices. Master’s Thesis, Lahti University of Technology, Lahti, Finland, 2025. Available online: https://lutpub.lut.fi/bitstream/handle/10024/169746/mastersthesis_Nouman_Noor.pdf?sequence=1&isAllowed=y (accessed on 6 March 2026).
- Djabbarov, I.; Puranam, P.; Shrestha, Y.R.; Zollo, M. How to Use Generative AI to Support Collective Problem-Solving. SSRN Electron. J. 2025. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- OpenAI, O. Openai: Introducing Chatgpt. 2022. Available online: https://openai.com/index/chatgpt/ (accessed on 12 August 2025).
- Haque, M.A. A Brief analysis of “ChatGPT”–A revolutionary tool designed by OpenAI. EAI Endorsed Trans. AI Robot. 2023, 1, e15. [Google Scholar] [CrossRef]
- Zhang, M.; Yuan, B.; Li, H.; Xu, K. LLM-Cloud Complete: Leveraging cloud computing for efficient large language model-based code completion. J. Artif. Intell. Gen. Sci. (JAIGS) 2024, 5, 295–326. [Google Scholar] [CrossRef]
- Kawaguchi, N.; Hart, C.; Uchiyama, H. Understanding the effectiveness of SBOM generation tools for manually installed packages in docker containers. J. Internet Serv. Inf. Secur. 2024, 14, 191–212. [Google Scholar] [CrossRef]
- Cervera, E. Generative AI for Reproducible Research in Control, Automation and Robotics. In Proceedings of the 2025 11th International Conference on Control, Automation and Robotics (ICCAR), Kyoto, Japan, 18–20 April 2025; IEEE: New York, NY, USA, 2025; pp. 55–60. [Google Scholar] [CrossRef]
- Gerlach, W.; Tang, W.; Wilke, A.; Olson, D.; Meyer, F. Container orchestration for scientific workflows. In Proceedings of the 2015 IEEE International Conference on Cloud Engineering, Cambridge, MA, USA, 21–25 September 2025; IEEE: New York, NY, USA, 2015; pp. 377–378. [Google Scholar] [CrossRef]
- Fuhrer, C.; Solem, J.E.; Verdier, O. Scientific Computing with Python: High-Performance Scientific Computing with NumPy, SciPy, and Pandas; Packt Publishing Ltd.: Birmingham, UK, 2021. [Google Scholar]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2020; Available online: https://www.R-project.org/ (accessed on 12 August 2025).
- Heinsberg, L.W.; Koleck, T.A.; Ray, M.; Weeks, D.E.; Conley, Y.P. Advancing nursing research through interactive data visualization with R shiny. Biol. Res. Nurs. 2023, 25, 107–116. [Google Scholar] [CrossRef]
- Kasprzak, P.; Mitchell, L.; Kravchuk, O.; Timmins, A. Six Years of Shiny in Research: Collaborative Development of Web Tools in R. arXiv 2020, arXiv:2101.10948. [Google Scholar] [CrossRef]
- Trist, E.L.; Bamforth, K.W. Some social and psychological consequences of the Longwall method of coal-getting. Hum. Relat. 1951, 4, 3–38. [Google Scholar] [CrossRef]
- Baxter, G.; Sommerville, I. Socio-technical systems: From design methods to systems engineering. Interact. Comput. 2011, 23, 4–17. [Google Scholar] [CrossRef]
- Seeber, I.; Bittner, E.; Briggs, R.O.; de Vreede, T.; de Vreede, G.J.; Elkins, A.; Maier, R.; Merz, A.B.; Oeste-Reiß, S.; Randrup, N.; et al. Machines as teammates: A research agenda on AI in team collaboration. Inf. Manag. 2020, 57, 103174. [Google Scholar] [CrossRef]
- Hutchins, E. Cognition in the Wild; MIT Press: Cambridge, MA, USA, 1995. [Google Scholar] [CrossRef]
- Nguyen-Duc, A.; Cabrero-Daniel, B.; Przybylek, A.; Arora, C.; Khanna, D.; Herda, T.; Rafiq, U.; Melegati, J.; Guerra, E.; Kemell, K.; et al. Generative Artificial Intelligence for software Engineering—A research agenda. arXiv 2023, arXiv:2310.18648. [Google Scholar] [CrossRef]




| Sector | Example Applications | Tools/Technologies | Level of AI Involvement | Key References |
|---|---|---|---|---|
| Healthcare/Medicine | - Drafting of reports - Summarizing patient history - Supporting diagnosis decision making. | ChatGPT GPT-4, Electronic Health Records (EHRs) | Low | Patel & Lam, 2023 [7] Yu et al., 2025 [18] Krittanawong et al., 2023 [20] |
| Medical Research | - Assisting in literature reviews - Generating code for analysis pipelines - Creating reproducible workflows | ChatGPT, GitHub Copilot, R + Python scripts, Docker | Medium | Jiang, F., 2017 [16] Topol, 2019 [17] Boettiger, 2015 [19] |
| Software | - Auto-generating configuration files - Infrastructure automation - Code debugging and test generation | ChatGPT, GitHub Copilot, CI/CD pipelines, YAML, Docker Compose | Medium-High | Leivaditis, V., et al., 2025 [21] Safari, N., et al., 2024 [22] Noor, N., 2025 [23] |
| Management/Business | - Decision-making augmentation - Document summarization - Customer communication | GPT-based copilots, Microsoft Copilot, custom LLMs | Low-Medium | Djabbarov, I., et al., 2023 [24] |
| Industrial Optimization & Automation | - Solving resource allocation problems - Combining ANN and software robots - Supporting decision-making in constrained environments | UIPath, Artificial Neural Networks, Workflow Automation | Medium-High | Eraña-Díaz et al., 2025 [1] |
| Our Contribution: | - Designing and automating a full ETL workflow for psychiatric data analysis | Docker, n8n, Python, R, ChatGPT, Claude (prompt-based collaboration). | High | This work: (2026) |
| Group | Age (Mean ± SD) | Years of Education (Mean ± SD) | Gender Female/Men | % Gender Female/Men |
|---|---|---|---|---|
| BPD | 31.02 ± 10.8 | 14.6 ± 3 | 34/11 | 59.7/19.3 |
| CTRL | 24.51 ± 5.72 | 14 ± 0 | 43/14 | 75.4/24.6 |
| Containerization | Docker (v20.10+) | Reproducible execution environments |
| docker-compose | Multi-container orchestration | |
| Workflow Orchestration | n8n (v0.195+) | Visual workflow automation platform |
| Data Processing | Python 3.x | Preprocessing, data cleaning, feature engineering (pandas, numpy, scikit-learn) |
| R (v4.0+) | Statistical analysis, machine learning, visualization (ggplot2, randomForest, caret, cluster, shiny) | |
| Generative AI Tools | ChatGPT (GPT-4.1) | Code generation and debugging (free tier) |
| Claude (Claude 3) | Docker/YAML configuration (free tier) | |
| Computational Resources | Desktop Workstation | AMD A10-6700T APU (2.50 GHz), 12 GB RAM, 447 GB SSD |
| Laptop | Intel Core i7-1255U (1.70 GHz), 16 GB RAM | |
| Operating System | Windows 11 with Docker Desktop |
| Metric | Manual Processing (Estimated) | Automated Processing (Achieved) | Improvement |
|---|---|---|---|
| Processing Time per Dataset | ~30–45 min | ~2–3 min | 90–93% reduction |
| Total Processing Time (102 datasets) | ~51–77 h | ~3.4–5.1 h | 93–94% reduction |
| Data Entry Errors | 5–10% (estimated from manual workflows) | 0% (automated validation) | 100% reduction |
| Reproducibility Across Systems | Variable (environment-dependent) | Identical (containerized) | 100% consistency |
| Datasets Successfully Processed | N/A | 102/102 (100%) | Fully automated |
| Configuration Files Generated | 0 (manual scripts) | 8 (Docker, docker-compose, workflows) | Standardized deployment |
| ML Models Executed | Manual execution required | Automated (K-means + Random Forest) | Zero manual intervention |
| Category | Metric | Value | Role Classification |
|---|---|---|---|
| Development Phases | Total phases with AI assistance | 7 phases | AI-primary |
| Interactions | Total AI conversations (ChatGPT + Claude) | ~25–30 sessions | AI-primary |
| Interactions | Total prompts submitted | ~80–100 prompts | AI-primary |
| Refinement | Total iterative refinement cycles | 18+ iterations | Collaborative |
| Assistance Type | Code & configuration generation (Python/R scripts, Docker/n8n files) | 4 Python scripts, 2 R scripts, 8 config files, 5 workflow variations | AI-primary (65% of interactions) |
| Assistance Type | Error debugging | 7 error categories resolved | Collaborative (30% of interactions) |
| Assistance Type | Documentation (comments, README, instructions) | Inline comments + deployment guides | AI-primary (5% of interactions) |
| Time Savings | Estimated development time (manual) | 6–8 weeks | Human-primary (baseline) |
| Time Savings | Actual development time (AI-assisted) | 4 weeks | Collaborative |
| Time Savings | Time reduction | ~40–50% | Collaborative |
| Code Quality | Syntax errors fixed by AI | ~30+ errors | AI-primary |
| Code Quality | Dependency conflicts resolved | ~10 conflicts | AI-primary |
| Code Quality | Logic errors requiring human intervention | ~5 errors | Human-primary |
| Variable | Control Group | BPD Group |
|---|---|---|
| Top 1 Variable | Action26 | Action01 |
| Top 2 Variable | Heart.Rate | Action12 |
| Top 3 Variable | HeadM | Heart.Rate |
| Other Highly Ranked Variables | Cluster, Contempt, Scared, Angry, Boredom | Action25, Action04, Scared, Interest, HeadM, Disgusted |
| Pipeline Stage | Input | Output | Success Rate |
|---|---|---|---|
| Extract | 102 .txt files | 102 processed CSV files | 100% |
| Transform | 102 CSV files | 102 clustered datasets + 102 RF outputs | 100% |
| Load | 204 processed datasets | 1 Shiny dashboard + 204 PDF reports | 100% |
| Aspect | Manual Processing | Automated Pipeline | Advantage |
|---|---|---|---|
| Time per dataset | 30–45 min | 2–3 min | 90–93% faster |
| Error rate | 5–10% (human errors) | 0% (validated) | 100% reduction |
| Reproducibility | Low (environment-dependent) | High (containerized) | Guaranteed consistency |
| Scalability | Linear (time increases 1:1) | Sub-linear (batch processing) | Efficient for large N |
| Documentation | Manual record-keeping | Automated (Docker/n8n logs) | Complete audit trail |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Eraña-Díaz, M.L.; Rosales-Lagarde, A.; Arango-de-Montis, I.; Velázquez-Monzón, J.A. Generative AI-Assisted Automation of Clinical Data Processing: A Methodological Framework for Streamlining Behavioral Research Workflows. Informatics 2026, 13, 48. https://doi.org/10.3390/informatics13040048
Eraña-Díaz ML, Rosales-Lagarde A, Arango-de-Montis I, Velázquez-Monzón JA. Generative AI-Assisted Automation of Clinical Data Processing: A Methodological Framework for Streamlining Behavioral Research Workflows. Informatics. 2026; 13(4):48. https://doi.org/10.3390/informatics13040048
Chicago/Turabian StyleEraña-Díaz, Marta Lilia, Alejandra Rosales-Lagarde, Iván Arango-de-Montis, and José Alejandro Velázquez-Monzón. 2026. "Generative AI-Assisted Automation of Clinical Data Processing: A Methodological Framework for Streamlining Behavioral Research Workflows" Informatics 13, no. 4: 48. https://doi.org/10.3390/informatics13040048
APA StyleEraña-Díaz, M. L., Rosales-Lagarde, A., Arango-de-Montis, I., & Velázquez-Monzón, J. A. (2026). Generative AI-Assisted Automation of Clinical Data Processing: A Methodological Framework for Streamlining Behavioral Research Workflows. Informatics, 13(4), 48. https://doi.org/10.3390/informatics13040048

