PCLLM: An Integrated LLM-Driven System for Automating Desktop Operations via Direct Mouse and Keyboard Control
Abstract
1. Introduction
- We propose a novel end-to-end PC task automation framework that leverages LLMs to operate the mouse and keyboard. The experimental results demonstrate the framework’s robust performance across three PC applications which are Wordpad, Notepad and Calculator, The experimental results demonstrate the framework’s robust performance across Notepad, Wordpad, and Calculator, achieving task completion rates of up to 98.59%, 91.55%, and 84.43% respectively when powered by GPT-4o.
- We present an advanced prompt engineering methodology that systematically integrates three software-related components which are software hierarchical architecture, button functional descriptions and few-shot examples. This prompt significantly enhances the LLM ability on understanding and executing PC tasks.
- A dual-LLM pipeline architecture is presented, in which one model generates diverse task descriptions while another executes and records the corresponding operations, demonstrating the feasibility of fully automated test data generation for PC automation tasks.
- We conduct extensive experimental validation using three standard PC applications (Notepad, WordPad, and Calculator), with quantitative results showing superior performance compared to existing approaches.
2. Related Work
3. Methodology
3.1. Prompt Engineering
3.2. GUI Element Localization via Template Matching
3.3. Two LLMs for Automated Test Data Generation
3.4. Materials
4. Experiment
4.1. Experimental Environment
4.2. Baseline and Comparison
4.3. Ablation Study
4.4. Comparison with CogAgent
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zhang, C.; He, S.; Qian, J.; Li, B.; Li, L.; Qin, S.; Kang, Y.; Ma, M.; Liu, G.; Lin, Q.; et al. Large language model-brained gui agents: A survey. arXiv 2024, arXiv:2411.18279. [Google Scholar]
- Zimmermann, D.; Koziolek, A. Automating gui-based software testing with gpt-3. In Proceedings of the 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), Dublin, Ireland, 16–20 April 2023; pp. 62–65. [Google Scholar]
- Fabio, P. End User Development: Survey of an Emerging Field for Empowering People. Isrn Softw. Eng. 2013, 2013, 532659. [Google Scholar]
- Schneider, S.; Werner, S.; Khalili, R.; Hecker, A.; Karl, H. mobile-env: An open platform for reinforcement learning in wireless mobile networks. In Proceedings of the NOMS 2022–2022 IEEE/IFIP Network Operations and Management Symposium, Budapest, Hungary, 25–29 April 2022; pp. 1–3. [Google Scholar]
- Collins, E.; Neto, A.; Vincenzi, A.; Maldonado, J. Deep reinforcement learning based android application gui testing. In Proceedings of the XXXV Brazilian Symposium on Software Engineering, Joinville, Brazil, 27 September–1 October 2021; pp. 186–194. [Google Scholar]
- Deng, X.; Gu, Y.; Zheng, B.; Chen, S.; Stevens, S.; Wang, B.; Sun, H.; Su, Y. Mind2web: Towards a generalist agent for the web. Adv. Neural Inf. Process. Syst. 2023, 36, 28091–28114. [Google Scholar]
- Pasupat, P.; Jiang, T.S.; Liu, E.; Guu, K.; Liang, P. Mapping natural language commands to web elements. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4970–4976. [Google Scholar]
- Shi, T.; Karpathy, A.; Fan, L.; Hernandez, J.; Liang, P. World of bits: An open-domain platform for web-based agents. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3135–3144. [Google Scholar]
- Yao, S.; Chen, H.; Yang, J.; Narasimhan, K. Webshop: Towards scalable real-world web interaction with grounded language agents. Adv. Neural Inf. Process. Syst. 2022, 35, 20744–20757. [Google Scholar]
- Zhou, S.; Xu, F.F.; Zhu, H.; Zhou, X.; Lo, R.; Sridhar, A.; Cheng, X.; Ou, T.; Bisk, Y.; Fried, D.; et al. Webarena: A realistic web environment for building autonomous agents. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; Volume 2024, pp. 15585–15606. [Google Scholar]
- Shen, Y.; Song, K.; Tan, X.; Li, D.; Lu, W.; Zhuang, Y. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Adv. Neural Inf. Process. Syst. 2023, 36, 38154–38180. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- Pan, Y.; Kong, D.; Zhou, S.; Cui, C.; Leng, Y.; Jiang, B.; Liu, H.; Shang, Y.; Zhou, S.; Wu, T.; et al. Webcanvas: Benchmarking web agents in online environments. arXiv 2024, arXiv:2406.12373. [Google Scholar] [CrossRef]
- Yang, J.; Zhang, H.; Li, F.; Zou, X.; Li, C.; Gao, J. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V. arXiv 2023, arXiv:2310.11441. [Google Scholar]
- Yan, A.; Yang, Z.; Zhu, W.; Lin, K.; Li, L.; Wang, J.; Yang, J.; Zhong, Y.; McAuley, J.; Gao, J.; et al. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv 2023, arXiv:2311.07562. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
- Delaflor, M.; Gendron, C.; Toxtli, C.; Li, W.; Delgado-Solórzano, C.T. ReActIn: Infusing Human Feedback into Intermediate Prompting Steps of Large Language Model. In Proceedings of the AHFE International, San Francisco, CA, USA, 20–24 July 2023. [Google Scholar]
- Burns, A.; Arsan, D.; Agrawal, S.; Kumar, R.; Saenko, K.; Plummer, B.A. A dataset for interactive vision-language navigation with unknown command feasibility. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 312–328. [Google Scholar]
- Li, Y.; He, J.; Zhou, X.; Zhang, Y.; Baldridge, J. Mapping natural language instructions to mobile UI action sequences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8198–8210. [Google Scholar]
- Rawles, C.; Li, A.; Rodriguez, D.; Riva, O.; Lillicrap, T. Androidinthewild: A large-scale dataset for android device control. Adv. Neural Inf. Process. Syst. 2023, 36, 59708–59728. [Google Scholar]
- Wen, H.; Li, Y.; Liu, G.; Zhao, S.; Yu, T.; Li, T.J.J.; Jiang, S.; Liu, Y.; Zhang, Y.; Liu, Y. AutoDroid: LLM-powered Task Automation in Android. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (MobiCom 2024), Washington, DC, USA, 30 September–4 October 2024. [Google Scholar]
- Gao, D.; Ji, L.; Bai, Z.; Ouyang, M.; Li, P.; Mao, D.; Wu, Q.; Zhang, W.; Wang, P.; Guo, X.; et al. ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
- Lin, X.V.; Wang, C.; Zettlemoyer, L.; Ernst, M.D. NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Liu, X.; Yu, H.; Zhang, H.; Xu, Y.; Lei, X.; Lai, H.; Gu, Y.; Ding, H.; Men, K.; Yang, K.; et al. AgentBench: Evaluating LLMs as Agents. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Wen, H.; Wang, H.; Liu, J.; Li, Y. Droidbot-gpt: Gpt-powered ui automation for android. arXiv 2023, arXiv:2304.07061. [Google Scholar]
- Hong, W.; Wang, W.; Lv, Q.; Xu, J.; Yu, W.; Ji, J.; Wang, Y.; Wang, Z.; Dong, Y.; Ding, M.; et al. Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June2024; pp. 14281–14290. [Google Scholar]
- Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 2023, 36, 68539–68551. [Google Scholar]
- Qin, Y.; Liang, S.; Ye, Y.; Zhu, K.; Yan, L.; Lu, Y.; Lin, Y.; Cong, X.; Tang, X.; Qian, B.; et al. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Nakano, R.; Hilton, J.; Balaji, S.; Wu, J.; Ouyang, L.; Kim, C.; Hesse, C.; Jain, S.; Kosaraju, V.; Saunders, W.; et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv 2021, arXiv:2112.09332. [Google Scholar]
- Zhao, N. Enhancing object detection with yolov8 transfer learning: A voc2012 dataset study. In Proceedings of the International Conference Pattern Recognition Applications and Methods (ICPRAM), Rome, Italy, 24–26 February 2024. [Google Scholar]
- Yu, X.; Zhao, X. YOLO-TCS: An enhanced multi-scale network for traffic sign detection integrating multi-level feature fusion and attention. Multimed. Syst. 2026, 32, 110. [Google Scholar] [CrossRef]
- Jiang, H.; Peng, Y.; Li, R.; Peng, Z. Feature-aware multi-head self-attention hashing for Chinese ancient document image retrieval. Appl. Soft Comput. 2026, 193, 114770. [Google Scholar] [CrossRef]
- Zhu, V.; Ji, Z.; Guo, D.; Wang, P.; Xia, Y.; Lu, L.; Ye, X.; Zhu, W.; Jin, D. Low-rank continual pyramid vision transformer: Incrementally segment whole-body organs in CT with light-weighted adaptation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Berlin/Heidelberg, Germany, 2024; pp. 371–381. [Google Scholar]






| Function | Parameter | Instance |
|---|---|---|
| moveTo | button | MoveTo (File) |
| click | button | Click (File) |
| doubleclick | button | Doubleclick (File) |
| scroll | times | Scroll (10) |
| write | message | Write (Hello) |
| keyDown | key | KeyDown (UP) |
| keyUp | key | KeyUP (UP) |
| hotkey | * keys | Hotkey (Ctrl + c) |
| Application | Comparison | H-Test p | MW p | Sig. |
|---|---|---|---|---|
| Notepad | GPT-4o vs. GPT-4o-mini | < | < | *** |
| GPT-4o vs. Gemini-1.5-Pro | < | *** | ||
| GPT-4o vs. Gemini-2.0-flash-exp | ** | |||
| GPT-4o-mini vs. Gemini-1.5-Pro | * | |||
| GPT-4o-mini vs. Gemini-2.0-flash-exp | < | *** | ||
| Gemini-1.5-Pro vs. Gemini-2.0-flash-exp | n.s. | |||
| Calculator | GPT-4o vs. GPT-4o-mini | < | < | *** |
| GPT-4o vs. Gemini-1.5-Pro | < | *** | ||
| GPT-4o vs. Gemini-2.0-flash-exp | ** | |||
| GPT-4o-mini vs. Gemini-1.5-Pro | < | *** | ||
| GPT-4o-mini vs. Gemini-2.0-flash-exp | < | *** | ||
| Gemini-1.5-Pro vs. Gemini-2.0-flash-exp | < | *** | ||
| WordPad | GPT-4o vs. GPT-4o-mini | < | *** | |
| GPT-4o vs. Gemini-1.5-Pro | * | |||
| GPT-4o vs. Gemini-2.0-flash-exp | ** | |||
| GPT-4o-mini vs. Gemini-1.5-Pro | n.s. | |||
| GPT-4o-mini vs. Gemini-2.0-flash-exp | n.s. | |||
| Gemini-1.5-Pro vs. Gemini-2.0-flash-exp | n.s. |
| Model | Task Complexity | Success Rate (%) |
|---|---|---|
| GPT-4o | basic | 98.59 |
| intermediate | 95.77 | |
| advanced | 52.11 | |
| Qwen2.5-32B-Instruct | basic | 91.55 |
| intermediate | 69.01 | |
| advanced | 45.07 | |
| Qwen2.5-14B-Instruct | basic | 85.91 |
| intermediate | 64.79 | |
| advanced | 36.62 | |
| Qwen2.5-7B-Instruct | basic | 56.34 |
| intermediate | 53.52 | |
| advanced | 28.17 |
| Model | Task Complexity | Success Rate (%) |
|---|---|---|
| Gemini-1.5-pro-latest | basic | 89.53 |
| intermediate | 80.19 | |
| advanced | 48.62 | |
| Qwen2.5-32B-Instruct | basic | 70.21 |
| intermediate | 19.84 | |
| advanced | 16.90 | |
| Qwen2.5-14B-Instruct | basic | 42.23 |
| intermediate | 12.68 | |
| advanced | 8.51 | |
| Qwen2.5-7B-Instruct | basic | 87.32 |
| intermediate | 12.67 | |
| advanced | 2.82 |
| Model | Ablation Part | Task Complexity | Complication Rate (%) |
|---|---|---|---|
| GPT-4o | full prompt | basic | 98.59 |
| intermediate | 95.77 | ||
| advanced | 52.11 | ||
| remove architecture | basic | 94.37 | |
| intermediate | 67.61 | ||
| advanced | 21.13 | ||
| no button function | basic | 98.59 | |
| intermediate | 97.18 | ||
| advanced | 49.30 | ||
| no hotkey | basic | 97.22 | |
| intermediate | 90.27 | ||
| advanced | 48.61 | ||
| no few-shot example | basic | 91.55 | |
| intermediate | 59.15 | ||
| advanced | 15.49 | ||
| only basic desc | basic | 47.89 | |
| intermediate | 7.04 | ||
| advanced | 0.00 | ||
| Gemini-2.0-flash-exp | full prompt | basic | 92.96 |
| intermediate | 87.32 | ||
| advanced | 28.17 | ||
| remove architecture | basic | 92.96 | |
| intermediate | 76.06 | ||
| advanced | 5.63 | ||
| no button function | basic | 94.37 | |
| intermediate | 78.87 | ||
| advanced | 22.54 | ||
| no hotkey | basic | 91.66 | |
| intermediate | 77.78 | ||
| advanced | 25.00 | ||
| no few-shot example | basic | 95.77 | |
| intermediate | 77.46 | ||
| advanced | 12.68 | ||
| only basic desc | basic | 35.21 | |
| intermediate | 4.23 | ||
| advanced | 0.00 |
| Scheme | Task Type | Success Rate (%) |
|---|---|---|
| CogAgent | Basic | 45.07 |
| Intermediate | 8.45 | |
| Advanced | 0.00 | |
| Our scheme, (GPT-4 as decision model) | Basic | 98.59 |
| Intermediate | 95.77 | |
| Advanced | 52.11 | |
| Our scheme, (Qwen2.5-7B-Instruct as decision model) | Basic | 56.34 |
| Intermediate | 53.52 | |
| Advanced | 28.17 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, Z.; Dong, Y.; Fu, M.; Wang, J.; Sun, J.; Wang, Q.; Lu, Y.; Chen, N.; Zhang, R.; Zhang, W. PCLLM: An Integrated LLM-Driven System for Automating Desktop Operations via Direct Mouse and Keyboard Control. Computers 2026, 15, 351. https://doi.org/10.3390/computers15060351
Wang Z, Dong Y, Fu M, Wang J, Sun J, Wang Q, Lu Y, Chen N, Zhang R, Zhang W. PCLLM: An Integrated LLM-Driven System for Automating Desktop Operations via Direct Mouse and Keyboard Control. Computers. 2026; 15(6):351. https://doi.org/10.3390/computers15060351
Chicago/Turabian StyleWang, Zhenqian, Yi Dong, Meixia Fu, Jianquan Wang, Jie Sun, Qu Wang, Yifan Lu, Na Chen, Ronghui Zhang, and Wen Zhang. 2026. "PCLLM: An Integrated LLM-Driven System for Automating Desktop Operations via Direct Mouse and Keyboard Control" Computers 15, no. 6: 351. https://doi.org/10.3390/computers15060351
APA StyleWang, Z., Dong, Y., Fu, M., Wang, J., Sun, J., Wang, Q., Lu, Y., Chen, N., Zhang, R., & Zhang, W. (2026). PCLLM: An Integrated LLM-Driven System for Automating Desktop Operations via Direct Mouse and Keyboard Control. Computers, 15(6), 351. https://doi.org/10.3390/computers15060351

