Retrieval-Augmented Text-to-CSEQL Generation for Cross-Platform Cyberspace Assets Query
Abstract
1. Introduction
- We propose a novel, RAG-based Text-to-CSEQL method that addresses the critical limitation of existing approaches in generating multi-platform queries. Central to this is our hybrid retrieval algorithm for syntax fields, which significantly enhances retrieval recall.
- We construct the first large-scale benchmark dataset for cross-platform Text-to-CSEQL, comprising over 16,000 manually verified NL question–CSEQL pairs across Shodan, Censys, ZoomEye, and FOFA. We also introduce a novel domain-specific FM metric for comprehensive evaluation.
- Extensive experiments show that our method achieves substantial improvements over prompt-based baselines and also demonstrates significant and consistent superiority over state-of-the-art single-platform methods.
2. Related Work
2.1. Natural Language Query for Cyberspace Assets
2.2. Text-to-SQL
2.3. Text-to-DSL
3. Methodology
Algorithm 1: Field Retrieval Procedure |
|
3.1. Knowledge Base Construction
3.2. Field Retrieval
3.3. Few-Shot Retrieval
3.4. CSEQL Generation
4. Experiments
- RQ1: How does our method perform across different CSEs?
- RQ2: How do the components of our method impact performance?
- RQ3: How does our method perform compared to a single-platform method?
4.1. Experimental Setup
4.2. Experimental Results
4.2.1. Knowledge Base Statistics
4.2.2. Main Results
- (1)
- As evidenced in Table 2, our method yields a significant 43.15% enhancement in mean EM score over prompt-based baselines. This demonstrates that our method can not only accommodate various CSE syntaxes but also reduce the hallucinations of LLMs to improve CSEQL query generation performance.
- (2)
- Table 2 also shows that different models exhibit varying abilities in generating CSEQL queries across different CSEs. Under the same method, no single LLM is universally optimal for the Text-to-CSEQL task, necessitating the use of different models for different CSEs.
- (3)
- Notably, different models achieve a low EM score on Censys under a simple prompt, with an average value of 0.002. The deeply nested field structures (e.g., services.port or services.tls.certificates.issuer) adopted by Censys necessitate precise dot-delimited path specifications, fundamentally differing from flat-field paradigms (e.g., port) in other CSEs. This hierarchy introduces critical challenges for LLMs, frequently leading to hierarchical misinterpretations during natural language query processing—such as substituting abbreviated forms (port, tls.issuer) for canonical paths (services.port, services.tls.certificates.issuer)—which propagate erroneous CSEQL queries. The field retrieval algorithm we propose performs semantic alignment between user questions and full hierarchical paths through field match and dense retrieval. This approach elevates Censys query accuracy, achieving an EM score of 0.69, conclusively validating its efficacy in handling nested fields. Consequently, this result highlights a key principle of field setting: prioritize simple fields over nested ones whenever possible.
Model | Shodan | ZoomEye | Censys | FOFA | ||||
---|---|---|---|---|---|---|---|---|
Simple | Ours | Simple | Ours | Simple | Ours | Simple | Ours | |
Kimi | 0.081 | 0.704 | 0.419 | 0.764 | 0 | 0.687 | 0.424 | 0.782 |
GPT-4o-mini | 0.081 | 0.723 | 0.496 | 0.792 | 0 | 0.686 | 0.429 | 0.772 |
GPT-3.5-Turbo | 0.301 | 0.714 | 0.485 | 0.787 | 0 | 0.666 | 0.372 | 0.795 |
Claude-3.5-Sonnet | 0.417 | 0.736 | 0.803 | 0.821 | 0 | 0.712 | 0.548 | 0.807 |
Gemini-1.5-Pro | 0.300 | 0.721 | 0.667 | 0.810 | 0.014 | 0.727 | 0.545 | 0.805 |
4.3. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
CSE | Cyberspace Search Engine |
CSEQL | Cyberspace Search Engine Query Language |
LLM | Large Language Model |
RAG | Retrieval-Augmented Generation |
NL | Natural Language |
Appendix A. Modified COSTAR Prompt Template
ine |
### Context You want to convert a natural language task into a FOFA query. A natural language task might be something like: “Search for all assets that have an IP address of 1.1.1.1.” FOFA queries are used to search cyberspace assets in the FOFA platform. For instance, the above task would convert into the FOFA query: ip=“1.1.1.1” {few_shot} ### Objective The GPT should convert natural language tasks into FOFA queries, following FOFA’s syntax and logic operator rules. The task includes: 1. Strictly generating FOFA queries without additional explanations or context. 2. Adhering to FOFA’s syntax, logic operators, and fields, as outlined in the provided tables. 3. Handling varied natural language inputs, such as: - “Search for all assets that have an IP address of 1.1.1.1.”, - “Find domains using port 80.”, - “List assets associated with example.com.”. ### Style The GPT should adopt a technical and precise style, focusing strictly on generating accurate FOFA queries. It should behave like a FOFA expert, with expertise in cyberspace asset search and a thorough understanding of FOFA’s syntax, operators, and field rules. ### Tone The GPT should use a straightforward and neutral tone, focusing on delivering the FOFA query output without unnecessary elaboration or additional context. ### Audience The GPT’s outputs should be accessible to everyone, including technical users familiar with FOFA syntax and non-technical users who may need clear and precise FOFA queries without additional jargon. ### Supplementary Data Below are FOFA query fields and their descriptions to help the GPT generate accurate queries: {fields} ### Response The GPT should return its output in JSON format with two keys: - text: A natural language summary of the query, echoing the input task in a simplified form. - query: The corresponding FOFA query string. Example: {{ “text”: “Search for all assets with the IP address 1.1.1.1.”, “query”: “ip=\“1.1.1.1\”” }} |
ine |
Appendix B. Implementation Details
Appendix B.1. Filter Implementation
Appendix B.2. Few-Shot Retrieval Implementation
References
- Lu, J.; Qiu, X.; Guo, R.; Li, L.; Zeng, J. A Novel System for Asset Discovery and Identification in Cyberspace. In Proceedings of the 2022 World Automation Congress (WAC), San Antonio, TX, USA, 11–15 October 2022; pp. 23–27. [Google Scholar] [CrossRef]
- Zou, Z.; Hou, Y.; Guo, Q. Research on Cyberspace Surveying and Mapping Technology Based on Asset Detection. In Proceedings of the 2024 IEEE 6th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 24–26 May 2024; Volume 6, pp. 946–949. [Google Scholar] [CrossRef]
- King, Z.M.; Henshel, D.S.; Flora, L.; Cains, M.G.; Hoffman, B.; Sample, C. Characterizing and measuring maliciousness for cybersecurity risk assessment. Front. Psychol. 2018, 9, 39. [Google Scholar] [CrossRef] [PubMed]
- Hao, H.; Xu, C.; Zhang, W.; Yang, S.; Muntean, G.M. Task-Driven Priority-Aware Computation Offloading Using Deep Reinforcement Learning. IEEE Trans. Wirel. Commun. 2025, 1. [Google Scholar] [CrossRef]
- Hao, H.; Xu, C.; Zhang, W.; Yang, S.; Muntean, G.M. Joint Task Offloading, Resource Allocation, and Trajectory Design for Multi-UAV Cooperative Edge Computing With Task Priority. IEEE Trans. Mob. Comput. 2024, 23, 8649–8663. [Google Scholar] [CrossRef]
- Censys. CensysGPT Beta. Available online: https://gpt.censys.io/ (accessed on 4 June 2024).
- Knownsec. Knownsec/ZoomeyeGPT. Available online: https://github.com/knownsec/ZoomeyeGPT (accessed on 4 June 2024).
- Shodan. Available online: https://www.shodan.io (accessed on 9 April 2024).
- Censys Search. Available online: https://search.censys.io/ (accessed on 18 December 2024).
- ZoomEye. Available online: https://www.zoomeye.org/ (accessed on 9 April 2024).
- FOFA Search Engine. Available online: https://fofa.info (accessed on 9 April 2024).
- Zelle, J.M.; Mooney, R.J. Learning to parse database queries using inductive logic programming. In Proceedings of the National Conference on Artificial Intelligence, Portland, OR, USA, 4–8 August 1996; pp. 1050–1055. [Google Scholar]
- Saha, D.; Floratou, A.; Sankaranarayanan, K.; Minhas, U.F.; Mittal, A.R.; Özcan, F. ATHENA: An ontology-driven system for natural language querying over relational data stores. Proc. VLDB Endow. 2016, 9, 1209–1220. [Google Scholar] [CrossRef]
- Guo, J.; Zhan, Z.; Gao, Y.; Xiao, Y.; Lou, J.G.; Liu, T.; Zhang, D. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D., Màrquez, L., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 4524–4535. [Google Scholar] [CrossRef]
- Choi, D.; Shin, M.C.; Kim, E.; Shin, D.R. RYANSQL: Recursively Applying Sketch-based Slot Fillings for Complex Text-to-SQL in Cross-Domain Databases. Comput. Linguist. 2021, 47, 309–332. [Google Scholar] [CrossRef]
- Wang, B.; Shin, R.; Liu, X.; Polozov, O.; Richardson, M. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7567–7578. [Google Scholar]
- Cao, R.; Chen, L.; Chen, Z.; Zhao, Y.; Zhu, S.; Yu, K. LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; Zong, C., Xia, F., Li, W., Navigli, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2541–2555. [Google Scholar] [CrossRef]
- Scholak, T.; Schucher, N.; Bahdanau, D. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 9895–9901. [Google Scholar] [CrossRef]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Chang, S.; Fosler-Lussier, E. How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings. In Proceedings of the NeurIPS 2023 Second Table Representation Learning Workshop, New Orleans, LA, USA, 15 December 2023. [Google Scholar]
- Pourreza, M.; Rafiei, D. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems, Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Nan, L.; Zhao, Y.; Zou, W.; Ri, N.; Tae, J.; Zhang, E.; Cohan, A.; Radev, D. Enhancing Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 14935–14956. [Google Scholar] [CrossRef]
- Gao, D.; Wang, H.; Li, Y.; Sun, X.; Qian, Y.; Ding, B.; Zhou, J. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. Proc. VLDB Endow. 2024, 17, 1132–1145. [Google Scholar] [CrossRef]
- Wang, B.; Wang, Z.; Wang, X.; Cao, Y.; Saurous, R.A.; Kim, Y. Grammar Prompting for Domain-Specific Language Generation with Large Language Models. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems, Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Staniek, M.; Schumann, R.; Züfle, M.; Riezler, S. Text-to-OverpassQL: A Natural Language Interface for Complex Geodata Querying of OpenStreetMap. Trans. Assoc. Comput. Linguist. 2024, 12, 562–575. [Google Scholar] [CrossRef]
- Ziletti, A.; DAmbrosi, L. Retrieval augmented text-to-SQL generation for epidemiological question answering using electronic health records. In Proceedings of the 6th Clinical Natural Language Processing Workshop, Mexico City, Mexico, 21 June 2024; Naumann, T., Ben Abacha, A., Bethard, S., Roberts, K., Bitterman, D., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 47–53. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 9459–9474. [Google Scholar]
- Teo, S. How I Won Singapore’s GPT-4 Prompt Engineering Competition. 2024. Available online: https://towardsdatascience.com/how-i-won-singapores-gpt-4-prompt-engineering-competition-34c195a93d41 (accessed on 25 December 2024).
- Projectdiscovery/Awesome-Search-Queries. Available online: https://github.com/projectdiscovery/awesome-search-queries (accessed on 10 February 2025).
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
- GPT-4o Mini: Advancing Cost-Efficient Intelligence. Available online: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/ (accessed on 14 February 2025).
- Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar]
- Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku; Anthropic: San Francisco, CA, USA, 2024. [Google Scholar]
- AI. Kimi. 2023. Available online: https://kimi.moonshot.cn/ (accessed on 9 April 2024).
- Xiao, S.; Liu, Z.; Zhang, P.; Muennighoff, N.; Lian, D.; Nie, J.Y. C-Pack: Packed Resources For General Chinese Embeddings. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 14–18 July 2024; SIGIR ’24; pp. 641–649. [Google Scholar] [CrossRef]
- Douze, M.; Guzhva, A.; Deng, C.; Johnson, J.; Szilvasy, G.; Mazaré, P.E.; Lomeli, M.; Hosseini, L.; Jégou, H. The Faiss library. arXiv 2024, arXiv:2401.08281. [Google Scholar]
- Zhong, R.; Yu, T.; Klein, D. Semantic Evaluation for Text-to-SQL with Distilled Test Suites. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Webber, B., Cohn, T., He, Y., Liu, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 396–411. [Google Scholar] [CrossRef]
- Deng, N.; Chen, Y.; Zhang, Y. Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; Calzolari, N., Huang, C.R., Kim, H., Pustejovsky, J., Wanner, L., Choi, K.S., Ryu, P.M., Chen, H.H., Donatelli, L., Ji, H., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 2166–2187. [Google Scholar]
Quantity | Shodan | Zoomeye | Censys | FOFA |
---|---|---|---|---|
# of questions number | 5835 | 3469 | 3671 | 4068 |
# of field number | 50 | 23 | 81 | 65 |
average logical operator/query | 2.04 | 1.92 | 1.15 | 1.39 |
average fields/query | 1.21 | 1.6 | 1.13 | 1.22 |
question length/query | 84.14 | 71.96 | 80.12 | 89.17 |
query length(char)/query | 32.93 | 22.14 | 55.86 | 32.47 |
Model | CeneysGPT Beta | Ours | ||
---|---|---|---|---|
EM | FM | EM | FM | |
Kimi | \ | \ | 0.687 | 0.774 |
GPT-4o-mini | 0.678 | 0.768 | 0.686 | 0.892 |
GPT-3.5-Turbo | \ | \ | 0.666 | 0.739 |
Claude-3.5-Sonnet | \ | \ | 0.712 | 0.802 |
Gemini-1.5-Pro | \ | \ | 0.727 | 0.797 |
average | 0.678 | 0.768 | 0.6956 | 0.8008 |
Model | ZoomeyeGPT | Ours | ||
---|---|---|---|---|
EM | FM | EM | FM | |
Kimi | 0.732 | 0.804 | 0.764 | 0.882 |
GPT-4o-mini | 0.729 | 0.801 | 0.792 | 0.886 |
GPT-3.5-Turbo | 0.747 | 0.819 | 0.787 | 0.886 |
Claude-3.5-Sonnet | 0.73 | 0.799 | 0.821 | 0.847 |
Gemini-1.5-Pro | 0.779 | 0.862 | 0.81 | 0.903 |
average | 0.7434 | 0.817 | 0.7948 | 0.8808 |
Method | Kimi | GPT-4o-Mini | GPT-3.5-Turbo | Claude-3.5-Sonnet | Gemini-1.5-Pro | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
EM | FM | EM | FM | EM | FM | EM | FM | EM | FM | ||
Our method | full method | 0.782 | 0.959 | 0.772 | 0.943 | 0.795 | 0.957 | 0.807 | 0.958 | 0.805 | 0.960 |
w/o field retrieval | 0.778 | 0.950 | 0.763 | 0.939 | 0.786 | 0.952 | 0.803 | 0.954 | 0.800 | 0.957 | |
w/o few-shot retrieval | 0.725 | 0.903 | 0.693 | 0.881 | 0.747 | 0.907 | 0.778 | 0.928 | 0.769 | 0.916 | |
w/o prompt template | 0.775 | 0.956 | 0.770 | 0.95 | 0.792 | 0.960 | 0.803 | 0.955 | 0.804 | 0.957 | |
Other methods | versus Full Schema | 0.407 | 0.856 | 0.369 | 0.887 | 0.465 | 0.870 | 0.511 | 0.902 | 0.514 | 0.902 |
versus Manual Examples | 0.412 | 0.599 | 0.411 | 0.560 | 0.301 | 0.472 | 0.481 | 0.723 | 0.483 | 0.730 |
Shodan | Zoomeye | Censys | FOFA | |||||
---|---|---|---|---|---|---|---|---|
EM | FM | EM | FM | EM | FM | EM | FM | |
K = 1 | 0.189 | 0.932 | 0.744 | 0.849 | 0.64 | 0.919 | 0.680 | 0.837 |
K = 2 | 0.432 | 0.930 | 0.764 | 0.856 | 0.643 | 0.918 | 0.704 | 0.873 |
K = 3 | 0.593 | 0.935 | 0.772 | 0.857 | 0.649 | 0.920 | 0.717 | 0.899 |
K = 4 | 0.633 | 0.932 | 0.769 | 0.850 | 0.646 | 0.922 | 0.725 | 0.899 |
K = 5 | 0.655 | 0.940 | 0.775 | 0.852 | 0.648 | 0.896 | 0.730 | 0.901 |
Few-Shot Setting | Kimi | GPT-4o-Mini | GPT-3.5-Turbo | Claude-3.5-Sonnet | Gemini-1.5-Pro | |||||
---|---|---|---|---|---|---|---|---|---|---|
EM | FM | EM | FM | EM | FM | EM | FM | EM | FM | |
few-shot@1 | 0.623 | 0.882 | 0.633 | 0.892 | 0.533 | 0.879 | 0.656 | 0.922 | 0.656 | 0.922 |
few-shot@2 | 0.651 | 0.914 | 0.660 | 0.926 | 0.598 | 0.909 | 0.667 | 0.944 | 0.668 | 0.945 |
few-shot@3 | 0.665 | 0.930 | 0.668 | 0.934 | 0.608 | 0.916 | 0.676 | 0.950 | 0.676 | 0.950 |
few-shot@4 | 0.668 | 0.935 | 0.674 | 0.940 | 0.613 | 0.930 | 0.676 | 0.953 | 0.677 | 0.952 |
few-shot@5 | 0.672 | 0.938 | 0.681 | 0.948 | 0.605 | 0.931 | 0.684 | 0.956 | 0.683 | 0.954 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Y.; Li, Y.; Shi, F.; Xue, P.; Xu, C.; Hu, L. Retrieval-Augmented Text-to-CSEQL Generation for Cross-Platform Cyberspace Assets Query. Electronics 2025, 14, 3164. https://doi.org/10.3390/electronics14163164
Li Y, Li Y, Shi F, Xue P, Xu C, Hu L. Retrieval-Augmented Text-to-CSEQL Generation for Cross-Platform Cyberspace Assets Query. Electronics. 2025; 14(16):3164. https://doi.org/10.3390/electronics14163164
Chicago/Turabian StyleLi, Ye, Yuwei Li, Fan Shi, Pengfei Xue, Chengxi Xu, and Luolin Hu. 2025. "Retrieval-Augmented Text-to-CSEQL Generation for Cross-Platform Cyberspace Assets Query" Electronics 14, no. 16: 3164. https://doi.org/10.3390/electronics14163164
APA StyleLi, Y., Li, Y., Shi, F., Xue, P., Xu, C., & Hu, L. (2025). Retrieval-Augmented Text-to-CSEQL Generation for Cross-Platform Cyberspace Assets Query. Electronics, 14(16), 3164. https://doi.org/10.3390/electronics14163164