Bridging the Semantic Gap in 5G: A Hybrid RAG Framework for Dual-Domain Understanding of O-RAN Standards and srsRAN Implementation

Nurakhov, Yedil; Kassymbek, Nurislam; Marlambekov, Duman; Mukhanbet, Aksultan; Imankulov, Timur

doi:10.3390/app16073275

Open AccessArticle

Bridging the Semantic Gap in 5G: A Hybrid RAG Framework for Dual-Domain Understanding of O-RAN Standards and srsRAN Implementation

by

Yedil Nurakhov

^1,2,

Nurislam Kassymbek

¹

,

Duman Marlambekov

^1,*

,

Aksultan Mukhanbet

^1,2

and

Timur Imankulov

¹

Department of Computer Science, Al-Farabi Kazakh National University, 71 Al-Farabi Ave., Almaty 050040, Kazakhstan

²

DigitAlem LLP, 150/1 Zhandosov St., Auezov District, Almaty 050042, Kazakhstan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3275; https://doi.org/10.3390/app16073275

Submission received: 19 February 2026 / Revised: 18 March 2026 / Accepted: 25 March 2026 / Published: 28 March 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

The rapid evolution of the Open Radio Access Network (O-RAN) architecture and the exponential growth in specification complexity create significant barriers for researchers translating 5G standards into practical implementations. Existing evaluation frameworks for large language models, such as ORAN-Bench-13K, focus predominantly on the theoretical comprehension of regulatory documents while neglecting the critical aspect of software execution. This disparity results in a profound semantic gap, defined here as the structural and conceptual misalignment between abstract normative requirements and their concrete realization in the source code of open platforms like srsRAN. To bridge this divide and enable advanced cognitive reasoning, this paper presents a Hybrid Retrieval-Augmented Generation (RAG) framework designed to unify two heterogeneous knowledge domains: the O-RAN/3GPP specification corpus and the srsRAN C++ codebase. The proposed architecture leverages a hierarchical Parent–Child Chunking strategy to preserve the structural integrity of complex code and normative protocols. Additionally, it introduces a probabilistic Semantic Query Routing mechanism that dynamically selects the relevant context domain based on query intent. This routing actively mitigates semantic interference—a phenomenon where merging conflicting cross-domain terminology introduces informational noise, which our baseline tests showed degrades response accuracy by 4.7%. Empirical evaluation demonstrates that the hybrid approach successfully overcomes this, achieving an overall accuracy of 76.70% and outperforming the standard RAG baseline of 72.00%. Furthermore, system performance analysis reveals that effective context filtering reduces the average response generation latency to 3.47 s, compared to 3.73 s for traditional RAG methods, rendering the framework highly suitable for real-time telecommunications engineering tasks.

Keywords:

5G; O-RAN; 3GPP; retrieval-augmented generation (RAG); large language models (LLM); srsRAN; semantic gap; parent–child chunking; semantic query routing

1. Introduction

The telecommunications industry is undergoing a paradigm shift driven by the Open Radio Access Network (O-RAN) concept, which promotes interface openness and architectural disaggregation to foster innovation and flexibility. In this evolving ecosystem, open-source projects, particularly srsRAN, have established themselves as critical platforms for prototyping and researching next-generation 5G networks. However, effective development within this environment imposes a high cognitive load on engineers, requiring a synchronous understanding of two heterogeneous knowledge domains: the extensive 3GPP and O-RAN specifications—comprising millions of words of normative text—and the low-level details of their implementation in C++.

A fundamental challenge in this domain is the semantic gap, which we formally define as the conceptual and structural misalignment between high-level normative abstractions (e.g., 3GPP specifications) and their low-level realization in software logic. For a developer, determining how a specific standard, such as the “7.2× functional split,” translates into specific classes or configuration files within the srsRAN project is a non-trivial task. Traditional information retrieval methods often fail to bridge this divide, as they lack the contextual awareness required to navigate and correlate both domains simultaneously. Large Language Models (LLMs) have demonstrated significant potential in addressing telecom-specific challenges. Recent initiatives, such as ORAN-Bench-13K, have advanced the field by standardizing the evaluation of LLM capabilities using a corpus of questions derived from O-RAN specifications. While specialized Retrieval-Augmented Generation (RAG) systems like ORANSight have achieved high accuracy in theoretical queries, existing benchmarks remain predominantly mono-domain. They rigorously assess knowledge of specification text but neglect the critical aspect of software implementation, leaving a gap in tools capable of assisting with practical engineering tasks. To address this limitation, we introduce a Hybrid RAG framework designed to unify the theoretical basis of O-RAN/3GPP standards with the practical implementation details of the srsRAN codebase. Unlike naive ensemble approaches that mechanically merge retrieval results, our architecture is specifically designed to combat semantic interference. We quantify this interference as the measurable degradation in generative accuracy—observed as a 4.7% performance drop in our standard RAG baselines—caused by the simultaneous injection of conflicting theoretical and code-based contexts. To mitigate this informational noise, our framework employs a Semantic Query Router that dynamically classifies query intent to activate the most relevant knowledge index, thereby filtering out irrelevant context and enhancing retrieval precision. The main contributions of this work are focused on three core areas, designed to advance the reliability of cognitive interfaces in telecommunications:

Cross-Domain RAG Architecture: We propose a novel framework integrating a hierarchical Parent–Child Chunking strategy with dynamic semantic routing. This architecture explicitly bridges the semantic gap by correlating abstract standards with concrete code while filtering out the noise that leads to cross-domain hallucinations.
Extended Evaluation Methodology: Building upon the ORAN-Bench-13K foundation, we introduce new query categories—specifically Code-Centric and Cross-Domain—to quantitatively assess implementation competence and complex knowledge synthesis.
Empirical Validation of the Routing Strategy: We provide a measurable analysis of context isolation, demonstrating that our probabilistic routing mechanism eliminates semantic interference. This approach surpasses standard RAG baselines by achieving 78.5% accuracy on technical implementation questions and 71.4% on complex cross-domain queries, while simultaneously reducing response latency to 3.47 s.

2. Background and Motivation

The integration of LLMs into the fabric of 5G and the burgeoning 6G ecosystems marks a fundamental transition from purely statistical machine learning paradigms to semantic-aware, proactive network management frameworks. As telecommunications infrastructure shifts toward increasingly heterogeneous environments—characterized by the coexistence of ultra-reliable low-latency communication (uRLLC), massive machine-type communications (mMTC), and enhanced mobile broadband (eMBB)—traditional reactive diagnostic frameworks have reached a ceiling of efficacy [1]. These legacy systems, which rely heavily on hard-coded thresholds and localized Key Performance Indicators (KPIs), often fail to generalize across diverse network slices or adapt to the non-stationary nature of dynamic traffic patterns [1]. The contemporary research landscape is now defined by the pursuit of “AI-native” networks, where LLMs serve as central reasoning engines capable of interpreting technical standards, automating multi-step orchestration, and providing interpretable diagnostics through natural language interfaces [1,2].

The trajectory of network intelligence has historically moved through three distinct phases: rule-based systems, statistical machine learning (ML), and the current generative AI era [3]. A comprehensive comparison of the evolution of these network management paradigms, highlighting their control logic and interpretability, is synthesized in Table 1. Rule-based systems, exemplified by algorithms like Copa for congestion control, relied on handcrafted heuristics that required significant human intervention to design and validate [4]. While the advent of deep learning (DL) and reinforcement learning (RL) shifted the burden from rule engineering to model engineering, it introduced a “one model for one task” bottleneck [4]. In this phase, specialized deep neural networks (DNNs) were meticulously tuned for narrow functions like traffic classification or beamforming, but these models lacked the flexibility to adapt to unseen environments or perform across different layers of the protocol stack [4].

The emergence of LLMs as foundation models offers a potential resolution to this architectural fragmentation [5]. By leveraging emergent abilities such as planning, pattern mining, and cross-domain reasoning, a single foundational model can be adapted to a multitude of networking tasks [1]. This “One Model for All Tasks” philosophy is increasingly reflected in global standards, such as ETSI GR ENI 045, which formally incorporates LLMs as reasoning engines within the Operational Administration Maintenance and Provisioning (OAMP) architecture [6]. This shift signifies a broader trend toward language-centric network management, where the linguistic structure of 3GPP standards becomes as important as the numerical telemetry they govern [1].

Table 1. Comparative Evolution of Network Management Paradigms.

Feature	Rule-Based Era	Statistical ML Era	Generative/Semantic Era
Control Logic	Handcrafted heuristics [4]	DNNs, RL, and Super-vised Learning [4]	LLM-based reasoning engines [1]
Adaptability	Static; requires manual updates	Sensitive to data distribution shifts [4]	High; leverages emergent abilities [7]
Human Effort	Rule engineering [4]	Model architecture engineering [4]	Prompt/Context engineering [7,8]
Interpretability	High (White-box)	Low (Black-box) [1]	High (Natural language explanation) [1]
Standardization	Local/Vendor-specific	Partial (O-RAN RIC)	Formal (ETSI GR ENI045) [6]

A critical barrier to the efficacy of general-purpose LLMs in telecommunications is the “modality gap”-the inherent difficulty these models face when processing non-textual network features such as time-series telemetry, constellation diagrams, or packet-level metadata [1]. Research has consequently diverged into several specialized architectural streams designed to bridge this gap through multimodal integration and domain-specific instruction tuning.

The NetLLM framework represents a pioneering effort to adapt LLMs for networking by enabling the processing of multimodal data [9]. Recognizing that network optimization often requires reasoning over diverse data types, NetLLM employs multimodal encoders to project features like throughput time-series and network topology graphs into a token-like embedding space [9]. This allows the transformer-based backbone to treat raw telemetry as part of its conversational context. Extensive evaluations in tasks such as viewport prediction for virtual reality and cluster job scheduling (CJS) demonstrate that NetLLM-adapted models not only outperform traditional DRL-based schedulers but also exhibit superior generalization to unseen network conditions [9].

While NetLLM focuses on multimodal input, Mobile-LLaMA addresses the need for domain-specific language and code understanding [10]. Built as an instruction-fine-tuned variant of the LLaMA 2 13B model, Mobile-LLaMA was trained on a proprietary dataset of real-world 5G network logs and packet captures [10]. The model excels in three primary functional areas: packet analysis (pcap processing), IP routing analysis (BGP path management), and performance analysis (KPI evaluation) [10]. In code generation benchmarks specifically tailored for network analysis, Mobile-LLaMA achieved a score of 247 out of 300, significantly surpassing the 209 score of GPT-3.5 [10]. This highlights the necessity of fine-tuning on high-quality, domain-precise instruction-output pairs to refine the model’s technical coherence.

The complexity of 6G design often necessitates advanced mathematical reasoning that exceeds the capabilities of general-purpose models [11]. TelecomGPT addresses this through a robust three-stage training pipeline: continual pre-training on a massive telecom-specific corpus (3GPP, IEEE, and ITU documents), supervised fine-tuning (SFT) for instruction following, and alignment tuning via Direct Preference Optimization (DPO) [11]. This specialized pre-training allows TelecomGPT to significantly outperform state-of-the-art models like GPT-4 and Mistral in the Telecom Math Modeling benchmark, which requires solving equations related to Radio Resource Management (RRM) and signal propagation [11].

Knowledge Retrieval and Technical Standards Interpretation One of the most pro-found challenges in the 5G/6G era is the interpretation and application of 3GPP technical specifications [12]. These documents are notoriously difficult to process due to their hierarchical structure, dense formatting, and multi-modal content (text, tables, and diagrams) [12]. RAG has emerged as the primary mechanism for grounding LLM outputs in these authoritative sources [13].

TelcoAI represents a sophisticated advancement in RAG systems, designed specifically for 3GPP documentation [12]. Unlike naive RAG systems that perform flat text chunking, TelcoAI utilizes section-aware chunking to preserve the hierarchical relationships between clauses in the standards [12]. The system acts as an agentic reasoner, using a query planner to decompose complex engineering questions into manageable sub-queries that are then resolved through metadata-guided retrieval [12]. By fusing textual insights with technical diagrams, TelcoAI achieves 87% recall and 92% faithfulness on expert-curated benchmarks, representing a 16% improvement over prior state-of-the-art baselines like Chat3GPP [12].

The risk of hallucinations in technical settings has led to the development of “Verified Telcorag,” which incorporates local verification modules to validate LLM-generated configurations against known protocol rules [14]. This system uses regex-based pattern matching and penalty scoring to ensure that generated answers for srsRAN or O-RAN configurations are not only linguistically coherent but also technically valid within the constraints of the 3GPP Release 16–18 specifications [14]. Such verification layers are critical for the eventual deployment of LLMs in live network controllers, where an error in a single parameter could trigger catastrophic outages.

The deployment of LLMs at the network edge-specifically within 5G Base Stations (gNB) or Integrated Access and Backhaul (IAB) nodes-is constrained by the computational and memory limitations of edge hardware [1,15]. The quadratic time complexity of traditional transformer models makes them ill-suited for the long telemetry sequences inherent to network monitoring [16].

Mamba4Net introduces a cross-architecture distillation framework that transfers networking knowledge from large transformer models to student models built on the Mamba architecture [16]. Mamba uses selective State Space Models (SSMs) to achieve linear time complexity (O(N)), which is far more efficient for processing extensive log streams [16]. In evaluations across viewport prediction and job scheduling, Mamba4Net demonstrated a throughput 3.96 times higher than transformer-based LLMs while occupying only 5.48% of the storage footprint [16]. This efficiency gain is crucial for achieving the sub-millisecond inference times required for real-time proactive optimization [1]. The architectural differences and computational complexities of these network-specific models are further contrasted in Table 2.

Recent trends also explore hybrid architectures, such as Mamba-KAN-Liquid (MKL), which combines Mamba’s temporal modeling with Liquid networks for dynamic adaptation and Kolmogorov–Arnold Networks (KAN) for feature representation [17], particularly as security and privacy challenges become paramount [18]. This hybrid approach has shown detection rates exceeding 95% for UAV cyberattacks (e.g., GPS spoofing, jamming) with an inference latency of only 47.3 ms [17]. To further support these models, research into hardware acceleration—spanning GPUs for short-term deployment to specialized ASICs for long-term “compute moats”—is accelerating the realization of AI-native 6G functions [2].

As network complexity grows, the “One Model” approach is being supplemented by “Agentic Workflows,” where multiple specialized LLMs interact to solve complex diagnostic or configuration chains [1,19]. This is particularly evident in the evolution toward 6G, where agents are deployed in a dual-loop structure spanning edge and terminal devices [20].

In the domain of satellite communications, the SCNOC-Agentic framework utilizes LLMs as a “control nucleus” to coordinate specialized agents for network task planning, fault analysis, and resource optimization [21]. Key components include an Intent Refinement (IR) module that parses high-level operator intents (e.g., “optimize cell coverage over a maritime zone”) into executable parameters, and a Multi-Agent Workflow (MaW) that invokes standard protocols like MCP for function calling [21]. Ablation studies con-firm that the IR component is vital for parameter generation accuracy, while the planning module enables the dissection of complex problems into sub-tasks that can be handled by “mini-models” [21].

The MX-AI system represents the first end-to-end agentic system instrumenting a live 5G Open RAN testbed based on srsRAN and OpenAirInterface [22]. By deploying a graph of LLM-powered agents inside the Service Management & Orchestration (SMO) layer, MX-AI provides both natural-language observability and control over the R1/E2 interfaces [22]. On realistic operational queries, the system attained 100% decision action accuracy and a mean answer quality of 4.1/5.0, competing directly with human expert performance [22]. This demonstrates the viability of LLM agents for automating “day-1” and “day-2” operations in Open RAN environments [22].

A primary application of LLM-driven network intelligence is the automation of fault detection and root-cause analysis (RCA) [1]. Traditional methods often struggle with the “black box” nature of cloud-native 5G systems, where faults can propagate across microservices in unpredictable ways [23].

Research utilizing Chaos Mesh to inject faults (e.g., pod failure, network loss, disk I/O failure) into Kubernetes-based 5G core networks has demonstrated that fine-tuned models like GPT-4.1-Nano can significantly improve detection accuracy [24]. By training on raw, heterogeneous logs and events rather than pre-parsed schemas these models learn to identify the subtle linguistic markers of system degradation [24]. These systems enable a “closed-loop” management cycle where the LLM not only detects the fault but also generates detailed diagnostic reports for operator intervention [24].

The KTeleBERT model takes this a step further by injecting structural domain knowledge from knowledge graphs directly into the pre-training phase [25]. This approach allows the model to reason about causal dependencies between different network products and protocols. Evaluations on root-cause analysis and fault chain tracing tasks show that KTeleBERT’s performance is significantly boosted by its internalized understanding of network architecture and product interdependencies [25].

The application of RAG architectures has recently expanded beyond static document retrieval to encompass dynamic, multi-physical field scenarios, particularly within Vehicle-to-Everything (V2X) communications and 5G edge deployments. In these highly mobile and latency-sensitive environments, network nodes must continuously synthesize heterogeneous knowledge streams, ranging from physical layer sensor data to application-layer traffic protocols. Recent studies demonstrate that coupling multi-domain RAG frameworks at the 5G edge significantly mitigates the latency of context retrieval while maintaining high semantic relevance for dynamic resource allocation tasks [26]. By embedding localized vector databases within Multi-Access Edge Computing (MEC) servers, these architectures enable autonomous vehicles and edge nodes to perform real-time knowledge fusion, thereby bridging the gap between historical predictive models and instantaneous environ-mental states [27]. As the diversity of ingested knowledge bases increases, the limitations of flat text chunking and static retrieval strategies become pronounced, prompting the adoption of hierarchical chunking and semantic routing mechanisms. Hierarchical chunking addresses the structural complexity of heterogeneous documents by preserving parent–child relationships, ensuring that granular semantic searches do not lose their overarching technical context [28]. Concurrently, semantic routing has emerged as a pivotal technique for heterogeneous knowledge fusion. Instead of relying on a single, monolithic vector space, semantic routers utilize lightweight intent classification models to dynamically direct queries to domain-specific sub-indices [29]. This multi-stage retrieval paradigm prevents semantic interference—where overlapping terminology from disparate domains degrades answer quality—and provides robust theoretical support for designing context-aware frameworks capable of unifying abstract normative standards with concrete software implementations.

This study pays particular attention to verifying the quality of dialogue interaction between a network engineer and an intelligent system. The assessment is based on a comprehensive set of criteria, including the situational relevance of the responses provided, their technical reliability, and factual accuracy in the context of 3GPP standards. In addition, the logical sequence of responses during a multistep dialogue is analyzed, as well as the system’s strict adherence to the specified professional style and operational role of the user. This approach ensures that the cognitive interface acts not only as a reference tool, but also as a full-fledged expert assistant capable of supporting coherent reasoning when making decisions in an Open RAN environment.

3. Materials and Methods

3.1. Advanced Hybrid RAG Framework

The proposed methodology implements an Advanced RAG framework through a multi-layered architecture specifically designed to bridge the semantic gap between high-level telecommunication specifications and low-level code implementation. As illustrated in Figure 1, the system performs several critical functions across five sequential stages: probabilistic query routing, multi-source retrieval, graph-based context enhancement, semantic re-ranking, and response generation. To form a unified and comprehensive knowledge base, the system executes parallel loading and pre-processing of documents from two fundamentally heterogeneous sources. The implementation domain undergoes a recursive scan of the target srsRAN project directory, utilizing specialized loaders to process files matching code patterns, such as C++ source codes and header files, alongside Markdown documentation. Simultaneously, the normative domain is processed by scanning repositories of O-RAN and 3GPP standards in PDF, HTML, and TXT formats. A critical aspect of this ingestion phase is thorough metadata enrichment.

3.2. Data Ingestion and Adaptive Semantic Chunking

To address the significant challenge of semantic search within complex technical texts, the framework supersedes traditional fixed-size indexing with an Adaptive Se-mantic Chunking strategy. The document ingestion pipeline implements a deterministic Parent–Child dual-level chunking methodology, as illustrated in Figure 2. To preserve the structural integrity of C++ functions, class definitions, and multi-step normative protocols, parent chunks are strictly configured to a capacity of 2000 characters with a 400-character overlap, providing comprehensive context for generation. Conversely, child chunks are defined at a static size of 400 characters with a 100-character overlap to ensure precise vector-based retrieval and granular semantic ranking. Vector embeddings for these semantically coherent chunks are subsequently generated utilizing the nomic-embed-text model and stored within a persistent Chroma vector database, maintaining rich metadata links via UUID v5 hashing for cross-referencing.

3.3. Soft Probabilistic Routing and Generation Parameters

The core architectural innovation of the proposed framework is the Soft Probabilistic Semantic Router, which operates as an intelligent gateway to mitigate the effects of semantic interference and context poisoning. Rather than utilizing a hard zero-shot classifier that blindly restricts vector searches to a single domain, the updated architecture employs a Llama 3.2 3B model to perform soft intent classification. The Semantic Query Router is driven by a rigorously designed few-shot prompt structure encompassing a defined System Role, explicit Classification Rules, and exactly 12 domain-specific examples. This structure guides the model to assign continuous probability weights across predefined categories: CODE for C++ implementations, STANDARD for normative specifications, and BOTH for hybrid requirements.

Following the aggregation of context, the final technical response is synthesized by the Llama 3.1 8B generation engine. To ensure deterministic and scientifically reproducible outputs, the generative hyperparameters are strictly defined: the generation temperature is constrained to 0.1, the maximum sequence length is limited to 1024 tokens, nucleus sampling (top-p) is set to 0.9, and the repetition penalty is configured at 1.1.

3.4. Experimental Setup and Statistical Validation

To ensure objective verification of the system’s theoretical competence in the field of 5G and O-RAN specifications, the ORAN-Bench-13K open dataset was integrated into the experimental research base. This benchmark, developed by Gajjar and Shah [30], is the industry’s first standardized tool for evaluating LLMs based on an extensive corpus of O-RAN Alliance and 3GPP regulatory documents. The use of this externally validated resource avoids subjectivity in the formulation of standard-based questions and ensures a high degree of academic transparency, allowing for a direct comparison of the effectiveness of the proposed hybrid RAG framework with existing baseline models. As part of the current work, ORAN-Bench-13K is used as a fundamental component of a comprehensive test suite, internally designated srsRANBench, where it serves as the basis for evaluating the category of standard-centric queries. Thus, reliance on a benchmark recognized by the community ensures the accuracy of the theoretical domain assessment, while the additional datasets developed by the author focus on solving the key research task—bridging the semantic gap between the abstract requirements of standards and their concrete software implementation in the srsRAN ecosystem.

To ensure the reproducibility of the experimental findings, the hardware and retrieval configurations were strictly standardized. The inference pipeline was deployed on a dedicated compute node equipped with a single NVIDIA RTX 4090 GPU utilizing 24 GB of VRAM. For the vectorization phase, the system utilized the nomic-embed-text-v1.5 model, configured to project semantic representations into a dense vector space with a fixed dimensionality of 768. During the retrieval stage, the hybrid search mechanism was configured with a static boundary of k = 5 for the top-k parameter. The Semantic Query Router, powered by the Llama 3.2 3B model, operates under a strictly defined few-shot prompt: “You are an intelligent query classifier for a 5G/O-RAN RAG system specializing in srsRAN Project codebase and 3GPP standards. Your task is to categorize incoming questions into THREE categories: 1. CODE: Questions about C++ implementation, srsRAN classes, functions, data structures, algorithms, or code-level details 2. STANDARD: Questions about 3GPP specifications, protocols, network architecture, theoretical concepts, or standards documentation 3. BOTH: Questions that require both implementation details AND theoretical/standard knowledge, or cross-reference questions.” This is followed by twelve domain-specific in-context learning examples.

Furthermore, the Weighted Context Blending (WCB) mechanism is formally defined as a dynamic chunk allocation strategy governed by the router’s maximum classification confidence, denoted as C_max. This confidence metric is mathematically expressed as:

C_max = max(P(C|q), P(S|q))

(1)

where P(C|q) and P(S|q) represent the routing probabilities assigned by the intent classification model for the codebase and standard specification domains, respectively, given the input query q. If the peak confidence C_max falls below a predefined strict isolation threshold, the query is classified as highly ambiguous or cross-domain. In such hybrid scenarios, the total retrieval capacity K (where K = 5) is proportionally divided rather than exclusively assigned to a single index. The number of discrete text chunks allocated to the C++ implementation domain,

k_{C}

, is calculated using a floor function to ensure an integer value:

k_{C} = K \frac{P (C| q)}{P (C| q) + P (S| q)}

(2)

Consequently, the remaining retrieval budget allocated to the 3GPP/O-RAN specification domain, k_S, is simply defined as the difference between the total capacity and the code-allocated chunks:

k_S = K − k_C

(3)

This mathematical formulation guarantees that the generative context is dynamically populated with a proportionally balanced ratio of theoretical and practical knowledge. By directly linking the retrieval allocation to the normalized probability distribution of the router, the system effectively mitigates semantic interference without exceeding the maximum context window constraints required for low-latency generation.

4. Results

4.1. Quantitative Performance and Overall Accuracy

The quantitative evaluation began by establishing a baseline using the Base LLM (Llama 3) in isolation. As hypothesized, the base model demonstrated limited capability, achieving an overall accuracy of only 45.9%. This poor performance underscores the model’s lack of specific factual knowledge required to correlate abstract O-RAN standards with low-level srsRAN implementation details.

To explicitly clarify the evaluation methodology, the query accuracy metric presented in the comparative analysis is calculated as the strict percentage of correctly resolved queries. For standard-centric questions sourced from the ORAN-Bench-13K multiple-choice format, accuracy represents the exact match rate of the selected options. For open-ended implementation and cross-domain queries, accuracy is determined through a binary thresholding of the LLM-as-a-Judge qualitative scores, where only responses achieving a technical re-liability score of 4 or 5 are classified as accurate. The seemingly low baseline accuracy of 45.9 percent for the isolated Base LLM is statistically consistent and theoretically expected. Without external knowledge retrieval, the parametric memory of a general-purpose model is insufficient to accurately reproduce highly specific, release-dependent 3GPP parameters or proprietary C++ class hierarchies from the srsRAN repository, inevitably forcing the model to generate linguistically coherent but technically hallucinated, and therefore incorrect, responses.

We subsequently evaluated a naive Ensemble Retrieval strategy, which mechanically merged results from lexical and semantic searches. While this approach improved accuracy to 72.0%, it suffered from “semantic interference” where the simultaneous injection of context from both specifications and code created significant informational noise.

The implementation of the proposed Hybrid RAG architecture with Semantic Query Routing yielded the highest performance. By dynamically classifying user intent and filtering irrelevant domains, the system reached a peak accuracy of 76.7%, outperforming both the Base LLM and the Standard RAG configuration, as detailed in Table 3. This result confirms that in heterogeneous technical domains, a precision-oriented strategy that strictly filters context is superior to recall-oriented strategies that maximize context coverage. Assuming a standard normal distribution of errors across the consolidated test sets, the reported peak accuracy of 76.7% carries an estimated 95% confidence interval of ±2.1%, thereby confirming the statistical robustness of the performance gains over the baseline models.

4.2. Stratified Analysis by Query Category

To provide a comprehensive multidimensional evaluation of the framework and rigorously quantify its knowledge fusion capabilities, the performance analysis was expanded beyond basic accuracy to include Precision, Recall, and F1-score metrics across all four query domains. Based on the evaluation matrix of the extended dataset, the system demonstrated a robust capacity for intent resolution and factual extraction. As shown in Table 4, shows that Code-Centric category achieved the highest predictive reliability, yielding a Precision of 0.766 and an F1-score of 0.746. This high performance underscores the efficacy of the routing module in isolating implementation-specific details without capturing irrelevant normative text. For Cross-Domain queries, which specifically test the model’s ability to fuse heterogeneous knowledge, the framework maintained a stable F1-score of 0.680 and a Recall of 0.688. This balanced metric confirms that the system effectively synthesizes abstract standards with concrete software realizations without suffering from severe retrieval drops. Similarly, the Standard-Centric and General query categories exhibited F1-scores of 0.671 and 0.701, respectively. The resulting macro-averaged F1-score of 0.700 serves as quantitative proof that the hybrid architecture successfully mitigates the risk of disproportionate failure in any single knowledge domain, ensuring stable generation quality across varying degrees of contextual complexity. To further delineate the applicable scenario boundaries of the framework, performance was analyzed across three progressive tiers of query complexity. The Simple complexity tier encompasses straightforward factual retrieval, primarily represented by general baseline queries, where the system demonstrated stable baseline competency. The Medium complexity tier involves deep single-domain structural comprehension, such as identifying specific variable instantiations within the codebase. In this tier, represented by the Code-Centric category, the model exhibited its highest predictive reliability, achieving an F1-score of 0.746. The Complex tier necessitates heterogeneous knowledge fusion, requiring the agent to simultaneously retrieve abstract 3GPP requirements and map them to their corresponding C++ software realizations. Despite the high cognitive load of these Cross-Domain scenarios, the framework maintained a robust F1-score of 0.680. This complexity analysis indicates that the operational boundaries of the current model are highly optimized for medium-to-complex engineering tasks re-quiring deep technical synthesis, whereas isolated, simple theoretical lookups occasionally trigger routing ambiguity.

4.3. Computational Efficiency and Latency Analysis

The integration of the semantic routing layer and the hierarchical indexing strategy introduces a nuanced computational profile compared to the parametric and standard RAG baselines. As detailed in Table 5, the Base LLM establishes the lower bound for the response time with an average generation latency of 0.77 s, which is expected given its reliance solely on internal parametric memory without external retrieval steps. In contrast, both RAG-based architectures exhibit significantly higher latencies due to the complex multi-stage pipeline involving query vectorization, document retrieval, and context-augmented generation. Specifically, the Standard RAG implementation demonstrated an average response time of 3.73 s, while the proposed Hybrid RAG framework achieved a more efficient average of 3.47 s. This counter-intuitive finding—that a more complex architecture with an additional routing step results in lower overall latency—is a key efficiency contribution of this work.

4.4. Qualitative Analysis

To ensure the reliability of the LLM-as-a-Judge framework and address potential evaluation bias, the judge model (Llama 3.1 8B) underwent a strict calibration process. As an alternative to subjective human inter-rater agreement, the evaluation mechanism was deterministically calibrated using a zero-temperature inference setting to eliminate generative variance. Furthermore, the model was constrained by a highly restrictive evaluation rubric specifically tailored to telecommunications engineering. The judge was conditioned with explicit definitions for each tier of the five-point Likert scale, strictly penalizing the generation of hallucinated 3GPP parameters, incorrect srsRAN C++ syntax, or cross-domain semantic interference. This deterministic calibration minimizes arbitrary model bias and ensures that the assigned qualitative scores reflect a consistent, objective assessment of technical reliability across all RAG architectures.

The qualitative performance analysis, illustrated in Figure 3, reveals critical insights into the generative behavior of the evaluated architectures when subjected to an expert review by the Llama 3.1-8B judge model. The Hybrid RAG framework demonstrated superior performance, achieving a mean quality score of 4.83, which indicates a high degree of factual accuracy and contextual relevance in both theoretical and implementation-oriented scenarios. In contrast, the Standard RAG approach exhibited a significant performance regression, scoring only 3.83, which is notably lower than the Base LLM score of 4.17. This decline is primarily attributed to the phenomenon of “context poisoning,” where the non-selective retrieval of C++ code fragments during conceptual queries introduces semantic noise that disrupts the coherence of the generated response. The Hybrid RAG architecture successfully mitigates this issue through its semantic routing mechanism, which filters out irrelevant knowledge domains and ensures that the model operates within a clean and focused context window.

4.5. Weighted Context Blending Strategy

To explore the boundaries of computational efficiency and evaluate the impact of strict domain isolation, we investigated an alternative WCB strategy. Rather than serving as a direct remedy for the accuracy regression in standard-centric queries, this approach was designed to test a “soft” probabilistic model for scenarios where query intent is highly ambiguous and real-time responsiveness is prioritized, as illustrated in Figure 4. The blending mechanism calculates a confidence score C [0, 1] for the primary domain; if this score falls below a critical threshold (optimized at 0.5 in our experiments), the system retrieves a balanced set of chunks from both the normative and implementation indices. While this balanced retrieval theoretically preserves access to both domains, its primary empirical contribution was a radical reduction in average latency to approximately 1.5 s, achieved by optimizing the context window size and minimizing the token processing load on the Llama 3 generation engine.

The most notable outcome of this blending strategy is its demonstration of an extreme latency-accuracy trade-off. The average response latency reached an unprecedented low, ranging between 1542.66 ms and 1565.61 ms, which represents a significant reduction compared to the 3.47 s achieved by the standard Hybrid RAG and the 3.73 s of the Standard RAG implementation. However, this optimization for real-time interaction comes at a severe cost to factual precision. The resulting 38.0% overall accuracy falls not only below the Hybrid RAG’s peak performance of 76.7%, but also under the 45.9% baseline established by the Base LLM in overall testing.

These results highlight that while Balanced Blending successfully eliminates the “hard” filtering bottlenecks that led to previous theoretical regressions, the simultaneous injection of dual-domain chunks introduces excessive informational noise that severely degrades generative quality. The persistent 38.0% accuracy, even at a confidence threshold of 1.0, indicates that the current unweighted blending fails to prioritize high-relevance chunks over general context. Consequently, WCB in its current form exposes a critical “efficiency-accuracy gap” in the framework: it is highly suitable for real-time engineering interactions (1.5 s) but lacks the precision required for complex knowledge synthesis. Future iterations must therefore explore Inverse Weighting or Dynamic k-Retrieval to recover the accuracy levels of the naive ensemble while preserving the newfound latency benefits of the blending architecture.

4.6. Error Profiling and Failure Analysis

A detailed analysis of the system’s error profile reveals a significant disparity between routing accuracy and retrieval precision. Empirical data shows that retrieval misses ac-count for 89.5% of total failures, whereas routing errors constitute only 10.5% of the error distribution. The high frequency of retrieval misses in the code-centric domain suggests that the static 400-character child chunking strategy may be too granular to capture the complex structural interdependencies of the srsRAN C++ codebase. Conversely, routing errors are predominantly localized in the cross-domain category, reflecting the inherent semantic ambiguity of queries that necessitate the simultaneous synthesis of abstract standards and concrete code. These findings indicate that the primary bottleneck in the current Hybrid RAG architecture lies in the factual extraction phase rather than intent classification, highlighting a critical need for more adaptive indexing methods.

An extended assessment of system performance on the full dataset is presented in the error matrix for four response categories, illustrated in Figure 5. Unlike the preliminary testing of routing mechanisms, this stage of testing covers a representative sample, where the total number of correct predictions (True Positives) along the diagonal indicates high stability of the architecture. The best performance is demonstrated by the first category, where the system successfully identified 291 cases, with the most pronounced confusion observed between adjacent classes 3 and 4. In particular, 61 queries in the fourth category were misclassified as the third, and 53 queries in the third class were classified as the fourth, which may indicate high semantic density or partial context overlap between these domains. Nevertheless, the overall concentration of values along the main diagonal confirms that the proposed hybrid model maintains accuracy not only at the index selection stage, but also in the final generation of different types of responses.

Confirmed. The grid lines within the Confusion Matrix have been reviewed.

To gain a deeper understanding of the limitations of the proposed architecture, a detailed analysis of the error profile was conducted, the results of which are shown in Figure 6. The statistics indicate a significant imbalance in the nature of failures: the vast majority of errors (89.5%) are classified as retrieval misses, while semantic routing errors account for only 10.5% of the total distribution. This segmentation indicates that the intelligent agent layer successfully determines user intent but encounters difficulties at the stage of factual information retrieval. The highest number of failures was recorded in the “Code-Centric” category, which confirms the hypothesis about the insufficient effectiveness of the static 400-character chunking strategy for capturing complex structural interdependencies of the srsRAN source code in C++. At the same time, routing errors are mainly localized in the “Cross-Domain” category, which is explained by the natural semantic ambiguity of queries that require the simultaneous synthesis of knowledge from abstract O-RAN standards and their software implementation.

To empirically validate the structural advantages of the proposed indexing strategy, an ablation study was conducted comparing the Parent–Child Chunking method against traditional fixed-size chunking baselines without overlap. The evaluation utilized a comprehensive test set of 15,366 queries, evaluating retrieval performance at a top-k threshold of 10. The experimental results, graphically detailed in Figure 7, provide a clear comparative visualization of Precision@10 and Hit Rate@10 (Recall) across the three tested configurations. As illustrated in the bar chart, the Parent–Child strategy achieves the highest macro hit rate at 56.7 percent, forming the highest peak among the recall metrics. This performance quantitatively surpasses both the 1000-character fixed chunking baseline, which is shown to achieve 55.6 percent, and the 1500-character baseline at 54.4 percent. Furthermore, the visual representation reveals a marginal precision trade-off; the fixed 1000-character strategy exhibits a slightly higher Precision@10 of 0.391 compared to the 0.382 level maintained by both the Parent–Child and fixed 1500-character approaches. However, this minor reduction in precision is strategically acceptable. In cross-domain RAG architectures, maximizing the recall rate is paramount, as the generative model strictly requires the inclusion of the ground-truth context to prevent technical hallucinations. Ultimately, the visual and quantitative evidence confirms that the Parent–Child approach effectively balances granular semantic search through child vectors with the comprehensive generative context provided by parent blocks, thereby proving its architectural superiority for processing complex 3GPP specifications and srsRAN C++ code interdependencies.

4.7. Semantic Router Model Comparison

To rigorously justify the selection of the Llama 3.2 3B model for the Semantic Query Router, a comparative performance analysis was conducted against traditional intent classification architectures. During the evaluation, the few-shot Llama 3.2 3B configuration was benchmarked against a classic TF-IDF model paired with a Support Vector Machine (SVM) and a zero-shot RoBERTa (MNLI) classifier using a dedicated validation set of 363 complex telecommunication queries. The experimental results (in Table 6) demonstrate that traditional lightweight classifiers struggle with the semantic density and specific cross-domain terminology inherent to 3GPP standards and C++ source code. The RoBERTa model 571 exhibited an overall routing accuracy of only 39.4%, while the TF-IDF machine learning baseline achieved just 44.1%, primarily due to their inability to distinguish theoretical concepts from implementation details under heavy lexical overlap. In contrast, the Llama 3.2 3B model, leveraging extensive pre-trained parametric knowledge and in-context learning, achieved a superior routing accuracy of 73.3%. This significant improvement in classification accuracy prevents the occurrence of semantic interference and guarantees that subsequent vector retrieval stages are executed exclusively within the relevant knowledge index, fully justifying the computational overhead of deploying a three-billion parameter model within the hybrid RAG architecture.

5. Discussion

In analyzing the specific causes of the accuracy regression observed in standard-centric queries, the error profiling indicates a distinct classification bias within the Semantic Query Router. Ambiguous theoretical queries often feature highly formalized parameter names, protocol identifiers (such as RRC or MAC layer terminology), or pseudo-code fragments extracted directly from 3GPP documentation. Because these technical acronyms are extensively shared across both theoretical standards and practical software components, the routing module occasionally misinterprets rigid structural elements of a specification as concrete C++ syntax, erroneously directing the search to the codebase index. This misclassification is primarily driven by the insufficient coverage of ambiguous edge cases in the static few-shot prompting examples. Specific improvement directions include transitioning from static prompts to dynamic few-shot example retrieval, where the router is conditioned on historically challenging overlapping terminology to better disambiguate abstract theoretical parameters from programmatic implementations. Furthermore, the significant accuracy drop observed under the WCB regime requires explicit acknowledgement. While the probabilistic blending of context domains succeeded in reducing generation latency to optimal levels for real-time interaction, it inadvertently reintroduced severe semantic interference, ultimately degrading generative quality below the performance of the isolated baseline model. This regression highlights a fundamental trade-off between computational efficiency and factual precision. The unweighted allocation of retrieval chunks failed to strictly isolate the correct knowledge domain for hybrid queries, poisoning the generation context with conflicting cross-domain terminology. Resolving this limitation will require future iterations to explore dynamic k-retrieval mechanisms or inverse weighting algorithms designed to penalize low-confidence domains, ensuring that latency-focused optimizations do not compromise the strict context boundaries required for telecom engineering accuracy. An in-depth analysis of the retrieval misses, which account for the vast majority of total errors, exposes fundamental limitations in applying static chunking to the specific structure of the srsRAN codebase. While the Parent–Child strategy mitigates overarching context loss, the rigid 400-character child chunk boundary frequently fractures complex object-oriented C++ structures. The srsRAN architecture heavily relies on deep inheritance hierarchies, dispersed header files, and extensive macro definitions, meaning that critical implementation logic often spans multiple non-contiguous chunks. Consequently, the vector search may retrieve a function invocation without its corresponding class definition. To address this, future framework development must implement adaptive, structure-aware chunking optimized specifically for source code. Utilizing Abstract Syntax Tree (AST) parsing would allow the system to dynamically segment context based on logical programming 616 boundaries, such as entire class definitions or standalone functions, thereby preserving the 617 functional integrity of retrieved code blocks and minimizing extraction failures.

Finally, it is imperative to objectively acknowledge the boundaries of the current study. The empirical validation of the proposed architecture is exclusively constrained to the srsRAN ecosystem and the ORAN-Bench-13K dataset. Its generalizability and retrieval efficacy on structurally distinct open-source platforms, such as OAI, or closed-source proprietary vendor codebases remain unverified. Additionally, the experimental setup relies on high-performance centralized GPU hardware, failing to account for the stringent computational and memory constraints inherent to 5G edge device deployment. The combined memory footprint of the routing and generation modules far exceeds the typical capacity of localized base station nodes. Deploying this framework in real-world edge environments will necessitate hardware-aware optimizations, including extreme weight quantization or cross-architecture distillation, to achieve feasible inference latency on MEC servers. Therefore, while this study provides a robust methodological advancement in bridging the dual-domain semantic gap, claims regarding its immediate applicability as a universal foundation for next-generation telecommunications AI assistants must be moderated, viewing the current framework as a critical transitional step rather than a finalized standalone solution.

6. Conclusions

This paper presented a Hybrid RAG framework designed to bridge the semantic gap between O-RAN/3GPP standards and their practical implementation in the srsRAN open-source ecosystem. By integrating a hierarchical Parent–Child Chunking strategy with a novel Semantic Query Router, the system effectively addresses the critical challenge of “semantic interference” inherent in heterogeneous technical domains. The experimental evaluation confirms that strictly separating knowledge domains based on query intent significantly enhances the system’s ability to assist developers, achieving a peak overall accuracy of 76.7% and outperforming the standard RAG baseline of 72.0%. Specifically, the system demonstrated high competence in technical implementation tasks and complex cross-domain queries requiring deep knowledge synthesis. Contrary to initial expectations regarding computational overhead, the performance analysis revealed that the implementation of semantic routing actually improves response efficiency by filtering irrelevant context. The Hybrid RAG exhibited an average generation latency of 3.47 s, representing a notable improvement over standard RAG approaches. This ability to accurately synthesize abstract requirements with concrete code while maintaining high temporal efficiency demonstrates the proposed architecture’s potential as an early stepping stone toward developing more robust next-generation telecom AI assistants. This ability to accurately synthesize abstract requirements with concrete code while maintaining high temporal efficiency makes the proposed architecture a viable foundation for next-generation telecom AI assistants. As part of further research, we plan to introduce self-correction mechanisms that allow the system to automatically reformulate search queries in cases of extraction stage failure. There are also plans to expand the knowledge base to include the O-CU and O-DU components of the O-RAN architecture to create a comprehensive assistant for telecommunication system developers. Additionally, a “soft routing” approach utilizing probabilistic classifiers will be studied to mitigate the accuracy regression observed in standard-centric theoretical queries, ensuring a more robust performance across all query categories.

Author Contributions

Conceptualization, Y.N. and T.I.; methodology, Y.N.; software, D.M.; validation, A.M. and N.K.; formal analysis, N.K.; investigation, D.M.; resources, T.I.; data curation, A.M.; writing—original draft preparation, Y.N. and N.K.; writing—review and editing, A.M. and T.I.; visualization, N.K.; supervision, T.I.; project administration, Y.N.; funding acquisition, T.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science Committee of the Ministry of Science and Higher Education of the Republic of Kazakhstan (Grant No BR24993211).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available within the article.

Conflicts of Interest

Authors Nurakhov Yedil and Mukhanbet Aksultan were employed by the company DigitAlem LLP. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Majlesara, A.; Majlesi, A.; Mamaghani, A.; Shokrani, A.; Khalaj, B.H. 5G Network Automation Using Local Large Language Models and Retrieval-Augmented Generation. arXiv 2025, arXiv:2511.21084. [Google Scholar] [CrossRef]
Zhang, R.; Jiang, H.; Wang, W.; Liu, J. Optimization Methods, Challenges, and Opportunities for Edge Inference: A Comprehensive Survey. Electronics 2025, 14, 1345. [Google Scholar] [CrossRef]
Ramamoorthy, L. Evaluating Generative AI: Challenges, Methods, and Future Directions. Int. J. Multidiscip. Res. 2025, 7(1), 1–7. [Google Scholar] [CrossRef]
Zhou, H.; Hu, C.; Yuan, Y.; Cui, Y.; Jin, Y.; Chen, C.; Wu, H.; Yuan, D.; Jiang, L.; Wu, D.; et al. Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities. arXiv 2024, arXiv:2405.10825. [Google Scholar] [CrossRef]
Zhao, K.; Yang, Z.; Huang, C.; Chen, X.; Zhang, Z. FedsLLM: Federated Split Learning for Large Language Models Over Communication Networks. In Proceedings of the 2024 International Conference on Ubiquitous Communication (Ucom); IEEE: New York, NY, USA, 2024; pp. 438–443. [Google Scholar] [CrossRef]
Experiential Networked Intelligence (ENI). Research on Application Scenarios of Network Large Language Models for Operation, Administration, Maintenance, and Performance; Technical Report; GR ENI 045-V4.1.1; ETSI: Sophia Antipolis, France, 2025. [Google Scholar]
Gajjar, P.; Shah, V.K. ORANSight-2.0: Foundational LLMs for O-RAN. IEEE Trans. Mach. Learn. Commun. Netw. 2025, 3, 903–920. [Google Scholar] [CrossRef]
Boi, B.; Esposito, C. Prompt Engineering vs. Fine-Tuning for LLM-Based Vulnerability Detection in Solana and Algorand Smart Contracts. arXiv 2025, arXiv:2511.11250. [Google Scholar] [CrossRef]
Wu, D.; Wang, X.; Qiao, Y.; Wang, Z.; Jiang, J.; Cui, S.; Wang, F. NetLLM: Adapting Large Language Models for Networking. In Proceedings of the ACM SIGCOMM 2024 Conference; Association for Computing Machinery: New York, NY, USA, 2024; pp. 661–678. [Google Scholar] [CrossRef]
Kan, K.B.; Mun, H.; Cao, G.; Lee, Y. Mobile-LLaMA: Instruction Fine-Tuning Open-Source LLM for Network Analysis in 5G Networks. IEEE Netw. 2024, 38, 76–83. [Google Scholar] [CrossRef]
Zou, H.; Zhao, Q.; Tian, Y.; Bariah, L.; Bader, F.; Lestable, T.; Debbah, M. TelecomGPT: A Framework to Build Telecom-Specific Large Language Models. arXiv 2024, arXiv:2407.09424. [Google Scholar] [CrossRef]
Ghosh, R.; Liu, C.H.; Rele, G.; Ravipati, V.; Aouad, H. TelcoAI: Advancing 3GPP Technical Specification Search through Agentic Multi-Modal Retrieval-Augmented Generation. arXiv 2025, arXiv:2601.16984. [Google Scholar] [CrossRef]
Conger, N.; Scollar, N.; Davaslioglu, K.; Sagduyu, Y.E.; Kompella, S. How to Discover Knowledge for FutureG: Contextual RAG and LLM Prompting for O-RAN. arXiv 2025, arXiv:2601.02382. [Google Scholar]
Salajeghe, Y.; Kopaee, A.M.; Najjari, S.; Ahmadi, I.; Dehmolaee, M.A.; Khalaj, B.H. Verified Telcorag: A Verified Local-Rag for Telecommunication Uses. In Proceedings of the 2025 European Conference on Networks and Communications (EuCNC) & 6G Summit; IEEE: New York, NY, USA, 2025; p. 102. [Google Scholar] [CrossRef]
Ye, S.; Ouyang, B.; Zeng, L.; Qian, T.; Chu, X.; Tang, J.; Chen, X. Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices. arXiv 2025, arXiv:2504.08242. [Google Scholar] [CrossRef]
Xia, L.; Yang, M.; Wang, J.; Yan, Z.; Ren, Y.; Yu, G.; Lei, K. Mamba4Net: Distilled Hybrid Mamba Large Language Models For Networking. arXiv 2025, arXiv:2510.17147. [Google Scholar] [CrossRef]
Dinler, O.B. UAV Cybersecurity with Mamba-KAN-Liquid Hybrid Model: Deep Learning-Based Real-Time Anomaly Detection. Drones 2025, 9, 806. [Google Scholar] [CrossRef]
Das, B.C.; Amini, M.H.; Wu, Y. Security and Privacy Challenges of Large Language Models: A Survey. arXiv 2024, arXiv:2402.00888. [Google Scholar] [CrossRef]
Zhani, M.F.; Korbi, Y.; Mkadem, Y. FlexNGIA 2.0: Redesigning the Internet with Agentic AI—Protocols, Services, and Traffic Engineering Designed, Deployed, and Managed by AI. arXiv 2025, arXiv:2509.02124. [Google Scholar]
Qu, Z.; Wang, W.; Yu, Z.; Sun, B.; Li, Y.; Zhang, X. LLM Enabled Multi-Agent System for 6G Networks: Framework and Method of Dual-Loop Edge-Terminal Collaboration. arXiv 2025, arXiv:2509.04993. [Google Scholar] [CrossRef]
Sun, W.; Sun, C.; Zhang, Y.; Yin, Z.; Kang, Z. SCNOC-Agentic: A Network Operation and Control Agentic for Satellite Communication Systems. Electronics 2025, 14, 3320. [Google Scholar] [CrossRef]
Chatzistefanidis, I.; Leone, A.; Yaghoubian, A.; Irazabal, M.; Nassim, S.; Bariah, L.; Debbah, M.; Nikaein, N. MX-AI: Agentic Observability and Control Platform for Open and AI-RAN. arXiv 2025, arXiv:2508.09197. [Google Scholar]
Guo, Z.; Zou, J.; Xin, P.; Zhao, X.; Hu, T.; Zhuang, S.; Sun, J.; Liu, Y.; Ma, W. Root Cause Analysis of Power Grid 5G Network Faults Based on Large Language Model. In Proceedings of the 2025 IEEE 28th International Conference on Computer Supported Cooperative Work in Design (CSCWD); IEEE: New York, NY, USA, 2025; p. 629. [Google Scholar] [CrossRef]
Hatami, P.; Majlesara, A.; Majlesi, A.; Khalaj, B. Automated Fault Detection in 5G Core Networks Using Large Language Models. arXiv 2025, arXiv:2512.19697. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, W.; Huang, Y.; Chen, M.; Geng, Y.; Yu, H.; Bi, Z.; Zhang, Y.; Yao, Z.; Song, W.; et al. Tele-Knowledge Pre-training for Fault Analysis. arXiv 2023, arXiv:2210.11298. [Google Scholar] [CrossRef]
Huang, X.; Tang, Y.; Li, J.; Zhang, N.; Shen, X. Toward Effective Retrieval Augmented Generative Services in 6G Networks. IEEE Netw. 2024, 38, 459–467. [Google Scholar] [CrossRef]
Ren, R.; Wu, Y.; Zhang, X.; Ren, J.; Shen, Y.; Wang, S.; Tsang, K.F. Retrieval-Augmented Generation for Mobile Edge Computing via Large Language Model. arXiv 2024, arXiv:2412.20820. [Google Scholar] [CrossRef]
Lu, W.; Chen, K.; Qiao, R.; Sun, X. HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking. arXiv 2025, arXiv:2509.11552. [Google Scholar]
Ahmad, S.; Nezami, Z.; Hafeez, M.; Zaidi, S.A.R. Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines for Open Radio Access Networks (ORAN). arXiv 2025, arXiv:2507.03608. [Google Scholar] [CrossRef]
Gajjar, P.; Shah, V.K. ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access Networks. arXiv 2024, arXiv:2407.06245. [Google Scholar]

Figure 1. The proposed Hybrid RAG Architecture. Blue-shaded components represent the normative standards domain and associated knowledge storage, while yellow-shaded components denote the practical implementation domain (srsRAN codebase). The system utilizes an adaptive chunking strategy for hierarchical indexing and a Soft Probabilistic Semantic Router during inference to dynamically blend relevant knowledge domains (Code vs. Standards).

Figure 2. The Chunking Strategy. Large contextual blocks are retained for broad context, while smaller, semantically defined chunks are vectorized for highly precise retrieval. Black arrows indicate the sequential data ingestion flow (splitting and vectorization), while red arrows represent the inference-time retrieval logic, including similarity matching and parent ID fetching. The green oval denotes the user query input, and the red diamond indicates the final retrieved context provided to the LLM.

Figure 3. Results of the qualitative evaluation based on the LLM-as-a-Judge methodology using a 5-point Likert scale.

Figure 4. Performance frontier and domain routing distribution under the Weighted Context Blending regime across varying confidence thresholds.

Figure 5. Semantic Router Confusion Matrix illustrating the distribution of intent classification predictions. The matrix highlights a specific classification bias where 14 code-centric queries were falsely predicted as standard-centric requirements.

Figure 6. Detailed failure analysis illustrating the distribution of error types. The bar chart (left) highlights failure frequency across specific query domains using the categorized legend, while the pie chart (right) illustrates the macro-level ratio of retrieval misses to routing errors.

Figure 7. Macro comparison of retrieval performance across chunking strategies. The bar chart contrasts Precision@10 and Hit Rate@10, demonstrating that the Parent–Child approach achieves the highest recall, validating its effectiveness in capturing comprehensive context without significant precision degradation compared to fixed-size baselines.

Table 2. Architecture and Complexity Comparison of Network-Specific Models.

Model	Architecture	Complexity	Key Metric (Efficiency)	Performance in Task
Mamba4Net	Selective SSM	Linear O(N)	3.96× Throughput [16]	High (Scheduling/ABR)
Transformer	Self-Attention	Quadratic O(N²)	Baseline [16]	High (Reasoning)
MKL Hybrid	SSM + Liquid	Linear O(N)	47.3 ms Latency [17]	95%+ Attack Detection
Mobile-LLaMa	Transformer (Fine-tuned)	Quadratic O(N²)	High Accuracy (Packet)	247/300 Code Gen [10]

Table 3. Comparative analysis of model accuracy across specialized and general query categories.

Category	Base LLM	Hybrid RAG (Ours)	Standard RAG
Code-Centric	46.8%	78.5%	72.8%
Cross-Domain	28.6%	71.4%	71.4%
Standard-Centric	20.0%	40.0%	60.0%
General/Mixed	42.2%	66.5%	67.6%
Overall	45.9%	76.7%	72.0%

Table 4. Detailed performance metrics (precision, recall, and F1-score) across specialized and general query categories.

Category	Precision	Recall	F1-Score
Code-Centric	0.766	0.728	0.746
Cross-Domain	0.672	0.688	0.680
Standard-Centric	0.646	0.697	0.671
General/Mixed	0.718	0.684	0.701
Macro Average	0.701	0.699	0.700

Table 5. Relevance scores comparison (1–5 scale).

Model	Code	Cross-Domain	Mixed	Standard	Average
Base LLM	0.77	0.88	0.78	0.70	0.77
Standard RAG	3.74	3.18	3.65	3.41	3.73
Hybrid RAG	3.74	3.63	3.28	3.21	3.47

Table 6. Router Model Performance Comparison.

Routing Model	Architecture	Overall Accuracy	Macro-F1	Average Latency
TF-IDF + SVM	Machine Learning Baseline	44.08%	48.46%	0.27 ms
RoBERTa_M NLI	Zero-shot Encoder	39.94%	36.95%	250.38 ms
BART_M NLI	Zero-shot Encoder	40.50%	33.10%	300.26 ms
Llama 3.2 3B (Ours)	Few-shot Causal LLM	73.28%	58.27%	211.69 ms

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nurakhov, Y.; Kassymbek, N.; Marlambekov, D.; Mukhanbet, A.; Imankulov, T. Bridging the Semantic Gap in 5G: A Hybrid RAG Framework for Dual-Domain Understanding of O-RAN Standards and srsRAN Implementation. Appl. Sci. 2026, 16, 3275. https://doi.org/10.3390/app16073275

AMA Style

Nurakhov Y, Kassymbek N, Marlambekov D, Mukhanbet A, Imankulov T. Bridging the Semantic Gap in 5G: A Hybrid RAG Framework for Dual-Domain Understanding of O-RAN Standards and srsRAN Implementation. Applied Sciences. 2026; 16(7):3275. https://doi.org/10.3390/app16073275

Chicago/Turabian Style

Nurakhov, Yedil, Nurislam Kassymbek, Duman Marlambekov, Aksultan Mukhanbet, and Timur Imankulov. 2026. "Bridging the Semantic Gap in 5G: A Hybrid RAG Framework for Dual-Domain Understanding of O-RAN Standards and srsRAN Implementation" Applied Sciences 16, no. 7: 3275. https://doi.org/10.3390/app16073275

APA Style

Nurakhov, Y., Kassymbek, N., Marlambekov, D., Mukhanbet, A., & Imankulov, T. (2026). Bridging the Semantic Gap in 5G: A Hybrid RAG Framework for Dual-Domain Understanding of O-RAN Standards and srsRAN Implementation. Applied Sciences, 16(7), 3275. https://doi.org/10.3390/app16073275

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bridging the Semantic Gap in 5G: A Hybrid RAG Framework for Dual-Domain Understanding of O-RAN Standards and srsRAN Implementation

Abstract

1. Introduction

2. Background and Motivation

3. Materials and Methods

3.1. Advanced Hybrid RAG Framework

3.2. Data Ingestion and Adaptive Semantic Chunking

3.3. Soft Probabilistic Routing and Generation Parameters

3.4. Experimental Setup and Statistical Validation

4. Results

4.1. Quantitative Performance and Overall Accuracy

4.2. Stratified Analysis by Query Category

4.3. Computational Efficiency and Latency Analysis

4.4. Qualitative Analysis

4.5. Weighted Context Blending Strategy

4.6. Error Profiling and Failure Analysis

4.7. Semantic Router Model Comparison

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI