Intelligent Virtual Assistant for Mobile Workers: Towards Hybrid, Frugal and Contextualized Solutions

Sop Djonkam, Karl Alwyn; Rey, Gaëtan; Tigli, Jean-Yves

doi:10.3390/app15179638

Open AccessArticle

Intelligent Virtual Assistant for Mobile Workers: Towards Hybrid, Frugal and Contextualized Solutions

by

Karl Alwyn Sop Djonkam

,

Gaëtan Rey

^*

and

Jean-Yves Tigli

Université Côte d’Azur, CNRS, I3S, France/Les Algorithmes—bât. Euclide B, 2000 Route des Lucioles, 06900 Sophia Antipolis, France

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(17), 9638; https://doi.org/10.3390/app15179638

Submission received: 23 July 2025 / Revised: 26 August 2025 / Accepted: 29 August 2025 / Published: 2 September 2025

Download

Browse Figures

Versions Notes

Abstract

Featured Application

The purpose of this study is to propose a methodology for designing an intelligent virtual assistant for mobile workers. This assistant, which can be queried in natural language, is capable of executing commands in the environment while taking into account the user’s specific situation (context).

Abstract

Field workers require expeditious and pertinent access to information to execute their duties, frequently in arduous environments. Conventional document search interfaces are ill-suited to these contexts, while fully automated approaches often lack the capacity to adapt to the variability of situations. This article explores a hybrid approach based on the use of specialized small language models (SLMs), combining natural language interaction, context awareness (static and dynamic), and structured command generation. The objective of this study is to demonstrate the feasibility of providing contextualized assistance for mobile agents using an intelligent conversational agent, while ensuring that reasonable resource consumption is maintained. The present case study pertains to the supervision of illumination systems on a university campus by technical agents. The static and the dynamic contexts are integrated into the user command to generate a prompt that queries a previously fine-tuned SLM. The methodology employed, the construction of five datasets for the purposes of evaluation, and the refinement of selected SLMs are presented herein. The findings indicate that models of smaller scale demonstrate the capacity to comprehend natural language queries and generate responses that can be effectively utilized by a tangible system. This work opens prospects for intelligent, resource-efficient, and contextualized assistance in industrial or constrained environments.

Keywords:

conversational agents; context-aware assistance; natural language interaction; small language models; mobile workers

1. Introduction

Field workers in industries such as industrial maintenance, emergency services, and technical interventions working in challenging environments often encounter difficulties accessing technical documentation, procedures, and real-time data while on the job. The present discussion does not concern the issues of network access to information. The emphasis is instead placed on the discussion of effective methods for accessing pertinent information and performing actions in the environment regarding the present circumstances.

The process of searching for information in large volumes of data is inherently complex. The provision of interfaces for such searches is not well-suited for the emergency intervention that field workers may face in a wide variety of situations, as they do not have the luxury of a desktop to search at their leisure.

Conversely, the implementation of fully automated approaches appears to be challenging, given the necessity to anticipate and consider a wide range of situations. Hence, the present study adopts a hybrid approach, integrating user commands, contextualized information and advanced reasoning capabilities.

1.1. Large Language Models

In the light of recent breakthroughs in artificial intelligence, in particular large language models (LLMs), conversational assistants have been developed to interact naturally with the workforce and support them with their tasks. Large-scale language models (LLMs) have been applied to a wide range of research challenges in various fields. One notable application of AI in this area is the development of question answering systems designed to meet the needs of users in various domains. While these systems have proven effective for answering questions in easily accessible domains, such as the literature, they are not consistent in domains requiring expert-level knowledge as shown by industrial applications.

As part of this project, the use of large language models (LLMs) is being investigated for its potential to improve the operational efficiency of field workers. To demonstrate the viability of this approach, the following elements are considered.

Firstly, the integration of natural language as a means of interacting with the intelligent virtual assistant is a key strength in terms of efficiency. By enabling users to express themselves intuitively, without having to learn specific commands or syntax, this approach considerably reduces barriers to use. It also reduces the time required to get a grasp of the tool. What is more, it helps to limit formulation errors, particularly in critical situations where the cognitive load is high.

Secondly, it seems justified to consider a static or quasi-static context, i.e., to take into account contextual information that does not change or changes very little. In this scenario, the intelligent virtual assistant must demonstrate a level of expertise that matches, or even complements, that of the user to guarantee relevant interactions. This expertise relies on access to structured information such as descriptions of the tools available (e.g., APIs), internal procedures and resource mapping. Without this shared knowledge base, the agent is likely to generate erroneous or even unusable answers, thereby limiting its day-to-day utility and credibility in the eyes of the user.

Thirdly, considering a dynamic context is essential in the design of an intelligent virtual assistant (as argued in [1,2]). This dynamic context can include frequent and sometimes unpredictable changes, such as modifications to the user’s state or location. To remain relevant and useful, the assistant must therefore be able to access this contextualized information in real time and use it to meet the user’s needs. Without this contextualized awareness, the assistant might give irrelevant or even harmful answers.

Fourthly, the issue of energy efficiency is emerging as a major criterion in the design of systems using AI models. Given growing concerns about the environmental impact of digital technologies and in particular AI models, which are often blamed for their high energy consumption, we will be taking the energy efficiency constraint into account in the development of our system.

1.2. Our Study

The purpose of this preliminary study is to explore methodologies for lighting control in a building complex, such as a university campus. The description of the campus is considered static contextualized data, while the user’s position is considered dynamic contextualized data. The aim of this study is to demonstrate the feasibility of using SLMs (LLM types 1B to 4B) to respond accurately to natural language commands that consider such contextualized data.

First, a review of related work on similar technologies and applications will be conducted. In the subsequent section, we will expound upon the methodological framework and the overarching theoretical model that undergoes our approach. Subsequently, we will present the experiments conducted to validate our work, as well as the experimental protocol implemented. After this, the results obtained during these experiments will be presented. Upon completion of the present study, a discussion will ensue regarding the findings, their respective strengths, and their limitations. Subsequently, a conclusion will be drawn.

2. Related Work

2.1. Natural Language Processing

The integration of Natural Language Processing (NLP) in chat systems has shifted radically, from rule-based to adaptive, AI-driven models [3,4,5]. However, in voice assistants such as Google Assistant, Amazon Alexa, and Apple Siri, we are not seeing the same. They still rely on rule-based architectures involving pre-documented templates for commands, which greatly inhibits their natural language understanding and nuance. They also do not handle multi-step commands or conditional commands very well. What is more, their cloud-based execution requirements pose problems in terms of latency, reliability, and security [6], these three being particularly unacceptable in an industrial or secure environment.

2.2. Large Language Models

The advent of large language models like GPT-4, PaLM, and Claude has created new avenues of controlling IoT devices using natural language [7]. LLMs share the ability to generalize across tasks, understand free-form instructions from the user, and interact with structured tools [8,9] and APIs through function calls, and reasoning-based agents. For instance, ToolFormer’s function call [10], TPTU [11] API retriever, and Gorilla information retriever [12] interfaces expose external capabilities to the model, allowing dynamic calls to APIs according to the instructions of the user. Likewise, the ReAct [13] framework (Reasoning + Acting) unifies the reasoning of language and tool utilization to enable models to plan and perform multi-step tasks.

Toolformer, another pertinent way, pretrains models to learn to discover how to utilize APIs themselves, enhancing them as general-purpose agents. These developments have created AI agents that can execute goal-oriented actions using Tools and APIs in real time. Yet, integrating these functions in real-world IoT environments, particularly those that involve spatial perception, conditional reasoning, and real-time sensor control, remains in its nascent stage. There are not many systems today that can, for instance, understand a request like “Turn off the light in front of me”, which involves environmental perception and device coordination.

Furthermore, the task of prompt engineering [13,14,15,16] is becoming even more vital in orchestrating these capabilities. Successful prompt design can assist LLMs in producing appropriately structured function calls, navigating unclear queries, and safely and efficiently responding in practical settings. Few-shot prompting, instruction tuning, and chain-of-thought prompting are among the techniques that promote reliability and explainability in the behavior of the agent when the model has to determine which function to invoke and with which parameters based on context.

Recent efforts have addressed the challenge of adapting large language models (LLMs) to resource-constrained settings. A growing body of work focuses on optimizing prompt structures in order to reduce context length and improve efficiency when models run under memory or latency constraints. Surveys have highlighted efficient prompting strategies and prompt compression methods [17,18]. Empirical studies show that compressed prompts (e.g., via hard/soft prompting or learned representations) can maintain or even enhance performance in long-context scenarios [19]. These approaches are particularly relevant for small- and medium-sized models deployed in edge environments.

In parallel, a rich line of research investigates techniques to make LLM inference feasible on commodity devices. Quantization methods such as GPTQ [20] and AWQ [21] achieve aggressive weight compression (down to 3–4 bits) with limited degradation, while parameter-efficient fine-tuning techniques such as QLoRA [22] enable the low-cost adaptation of quantized models. Beyond model compression, system-level approaches like LLM in a Flash [23] or speculative decoding [24] further reduce latency and memory requirements. Recent surveys provide a comprehensive overview of these developments in on-device and edge LLM deployment [25,26].

Research in line with our objectives has been conducted in the field of robotics. In this study [27], the authors use natural language, large language models (LLMs), and visual language models (VLMs) to control a robot. While the results obtained are particularly interesting, it should be noted that this work does not meet our constraints in terms of resource use.

Although previous works have demonstrated the potential of LLMs in the field of tools, there is a conspicuous lack of research on the use of these advancements in distributed, context-aware, and privacy-preserving IoT environments [28]. Our contribution fills this gap by integrating LLM-based agents into real-time IoT environments, using function calls, prompt engineering, and tools that enable intelligent and dynamic interactions in work environments.

3. Methodology

3.1. Approach

The approach proposed in Figure 1 (training) and Figure 2 (runtime) consists of taking contextualized elements into account in addition to a natural language command to query a specialized SLM and obtain a structured response that can be easily used. The implementation of this system was driven by the necessity to address unrestricted user commands intended to query or control an information system related to the tasks of a mobile worker. Two types of contextualized data are employed.

Data can be designated as “cold” if it is static or undergoes changes at a very slow rate. This encompasses information pertaining to the task to be executed, the available tools, and user-related data that remains relatively constant (e.g., profile and preferences).
Conversely, “hot” data is defined as dynamic information. This includes information from sensor networks describing the user (position, current task, etc.), the physical environment (lighting, temperature, etc.) or the system environment (measurements from a particular sensor, the status of a particular machine, etc.).

There are two main reasons for the decision to treat these two types of contextualized data separately.

The primary rationale is related to technical implementation concerns. The mechanisms and dynamics of data access depend on the characteristics of the data. In cases where the data is static and can be easily cached, the mechanisms and dynamics of data access differ from those in cases where the data changes rapidly and its validity could expire if processing took too long.
Secondly, the rationale is associated with modifications in architectural design. The objective of this initiative is to expand the conversational agent’s repertoire by equipping it with the capacity to process queries of a more intricate nature. These queries require the retrieval of data that is not included in the initial prompt, thereby necessitating the engagement of more sophisticated reasoning processes on the part of the agent. Subsequently, it will be imperative to differentiate between “cold” data, which can be leveraged by RAG-type techniques. Conversely, “hot” data necessitates processing through distinct methodologies, including the utilization of specialized tools.

The user’s command, expressed in natural language, is then utilized to generate a prompt that queries the previously fine-tuned SLM engine.

The response obtained will not be the command to be executed by the system; rather, it will be structured data containing the command and its parameters. Preserving a degree of generality and independence between the SLM and the system is imperative to accommodate minor configuration changes, such as alterations to the command URL or name. This approach eliminates the necessity for subsequent updates to the SLM.

The development of the prompts necessitates a substantial investment of effort; however, the most significant challenge is the refinement of our SLM using a reproducible methodology. To address this challenge, we are compelled to construct a dataset to train our SLM, as no existing dataset meets our requirements.

The dataset is divided into two parts, as it is standard practice. One part of the dataset is dedicated to the SLM training, while the other part is employed for the evaluation of the efficacy of the learning process. The dataset under consideration comprises the prompts that will be fed into the SLM, along with the anticipated responses.

In this study, it is observed that obtaining a large number of different sentences is often complex. This complexity can be attributed to the challenges associated with accessing future users within this particular context. Indeed, these users may be unavailable, in small numbers, or spread across multiple sites, which complicates access to information or interaction with these users. This dynamic poses a significant challenge in the collection of a substantial number of varied sentences.

First, we collect phrases from field agents on campus. This collection is carried out anonymously; only the phrases themselves are kept. No information about the origin of the phrases or the identity of the authors is collected.

In order to confront this challenge and achieve the requisite number of sentences for SLM training, the initial sentences are enriched using large natural language models (LLMs). A particular emphasis is placed on the exclusion of identical LLM engines for the purposes of sentence enrichment and the fine-tuning of the SLM. This methodical approach aims to reduce potential biases that could arise when using LLMs/SLMs of the same type.

To enrich our dataset, we first replace a specific command element like the light category, the floor/level, and the building with a placeholder to transform them into a template. Each teamplate is fed to the LLM and ask to reformulate using different condition like put the floor/level after the building name. Each reformulation produced by the AI is manually inspected to verify that when replacing the placeholder, each template has a correct meaning. The incorrect one is discarded. We repeat the process for each type of command until the number of templates per action is close to 18.

3.2. LLM or SLM

The deployment of large language models (LLMs) in IoT systems is significantly influenced by the resource limitations of IoT devices. Many of these devices are constrained by processing power, memory, and storage, making it impractical to run traditional, computationally demanding LLMs directly at the edge.

Furthermore, real-time applications such those related to user interactions can be challenging for larger models due to their longer inference times. To overcome these challenges, we opt for SLMs (Small Language Models). However, lightweight models (≤5B parameters) can sometimes struggle with deep language understanding, leading to potentially hazardous or superficial responses—particularly in models with fewer than 3B parameters. In this study, we fine-tune lightweight models while also using a few-shot prompting approach for baseline models.

4. Experiences

To illustrate our research work, we implement our methodology in a real-world setting: the subsequent investigation will address the utilization of natural language for the purpose of controlling lighting within the confines of the Sophia Antipolis university campus. The administration of this system is conducted by the technical maintenance staff, whose responsibilities include the implementation of regular interventions on campus.

In this section, the use case will be presented. Subsequently, an examination of the composition of the dataset will be conducted. And we will present our selection of SLMs.

4.1. Use Case Presentation

As previously mentioned, the task selected to illustrate our work concerns the management of lighting in the Sophiatech campus buildings by technical maintenance staff, using unrestricted natural language.

It is noteworthy that the campus comprises seven buildings, with heights ranging from one to four stories. Within the campus, each lighting zone is organized into several categories (waylighting, emergency lighting, and outdoor lighting). This configuration is indicative of the definition of each lighting zone by three pieces of information: its building, its floor, and its category.

Furthermore, each lighting zone is characterized by three distinct states (on, off, or automatic programming), which delineate the actions that can be executed on each zone (i.e., activate, deactivate, or program in automatic mode). The university campus is already equipped with a lighting management system. This system incorporates an API, which can be utilized by a designated application. A comparable application is available, facilitating the management of lighting through a graphical interface on a tablet.

As a component of the proposed methodology, the objective is to concentrate on the conversion of requests initiated by technical personnel in natural language into commands that adhere to the specifications of this API.

It is imperative to underscore that the objective of this study is not to evaluate the ergonomics of the natural language system or to compare it to an existing graphical solution but rather to demonstrate the possibility of using natural language in an unrestricted manner to control such systems. Consequently, it is imperative for our SLM to generate an output comprising the command to be invoked, in addition to the arguments particular to that command.

4.2. Dataset Creation

In fact, we created not one but five different datasets as shown in detail below. Each dataset is separated into two parts as is typically done. The first part of each dataset, representing 70% of the examples, is used to train the models. The second part, consisting of the remaining 30%, is used to test these same models. In each dataset, each example is broken down into four parts: the static context, the dynamic context, the user command, and the expected response.

The static context consists of a description of the campus (building + floor per building) and a description of the available tools.
The dynamic context consists solely of the user’s position (current building and floor).
The commands that make up the examples in our datasets are instructions provided in natural language by technical maintenance agents.
The expected response is in JSON format with the following structure:
        {
            "name": <tool_name>,
            "arguments": {
                "arg_1": <arg_1_value>,
                "arg_2": <arg_2_value>,
                "arg_n": <arg_n_value>
            }
        }

These commands can contain all the information necessary for their execution. For example, the following sentence, “Peux-tu mettre en marche l’éclairage de circulation du premier niveau du bâtiment A ?” turns on the circulation lighting on the first floor of A building.

Another possibility is that the commands in question do not contain all the information required for their implementation. As a result, the SLM is forced to search for the missing information in one of the prompt contexts. The phrases “Éteins les éclairages ici” (Turn off the lights here,) “Repasse les lampes en automatisation dans ma zone” (Switch the lights in my area back to automatic,) and “Allume” (Turn on) illustrate this scenario.

It is important to note that our commands and dataset are in French, as our target users are technical agents on campus who use French in their daily work. Another important point in the creation of our datasets stems from the initial collection of commands from technical agents. This collection reveals that a distinction is made between “niveau” (level) and “étage” (floor) in these French sentences. The ground floor of buildings is assigned to level 1, while in other sentences, floor 1 corresponds to the floor above the ground floor. While this distinction is common in French, it is not necessarily the case in all languages and cultures. This is why we feel it is essential to clarify this point in order to ensure a clear understanding of the structure of our datasets. Let us now examine our five light campus control (LCC) datasets. The size of each dataset in terms of the number of examples is indicated in parentheses:

LCC-Min (145k): This is the smallest of our datasets. It contains only valid (error-free) and authorized commands referring to specific levels (not floors), with no actions affecting the entire building. It was designed to isolate fundamental instruction–function capabilities under ideal, controlled conditions. A typical command in this dataset is “Allume les lumières de permanente situées au deuxième niveau du bâtiment C” (Turn on permanent lights located on second level of C building) or “Coupe toutes les lumières autour de moi” (Turn off all lights around me) when the user’s context is taken into account.
LCC-Building (194k): This is an extension of the LCC-Min dataset in which we have added commands that apply to an entire building. A typical command in this dataset is “Allume les lampes de circulation dans le bâtiment B” (Turn on the circulation lights in B building).
LCC-Floor (285k): This is an extension of the LCC-Min dataset in which we have added commands using floor terminology in addition to level terminology. A typical command in this dataset is “Restaure la gestion automatique pour les éclairages à l’étage floor du bâtiment building” (Restore automatic management for lights on floor floor of building building).
LCC-Core (334k): This is an extension of the LCC-Min dataset in which we have added commands using floor terminology, in addition to level terminology, as well as commands that apply to an entire building.
LCC-Full (393k): This is an extension of the LCC-Core dataset in which we have added commands containing errors for each of the previous categories (approximately 18%). The objective is to train the SLM to indicate that the user’s command cannot be fulfilled. This may be due to the absence of a known building, the non-existence of a specified floor, or other factors.

4.3. SLM Selection

In our experiments, we apply a selection criterion to consider only small-scale LLMs, commonly referred to as SLMs. This approach is part of an effort to minimize the ecological footprint resulting from the use of artificial intelligence. This methodology can be applied to both fine-tuning processes and the use of engines in chatbots. In addition, the adoption of small language models makes future chatbots more accessible to small businesses, which do not have the resources to purchase or rent computing power capable of handling the most complex language models. As a result, the analysis focused on examining models with between 1 and 3 billion parameters.

In addition, an in-depth analysis is conducted to assess the feasibility of implementing an LLM that would offer extensive prompting and fine-tuning capabilities. These two features are directly used in our work. Beyond these two features, we also explore the possibility of using an LLM capable of connecting to a RAG (Retrieval-Augmented Generation) environment and easily using external tools when resolving queries. The integration of these features is planned for the next phase of the work as part of the evolution of the conversational agent. This is why we decided to focus on the 1B and 3B models of Llama 3 in this preliminary study. More specifically, we focus on the Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct models.

As part of our research, we identify other LLMs that should be evaluated at a later stage using the same datasets. This methodical approach will allow for a meaningful comparison of the results obtained. As part of this study, several models are identified to serve as benchmarks: SmolLM 1.7B, Magistral Small 2506, Phi-4 Mini, Gemma 3n E2B, Qwen3 1.7B, DeepSeek-R1-0528, etc.

4.4. SLM Fine-Tuning

We fine-tune each model on the training set of a dataset and evaluate it on the test set of the corresponding dataset. Training is a supervised fine-tuning (SFT) using the LoRa (Low-Rank Adaptation) method [29]. We first fine-tune the model only on certain layers of the transformers, but the performance is not satisfactory. We therefore decide to fine-tune all linear layers of the model. This approach significantly improves the model’s performance.

4.5. Hardware and Software Parameters

The models fine-tuning is performed on A VM with 48Go VRAM using A100s, and inference is performed on the previous hardware +

2 \times A 100

40Go The fine-tuning is performed using the following hyperparameters:

For the 1B model, we use a batch size of 4, and 8 gradient accumulation steps.
For 3B model we use a batch size of 2, and 16 gradient accumulation steps.
2 epochs.
LoRa rank of 16.
LoRa Alpha of 64.
LoRa dropout of 0.05.
no LoRa bias.
LoRa target modules: all linear.

For all our experiments we use this software stack:

datasets == 3.6.0;
Transformers == 4.53.3;
sentencepiece == 0.2.0;
torch == 2.7.0+cu128;
mlflow == 2.22.0;
pydantic == 2.11.4.

5. Results

5.1. General Accuracy of SLMs

The initial analysis entails the evaluation of the two selected SLMs for each of the five test datasets. It is noteworthy that each SLM is trained on a subset of the dataset (LCC-*-training), constituting 70% of the total, and is subsequently tested on another subset (LCC-*-test), representing the remaining 30%.

As illustrated in the initial row of Table 1, the Llama-1B SLM, which is trained on the LCC-Min-training dataset, attains an accuracy of 95.51% on the LCC-Min-test dataset. The Llama-1B SLM, when trained on the LCC-Core-training dataset, attains an accuracy of 88.67% on the LCC-Min-test dataset.

This observation is also applicable to the second line of Table 1 but with the Llama-3B SLM.

In addition to this, in order to have a reference for evaluating the effectiveness of the fine-tuning process, we also evaluate the two SLMs using the few-shot prompting method [30] on our test data. It is imperative to acknowledge that few-shot prompting involves the provision of an SLM with a restricted number of examples in the prompt, thereby enabling the guidance of the SLM in executing a particular task. In this particular instance, there is an absence of fine-training. The final two lines of the table present the results obtained from the two SLMs employing this method on the five test datasets.

As part of the analysis presented in Table 1, we examine the accuracy for valid requests only for all five datasets. In other words, the SLM is considered to respond adequately when its response to a valid user request includes the expected command and parameters in the specified format. Minor syntax errors are found in the format. These cannot be considered accuracy errors, as they are easily correctable.

For the first four datasets, the accuracy provided in the table corresponds to the entire training subset. However, for the LCC-Full-training dataset, only valid requests are taken into account. Incorrect requests are excluded from the accuracy calculation. A more complex study of these cases is conducted in the following two sections.

In the rest of this article, we refer to our SLMs fine-tuned on the LCC-Core-training (complete without errors) and LCC-Full-training (complete with errors) datasets as “Llama-1B-Core” and “Llama-1B-Full” (respectively “Llama-3B-Core” and “Llama-3B-Full”).

5.2. Adaptation to Contexts

Table 2 shows the evaluation of context awareness by our fine-tuned SLMs, which allows us to assess its generality. To do this, we create one new dataset (XCampus-LLC-Core) based on the same user instructions as in the previous datasets but varying the static context, i.e., using an unknown campus that was not used during SLM fine-tuning.

This campus contains only unknown buildings that have more floors than the buildings in the training datasets. Furthermore, to ensure that the information is extracted correctly from the context, only examples relating to floors not present in the training datasets are used here.

5.3. Error Detection

Continuing on from this third part, we now look at error detection. For this problem, we consider a valid command as the positive class and an erroneous command as the negative class. As shown in Table 3, the first column provides the percentage of valid command predicted as effectively valid (True Positive), while the second column shows the percentage of valid requests that are predicted as invalid (False Negative). Columns 3 and 4 show, in a similar way, the percentage of invalid requests predicted as valid (True Negative) and invalid requests detected as invalid (False Positive).

It should be noted that the results for valid requests are already included in the accuracy results presented in Table 1. In the next section, an in-depth analysis is conducted to assess the quality of the responses provided in response to the requests considered invalid.

Correct detections are therefore present in two columns: True Positive (TP) and True Negative (TN). The percentages are expressed as valid queries and invalid queries, which means that the sum of TP and FN (respectively TN and FP) equals 100%.

5.4. Error Identification

Table 4 shows the evaluation of the ability of our SLM to identify errors in user instructions, report them, and explain their origin. To do this, we introduce approximately 20% erroneous commands into the LCC-Full dataset, including buildings that are not present on campus or floors that are unavailable for certain buildings.

We use the F1-score [31], based on the BERTScore method [32], to evaluate the semantic similarity between the expected error and the SLM’s response.

The BertScore is computed between 2 sentences, considering one as the reference and the other as the candidate (the similarity of the sentence we are trying to compare with the reference). First for each sentence we compute the contextual embedding using BERT. After, we compute pairwise cosine similarity. Next we extract the maximum similarity. Finally we compute the precision and recall that help us to compute the F1-score.

5.5. Adaptation to Contexts with Errors

Table 5 presents the evaluation of context recognition by our refined SLM, but this time with erroneous commands. To do this, we create two new datasets based on the same campus as in Section 4.2 but querying either non-existent buildings or non-existent floors for a correct building.

5.6. Frugality of the Solution

To evaluate our solution’s ability to run on an embedded device and thus validate its efficiency, we query our most comprehensive SLMs (i.e., those that have been finely tuned on the LCC-Full-Training dataset), about 2000 commands extracted from the LCC-Full-Test dataset. Of these, 1000 are valid commands and 1000 are invalid commands.

Initially, the tests are supposed to be performed on NVIDIA Jetson AGX Xavier, a single-board mini-computer with computing power similar to that of embedded devices currently available or coming soon. Unfortunately, the card is not delivered in time, so we have to find an alternative and use a laptop equipped with an NVIDIA RTX 2000 Ada Generation 8 GB card.

Table 6 shows the memory consumption, mean response time of a test, and its standard deviation measured during the tests.

6. Discussion

6.1. General Accuracy of SLMs

Analysis of the results shows that our SLMs are unable to perform the task requested unless they are refined. Analysis of Table 1 shows the effectiveness of the fine-tuning process. There is an increase in accuracy of more than 80% for version 1B (from less than 8% for the non-fine-tuned version to more than 87% for the fine-tuned version in each of the tests). The increase is also spectacular for version 3B, rising from less than 37% (untuned version) to over 92% (tuned version) in each of the tests. These results confirm the value of this process.

These results suggest that it would be interesting to evaluate slightly larger models (from 7B to 24B) in few-shot prompting to see if the 8% to 30% improvement from 1B to 3B continues. If such models were able to deliver results similar to those of the smallest fine-tuned model, then we would need to consider whether the fine-tuning process is really useful. Fine-tuning is time-consuming and costly in terms of computing time, and requires the creation of large datasets, which are not necessary for few-shot prompting implementation (except for performance evaluation). However, 24B models remain difficult to deploy on personal devices, which is a problem for their large-scale use.

Another interesting point to note is that the use of SLM (models 1B and 3B) remains relevant for the task evaluated and achieves very good results, even with small models. However, it should be noted that model 1B lags slightly behind (by about 10%) compared to model 3B.

A current limitation of the study is that it only focuses on Llama models. It would be interesting to see if other models perform better and if the results obtained can be generalized to all other SLMs.

6.2. Adaptation to Contexts

Our study did not aim to fine-tune an SLM for each environment but rather to determine whether it could adapt to changes in context. First, we wanted to evaluate its adaptation to static changes in context, i.e., changes in the description of the campus. We used a dataset (XCampus-LLC) containing only building names that were new to the learning process. We also made sure to only query the SLM about floors higher than those in the learning process, in order to verify that it was using contextualized information to construct its responses. The results are very convincing, as they are very similar to those obtained during tests on the training campus datasets. Model 1B achieved approximately 90% correct answers, and model 3B achieved almost 95%.

6.3. Error Detection

We were also interested in evaluating the detection rate of incorrect commands that a user might issue, such as the existence of a non-existent building or floor. We were extremely surprised to find that both SLM models we used detected the responses perfectly. In fact, for each query, both SLMs were able to determine whether the command was valid or not in 100% of cases.

6.4. Error Identification

Regarding the analysis of error identification in our fine-tuned models, we can make two observations. First, error identification is barely accurate, with an average F1-score of around 0.86. Second, the measured standard deviation is very high (around 0.14), indicating high variability in the quality of error identification. We obtain results that are sometimes excellent (almost perfect) and sometimes terrible. As things stand, it will therefore be difficult for us to exploit this mechanism.

Another explanatory hypothesis, which allows us to qualify the disappointing results obtained, is as follows: these results could be interpreted negatively due to the embedding model used. When evaluating a model, it is essential to take into account the hypothesis that using a text model that performs better than the pseudo-references may lead to erroneous results. This problem is inherent in free reference metrics [33]. In the event that the text generated by the model being evaluated outperforms the pseudo-reference (in this case BERT), a penalty could be applied to the model.

We need to examine the results obtained in more detail in order to improve the feedback provided to users when they make erroneous queries.

6.5. Adaptation to Contexts with Errors

Finally, we wanted to conduct a study on errors in our dataset with the new campus (XCampus-LLC). We obtained slightly more surprising results in this case. The detection of incorrect commands is still perfect, but the same cannot be said for valid commands.

In fact, between 25 and 37% of the latter are detected as invalid, which reduces the accuracy of the results. In addition, the 3B model performs worse than the 1B model in terms of detection.

Our initial investigations suggest that this problem stems from the fact that many error examples in the LLC-Full dataset are now valid cases in the XCampus-LLC test dataset. This therefore has no impact on the errors in XCampus-LLC but it does affect the detection of valid cases in this dataset. Furthermore, as the learning capabilities of model 3B are superior to those of model 1B, the latter has a lower detection rate.

We are currently exploring various methodologies that could prove effective in resolving the issue in question.

6.6. Frugality of the Solution

Initially, we were only able to test version 1B of our SLM. This already takes up 3 GB of memory. Measurements show that version 3B takes up 8.7 GB, which is more than the 8 GB available on our test computer. We will therefore need to work on quantification methods to reduce the memory impact of our SLMs if we want to run them on embedded devices. We will also need to check that this quantification does not have a negative impact on the performance (accuracy) of the model.

In terms of response time, it remains low: less than 2 s on average if the question is valid, and just over 2.2 s if it is not. Although this time corresponds only to the GPU calculation time to evaluate and respond to the request by the SLM, it is representative of what could be achieved on a computer with 8 GB of VRAM and 13 TFlops in FP16 (half-precision floating point), and is therefore close to what future cards, such as the JETSON AGX Xavier Series, should provide.

However, in order to improve the efficiency of our SLM, it would be very interesting to quantify it (at least in int8) to save memory and take advantage of the power of existing cards (100 TOP for the JETSON Orin NX Series, for example). This should also reduce the response time. Other methods identified in the state of the art could also be combined with our work to improve their frugality.

6.7. Limitations and Future Work

However, this preliminary work has many limitations that we are trying to resolve gradually.

For now, we have focused on feasibility to verify that finely tuned SLMs could meet our expectations. However, many other studies are still needed.

As a continuation of this work, we plan to improve the learning process by introducing more variability in the static context (e.g., by increasing variability at the API and/or campus description level) and the dynamic context. As mentioned above, we would like to test other SLMs and look into more complex commands.

Once our conversational agent has reached a sufficient level of capability, we would like to conduct field experiments with real users. We will then be able to address related issues that have been set aside for the time being, such as deployment, security, and other elements necessary for practical implementation.

7. Conclusions

In this work, we explored the issue of natural language interactions for smart building control, focusing on the use of Small Language Models (SLM) by technical maintenance agents. Our experimental results show that even small models, when adapted through fine-tuning, can produce relevant responses while taking into account static or dynamic context through advanced prompt engineering techniques.

To evaluate these capabilities, we designed a dedicated learning architecture and a rigorous testing methodology, enabling us to generate several datasets specific to the application domain.

As an extension of this work, we plan to study the impact of language model choice on performance by expanding our experiments to models other than LLaMA. We also want to analyze the performance of larger language models (LLMs) (with between 7 and 24 billion parameters) used without adjustment, in order to assess their ability to effectively replace specialized models. We will also work on quantifying our models to make them even more economical in terms of both memory and computing power.

Finally, we plan to explore Retrieval-Augmented Generation (RAG) approaches and the integration of internal tools. This will enrich the response capabilities of agents, particularly for managing complex queries requiring the joint use of technical document databases and real-time sensor data.

Author Contributions

Conceptualization, G.R.; methodology, G.R. and K.A.S.D.; software, K.A.S.D.; validation, G.R., K.A.S.D. and J.-Y.T.; formal analysis, G.R. and K.A.S.D.; investigation, G.R. and K.A.S.D.; data curation, K.A.S.D.; writing—original draft preparation, G.R. and K.A.S.D.; writing—review and editing, G.R., K.A.S.D. and J.-Y.T.; supervision, G.R.; funding acquisition, J.-Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

All datasets produced and scripts during this study are available to the community provided that projects wishing to use them comply with the research policy of the I3S laboratory at the Université Côte d’Azur. For more details on accessing these information, please contact the authors.

Acknowledgments

We would like to thank the Université Côte d’Azur’s University Institute of Technology (IUT) and, in particular, the IT department for providing us with the IT resources necessary to carry out this project, without which it would not have been possible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Brézillon, P. Context in problem solving: A survey. Knowl. Eng. Rev. 1999, 14, 47–80. [Google Scholar] [CrossRef]
Zhou, H.; Huang, M.; Zhu, X. Context-aware natural language generation for spoken dialogue systems. In Proceedings of the COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, 11–16 December 2016; pp. 2032–2041. [Google Scholar]
Jain, L.; Ananthasayam, R.; Gupta, U.; R, R. Comparison of Rule-based Chat Bots with Different Machine Learning Models. Procedia Comput. Sci. 2025, 259, 788–798. [Google Scholar] [CrossRef]
Casheekar, A.; Lahiri, A.; Rath, K.; Prabhakar, K.S.; Srinivasan, K. A contemporary review on chatbots, AI-powered virtual conversational agents, ChatGPT: Applications, open challenges and future research directions. Comput. Sci. Rev. 2024, 52, 100632. [Google Scholar] [CrossRef]
Smajić, A.; Mušić, D. Application of Natural Language Processing Algorithms for Chatbots. In Proceedings of the 2025 IEEE 17th International Conference on Computer Research and Development (ICCRD), Shangrao, China, 17–19 January 2025; pp. 246–249. [Google Scholar] [CrossRef]
Chung, H.; Iorga, M.; Voas, J.; Lee, S. “Alexa, Can I Trust You?”. Computer 2017, 50, 100–104. [Google Scholar] [CrossRef] [PubMed]
Ray, P.P. A Review on LLMs for IoT Ecosystem: State-of-the-art, Lightweight Models, Use Cases, Key Challenges, Future Directions. TechRxiv 2025. [Google Scholar] [CrossRef]
Shen, Z. LLM with Tools: A Survey. arXiv 2024, arXiv:2409.18807. [Google Scholar]
Qu, C.; Dai, S.; Wei, X.; Cai, H.; Wang, S.; Yin, D.; Xu, J.; Wen, J.-r. Tool learning with large language models: A survey. Front. Comput. Sci. 2025, 19, 198343. [Google Scholar] [CrossRef]
Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 2023, 36, 68539–68551. [Google Scholar]
Kong, Y.; Ruan, J.; Chen, Y.; Zhang, B.; Bao, T.; Shiwei, S.; Qing, D.; Hu, X.; Mao, H.; Li, Z.; et al. TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Industry Systems. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, Miami, FL, USA, 12–16 November 2024; pp. 371–385. [Google Scholar]
Patil, S.G.; Zhang, T.; Wang, X.; Gonzalez, J.E. Gorilla: Large language model connected with massive apis. Adv. Neural Inf. Process. Syst. 2024, 37, 126544–126565. [Google Scholar]
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. React: Synergizing reasoning and acting in language models. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv 2025, arXiv:2402.07927. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Adv. Neural Inf. Process. Syst. 2023, 36, 11809–11822. [Google Scholar]
Li, W.; Wang, X.; Li, W.; Jin, B. A Survey of Automatic Prompt Engineering: An Optimization Perspective. arXiv 2025, arXiv:2502.11560. [Google Scholar]
Li, Z.; Liu, Y.; Su, Y.; Collier, N. Prompt Compression for Large Language Models: A Survey. arXiv 2024, arXiv:2410.12388. [Google Scholar] [CrossRef]
Zhang, Z.; Li, J.; Lan, Y.; Wang, X.; Wang, H. An Empirical Study on Prompt Compression for Large Language Models. arXiv 2025, arXiv:2505.00019. [Google Scholar]
Frantar, E.; Ashkboos, S.; Hoefler, T.; Alistarh, D. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv 2023, arXiv:2210.17323. [Google Scholar]
Lin, J.; Tang, J.; Tang, H.; Yang, S.; Xiao, G.; Han, S. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. GetMobile Mob. Comp. Comm. 2025, 28, 12–17. [Google Scholar] [CrossRef]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2023, 36, 10088–10115. [Google Scholar]
Alizadeh, K.; Mirzadeh, S.I.; Belenko, D.; Khatamifard, S.; Cho, M.; Del Mundo, C.C.; Rastegari, M.; Farajtabar, M. Llm in a flash: Efficient large language model inference with limited memory. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; pp. 12562–12584. [Google Scholar]
Svirschevski, R.; May, A.; Chen, Z.; Chen, B.; Jia, Z.; Ryabinin, M. SpecExec: Massively Parallel Speculative Decoding For Interactive LLM Inference on Consumer Devices. In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Vancouver, BC, Canada, 2024; Volume 37, pp. 16342–16368. [Google Scholar]
Xu, D.; Zhang, H.; Yang, L.; Liu, R.; Huang, G.; Xu, M.; Liu, X. Fast On-device LLM Inference with NPUs. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Rotterdam, The Netherlands, 30 March–3 April 2025; Volume 1, pp. 445–462. [Google Scholar] [CrossRef]
Zheng, Y.; Chen, Y.; Qian, B.; Shi, X.; Shu, Y.; Chen, J. A Review on Edge Large Language Models: Design, Execution, and Applications. ACM Comput. Surv. 2025, 57, 1–35. [Google Scholar] [CrossRef]
Nwankwo, L.; Rueckert, E. The conversation is the command: Interacting with real-world autonomous robots through natural language. In Proceedings of the Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Boulder, CO, USA, 11–15 March 2024; pp. 808–812. [Google Scholar]
Wang, X.; Wan, Z.; Hekmati, A.; Zong, M.; Alam, S.; Zhang, M.; Krishnamachari, B. The Internet of Things in the Era of Generative AI: Vision and Challenges. IEEE Internet Comput. 2024, 28, 57–64. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Van Rijsbergen, C.J. Information Retrieval; Butterworths: Newton, MA, USA, 1979. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
Deutsch, D.; Dror, R.; Roth, D. On the limitations of reference-free evaluations of generated text. arXiv 2022, arXiv:2210.12563. [Google Scholar] [CrossRef]

Figure 1. Process architecture and fine-tuning pipeline for the proposed solution.

Figure 2. Runtime architecture for implementing the proposed solution.

Table 1. Results of the accuracy (%) evaluation of the two fine-tuned LLMs on the LLC-*-training datasets, as well as of the same LLMs in a few-shot setting (without fine-tuning). Evaluations are performed on our various LCC-*-test datasets.

Model	Min	Building	Floor	Core	Full
Llama-1B	95.51	91.46	96.19	88.67	95.21
Llama-3B	97.66	93.39	97.00	93.25	92.27
Llama-1B-FewShot	7.64	5.62	7.59	6.06	5.93
Llama-3B-FewShot	33.9	34.11	29.48	30.60	36.52

Table 2. Results of the accuracy (%) evaluation of the two fine-tuned LLMs on the LLC-Core-training dataset, tested on a new XCampus-LLC-Core dataset containing an unknown campus.

Model	XCampus-LLC-Core
Llama-1B-Core	90.79
Llama-3B-Core	94.91

Table 3. Results of the evaluation of the ability of LLMs, fine-tuned on the LCC-Full-Training dataset, to detect errors (error detection (%)) in commands from the LCC-Full-Test dataset.

	LCC-Full
Model	TP	FN	TN	FP
Llama-1B-Full	100	0	100	0
Llama-3B-Full	100	0	100	0

Table 4. Results of the evaluation of the ability of SLMs, fine-tuned on the LCC-Full-Training dataset, to explain errors (F1 score) previously identified in the commands of the LCC-Full-Test dataset.

	LCC-Full
Model	Mean F1-Score	SD F1-Score
Llama-1B-Full	0.84	0.15
Llama-3B-Full	0.86	0.14

Table 5. Results of the evaluation of accuracy (%) and error detection capability (%), of SLMs, fine-tuned on the LCC-Full-Training dataset and tested on new XCampus-LLC-* datasets containing an unknown campus.

	Core			Error
Model	Accuracy	TP	FN	TN	FP
Llama-1B-Full	46.50	74.61	25.39	100	0
Llama-3B-Full	58.65	63.24	36.76	100	0

Table 6. Results of memory consumption (in gigabytes) and response time (in seconds) of our fine-tuned SLMs on the LCC-Full-Training dataset, then tested on a sample of 10,000 examples from the LCC-Full-Test dataset.

	LCC-Full
Model	Memory (GB)	Mean Response Time (s)	SD (s)
Llama-1B-Full (valid)	3.0	1.98	0.25
Llama-1B-Full (error)	3.0	2.21	0.45

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sop Djonkam, K.A.; Rey, G.; Tigli, J.-Y. Intelligent Virtual Assistant for Mobile Workers: Towards Hybrid, Frugal and Contextualized Solutions. Appl. Sci. 2025, 15, 9638. https://doi.org/10.3390/app15179638

AMA Style

Sop Djonkam KA, Rey G, Tigli J-Y. Intelligent Virtual Assistant for Mobile Workers: Towards Hybrid, Frugal and Contextualized Solutions. Applied Sciences. 2025; 15(17):9638. https://doi.org/10.3390/app15179638

Chicago/Turabian Style

Sop Djonkam, Karl Alwyn, Gaëtan Rey, and Jean-Yves Tigli. 2025. "Intelligent Virtual Assistant for Mobile Workers: Towards Hybrid, Frugal and Contextualized Solutions" Applied Sciences 15, no. 17: 9638. https://doi.org/10.3390/app15179638

APA Style

Sop Djonkam, K. A., Rey, G., & Tigli, J.-Y. (2025). Intelligent Virtual Assistant for Mobile Workers: Towards Hybrid, Frugal and Contextualized Solutions. Applied Sciences, 15(17), 9638. https://doi.org/10.3390/app15179638

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Intelligent Virtual Assistant for Mobile Workers: Towards Hybrid, Frugal and Contextualized Solutions

Abstract

Featured Application

Abstract

1. Introduction

1.1. Large Language Models

1.2. Our Study

2. Related Work

2.1. Natural Language Processing

2.2. Large Language Models

3. Methodology

3.1. Approach

3.2. LLM or SLM

4. Experiences

4.1. Use Case Presentation

4.2. Dataset Creation

4.3. SLM Selection

4.4. SLM Fine-Tuning

4.5. Hardware and Software Parameters

5. Results

5.1. General Accuracy of SLMs

5.2. Adaptation to Contexts

5.3. Error Detection

5.4. Error Identification

5.5. Adaptation to Contexts with Errors

5.6. Frugality of the Solution

6. Discussion

6.1. General Accuracy of SLMs

6.2. Adaptation to Contexts

6.3. Error Detection

6.4. Error Identification

6.5. Adaptation to Contexts with Errors

6.6. Frugality of the Solution

6.7. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI