2.2. State of the Art
In this section, we explore previous research on virtual assistants, object-detection, and web-scraping techniques within the domains of cooking, fashion, and fitness to understand key methodologies and insights that contribute to the system development.
First, the study in [
17] introduces GptVoiceTasker, an advanced virtual assistant leveraging LLMs to enhance task efficiency and user experience on mobile devices. By employing prompt engineering, GptVoiceTasker interprets user commands with high accuracy, automates repetitive tasks based on historical usage, and streamlines device interactions. This virtual assistant is designed to improve smartphone interaction through voice commands; however, it faces several limitations, such as reliance on prior interactions with applications, difficulties handling complex tasks, interference from dynamic user interfaces, latency issues, and privacy concerns. Looking ahead, the goal is to generalize common tasks, adapt the system to devices like smartwatches and AR/VR platforms, improve handling of unexpected real-time interface elements, and optimize the user experience with features such as live transcriptions, while strengthening data security. Key research questions include how to enhance adaptability to new applications, manage dynamic interfaces, and ensure privacy while minimizing latency.
Secondly, the research in [
18] introduces BIM-GPT, a prompt-based virtual assistant framework that integrates Building Information Models (BIMs) with generative pre-trained transformers (GPT) to enable efficient natural language-based information retrieval. The BIM-GPT framework presents several limitations, including its evaluation on a single dataset that does not encompass all possible query types, the exclusion of geometric information such as width, length, and height, a latency of approximately five seconds due to multiple API calls, the lack of support for multi-turn conversations, and the risk of generating irrelevant or inaccurate information without carefully designed prompts. To address these limitations, future work suggests expanding datasets to include more diverse information, optimizing performance to reduce latency, developing multi-turn interaction capabilities, refining prompts to improve control over text generation, and exploring the impact of different GPT models on framework performance.
The research presented in [
19] introduces a Python-based Voice Assistant designed to perform routine tasks such as checking the weather, streaming music, searching Wikipedia, and opening desktop applications. Leveraging AI technology, this assistant simplifies user interactions by integrating functionalities similar to popular virtual assistants like Alexa, Cortana, Siri, and Google Assistant. However, the current system’s functionality is limited to working exclusively with application-based data, restricting its ability to handle more diverse or real-time external information sources.
The study in [
20] presents a virtual assistant developed using open-source technologies, integrating natural language processing (NLP) and speech recognition to interpret user requests, while Machine Learning (ML) improves responsiveness over time. Trained on a dataset of queries and responses, and enhanced with various Python libraries, the system supports tasks such as sending messages and emails, making calls, playing YouTube videos, automating Chrome usage, opening desktop applications, and engaging in conversation. Performance evaluation showed a 90% accuracy rate in interpreting and responding to queries, response times of only a few seconds, and high user satisfaction in terms of ease of use, helpfulness, and overall performance. These results demonstrate the assistant’s accuracy, responsiveness, and user-friendliness, highlighting its potential for applications across domains such as healthcare, education, and customer service.
The study in [
21] explores how generative AI can enhance Human–Machine Interaction (HMI) by enabling virtual assistants to generate contextually relevant responses and adapt to diverse user inputs. The proposed generative AI virtual assistant is developed using a modular methodology that integrates multiple open-source tools, each selected for its unique strengths: Gradio for creating intuitive user interfaces, Play.ht for natural-sounding text-to-speech, Hugging Face for access to robust pre-trained models, OpenAI for advanced language generation, LangChain for improved contextual understanding, Google Colab for scalable cloud-based model training, and blockchain technology for secure and transparent data handling. The system is trained and refined on datasets from Hugging Face and evaluated using metrics such as accuracy, precision, recall, and F1 score, achieving high contextual understanding and responsiveness. Additional functionalities included web-based interactivity via Gradio components, immersive voice feedback from Play.ht, and iterative performance monitoring for continuous improvement. Limitations identified include scalability challenges with Gradio in complex scenarios, the occasional inability of Play.ht to capture subtle emotional nuances, customization constraints from Hugging Face’s reliance on existing datasets, inherent biases, the computational costs of large language models, and resource restrictions in Google Colab. Future directions involve optimizing scalability, improving emotional expressiveness in speech synthesis, enhancing adaptability to evolving conversational patterns, refining bias mitigation strategies, and expanding applicability across diverse domains while ensuring user privacy and maintaining low latency.
The study in [
22] introduces LLaVA-Med, a vision–language conversational assistant designed to answer open-ended biomedical research questions by combining multimodal AI with domain-specific datasets. The model is trained cost-efficiently using a large-scale biomedical figure-caption dataset from PubMed Central and GPT-4-generated instruction-following data, applying a curriculum learning strategy where it first aligns biomedical vocabulary from captions, then learns open-ended conversational semantics. Fine-tuning is performed on a general-domain vision–language model using eight A100 GPUs in under 15 h, achieving strong results and outperforming prior supervised state-of-the-art on several biomedical visual question-answering benchmarks. LLaVA-Med integrates a self-instruct data curation pipeline leveraging GPT-4 and external knowledge to enhance contextual understanding and chat capabilities. Limitations include domain specificity that hinders generalization beyond biomedical tasks, susceptibility to hallucinations and inherited data biases, dependency on input quality and image resolution (currently 224×224), and reduced reliability for unseen question types. Future directions aim to improve reasoning depth, mitigate hallucinations, enhance resolution and data quality, and expand the methodology to other vertical domains, while ensuring outputs remain reliable, interpretable, and clinically useful.
The research presented in [
23] introduces an AI-based food quality-control system that integrates the YOLOv8 object-detection model to identify and locate spoiled food in real time, leveraging high-resolution images and cloud-based visualization tools. The system uses YOLOv8’s advanced recognition capabilities to detect damaged products in complex production flows, providing rapid quality assessments that improve efficiency and reduce product losses. Cloud integration enables dynamic visualization of detection accuracy, model behavior, and potential biases, allowing for continuous monitoring and real-time adjustments to optimize performance. While the approach offers significant benefits, its effectiveness depends on the quality and diversity of training images, the robustness of cloud connectivity, and the model’s ability to handle unseen food types or extreme environmental conditions. Future directions include expanding the dataset to improve generalization, integrating predictive analytics for early spoilage detection, enhancing explainability for quality assurance compliance, and scaling the system for deployment across diverse food production environments.
The study in [
24] introduces BiTrains-YOLOv8, an enhanced food-detection model built on YOLOv8 and optimized for real-time applications through advanced supervision techniques that improve accuracy, robustness, and generalization to unseen food items. The model integrates dual training strategies to enhance detection performance, achieving an average accuracy of 96.7%, precision of 96.4%, recall of 95.7%, and F1 score of 96.04%, outperforming baseline YOLOv8 in speed and resilience, even with partially obscured or hidden food items. Its rapid inference capabilities make it suitable for applications such as dietary assessment, calorie counting, smart kitchen automation, and personalized nutrition.
The research presented in [
25] proposes an improved YOLO-based food packaging safety-seal detection system, addressing inefficiencies of traditional manual inspection through an optimized network structure, targeted training strategies, and anchor frame design tailored for small object detection. Comparative experiments against a CNN-based closure detection system show consistently higher recall and F1 scores for the improved YOLO model under identical datasets and iterations, with performance gaps increasing over time, demonstrating superior accuracy, reliability, and efficiency in identifying subtle packaging defects. The authors outline future enhancements by integrating advanced sensing technologies such as high-resolution cameras and multispectral imaging to increase detection resolution, expand coverage, and enable more comprehensive and refined defect recognition.
Reference [
26] proposes an automated billing system for self-service restaurants, tailored for Indian cuisine, that leverages computer vision and the YOLO object-detection algorithm—specifically YOLOv8, with comparative use of YOLOv5—to identify food items from images and extract details such as names and prices. A graphical user interface visually demonstrates the process, enabling waiters or customers to generate itemized bills from a single photograph, thereby reducing manual entry errors and alleviating cashier congestion during peak times. Tested on the IndianFood10 dataset, the system achieved a 91.9% mean average precision, surpassing previous state-of-the-art results. Future work aims to extend capabilities to volumetric detection of food items and fully automate self-service billing through menu integration supported by geolocation technologies.
The study in [
27] presents a computer vision-based system for clothing recognition and real-time recommendations in e-commerce, leveraging an improved YOLOv7 architecture. To address the challenges of accurate and efficient clothing detection, the authors redesign the YOLOv7 Backbone and integrate the PConv (Partial Convolution) operator to boost inference speed without compromising accuracy. The system also incorporates the CARAFE (Content-Aware ReAssembly of FEatures) upsampling operator to preserve crucial semantic information and the MPDIoU (Minimum Point Distance Intersection over Union) loss function to optimize recognition frame selection. Experimental results show a 9.2% increase in inference speed (95.3 FPS), a 20.5% reduction in GFLOPs, and improved detection accuracy compared to the native YOLOv7, demonstrating the model’s superior performance and effectiveness in clothing recognition for e-commerce and live-streaming applications.
The study in [
28] presents FashionAI, a YOLO-based computer-vision system designed to detect clothing items from images and provide real-time recommendations. Leveraging YOLOv5 for object detection, a custom CNN for clothing pattern identification, and ColourThief for dominant color extraction, the system analyzes user-uploaded images to recognize garments such as shirts, pants, and dresses. It then employs web scraping to match these features with similar products on e-commerce platforms like Myntra, enhancing the online shopping experience by offering personalized and visually similar clothing suggestions.
Finally, the research in [
29] conducts a comparative experimental evaluation of Selenium and Playwright for automated web scraping, focusing on reliability metrics such as uptime and the Rate of Occurrence of Failures (ROCOF) under controlled 24 h tests on HDD- and SSD-equipped laptops. Both tools executed identical Python scripts, with uptime and ROCOF measured at multiple time intervals to account for hardware and time-of-day effects. Selenium achieved 100% uptime with no failures, while Playwright reached 99.72% uptime with four downtimes but demonstrated more stable and predictable execution times. Although both had one failure per 10-test sequence, Selenium’s faster execution times resulted in a higher failure rate per second, whereas Playwright’s slower but consistent performance proved advantageous for dynamic web application testing. The results also confirmed SSD hardware’s superiority over HDD in speed and stability.
Table 1 presents a summary of the related works.
The review of related works highlights key trends in the development of AI-powered virtual assistants and real-time object-detection systems. A predominant approach in recent research is the integration of OpenAI’s LLM services to enhance the intelligence and responsiveness of virtual assistants. Additionally, Python has emerged as the preferred programming language for backend development, providing flexibility and robust AI and automation capabilities. In the field of object detection, YOLO-based architectures are widely adopted for their efficiency and accuracy in real-time classification of domain-specific resources. These insights underscore the prevailing methodologies shaping current advancements and inform the design choices for our proposed system.