1. Introduction
The Hispanic population is the fastest-growing demographic in the United States [
1]. Despite this growth, the Hispanic community continues to face systemic neglect in various areas, especially in healthcare. While cutting-edge cancer treatments exist, many Hispanic individuals still receive suboptimal care compared to their non-Hispanic counterparts [
2]. For instance, researchers have evaluated the Breast Cancer Risk Assessment Tool (BCRAT), which incorporates the Gail model, to determine 5-year risk and lifetime risk of invasive breast cancer amongst Hispanic women [
3]. Significant limitations were presented when the tool failed to account for Hispanic-specific risk factors such as migration history, country of origin, and cultural variations in reproductive behavior [
3]. Nearly 5% to 10% of women who have breast cancer inherited gene mutations most commonly found in the breast cancer genes BRCA1 and BRCA2 (BRCA) [
4]. Both Hispanic and African American women are 4–5 times less likely to undergo genetic testing and significantly less likely to discuss it with a healthcare provider [
5]. Hispanic and African American women were also found to be 2 and 16 times, respectively, less likely to discuss genetic testing for BRCA1 and BRCA2, compared to Non-Hispanic White counterparts [
4]. This leads to a larger proportion of undiagnosed carriers in these groups, potentially worsening outcomes. These persistent disparities contribute to significant gaps in cancer-related morbidity and mortality between Hispanics and Non-Hispanic White individuals [
2]. Moreover, it suggests that fewer studies regarding breast cancer in Hispanic women focus on known risk factors, highlighting a critical need for studies that identify population-specific risk factors [
6].
1.1. Challenges in Healthcare
Health literacy, defined as the cognitive and social skills needed to access and understand health information, remains a major hurdle for Hispanics [
7]. A lack of knowledge and information on cancer and culturally based perceptions of cancer lead to the spread of misinformation amongst these populations. A group of Spanish-preferring adults in the United States were surveyed to identify accurate and inaccurate beliefs about cancer risk factors. It was observed that over 40% of Spanish-preferring adults believed inaccuracies, such as stress, artificial sweeteners, food additives, and genetically modified foods cause cancer [
8]. This was also attested by the low-literacy in the Hispanic community. It could be seen that many Spanish-language breast cancer resources are written at a significantly lower reading level than equivalent English-language materials [
9]. While this may improve readability, it also highlights the unequal quality and depth of information available in Spanish, contributing to a lack of informed decision-making. These inaccurate beliefs reflect broader gaps in public health communication tailored to the Hispanic community.
Hispanic individuals often face compounding challenges related to social determinants of health. Lack of health insurance is one of the most significant barriers, and this is frequently driven by educational disparities [
1]. Individuals with lower education levels tend to work in jobs that do not offer comprehensive insurance or that include high out-of-pocket costs [
2]. Systemic discrimination and language barriers also negatively impact employment opportunities, which, in turn, restrict access to health services. Ethnic minorities are more likely to delay seeking medical help, contributing to late diagnoses. This delay stems from a lack of awareness about symptoms, limited access to effective screenings, and cultural misconceptions about breast cancer [
10,
11]. There is a lower frequency of mammograms, longer time between mammograms, and decreased timely follow-up of suspicious mammograms within the Hispanic populus [
12]. Hispanic individuals in the U.S. face interconnected challenges in breast cancer care of poor access to information, testing, and, most significantly, language barriers. Addressing these disparities requires culturally responsive healthcare, language-accessible resources, and further research into Hispanic-specific risk factors to create more inclusive and equitable cancer prevention and treatment strategies [
13].
Language barriers contribute to a major limitation in healthcare access and quality. In the U.S., 51.3% of Latinos report language as one of the most difficult issues they face in medical settings [
14]. Most healthcare providers are English monolingual, which exacerbates communication breakdowns [
15]. These barriers impact not only patient-provider relationships but also caregiver support, trust-building, and health outcomes [
16]. This factor contributes to the reason as to why Hispanics and Latinos are seen with more advanced disease and have a poorer prognosis for several different cancers relative to Non-Hispanic White counterparts [
17]. This leads to many Hispanics being diagnosed with late-stage cancer, deterring them from the U.S. healthcare system altogether [
18]. To overcome these communication barriers, many people adopt code-switching, the practice of using two or more linguistic varieties in the same conversation [
19,
20].
1.2. Code-Switching
Code-switching has natural unconscious mental rules that form the speaker’s grammar [
21]. It is often a deliberate and functional communication strategy, driven by social, contextual, or referential needs within a conversation [
22]. A study in Spain was conducted to examine the patterns of language choice and code-switching in bilingual medical consultations [
23]. In these settings, the top three reasons for code-switching were for explanations, inquiries, and emphasis. Thus, aligning with the transactional objectives of resolving health concerns efficiently and accurately [
19]. Furthermore, research indicates that children code-switch not merely to compensate for limitations in one language, but also to maximize their expressive capacity [
23], further supporting the notion that code-switching serves complex communicative functions beyond simply linguistic gaps. If patients were given the space to fully express themselves, it could foster greater openness and willingness to seek medical care. Creating this space involves not only improving communication but also recognizing and valuing the linguistic and cultural capacities of Hispanic patients. Acknowledging their ability to navigate between languages and cultural frameworks can strengthen trust, enhance patient-provider relationships, and ultimately lead to more effective and inclusive healthcare experiences.
Despite the prevalence of code-switching among bilingual Hispanic populations, there is a significant lack of research that evaluates this linguistic phenomenon, particularly in the context of bilingual health communication. The Miami corpus remains one of the few studies conducted in the United States to test alternative models of code-switching [
24]. This corpus consists of 56 audio recordings, yet it is limited in scope. Even among systems developed to support code-switched speech, performance remains a challenge; for instance, one such system reported a 55% relative word error rate and failed to account for speakers who were not fully proficient in either language [
25]. However, word error rate may not even be the most appropriate metric for evaluating ASRs, as it weighs all words equally regardless of their contextual importance, which could partly explain why the error rate appears so high. To address this limitation, the authors suggest a new metric that not only detects language translation errors produced by ASRs but also provides granular feedback on how well the system preserves bilingual structures [
25]. This makes it better suited for evaluating ASR performance in code-switching contexts. Moreover, existing datasets often lack the linguistic diversity and cultural specificity required for effective use in bilingual health education. Although efforts have been made to create linguistically and culturally responsive corpora for bilingual breast cancer education [
26], current resources remain limited in scale. This corpus contains only 11 h of annotated content, in contrast to the thousands of hours needed to train large-scale large language models (LLMs).
Even though breast cancer awareness and education are critical for early detection and effective treatment, studies have shown that the breast cancer survivors too often experience ongoing diagnosis and treatment-related symptoms that negatively impact their health-related quality of life (HRQoL), including fatigue, depressive symptoms, and changes in sleep and sexual function. Hispanic breast cancer survivors are more likely to report poorer HRQoL than their White counterparts, even after adjustment for factors such as socioeconomic status [
18,
27]. Previous research has shown that delivering cancer-related information, stress management, coping skills, and increasing self-efficacy in communication, in a culturally appropriate intervention, can improve quality of life in particular related to health outcomes in the post-treatment survivorship phase [
28,
29,
30]. Therefore, the purpose of this educational breast cancer platform is to provide a centralized, bilingual digital platform (referred to as ConoCancer) that delivers trusted, accessible, and culturally relevant information about breast cancer to English- and Spanish-speaking communities. Despite progress in health communication, two important research gaps remain unaddressed. First, few studies have systematically examined how misinformation about breast cancer circulates within low-literacy Hispanic communities and how it influences health decision-making. Second, there is a lack of bilingual, culturally responsive platforms that provide consistent, equivalent, and trustworthy information across both English and Spanish. Existing resources are often fragmented, uneven in quality, or linguistically mismatched, which limits their effectiveness for this population. Addressing these gaps requires innovative approaches that integrate language technologies and LLMs to ensure accuracy, accessibility, and equity in breast cancer education. By doing so, we fulfill a crucial need for health literacy, early detection awareness, and resource accessibility, ultimately contributing to better health outcomes for those affected by breast cancer. By creating an accessible, user-friendly experience, ConoCancer seeks to reduce misinformation and dispel common myths about breast cancer. Additionally, user feedback is integrated to continuously improve the platform’s effectiveness in addressing the needs of this community.
The goal of this work is to design and implement a bilingual, web-based prototype platform that integrates LLMs and speech technologies to address language and literacy barriers in breast cancer education. We hypothesize that such a platform, through features including code-switching ASR, LLM-powered question generation, and a bilingual chatbot, can provide a scientifically grounded, scalable framework that will improve accessibility and comprehension for underserved Hispanic populations. While this paper focuses on the design and development phase, systematic evaluation with defined metrics is planned as the next stage of research.
2. Multilingual LLMs & Intelligent Tutoring Systems
LLMs are transforming the way we engage with language, knowledge, and learning. As one of the most powerful advancements in artificial intelligence (AI), LLMs are now widely used across various sectors, with education being one of the most impactful areas of application. Their natural language understanding and generation capabilities allow for adaptive, personalized, and conversational learning experiences [
31]. In particular, the integration of LLMs into education has the potential to reduce barriers and promote greater equity across diverse learning environments [
31]. AI-driven intelligent tutoring systems (ITS) powered by LLMs simulate one-on-one instruction by leveraging real-time data to assess student progress, identify knowledge gaps, and provide tailored feedback and learning recommendations [
32]. These systems can dynamically adjust the difficulty of tasks based on individual student needs, encouraging growth while maintaining engagement [
32]. LLMs are customizable to identity patterns and acknowledge the students’ understanding to generate improved question asking strategies tailored to that individual. Their ability to support learners in diverse environments underscores their potential as powerful tools for intelligent, equitable instruction [
31].
Language plays a central role in communication and education. As LLMs continue to evolve, they enhance our ability to simulate human-like reasoning, interpret abstract concepts, and deliver semantically rich educational content. Their use in formal and computational linguistics supports not just content delivery but also the exploration of meaning, structure, and expression in language-based disciplines [
33]. Through multicultural training and diverse datasets, LLMs can help create inclusive learning environments that respect cultural norms and accommodate various learning styles [
31]. This is especially important in multilingual settings, where learners frequently engage in code-switching. Despite this, most LLMs are not explicitly trained to handle mixed-language input, limiting their effectiveness in multilingual contexts [
34]. The English-centric bias embedded in many existing LLMs presents further challenges. These models often perform better in English than in lower-resource languages, leading to disparities in accuracy, relevance, and cultural alignment [
35]. Such limitations risk reinforcing existing educational inequities. To ensure fair access to ITSs, it’s essential to evaluate and adapt LLMs for specific languages, regional curricula, and local contexts [
35].
A study on low-literacy Hispanic breast cancer survivors illustrates the intersection of language, technology, and access. While internet searching is a potentially empowering tool for seeking health information, participants in the study reported users preferred to use ASR rather than the traditional typing method. Many users submitted overly broad queries in hopes of retrieving more relevant information. Due to limited literacy, they often relied on snippets in search result lists rather than reading full articles. Although Spanish-language articles were written at a lower reading level, about 4.18 grades lower than their English counterparts, comprehension remained a challenge due to complex vocabulary and sentence structures [
36]. These findings highlight the importance of designing LLM-based tools that can effectively serve low-literacy and bilingual populations. Improved voice recognition, simplified language, and culturally appropriate content are critical for making digital tools more accessible and usable. Another study explored the use of an LLM tutor in an accelerated introductory computer science course for non-native English speakers. The majority of their questions were posed in their native language but code-switched to key English terms [
35].
Formally evaluating ITSs to ensure both rigor and detail presents a convoluted situation that requires a nuanced solution. Among the proposed approaches, summative evaluation aligns best with our goal, as it is conducted with a completed system and used to make formal claims about its effectiveness [
37,
38]. Formal methods involving expert review and certified personnel can provide rigorous evaluations of systems. However, their limitation is that they may not fully reflect the experiences of target users, who may have lower literacy or educational backgrounds. In such cases, the system could be deemed ineffective despite meeting expert standards. To address this, pilot testing alongside summative evaluation is recommended to determine whether systems are used as anticipated and to ensure that formal claims are not undermined by unexpected learner outcomes [
38]. Recent systematic reviews confirm that ITSs in education are most often evaluated through pre-test and post-test experimental designs in educational settings. Notably, when tested with adults with low literacy, studies show that future investigations and clustering strategies might be employed to improve the adaptivity of ITS. AutoTutor assisted the adults in improving their reading comprehension abilities [
39]. Informal techniques such as pre-tests and post-tests, oral questioning, and pilot testing also align with Bloom’s taxonomy of cognitive educational objectives [
38,
40]. Bloom identified six major levels of objectives, knowledge, comprehension, application, analysis, synthesis, and evaluation. This captures the progression of learning complexity. These levels provide a framework for connecting informal assessments to observable, performance-based indicators of student achievement. Evaluation techniques that emphasize the educational effects of ITSs on learners are thus particularly appropriate for summative evaluation, since the primary purpose of an ITS is to teach [
38].
There is recent research that proposes a multilingual tutoring system for English and Korean. The authors suggest using pedagogical code-switching and showed significant promise for enhancing intelligent tutoring systems in multilingual educational settings. This approach involves strategically integrating multiple languages during instruction, drawing on traditional classroom strategies to scaffold learning for students with varied language proficiencies [
41]. Nevertheless, there are several challenges that remain unaddressed. Current LLMs often exhibit inconsistent performance across different languages, particularly in lower-resource contexts. Biases, hallucinations, and misalignments between model output and local curricula can negatively affect the learning experience [
41]. Further development is needed to improve cross-lingual consistency, ensure cultural appropriateness, and manage the complexity of vocabulary and syntax in more advanced subjects. LLMs hold tremendous potential to revolutionize education by making learning more personalized, responsive, and inclusive. However, realizing this potential fully requires addressing key challenges related to language, culture, and equity. Multilingual intelligent tutoring systems require evaluation strategies that combine summative, informal evaluations to ensure an adequate diverse educational environment.
4. System Description
ConoCancer is a web-based tutoring platform designed to address the educational needs of low-literacy Hispanic individuals and their families. The name ConoCancer embodies this bilingual mission, combining the Spanish word conocer, meaning “to know,” with “cancer” to create a name that signifies “Know Cancer”. The platform ensures that patients, caregivers, and healthcare professionals can access vital knowledge regardless of their primary language. It is designed to educate individuals about breast cancer symptoms, prevention, diagnosis, and treatment; facilitate access to healthcare providers, treatment centers, and financial aid resources; bridge the language gap by offering professionally translated, culturally adapted content in both English and Spanish; and support community engagement through interactive tools, forums, and expert-driven content. The platform offers a unique bilingual experience, featuring educational videos in both English and Spanish, complemented by interactive quizzes that allow users to engage in either language. A personalized dashboard tracks users’ progress in understanding breast cancer topics and provides access to external resources to further support their learning. What sets this platform apart is its integration of multilingual ASR technology. This innovative feature enables users to answer short-response questions by speaking in English, Spanish, or a mix of both (code-switching), seamlessly accommodating code-switching and enhancing accessibility. The Whisper-1 model [
48] was used as the ASR. This model has been trained on a large dataset of diverse audio, enabling recognition of multiple languages within a single audio intake, hence making it well-suited to handle code-switching inputs. The website is hosted on Amazon Web Services (AWS) and was developed in Python 3.12.8 alongside flask and MySQL 8.0.41 as the database. To populate ConoCancer’s backend with real-world data on hospitals and breast cancer support groups, a combination of scraping tools and manual enrichment was used to ensure both scale and accuracy. The system’s key features that can be navigated through the dashboard by the users are depicted in
Figure 2.
Although video-based breast cancer educational platforms exist [
49,
50,
51], none are specially designed for low-literacy Hispanic populations. Our platform directly addressed this gap by using interactive techniques that engage users and allow them to navigate the content with ease. While some websites do provide Spanish-language materials, access to it is disruptive, users must exit the English version, switch to Spanish, and then manually relocate their previous spot. In contrast, our platform eliminates this hassle through a seamless toggle feature, enabling instant language switching without disrupting progress. Additionally, many existing platforms present videos without accompanying questions, or if they are embedded in the video, the speaker poses and then immediately answers the questions, leaving no opportunity for users to submit their own response. This approach risks undermining meaningful learning. Even in cases where quizzes are offered, they often take the form of personal questionnaires, such as detailing if any abnormalities from a self-examination appeared, rather than content-based assessments. Our platform personalizes the experience by recording user progress while focusing strictly on comprehension of the video material. Each question is tied directly to the educational context, with clear answers, ensuring that assessments are objective, measurable and pedagogically effective.
Table 1 provides features comparison for some of these platforms with our proposed system.
This platform can be used by anyone interested in learning more about breast cancer but is particularly targeted towards the low-literacy Hispanic population. In order to help users keep track of their progress, they are required to create an account and log in before they can avail themselves of the features and functionalities available on the platform. As mentioned in the previous section, the website is hosted on AWS and accessed directly from there and any relevant data is also stored on cloud. Once the user creates the account and logs in the system, they have access to the platform via a dashboard. Each of the features along with the overall system architecture is discussed in this section (see
Figure 3 for detailed architecture).
The dashboard serves as the central hub and primary entry point for users, as it is the first interface they encounter after logging in the system (see
Figure 4). The sidebar on the left acts as the main navigation tool, allowing users to explore the rest of the options available on this platform. The dashboard also highlights the four core categories, along with the number of videos available in each in the center. An “Upcoming Videos and Quizzes” section provides direct links, enabling users to seamlessly continue where they left off under the categories section in the center of the page. On the right side, the statistics panel offers a quick snapshot of the user’s progress, serving as a helpful reminder of their progress. Additionally, a language toggle button at the center top of every page allows users to switch between English and Spanish with a single click. This language toggle applies to the current page only, allowing users to explore alternative phrasing or terminology in the other language without having to change the default language for their broader overall experience.
4.1. Videos and Articles
The Videos and Articles page initially presents an overview of each category (refer to
Figure 5). Upon selecting a category, users can access its associated videos, reading materials, and study quiz.
The platform incorporates essential features designed to support users in learning in the way that works best for them. At the core of our approach is short, video-based learning (
Figure 6a), which helps reduce barriers for users who may lack confidence in their reading abilities. Infographics are also available so users can have a quick overview of the category (
Figure 6b). Bilingual users due to lack of vocabulary in their second language sometimes may misinterpret or not fully understand the content written or spoken in language that is not their first language. Therefore, users can seamlessly switch between English (
Figure 6c) and Spanish (
Figure 6d) versions of the video without losing context, ensuring accessibility across language preferences. If the user decides to use the language toggle button, this feature does not hinder their progress in the video. If the user wishes to switch the language while watching a video, it resumes into the new selected language from the exact point of change within the current language being used, ensuring continuity without disrupting the learning experience. Users will have the ability to rewind the video if needed. Progress is automatically saved and is labeled either of these three as “not Started”, “in progress”, or “completed” to help users track their learning journey.
4.2. Practice and Study Quizzes
The practice quizzes are embedded directly within each video to ensure active engagement. As users watch the videos, it will automatically pause at key points to present a question to the user to gauge their involvement and understanding (see
Figure 7a,b). Users must answer the question correctly for the video to resume. They have unlimited attempts, and if they are stuck, they can choose to reveal the answer. Skipping ahead is disabled, progress is locked to maintain the intended learning flow. These in-video questions are designed to reinforce key concepts and keep users engaged. These embedded questions use the lower levels of Bloom’s Taxonomy: Remembering, Understanding, and Applying to ensure the user is able to grasp the basic concepts of the video. These questions are used as check-points in the user’s progress, as there are only 2–4 questions per video. The purpose of these questions is not to overwhelm the user with excessive information, but rather to focus on key points that help reinforce understanding.
The study quizzes cover between 10–15 questions from all videos within a given category and take around 15–25 min to complete. These are designed to reinforce comprehension across the full range of content. While some questions assess basic recall, others introduce more complex challenges aligned with higher levels of Bloom’s Taxonomy. These questions are intentionally crafted to push users beyond simple memorization and encourage deeper understanding and critical thinking. Although they may be challenging, the questions remain within the scope of the videos and do not require any outside knowledge beyond what is provided in the content. Once the user submits their quiz, the multiple-choice and true/false questions are evaluated automatically using standard answer-matching methods. Short-response questions, however, are assessed using GPT-4o, which classifies answers as correct or semi-correct. This evaluation leverages keyword mapping and considers responses in both English and Spanish, even if the user did not code-switch. A monolingual answer does not negatively impact the user’s score; if the required keywords are not found in one language, the system checks the other language. Based on this assessment, each short response scores 2 points for a completely correct answer, 1 point for a semi-correct answer, and 0 points for an incorrect answer (see
Figure 8 for some examples).
4.3. Multilingual ASR
To support bilingual users, the platform integrates a multilingual ASR system that not only allows the user to speak out the response for some questions but also code-switch within their responses when desired (see
Figure 8). This feature enables users to focus on understanding the material without the added burden of mentally translating content, resulting in a smoother and more intuitive learning experience. This functionality is available in the study quizzes that offer language flexibility on a question-by-question basis. Users can choose to respond in the language they’re most comfortable with for each individual question, allowing them to demonstrate understanding more naturally. The intention behind this feature is to acknowledge the bilingualism of users, who may not watch all the videos within a category in a single language. This allows them to toggle between the language in which they learned the material. This approach not only enhances comprehension and engagement but also makes the platform more accessible to caregivers, who often attend appointments and assist in decision-making. By offering tools to explore complex medical concepts in both English and Spanish, the platform encourages broader participation and shared understanding within the care network.
4.4. Additional Resources
The Resources section features an interactive map that connects users to support services beyond the platform, including Long Island cancer care facilities and support groups. Users can filter these resources by distance and accept insurance plans, making it easier to find services that meet their specific needs. Support groups can also be filtered by language and format, whether in-person or virtual, ensuring accessibility and relevance. All resources are presented through a user-friendly interface, streamlining the search for appropriate and inclusive care options (see
Figure 9).
4.5. User Progress
The Progress section tracks all user accomplishments across the platform. It provides an overview of progress within each category. Once a category is completed, it is highlighted in green with a checkmark for easy recognition (
Figure 10a). For incomplete categories (e.g., Diagnosis and Treatment in the figure below), users can expand a dropdown menu to view the remaining videos, readings, and study quizzes (see Treatment category as expanded in
Figure 10b). Each item links directly to its corresponding content, allowing users to pick up where they left off. On the right side, users will find a list of completed study quizzes, displaying the associated category, their latest score, and a link to view a detailed summary (
Figure 10a).
4.6. Settings
The Settings section provides users the choice to switch their language preference to the language other than what the user selected while creating their account, as the users are prompted to select their primary language preference. This ensures that all content is initially displayed in the language of their choice. However, this selection is not permanent, users can switch their preferred language at any time through their account or profile settings. Additionally, this section also allows users to update their account password from the same settings page (
Figure 11).
4.7. Chatbot
The centerpiece of this tutoring system is the advanced chatbot designed to engage users in meaningful conversations about breast cancer. The chatbot can interpret and reply to user inquiries in a conversational fashion by using natural language processing (NLP) models, guaranteeing that users are provided with precise and contextually appropriate information in real time. The primary goals of developing the chatbot were to (1) create an intuitive user experience, (2) include trustworthy sources of information on breast cancer, and (3) provide a tool to raise breast cancer awareness. Through interactive and educational resources, the system seeks to close the knowledge gap by giving users a greater understanding of breast cancer, its risk factors, early warning indicators, and accessible screening techniques. The integration of the chatbot into the broader tutoring system allows for a more personalized learning experience, so that the chatbot may respond to inquiries about breast cancer, offer instructional materials, and assist users by directing them to pertinent sites as necessary.
4.7.1. The Chatbot Architecture
The architecture of the Breast Cancer Q&A chatbot is composed of several components working in tandem to ensure smooth user interaction, effective content retrieval, and personalized responses (see
Figure 12). The key components of the chatbot include (a) Streamlit Interface [
52,
53], the chatbot’s user-facing front-end interface that makes communication easier; (b) NLP Models [
54], pre-trained and refined models for context retrieval and question answering; (c) Knowledge Base, a carefully curated compilation of data about risk factors, treatment options, and symptoms of breast cancer; (d) User Profile, a customized area where user information is gathered and used by the chatbot to provide tailored answers; and (e) Feedback logging system for gathering user input on the chatbot responses in order to improve the system over time.
4.7.2. The Chatbot Features
The chatbot is designed to answer a variety of questions related to breast cancer, provide personalized health insights, and assist in navigating the complexities of breast cancer prevention, detection, and treatment.
Question-Answering: The DistilBERT model [
55] is used by the system to determine the most pertinent response to a user’s query. The user’s question is preprocessed and converted into a tensor representation. The chatbot determines the similarity between the query and the knowledge base information using Sentence-Transformer [
56]. It retrieves the most pertinent paragraph. The DistilBERT model then extracts the answer from the retrieved context. The system returns a concise answer along with additional context and external links for further reading.
Personalized Health Insights: To enhance user engagement and deliver more relevant educational support, the ConoCancer chatbot incorporates a set of personalization features that allow it to tailor responses based on individual users’ profile (
Figure 13a,b). During the initial interaction, users are prompted to provide basic information such as their age, their family history of breast cancer, and any known lifestyle-related risk factors (e.g., smoking, alcohol use, lack of physical activity). The chatbot uses this profile data to generate personalized messages that provide context-aware health guidance. For example, users over the age of 40 may be reminded of the importance of regular mammograms, while those with a family history may be advised to speak with a healthcare provider about genetic testing or early screening. Similarly, if certain risk factors are selected by the user, the chatbot can offer lifestyle tips or refer the user to relevant educational resources. This approach makes the conversation feel more individualized and supportive, especially for users who may be anxious or unsure where to begin. This should be noted that the user information collected for personalization purposes is not stored in the database but is only valid and retained while the user is active in the session. The only data that is stored is the question asked, response generated and the user feedback on the response. This information is stored for later fine-tuning the system (see feedback logging).
Follow-Up Questions: To promote deeper learning and sustained user engagement, the chatbot includes a follow-up question feature that appears after each response (
Figure 14, left figure). These follow-up questions were generated manually by clearly and carefully using predefined and manually mapped to specific categories within the knowledge base. For instance, after a user asks a question related to breast cancer symptoms, the chatbot recommends the user follow-up queries such as “Can breast cancer be painless?” or “Do symptoms vary by age?” that the user can ask next. Each core category—such as Symptoms, Diagnosis, or Treatment—has a curated list of relevant follow-up questions that were designed to reflect common next steps in the learning process. These questions serve two purposes; first, they help guide users through a logical educational pathway without requiring them to know what to ask next and second, they encourage exploration of related topics they may not have previously considered. This approach makes the learning experience feel more natural and conversational while also reinforcing key concepts. By mapping these follow-up prompts to specific topics, the chatbot maintains semantic coherence and ensures that the conversation stays relevant to the user’s initial intent.
Feedback Collection Mechanism: To improve the chatbot’s accuracy, usability, and relevance over time, a feedback collection mechanism has been integrated directly into the user interface (see
Figure 14a,b),
Figure 14b showcasing the feedback mechanism.
After each response obtained from the chatbot, users are provided an option to provide feedback using thumbs-up (👍) or thumbs-down (👎). This feedback is logged and includes three critical data points: the user’s question, the chatbot’s answer, and the selected feedback (see
Figure 15). This setup allows developers and researchers to later review and work on system improvement. The chatbot also uses a mechanism to track which knowledge base entries have already been used to avoid repeating answers for contextual learning. This context management allows the system to offer similar but varying and relevant information to the user throughout the conversation.
5. LLMs Applications and Challenges
5.1. Multilingual ASR and GPT-4o
LLMs were primarily used in our system for three key components: question generation, comprehension assessment, and the chatbot. After processing the audio content through ASR (which also uses LLMs behind the scenes), the resulting transcripts were input to the GPT-4o model for generating the study and practice quiz questions. The LLM was tasked with identifying pedagogically relevant segments from the transcript and then formulating questions that aligned with the video’s learning objectives. Additionally, the LLM was responsible for evaluating user responses to short-answer questions by comparing them semantically against predefined correct and semi-correct answers. Although model GPT-4o supported multilingual input/output, a strong bias toward English was observed. It was found that the model often internally preferred reasoning over the English version, even when the Spanish transcription was provided. It exhibited a tendency to translate English-formulated questions into Spanish, rather than generating questions natively from the Spanish input. This created a disconnect between the phrasing of the question and the user’s experience, especially when a subtle difference in word choice misaligned with what was actually heard in the video. This undermined the goal of faithful comprehension assessment, as learners could be confused by wording they never encountered. Another recurring issue was question irrelevance. The LLM would sometimes focus on minor or tangential details that, while present in the transcription, were not central to the pedagogical goal of the video. This also points to a deeper challenge in content summarization and semantic salience detection, tasks that LLMs still struggle with when the input lacks clear structural cues.
Whisper (the ASR model), despite strong multilingual performance, occasionally produced inaccurate or incomplete transcriptions, particularly when speakers stuttered, repeated words, or used non-standard speech patterns. These errors often propagated downstream into the LLM’s generation phase. For example, missing or misrecognized medical terms in the transcription could lead to confused or malformed questions, or worse, the omission of essential content altogether. This behavior suggests that while LLMs are multilingual, they are not yet code-switching-aware in a nuanced, task-specific way. They lack the ability to contextually preserve or reflect bilingual patterns that are natural in certain user settings. Additional challenges emerged during testing of the ASR feature. Accurate transcription requires clear and close-range speech into the device’s microphone. Even so, Whisper’s transcription is not entirely reliable and often requires manual correction once the text appears in the textbox. There may also be a noticeable delay between when the user finishes recording their response and when the transcription appears, which could disrupt the user’s flow and create confusion.
5.2. Chatbot Models
During the early stages of developing the ConoCancer chatbot, several LLMs and transformer-based architectures were evaluated with the goal of enabling accurate, bilingual, and context-aware conversational capabilities. Despite these models being powerful in their respective domains, they did not meet the specific requirements of our use case including bilingual understanding, code-switching, or contextual response generation. The LLMs tested included models from OpenAI, Helsinki, Facebook, Google, Deepset, EleutherAI, Hugging Face. The last two models (Distilbert and MiniLM) by Hugging Face fulfilled the requirements of the study and were used for this chatbot.
Text-davinci-003 (OpenAI GPT-3) [
57]: Despite its strong natural language understanding and generation capabilities, GPT-3 is a general-purpose model. It struggled with domain specificity, often generating overly generic responses. Moreover, due to API limitations, lack of control over factual grounding, and cost constraints, it was not ideal for a focused, structured Q&A chatbot.
Helsinki-NLP/opus-mt-en-es [
58]: This model was considered for real-time bilingual translation. However, it lacks conversation context awareness and cannot handle code-switching between English and Spanish in a single utterance. Additionally, it is a one-way translator, not suitable for dynamic dialog generation.
Facebook/mbart-large-50-many-to-many-mmt [
59]: While this model supports over 50 languages, its primary strength lies in translation and summarization, not real-time Q&A or retrieval-based interaction. It also required significant compute for inference and failed to maintain topic relevance in dialog when used directly.
Deepset/bert-base-cased-squad2 [
60]: Designed for extractive question answering, this model performed well on short, direct questions but lacked the ability to generalize across loosely worded or conversational queries. Additionally, it did not support semantic matching across multiple contexts, which was essential for our corpus-driven architecture.
EleutherAI/gpt-neo-1.3B [
61]: As an open-source alternative to GPT-3, GPT-Neo offered promise in offline settings. However, it produced hallucinated responses, especially in the medical domain, and lacked sufficient bilingual understanding, making it unsuitable for our dual-language audience.
Google/flan-t5-small [
62]: Though instruction-tuned and lightweight, FLAN-T5 struggled with multi-turn conversational coherence, and lacked robustness in retrieving grounded answers from a domain-specific corpus. It also underperformed when dealing with Spanish-language inputs and mixed-language prompts.
Distilbert-base-cased-distilled-squad [
55]: This model developed by Hugging Face is a lightweight, high-performance transformer model used for question-answering (QA). It processes the user’s question and searches for the most relevant information from the knowledge base to provide the correct answer. It is used for processing user queries and extracting precise answers from the knowledge base, leveraging its strong performance on question-answering tasks.
All-MiniLM-L6-v2 (Sentence-Transformers) [
63]: This is employed for semantic similarity matching. This model ensures that user questions are matched with the most contextually relevant information in the knowledge base. The model (all-MiniLM-L6-v2) is used to encode the user’s question and the information in the knowledge base into vectors. The system calculates cosine similarity between the question and the knowledge base content, retrieving the most relevant content based on the computed similarity score.
One of the challenges in implementing a question-answering system is “context retrieval”, i.e., ensuring that the chatbot can retrieve the most relevant information based on the user’s query. To solve this, the chatbot uses a semantic search mechanism [
53] that calculates the similarity between the question and knowledge base entries using embeddings generated by the Sentence-Transformer model. In cases where the system cannot find a relevant answer (i.e., when the similarity score falls below a predefined threshold), the chatbot provides a fallback message indicating that it doesn’t have sufficient information and offers suggestions for related questions, hence handling errors in a user-friendly manner. User feedback is essential for improving the chatbot’s responses. A challenge in this area is ensuring that feedback is properly logged and used to refine the system. The chatbot addresses this by storing feedback in a structured format, allowing easy analysis and continuous improvement.
6. Discussion
ConoCancer presents itself as a comprehensive bilingual educational platform tailored to address the informational needs of low-literacy Hispanic communities around breast cancer. Our goal was to build a personalized, culturally relevant, and linguistically inclusive web-based learning environment, incorporating video-based education, interactive assessments, and a real-time chatbot. Through this work, we explored the integration of LLMs and ASR systems across several components. One of the major accomplishments and contributions of this work was the successful curation and deployment of a structured bilingual video library. The videos were categorized along key stages of the breast cancer journey and supplemented with infographics to visually reinforce key messages. The content was consistently mirrored in both English and Spanish, ensuring accessibility and coherence for bilingual or Spanish-dominant users. The platform’s assessment system is designed with Bloom’s Taxonomy as the foundation, creating two tiers of comprehension checks, embedded practice quizzes for foundational recall and understanding, and study quizzes that push users to engage with higher-order thinking through short-response and evaluative questions. Notably, the system allows users to answer short-response questions via speech in both English and Spanish which is supported by Whisper’s multilingual ASR. This functionality allows users to respond naturally, whether by typing or speaking, and accommodates varying levels of literacy and language preference.
The integration of a real-time bilingual chatbot, powered by a curated knowledge base, allowed for flexible, conversational learning. The chatbot offered users a natural interface for asking questions about breast cancer and receiving personalized, structured, and medically accurate responses. We further enhanced its utility by implementing user profiles for personalization, a follow-up question generation system to encourage deeper exploration, and a feedback mechanism to support iterative refinement. These features collectively enabled a more dynamic and engaging learning experience. Another key aspect of the platform is its emphasis on language flexibility and user control. Users select a primary language preference during account creation, which determines how content is initially displayed. However, they can change this preference at any time using a language toggle button available on every page, designed to support real-time bilingual exploration. Furthermore, study quizzes allow users to type or speak aloud their response in the language of their choice on a per-question basis. By enabling flexible access to content and assessments in both languages, it allows users to easily code-switch. ConoCancer also integrated a resource discovery module. This module maps nearby breast cancer support centers and hospitals, filtered by proximity, accepted insurance, and language support, adding a valuable real-world utility layer to the platform. The key features of the prototype are briefly summarized in
Figure 16.
6.1. Limitations
One of the limitations that restricted the content coverage was lack of openly available, high-quality educational resources in both the languages English and Spanish. Several materials that could improve and enhance the content variety were copyrighted and therefore could not be used. Additionally, while the platform features a resource map for breast cancer centers and support groups, coverage is currently limited to the Long Island area. The accuracy of data such as accepted insurance policies depends on third-party sources, which may not always be up to date.
Other limitations from an implementation perspective include Whisper, the ASR model poses performance challenges. Since Flask’s default WSGI server processes each request in a blocking thread, long-running tasks like sending audio to Whisper can cause high latency and reduced concurrency. Moreover, Whisper lacks real-time transcription, requiring users to wait until the full audio clip is processed. Under heavy load, Whisper’s CPU, memory, and network demands can cause system slowdowns, timeouts, or failures. To mitigate this, we implemented a 20-s automatic timeout and batched audio requests, ensuring that excessively long recordings did not overwhelm the system. Users are also given the option to use the ASR feature multiple times within the same question if there’s a need to refine or extend an answer. MySQL performance can also degrade when a new connection is opened for every API call, especially without proper connection pooling. In the current system, we introduced connection pooling to maintain a small pool of reusable connections rather than repeatedly opening and closing them. This reduced query latency and prevented timeouts during testing. While the fix improved stability, scaling to larger user volumes will require further optimization in future iterations. From a usability perspective, ConoCancer is currently optimized only for desktop use. This limits access on mobile and tablet devices where layouts may break or appear cluttered.
Despite its potential, the chatbot’s accuracy depends on the breadth of the underlying knowledge base, lacks the ability to deliver personalized medical advice, and cannot replace professional healthcare consultations. Addressing these challenges requires ongoing research, iterative improvements, and potential integration with medical databases and advanced AI capabilities.
6.2. Future Directions
Although the current system primarily relies on LLM-generated content during deployment, future iterations will incorporate additional resources to enhance both accuracy and user experience. Notably, the question corpus [
64] and Question-Answer (QA) corpus [
65,
66] developed for breast-cancer content, though not used in the present implementation, will be integrated into the database moving forward. The question corpus previously generated could offer a curated set of quiz items that align closely with pedagogical goals, providing a reliable alternative or supplement to LLM-generated questions. Meanwhile, the QA corpus holds potential for evaluating user-generated responses through semantic similarity, aiding in the creation of individualized learning profiles based on topic understanding. Future development efforts will focus on improving multilingual and interactive features. One key direction is the integration of a Spanish or multilingual version of the chatbot to support a wider range of users. In addition, the system would provide an option to have questions and answers read-out-loud to improve accommodations for low-literacy users. Furthermore, planned improvements include evaluating user responses through semantic analysis rather than keyword mapping, which would provide a more accurate and nuanced understanding of learner progress. Another scope for improvement includes introduction of more interactive quiz formats to foster deeper engagement. The inclusion of more sophisticated use of AI to better customize a personal learning environment for each user will also be implemented. These advancements will make ConoCancer more inclusive, intelligent, and scalable for broader use. Finally, our focus will shift toward systematically evaluating the platform with target users to determine its effectiveness in reducing health literacy disparities. This evaluation will involve usability testing, comprehension assessments, and confidence measures to examine whether users feel more assured in accessing and understanding bilingual breast cancer information. We also plan to analyze the extent to which code-switching and bilingual delivery improve accessibility and engagement compared to single-language systems. Furthermore, outcome measures will assess reductions in miseducation by comparing user knowledge before and after interacting with the platform. These studies will provide quantitative evidence of the platform’s impact, validate its design choices, and guide refinements necessary to enhance both technical performance and cultural responsiveness. Taking the human-feedback in the loop to retrain and fine-tuned the transformer models and embed reinforcement learning approach in the system for real-time learning of the conversational agents as the interactions progress will be an additional task that we plan to explore further. An additional important task that we would like to include from a development and system implementation perspective is a mobile-friendly version of the system. Many users in the target demographic rely on mobile devices as their primary means of internet access. Therefore, implementing a fully responsive, mobile-friendly design once the current prototype is tested, evaluated and approved, would be an added future task to increase reach and usability of the system.