SnapStick: Merging AI and Accessibility to Enhance Navigation for Blind Users
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsSnapStick is a significant development in assistive technology, providing a complete solution that encourages greater autonomy and inclusion for blind people by fusing cutting edge algorithms with purpose-driven design. This work emphasizes how crucial it is to create useful, approachable technologies that enable users to live more autonomous and active lives. The writing of this work is fine. Below are my concerns.
1. Is there any visual results to support the proposed work?
2. In Table 2, why chooise Q1-Q5, not Q1-Q10 or Q6-Q10?
Author Response
- Are there any visual results to support the proposed work?
We thank the reviewer for this insightful comment. To address this, we have already added visual examples that illustrate the actual output generated by the SnapStick system. Specifically, we included Figure 1, which presents side-by-side comparisons of the captured image and the corresponding audio description generated by the app. This visual evidence helps demonstrate the real-world effectiveness and descriptive quality of our system. - In Table 2, why choose Q1-Q5, not Q1-Q10 or Q6-Q10?
We thank the reviewer for this observation. We believe there may have been a misunderstanding regarding the structure of the questionnaires used in the study.
Table 2 specifically presents responses to a custom-designed five-question survey, separate from the System Usability Scale (SUS), which includes 10 standardized questions (Q1–Q10) and is shown in Table 1. The questions in Table 2 (Q1–Q5) refer to the custom questionnaire evaluating user satisfaction, comfort, confidence, audio clarity, and perceived problems, as described in the Data Collection section. These are not part of the SUS.
Reviewer 2 Report
Comments and Suggestions for AuthorsAcademic English Translation (refined for clarity and flow)
This manuscript presents SnapStick, a smartphone‐based assistive system that integrates a Bluetooth smart cane, bone-conduction headphones, and an on-device vision–language model. The platform delivers real-time scene narration, object and text recognition, and bus-stop identification, thereby enabling blind and visually impaired users to navigate safely in both indoor and outdoor settings. Preliminary user trials indicate that the system is easy to operate, comfortable to use, and provides clear voice feedback—benefits that collectively foster greater independence and confidence. While the study is innovative, the following points should be addressed to strengthen the work’s scholarly rigor and practical impact:
- Although Seeing AI and similar products are briefly mentioned, a systematic comparison using a Function–Performance–Cost theoretical framework would contextualize SnapStick’s advantages and limitations more convincingly.
- System effectiveness is assessed mainly through the SUS and custom questionnaires. Beyond qualitative descriptions of “high accuracy,” incorporate quantitative measures such as recognition accuracy, inference latency, and power consumption to substantiate performance claims.
- Hosting the vision–language model on a user-supplied private server reduces cloud latency but increases hardware costs and maintenance complexity, potentially limiting accessibility for low-income or non-technical users. Consider model compression for on-device inference or a hybrid edge–cloud architecture.
- Current functionality focuses on static scene description and text/bus-stop recognition. Integrating depth sensing, real-time obstacle tracking, and vibrotactile feedback would better mitigate collision risks in crowded or fast-moving environments.
- While local processing is said to protect data, the manuscript does not detail image encryption, anonymization, or data-deletion procedures. Clarify these processes to ensure compliance with GDPR and other privacy regulations, thereby increasing user trust and legal robustness.
Author Response
1. Although Seeing AI and similar products are briefly mentioned, a systematic comparison using a Function–Performance–Cost theoretical framework would contextualize SnapStick’s advantages and limitations more convincingly.
We sincerely thank the reviewer for this thoughtful suggestion. In response, we have now included a systematic comparison table based on the Function–Performance–Cost (FPC) framework. This table compares SnapStick with existing commercial tools to provide a structured overview of their capabilities, usability, and accessibility. Kindly see Table 3
2. System effectiveness is assessed mainly through the SUS and custom questionnaires. Beyond qualitative descriptions of “high accuracy,” incorporate quantitative measures such as recognition accuracy, inference latency, and power consumption to substantiate performance claims.
We appreciate the reviewer’s insightful comment highlighting the need for quantitative performance metrics. In response, we have now included additional experimental results that quantify:
- Recognition Accuracy (across indoor, outdoor, and people-based scenes),
- Inference Latency (average time per image to generate a description),
- Power Consumption (average battery usage during standard operation over a 30-minute session).
Kindly see lines 248 to 263.
3. Hosting the vision–language model on a user-supplied private server reduces cloud latency but increases hardware costs and maintenance complexity, potentially limiting accessibility for low-income or non-technical users. Consider model compression for on-device inference or a hybrid edge–cloud architecture.
We thank the reviewer for this insightful and important comment. We fully agree that relying on a user-hosted local server, while beneficial for latency and privacy, introduces challenges in terms of hardware accessibility and technical complexity.
In response, we have now expanded the Discussion and Future Work sections to address this limitation explicitly. Kindly see lines 346 to 251 and 355 to 367.
4. Current functionality focuses on static scene description and text/bus-stop recognition. Integrating depth sensing, real-time obstacle tracking, and vibrotactile feedback would better mitigate collision risks in crowded or fast-moving environments.
We thank the reviewer for this excellent suggestion. We agree that enhancing SnapStick with real-time obstacle tracking and non-visual feedback modalities would significantly improve safety in dynamic or crowded environments. In response, we have expanded the Future Work section to include our plans for incorporating depth sensing technologies, real-time spatial tracking, and vibrotactile feedback. Kindly see lines 355 to 367.
5. While local processing is said to protect data, the manuscript does not detail image encryption, anonymization, or data-deletion procedures. Clarify these processes to ensure compliance with GDPR and other privacy regulations, thereby increasing user trust and legal robustness.
We thank the reviewer for raising this important point regarding data privacy and regulatory compliance. To address this, we have now added a dedicated subsection under the Methodology section titled Data Privacy and Security, where we detail our current and planned procedures for image encryption, local-only processing, anonymization, and automatic data deletion. These measures are designed to align with GDPR requirements and ensure user trust and legal robustness. We believe this clarification significantly strengthens the ethical and practical grounding of the system. Kindly see lines 206 to 220.
Reviewer 3 Report
Comments and Suggestions for Authors(1)How is to merge AI? Which AI algorithm is used to develop this snapstick?
(2)This paper likes to introduce a product. The core technology is not illustrated.
(3)In table2, it is subjective results. Different persons may have different feeling. Objective evaluation indicators should be considered for experiments
(4)Compared with other technology, what are the advantages?
(5)The future research direction should be discussion in the conclusion section.
Author Response
1. How is to merge AI? Which AI algorithm is used to develop this snapstick?
We thank the reviewer for this question. As described in Section 2.2 (SnapStick) of the Methodology, the AI component of SnapStick is powered by the Florence-2 Vision–Language Model (VLM) developed by Microsoft. This model is integrated into the system through a client–server architecture, where the Florence-2 model runs locally on a private server to ensure fast and secure image processing. The smartphone client captures images and communicates with the server, which uses the VLM to generate detailed scene descriptions, recognize text, and identify facial expressions or bus numbers
2. This paper likes to introduce a product. The core technology is not illustrated.
We thank the reviewer for this important observation. While the paper does present SnapStick as a practical assistive tool, we agree that it is essential to also highlight the technical foundation that enables its functionality. The core AI technology, including the Florence-2 Vision–Language Model, its integration in a client–server architecture, and the image processing and communication pipeline, is described in Section 2.2 (SnapStick). Kindly check line 108 to 134.
3. In table2, it is subjective results. Different persons may have different feeling. Objective evaluation indicators should be considered for experiments
We appreciate the reviewer’s point regarding the need for objective evaluation. While Table 2 presents subjective usability feedback gathered via structured questionnaires, we agree that supplementing this with objective performance metrics strengthens the overall evaluation. In response, we have now included a new subsection titled System Performance Evaluation in the Results section, where we report quantitative measures. Kindly see 248 to 263.
4. Compared with other technologies, what are the advantages?
Thank you for the comment. Compared to other existing technologies, SnapStick offers superior comfort, usability, and recognition accuracy. These advantages are supported by both subjective user feedback and objective performance metrics. A detailed comparative analysis is provided in Table 3, which highlights SnapStick’s strengths using a Function–Performance–Cost framework.
5. The future research direction should be discussion in the conclusion section.
Thank you for the helpful suggestion. We appreciate the reviewer’s recommendation; however, to maintain clarity and logical flow, we have included the discussion of future research directions within the Discussion section rather than the Conclusion. This placement allows for a more integrated reflection on the findings and their implications.
Reviewer 4 Report
Comments and Suggestions for AuthorsThanks for the invitation to review this work. It describes an assistive technology system called SnapStick that combines a Bluetooth-enabled cane, bone-conduction headphones, and a smartphone app using the Florence-2 Vision-Language Model to help blind users navigate. The paper reports positive results from a study with 11 blind participants, including a high SUS score of 84.7. However, there are some issues necessary for solutions before publication.
- The abstract claims capabilities like "bus route detection" and "text reading," but these features are not described in the methodology (Section 2.2). The paper only validates scene descriptions.
- The methodology states Florence-2 VLM recognizes "facial expressions" (Page 3), but Page 4 clarifies it "cannot recognize emotions." This contradiction needs resolution.
- Table 1 and table 2 Column headers (Q1–Q10) or Q1-Q5 lack descriptions of the SUS statements. Subject S9’s total score (35) is implausible given individual responses (e.g., Q1=4, Q5=3). Recalculate scores—SUS ranges 0–100, and 35 suggests extreme dissatisfaction inconsistent with other feedback.
- In table 2, Q5 responses ("No" for all subjects) contradict Section 3’s claim that "18.2% mentioned occasional ambiguity."
- Testing occurred only in "one indoor experimental room" (Page 3). Claims about "real-world environments" (Page 6) and "bus route detection" are unsupported.
- Participants were recruited from the authors’ institute (Page 3), potentially introducing selection bias. Disclose recruitment criteria.
- Page 3 states the VLM runs on a "local server," but Page 5 mentions "client and server data transmission." Clarify if processing is truly local (on-device) or requires a separate server.
- "VI" (visually impaired) and "VIP" (visually impaired people) are used interchangeably (e.g., Abstract vs. Section 2). Standardize to "VI" per modern conventions.
- Claims of SnapStick’s superiority over SeeingAI (Page 8–9) lack quantitative data (e.g., accuracy metrics). Provide statistical comparisons rather than subjective endorsements.
- An SUS score of 84.7 is described as "A+" and "96th–100th percentile," but industry benchmarks vary. Cite Sauro’s work explicitly for context.
- The requirement for a locally hosted server (Page 9) excludes users without technical expertise. Propose cloud/offline alternatives to improve accessibility.
- Consider including citations of Exploration, 2024, 4, 20230146
Author Response
1. The abstract claims capabilities like "bus route detection" and "text reading," but these features are not described in the methodology (Section 2.2). The paper only validates scene descriptions.
We thank the reviewer for pointing out this important inconsistency. The abstract indeed mentions features like bus route detection and text reading, which are core functionalities of SnapStick. While our user validation focused primarily on the system’s scene description capabilities, we agree that these additional features should be technically described for completeness. We have now updated Section 2.2 to clarify how bus route detection and text reading are implemented using the Florence-2 model and image processing pipeline. Kindly see lines 138 to 148.
2. The methodology states Florence-2 VLM recognizes "facial expressions" (Page 3), but Page 4 clarifies it "cannot recognize emotions." This contradiction needs resolution.
Thank you for the comment. We acknowledge that it was a text error. The SnapStick detects facial expressions like smile, sadness and etc. we have revised our text. Kindly see lines 153 to 154.
3. Table 1 and table 2 Column headers (Q1–Q10) or Q1-Q5 lack descriptions of the SUS statements. Subject S9’s total score (35) is implausible given individual responses (e.g., Q1=4, Q5=3). Recalculate scores—SUS ranges 0–100, and 35 suggests extreme dissatisfaction inconsistent with other feedback.
We acknowledge that the SUS column headers (Q1–Q10) in Table 1 and the custom questionnaire items (Q1–Q5) in Table 2 lacked sufficient context for readers. In the revised manuscript, we have now included the full descriptions of the 10 SUS following the standard SUS format (Kindly see Table 1). For Table 2, we ensured that all five custom questions are clearly listed in the Data Collection section (lines 191–204) for easy reference.
We appreciate the reviewer’s careful attention to detail. We have thoroughly recalculated Subject S9’s SUS score using the standard SUS scoring methodology. The score of 35 is in fact correct, based on the participant’s original responses. While some items (e.g., Q1 = 4, Q5 = 3) may appear mid-to-high on the 5-point scale, other responses (especially on negatively framed questions) were low, resulting in a reduced total score.
The value of 35 is thus not an error but reflects this specific participant’s individual scoring pattern, which we have reported exactly as received. To acknowledge this discrepancy in the manuscript, we have added a clarification note in the Results section. The note explains that due to this outlier, the overall SUS mean did not reach closer to the maximum score of 100, despite high ratings from the majority of participants. Importantly, the participant did not report any technical or usability issues, suggesting the low score may be due to individual response variation rather than system dissatisfaction. Kindly see lines 229 to 234.
4. In table 2, Q5 responses ("No" for all subjects) contradict Section 3’s claim that "18.2% mentioned occasional ambiguity."
Thank you for pointing this out. Upon reviewing the manuscript, we acknowledge that the phrase "occasional ambiguity" in Section 3 was a misstatement. What we intended to report was that 18.2% of participants selected "Neutral" when asked about their confidence while using SnapStick, rather than indicating any explicit problems or flaws (which were addressed in Q5 of Table 2, where all participants answered "No").
We have now rephrased the sentence in Section 3 to accurately reflect the user responses by replacing "occasional ambiguity" with "neutral confidence rating". This revision ensures consistency between the survey results and the text. Kindly see lines 240 to 242.
5. Testing occurred only in "one indoor experimental room" (Page 3). Claims about "real-world environments" (Page 6) and "bus route detection" are unsupported.
We thank the reviewer for this important observation. While the primary user testing was indeed conducted in a controlled indoor experimental room, we would like to clarify that additional validation images were also captured from real-world contexts—including views through windows and random photos of different indoor environments—to ensure diversity in scene content.
For the bus route detection feature, we used a set of pre-captured bus images for validation purposes. At this stage, we intentionally avoided real-time outdoor testing with blind participants due to safety and ethical considerations. However, the model’s ability to recognize bus numbers and destination text was tested using real images of buses under varied lighting and signage conditions.
We have now updated the Methodology section to better reflect the nature and scope of the testing and to avoid overstating claims regarding real-world deployment. Kindly see lines 86 to 87, 138 to 148, 177 to 180.
6. Participants were recruited from the authors’ institute (Page 3), potentially introducing selection bias. Disclose recruitment criteria.
We thank the reviewer for this important observation. As noted, participants were recruited from the database maintained at our institute. Databases of blind individuals are relatively rare, and not all institutions have access to such resources. Fortunately, our institute maintains a large and diverse participant pool, allowing us to randomly select individuals who were available and willing to participate, without applying any selection criteria that could introduce bias.
We acknowledge that working with participants from external institutions would further strengthen generalizability. However, institutional policies and ethical constraints often limit data sharing and cross-institute recruitment in studies involving vulnerable populations. We have clarified this point in the Methodology section to ensure transparency regarding recruitment. Kindly see lines 94 to 100.
7. Page 3 states the VLM runs on a "local server," but Page 5 mentions "client and server data transmission." Clarify if processing is truly local (on-device) or requires a separate server.
We thank the reviewer for this observation. To clarify: the Florence-2 Vision–Language Model does not run directly on the smartphone device. Instead, it is hosted on a separate private local server (e.g., a laptop or desktop within the same network), which handles image processing. The smartphone acts as a lightweight client, responsible for capturing images and receiving responses via a secure local network connection.
We use the term “local server” to indicate that processing is performed within the user’s private environment (i.e., not cloud-based), but it does require a separate machine from the smartphone. We have updated the manuscript to clarify this architecture. Kindly see lines 117 to 123.
8. "VI" (visually impaired) and "VIP" (visually impaired people) are used interchangeably (e.g., Abstract vs. Section 2). Standardize to "VI" per modern conventions.
Thank you for your comment. We have standardized to VI as per the suggestion.
9. Claims of SnapStick’s superiority over SeeingAI (Page 8–9) lack quantitative data (e.g., accuracy metrics). Provide statistical comparisons rather than subjective endorsements.
We thank the reviewer for this valuable observation. In response, we would like to clarify that the manuscript does include quantitative metrics supporting SnapStick’s performance. Specifically, we report a recognition accuracy of 94%, along with inference latency (1.7 ± 0.2 seconds) and power consumption (9.8% over 30 minutes)—all measured in a controlled evaluation on a mid-range smartphone. Kindly lines to 248 to 263.
Additionally, we compare these technical results with reported satisfaction levels and performance limitations of other tools, including SeeingAI, in Table 3 using a Function–Performance–Cost (FPC) framework. While some comparisons are necessarily based on published user feedback (e.g., satisfaction percentages for SeeingAI), our intention is to combine objective system performance data with real-world usability insights to provide a well-rounded perspective. For statistical comparison, kindly see lines 333 to 339
10. An SUS score of 84.7 is described as "A+" and "96th–100th percentile," but industry benchmarks vary. Cite Sauro’s work explicitly for context.
We thank the reviewer for this helpful suggestion. To ensure clarity and alignment with established industry standards, we have now cited Sauro and Lewis’s work on SUS interpretation benchmarks. According to their grading scale, a score of 84.7 corresponds to an “A+” rating and falls within the 96th–100th percentile, indicating excellent usability. We have revised the text in the Results section to include this citation and provide appropriate context for interpreting the score. Kindly see lines 226 to 228.
11. The requirement for a locally hosted server (Page 9) excludes users without technical expertise. Propose cloud/offline alternatives to improve accessibility.
We appreciate the reviewer’s important observation regarding accessibility and deployment barriers for non-technical users. We agree that requiring a locally hosted server may limit adoption among users without the necessary hardware or technical background. In response, we have expanded the Future Work section to outline two practical solutions: (1) a cloud-based inference option for users with reliable connectivity, and (2) on-device model compression and deployment using techniques such as quantization or knowledge distillation. These alternatives aim to provide users with flexibility based on their technical capability and connectivity status, ultimately making SnapStick more accessible and scalable. Kindly see lines 335 to 367.
12. Consider including citations of Exploration, 2024, 4, 20230146
Thank you for the suggestion. We reviewed the article Exploration, 2024, 4, 20230146; however, we did not find it directly relevant to the scope or content of our current study. Therefore, we have opted not to include it in the manuscript at this time.
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have completed the paper revisions according to the reviewers' comments.
Reviewer 4 Report
Comments and Suggestions for AuthorsThanks for the invitation to review this work. The authors have solved the previous concerns, and the article is recommended for publications after careful proof check.