Abstract
The latest AI advancements have provided opportunities for developing automated scoring and diagnosis systems that interpret and evaluate students’ written solutions and assist teachers’ grading and evaluation, yet computer vision still represents a technical challenge in detecting and describing the numerical values and spatial locations of key elements in students’ hand-written solutions to mathematics tasks. This study reports the development and evaluation of an AI-based platform, called Visual Translator (VT), that automatically detects and describes the key visual information which is essential to the next step of auto-grading and diagnosis. The VT was trained with a private dataset of students’ handwritten solution images. Human-experts annotated the key elements in students’ solution images to build ground truth. We evaluated the VT performance by comparing the fraction value identification accuracy and location detection accuracy between VT and available LLMs against human expert annotations. Results suggested that VT surpassed GPT and Grok in fraction value identification, and also outperformed Geimini, the only LLM that supports image segmentation, in location detection. This model serves as the first step to reach the ultimate goal for classifying problem-solving strategies and error types in students’ handwritten solutions. Implications for computer vision research, auto-grading and diagnosis in K12 mathematics education are discussed.