Automating Spatial Visualisation of Handwritten Vector Equations Using Large Vision Models in Pre-Tertiary Mathematics

Lim, Kenneth Y. T.; Le, Nguyen Thanh Minh; Chanoudam, Sopheap

doi:10.3390/mti10060068

Open AccessArticle

Automating Spatial Visualisation of Handwritten Vector Equations Using Large Vision Models in Pre-Tertiary Mathematics

by

Kenneth Y. T. Lim

^1,*

,

Nguyen Thanh Minh Le

²

and

Sopheap Chanoudam

²

¹

National Institute of Education, Nanyang Technological University, Singapore 637616, Singapore

²

Independent Researcher, Singapore 357689, Singapore

^*

Author to whom correspondence should be addressed.

Multimodal Technol. Interact. 2026, 10(6), 68; https://doi.org/10.3390/mti10060068 (registering DOI)

Submission received: 26 April 2026 / Revised: 8 June 2026 / Accepted: 11 June 2026 / Published: 14 June 2026

Download

Browse Figures

Review Reports Versions Notes

Abstract

Understanding advanced pre-tertiary mathematics, particularly three-dimensional vectors, demands robust spatial reasoning skills that many students find challenging to develop through traditional pedagogical methods. This study proposes and evaluates an innovative educational tool that leverages large vision models to automate the conversion of handwritten vector equations into accurate 3D graphical representations. By interpreting students’ handwritten input using advanced computer vision, the system provides immediate, interactive visual feedback to bridge the cognitive gap between abstract symbolic notation and tangible geometric concepts. We evaluated the system using a dataset of 1000 handwritten vector equations typical of the Singapore-Cambridge GCE ‘A’ Level H2 Mathematics syllabus. Our findings demonstrate that while GPT-4o serves as a capable baseline, achieving 84.6% accuracy with multi-shot prompting, newer variants such as GPT-4.1-mini offer superior performance, reaching 91.4% accuracy with significantly higher computational efficiency. The results confirm that AI-powered visualisation tools can effectively interpret complex spatial mathematical layouts when guided by optimal prompt engineering. Implementing such technology in educational settings presents a viable, scalable, and cost-effective method to democratise learning support, fostering independent study and enhancing students’ conceptual comprehension of spatial mathematics.

Keywords:

computer vision; mathematics education; educational technology; generative AI; large language models

1. Introduction

In the demanding academic landscape of Singapore’s pre-tertiary education, the Singapore-Cambridge GCE ‘A’ Level H2 Mathematics curriculum stands out for its rigour and depth. This advanced syllabus requires students to make a substantial leap from the more procedural learning of secondary school to a world of complex, abstract reasoning. A prime example of this challenge lies within the topic of Vectors, where students are frequently tasked with visualising and manipulating objects in three-dimensional space based on abstract equations. For many, the conceptual hurdle of translating symbolic notation into a tangible mental model is significant [1,2], often leading to difficulties in grasping the core principles of vector operations like cross products or geometric interpretations. Traditional pedagogical methods, though foundational, may not always suffice to bridge this cognitive gap for every student, highlighting a pressing need for more dynamic and intuitive educational tools that can bring these abstract mathematical concepts to life.

To address this educational challenge, this research project proposes the development and evaluation of an innovative system that harnesses advanced artificial intelligence to offer a new mode of learning support. By leveraging the sophisticated visual interpretation capabilities of the GPT-4o model, this tool is designed to automate the process of converting handwritten vector equations directly into accurate, three-dimensional graphical representations. The system employs a combination of computer vision and machine learning to analyse the nuanced and often ambiguous nature of handwritten mathematics, extracting the necessary components—vectors, scalars, and operations—to generate visualisations using the Matplotlib library (version 3.10.3). This approach aims to provide students with an immediate and interactive bridge between their own pen-and-paper work and the underlying geometric concepts, transforming an often-frustrating task into an accessible and insightful experience. This research, therefore, explores a practical application of AI in education, seeking to enhance student comprehension and close the gap between human intuition and computational analysis in the study of advanced mathematics.

2. Objectives

This project focuses on creating a novel system that leverages GPT-4o’s sophisticated visual comprehension capabilities to automate the graphical representation of handwritten vector equations. Through the application of machine learning and computer vision, the system will decipher and interpret various handwritten mathematical notations. Specifically, GPT-4o will be prompted to accurately identify and extract essential elements, including variables, coefficients, operators, and vector symbols, distinguishing between visually similar characters such as ‘x’, multiplication signs, and cross product notations. The gathered data will subsequently be utilised to produce accurate graphical outputs using the matplotlib library as well as parsed in for calculation.

Automating the visualisation of handwritten vector equations, this system provides vital assistance to junior college students, making the abstract concepts within the Vectors chapter more comprehensible. This methodology is expected to considerably improve the efficiency and precision of mathematical visualisation, eliminating the laborious and error-prone process of manual plotting. Additionally, this research contributes to the wider domain of AI-powered scientific instruments, demonstrating the capacity of advanced language models like GPT-4o to connect human insight with computational analysis. Finally, the research also examines the feasibility of applying such visualisation tools into the real-world context by examining factors such as running cost, accuracy and consistency, in hope of creating a practical product for students’ wide-use.

3. Literature Review

3.1. The Pedagogical Importance of Spatial Visualisation Tools

The development of robust spatial skills is widely recognised as being critical for success in advanced mathematics, particularly in fields such as geometry and multivariable calculus that demand a strong grasp of three-dimensional space [3,4]. Students often struggle to translate abstract equations into tangible mental models, a cognitive gap that traditional teaching methods may not always bridge effectively. Recent research continues to affirm the powerful, predictive relationship between spatial reasoning and mathematical achievement [5], underscoring the need for pedagogical tools that actively cultivate these skills.

A significant body of work demonstrates that technological interventions can successfully enhance these abilities. A study by Herrera et al. [6] provides compelling evidence for this, showing that the integration of tools like Augmented Reality (AR), Virtual Reality (VR), and 3D printing into a multivariable calculus curriculum yielded substantial gains in students’ spatial visualisation capabilities. In their experiment, the group using these technologies saw a 25% increase in their spatial skills, as measured by the Revised Purdue Spatial Visualisation Test, compared to just a 5% increase in the control group. This finding reinforces the idea that allowing students to interactively manipulate, rotate, and view mathematical objects from multiple angles offers a more intuitive pathway to understanding complex spatial relationships than static, two-dimensional representations alone. The work of Herrera et al. [6] establishes a clear precedent: purpose-built technological tools, when combined with a sound pedagogical framework, are highly effective at developing the foundational spatial skills necessary for higher-level mathematics.

3.2. Learning with Visualisations Helps: A Meta-Analysis of Visualisation Interventions in Mathematics Education

Herrera et al.’s conclusion [6] is strongly supported by a recent meta-analysis by Schoenherr et al. [7]. which synthesised the results of 41 different studies involving over 10,000 learners, confirmed that interventions using external visualisations have a significant and robustly positive medium effect (g = 0.504) on mathematics learning outcomes. The analysis found this positive effect to be consistent across various mathematical topics, benefiting not just visually oriented domains like geometry but also more abstract areas such as algebra, calculus, and number operations. Interestingly, the meta-analysis did not find a significant difference in effectiveness between interventions that used digital technology and those that used traditional analog media like paper and pencil. This suggests that the inherent benefit of visualisation is not tied to the sophistication of the medium itself, but rather to its capacity to make abstract concepts understandable and to help learners make connections between mathematical ideas. The consistent, positive, and lasting effects across different age groups and mathematical domains underscore the universal applicability and power of visualisation as a fundamental tool for mathematics education.

3.3. Image Recognition Performance of GPT-4V(ision) and GPT-4o in Ophthalmology: Use of Images in Clinical Questions

Focusing specifically on the capabilities of the underlying technology, a study by Tomita et al. [8] directly compared the image recognition performance of different Generative Pre-trained Transformer (GPT) models in the specialized field of ophthalmology. The research evaluated the diagnostic accuracy of GPT-4V with and without images against the newer GPT-4o model, which also processed images. Across 580 clinical questions, the multimodal GPT-4o demonstrated the highest accuracy (77.1%), significantly outperforming both the multimodal GPT-4V (71.0%) and the text-only GPT-4V (66.7%). The authors concluded that the addition of image data is crucial for enhancing the performance of these models in diagnostic tasks and that GPT-4o represents a significant advancement in image analysis capabilities over its predecessor. This provides direct empirical validation for selecting GPT-4o for tasks requiring sophisticated visual interpretation, such as recognizing the nuanced and complex structures within handwritten notations.

3.4. Handwritten Mathematical Expression Recognition Using Deep Learning Techniques

The specific technical problem of Handwritten Mathematical Expression Recognition (HMER) presents unique difficulties that position this research within a well-established field. As noted by Kalpana & Benita [9], traditional OCR systems often fail due to the inherently “two-dimensional structure of mathematical notation” and wide variations in handwriting. To address this, their work exemplifies a common and effective approach within the HMER field: developing a specialised deep learning framework. They utilised a Convolutional Neural Network (CNN) trained specifically to classify individual symbols, achieving a high accuracy of 99.55% on this sub-task. These classified symbols are then reconstructed into a complete expression for evaluation. This methodology highlights a classic pipeline in HMER—segmentation, classification, and reconstruction—which serves as a benchmark against which more recent, end-to-end approaches using large, general-purpose vision models like GPT-4o can be compared.

4. Methodology

This experiment investigates the potential of using GPT-4o to train a computer vision model. The focus is on how verbal prompts can facilitate the analysis of handwritten equations, enabling the extraction and processing of data to generate corresponding graphs, as well as to compute the equation itself.

4.1. Preparing Input Images of Handwritten Equations

In this experiment, we prepared samples of handwritten vector equations, each representing a different task covered in the GCE “A” Level H2 Mathematics syllabus, to be analysed using GPT-4o.

The tasks are as follows:

Vector Recognition (Figure 1)
Addition of Vectors (Figure 2)
Subtraction of Vectors (Figure 3)
Dot Product of Vectors (Figure 4)
Cross Product of Vectors (Figure 5)
Magnitude of a Vector (Figure 6)

4.2. Image Processing, GPT-4o’s Computer Vision & Prompt Engineering

We recognise GPT-4o as a versatile and powerful OpenAI API capable of processing multimodal inputs, including text, images, and audio, while generating structured outputs with the help of prompting. In particular, GPT-4o’s ability to process images as inputs and produce structured outputs is invaluable for tasks such as analysing handwritten mathematical equations and providing the necessary data to create visualisations for mathematics exercises.

Firstly, the image must be encoded from bytes into a text-based format to ensure that it is in a compatible form with the GPT-4o model for processing, as shown in Listing 1. Then, we adjusted the parameter and added prompts to start the model on image processing, like in Listing 2.

Listing 1. Python-based function (version 3.13.2) to Encode the Image.

Listing 2. Python-based code snippet (version 3.13.2) using GPT-4o to Analyse the Input Image.

The temperature was set to 0.1 and the top_p parameter to 0.8 to minimise randomness, therefore increasing the accuracy and consistency of the output. The maximum output token was also set to 200 to ensure a more focused and computationally cost-efficient response from the model, while not compromising on potential truncation. The API model used for this experiment is “gpt-4o”, or “gpt-4o-2024-11-20” in full.

We acknowledge the importance of text prompts as a method to instruct the GPT-4o model by specifying criteria that guide the model in generating structured outputs. However, we are still keen to investigate the effects of prompting by asking GPT-4o to complete the tasks with different levels of context and varying prompting techniques. As such, we decided to experiment with and carry out accuracy analysis with the following methods:

(i): Zero-shot

We gave the model a direct instruction on the task to be performed without providing any specific examples of the desired output. The model must therefore rely entirely on its pre-existing knowledge of vector mathematics and data structures to interpret the request and generate the output in a list format. This method is used to establish a baseline performance. It tests the model’s fundamental, unguided ability to recognise handwritten vectors and operations and to structure the data as requested. This helps quantify the value added by more detailed prompting.

Observe the prompt in Listing 3:

Listing 3. Prompt used for zero-shot prompting.

(ii): Few-shot

We provided the model with a few examples (shots) that demonstrate the expected input-to-output pattern. By seeing examples for different scenarios (e.g., vector addition, magnitude), the model learns the specific formatting rules and task requirements through in-context learning. It then applies this learned pattern to the new, unseen handwritten equation. The goal is to test whether providing concrete examples significantly improves the model’s accuracy and its adherence to the required structured output format. This reduces ambiguity and helps the model generalise the task correctly.

Observe the prompt in Listing 4:

Listing 4. Prompt used for few-shot prompting.

(iii): Chain-of-thought

Chain-of-thought (CoT) prompting is an advanced technique that instructs the model to break down a complex problem into a series of intermediate reasoning steps before arriving at a final answer [10]. In this prompt, the model is explicitly asked to first identify the individual components (numbers, symbols), then group them into vectors, determine the operation, and only then construct the final output based on its articulated reasoning. This forces a more deliberate and logical process to determine if forcing such a process can improve accuracy, especially for ambiguous or complex handwritten inputs. By breaking the problem down, the model is less likely to make intuitive leaps that lead to errors, which directly addresses the identified limitation of the model’s spatial recognition capabilities. The primary technique is Chain-of-Thought (CoT), which encourages the model to decompose the task. The prompt uses an explicit step-by-step instruction format to guide this decomposition.

Observe the prompt in Listing 5:

Listing 5. Prompt used for chain-of-thought prompting.

(iv): Multi-shot

The multi-shot prompting strategy expands on the few-shot approach by providing the model with a larger and more diverse set of examples. This technique involves augmenting the prompt to include not only basic operations but also exemplars for more complex tasks such as cross product, dot product, and scalar multiplication, thereby covering a wider spectrum of the H2 Mathematics syllabus for Vectors.

The rationale for this method was to determine if a richer contextual basis could improve accuracy and help mitigate the model’s limitations by providing more varied patterns for generalization. This approach relies on example-driven instruction to enhance the model’s ability to adhere to a strict output format across a diverse range of inputs. By demonstrating the correct handling of more nuanced mathematical notations, this method is designed to produce higher fidelity interpretations, making it a critical step in evaluating the upper bounds of prompting effectiveness for this application.

Observe the prompt in Listing 6:

Listing 6. Prompt used for multi-shot prompting.

4.3. Processing the Data and Visualisation

The structured results generated by GPT-4o that would be parsed into a Python programme utilising the matplotlib library to be rendered as three-dimensional graphs, enabling the visualisation of the exercise. The processed results must be in the format [vector1, vector2, …, ‘operation’], which encompasses the necessary information to plot the graphs for effective visualisation. See Figure 7 and Figure 8 for 2 examples of such formatted results and Figure 9. for an example of the final graph visualisation.

The purpose of this stage is to provide students with immediate and unambiguous visual feedback on the handwritten equation. The system is designed for pedagogical clarity, adhering to the principles of visual distinction, appropriate scaling and sufficient annotation (titles and legends). This allows students to easily identify the components of the operation and its solution, as well as avoiding misinterpretation.

This visualisation module is the final component of the system’s workflow, responsible for translating the abstract, structured data from the AI into a concrete, easy-to-understand graphical format for the end-user.

It must be noted that, in its current iteration, the system outputs these visualisations strictly as static images. While they provide clear two-dimensional representations of three-dimensional spaces, they do not presently support interactive manipulation, such as real-time zooming or altering the perspective.

4.4. Solving Vector-Related Problems

Despite GPT-4o having the ability to solve equations with commendable accuracy, we wanted to experiment with an in-house function with hard-coded formulas that can ensure calculations with perfect accuracy while also being more cost-efficient. As such, we integrated a solution-solving component into the project in the form of an object-oriented programming (OOP) Python programme. This allowed the system to solve equations parsed in the earlier stage with low latency and processing power needed.

The processed data is then fed into dedicated OOP functions, which will carry out the calculations and output the solutions. A sample of such functions can be seen in Listing 7:

Listing 7. Snippet of Python-based OOP function (version 3.13.2) to calculate vector-related problems.

4.5. Performance Analysis

To thoroughly evaluate GPT-4o’s Computer Vision component in vector visualisation, we conducted extensive empirical testing on a self-generated dataset of 1000 handwritten mathematical equations. These test cases were carefully designed to encompass a wide spectrum of common and basic vector expressions found in the Singapore-Cambridge GCE “A” Levels H2 Mathematics syllabus, including varying symbol styles, spatial arrangements, and formatting conventions, to simulate realistic scenario diversity.

Our assessment relied on a bespoke grading system that holistically evaluated each output based on three criteria:

Symbol and Number Recognition Accuracy: Correct identification of all symbols, numbers, and operators.
Positional Accuracy: Correct spatial placement, ensuring the reconstructed equations maintained structural integrity.
Formatting Compliance: Adherence to the specified equation format necessary for subsequent processing stages.

To ensure uniform notation formatting across all analytical segments, the grading algorithm categorised the models’ outputs into three distinct performance tiers using an automated LLM-as-a-judge evaluation matrix:

Correct (Full Credit): The model perfectly identifies all vector components and mathematical operations, and the output is structurally identical to the strictly requested formatting (e.g., [[1, 2, 3], [4, 5, 6], ‘addition’]).
Partially Correct (Partial Credit): The model correctly captures the majority of the mathematical intent but falls short of a perfect conversion. This includes outputs with minor transcription errors (e.g., misreading a single integer), correctly identifying the vectors but mislabelling the operation (or vice versa), or successfully extracting the correct mathematical data but failing to adhere to the strict Python list formatting requested by the prompt.
Failed (No Credit): The model completely fails to recognise the handwritten input as a mathematical expression, hallucinates major structural components, or produces an output that cannot be reliably parsed by the subsequent programmatic stages (e.g., admitting inability to read the image or generating plain text without data structures).

These tiered classifications were used to calculate the aggregate accuracy scores, providing a nuanced view of the models’ capabilities beyond simple binary correctness.

4.6. Comparison with Other Models

To contextualize our findings and evaluate the robustness and scalability of GPT-4o’s Computer Vision capabilities for vector visualisation, a further comparative analysis was conducted across a set of models with similar foundational architectures and intended functionalities. The models selected—OpenAI o4-mini, GPT-4o-mini, GPT-4o-turbo, GPT-4.1, GPT-4.1-nano, and GPT-4.1-mini—are variants tailored for different operational trade-offs, including computational efficiency, processing speed, and resource consumption [11]. These models differ primarily in their parameter counts, optimization strategies, and tokenization efficiencies, making them suitable for understanding how model scale and configuration influence performance metrics in vector visualisation tasks.

Given their shared basis in large language models with integrated vision processing, these variants are comparable in their capacity to interpret and manipulate visual data embedded within text prompts. Their architectural similarities allow a direct comparison of key performance indicators such as accuracy, processing time, and token utilization (pricing for each model is included in Appendix A), facilitating an understanding of the trade-offs involved in deploying resource-constrained models versus more comprehensive configurations.

The models are chosen for their architectural lineage—branching from the GPT-4 architecture—ensuring a consistent conceptual core while differing in their computational capacities and optimization nuances. This selection aligns with the literature emphasizing the scalability of GPT models for multimodal tasks [12]. For instance, smaller variants like o4-mini, GPT-4o-mini, and GPT-4.1-mini serve as lightweight alternatives, ideal for rapid processing with slightly reduced precision, whereas GPT-4o-turbo and GPT-4.1 aim to strike a balance between speed and accuracy, optimized for real-time applications.

All models were prompted using multi-shot technique. Temperature was set to 0.1, the top_p parameter was set to 0.8 and maximum output token was set to 200, where applicable, for the same reason stated in Section 4.2. The exact API version for the models used are “gpt-4o-mini-2024-07-18”, “gpt-4.1-2025-04-14”, “gpt-4.1-mini-2025-04-14”, “gpt-4.1-nano-2025-04-14”, “o4-mini-2025-04-16”, and “gpt-4-turbo-2024-04-09”, for the corresponding models. The accuracy of the models would be evaluated by the same process stated in Section 4.5.

5. Results

5.1. Pipeline

As a result of this project, we have developed a pipeline for a mathematics visual assistance tool by leveraging advanced machine learning technologies, such as GPT-4o, for computer vision. This tool analyses handwritten equations and generates graph visualisations to address mathematical exercises effectively. The complete execution workflow of the system, from the user’s handwritten input to the final pedagogical feedback, is visualised in Figure 10 and Table 1.

Some example visualisations of other vector operations can be found in Figure 11, Figure 12, Figure 13 and Figure 14.

In addition, the initial effort to solve vector-related questions also yielded highly accurate results. As the calculation function relates closely to the visualisation function, we decided to combine them into one single function. The calculated results would be shown in the graph visualiser itself.

5.2. Performance of GPT-4o

As detailed in Table 2, there is a clear correlation between the complexity of the prompt and the model’s accuracy. The zero-shot approach, which provided no examples, established a baseline accuracy of 48.24%. This relatively low performance highlights the model’s difficulty in interpreting the task and adhering to the required output format without guidance.

Introducing examples significantly improved performance. The few-shot method, which included a small number of examples, boosted accuracy to 80.67%. This demonstrates the effectiveness of in-context learning for guiding the model. The multi-shot technique, which expanded the variety of examples to cover more of the H2 Mathematics syllabus, further increased accuracy to 84.58%, representing the highest performance among the tested methods.

The chain-of-thought (CoT) prompt, which guided the model to break down its reasoning process, yielded an accuracy of 63.92%. While an improvement over the zero-shot baseline, it was less effective than the few-shot and multi-shot methods. This suggests that for this specific task of structured data extraction, providing clear examples of the desired output is more effective than instructing the model on the reasoning process.

In terms of efficiency, the multi-shot prompt, despite having the highest average input tokens (674.40), had a comparable response time (1.458 s) to the other guided methods and a marginally higher cost per query ($0.001874). This indicates that the additional context improves accuracy without creating a significant trade-off in performance, making it the most effective and reliable prompting strategy for this application.

5.3. Performance of Other Models

The results shown in Table 3, reveal a clear trade-off between accuracy, cost, and speed. GPT-4.1-mini achieved the highest accuracy at 91.42%, surpassing all other models, including the primary GPT-4o. It also demonstrated high efficiency with a low response time (0.795 s) and the second-lowest cost per query ($0.000276). Its counterpart, GPT-4.1, also performed strongly with an accuracy of 89.58% and a fast response time, though at a higher cost. These results position the GPT-4.1 series, particularly the ‘mini’ variant, as a highly effective and efficient alternative for this specific task.

In contrast, the models branded as ‘turbo’ and the older o4-mini were less effective. GPT-4o-turbo recorded a lower accuracy of 70.67% and had the longest response time (2.759 s) and highest cost ($0.007307), suggesting it is not optimised for this type of visual interpretation task. The o4-mini model performed the poorest, with an accuracy of only 51.50%.

The nano variant, GPT-4.1-nano, offered the most cost-effective solution at just $0.000117 per query and the fastest processing time (0.467 s). However, this efficiency came at the cost of accuracy, which stood at 70.92%. This makes it a viable option for applications where speed and low cost are critical, and a lower level of accuracy is acceptable.

6. Discussion

6.1. Overview of Findings

6.1.1. Viability of the System

Our exploration into applying computer vision, particularly through the GPT-4 model, for automating the visualisation of handwritten vector equations has revealed a landscape rich with both opportunities and challenges. Although the capability to generate graphical representations directly from handwritten inputs signifies a major advancement in improving accessibility and efficiency in mathematics education, the encountered limitations highlight the inherent difficulties in creating AI systems that can precisely interpret the details of handwritten mathematical notation.

The system’s ability to convert handwritten equations into visualisations stands as a significant step toward integrating traditional pen-and-paper approaches with modern digital tools. This functionality holds the promise of transforming mathematics learning by offering students a more intuitive and interactive means to engage with and comprehend abstract concepts. By removing the necessity for manual plotting, the system also aims to boost efficiency and minimize errors, enabling learners to concentrate more on the core mathematical ideas rather than the technicalities of visualisation.

Our research adds to the existing knowledge by offering empirical evidence about the opportunities and challenges of applying AI to recognize and visualize handwritten mathematical expressions. The limitations we found point to directions for future work, including building larger, more varied datasets of handwritten math and improving AI models’ ability to understand spatial relationships.

6.1.2. Cost and Effect

The findings from this study indicate that implementing an AI-powered visualisation tool for mathematics education is not only viable but also potentially transformative. From a pedagogical standpoint, the system directly addresses the cognitive gap many students face when translating abstract vector equations into tangible, three-dimensional models. By providing immediate visual feedback on their handwritten work, the tool can enhance spatial reasoning skills, which are critical for success in advanced mathematics. This aligns with research showing that interactive visualisations lead to better understanding of complex spatial relationships compared to static representations.

For large-scale implementation in an educational setting like the Singapore ‘A’ Level system, the balance between accuracy and cost is crucial. The performance analysis shows that highly accurate results can be achieved at a relatively low cost per query. For instance, the GPT-4.1-mini model delivered over 91% accuracy at a cost of approximately $0.000276 per equation. This low operational cost suggests that such a tool could be deployed to a large student body without incurring prohibitive expenses.

The primary benefit lies in its potential to democratise learning support. Students could use the tool for independent study, getting instant clarification and visual reinforcement without needing immediate teacher intervention. This could level the playing field, offering powerful assistance to those who struggle with spatial visualisation. While our in-house solver demonstrates that calculations can be handled more efficiently offline, the AI’s role in interpreting the initial handwritten input is the key innovation. It is clear that a small computational cost provides a significant pedagogical benefit, making abstract mathematics more accessible and intuitive for every student.

6.1.3. GPT-4o’s Position Amongst the Models

While this project was initially designed around the capabilities of GPT-4o, the comparative analysis positions it as a strong, but not leading, model for this specific application. With an accuracy of 84.58% using the multi-shot technique, GPT-4o is a capable and reliable tool for converting handwritten vector equations into structured data. Its performance validates the core premise of the research—that advanced vision models can bridge the gap between handwritten work and digital visualisation.

However, when compared to its peers, GPT-4o is outperformed in both accuracy and cost-efficiency by the GPT-4.1 series models. Specifically, GPT-4.1-mini emerged as the superior option, achieving a higher accuracy of 91.42% at a significantly lower cost and faster response time. This suggests that for a production-level deployment of this educational tool, where both accuracy and operational cost are critical factors, GPT-4.1-mini would be the more pragmatic choice.

GPT-4o’s main strength, as demonstrated in the ophthalmology study by Tomita et al. [13], is its advanced image analysis capability. While it performs this task well, the results suggest that newer, more specialised models have been further optimised, achieving a better balance of performance and efficiency. Therefore, GPT-4o serves as an excellent proof-of-concept and a robust baseline, but for scaling and practical implementation, more recent and cost-effective models present a more compelling value proposition.

6.2. Limitations

Fundamentally, computer vision relies on analysing large volumes of data to improve its performance and deliver results with high accuracy [14]. However, despite being extensively trained with quality training data, GPT-4o still tends to produce occasional inaccuracies. These inaccuracies arise primarily due to:

Insufficient Dataset of Handwritten Mathematical Equations: A primary challenge for GPT-4o’s computer vision is the scarcity of large, comprehensive datasets of handwritten mathematical equations. Such equations feature a unique mix of symbols and characters, and without extensive training data reflecting this complexity, the model’s ability to accurately interpret image inputs is limited, leading to potential errors in analysis. There has been extensive datasets on mathematical equation such as the CROHME23 dataset [15] or the MathWriting dataset [16]. However, there are no dataset dedicated to just vector-related operations and expressions. As such, GPT-4o would not be able to perfectly execute the task all the time even after being prompted sufficiently, some of such cases are demonstrated in Figure 15 and Figure 16. Similarly, as shown in Figure 17 and Figure 18, the limitations of GPT-4o’s computer vision become evident, as it sometimes fails to produce the intended or correct interpretation despite being prompted with sufficient context and clear images.

Limitations in Spatial Reasoning: While computer vision systems are adept at identifying objects, they often struggle to comprehend the spatial relationships between them. However, in mathematics, the positioning of symbols is critical to its meaning. Take Figure 19 as an example, where a human learner should understand that an arrow above a letter and surrounding vertical bars signify the magnitude of a vector. The AI, however, may recognise the individual symbols but fail to synthesise their spatial arrangement into the correct mathematical concept, limiting its ability to interpret the problem correctly, leading to a completely wrong interpretation shown in Figure 20.

Qualitative Error Characterisation: In addition to previously stated reasons that caused complete parsing failures, a closer analysis reveals several distinct syntactic and morphological patterns that consistently triggered partial parsing failures. Based on observed patterns, the vision model struggles predominantly with the following configurations:

Handwriting ambiguities frequently lead to character misinterpretation. Negative signs are particularly vulnerable; they are often missed entirely or mistaken for stray marks, which alters the mathematical value significantly (e.g., interpreting [−1, −1, 0] as [1, −1, 0]). Additionally, cramped handwriting occasionally causes the model to confuse structurally similar digits, such as 0 and 8.
When processing multiple vectors, the model applies wrapping brackets inconsistently. Instead of outputting the expected flat list structure (e.g., [[1, 2, 3], [4, 5, 6], ‘addition’]), it frequently over-nests the components (e.g., [[[1, 2, 3], [4, 5, 6]], ‘addition’]). In more complex syntactic structures, such as linear combinations, the model sometimes drops scalar multipliers entirely and extracts only the base vectors.
If the original handwritten text is tightly cramped or faded at the edges, the model occasionally truncates three-dimensional vectors, mistakenly extracting a 3D coordinate like [1, 0, 0] as a 2D coordinate [1, 0].
The model exhibits minor syntactic instability in its descriptive string labels. It frequently oscillates between plural and singular nouns (“vector” versus “vectors”) and swaps spaces for underscores (“cross product” versus “cross_product”). Such quirks require robust downstream parsing scripts to prevent strict string-matching algorithms from flagging them as complete failures.

Challenges in real-world usage: The diverse representations of mathematical expressions across different contexts and regions pertains a significant challenge. Various educational systems and mathematical curricula worldwide employ distinct notation systems, which can encode the same mathematical concept through different symbols or structural arrangements [17]. This variability can cause confusion within the AI model if it has been predominantly trained on a limited set of conventions, resulting in decreased recognition accuracy when confronted with unfamiliar formats [18]. Furthermore, handwriting quality remains a persistent hurdle. Handwritten mathematical symbols are inherently prone to ambiguities due to individual differences in writing styles, stroke thickness, and spatial placement [13]. Poor handwriting can cause different characters or numerals to appear indistinguishable or ambiguous, thus confusing the recognition system and increasing the likelihood of errors [19]. These ambiguities are further exacerbated when dealing with complex expressions involving superscripts, subscripts, or nested fractions, which require precise spatial interpretation. Environmental noise in document images further complicates recognition tasks. Common degradations such as ink smudging, crossing out extraneous annotations, blurring due to low image resolution, or background distractions can distort the visual cues essential for accurate inference [20]. Such noise can obscure critical elements of the equation or generate artifacts that mislead the model.

Collectively, these factors underscore the importance of developing more robust preprocessing techniques, diversified training datasets, and adaptable recognition frameworks to improve GPT-4o’s performance in real-world mathematical document analysis. Addressing these limitations is crucial for deploying AI systems capable of reliably supporting educational, scientific, and engineering workflows across heterogeneous environments.

Improvement: The accuracy of computer vision can be improved by further training the model with a comprehensive and relevant image dataset of students’ handwritten mathematical equations, coupled with additional text-prompt reinforcement. Such training datasets and reinforcement will guide the model with important details of the topics to develop correct interpretation of the given handwritten equations [21,22]. This approach ensures more accurate data interpretation, particularly in niche domains such as recognising vector equations.

6.3. Future Work

Future research should delve into two pivotal areas to significantly enhance AI models for handwritten mathematical expressions. The first critical area involves prioritizing the creation of substantially larger and more diverse datasets. This is paramount for improving the model’s robustness and generalization capabilities. Such datasets must incorporate a much broader spectrum of handwriting styles, ranging from neat and precise to hurried and idiosyncratic, to accurately reflect real-world variability. Furthermore, they need to include a vastly expanded library of mathematical symbols, not just common ones, but also less frequently used notations and variations. Crucially, the datasets should also encompass a wider array of spatial arrangements of these symbols, accounting for different alignments, sizes, and orientations within expressions, as these elements often carry significant meaning in mathematics. This comprehensive approach to data diversity will directly improve the model’s ability to accurately interpret varied handwritten input, making it more practical for real-world applications.

The second crucial area of research should explore innovative methods to enhance AI models’ spatial reasoning capabilities, specifically within the complex context of mathematical expressions. Unlike simple character recognition, understanding handwritten mathematics requires a deep comprehension of the spatial relationships between symbols. For instance, the position of a superscript or subscript drastically alters the meaning of an expression. This could involve integrating advanced geometric and topological concepts directly into the model’s training process. This might entail representing mathematical expressions not just as sequences of characters, but as intricate spatial graphs where nodes are symbols and edges represent their spatial relationships (e.g., above, below, inside, next to). Alternatively, researchers should focus on developing entirely new algorithms specifically tailored for spatial reasoning in mathematics. These algorithms might leverage techniques from computational geometry or graph theory to explicitly model and interpret the two-dimensional layout of mathematical expressions, moving beyond traditional sequential processing and allowing for a more nuanced understanding of the inherent spatial grammar of mathematics.

Additionally, future research should explore how AI-powered visualisation tools can be integrated into actual classroom environments. This involves examining how these tools affect student learning, motivation, and engagement, as well as identifying challenges related to their implementation and accessibility. Focusing on these areas will help improve the use of AI in math education, making learning more effective and accessible for students.

Finally, a crucial area for future development is the integration of interactive manipulation features within the generated visualisation. As the current system produces static images, it restricts the user’s ability to fully explore the generated three-dimensional geometry. Future iterations will aim to incorporate dynamic geometry capabilities, allowing students to drag vector origins, change viewing perspectives, and zoom in or out in real-time. Implementing this interactive functionality would significantly enhance the tool’s pedagogical value by enabling deeper, hands-on exploration of spatial relationships.

7. Conclusions

This research successfully demonstrates the significant potential of leveraging advanced AI models like GPT-4o to automate the visualisation of handwritten vector equations. The developed system serves as a powerful proof-of-concept, affirming that AI can act as an effective bridge between traditional, pen-on-paper mathematics and dynamic, digital learning tools, thereby enhancing both accessibility and efficiency for students. Our findings show that with appropriate prompt engineering, specifically a multi-shot approach, a high degree of accuracy in interpreting handwritten notations can be achieved.

However, the study also highlights critical limitations that must be addressed for such tools to reach their full potential. The comparative analysis revealed that while GPT-4o is highly capable, newer models like GPT-4.1-mini offer superior accuracy and cost-effectiveness for this specific task. Furthermore, the persistent challenges related to the scarcity of diverse training data and the model’s inherent limitations in spatial reasoning underscore the need for continued research and development. Future work must focus on creating richer datasets and developing new algorithms to better capture the spatial grammar of mathematics.

Ultimately, the key takeaway is that while AI presents a promising frontier for transforming mathematics education, its successful implementation depends on systematically addressing the underlying challenges of data diversity and sophisticated reasoning. By doing so, we can create robust, reliable, and truly effective tools that make abstract concepts more intuitive and accessible to all learners.

Author Contributions

Conceptualization, N.T.M.L. and S.C.; methodology, N.T.M.L. and S.C.; software, N.T.M.L.; validation, S.C.; formal analysis, N.T.M.L.; investigation, N.T.M.L.; resources, K.Y.T.L.; data curation, N.T.M.L.; writing—original draft preparation, N.T.M.L.; writing—review and editing, K.Y.T.L. and S.C.; visualisation, N.T.M.L.; supervision, K.Y.T.L.; project administration, K.Y.T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to institutional protocols.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Pricing for models used in this experiment.

Table A1. Pricing for different models (per 1 M tokens) (accurate as of 19 June 2025).

Model	Input	Cached Input	Output
gpt-4.1 (14 April 2025)	$2.00	$0.50	$8.00
gpt-4.1-mini (14 April 2025)	$0.40	$0.10	$1.60
gpt-4.1-nano (14 April 2025)	$0.10	$0.025	$0.40
o4-mini (16 April 2025)	$1.10	$0.275	$4.40
gpt-4o-mini (18 July 2024)	$0.15	$0.075	$0.60
gpt-4-turbo (29 April 2024)	$10.00	NIL	$30.00

References

Duval, R. A Cognitive Analysis of Problems of Comprehension in a Learning of Mathematics. Educ. Stud. Math. 2006, 61, 103–131. [Google Scholar] [CrossRef]
Sabah, S. Science and engineering students’ difficulties in understanding vector concepts. Eurasia J. Math. Sci. Technol. Educ. 2023, 19, em2310. [Google Scholar] [CrossRef] [PubMed]
Arcavi, A. The role of visual representations in the learning of mathematics. Educ. Stud. Math. 2003, 52, 215–241. [Google Scholar] [CrossRef]
Battista, M. The development of geometrical and spatial thinking. In Second Handbook of Research on Mathematics Teaching and Learning; Emerald Publishing Limited: Cambridge, MA, USA, 2007; pp. 843–908. [Google Scholar]
Resnick, I.; Harris, D.; Logan, T.; Lowrie, T. The relation between mathematics achievement and spatial reasoning. Math. Ed. Res. J. 2020, 32, 171–174. [Google Scholar] [CrossRef]
Herrera, L.M.M.; Ordóñez, S.J.; Ruiz-Loza, S. Enhancing mathematical education with spatial visualisation tools. Front. Educ. 2024, 9, 1229126. [Google Scholar] [CrossRef]
Schoenherr, J.; Strohmaier, A.R.; Schukajlow, S. Learning with visualisations helps: A meta-analysis of visualisation interventions in mathematics education. Educ. Res. Rev. 2024, 45, 100639. [Google Scholar] [CrossRef]
Tomita, K.; Nishida, T.; Kitaguchi, Y.; Kitazawa, K.; Miyake, M. Image Recognition Performance of GPT-4V(ision) and GPT-4o in Ophthalmology: Use of Images in Clinical Questions. Clin. Ophthalmol. 2025, 19, 1557–1564. [Google Scholar] [CrossRef] [PubMed]
Kalpana, Y.; Benita, P.S. Handwritten mathematical expression recognition using deep learning techniques. J. Neonatal Surg. 2025, 14, 516–523. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought prompting elicits reasoning in large language models. arXiv 2022, arXiv:2201.11903. [Google Scholar] [CrossRef]
OpenAI. GPT-4 Technical Report. 2023. Available online: https://openai.com/research/gpt-4 (accessed on 20 June 2025).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Yasuhara, T.; Watanabe, T.; Yamaguchi, T. Handwritten mathematical symbol recognition system considering writing variability. Int. J. Pattern Recognit. Artif. Intell. 2017, 31, 1757002. [Google Scholar]
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T.B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; Amodei, D. Scaling laws for neural language models. arXiv 2020, arXiv:2001.08361. [Google Scholar] [CrossRef]
Xie, Y.; Mouchère, H.; Simistira, L.F.; Rakesh, S.; Saini, R.; Nakagawa, M.; Nguyen, C.T.; Truong, T.N. ICDAR 2023 CROHME: Competition on Recognition of Handwritten Mathematical Expressions [Data set]. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR 2023), San José, CA, USA, 21–26 August 2023. [Google Scholar] [CrossRef]
Gervais, P.; Fadeeva, A.; Maksai, A. MathWriting: A dataset for handwritten mathematical expression recognition. arXiv 2024, arXiv:2404.10690. [Google Scholar] [CrossRef]
Tengan, D.; Wang, H. Variability in mathematical notation across different cultures and curricula. Educ. Process. 2020, 7, 134–145. [Google Scholar]
Zanibbi, R.; Blostein, D. Recognition and retrieval of mathematical expressions. IJDAR 2012, 15, 331–357. [Google Scholar] [CrossRef]
Long, Y.; Wang, Z.; Huang, J. Handwritten mathematical expression recognition with deep learning: Challenges and prospects. Pattern Recognit. Lett. 2022, 157, 47–55. [Google Scholar]
Sharma, A.; Singh, K.; Mishra, R. Noise-robust handwritten mathematical expression recognition: A survey. Image Vis. Comput. 2019, 86, 1–15. [Google Scholar] [CrossRef]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
Zhai, X.; Kolesnikov, A.; Houlsby, N.; Beyer, L. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12104–12113. [Google Scholar] [CrossRef]

Figure 1. Vector Recognition.

Figure 2. Addition of Vectors.

Figure 3. Subtraction of Vectors.

Figure 4. Dot Product of Vectors.

Figure 5. Cross Product of Vectors.

Figure 6. Magnitude of Vector.

Figure 7. Processed Data of Addition of Vectors.

Figure 8. Processed Data of Cross Product of Vectors.

Figure 9. Example of a Final Visualisation.

Figure 10. System Execution Workflow Diagram. This infographic details the sequential pipeline: (1) capturing the user’s handwritten vector equation image; (2) AI interpretation via GPT-4o and multi-shot prompting for segmentation and text extraction; (3) parsing the standardized structured data output; (4) processing the mathematical model through the Python vector solver; and (5) rendering the final 3D visual feedback for spatial understanding.

Figure 11. Graph of Recognising Vectors.

Figure 12. Graph of Subtraction of Vectors.

Figure 13. Graph of Vector and its Magnitude.

Figure 14. Graph of Dot Product of Vector.

Figure 15. Vector v.

Figure 16. Multiple Incorrect Analysis for Vector v.

Figure 17. A handwritten vector used for accuracy analysis.

Figure 18. GPT-4o failed to recognise Figure 15 as a vector.

Figure 19. A Handwritten Vector Expression.

Figure 20. Incorrect Interpretation by GPT-4o.

Table 1. Samples of how a handwritten equation would be processed.

Operation Type	Addition	Cross Product
Original handwritten equation
Step 1: GPT-4o’s Interpretation
Step 2: Graph Produced
Step 3: Calculation

Table 2. Performance of GPT-4o on different levels of prompting on the task stated in Section 4.5.

Prompting Technique	Average Input Tokens *	Average Output Tokens	Average Cost per Query **	Average Response Time/s	Accuracy
Zero-shot ***	447.40	44.38	US$0.001562	2.504	48.2384%
Few-shot	547.40	18.80	US$0.001557	1.547	80.6667%
Multi-shot	674.40	18.81	US$0.001874	1.458	84.5833%
Chain-of-thought	550.40	18.79	US$0.001564	1.519	63.9167%

* assuming no cached input. ** pricing for GPT-4o is US$4.50 and US$10 per 1 M tokens for input and output respectively. *** without prompt engineering, the model would not output the vector in the expected format, hence accuracy metrics are eased.

Table 3. Performance of GPT-4o on different levels of prompting on the task stated in Section 4.5.

Model	Average Input Tokens *	Average Output Tokens	Average Cost **	Average Response Time	Accuracy
GPT-4o-mini	13,427.94	17.88	US$0.002025	1.114	75.6667%
GPT-4o-turbo	675.40	18.44	US$0.007307	2.759	70.6667%
o4-mini	692.93	236.11	US$0.001801	3.319	51.5000%
GPT-4.1	675.40	18.92	US$0.001502	1.174	89.5833%
GPT-4.1-nano	872.34	18.73	US$0.000117	0.467	70.9167%

* assuming no cached input. ** see Appendix A for tokens pricing.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lim, K.Y.T.; Le, N.T.M.; Chanoudam, S. Automating Spatial Visualisation of Handwritten Vector Equations Using Large Vision Models in Pre-Tertiary Mathematics. Multimodal Technol. Interact. 2026, 10, 68. https://doi.org/10.3390/mti10060068

AMA Style

Lim KYT, Le NTM, Chanoudam S. Automating Spatial Visualisation of Handwritten Vector Equations Using Large Vision Models in Pre-Tertiary Mathematics. Multimodal Technologies and Interaction. 2026; 10(6):68. https://doi.org/10.3390/mti10060068

Chicago/Turabian Style

Lim, Kenneth Y. T., Nguyen Thanh Minh Le, and Sopheap Chanoudam. 2026. "Automating Spatial Visualisation of Handwritten Vector Equations Using Large Vision Models in Pre-Tertiary Mathematics" Multimodal Technologies and Interaction 10, no. 6: 68. https://doi.org/10.3390/mti10060068

APA Style

Lim, K. Y. T., Le, N. T. M., & Chanoudam, S. (2026). Automating Spatial Visualisation of Handwritten Vector Equations Using Large Vision Models in Pre-Tertiary Mathematics. Multimodal Technologies and Interaction, 10(6), 68. https://doi.org/10.3390/mti10060068

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Automating Spatial Visualisation of Handwritten Vector Equations Using Large Vision Models in Pre-Tertiary Mathematics

Abstract

1. Introduction

2. Objectives

3. Literature Review

3.1. The Pedagogical Importance of Spatial Visualisation Tools

3.2. Learning with Visualisations Helps: A Meta-Analysis of Visualisation Interventions in Mathematics Education

3.3. Image Recognition Performance of GPT-4V(ision) and GPT-4o in Ophthalmology: Use of Images in Clinical Questions

3.4. Handwritten Mathematical Expression Recognition Using Deep Learning Techniques

4. Methodology

4.1. Preparing Input Images of Handwritten Equations

4.2. Image Processing, GPT-4o’s Computer Vision & Prompt Engineering

4.3. Processing the Data and Visualisation

4.4. Solving Vector-Related Problems

4.5. Performance Analysis

4.6. Comparison with Other Models

5. Results

5.1. Pipeline

5.2. Performance of GPT-4o

5.3. Performance of Other Models

6. Discussion

6.1. Overview of Findings

6.1.1. Viability of the System

6.1.2. Cost and Effect

6.1.3. GPT-4o’s Position Amongst the Models

6.2. Limitations

6.3. Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI