Analysis of Graphomotor Tests with Machine Learning Algorithms for an Early and Universal Pre-Diagnosis of Dysgraphia

Five to ten percent of school-aged children display dysgraphia, a neuro-motor disorder that causes difficulties in handwriting, which becomes a handicap in the daily life of these children. Yet, the diagnosis of dysgraphia remains tedious, subjective and dependent to the language besides stepping in late in the schooling. We propose a pre-diagnosis tool for dysgraphia using drawings called graphomotor tests. These tests are recorded using graphical tablets. We evaluate several machine-learning models and compare them to build this tool. A database comprising 305 children from the region of Grenoble, including 43 children with dysgraphia, has been established and diagnosed by specialists using the BHK test, which is the gold standard for the diagnosis of dysgraphia in France. We performed tests of classification by extracting, correcting and selecting features from the raw data collected with the tablets and achieved a maximum accuracy of 73% with cross-validation for three models. These promising results highlight the relevance of graphomotor tests to diagnose dysgraphia earlier and more broadly.


Introduction
At school, fine motor activities still represent between 18% and 47% of children's activities [1] compared to 30 to 60% in 1992 [2], where 85% of this time is allocated to pure handwriting. As a major part of their learning process, handwriting is a very important ability that children need to master. It involves several different skills, as the children have to learn how to perform an accurate and complex motor task that has to be synchronized with their eye feedback [3]. Difficulties to perform handwriting in a convenient way are a concern because they may become a set back in the children's schooling and lead to a degradation of self-esteem [4].
Despite correct learning and training, some children never master handwriting and their difficulties are actually pathologic. This disorder is called dysgraphia and affects between 5 and 10% of children [5][6][7][8]. In France, dysgraphia is diagnosed thanks to the Concise Evaluation Scale for Children's Handwriting [6], also known as the BHK test, which is an adaptation from the Dutch BHK test [9]. The test consists of copying a text for five minutes, on a blank piece of paper. Only the first five lines are then rated by the examiner on the basis of thirteen quality criteria (size of the handwriting, gaps between Table 1. A summary of the different classification algorithms used in the last studies using text data only as training data. Some studies tried several models of classification of children with dysgraphia (DYS) and typically developing children (TD). 1 the score used is not balanced accuracy but only the percentage of children with dysgraphia actually detected by the algorithm. 2 the score used is the accuracy, because the sets are almost balanced.
This study intends to evaluate the possibility of using only graphomotor tasks, and no text-based test, for the detection of the developmental of dysgraphia -not only graphomotor difficulties-in order to avoid the drawbacks related to the dependence on the language and on the fact that the children are old enough to know how to write. Based on studies that showed that there are very informative data in the writing path of children performing simple shapes on a sheet of paper [11,22], we formulate the hypothesis that dysgraphia has an impact on the way children draw, even though writing and drawing are two distinct exercises [23]. The graphomotor tests consists in the reproduction by the subjects of several shapes on a graphic tablet: a total of six drawings or groups of drawings were analyzed, part of them stemming from the Developmental Test of Visual Perception version 2 [24]. Three hundred five children from 2nd to 5th grade performed the tests. Among the 305, all 43 children with dysgraphia, with 43 matching children without dysgraphia, were used as a database to train the machine learning model. The remaining 219 children are used as a reference database for computing age bias. Features reflecting both the kinematic and static characteristics of the children's drawings were extracted from the graphomotor tests. The number of features is limited to those which are relevant to dysgraphia and may help in the building of an explanatory model of dysgraphia. After selecting the best combination of those features, we infer if the children display dysgraphia or not with a machine learning model, which is evaluated based on the cross-validation method ( Figure 1). We evaluated nine models and three were selected for their best performances. ings of graphomotor tasks and used a machine-learning model to determine the classification. Their focus was on features extraction, where they compared different categories of features (classical, modulation spectra, fractional order derivative and Q-factor wavelets transform) and their impact on the final classification. They used only one classifier and obtained 84% accuracy in the prediction of graphomotor difficulties. This study intends to evaluate the possibility of using only graphomotor tasks, and no text-based test, for the detection of the developmental of dysgraphia -not only graphomotor difficulties-in order to avoid the drawbacks related to the dependence on the language and on the fact that the children are old enough to know how to write. Based on studies that showed that there are very informative data in the writing path of children performing simple shapes on a sheet of paper [11,22], we formulate the hypothesis that dysgraphia has an impact on the way children draw, even though writing and drawing are two distinct exercises [23]. The graphomotor tests consists in the reproduction by the subjects of several shapes on a graphic tablet: a total of six drawings or groups of drawings were analyzed, part of them stemming from the Developmental Test of Visual Perception version 2 [24]. Three hundred five children from 2nd to 5th grade performed the tests. Among the 305, all 43 children with dysgraphia, with 43 matching children without dysgraphia, were used as a database to train the machine learning model. The remaining 219 children are used as a reference database for computing age bias. Features reflecting both the kinematic and static characteristics of the children's drawings were extracted from the graphomotor tests. The number of features is limited to those which are relevant to dysgraphia and may help in the building of an explanatory model of dysgraphia. After selecting the best combination of those features, we infer if the children display dysgraphia or not with a machine learning model, which is evaluated based on the cross-validation method ( Figure 1). We evaluated nine models and three were selected for their best performances. Step 1 corresponds to the tests, performed at school or at the hospital. Then, the BHK (conducted on paper) are sent to the examiners to rate and create the labels in Step 2a, and the raw numerical data of the graphomotor tests are processed in Step 2b.
Step 3 is the model evaluation. Step 1 corresponds to the tests, performed at school or at the hospital. Then, the BHK (conducted on paper) are sent to the examiners to rate and create the labels in Step 2a, and the raw numerical data of the graphomotor tests are processed in Step 2b.
Step 3 is the model evaluation.

Materials and Methods
In this section, we present the characteristics of the participants before detailing our proposed procedure.

Participants
Three hundred five children from 2nd to 5th Grade (7 to 11 years old) took part in the study. This study was conducted in accordance with the Helsinki Declaration. It was approved by the Grenoble University Ethics Committee (CERGA, agreement 2016-01-05-79). The writing consent of all children's parents and the oral consent of all children have been acquired. The database is presented in Table 2. Table 2. Description of the database, organized by diagnosis. For each group (TYP and DYS), the general characteristics, the BHK quality score (evaluation of handwriting) and the BHK speed score (evaluation of speed) are listed, with the p-value of the differences between both groups. A chi-squared test was used for gender, and a Welch test for the other categories to obtain these p-values. All children are French native-speakers. In order to obtain a database as exhaustive as possible, children have been recruited in 5 schools from the region of Grenoble (France), as well as in the Reference Center for Language and Learning Disorders of the Grenoble University Hospital (CRTLA, Centre Hospitalier Universitaire de Grenoble). Most of the children from schools were good writers but a few of them have been diagnosed with dysgraphia (4.9%). At the CRTLA, children were consulted because of handwriting difficulties that were becoming troublesome, and a majority (30 out of 40) of these children were indeed diagnosed with dysgraphia. Overall, there were 262 typically-developing children and 43 children with dysgraphia. Two of the children with dysgraphia have been diagnosed based on their speed only, and not because of their handwriting quality. Because of the variety of profiles leading to a diagnosis of dysgraphia, a Welch test was performed to compute the p-value of the distinguishability between the means of the different populations. As shown earlier, a gender effect was observed, with a prevalence of boys in the DYS group [6].
Children before 2nd Grade were not considered, as the BHK test cannot be used to rate people who have an experience of less than 2 years in handwriting [6] and are not present in Table 2. Twelve children who did not perform the tasks as they have been asked-lack of some shapes for instance-as well as children suffering from physical disabilities or neurological deficits have been excluded as well.
From this main database, we selected all 43 children with dysgraphia (13 from schools, 30 from the hospital), and 43 typical children paired to each of them. The pairing was determined by chronological age as follows: for each child with dysgraphia, among the typical children with the same age and gender, the one with the closest age was selected. Therefore, each pair has the same gender and laterality, and an average age gap of 15 days, the maximum being 76 days and the minimum 0 days. These 86 children constitute the active database used for the training and testing of the classification model. The remaining 219 typical children were used only to control any bias due to the children's age. They constitute the Z-score database ( Figure 2). From this main database, we selected all 43 children with dysgraphia (13 from schools, 30 from the hospital), and 43 typical children paired to each of them. The pairing was determined by chronological age as follows: for each child with dysgraphia, among the typical children with the same age and gender, the one with the closest age was selected. Therefore, each pair has the same gender and laterality, and an average age gap of 15 days, the maximum being 76 days and the minimum 0 days. These 86 children constitute the active database used for the training and testing of the classification model. The remaining 219 typical children were used only to control any bias due to the children's age. They constitute the Z-score database ( Figure 2).

Figure 2.
The database was divided into two sets. The active database is split into Train and Test for the supervised learning algorithm, and the Z-score database is only used as a reference, preventing any data leakage and overestimation of the performances of the model.

Approach & Procedure
All children took the test alone with an examiner, inside their school or the hospital, depending on where they were recruited. Two series of distinct tests were given to each child. The first ones are the graphomotor tests (described below), which are the tests that have been studied later on. Then, all the children took the BHK test following the procedure described in [6]. One or two expert examiners annotated the BHK tests in compliance with the Concise Evaluation Scale for Children's Handwriting procedure. In this study, the only purpose of the BHK test was to have a real diagnosis and to correctly label children's data for the supervised learning algorithms (see Figure 1). The recorded data of the BHK tests were not used in the machine learning algorithms.
The graphomotor tests, as well as the BHK test, were performed on a graphic tablet that records the x, y and z positions of the pen tip as a function of time, as well as pressure data ( Figure 3a). Data were collected with a software called GraphLogger, developed by the CEA [20] (Figure 3b). Different models of graphic tablets were used at school (WACOM Pro M) and at the hospital (WACOM Pro L). A step of calibration has thus been necessary to be able to compare effectively the records coming from the two tablets: to check the written length calibration, different distances on the real written tracks made by the children on paper were measured and then compared to measures recorded by the computer to check the reliability of the tablets and software solutions. The spatial resolution of both tablets is 0.25 mm, and a new dot is considered every 5 min ms (200 Hz sampling rate). The database was divided into two sets. The active database is split into Train and Test for the supervised learning algorithm, and the Z-score database is only used as a reference, preventing any data leakage and overestimation of the performances of the model.

Approach & Procedure
All children took the test alone with an examiner, inside their school or the hospital, depending on where they were recruited. Two series of distinct tests were given to each child. The first ones are the graphomotor tests (described below), which are the tests that have been studied later on. Then, all the children took the BHK test following the procedure described in [6]. One or two expert examiners annotated the BHK tests in compliance with the Concise Evaluation Scale for Children's Handwriting procedure. In this study, the only purpose of the BHK test was to have a real diagnosis and to correctly label children's data for the supervised learning algorithms (see Figure 1). The recorded data of the BHK tests were not used in the machine learning algorithms.
The graphomotor tests, as well as the BHK test, were performed on a graphic tablet that records the x, y and z positions of the pen tip as a function of time, as well as pressure data ( Figure 3a). Data were collected with a software called GraphLogger, developed by the CEA [20] (Figure 3b). Different models of graphic tablets were used at school (WACOM Pro M) and at the hospital (WACOM Pro L). A step of calibration has thus been necessary to be able to compare effectively the records coming from the two tablets: to check the written length calibration, different distances on the real written tracks made by the children on paper were measured and then compared to measures recorded by the computer to check the reliability of the tablets and software solutions. The spatial resolution of both tablets is 0.25 mm, and a new dot is considered every 5 min ms (200 Hz sampling rate). Studies suggest that people write differently on paper than on screen [25], which led us to put a sheet of paper onto the tablet and an ink pen to better suit the usual writing conditions for the children. The children could write directly on the paper and had a visual feedback of their handwriting. All the data were processed with Python, using the following modules: scikit-learn [26], pandas [27] and mlxtend [28].

Tasks: The Graphomotor Tests
The graphomotor tests consist of 6 drawings, also called "stimuli", which the children have to reproduce without a time limit. The first 6 drawings were extracted from the Developmental Test of Visual Perception [24], a test widely used in clinical context to evaluate "visual perception and visual-motor integration". The last one, The Loops, can be used to evaluate dysgraphia levels as it is close to real handwriting [29]. Studies suggest that people write differently on paper than on screen [25], which led us to put a sheet of paper onto the tablet and an ink pen to better suit the usual writing conditions for the children. The children could write directly on the paper and had a visual feedback of their handwriting. All the data were processed with Python, using the following modules: scikit-learn [26], pandas [27] and mlxtend [28].

Tasks: The Graphomotor Tests
The graphomotor tests consist of 6 drawings, also called "stimuli", which the children have to reproduce without a time limit. The first 6 drawings were extracted from the Developmental Test of Visual Perception [24], a test widely used in clinical context to evaluate "visual perception and visual-motor integration". The last one, The Loops, can be used to evaluate dysgraphia levels as it is close to real handwriting [29]. We differentiated two kinds of graphomotor tests: the highlighting tests and copying tests. For the highlighting tests, the child had to trace directly on the drawing while trying to keep the pen in the thick grey path, without lifting the pen if possible. In the copying tests, the child must reproduce the drawings shown as accurately as possible. There are three of each, listed below.
Highlighting Tasks

1.
Circuit 1 (known as the lines): the child has to link the left drawing to the right one. There are 3 lines, getting thinner and thinner, increasing the difficulty (see Figure 4). We differentiated two kinds of graphomotor tests: the highlighting tests and copying tests. For the highlighting tests, the child had to trace directly on the drawing while trying to keep the pen in the thick grey path, without lifting the pen if possible. In the copying tests, the child must reproduce the drawings shown as accurately as possible. There are three of each, listed below.
Highlighting Tasks 1. Circuit 1 (known as the lines): the child has to link the left drawing to the right one. There are 3 lines, getting thinner and thinner, increasing the difficulty (see Figure 4). 2. Circuit 2 (known as the labyrinth): likewise, the child links the bee and the flower. This task is more difficult to perform as there are a few abrupt turns (see Figure 5). 3. Circuit 3 (known as the oval): the goal is the same, tracing a line from the left side of the car to the right side (see Figure 6).

2.
Circuit 2 (known as the labyrinth): likewise, the child links the bee and the flower. This task is more difficult to perform as there are a few abrupt turns (see Figure 5). We differentiated two kinds of graphomotor tests: the highlighting tests and copying tests. For the highlighting tests, the child had to trace directly on the drawing while trying to keep the pen in the thick grey path, without lifting the pen if possible. In the copying tests, the child must reproduce the drawings shown as accurately as possible. There are three of each, listed below.
Highlighting Tasks 1. Circuit 1 (known as the lines): the child has to link the left drawing to the right one. There are 3 lines, getting thinner and thinner, increasing the difficulty (see Figure 4). 2. Circuit 2 (known as the labyrinth): likewise, the child links the bee and the flower. This task is more difficult to perform as there are a few abrupt turns (see Figure 5). 3. Circuit 3 (known as the oval): the goal is the same, tracing a line from the left side of the car to the right side (see Figure 6).

3.
Circuit 3 (known as the oval): the goal is the same, tracing a line from the left side of the car to the right side (see Figure 6). We differentiated two kinds of graphomotor tests: the highlighting tests and copying tests. For the highlighting tests, the child had to trace directly on the drawing while trying to keep the pen in the thick grey path, without lifting the pen if possible. In the copying tests, the child must reproduce the drawings shown as accurately as possible. There are three of each, listed below.
Highlighting Tasks 1. Circuit 1 (known as the lines): the child has to link the left drawing to the right one. There are 3 lines, getting thinner and thinner, increasing the difficulty (see Figure 4). 2. Circuit 2 (known as the labyrinth): likewise, the child links the bee and the flower. This task is more difficult to perform as there are a few abrupt turns (see Figure 5). 3. Circuit 3 (known as the oval): the goal is the same, tracing a line from the left side of the car to the right side (see Figure 6).  4. Shapes 1: there are four simple shapes (one vertical line, one horizontal line, one sloping line, and a circle) for the child to reproduce. They must be drawn in only one stroke (see Figure 7).  Figure 7).

Shapes 2:
four other shapes that are slightly more complex: a triangle, a plus sign, a square and a cross (see Figure 8). 6. The Loops: two series of six alternated loops. This is the hardest exercise of these graphomotor tests, as it requires great visual-motor coordination (see Figure 9).

Features Extraction
Many different characteristics can be extracted from the 6 stimuli, but only features that are relevant in regard to dysgraphia were used. They were computed from the data acquired during the tests, consisting of time series of the x (horizontal), y (vertical) and z (distance to the tablet up to 1 cm) coordinates of the pen tip, the pressure applied by the

5.
Shapes 2: four other shapes that are slightly more complex: a triangle, a plus sign, a square and a cross (see Figure 8).  Figure 7).

Shapes 2:
four other shapes that are slightly more complex: a triangle, a plus sign, a square and a cross (see Figure 8). 6. The Loops: two series of six alternated loops. This is the hardest exercise of these graphomotor tests, as it requires great visual-motor coordination (see Figure 9).

Features Extraction
Many different characteristics can be extracted from the 6 stimuli, but only features that are relevant in regard to dysgraphia were used. They were computed from the data acquired during the tests, consisting of time series of the x (horizontal), y (vertical) and z (distance to the tablet up to 1 cm) coordinates of the pen tip, the pressure applied by the

6.
The Loops: two series of six alternated loops. This is the hardest exercise of these graphomotor tests, as it requires great visual-motor coordination (see Figure 9).  Figure 7).

Shapes 2:
four other shapes that are slightly more complex: a triangle, a plus sign, a square and a cross (see Figure 8). 6. The Loops: two series of six alternated loops. This is the hardest exercise of these graphomotor tests, as it requires great visual-motor coordination (see Figure 9).

Features Extraction
Many different characteristics can be extracted from the 6 stimuli, but only features that are relevant in regard to dysgraphia were used. They were computed from the data acquired during the tests, consisting of time series of the x (horizontal), y (vertical) and z (distance to the tablet up to 1 cm) coordinates of the pen tip, the pressure applied by the

Features Extraction
Many different characteristics can be extracted from the 6 stimuli, but only features that are relevant in regard to dysgraphia were used. They were computed from the data acquired during the tests, consisting of time series of the x (horizontal), y (vertical) and z (distance to the tablet up to 1 cm) coordinates of the pen tip, the pressure applied by the pen, and the different angles of the pen. Unfortunately, we were unable to calibrate pressure and angle data from the different tablets used, which led us to abandon the related features. However, the binary information stating whether the pen was on paper or in air (where no pressure is applied) was kept.
Before computing the features, a low-pass fourth order Butterworth filter was applied on the data to remove any noise that may be due to the tablets. We chose a cut-off frequency of 15 Hz, to keep most of the useful information about handwriting, which is comprised of around 5 Hz [30]. Two types of features are distinguished, general features and specific features. General features have been computed for each stimulus, which means that there are 6 occurrences of these features, 1 for each stimulus. Most of the general features used here are described in [20]. They are listed in Table 3. Two features previously used in [37] for the evaluation of Parkinson's Disease patients' handwriting were added: Rényi Entropy of order 2: delivers information about the entropy of the trajectory along x (resp. y or (x, y)). The entropy rate is linked to the uncertainty of the path, which manifests through a chaotic and unpredictable handwriting that is typical of children with dysgraphia. Their total Rényi Entropy should be higher in absolute value. The mean and standard deviation of the Rényi entropy of the standardized and normalized strokes was also extracted: where N is the number of dots in the signal of interest (which is either x or y or the 2dimensional time series of both x and y), and the p are the probabilities of each dot i to be where they are considering the handwriting is a strictly uniform movement. Signal-to-Noise Ratio (SNR): reveals quick and unexpected movements from the children by comparing the trajectory along x (resp. y) to its smoothed version obtained by applying a basic low-pass-filter: where N is the number of dots in the signal s (which is either x or y). Moreover, features specific to a particular stimulus were computed. An exhaustive list is given below: Circuit 1 Mean squared error (MSE) of each line: The number of pen lifts without a direction change between the previous and next stroke.
Mean and standard deviation of the quality of the angles: For each angle, the Quality is computed.
where v b is the direction of the line before the turn, and v a is the direction after it. A Quality of 1 means the turn was 90 • , which is perfect. The mean and Std Quality are computed for the whole stimuli. Mean distance between the end of a stroke and the beginning of the next one: where d i is the distance of the point i to the center, and r the radius. Mean distance between the end of a stroke and the beginning of the next one: Same as Circuit 1 Shapes 1 Duration of each shape: Time taken to draw each shape. Length of the path for each shape: Total perimeter of each shape. Index of curvature [22]: Detailed in [20]. Horizontal, vertical diameter of the circle: The length of the vertical and horizontal diameters of the drawn circle. For a perfect circle, the two should match.
Ratio between the horizontal and vertical diameter of the circle: The ratio between the horizontal and vertical diameter of the circle. For a perfect circle, it should be equal to 1. Ratio between the length of the trajectory dedicated to the loops and the total length (links between loops included): The ratio between the loops themselves (the closed part of the loop), and the total trajectory, including the links between the loops.
Difference between the highest and the lowest point (links only, loops excluded): The measure of general direction by calculating the difference between the highest point that is not in a closed loop, and the lowest.
Overall, it represents 345 features, general and specific features included.

Z-Score Computing
In order to apply the algorithms to the features, they need to be comparable. That is why they should evolve in the same range of values. To this end, features were standardized with a Z-score: However, children from the 2nd Grade and children from the 5th Grade do not have the same handwriting skills, given that the youngest ones are still beginners and their handwriting will improve throughout the years [35,38,39]. Their production thus cannot be compared directly and adding age as a feature does not entirely solve the problem.
Charles et al. took this concern into account when conceiving the BHK test [6]. The diagnosis indeed depends on the score of the child as well as its grade. Deschamps et al. thus proposed to implement a moving Z-score [20] that takes age into consideration during the standardization of the features. Instead of computing the Z-score over the entire database, the children were only compared to those who were utmost 6 months younger or older than them. The same method was used in this study.
The only issue with the Z-score as a data transformation is that it may lead to data leakage. If each data is transformed using the mean and standard deviation of all data from the same age range, when dividing into train and test sets, some information from the train set will be present in the test set, which can lead to the overestimation of classification results. To prevent this, only the data from the Z-score database were used as a reference for computing the Z-score. The data used for both the train and test set were the active database ( Figure 2). Thus, no data from the train set were used for any transformations of the test set.

Preprocessing Steps
Considering that there is a total of 345 features, with only 86 data, a feature selection step must be conducted. The number of features should be reduced to the minimum, while keeping a sufficient amount of features to be able to efficiently perform the classification. To ensure the correct encapsulation of data, a pipeline containing 4 steps was created ( Figure 10): database, the children were only compared to those who were utmost 6 months younger or older than them. The same method was used in this study.
The only issue with the Z-score as a data transformation is that it may lead to data leakage. If each data is transformed using the mean and standard deviation of all data from the same age range, when dividing into train and test sets, some information from the train set will be present in the test set, which can lead to the overestimation of classification results. To prevent this, only the data from the Z-score database were used as a reference for computing the Z-score. The data used for both the train and test set were the active database ( Figure 2). Thus, no data from the train set were used for any transformations of the test set.

Preprocessing Steps
Considering that there is a total of 345 features, with only 86 data, a feature selection step must be conducted. The number of features should be reduced to the minimum, while keeping a sufficient amount of features to be able to efficiently perform the classification. To ensure the correct encapsulation of data, a pipeline containing 4 steps was created ( Figure 10): First, the removal of features that are too correlated. Above a certain threshold of correlation between 2 features, only 1 is kept. This threshold was set to 85%.
In the second step, the data were normalized and standardized.
In the third step, the feature selection was performed using an estimator which assigns importance to each feature. By selecting a threshold for this estimator, the number of features can be reduced. In our studies, we used either a linear SVM (C = 0.1) or an Extra Trees classifier with 100 estimators, and we selected 10 features.
Finally, the estimator for the prediction. Because we have a labeled database, we worked with supervised machine learning algorithms for our study. Several classes of algorithms have been used in the literature [11][12][13][14][15][16]40], therefore 9 models were tested, with different sets of hyperparameters, to select the best ones for each model. Figure 10. The main pipeline. The evaluation was performed through cross validation: firstly, the database was randomly split between train and test sets, then this whole pipeline was fit using the train set, and evaluated using the test data. Figure 10. The main pipeline. The evaluation was performed through cross validation: firstly, the database was randomly split between train and test sets, then this whole pipeline was fit using the train set, and evaluated using the test data.
First, the removal of features that are too correlated. Above a certain threshold of correlation between 2 features, only 1 is kept. This threshold was set to 85%.
In the second step, the data were normalized and standardized.
In the third step, the feature selection was performed using an estimator which assigns importance to each feature. By selecting a threshold for this estimator, the number of features can be reduced. In our studies, we used either a linear SVM (C = 0.1) or an Extra Trees classifier with 100 estimators, and we selected 10 features.
Finally, the estimator for the prediction. Because we have a labeled database, we worked with supervised machine learning algorithms for our study. Several classes of algorithms have been used in the literature [11][12][13][14][15][16]40], therefore 9 models were tested, with different sets of hyperparameters, to select the best ones for each model.
The whole pipeline was then evaluated by a 5-fold cross validation. To avoid overestimation due to a particularly lucky cut for 1 cross validation, 100 iterations of cross validation with different random seeds for the splitting were performed. It is important that the preprocessing steps never used test data along with the train data, which is why we cannot undertake cross validation only on the final classifier that will determine the classification, but on the whole pipeline. Thus, for each fold, the train/test separation was conducted before any pre-processing step. Figure 11 is a summary of all the preprocessing steps.
dation with different random seeds for the splitting were performed. It is important that the preprocessing steps never used test data along with the train data, which is why we cannot undertake cross validation only on the final classifier that will determine the classification, but on the whole pipeline. Thus, for each fold, the train/test separation was conducted before any pre-processing step. Figure 11 is a summary of all the preprocessing steps. Figure 11. Summary of pre-processing. Starting from the raw data, features are extracted and corrected. The final data is then given to the main pipeline ( Figure 10) and undergo more preprocessing (feature selection).
This pre-processing and the main pipeline evaluation can be applied to any data that have the same kind of properties as our data: biased (for our case it is biased by age), medium-sized (approximately 100 entries) datasets that need correction, extraction of meaningful features and protection from overfitting during feature selection.

Results
The recording of the graphomotor tests allowed to highlight the differences between the drawings of children with and without dysgraphia. Examples are shown in Figure 12, where it can be noted that the number of strokes is higher in the DYS child compared to the TD child, and the trace is visually less regular. Features like "Number of angle Figure 11. Summary of pre-processing. Starting from the raw data, features are extracted and corrected. The final data is then given to the main pipeline ( Figure 10) and undergo more preprocessing (feature selection). This pre-processing and the main pipeline evaluation can be applied to any data that have the same kind of properties as our data: biased (for our case it is biased by age), medium-sized (approximately 100 entries) datasets that need correction, extraction of meaningful features and protection from overfitting during feature selection.

Results
The recording of the graphomotor tests allowed to highlight the differences between the drawings of children with and without dysgraphia. Examples are shown in Figure 12, where it can be noted that the number of strokes is higher in the DYS child compared to the TD child, and the trace is visually less regular. Features like "Number of angle changes" in The Loops and "Mean squared error to the path" in Circuit 2 will have different values and help the correct classification of these children.
changes" in The Loops and "Mean squared error to the path" in Circuit 2 will have different values and help the correct classification of these children. For the feature selection step, two of the six estimator candidates were selected for their compatibility with feature ranking: Linear SVM and Extra Trees. These estimators select different sets of features, leading to different final results for the subsequent following classification step. The features selected by each estimator are described in Table 4. Because this step is placed before the final classification, it was not dependent on the nature of the final estimator. That is why there are two different sets of features selected and not 18. Table 4. The features selected. Each feature is followed by the percentage of times the feature was selected, the higher it is, the more important and generalizable the feature. No features from Circuit 3 and Shapes 2 were selected more than 40% of the time.

Estimator Used for the Feature Selection Step Stimuli
Linear SVM Extra Trees For the feature selection step, two of the six estimator candidates were selected for their compatibility with feature ranking: Linear SVM and Extra Trees. These estimators select different sets of features, leading to different final results for the subsequent following classification step. The features selected by each estimator are described in Table 4. Because this step is placed before the final classification, it was not dependent on the nature of the final estimator. That is why there are two different sets of features selected and not 18. Table 4. The features selected. Each feature is followed by the percentage of times the feature was selected, the higher it is, the more important and generalizable the feature. No features from Circuit 3 and Shapes 2 were selected more than 40% of the time.

Estimator Used for the Feature Selection Step Stimuli
Linear SVM Extra Trees  Table 4 represents the percentage of splits for which the feature has been selected. Because there is 100 cross-validation of each five splits, there is a total of 500 train/test splits for which the feature selection is conducted. The features shown in Table 4 are the ones selected in at least 200 splits.
The only features not selected by both estimators were the number of backtracking when drawing the third line of Circuit 1, the height without loops and the standard deviation of the velocity peaks when drawing The Loops (selected by the Linear SVM and not Extra Trees), and the length of the horizontal diameter of the circle in Shapes 1. Both did not select any features from Shapes 2 or Circuit 3, either because they were not discriminant enough or because they were too correlated to other features and discarded during the first pipeline step.
The other difference between both estimators used for Feature Selection is the consistency of the selected features between the splits. Eleven features were selected for more than 40% of the splits by the SVM, and even six more than 90% of the time, whereas for the Extra Trees only nine features were selected for more than 40% of the splits, and only one is almost always selected (Mean squared error during the short parts of the circuit (99.2%)). This indicates that Extra Trees, as a mean of feature selection, is less consistent throughout the different train/test splits of the cross validation, and more sensible to some data, and so possibly more prone to the selection of features that match noise rather than actual pattern in the drawings of children with dysgraphia. This reflects in the general classification results as well. The accuracy, sensitivity (also called recall) and specificity obtained by cross-validation are lower when feature selection is conducted by Extra Trees than by Linear SVM.
As most selected features come from The Loops, it seems to be a task of high interest for qualifying dysgraphia. With the exception of "Std of the velocity peaks" and "SNR on y axis", all features selected are features specific to one test, and not general features that would be computed for all stimuli. This implies it is the diversity and specificity of the graphomotor tasks that reveal the characteristics of dysgraphia, rather than a general characteristic of the drawings.
The performances of the different pipelines configurations are listed in Table 5. As said above, when using Extra Trees for feature selection, the final classification has lesser results than when using a Linear SVM. Six models were eliminated, the three remaining have the same range of performance. They are: Random Forest (RF), Extra Trees (ET) and a Multi-Layers Perceptron (MLP). The accuracy was the main metric used for the selection, but the sensitivity and specificity were computed as well. The best accuracy score is 73.4% obtained with RF with 500 estimators. The best configuration of MLP (one hidden layer of size 12) gives 73% accuracy and it is the same result for ET with 500 estimators. The main difference between the three models is the repartition between Sensitivity (True Positive Rate) and Specificity (True Negative Rate). A higher sensitivity means that fewer children with dysgraphia remain undetected, and a higher specificity means that fewer typically developing children are misclassified by the algorithm. Table 5. This table gives the performances when using nine different estimators, and feature selection is conducted using either Extra Trees (left column) or Linear SVM (right column). Acc. stands for accuracy, Sen. for Sensitivity (or recall) and Spe. for Specificity. The three highlighted results are the three top models using accuracy as the main metric.

Extra Trees
Linear

Discussion
An accuracy of 73% has been reached in this study, which may seem to be relatively low compared to previous studies using text data, as seen in Table 1. However, it is a promising result for the use of non-text-based data for this kind of prediagnosis tool. The studies presented in Table 1 have access to text data, and our goal was to show that classification was possible without such data. Thus, with the new specific features extracted and a pipeline focused on reducing overfitting, we could afford not having any text-based data and still reach 73% accuracy on the classification of dysgraphia, a writing disorder.
The accuracy of other estimators, such as SVM or Ada boost, may also seem low when compared to other studies in unrelated fields such as [40]. It seems, for SVM, that the data is hard to separate, even when projected into other spaces (such as with the RBF Kernel).
We used cross-validation because we could not afford having a completely separated train set, due to the low amount of subjects with dysgraphia (43). Cross-validation remains a solid method for validating such a model and avoiding overfitting. Any pre-processing steps were either conducted using separated data (for the Z-score) or as a part of the pipeline (correlation filtering, feature selection) in order to prevent any overfitting due to test data leakage in the train set.
Very few features were selected (10 out of 345) for each evaluation, but some of them may still have been selected because of statistical noise in the data. This suspicion is particularly important when the features selected vary highly between the different splits of the cross-validation, because it means the features are selected only due to the presence of particular examples and most likely will not help the generalizability (applicability to new datasets) of the pipeline, as is the case when Extra Trees are used for the selection. This kind of suspicion can be controlled in a future analysis by evaluating a model built using only the features selected presented in Table 4 with new test data.
One of our goals with feature selection and the initial feature extraction was to have a very explicative model. That is why we did not use automatic feature extraction or PCA [41]. However, it must be noted that because the first step (suppression of correlated features) will select arbitrarily between two correlated features, it may deteriorate the explicative part of the model in some way. This is a minor issue because even if the two correlated features are different and one of them is randomly not selected after the first step, the fact that they are highly correlated (at least 85%) means that they do not essentially represent different particularities of the drawings, so this choice will not change the final interpretation of the characteristics of dysgraphic drawings. Overall, these characteristics are typical of dysgraphic handwriting, but transposed to the context of graphomotor tasks, especially for The Loops, which is close to a real writing task.
Features related to the detection of Parkinson's Disease were used, along with The Loops, a test used to quantify the severity of parkinsonian dysgraphia [29,37,42]. It is crucial to note that parkinsonian and developmental dysgraphia probably do not result from the same neurological causes, even if they can share some similar effects on the handwriting (their main difference being parkinsonian micrographia [42]). Yet, including these features and this test seemed relevant to our study. Indeed, five out of eleven of the most selected features come from The Loops, meaning that its characteristics show differences between typically-developing children and children with dysgraphia that the other graphomotor tests cannot see, thus they are complementary. It is not that surprising, given that The Loops is the test closer to actual writing among the six graphomotor tests used here. The inclusion of a test close to handwriting was indeed important in our case (detection of developmental dysgraphia and not motor difficulties), since not all dysgraphia subtypes impair drawing [43].
We had to withdraw the features related to pen pressure due to calibration issues between the tablets. Adding them could improve the performance of the classifier [11,12]. It would be interesting to consider those in a future analysis by using the same graphic tablet model for all data acquisitions, or with a good method to calibrate different graphic tablets.
In terms of performances, with an accuracy of 73% for the best model, this study presents lower results than the previous one using graphomotor tests, which reported an 84% accuracy [21]. The main difference in our approach is not the size of the datasets (26 healthy children, 27 with graphomotor difficulties), but the nature of the subjects and the goal of the study: detection of dysgraphia using only drawing tests for this study, and detection of self-reported graphomotor difficulties for Galaz et al., using an adapted set of tests. Furthermore, their study was more focused on feature extraction and quality control (more features were extracted in their study), rather than on model selection and a focus on the explicative part of the final model. A difference in methodology might also explain the performance gap between the two models. They used Sequential Forward Feature Selection (SFFS) [28], which trains a second model to select the features. However, it is not clear if they preprocessed their features before the feature selection or as a part of the feature selection. As reported in their paper, it seems that they used the second option which, in this case, can lead to some overfitting, as SFFS will perform cross-validation on biased data.
The participants were all in the 2nd Grade at least, because we needed them to have at least two years of experience in handwriting in order to be rated by the BHK test and have labels for the model to learn with. In the view of conceiving a pre-diagnosis tool that would be targeting younger children that do not know how to write yet, a follow-up study could make children perform the graphomotor tests at an earlier age and then make them take the BHK test a few years later, when they would be able to. This way, it would be possible to estimate whether at-risk children detected by the model actually developed dysgraphia, and which features were the most useful for this detection.
Another goal of this study was to provide a more objective assessment of the childrens' drawing skills and detect if they are at risk for developing dysgraphia. The subjectivity part in the examiner's ratings has been evidenced by [6] when establishing their standard for the BHK test. They showed that there was a variance in the examiners' ratings. The correlation between the BHK score of the examiners ranges from 0.68 to 0.90 (respectively inexperienced and experienced raters) which may lead to divergences in the given BHK score and consequently in the diagnosis for a same child. In this study either one or two experienced examiners, who then gave their diagnosis, rated each BHK test. It did not necessarily perfectly match the threshold at two standard deviations from the mean: This means that children with a BHK test score above −2 (above the threshold at which the diagnosis is supposed to be made) could be considered as dysgraphic, because other contextual information (such as the quality of the writing at school) was considered as well. It is worth considering that other examiners could rate differently and that some labels may then be inaccurate for some children. However, in our active database, this could only concern 8 children out of 305, who are close to the cutoff, with a BHK score comprised between −2.25 and −1.75.
Our machine learning based pre-diagnosis tool aims at avoiding this variability by providing an objective assessment of the childrens' handwriting skills through not only the classification but also the drawing features involved in this classification. The fact that we can help in the detection of dysgraphia by specialists, without having any writing task and thus independent of the nationality of the subject, is the main feat presented here.

Conclusions
We proposed a pre-diagnosis tool for dysgraphia based on graphomotor tests only and machine learning models trained with data we collected and features we extracted. We designed a pipeline and pre-processing steps to use our data, reduce its bias, and select features without overfitting, in order to help the generalizability of our results. An accuracy of 73% was obtained on the classification test, which is a very encouraging result. The use of graphomotor tests to diagnose dysgraphia is quite new in the field and this study should pave the way to new possibilities for approaching dysgraphia without resorting to pure handwriting, making it independent from nationality. The Loops, a test particularly close to handwriting used also in Parkinson Disease qualification, was especially useful, providing half the features used in the end, indicating a good relevance for dysgraphia detection. Three models seem to present good results in classification when used in our pipeline: Random Forest, Extra Trees and Multi-Layer Perceptron. Our general database is exhaustive and thus allowed us to reduce the bias due to age regarding writing and its development. The extracted features cover a broad part of dysgraphia's criterions and handwriting kinematic and are easy to interpret. Even if dysgraphia is not a binary characteristic and keeping in mind that a classification might be biased, such a pre-diagnosis tool could allow operating an early pre-selection of the at-risk children, with drawing tests only, leading to early intervention on handwriting difficulties and a general improvement of the well-being of these children.