Line Chart Understanding with Convolutional Neural Network

: Visual understanding of the implied knowledge in line charts is an important task affect-ing many downstream tasks in information retrieval. Despite common use, clearly deﬁning the knowledge is difﬁcult because of ambiguity, so most methods used in research implicitly learn the knowledge. When building a deep neural network, the integrated approach hides the properties of individual subtasks, which can hinder ﬁnding the optimal conﬁgurations for the understanding task in academia. In this paper, we propose a problem deﬁnition for explicitly understanding knowledge in a line chart and provide an algorithm for generating supervised data that are easy to share and scale-up. To introduce the properties of the deﬁnition and data, we set well-known and modiﬁed convolutional neural networks and evaluate their performance on real and synthetic datasets for qualitative and quantitative analyses. In the results, the knowledge is explicitly extracted and the generated synthetic data show patterns similar to human-labeled data. This work is expected to provide a separate and scalable environment to enhance research into technical document understanding.


Introduction
Understanding the propositions in chart images is a basic task to understand technical documentation. For this task, a variety of problem settings and machine learning solutions have been proposed [1][2][3][4][5]. Because of the ambiguity in defining a standard of knowledge to extract from a chart, in most studies, the task is indirectly solved as part of a larger integrated task as image caption generation.
This end-to-end style of problem solving can hinder research in academia in finding optimally configured deep neural networks for chart understanding. For solve sequential tasks at once, many deep networks are successful, such as neural machine translation [6], compared with the conventional approach of dividing and conquering the integrated tasks [7,8]. This is not the only case observed in this specific area. Deep neural networks showed high-accuracy image classification by mitigating the drawbacks of decomposing feature extraction and abstraction [9]. Because of the impact of the end-to-end style of problem solving, many deep network researchers configure a whole architecture first and analyze its macroscopic behavior. However, if we do not sufficiently understand the properties of separate tasks, architecture configuration to find the optimal generalization, model capacity, connections, and the required input features for each layer are delayed because all the settings should be searched from scratch. The optimal settings for each task can be hidden because of the effects of merging all integrated tasks in the search.
To address this problem in this paper, we propose a problem definition for the explicit analysis of a chart image, provide an algorithm to generate supervised data, and share them (https://github.com/cy-sohn/LCUdataset_generator (accessed on 9 March 2021)). To the best of our knowledge, problem definition and shared data for understanding statements implied in a line chart have been rarely proposed for helping with microscopic architecture design. We focused on understanding knowledge in line chart images from visual perspectives rather than text-mixed information, called line chart understanding (LCU) in this paper. In the proposed definition, we test well-known and simply tuned convolutional neural networks for image analysis [10]. They are configured for multitask learning [11,12] with various classification and regression subtasks to determine propositions and their numerical arguments. The contributions of this work are summarized as follows: • proposing a definition of knowledge implied in a line chart; • providing an algorithm to automatically generate input chart images with their labels; • analyzing the properties of the task and data by applying well-known neural networks to synthetic and real datasets.
We note that the main contribution is defining LCU and providing synthetic data with an algorithm validated with human-labeled real data. The neural network configuration is just an example we use to provide easy-to-obtain performance and intuition about this task for readers.
In Section 2, we explain state-of-the-art works related to chart understanding, and in Section 3, we introduce the problem definition for specifying target chart images and the knowledge template. Section 4 describes the algorithm to generate synthetic data. Sections 5 and 6 show experiment setups and their results in the synthetic data and humanlabeled real data. In Sections 7 and 8, we conclude and discuss future challenges.

Related Works
Deep-learning-based chart understanding has been proposed [13][14][15][16], but these works focused on estimating the positions of chart objects rather than understanding implied knowledge in a chart. References [1][2][3][4]17] introduced methods to extract data from a chart or to convert data to other forms. They correlated the recognized results to text and graphic information shown in technical chart images rather than extracting implied statements as LCU. Reference [18] introduced the object detection network for the evaluation of scientific plots. This work aimed to build a model to understand a horizontal bar graph by estimating its numerical attributes. LCU targets line charts and extracting implied propositions instead of estimating the numerical values. Chartsense [4] uses deep-learning-based classifiers to determine the chart type of a given chart image and extracts simple information from the chart image, which is an integrated task implicitly using part of the knowledge in a chart, even though there is no explicit knowledge-understanding module. Figureseer [5] recognizes texts using character recognition modules and parses them with plots together for re-designing various charts and applying them to question answering. The mainly discussion was the estimation of a regression form, rather than capturing knowledge in a logic form. Reference [3] proposed a similar method for understanding and redesigning a chart, but its targets are bar and pie charts. It omits a function to predict intents in a chart different compared with LCU. Chart image generation [19][20][21][22] may also include the chart understanding problem. Reference [19] proposed a method to generate line, bar, and pie chart images, but it is partially automatic so the scalability of data is limited for training data-driven models. PlotQA [20], FigureQA [21], and DVQA [22] provide data used for question answering. PlotQA [20] provides data using the three types of plot images: horizontal bar graph, line plot, and dot-line graph. Text appearing in the chart images consists of words in the document texts. Labels, grids, font sizes, tick labels, line styles, line colors, and legend locations are used as attributes of the chart. In our work, we set wider ranges for those attributes and used more data samples to express detailed local implications of a line. Slopes, positions, and the ranges of lines are also more expressive in LCU. LCU uses both human-labeled and synthetic data for evaluation to confirm the impact of the synthetic data as a test bed for real-world understanding. FigureQA [21] provides visual inference data consisting of more than a million pairs of questions and answers. It can express five types of plot types (line, dot-line, vertical and horizontal bar, and pie charts) and learn logic such as maximum, minimum, and smoothness. Similar to PlotQA, these data have limited forms and attributes in line plots. Attributes such as title, label, tick, and axis label are fixed, and the shape and legend of the line are expressed differently for each plot. There are six questions fixed about the line plot in FigureQA that can be answered by yes or no. LCU has a wider variety of logic templates than FigureQA. DVQA [22] provides data for understanding bar charts. These data are not only applied to QA but also used for extracting numerical and semantic information. The targeted chart of this work is different than that of LCU.

Problem Definition for Line Chart Understanding
The goal of the LCU problem is to determine the propositions implied in a line chart image. Thus, an input image is given and we need to predict the most accurate labels representing the propositions and estimate their numerical arguments. In this section, we describe the targeted image conditions and propositions that compose a knowledge template.

Input: A Line Chart Image
A line chart has many diverse attributes [23]. To cover a wide range of graphic perceptions that humans understand [23][24][25], we set a variety of attributes as shown in Table 1. To obtain unbiased and diverse lines, we set the range of attributes as large as possible in a uniform distribution when generating a value for each attribute (the library used for generating lines: https://matplotlib.org (accessed on 1 January 2021)). In this problem, we focus on a single chart composed of at most two lines, because this is the first step to solve before we consider more complex charts. The target chart image follows these rules: • An image has a line chart. • A chart has at most two lines.

• All lines are continuous and have different colors.
This input setting is used to evaluate the basic functionality of understanding knowledge. It can be easily integrated with other practical downstream tasks in a multitask learning or fine-tuning manner. In addition to the rules, the target image uses a standardized chart frame as follows: • The origin point is located at the left bottom. • The range of each axis is [0,1] (a standardized range).
The conditions show that any statement assigned to this image is based on visual perspectives. For example, if a model predicts an optimum in this graph, it generates an X-coordinate in [0,1]. Then, the selected point is linearly transformed to the range determined by the attached numerical text labels without any additional process. This setting has the advantage of clarifying the effect of predicting combinations with tick labels and images when detecting knowledge determined purely by visual properties.

Output: A Knowledge Template
The knowledge template proposed in this paper is the set of propositions determined by classification and regression subtasks. It can also be interpreted as the set of discrete labels and related numerical arguments. The structure, labels, and ranges of labels of all the subtasks are shown in Figure 1. Depending on the objects contained in an image, the logics representing knowledge are categorized into chart, line, and partition groups. In the chart group, the superiority subtask determines which line is superior to the other line overall. If lines have a cross point, the superiority has a None label. The line group has three subtasks: number of partitions is used to recognize the number of segments in the line. We allow one to three contiguous partitions to imply different logics. The line segment in each partition can have an independent growth type label. Monotonicity is used to distinguish whether the slope of the line is positive or negative from the starting to the ending points of a line. If a clear monotonicity is not observed, the None label is assigned. The minimum and maximum are subtasks used to detect minimum and maximum real-valued XY-coordinates in a line, respectively. In the partition group, the range is used to estimate the X-coordinates used as the partition boundaries. Growth type determines the growth type of the line segment in each partition. Examples of input images for extracting the knowledge template are shown in Figure 2.

Algorithm to Generate Labeled Data
After generating the attributes for a chart image, lines are automatically generated for the selected labels of the subtasks. The whole process of generating lines and labels is shown in Figure 3 and Algorithm 1. In the overall steps, we select logics and their numerical arguments first, and randomly select data points to satisfy the selected labels. In the first step, the algorithm randomly generates two points used as the starting and ending point of a line. The points are in the range of 0 to 1. To determine the number of logics for a line in between the two points, the algorithms selects the number of partitions from {1,2,3} and then build partitions by randomly generating intermediate boundary points. Then, the growth type for each partition is randomly selected from the label set {inear, logarithmic, exponential}. After selecting a growth type for each partition, the form of the lines for the selected label is determined as exponential label : where x and y are the coordinates of a point; k, a, and b are the parameters to be tuned for drawing a line to pass all generated samples. The value for k is a randomly selected number in [1,5] for linear lines; m and b are approximated for generated data points using the Python library. Data points are sampled at regular intervals on the X-axis. In the algorithm, the range of θ is in [0.3, 2.9]. In logarithmic and exponential functions, b and k are approximated to pass the initial points. The parameter a is initially fixed in [2,20] for the exponential function and [0.85 × X start , 0.99 × X start ] for the logarithmic function, where X start is the X-coordinate of the leftmost initial points. The boundary conditions locate the lines into the first quadrant. The number of data points positioned in a partition is in the range of 10 to 50.
Algorithm 1 Generation of synthetic supervised data.
Randomly select the slope of line θ Randomly select starting and ending points of a line with θ Randomly select a label for the number of partitions Randomly select the boundary X-coordinate of partitions for all p do Randomly select a label for growth type Randomly select a line shape in the type Generate data points Draw a line in the range of p end for Determine labels for line-level subtasks Determine labels for chart-level subtasks Return (a chart, a set of labels) pairs

Detailed Settings for Label Generation
Categories and the range of outputs for each task are shown in Figure 1, which use the following specific configurations for their output. For the number of partitions, we assign the number of partitions to each line; therefore, the partition boundaries of lines are also independent. Growth type is independently assigned to each partition of each line. Superiority determines whether the first line is greater than the second line in the overall area. If a chart has only a line, this task is ignored in training. Label 1 means greater than the second in the overall area, 2 means the opposite case, and 0 means that it is too ambiguous. If the first line is greater than the second line, the minimum value of the first line is greater than or equal to the maximum value of the second line. Monotonicity determines a consistently increasing or decreasing state of a line in its all partitions. We set the label 1 for monotonic increasing, 2 for decreasing, and 0 for the inconsistent case. We set the labels by checking the sign and slope of generated lines. Minimum and maximum are regression subtasks to predict two points whose Y values are the minimum or maximum overall X-coordinates in a line, respectively. The growth type label is separately assigned to each partition of each line. Range is the subtask used to predict the meaningful partition boundaries composed of X-coordinates. In this subtask, the starting point S and ending point E on the X-axis are predicted. The total number of output variables to predict and their types are shown in Table 2. Superiority, monotonicity, growth type, and number of partitions are classification tasks and the others are regression tasks. Table 3 shows the distribution of labels in the generated 75,000 samples. Figure 4a,c shows the distributions of minimum and maximum points and mean X-coordinates of partitions. To visualize the distribution, 1000 images were sampled for each number of partitions, and the mean X-coordinates for the starting and ending points were plotted.

Detailed Settings for Input Image Generation
The default resolution of a chart image is 100 dpi at a figure size of 640 × 480. The background color of the chart area is randomly selected except for black. The grid lines and the chart frame containing the axes are turned on or off. The direction of the lines is vertical, horizontal, or both. Text elements appearing on a chart can contain up to 10 uppercase or lowercase characters. This condition for text generation is equally applied to the chart title, X-axis label, Y-axis label, and line labels. The number of ticks in the chart is between 3 and 12 and represented with two decimal places.

Experiments
The goal of the following experiments was to show the easy-to-obtain performance of well-known neural networks and their difference between human-labeled and synthetic test data. We note that proposing a novel and extensively optimized architecture was beyond the scope of this study.

Model Configuration
To evaluate an easy-to-obtain performance in this problem, we tested ResNet-50, Wide-ResNet-50-2, and Chart-Understanding-Spatial-Transformer-Network (CU-STN), as illustrated in Figure 5. ResNet-50 [26] and Wide ResNet-50-2 [27] were modified to leave the spatial information. The average pooling layer was replaced by the conversion layer (channel = 128, kernel = 3, and stride = 2). Their fully connected layer was also modified to fit the output size. CU-STN is a network configuration that was proposed to apply the spatial transformer network to the ResNet backbone resized for LCU. This network constructs a more robust network given the flexibility of the positions of the lines on a chart. The number of parameters for ResNet-50, Wide-ResNet-50-2, CU-STN is 26, 69, and 9 million, respectively.

Training Setting
The training loss is the sum of loss functions for 17 classification and 32 regression subtasks. We used cross-entropy for classification and average mean squared error for regression. The problem types for each subtask are shown in Table 2. The total loss function L total is defined as follows: where S is the set of all subtasks and L i is the ith subtask. Because the number of subtasks is dependent on the value of the selected line and the partition number, we used the indicator functions L and P to determine which subtasks to include in the total function. For monotonicity and superiority, ambiguity is very high and their proportion is not uniform as shown in Table 3. To remove the bias in training, we set the balancing parameters as shown in Table 4 multiplied with cross-entropy loss functions. The balancing parameter was set to the ratio of the inverse of the corresponding proportions. To investigate various behaviors with respect to the generated data size, we prepared four training data sets composed of 1000, 5000, 10,000, and 50,000 sample images. The detailed hyperparameter settings for training are listed in Table 4.

Evaluation Setting
To evaluate performance, we prepared three test data sets composed of 500 synthetic images, 5000 synthetic images, and 500 human-labeled real images. The best validation model observed in training was used for test evaluation.

Quantitative Analysis
The accuracy and error results from 5000 synthetic test images are shown in Table 5. The growth type results are split to the three cases of number of partitions. The best results are displayed in bold text. Growth type per partition is more complex than the other tasks. This result may have been caused by the high ambiguity of the growth type values of short lines. The decrease in the accuracy was an expected pattern because the accuracy in each case is the percentage of the images that obtained the correct labels for all the partitions involved. Superiority is the simplest task. The estimation of partition boundary showed significant errors. Minimum and maximum estimation are more complex than the boundary estimation. According to Tables 6 and 7, the results varied but overall patterns of accuracy of subtasks were not significantly different between the human-evaluated data. For the superiority and monotonicity tasks, the proportion of labels is unbalanced compared to the other subtasks maintaining uniform distribution, so we additionally evaluated F1 scores in the small synthetic dataset, as shown in Figure 6. In the case of monotonicity, F1 scores were similar to the accuracy results, which implied that average recall was close to one rather than zero. Superiority showed a significantly lower F1 score compared with the accuracy, so the average recalls were also low. This difference was observed even in the high accuracy near 90%, which implied that the dominating labels had sufficiently large precision and recall while the others did not. Because of the high ambiguity of labeling, this task has high problem complexity.   Figure 7 shows the task-wise comparison results between human-labeled real and small synthetic data. Fluctuation patterns were similar in growth type estimation for one partition case. The two and three partition cases showed large difference, which were caused by the ambiguity shown in the quantitative analysis. The number of partitions, monotonicity, boundary estimation, and minimum and maximum regressions showed relatively similar patterns.
The validation results were also collected, as shown in Table 8.
In this setting, the ratio of validation and training samples was 1:1. The highest accuracy was recorded for growth type, partition confidence, monotonicity, and superiority. The lowest mean square error (MSE) values were recorded for range and minimum and maximum. As with the test, growth type and range were separately marked according to the number of partitions. The score was high because it was the best score recorded in each task during validation regardless of the total loss.
The overall results showed that simple CNN settings resulted in good performance on most subtasks, but a few tasks had low performance. The cause of this limitation is the ambiguity of labels in the data, because the rules for data generation with labeling were mainly based on human intuition. For example, determining linear or logarithmic in many images was challenging. Beyond the problem of ambiguous labeling, limits in machine learning perspectives remain. First, we used multitask learning framework, but learning all subtasks together may not be beneficial depending on their similarity.

Qualitative Analysis
For Figure 8, we selected two sample images in the synthetic data for each number of partitions case from the test result from the synthetic data. Figure 8a,b shows the correct prediction results for growth type, and regression tasks still need improvement. In Figure 8c,d, some partitions are relatively well-predicted but the maximum and minimum values may be distant from correct points. Growth type labels are partially incorrect, but they are ambiguous even in human evaluation. In Figure 8e,f, partition and growth type values show large errors. In the accurate cases of prediction, we obtain somewhat understandable knowledge in human evaluation, but there are still errors that needs improvement in all tasks. Similarly, Figure 9 shows the prediction results on real test data. Compared with the synthetic data, we can see the natural language texts for labels, various ranges of real tick labels, and other practical attributes. The red bar and blue cross are the prediction results. The results in this data set are similar to those of the synthetic test dataset. Because the prediction is completely based on visual perspective, the prediction can be applied to practical images without the loss of generality. (e) (f) Figure 9. Example of detailed results with the human-labeled dataset. Blue, correct prediction; red, wrong prediction; In, increase; De, decrease; blue cross, minimum and maximum point; red line, partition boundary. These test data consist of one line chart, so superiority evaluation was excluded.

Conclusions
In technical document understanding, learning knowledge implied in a line chart is important, but it is conducted together with downstream tasks. This integration slows research on optimizing the configuration of neural networks used for understanding the knowledge. The explicit knowledge template proposed in this paper and the algorithm to automatically generate supervised data can be used as an incubating environment of models to solve the task. As an example of using the environment, we showed three configurations of convolutional neural networks and analyzed their performance and actual prediction cases. The synthetic data showed similar patterns to the human-labeled real data, showing that this environment can work for incubating models without a datasize limitation. This shared task is expected to boost research on the understanding of technical documents.

Future Works
In future work, the domain of applicable charts could be extended. We plan to more rigorously analyze the human evaluation results.