Analysis and Visualization of New Energy Vehicle Battery Data

: In order to safely and efﬁciently use their power as well as to extend the life of Li-ion batteries, it is important to accurately analyze original battery data and quickly predict SOC. However, today, most of them are analyzed directly for SOC, and the analysis of the original battery data and how to obtain the factors affecting SOC are still lacking. Based on this, this paper uses the visualization method to preprocess, clean, and parse collected original battery data (hexadecimal), followed by visualization and analysis of the parsed data, and ﬁnally the K-Nearest Neighbor (KNN) algorithm is used to predict the SOC. Through experiments, the method can completely analyze the hexadecimal battery data based on the GB/T32960 standard, including three different types of messages: vehicle login, real-time information reporting, and vehicle logout. At the same time, the visualization method is used to intuitively and concisely analyze the factors affecting SOC. Additionally, the KNN algorithm is utilized to identify the K value and P value using dynamic parameters, and the resulting mean square error (MSE) and test score are 0.625 and 0.998, respectively. Through the overall experimental process, this method can well analyze the battery data from the source, visually analyze various factors and predict SOC.


Introduction
As the world is moving towards sustainable survival and development, the shortage of oil and increasingly prominent environmental pollution make research on new energy and renewable energy an inevitable trend for the development of all walks of life [1][2][3][4][5][6]. Among them, new energy vehicles have gradually become the main development object in the transportation industries of various countries, and the battery components necessary for new energy vehicles have become increasingly perfect with the continuous development of science and technology [7,8]. At present, lithium-ion batteries with low cost, small volume, and long service life have been put into production through continuous experiments and improvement, and their safety and reliability have been continuously improved [9][10][11]. Lithium-ion batteries have been widely used in new energy vehicles, electric bicycles, aerospace, the military, and other fields, especially in the field of electric vehicles [12,13]. However, the current lithium-ion battery has poor abuse resistance and is vulnerable to the external environment, resulting in safety-related accidents. In order to improve the utilization rate of the battery, prevent overcharge and overdischarge of the battery, prolong the service life of the battery, and monitor the state of the battery, major manufacturers have conducted in-depth research on battery technology, thus the battery management system came into being. The battery management system (BMS) will monitor the battery, including real-time monitoring of battery physical parameters, battery state estimation and charging control. However, when relevant faults occur, the battery management system itself cannot analyze the original data generated by the battery. It can only artificially analyze the stored data and the messages in the CAN bus, and it can not find the root cause of the battery faults [14,15].
In recent years, with the continuous improvement and maturity of battery technology, the battery energy storage system (present battery maximum capacity at a certain condition is called the SOC of the battery) has been used as an important indicator to evaluate the battery state [16]. Since Li-ion batteries are renewable energy sources and intermittent in nature, the interpretation and analysis of SOC is important in the development of effective charging and discharging schemes [17], so the analysis and evaluation of battery energy storage is the top priority in the development of new energy vehicles. A previous paper [18] has conducted a detailed study on some data of new energy batteries, and introduced the cyclic neural network (RNN) to visualize and warn on battery data management; Ref. [19] proposed a method to analyze battery fault diagnosis of electric vehicles based on short-term and long-term memory networks. In reference [20], the author proposed a two-way coupled electrochemical thermal model to study and analyze the effects of water cooling liquid inlet and flow rate on the effectiveness of battery thermal management system. The original battery data and factors impacting SOC have not been explored in the aforementioned literature, despite the fact that a variety of approaches have been suggested to detect battery failure. However, the SOC of the battery is affected by many factors (vehicle state, voltage, temperature, etc.). The existing methods focus on the direct prediction of SOC but ignore the importance of analyzing and visualizing the original data. There is no practical method to analyze the factors affecting SOC.
In order to solve the shortage of existing parsing of original battery data, visual analysis, and analysis of factors affecting SOC, this paper is based on parsing the original battery data (hexadecimal) intuitively, visualizing and analyzing each index of the battery on the data, and finally using the indexes affecting SOC to realize the prediction of SOC by the KNN algorithm.
The organizational structure of this paper is as follows: Section 2 includes the relevant methods mainly used in data analysis, visual analysis, and SOC prediction. Section 3 describes the relevant data sets of the experiment and the various indicators of the data. Section 4 is divided into three parts. Part 1 describes how to visually analyze the obtained battery data; Part 2 makes a visual analysis of the analytical data obtained in the first part to find out the indicators that affect SOC; In the third part, the KNN algorithm is built for the analyzed indicators, and the SOC is predicted by comparing the selected parameters. Section 5 presents the results and analysis of the methods in Section 4. Finally, Section 6 summarizes the conclusions.

Related Work
Nowadays, there is little work carried out to analyze the original data of a battery, and it is very uncommon. The SOC that directly affects the battery is studied. In Ref. [21], Deng Ma proposed an adaptive tracking EKF (ATEKF) method to estimate the SOC of a battery. In Ref. [22], the authors compared machine learning methods with different characteristics to estimate the performance of battery SOC, showing that different methods of machine learning are useful for both measuring and predicting SOC. However, there is a lack of research on the original data generated by Li-ion batteries, because Lithiumion batteries generate hexadecimal data, which are not intuitive, and the hidden voltage, current, temperature, and SOC are difficult to obtain directly.
Based on the observation results reported above, we introduced a scheme to realize the visual analysis of lithium battery data and SOC prediction from the source. The scheme helps to use abstract password-like lithium battery data to visualize the various metrics that affect battery performance and analyze them to predict SOC. Our work has two contributions: (I) through investigation and acquisition of a large number of lithium battery data, we first make a rough analysis, and then conduct codification research to illustrate the intuitiveness and feasibility of the method; (II) the parsed data are cleaned and pre-processed, visualization studies are performed to filter out the valid data, the factors affecting SOC are analyzed in the visualization, and finally, SOC is predicted by the KNN algorithm.

Data Data Analysis
• New Energy Vehicle Battery Dataset 1 The data provided include the message data obtained from the lithium battery, including protocol type, the server receiving time, message time, message type, and the original messages. We mainly extract and analyze the original messages, which include the current vehicle status, vehicle position, battery voltage, battery voltage, and engine status. However, the message data are all composed of hexadecimals, so it is difficult to directly obtain understandable data. Therefore, it is necessary to analyze and obtain intuitive related indicators to promote the next analysis. See Figure 1 for an explanation. The data set consists of one CSV file including 36 indicators of vehicle battery data (vehicle status; total voltage; cumulative mileage; total current; vehicle speed; SOC; operation mode; insulation resistance; DC-DC status; charging status; minimum voltage battery subsystem number; minimum voltage battery cell code; minimum temperature value; minimum temperature subsystem number, etc.).The collection interval is 10 s, which includes all the data of lithium battery operation, reducing the contingency of subsequent experiments. This data set is used for visual analysis and SOC prediction in Sections 4 and 5. See Figure 2 for an explanation.

Methods
Since the original data of lithium batteries are provided by new energy vehicles that all meet the production standards, all comply with the GB/T32960 standard that specifies the remote service and data format of electric vehicles. The hexadecimal messages generated by the battery are following its defined data format. In Section 4.1, the data set format, analysis method, and related algorithm structure defined in the GB/T32960 standard will be explained in detail. In Section 4.2, the new energy vehicle battery dataset 2 is used for visualization to find the factors with high SOC correlation. In the last subsection, how to design the KNN algorithm is explained.

GB/T32960 Standard Introduction and Data Format Analysis
4.1.1. Introduction to GB/T32960 Standard GB/T32960, "technical specification for electric vehicle remote service and management system", is divided into three parts in terms of content, which are general, on-board terminal, communication protocol, and data format [23].
The general structure diagram of the electric vehicle remote monitoring system is given in GB/T 32960.1-2016, part I, general provisions. As can be seen from Figure 3, after the vehicle terminal obtains vehicle data, it uploads the data to the enterprise platform by means of CAN bus communication, and then the enterprise platform interacts with the public platform by means of CAN bus.

Data Format Analysis
GB/T 32960.3 specifies the protocol structure, communication connection, data packet structure and definition, data unit format, and definition in the electric vehicle remote service and management system. Before introducing the packet structure, first, the data types in the packet specified in the protocol are analyzed. The defined data types specify the composition of battery message information. The protocol has five data types: byte, word, dword, string, and byte[n]. It should be noted that the protocol uses the big end mode to transfer multi-byte data types.
A complete data message consists of a start, command cell, unique vehicle identifier, data encryption method, data unit length, data unit and check code. A battery packet sent from the vehicle terminal to the server side always follows the general structure, as shown in Table 1. The header of the packet is first composed of two ASCII characters '#', representing the beginning of the packet. Then, the definition of response ID and command ID is shown in Table 2. It can be seen that if it is a real-time information reporting frame, the third byte should be filled with 0 × 02. Next, you should fill in the unique vehicle identification number, namely VIN (vehicle identification number). Because the information length is not fixed, the data unit length represents the length of the next data information, so that the server can find the end of the frame when parsing. This paper mainly discusses the protocol content and packet structure involved in the most commonly used real-time information reporting as an example.
The real-time information reporting first includes the data collection time, which is represented by a 6-byte BCD code in the format of month, year, and day. Then is the information type; information type does not require the order and items can be freely combined. There are many types of information, such as vehicle data, drive motor data, fuel cell data, engine data, etc. See Table 3 for specific information type definitions. Finally, there is the message body, whose length and data type will vary depending on the type of message [23]. Because there are many types of information, this paper uses the vehicle data format for example analysis. The vehicle data format, fuel cell data, drive motor data and other information types are detailed in the literature [23].

Analytical Thinking
From the above data format, the data packets generated by the lithium battery comply with the format shown in Table 1. Therefore, different states of different vehicles are expanded based on data in Table 1. The whole message data output is mainly divided into three types: Vehicle login, real-time information reporting, and vehicle logout. First, for the overall analysis of the idea, the detailed steps of the analysis are as follows.

1.
The entire message is structured according to the structure and definition of the packet, and the message is divided into starters, command units, unique vehicle identifiers, data encryption methods, data unit lengths, data unit, and check code.

2.
Judge the vehicle status (vehicle login, real-time information reporting, vehicle logout) contained in this message by the command ID in the command unit.

3.
Further analyze the vehicle status in detail according to different modules defined in the data unit format. Details can be seen in Figure 4. Next, for the different types to parse, after the overall structure of the division of the hexadecimal message mainly for its command unit is further split, we look for the command ID to determine the type to which it belongs. Since the types obtained from the command unit are divided into three types, but the overall parsing structure is roughly the same, the most complex real-time information is mainly reported. The analysis of real-time information-reporting type message is shown in Figure 5. The command unit in Table 1 finds 0 × 02, which indicates that this message is a real-time information report. After determining the type, the overall structure of the message is divided according to the format in which the information is reported, then finds the corresponding data type of the message according to Table 3 (data unit format definition), and finally fuses the parsing information. The real-time information-reporting message generally includes vehicle data, drive motor data, fuel cell data, position data, extreme value data, alarm data and battery voltage data. During parsing, each step is linked, and parsing is carried out based on the last two bits of the previous message (one byte corresponds to two messages).

Codification
After the division and analysis of Tables 1-3, the overall structure of the code-based parsing process is shown in Figure 6. The overall parsing structure is mainly divided into six modules to parse the command ID to start parsing, each parsing a module to find the first mark of the remaining messages, and grabs the command ID of the message to start the subsequent message parsing until all the parsing is complete. The whole process is similar to a workshop with six different workshops, and a car is sent to the workshop in turn to check it. Because the message is hexadecimal, it is necessary to perform a binary conversion first. The main idea is to convert the hexadecimal to a decimal (multiply precision first, then offset), and then the converted byte number corresponds to the description so as to achieve the purpose of parsing. "0 × FE" means exception and "0 × FF" means invalid two in consideration. A global var is defined to capture the global parsing type and pass it to the next parsing module. From the above reasoning, the algorithm structure is shown in Table 4. Overall division analysis (Table 1) (4) Self.nextMark Expand with the first token of the remaining message (5) Fun_07 Command bit parsing (6) Self.ol Display in columns (7) Self. next Identify remaining messages Parsing process: (all the parsing is carried out according to the technical specification of electric vehicle remote service and management system part 3: communication protocol and data format). First, regardless of the vehicle status, the overall parsing (fun_01to06) is required: (start → command ID → response ID → unique identification code → data unit encryption method → data unit length). Idea: firstly, the number of bytes represented by each description is input, followed by binary conversion, using (self.ol) to display the original message in the form of columns, then (self.pj) to start parsing and display the parsing results in the form of columns (self.pl), and finally it continues to identify the remaining messages (self.next), find the first mark of the remaining messages (self.nextMark), and grab the command mark of the message to start the subsequent message parsing (self.mo). Next, the command mark bit is identified and parsed (fun_07), and the type of the whole message is parsed for subsequent parsing.
By constructing the main body, the message data to be detected are divided into the overall message structure and the type of the message (vehicle login, information reporting, vehicle logout, platform login, platform logout) is judged. Each module is detected and finally merged. See Figure 7 for details.

Visual Analysis
Based on the above Section 4.1, the abstract hexadecimal original message is parsed into intuitive data such as voltage, current, mileage, SOC, and temperature. In order to further analyze the SOC-related factors that are crucial to the battery, we will visualize the parsed data to analyze the data of the lithium battery. Data visualization is scientific and technological research on the visual representation of data, which mainly uses graphical means to clearly and effectively convey and communicate information [24,25]. Here, the data visualization tool in Python is used to visualize the parsing [26].
Due to the rapid development of new energy vehicles, research on batteries is becoming more and more important. However, battery SOC is unable to be measured directly and can only be estimated by the parameters of the battery voltage, current and temperature, which are also affected by various uncertainties such as battery aging, environmental temperature changes and vehicle driving status, so accurate SOC estimation has become an urgent problem in the development of electric vehicles.

Data Preprocessing
Since the SOC predictions are to be made in Section 5, it is important to pre-process these metrics. The main characteristics included in the dataset are battery voltage, current, mileage, maximum temperature, and minimum temperature, which are organized as shown in Table 5. The collected data contain a total of 36 pointa and there are null values and redundant data not related to SOC. So, in order to accurately predict the battery SOC, a lot of data preprocessing is needed before the model prediction, where 80% of the work is carried out in the process of cleaning and preparing the data [27]. By using Pandas and Numpy tools to analyze the parsed battery data completely, we first filter the data for null values and outliers (combined with visual analysis) and then use the describe() function to calculate the number (count), mean, standard deviation (std), minimum (min), maximum (max), and median of battery data. Based on this, the battery data are segmented and analyzed as a whole to preliminarily understand the driving state and charging state of the car, so as to pave the way for subsequent visual analysis.
In the data cleaning stage, it is necessary to understand the missing values, duplicate values, and abnormal points in the data. The first is missing values, and there are three common data-missing situations: complete random missing, random missing, and nonrandom missing [28,29]. There are two types of missing values:

1.
Since the missing values account for less than 10% of the total data, they can be deleted directly.

2.
If the missing values account for a larger proportion of the total data, the missing data need to be filled in. The common ways of filling in are mean interpolation and regression replacement methods.
Through experiments in this paper, it is found that there are zero missing values in the data set used, and the number of abnormal values is very small. However, due to different criteria for determining outliers, there will be deviations in the identification of outliers. At the abstract level, exceptions are defined as patterns that do not conform to the expected normal behavior, so a simple exception detection method is to define an area that represents the normal behavior and declare any observations in the data that do not belong to the normal area as exceptions [30].

K-Nearest Neighbor Algorithm
From the visual analysis, it can be seen that the SOC value has a linear regression relationship with some indicators, and the correlation is high, while the KNN algorithm is very effective for classification and regression problems. So, we use the KNN algorithm to predict SOC simply and quickly.
KNN was originally an intuitive classification method and has been widely used in pattern recognition. With a little modification, it can also be effectively applied for regression purposes. However, because different models have different requirements for data, KNN can ensure better prediction results only by selecting appropriate models in combination with the characteristics of the data themselves [31]. The core idea of the KNN nearest neighbor algorithm is that if most of the K nearest samples in the feature space belong to a certain category, the sample also belongs to this category. K is usually an integer no greater than 20. The implementation of KNN classification prediction is divided into the following steps:

1.
Randomly select K tuples from the training tuples as the initial nearest neighbor tuples, and calculate the distance from the test tuples to the K tuples, respectively; 2.
Sort according to the increasing relationship of distance; 3.
Select the K points with the minimum distance; 4.
Determine the occurrence frequency of the category of the first K points;

5.
The category with the highest frequency among the first K points is returned as the prediction classification of test data [32].
In this paper, the training data are divided 8:2 for training and testing, and KNN constructs a range instead of setting the most important n_neighbors, weights and P as default. Dividing the training and test sets, a grid search method is used to let the KNN algorithm itself find the optimal parameters and the highest overlap according to the data set assignment. The setting of hyperparameters affects the selection of K and P values, and here we will analyze and compare different ranges to obtain the optimal solution. The distance P is calculated by (1) and (2), which reflects the similarity of the two points before. The feature space of the K-nearest neighbor method is generally the n-dimensional real vector space Rn, and the Euclidean distance and Manhattan distance [33] are used in the distance corresponding to Equations (1) and (2), respectively.
where xi denotes the predicted value, yi is the sample value, and |xi − yi| denotes the absolute value between the predicted value and the sample value [34].
To avoid the distance deviation caused by the different sizes of different features, standardization is first required in data preprocessing. The prediction process of the KNN algorithm is shown in Figure 8.

Visualization
In this experiment, according to the method in Section 4.2, first we filtered and eliminated missing values and outliers, secondly, combined with correlation functions for the overall analysis of battery data, the screening analysis results are shown in Table 6, and finally, the visual analysis results are gradually carried out by using the correlation function. Figure 9 shows the battery-related data and the visual analysis of vehicle status in turn.  According to the results shown in Table 6 and combined with Table 5, we can roughly learn that the vehicle is divided into three states, vehicle start, off, and other, mostly using start for driving. The vehicle charging state is for parking charging, driving charging, not charging and charging is complete. The maximum speed is 85.80 km/h, and the cumulative mileage ranges from 42,886.00 km to 59,776.00 km (16,890 km in total). SOC is more than 30%, with an average of 73%. The car's maximum temperature value is mostly below 100°C. Minimum temperature's highest value is also 35°C, and fluctuates within −40∼42°C. This means that the whole vehicle is in a good driving condition and the battery is in a healthy state. Basically, it is intuitive to understand the overall data of each indicator of the car. First, we grasp the data distribution as a whole so as to prepare for the next step of detailed visualization work. Combined with Figure 10, from (a), there is no direct relationship between SOC and speed, and the speed is basically below 80 km/h. However, it can be seen that the battery goes through two processes of discharging and charging, and the vehicle goes through two states of parking charge and discharging. This can also be derived from (b) the SOC versus time graph, and (c) shows the SOC in turn with sumvoltage, minbatterysinglevoltageval and sumcurrent. It can be seen that the SOC is almost the same as the sumvoltage and minbatterysinglevoltageval, in the stage of charging the current displays a process of first falling and then rising. Then, from (d), it is seen that there is no substantial pattern between the maximum, minimum temperature and SOC curves, and the correlation is low.
Finally, combined with the thermodynamic diagram, as shown in Figure 11, the correlation between these 15 battery data indicators is further intuitively obtained, in which the correlation between minbatterysinglevoltageval, sumvoltage and SOC is 0.98, basically close to 1, showing a high correlation. Through the analysis, it can be seen that the SOC of the battery has the highest linear correlation with the minbatterysinglevoltageval and sumvoltage, indicating that they have the greatest impact on the SOC in the operation of the battery, and can better predict the SOC value. Based on the above analysis, we take these two indicators as samples for reference, use the KNN algorithm for prediction, and then add a sumcurrent as a reference to eliminate contingency.

KNN Prediction
The article divides the data set and filters the hyperparameters by the method in Section 4.3, and selects the sumcurrent, sumvoltage and minbatterysinglevoltageval as the prediction index to construct the model to predict SOC. The two most important hyperparameters of the KNN algorithm are K (the number of specified nearest neighbor samples) and P (the selected distance), and we design several sets of value ranges to compare this aspect of hyperparameters and use the mean square error (see Equation (3)) as the evaluation index, where y_test is the tested data and y_pre is the predicted data, so as to find the optimal parameters. See Table 7 for details. It can be seen that the selected K = 5 and P = 1 (P = 1 is the Manhattan distance and P = 2 is the Euclidean distance), which are the hyperparameters of KNN and have an MSE of 0.5864. The test score arrives at 0.9989, which is 0.0394 higher than the MSE (0.6258) and test score (0.9988) with the hyperparameters of K = 3 and P = 2, so the predicted distributions fit the original data better and the accuracy of predicted SOC is slightly improved. Figure 12 below shows the prediction results obtained with different K and P values.

MSE
Red represents the SOC data in the original data and the green represents the SOC value predicted according to the key factors visually analyzed. It can be seen that regardless of the selection range of K and P values, the sumcurrent, sumvoltage, and minbatterysinglevoltageval analyzed in Section 5.1 can accurately predict the SOC value of a lithium battery.

Conclusions
In this paper, a new analytical method based on the original data of lithium batteries is proposed. This method analyzes the abstract hexadecimal message data generated by the lithium battery at the source end, parses it into intuitive and understandable data according to the GB/T32960 standard design algorithm, and gradually analyzes the key factors in the original data that are highly linear with SOC by using the visual method. Finally, the KNN algorithm is used to model, the key factors are taken as input, and the range of hyperparameters is set, so that the KNN algorithm can independently select parameters according to the characteristics of the data set distribution. The test score of SOC prediction will reach 0.9988, improving the accuracy of SOC prediction. After verification, the parsing method can completely parse the original battery data of the GB/T32960 standard, including three different types of messages: vehicle login, real-time information reporting and vehicle logout, with a correct parsing rate of over 95%. This offers a fresh approach to the study of battery data. It can also be connected to the current BMS in the future for intuitive data processing and early warning judgment.
Author Contributions: W.R. and X.B. conceived and designed the experiment; W.R. and X.B. performed experiments; M.L., Z.X. and J.W. surveyed data; W.R analyzed the data and wrote papers; J.G. and A.C. supervised and operated the project. All authors have read and agreed to the published version of the manuscript.