You are currently viewing a new version of our website. To view the old version click .
Entropy
  • Article
  • Open Access

22 April 2019

An Entropy-Based Car Failure Detection Method Based on Data Acquisition Pipeline

and
Department of Applied Computer Science, Faculty of Electrical Engineering, Automatics, Computer Science and Biomedical Engineering, AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków, Poland
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Entropy-Based Fault Diagnosis

Abstract

Modern cars are equipped with plenty of electronic devices called Electronic Control Units (ECU). ECUs collect diagnostic data from a car’s components such as the engine, brakes etc. These data are then processed, and the appropriate information is communicated to the driver. From the point of view of safety of the driver and the passengers, the information about the car faults is vital. Regardless of the development of on-board computers, only a small amount of information is passed on to the driver. With the data mining approach, it is possible to obtain much more information from the data than it is provided by standard car equipment. This paper describes the environment built by the authors for data collection from ECUs. The collected data have been processed using parameterized entropies and data mining algorithms. Finally, we built a classifier able to detect a malfunctioning thermostat even if the car equipment does not indicate it.

1. Introduction

Car reliability degrades over time. The driven distance is important, but car parts wear out and deteriorate even if a car stands unused for months. Rust builds up on brake rotors, rubber parts rot and leak etc. Conditions in which the vehicle is being used are also vital. Driving in the city differs a lot from driving on the highways. In the city a car is driven on different RPMs; it starts and stops a lot. On the other hand, on the highway a car is forced to maintain constant speed for a considerable amount of time. People rely on cars and cars break from time to time.
Modern cars equipped with plenty of Electronic Control Units (ECU) are able to detect a fault and pass on the relevant information to the driver. Unfortunately, in most cases the message is reduced to a warning light on the dashboard, and only an authorized car service is able to diagnose the fault based on error codes. Thus, a question can be asked: if we had access to data read by sensors, could we diagnose the situation ourselves? And even more, what additional knowledge about the car’s work can be obtained from such data? For example, can we assess the driver’s driving style?
This article is meant to provide a partial answer to these questions. It describes a data acquisition system that collects data from car’s ECUs and prepares them for further exploration with the data mining techniques. The system was built with a minimum financial effort. Its components include an Android based smartphone, Torque PRO application, and Akka HTTP framework. The data acquisition system captures live stream of the data from a car, stores it and does not influence the driver. The prototype versions of our system were presented in two conference papers [1,2]. The most important step involved decoding the Torque protocol.
The presented system was successfully used to collect real data from Hyundai i30 vehicle. The data set consists of 44 h of car driving. By collecting the first data set, we were able to obtain the data from a car with a malfunctioning thermostat and after the thermostat had been repaired. We then used this data to build a classifier to detect the analyzed sample points in a malfunctioning thermostat. As it turned out, the use of parameterized entropy significantly influenced the quality of the obtained results.
This paper is organized as follows: Section 2 presents related works. The data acquisition system is presented in Section 3. Section 4 deals with a short introduction to parameterized entropies. The collected data set is described in Section 5. Some results of the data exploration are presented in Section 6. A short summary is given in the final section.

3. Data Acquisition

The data for the presented research were collected from Hyundai i30 CW manufactured in 2009 and equipped with ODB2 diagnostic interface. The amount of data produced by the car requires an automated processing pipeline. Such data acquisition system was developed as part of this research. The system needed to work on-line while the car was going and could not require interaction with a driver.
The diagnostic data are collected by an ODB2 ELM327 microchip. The ODB2 reader connects to a Xiaomi Redmi 4X smartphone equipped with Torque Pro Android application via Bluetooth. The Torque application collects live data from the car. The captured interval is configurable and based on the experiments it was set to 1 s.
Torque Pro application, apart from saving the collected data to a local file on the smartphone, allows the user to set custom HTTP URL where the data are sent. It sends the data over the HTTP encoded in a query string. In order to find out what parameters are being sent, configuration file from the application were reverse-engineered. Identifiers along with labels are stored in the file named torqueConf.dat. Example data is presented below:
  • log_min46 = 0.0 – minimal value,
  • log_max46 = 100.0 – maximal value,
  • log_pid46 = 16716401 – identifier number,
  • log_fullName46 = Fuel used (trip) – description,
  • log_unit46 = l – unit.
The application stores parameter IDs in a decimal number format but sends them in hexadecimal. What is more, it maps internally between its own parameter ordering and external IDs. The above-mentioned parameter 46 with ID (decimal) 16716401 has hexadecimal FF5203. An example URI sent by the application is presented in Listing 1 (Line breaking was used to make the text more readable):
Listing 1.Example URI
?eml= bartekviper@gmail .com&kff1222 =0.47763458
&v=8&kff1221 =0.7290559
&session =1529270311928&kff1239 =4.5509996
&id =80 c9065bff55006e9f5f3d4f8d9456ae&kff1269 =49.0
& time =1529271850358&kff1005 =20.00668569
&kff1005 =20.00668569&kff1006 =50.08636538
&kff1006 =50.08636538&kff123b =0.0
&kff1001 =0.0&kff1203 =0.0
&kff1007 =0.0&kff5202 =5.5996585
&kff129a =61.0&kff5203 =17.85786
&k2d =99.21875&kff1201 =0.0
&k33 =98.0&kff5201 =15.818244
&kb =98.0&kff1238 =13.0
&k23 =8560.0&k42 =13.212
&kff1267 =109.728&k4 =0.0
&kff1268 =438.995&kc =0.0
&kff1266 =548.0&k21 =0.0
&kff1223 = -0.030030591&k31 =65535.0
&kff1220 =0.2738867&kff126b =2.0411227
&kff126a =33.774616&k5 =95.0
&kff1204 =5.892079&kff1202 =0.0
&k10 =3.52&kff1206 =7.2493286
&kff1001 =0.0&kff1010 =268.0
&kd =0.0&kff1271 =1.289121
&kff1237 =0.0&kff125d =1.286436
&kff123a =17.0&kff125a =21.4406
&k2c =4.7058825&kff1272 =22.889725
&k46 =28.0&kff1208 =13.794094
&kf =39.0
A performed request contains standard OBD data, such as engine coolant temperature and data calculated by Torque Pro, for example, the fuel used. Meta information is also sent and it contains:
  • eml—registered email address,
  • v—protocol version,
  • session—when measurement started in milliseconds since 1 January 1970 UTC,
  • id—UUDI session identifier with stripped “-”,
  • time—measurement time in milliseconds since 1 January 1970 UTC.
The server is capable of processing data with high throughput. It allows for creation of multiple connections for different test drives. Torque Pro sends standard OBD2 PIDs (Parameter IDs) along with the proprietary ones.
The server saves collected data into InfluxD—a time series database. It is designed to store and query time-oriented data without delays. This feature is important from server’s perspective. Each measured attribute is stored in a separate table with a tag assigned to it. Tags allow for identification of test drives.
Data is not only saved to a database but also published on queues. There is one queue per each measured attribute. The server publishes messages in JSON (JavaScript Object Notation) format. Ther can be multiple independent subscribers for each queue so data can be processed in real time.

4. Entropy

This section provides theoretic fundamentals of entropy. It starts with a brief overview of Shannon entropy and then the parameterized generalizations are presented.
Entropy as a measure of disorder has its origin in thermodynamic. The concept was proposed in the early 1850s by Clausius [10]. Shannon [11] adopted entropy to information theory in 1948. In information theory, the concept is used to measure the uncertainty of a random variable – the greater entropy, the more random variable. Let X be a random variable that can take values { x 1 x n } , and p ( x i ) is the probability mass function of outcome x i . The Shannon entropy for variable X is defined as:
H s ( X ) = i = 1 n p ( x i ) log a 1 p ( x i )
Depending on the value of a parameter, different units can be used: bits ( a = 2 ), nats ( a = e ) or hurtleys ( a = 10 ). For more details about the Shannon entropy see [12,13].
The Shannon entropy assumes a trade-off between contributions from the main mass of the distribution and the tail [14]. In order to control the trade-off explicitly, we must use a generalization of the Shannon entropy. Two such generalizations were proposed by Renyi [15] and Tsallis [16] respectively. The Renyi entropy is defined as:
H R α ( X ) = 1 1 α log a i = 1 n p ( x i ) α
while the Tsallis entropy is defined as:
H T α ( X ) = 1 1 α i = 1 n p ( x i ) α 1
If the parameter denoted as α has a positive value, it exposes the main mass (the concentration of events that occur often), whereas, if the value is negative, it exposes the tail (the dispersion caused by rare events). Both entropies converge to Shannon entropy for α 1 . A more detailed comparison of all the above-mentioned entropies can be found in [17].
Let us consider the data sample presented in Table 1. The table contains 50 values drawn from the set presented in Table 2. Probability distribution for these values is also shown in Table 2.
Table 1. Data sample.
Table 2. Probability distribution.
Now, let us examine what is the impact of a rare event on the Renyi entropy with different values of the α parameter. We used a sliding window with a length of 7, i.e., the first window contains values with indices ranging from 1 to 7, the second window contains values with indices ranging from 2 to 8, etc. Figure 1 shows Reyni entropy values for six different values of the α parameter. It is worth paying attention to the entropy values for window number 30 and six consecutive windows. These windows contain the measurement number 36 containing a value with a very low probability (rarely occurring). This fact is illustrated very well by a significant increase in the value of entropy with negative coefficients.
Figure 1. Reyni entropies for the data from Table 1.

5. Dataset

In order to verify if entropy-based approach is suitable to detect a car failure we used the data acquisition system described in Section 3 to collect data from the mentioned Hyundai i30 CW. We collected data corresponding to a total of 44 h of driving. The data are stored in text files using Comma Separated Values format (CSV). Each file corresponds to a single drive. The first line of such a file contains header (attributes names) and each of the other lines contains one data record. A snippet of one of the files is shown in Listing 2. The data are displayed in columns instead of rows, to make them more readable.
Listing 2. Example log file.
GPS Time,Wed Dec 27 20:58:49 GMT +01:00 2017
Device Time,27-gru-2017 20:58:48.388
Longitude,20.4570165
Latitude,50.80177848
GPS Speed (Meters/second),8.24
Horizontal Dilution of Precision,3.0
Altitude,318.0
Bearing,109.7
G(x),0.86602783
G(y),7.83947754
G(z),4.78611755
G(calibrated),0.00805598
EGR Error(%),-
Barometric pressure (from vehicle)(psi),-
Intake Manifold Pressure(psi),15.37400055
Fuel Rail Pressure(psi),-
Run time since engine start(s),-
Trip time(whilst stationary)(s),0
Trip time(whilst moving)(s),0
Trip Time(Since journey start)(s),0
GPS Bearing(°),109.69999695
Timing Advance(°),-
Litres Per 100 Kilometer(Instant)(l/100km),-
Horsepower (At the wheels)(hp),-
Engine kW (At the wheels)(kW),-
Torque(Nm),-
Voltage (OBD Adapter)(V),-
Voltage (Control Module)(V),-
Engine Load(%),-
Engine RPM(rpm),-
Distance travelled with MIL/CEL lit(km),-
Distance travelled since codes cleared(km),-
Percentage of City driving(%),100
Percentage of Highway driving(%),0
Percentage of Idle driving(%),0
Trip Distance(km),-
Trip distance (stored in vehicle profile)(km), 491.32943726
Mass Air Flow Rate(g/s),17.04999924
Speed (OBD)(km/h),-
EGR Commanded(%),-
Ambient air temp(°C),-
Intake Air Temperature(°C),-
Engine Coolant Temperature(°C),68
Turbo Boost & Vacuum Gauge(psi),0.67400074
Trip average KPL(kpl),-
Trip average Litres/100 KM(l/100 km)-
Table 3 consists of all measured attributes with a short description. It is important to note that either it was not possible to measure all of them, or their meaning differs based on the engine type—either it is diesel or a gasoline engine. For example the timing advance is used in gasoline engines to prevent the knock effect, by adjusting when a spark plug should give a spark. In diesel engines a knock effect can also occur, but fuel injectors are used to eliminate this undesirable effect.
Table 3. CSV headers with descriptions.
The collected data contains measurements performed on the car with a malfunctioning thermostat. It remained half-open, causing the engine temperature to drop below the operating temperature. It also led to cabin heating loss. For validation purposes, after replacing thermostat, new data were collected describing how normally operating thermostat influences the engine coolant temperature.
Broken thermostat data set consists of 25 files. This amounts to around 16 driving hours of raw data. The second data set with the working thermostat consists of 49 files which equal to around 27 driving hours.
The first step involved preprocessing phase to prepare data for further analysis. It is assumed that for diagnostics purposes, engine must be running in order to produce heat and cause thermostat to open and close. All data rows where engine revolutions per minute were indicating that engine had not been running were discarded. Thus samples between around 10,000th and 12,000th second in Figure 2 were discarder and consequently in every other dataset. However, it does not matter if engine was idling or not because engine coolant flow is caused by water pump.
Figure 2. Engine coolant temperature and engine RPM for broken thermostat.
Figure 2 presents the broken thermostat behavior. There are both engine coolant temperature and engine RPM measurements. The engine is running when RPM readouts are above 0 and around 1000. When it is stopped coolant does not circulate, thus sensor reads local temperature. Around 12,000th second, since test drive had begun, engine was started again and temperature read dropped, because the engine got colder. The engine water pump, which forces coolant to circulate, is propelled by either timing mechanism or fan belt.
In comparison Figure 3 sets the engine coolant temperature together with the engine RPM for a working thermostat. Before test drive started it already had some residual heat. Before engine reached its operating temperature, thermostat remained closed. This is why the engine is required to run.
Figure 3. Engine coolant temperature and engine RPM for working thermostat.
After completing the first step, all data was divided into batches. Each of them contains 600 samples which equal 10 min of data. Last but not least, those batches were enriched with calculated entropies. Within batches, a sliding window of 10 samples was selected which amounting 10 s. Inside selected time window both Tsallis and Renyi entropies were calculated for both negative and positive values of the α parameter. Entropies rely on reference probability calculated based on a reference model. We used the sample shown in Figure 3 to determine the probabilities of individual temperature values. For values not present in the sample, it is assumed that the probability is equal to the least of the calculated ones. The calculated probabilities are given in Table 4. The smallest value is equal to 0.004653044709690472.
Table 4. Probabilities in reference model.
After many experiments following α parameters for Renyi and Tsallis entropies were selected and are presented in Table 5.
Table 5. Selected α for each calculated entropy type.
Both Figure 4 and Figure 5 correspond with temperature data presented in Figure 2.
Figure 4. Renyi entropy for broken thermostat.
Figure 5. Tsallis entropy for broken thermostat.
Figure 6 and Figure 7 correspond with temperature data presented in Figure 3.
Figure 6. Renyi entropy for working thermostat.
Figure 7. Tsallis entropy for working thermostat.
Visual analysis of both Figure 2 and Figure 3 allows to distinguish working thermostat from broken one. A thermostat which operates correctly should maintain the engine temperature in given temperature range, as show in Figure 3. A broken thermostat will either leak or not open. In this case, it leaked and caused the engine to cool excessively. Undesirable behavior can be observed when analyzing entropies’ values. It is very noticeable when looking at Tsallis entropies. For working thermostat and negative α , calculated value is almost always equal to 0 with some local aberrations—see Figure 7. The same can be noticed for positive α when analyzing broken thermostat—see Figure 5. Renji entropy plots also differ for working and broken thermostat. But differences for both positive and negative α are not that significant as with Tsallis entropy.

6. Machine Learning Approach

It is possible to distinguish between broken and working thermostat characteristics on graphs. One of the goals was to create an automated system which would support car diagnosis. It is especially important because a standard diagnosis procedure is based on experiments which can possibly do further damage to the engine if the thermostat do not open in the right moment. What is also important is that such systems should spot advancing wear.
There are two ways of diagnosing. First is based on threshold, second on machine learning algorithm. Threshold has major disadvantage. It assumes that there is a certain temperature range in which thermostat should operate. Any temperature outside a given range is a sign of fault. More advanced solution uses machine learning algorithms. Major advantage of ML over standard algorithm is model’s ability to generalize. It also requires samples, which are diverse.
In case of this study classification decision tree was used. It is a supervised machine learning algorithm which is designed to, based on provided features (attributes), detect which sample belongs to which group. In this experiment there were two groups: working and broken. Machine learning process consists of two phases: training and testing. The entire data set was divided into two subsets corresponding to each mentioned phase. Because training is more important than testing, testing only validates quality of model, it contains 80% of all available samples.
Following experiments with different features were conducted:
  • Model A – temperature only,
  • Model B – Tsallis entropy,
  • Model C – Renji entropy, Tsallis entropy, mean and median temperature.
For model’s quality verification following measures were done:
  • feature importance – in what percentage given feature has impact on a result,
  • mean squared error (MSE) – measure which indicates how accurate is the model,
  • model’s score – according to documentation it is mean error rate.
There were three experiments conducted with different parameters described in Table 6. As a result of each experiment different models were created. There are based on machine learning algorithm called decision tree. This algorithm labeled data from a test set as working or broken. For all models, values are labeled based on to which leaf they belong to.
Table 6. Created models based on specific attributes.
Decision tree created for model A, presented in Figure 8, has a broad structure. It would have less nodes only if the subtree on the left do not contain nodes labeled as working. It scored 86.8% accuracy and MSE equal to 13.2%.
Figure 8. Decision tree based on temperature value.
Model B (see Figure 9) was based solely on Tsallis entropy values both for positive and negative α . It is the least complicated when compared to all the performed experiments. It also confirms the observed regularity that Tsallis entropy is sufficient to determine thermostat condition. The model scores 88.9% and MSE is equal to 11.1%. Feature importance shows that positive α has 93.5% of impact on a model’s prediction while negative α only 6.5%.
Figure 9. Decision tree based on Tsallis entropy.
The last developed model C is based on both Tsallis and Renji entropies for positive and negative α . For the sake of this experiment mean and median temperatures are also used. This makes it too detailed and the number of contained nodes make it impossible to visualize. Renji entropy is irrelevant, because both values calculated for positive and negative α have significance of 0.0%. Tsallis entropy has performed similar to model B reaching significance of 92.5% for positive α and 5.6% for negative α . Mean temperature is 1.8% relevant to overall performance and median has less than 0.1% importance. Model scored 88.8% of correctness.

7. Conclusions

Entropies allow the creation of a concise decision tree model. In particular, Tsallis entropy which alone allowed the creation of model B with accuracy of 88.9%. Although, all models’ scores vary only by a few percent, model B is the least complicated. Thus, it is most generic of all.
For an experienced car mechanic investigating a thermostat problem is a simple job. Lack of cabin heating and significant engine temperature drop are noticeable symptoms for thermostat broken in an open position. On the other hand, when thermostat is not able to open itself the engine overheats and damages, for example, the engine block.
It is believed that self-driving cars will become popular on public roads within the next couple of years. A lot of effort is put into artificial intelligence which drives those cars. It seems that undisturbed operation has been put in second place. One of authors’ goals is to provide reliable solution which improves day to day car usage and will have impact on self-driving cars in the future.
From drivers’ perspective solution should be intuitive and cheap. The developed system meets both criteria although it has limited functionality for now. Mentioned criteria are crucial from the adoption standpoint. Expensive or unintuitive solutions are unacceptable for the end user. In addition, client-server solutions are universal as smartphones are responsible for collecting data and sending them to the server. Battery life is not an issue here because it is assumed that device will be used only for diagnostic purposes or will be connected to a charger.
Future works include developing a dedicated device powered by OBD2 plug. Such a device could be kept hidden in a car and work automatically without human intervention. What is more important further research includes machine learning model enhancements. We want to develop methods that will not only detect failures or provide additional information compared to the on-board computer but can also be used, for example, to assess the driver’s driving style, especially from an economic point of view.

Author Contributions

The authors contributed equally to this work.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kowalik, B. Introduction to car failure detection system based on diagnostic interface. In Proceedings of the 2018 International Interdisciplinary PhDWorkshop (IIPhDW), Swinoujście, Poland, 9–12 May 2018; pp. 4–7. [Google Scholar] [CrossRef]
  2. Kowalik, B.; Szpyrka, M. Architecture of on-line data acquisition system for car on-board diagnostics. MATEC Web Conf. 2019, 252, 02003. [Google Scholar] [CrossRef][Green Version]
  3. Xue, X.; Li, C.; Cao, S.; Sun, J.; Liu, L. Fault Diagnosis of Rolling Element Bearings with a Two-Step Scheme Based on Permutation Entropy and Random Forests. Entropy 2019, 21, 96. [Google Scholar] [CrossRef]
  4. Dandarea, S.; Dudulb, S. Multiple fault detection in typical automobile engines: A soft computing approach. WSEAS Trans. Signal Process. 2014, 10, 254–262. [Google Scholar]
  5. Li, Y.; Yang, Y.; Li, G.; Xu, M.; Huang, W. A fault diagnosis scheme for planetary gearboxes using modified multi-scale symbolic dynamic entropy and mRMR feature selection. Mech. Sys. Signal Process. 2017, 91, 295–312. [Google Scholar] [CrossRef]
  6. Budiharto, W. The development of an expert car failure diagnosis system with Bayesian approach. J. Comput. Sci. 2013, 9, 1383–1388. [Google Scholar] [CrossRef][Green Version]
  7. Yash, J.; Rashi, A.; Neeta, V.; Swati, J. Approach towards Car Failure Diagnosis-An Expert System. Int. J. Comput. Appl. 2010, 1. [Google Scholar] [CrossRef]
  8. Cai, B.; Liu, Y.; Fan, Q.; Zhang, Y.; Liu, Z.; Yu, S.; Ji, R. Multi-source information fusion based fault diagnosis of ground-source heat pump using Bayesian network. Appl. Energy 2014, 114, 1–9. [Google Scholar] [CrossRef]
  9. Cai, B.; Liu, H.; Xie, M. A real-time fault diagnosis methodology of complex systems using object-oriented Bayesian networks. Mech. Syst. Signal Process. 2016, 80, 31–44. [Google Scholar] [CrossRef]
  10. Clausius, R.; Hirst, T. The Mechanical Theory of Heat: With its Applications to the Steam-Engine and to the Physical Properties of Bodies; J. van Voorst: London, UK, 1867. [Google Scholar]
  11. Shannon, C. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  12. Karmeshu, J. Entropy Measures, Maximum Entropy Principle and Emerging Applications; Springer-Verlag New York, Inc.: Secaucus, NJ, USA, 2003. [Google Scholar]
  13. Harremoes, P.; Topsoe, F. Maximum Entropy Fundamentals. Entropy 2001, 3, 191–226. [Google Scholar] [CrossRef]
  14. Maszczyk, T.; Duch, W. Comparison of Shannon, Renyi and Tsallis Entropy Used in Decision Trees. In Artificial Intelligence and Soft Computing–ICAISC 2008; Rutkowski, L., Tadeusiewicz, R., Zadeh, L., Zurada, J., Eds.; Springer: Berlin, Germany, 2008. [Google Scholar]
  15. Renyi, A. Probability Theory; North-Holland Pub. Co.: Amsterdam, The Netherlands, 1970. [Google Scholar]
  16. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
  17. Bereziński, P.; Jasiul, B.; Szpyrka, M. An Entropy-Based Network Anomaly Detection Method. Entropy 2015, 17, 2367–2408. [Google Scholar] [CrossRef]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.