Factor Identiﬁcation and Prediction for Teen Driver Crash Severity Using Machine Learning: A Case Study

: Crashes among young and inexperienced drives are a major safety problem in the United States, especially in an area with large rural road networks, such as West Texas. Rural roads present many unique safety concerns that are not fully explored. This study presents a complete machine leaning pipeline to ﬁnd the patterns of crashes involved with teen drivers no older than 20 on rural roads in West Texas, identify factors that a ﬀ ect injury levels

In order to address rural teen driver safety issues, which has been raising public concerns recently, factors contributing to these problems need to be identified. It makes significant sense that an understanding of how the number of teen driver crashes might be reduced begins with an understanding of how they happened. In this article, machine learning approaches are used to analyze factors that affect the occurrence of teen driver crashes and resulting severity. A complete machine learning pipeline has been conducted from collecting/cleaning data, performing exploratory data analysis, and formulating a real-world problem into machine-learning predictive models. The study is performed in two steps:

•
Step 1: Firstly, perform an Exploratory Data Analysis (EDA) on the collected dataset. The purpose of performing this EDA is to first to explore the collected raw dataset by using methods from descriptive statistics and data visualization. This EDA is done without any pre-conceived notions or hypotheses, and the results of this exploration is used to guide and to identify the factors in the subsequent machine learning (ML) models. Remove any features that are not related to crash severity by nature and any features with a large percentile of "no-data" or "missing data". Also remove features with notable positive correlation (e.g., object stuck and first harmful event, etc.). Lastly, select features based on the existing literature on crash severity and teen crashes and your best engineering judgement. • Step 2: By performing data dimensionality reduction in the first step, build four machine learning models to identify the most important factors associated with the severity of crashes on rural roads and predict crash severity using these identified factors. Approach the problem as a multivariate regression problem and predict the severity of the crash based on the collected crash dataset.
The rest of the paper is organized as follows: Section 2 gives a thorough view of various literatures about teen driver crash severity and machine learning approaches applied in this area. Section 3 highlights the data collection/filtering process for this research. Section 4 performs an exploratory data analysis and the results from easily understandable visual presentations. Section 5 ranks the importance of factors on crash severity and building four machine learning models on crash severity prediction based on the teen driver crash data in rural West Texas alone. In the last section, a conclusion is discussed.

Literature Review
Crash severity analysis is an important and widely studied area in safety research. Past studies are usually focused on two important topics: (1) identification of risk factors associated with crash injury severity and (2) prediction of crash injury levels from the identified contributing variables.
In the first area, researchers in the past have investigated various factors that are contributing to crashes involved with teen drivers, such as speeding [4,5], distracted driving [6,7], driver age and gender [8,9], risky behavior [10], impaired driving [11], night-time driving [12], and peer passengers [13,14]. In the second field, many researchers have developed prediction models on crash severity, which can be generally divided into two groups: parametric models and nonparametric (or machine learning) methods. In crash severity analysis, the dependent variable is discrete, with multiple categories corresponding to different crash severity levels (for example, in this study, or in the TxDOT Crash Records Information System (CRIS), crash severity is defined with the KBACO scale: K: Killed, A: Incapacitating injury, B: Non-incapacitating injury, C: Possible injury, O: Not injured). Since crash severity levels can be considered as either nominal or ordinal variables, different types of statistical modeling techniques for nominal variables can be applied, such as multinomial logit (MNL) models (for example, investigating the impact of road features on crash injury severity [15], examining the impact of alcohol and drug use cyclist injury severities [16], and analyzing the severity of large-truck-involved crashes [17]) and ordered logit (OL) models (for example, investigating the impact of rural highway geometric features on crash injury severity [18], comparing the severity of Appl. Sci. 2020, 10, 1675 3 of 16 freight-involved crashes and non-freight crashes [19], and finding the impact of contributing variables on the severity for highway-rail crossing crashes [20]).
It should be noted that parametric models need some model assumptions and pre-defined relationships between dependent and explanatory variables, which restrict their application when the assumptions are violated [21]. Meanwhile, nonparametric or machine learning methods usually do not require such assumptions and can identify and present underlying knowledge in large datasets with diverse variable categories without priori model assumptions or predefined relationships. This makes machine learning approaches particularly useful for traffic safety data analysis. The machine learning technologies applied in the past research include: Random forest [22,23], SVM (Support Vector Machine) [24,25], neural networks [26,27], and XGBoost [28].
Some scholars also performed comparison studies on these two kinds of approaches. Their studies consistently show the advantages of machine learning approaches over traditional statistical modelling methods. For example, Wang and Kim performed a comprehensive comparison between discrete models and tree-based models [21,29].
As discussed previously, rural roads are particularly dangerous to teen drivers due to geographically isolated rural areas and longer travel distances within rural communities. Combined with inexperience and risky driving behaviors of teen drivers, the situation becomes even more serious, and a comprehensive analysis of teen driver crashes on rural roads is necessary. Texas has one of the largest rural road networks in the nation (nearly 57,000 centerline miles and over 128,000 lane miles of state highway in rural areas, including 50,000 miles of farm-to-market (FM) roads). Therefore, we perform a case study in West Texas rural areas in this study.
The major contribution of this study is to conduct a systematic and panoramic machine learning methodology from collecting, cleaning, and coding crash data, performing a comprehensive exploratory data analysis, identifying the effect of factors contributing to the injury severity, and formulating four machine learning models on crash severity prediction. We use West Texas rural teen driver crash data as an example case. The research findings are helpful to better understand teen driver crashes and reduce teen driver injury severities by adopting effective solutions in West Texas or any other rural areas facing similar issues.
It is worthy of note that the proposed methodology has no any specific geospatial coverage limit and could be applied to many other areas facing similar safety issues. The proposed method can also be applied to other driver groups, such as adults or elderly drivers. The authors hope that other researchers could benefit from this study by applying this systemic approach to their own crash data set and hence better understand their own safety issues. In addition, we compared the performance of four ML models in terms of data coding methods, accuracy, and efficiency. We believe these comparisons could provide some insightful ideas for other researchers when they choose an appropriate ML method to process their own crash dataset, particularly to select data coding methods.

Study Area
West Texas is a loosely defined part of the US state of Texas, generally encompassing the arid and semiarid lands west of a line drawn between the cities of Wichita Falls, Abilene, and Del Rio [30]. Although there is no consensus on the boundary between East Texas and West Texas, we included five geographical TxDOT districts in this definition based on four principal metropolitan areas: Lubbock, Abilene, Midland/Odessa, and San Angelo, as highlighted in the map of Figure 1. Considering the huge binational border crossing transportation activities between the US and Mexico, we don't include the TxDOT El Paso district in this West Texas focused research. The Lubbock district sponsored and managed this teen driver safety research as the home base for the researchers.

Crash Data
TxDOT maintains a statewide automated database for all reported motor vehicle traffic crashes received from the department of public safety, which is called the Crash Records Information System (CRIS). The CRIS database contain many attributes of crashes, such as crash identifier, crash date/time, location, facility, crash characteristics, human factors, environmental variables, and vehicle characteristics. Users may obtain publicly available crash data through the TxDOT's CRIS Query tool. This application allows public users to query, extract, and/or analyze publicly available crash data. In this study, this Query tool was used to firstly collect the crashes with the rural area flag in the selected five TxDOT districts during the period of 2010 to 2018, and then filtered them to get the rural crashes involved with teen drivers only (aged from 15 to 20). Note that we included the drivers of age 20 to represent the teen driver group with longer trip distances on rural roads. Finally, we have collected 8859 teen driver crashes in rural West Texas. The detailed descriptions of the collected data can be found in Table A1.

Exploratory Data Analysis (EDA)
A comprehensive EDA was performed for the collected rural teen driver crash data in three stages. The first stage was all about the detailed analysis of the causes of teen driver crashes, nature, and class. The second stage was about the trend analysis related to teen driver crashes, and the third stage focused on the demographics of teen drivers involved with the crashes. Finally, we applied the EDA for dimensionality reduction to the raw data so that further processing could be accomplished by great reducing the computational cost of the developed machine learning models.

Crash Data
TxDOT maintains a statewide automated database for all reported motor vehicle traffic crashes received from the department of public safety, which is called the Crash Records Information System (CRIS). The CRIS database contain many attributes of crashes, such as crash identifier, crash date/time, location, facility, crash characteristics, human factors, environmental variables, and vehicle characteristics. Users may obtain publicly available crash data through the TxDOT's CRIS Query tool. This application allows public users to query, extract, and/or analyze publicly available crash data. In this study, this Query tool was used to firstly collect the crashes with the rural area flag in the selected five TxDOT districts during the period of 2010 to 2018, and then filtered them to get the rural crashes involved with teen drivers only (aged from 15 to 20). Note that we included the drivers of age 20 to represent the teen driver group with longer trip distances on rural roads. Finally, we have collected 8859 teen driver crashes in rural West Texas. The detailed descriptions of the collected data can be found in Table A1.

Exploratory Data Analysis (EDA)
A comprehensive EDA was performed for the collected rural teen driver crash data in three stages. The first stage was all about the detailed analysis of the causes of teen driver crashes, nature, and class. The second stage was about the trend analysis related to teen driver crashes, and the third stage focused on the demographics of teen drivers involved with the crashes. Finally, we applied the EDA for dimensionality reduction to the raw data so that further processing could be accomplished by great reducing the computational cost of the developed machine learning models.

Analysis of Major Causes
As observed in Figure 2, the major causes of teen driver crashes in rural West Texas are: unsafe speed (23%), failure to control speed (19%), failure to yield right of way (13%), driver inattention (6%), fatigue or sleep (5%), and animals on roads (6%). Other minor causes are mainly about risky driving behaviors, such as turning or changing lane when unsafe, following too closely, and backibg without safety, etc.

Analysis of Major Causes
As observed in Figure 2, the major causes of teen driver crashes in rural West Texas are: unsafe speed (23%), failure to control speed (19%), failure to yield right of way (13%), driver inattention (6%), fatigue or sleep (5%), and animals on roads (6%). Other minor causes are mainly about risky driving behaviors, such as turning or changing lane when unsafe, following too closely, and backibg without safety, etc. It is evident that speeding appeared to be the most important cause of teen driver crashes in West rural Texas regions (nearly 42% of crashes occurred due to high speed of the vehicle (unsafe speed and failure to control speed)). The yearly number and causes of teen driver crashes in West Texas rural areas can be found in Table A2.
Speeding increases the stopping distance of the vehicles while reducing reaction time to avoid a potential collision. It also increases the likelihood of injury in a crash. There is also evidence from naturalistic driving studies that teens' speeding behavior increases over time, possibly as they gain confidence. In general, they take more risks than adults and are more likely to involve speeding, which becomes more prevalent when driving on rural roads in West Texas at night or with other teen passengers on board. The problem is even worse since Texas has the highest speed limit on its rural highways in the nation (75 mph is a common speed limit setting and 80 or 85 mph will be allowed if the highway is designed to accommodate that speed.) It should also be noted that not all speed-related teen crashes are due to intentional risk-taking. Instead, most are caused by inexperience of teen drivers and a lack of driving skills. Teens or new drivers need to be educated on speed management based on traffic and road conditions and keep a safe distance from other vehicles.
The second important cause is mainly related to the improper driving behaviors of teen drivers in West Texas: failure to yield right of way and being distracted while driving (roughly 19% of the total crashes). In some cases, teen drivers of rural West Texas failed to yield to other vehicles while they were merging into the interstate highway, which resulted in crashes. A merging vehicle is usually required to execute a lane-changing maneuver along a limited length of a merge lane, which requires a good control while timing the merging and possesses a high chance of collision [31]. Therefore, teen drivers might not be experienced enough to perform the maneuver properly and get involved in crashes. Additionally, teens' inexperience behind the wheel makes them more susceptible to distraction of electronic devices or passengers in the vehicle. It is evident that speeding appeared to be the most important cause of teen driver crashes in West rural Texas regions (nearly 42% of crashes occurred due to high speed of the vehicle (unsafe speed and failure to control speed)). The yearly number and causes of teen driver crashes in West Texas rural areas can be found in Table A2.
Speeding increases the stopping distance of the vehicles while reducing reaction time to avoid a potential collision. It also increases the likelihood of injury in a crash. There is also evidence from naturalistic driving studies that teens' speeding behavior increases over time, possibly as they gain confidence. In general, they take more risks than adults and are more likely to involve speeding, which becomes more prevalent when driving on rural roads in West Texas at night or with other teen passengers on board. The problem is even worse since Texas has the highest speed limit on its rural highways in the nation (75 mph is a common speed limit setting and 80 or 85 mph will be allowed if the highway is designed to accommodate that speed.) It should also be noted that not all speed-related teen crashes are due to intentional risk-taking. Instead, most are caused by inexperience of teen drivers and a lack of driving skills. Teens or new drivers need to be educated on speed management based on traffic and road conditions and keep a safe distance from other vehicles.
The second important cause is mainly related to the improper driving behaviors of teen drivers in West Texas: failure to yield right of way and being distracted while driving (roughly 19% of the total crashes). In some cases, teen drivers of rural West Texas failed to yield to other vehicles while they were merging into the interstate highway, which resulted in crashes. A merging vehicle is usually required to execute a lane-changing maneuver along a limited length of a merge lane, which requires a good control while timing the merging and possesses a high chance of collision [31]. Therefore, teen drivers might not be experienced enough to perform the maneuver properly and get involved in crashes. Additionally, teens' inexperience behind the wheel makes them more susceptible to distraction of electronic devices or passengers in the vehicle. The third important cause is from the unique hazards on rural roads: fatigue or asleep and animals on the road, which accounts for 11% of the overall crashes. One major concern for teen drivers on rural roads is highway hypnosis. Many rural highways in Texas consist of long, straight roadway sections. This fact, coupled with the propensity for teens to drive late at night, is a likely explanation for high fatal crash counts during night hours in Texas. Highway hypnosis can occur when drivers operate vehicles for extended periods of time in monotonous surroundings, entering a dulled and drowsy state. Low light and lack of sleep can contribute to the issue [2]. Additionally, in West Texas, animal crossings are common in rural terrains. Teen drivers often carelessly endanger themselves by overlooking the unforeseen hazards from animals. Wild animals, such as deer and domestic animals, escape from farms or ranches and dash into oncoming traffic. Inexperienced teen drivers often overcompensate when trying not to collide with the animal and, in fact, cause a collision. Other environmental issues coinciding with rural road animal accidents include lack of lighting and obstructions to visibility. It is important to teach teen drivers to be more cautious in areas with low visibility, more animals, and higher volume of slow-moving vehicles, such as farm machinery, in West Texas regions in the safety education program [2].
It is noted that driving under the influence is a minor cause for teen driver crashes in West Texas (only about 3% of the overall crashes). According to the National Highway Traffic Safety Administration (NHTSA), in 2016, alcohol-impaired driving crashes accounted for 18% of drivers involved in fatal crashes in the age group from 16-20 [32]. This indicates that the current Texas alcohol regulation that bans anyone under the age of 21 to purchase or consume alcohol successfully reduced the occurrences of teen driver alcohol-related crashes in West Texas.

Analysis of Trends
As overserved in Figure 3, the number of crashes involved with teen drivers in West Texas has experienced an overall increase in trend in the past nine years. The number almost doubled in 2018 compared with the number in 2010. However, according to a national study that examined the teen driver crash data in the entire 20-year period from 1994 through 2013, a substantial downward trend was presented in crash involvements of teen drivers [33]. In this regard, great efforts are needed to be devoted exclusively to teen driving safety in West rural Texas.
Additionally, as observed in Figure 4, there are two daily peaks for the occurrence of teen driver crashes on rural roads in West Texas: 7:00 to 8:00 in the morning and 16:00 to 17:00 in the afternoon. The researchers originally linked these two peaks with the daily peak hour traffic. However, these trends have not been found for other age driver groups. The researcher assumed that these two peaks are associated with teen driver traffic at school starting and ending time.

Other Exploratory Analysis
Other analyses were also performed to find the demographics of teen drivers and crash natures. As shown in Figure 5, male teen drivers were more likely to be involved in crashes than their female peers in rural West Texas (60% vs. 40%). In different ages, the occurrence trend of crashes was almost equally distributed (except the age of 15, only accounting for 2%). From the aspect of ethnicities, the majority of teen drivers were White and Hispanic, which was consistent with the demographics of residents in West Texas (almost 90% in total). As for the levels of crash severity, more than two thirds of the crashes did not end in injury. About 4% of teen drivers were involved with crashes with serious injuries and 1% were involved with fatal crashes, which was much lower than the average level of teen fatal crashes in the nation. In 2017, teen drivers involved in fatal crashes were 3225 in the nation, which was about 8% of total crashes [34].

Factor Identification and Predictive Models
Predictive analytics is defined as finding patterns in data and predicting values in future trends and outcomes by using those patterns. Moreover, machine learning techniques have become more popular in accomplishing predictive analytics due to their exceptional usage in managing large scale data in recent years. In addition to the EDA, in this study, we built several machine learning models to rank the important factors associated with the severity of crashes and performed crash severity prediction using these identified factors from the EDA. The machine learning process was implemented using Python and the related Python data learning libraries, such as Pandas and SKlearn. The detailed steps are described as follows: Step 1: W Cleaned and filter raw crash data from the CRIS. "No data" and "Unknown data" of crash severity are filtered out of the dataset. Choose the features based on the previous EDA and remove the features with the large percentile of missing data. In the end, we obtained 8859 valid crash records during the period of year 2010 to 2018. Each crash record had 46 features about the characteristics of crash, teen driver, vehicle, road, and driving environment.
Step 2: Since most of the features in the crash dataset are categorical, encode categorical data to numerical data for further construction of predictive models. Use two encoders that are available in the Python Sklearn library: label encoder and one hot encoder.
Label encoder assigns a non-repeat numerical value to each category in the data type of object. However, one weakness of the label encoder is that, by assigning different values to different categories, it gives an order to all categories, even though all categories are weighted equally and have no relationships between each category. One hot encoder splits one categorical feature into a binary sparse matrix. For example, in the categorical feature "Day of the week", there are seven different values representing each day of the week. One hot encoder splits the feature into a binary matrix with seven columns, where 1 represents true and 0 represents false for the value in each column. The differences between these two coding methods are illustrated in Figure 6.

Factor Identification and Predictive Models
Predictive analytics is defined as finding patterns in data and predicting values in future trends and outcomes by using those patterns. Moreover, machine learning techniques have become more popular in accomplishing predictive analytics due to their exceptional usage in managing large scale data in recent years. In addition to the EDA, in this study, we built several machine learning models to rank the important factors associated with the severity of crashes and performed crash severity prediction using these identified factors from the EDA. The machine learning process was implemented using Python and the related Python data learning libraries, such as Pandas and SKlearn. The detailed steps are described as follows: Step 1: W Cleaned and filter raw crash data from the CRIS. "No data" and "Unknown data" of crash severity are filtered out of the dataset. Choose the features based on the previous EDA and remove the features with the large percentile of missing data. In the end, we obtained 8859 valid crash records during the period of year 2010 to 2018. Each crash record had 46 features about the characteristics of crash, teen driver, vehicle, road, and driving environment.
Step 2: Since most of the features in the crash dataset are categorical, encode categorical data to numerical data for further construction of predictive models. Use two encoders that are available in the Python Sklearn library: label encoder and one hot encoder.
Label encoder assigns a non-repeat numerical value to each category in the data type of object. However, one weakness of the label encoder is that, by assigning different values to different categories, it gives an order to all categories, even though all categories are weighted equally and have no relationships between each category. One hot encoder splits one categorical feature into a binary sparse matrix. For example, in the categorical feature "Day of the week", there are seven different values representing each day of the week. One hot encoder splits the feature into a binary matrix with seven columns, where 1 represents true and 0 represents false for the value in each column. The differences between these two coding methods are illustrated in Figure 6. continuous features are in large numbers by nature (such as the average daily traffic), which means it may have a higher impact on prediction than other features. Standardize all continuous features so that all variability is measured at the same scale. We used the Standardscaler function in the SKlearn library to normalize all the features in the dataset. The standard scalar examples are also shown in Figure 6. Step 3: With the encoded and scaled data, start building predictive models for crash severity prediction. Here, approach the prediction as a multivariate regression problem and choose Random Forest (RF) and XGBoost as predictive methods. We also used the Python SKlearn library to implement the selected algorithms.
Random Forest is an ensemble machine learning algorithm that is used for classification and regression problems. It employs the decision tree algorithm for parametrization, but it mixes a sampling procedure, a subspace technique, and an ensemble tactic to optimize the model building. Bootstrap is the name of the sampling method, which uses a random sampling tactic with replacement. The subspace technique also takes a random sampling method, but it assists in removing smaller subsets (i.e., subspaces) of variables. The overfitting problem in decision trees is corrected in random decision forests by providing manifold. Use trained decision tree algorithms for the testing stage [21]. This property makes the RF preferable over the regular decision trees algorithm, since it was the most popular machine learning algorithm amongst data scientist until XGBoost took over.
XGBoost is an optimized distributed gradient boosting library, designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the gradient boosting framework. XGBoost provides a parallel tree boosting, also known as GBDT (Gradient Boosting Decision Tree), GBM (Gradient Boosting Machine), which solves many data science problems in a fast and accurate way. It has recently been dominating applied machine learning for structured or tabular data [28].
Since we had two predictive algorithms, combined with the two coding methods, we built four models for the crash severity perdition: Model 1 (one hot encoder/Random Forest); Model 2 (one hot encoder/XGBoost); Model 3 (label encoder/Random Forest); and Model 4 (label encoder/XGBoost).
Step 4: In this step, apply the four models to the West Texas rural teen driver crash dataset. The dataset is randomly split into training and testing datasets (75%/25%) to avoid any bias from the data After encoding the data, all data should be numerical in all cases, but the values of some continuous features are in large numbers by nature (such as the average daily traffic), which means it may have a higher impact on prediction than other features. Standardize all continuous features so that all variability is measured at the same scale. We used the Standardscaler function in the SKlearn library to normalize all the features in the dataset. The standard scalar examples are also shown in Figure 6.
Step 3: With the encoded and scaled data, start building predictive models for crash severity prediction. Here, approach the prediction as a multivariate regression problem and choose Random Forest (RF) and XGBoost as predictive methods. We also used the Python SKlearn library to implement the selected algorithms.
Random Forest is an ensemble machine learning algorithm that is used for classification and regression problems. It employs the decision tree algorithm for parametrization, but it mixes a sampling procedure, a subspace technique, and an ensemble tactic to optimize the model building. Bootstrap is the name of the sampling method, which uses a random sampling tactic with replacement. The subspace technique also takes a random sampling method, but it assists in removing smaller subsets (i.e., subspaces) of variables. The overfitting problem in decision trees is corrected in random decision forests by providing manifold. Use trained decision tree algorithms for the testing stage [21]. This property makes the RF preferable over the regular decision trees algorithm, since it was the most popular machine learning algorithm amongst data scientist until XGBoost took over.
XGBoost is an optimized distributed gradient boosting library, designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the gradient boosting framework. XGBoost provides a parallel tree boosting, also known as GBDT (Gradient Boosting Decision Tree), GBM (Gradient Boosting Machine), which solves many data science problems in a fast and accurate way. It has recently been dominating applied machine learning for structured or tabular data [28].
Since we had two predictive algorithms, combined with the two coding methods, we built four models for the crash severity perdition: Model 1 (one hot encoder/Random Forest); Model 2 (one hot encoder/XGBoost); Model 3 (label encoder/Random Forest); and Model 4 (label encoder/XGBoost).
Step 4: In this step, apply the four models to the West Texas rural teen driver crash dataset. The dataset is randomly split into training and testing datasets (75%/25%) to avoid any bias from the data itself. Finally, obtain the feature importance graph (as shown in Figure 7) and compare the performance measures in both accuracy and speed for the four models (as illustrated in Tables 1 and 2). prediction for West rural Texas teen driver crashes are: manner of collision, person restraint used, weather condition, school zone, roadway type, light condition, construction zone, left shoulder use, person age, right should type, traffic control type, number of lanes, speed limit, and road class.
In Tables 1 and 2, we can find that, for both coding methods, XGBoost has the highest accuracy (Model 2). As for the speed evaluation, Model 4 takes the shortest time to process the same dataset, which is critical for real-time crash prediction applications.

Conclusions
In this study, we presented a complete machine learning pipeline on teen driver crashes in rural West Texas, including collecting, cleaning, and coding crash data, a comprehensive exploratory data  Note that, in Figure 7, the features with positive values mean that these features are the values that may lead to high-level crash severity (injury or even fatal crashes), while the features with negative values indicate that these are the values that may lead to low-level crash severity or even prevent the occurrence of crash injury. The features that rank highest in importance to crash severity prediction for West rural Texas teen driver crashes are: manner of collision, person restraint used, weather condition, school zone, roadway type, light condition, construction zone, left shoulder use, person age, right should type, traffic control type, number of lanes, speed limit, and road class.
In Tables 1 and 2, we can find that, for both coding methods, XGBoost has the highest accuracy (Model 2). As for the speed evaluation, Model 4 takes the shortest time to process the same dataset, which is critical for real-time crash prediction applications.

Conclusions
In this study, we presented a complete machine learning pipeline on teen driver crashes in rural West Texas, including collecting, cleaning, and coding crash data, a comprehensive exploratory data analysis, identification of contributing factors to crash severity, and four machine learning predictive models. We have the following major findings: (1) The major causes of teen driver crashes in rural West Texas are unsafe speed (23%), failure to control speed (19%), failure to yield right of way (13%), driver inattention (6%), fatigue or sleep (5%), and animals on the road (6%). Other minor factors are mainly about risky driving behaviors, such as turning = or changing lane when unsafe, following too closely, and backing without safety. (2) The features that rank highest on importance to crash severity prediction for West rural Texas teen driver crashes are road class, speed limit, first harmful event, number of lanes, traffic control type, right shoulder type, person age, left shoulder use, construction zone, light condition, roadway type, school zone, weather condition, person restraint used, and manner of collision. (3) Machine learning approaches are particularly useful to uncover new statistical patterns in the large heterogeneous crash datasets. XGBoost seems to be an effective option when considering both predictive performance and computational cost. (4) It became apparent in this study that many teens do not have adequate knowledge of safe driving behavior on rural roads in West Texas. Policies and programs that can be built to curb this upward trend are in great need.
However, the dataset created contains information that has not been exploited yet, for example tempo-spatial patterns of teen driver crashes in rural West Texas. These will be included in further work. Additionally further time series analyses, for example, a comparison of the monthly time series through the years, as well as additional clustering experiments, for example, clustering of the day of week profiles by districts, might provide additional useful information. In cooperation with the TxDOT districts, we also hope to find explanations for some interesting features of the data discovered and possible ways to use the results of this work in improving rural teen driver traffic safety in West Texas.

Acknowledgments:
We are very thankful to the reviewers for their time and efforts; their comments and suggestions have greatly improved the quality of this paper.

Conflicts of Interest:
The authors declare no conflict of interest.