AI for Data Quality Auditing: Detecting Mislabeled Work Zone Crashes Using Large Language Models

Jaradat, Shadi; Acharya, Nirmal; Shivshankar, Smitha; Alhadidi, Taqwa I.; Elhenawy, Mohammad

doi:10.3390/a18060317

Open AccessArticle

AI for Data Quality Auditing: Detecting Mislabeled Work Zone Crashes Using Large Language Models

by

Shadi Jaradat

^1,2,3

,

Nirmal Acharya

^1,3,4

,

Smitha Shivshankar

¹,

Taqwa I. Alhadidi

⁵

and

Mohammad Elhenawy

^2,*

¹

Australian International Institute of Higher Education, Brisbane 4000, Australia

²

Centre of Data Science, Queensland University of Technology, Brisbane 4000, Australia

³

Cogninet Australia, Sydney 2010, Australia

⁴

School of Business and Law, Central Queensland University, Brisbane 4000, Australia

⁵

Faculty of Engineering, Hourani Center for Applied Scientific Research, Al-Ahliyya Amman University, Amman 19111, Jordan

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(6), 317; https://doi.org/10.3390/a18060317

Submission received: 7 April 2025 / Revised: 17 May 2025 / Accepted: 21 May 2025 / Published: 27 May 2025

(This article belongs to the Section Databases and Data Structures)

Download

Browse Figures

Versions Notes

Abstract

Ensuring high data quality in traffic crash datasets is critical for effective safety analysis and policymaking. This study presents an AI-assisted framework for auditing crash data integrity by detecting potentially mislabeled records related to construction zone (czone) involvement. A GPT-3.5 model was fine-tuned using a fusion of structured crash attributes and unstructured narrative text (i.e., multimodal input) to predict work zone involvement. The model was applied to 6400 crash reports to flag discrepancies between predicted and recorded labels. Among 80 flagged mismatches, expert review confirmed four records as genuine misclassifications, demonstrating the framework’s capacity to surface high-confidence labeling errors. The model achieved strong overall accuracy (98.75%) and precision (86.67%) for the minority class, but showed low recall (14.29%), reflecting its conservative design that minimizes false positives in an imbalanced dataset. This precision-focused approach supports its use as a semi-automated auditing tool, capable of narrowing the scope for expert review and improving the reliability of large-scale traffic safety datasets. The framework is also adaptable to other misclassified crash attributes or domains where structured and unstructured data can be fused for data quality assurance.

Keywords:

misclassification detection; large language models (LLMs); data quality assurance; work zone crashes

1. Introduction

Work zones present considerable challenges to road safety, increasing the risk of traffic crashes, particularly rear-end collisions and heavy-vehicle incidents. In Australia, crash data often under-represent the true scope of construction zone-related risks [1,2]. Blackman et al. [3], for instance, found substantial discrepancies between police-reported crash data and workplace health and safety (WHS) records, revealing that many work zone (WZ) crashes go unrecorded or misclassified in official statistics. This misrepresentation hampers evidence-based safety planning, enforcement, and infrastructure design.

Accurate classification of WZ crashes is essential for effective traffic safety interventions. However, this task is complicated by the fragmented nature of crash records. Structured data (e.g., severity, time, lighting conditions) may omit contextual cues, while unstructured narrative descriptions are rich in detail but difficult to process at scale. Prior studies have highlighted widespread inconsistencies between coded labels and textual narratives, often leading to under-reporting or misclassification [4]. Traditional methods—such as rule-based systems or shallow classifiers—struggle to resolve these inconsistencies, especially in imbalanced datasets.

Traffic conditions in work zones are influenced by dynamic variables such as the weather, vehicle speed, and detour complexity [5]. Traditional assessments often overlook these broader implications, including travel delay and fuel consumption, which are vital for accurate crash modeling [3]. Integrating such variables reinforces the need for contextual models capable of interpreting complex crash environments.

Meanwhile, advances in machine learning and natural language processing (NLP) have improved real-time crash prediction and severity modeling [6]. Yet, most research has focused on forward-looking tasks—such as predicting crash likelihood or simulating traffic flow—rather than retrospectively auditing the quality of existing datasets. The potential of large language models (LLMs) to bridge structured and narrative data for retrospective validation remains underexplored. In this study, we selected GPT-3.5 for its strong balance of fine-tuning capability, task-specific accuracy, and computational efficiency [7].

This study addresses the abovementioned research gap by proposing a multimodal, AI-assisted framework for detecting mislabeled crash records. By fusing structured crash attributes with unstructured narratives and fine-tuning a GPT-3.5 model, the framework flags inconsistencies between coded and inferred work zone labels. Unlike previous work that treats LLMs solely as predictive tools, our approach repositions them as validation instruments capable of surfacing high-confidence anomalies for expert review.

The contributions of this study are threefold:

We propose a scalable framework that repurposes a fine-tuned GPT-3.5 model for retrospective data validation—shifting the use of large language models (LLMs) from prediction to auditing in the context of traffic safety.
We demonstrate the effectiveness of multimodal data fusion by integrating structured crash attributes and unstructured narrative descriptions to detect inconsistencies in construction zone labeling.
We highlight the framework’s generalizability, suggesting its potential to identify other types of misclassified attributes (e.g., injury severity, pedestrian involvement, distraction) in crash datasets.

This precision-oriented, semi-automated framework offers a promising solution for improving the reliability of traffic crash datasets—ultimately supporting more accurate policy decisions and safety interventions. Accordingly, this study is guided by the following research question:

Can fine-tuned large language models (LLMs), when applied to multimodal crash data combining structured attributes and narrative text, effectively detect mislabeled construction zone crashes and support data quality auditing?

The remainder of this paper is organized as follows: Section 2 reviews prior work on work zone misclassification, data quality challenges, and LLM applications in transportation; Section 3 outlines the methodology; Section 4 presents the results; Section 5 offers a discussion; and Section 6 concludes with implications and future research directions.

2. Literature Review

2.1. Misclassification in Work Zone Crash Data

Accurate classification of work zone (WZ) crashes remains a persistent challenge in traffic safety datasets. Several studies have identified significant inconsistencies between coded data and the actual crash contexts as described in narrative reports. Swansen et al. [8] revealed that numerous crashes involving construction zones were either incorrectly labeled or omitted entirely from WZ-related categories. Similarly, Carrick et al. [9] reported that over one-third of WZ crashes in their case study were spatially misclassified due to vague work zone definitions and limitations in data reporting standards. Sayed et al. [4] used a unigram + bigram noisy-OR classifier to detect misclassified work zone crashes based on narratives, highlighting how officer workload and vague descriptions contribute to overlooked cases.

Asadi et al. [10] addressed WZ crash severity prediction using ensemble SHAP methods, and highlighted the under-representation of severe cases. The authors in [3] examined WZ crash under-reporting by comparing police-reported data and workplace health and safety (WHS) records in Australia, showing significant data discrepancies that mask the true frequency of such crashes. Previous studies [11,12,13,14] applied transformer-based deep learning models to classify crash severity, and demonstrated the power of narrative analysis in safety datasets. While these works focused primarily on predictive modeling and narrative analysis, the current study introduces a novel application of fine-tuned LLMs for retrospective mislabel detection. This marks a shift from prediction to validation, using multimodal data fusion to identify inconsistencies between structured crash labels and narrative descriptions—an area not previously explored in the cited works.

Recent studies have applied deep learning and NLP to improve crash classification by extracting nuanced information from narratives. For example, Chang and Edara [15] and Wang et al. [16] used machine learning and sensor-based data to analyze crash risks in work zones, while Rangaswamy et al. [17] used mixed logit modeling to explore improper driving actions in work zones. Other machine learning methods, such as Random Forest and XGBoost, have also been applied to work zone crash severity prediction, with accuracies up to 88.6% in structured datasets [18]. Additionally, pre-crash behavioral variables from naturalistic driving studies (NDSs) have shown strong predictive power in modeling hazardous work zone events [15], reinforcing the need to incorporate contextual data sources. However, most studies have focused on prediction tasks rather than retrospective mislabeling detection—a gap which this paper directly addresses.

The issue of misclassification has a wider policy implication. Ullman and Scriba [19] showed that crash report design influences how work zones are recorded, affecting national fatality estimates. Clark and Fontaine [20] further differentiated between crashes caused by work zones and those merely occurring nearby, revealing flaws in current categorization systems. Daniel et al. [21] and Wang et al. [22] confirmed the need for context-aware data collection strategies to improve classification consistency.

2.2. Data Quality Challenges in Transportation Systems

The integrity of transportation data is critical to evidence-based planning, policy, and safety interventions. Yet, multiple types of data quality (DQ) issues—including missing values, redundancy, semantic ambiguity, and mislabeling—persist across public safety datasets. Hrubeš et al. [23] highlighted the lack of codified standards for transportation data governance, while Si et al. [24] identified quality challenges in bus transit systems and emphasized preprocessing for improved data usability.

Remoundou et al. [25] proposed a multi-level quality control framework to manage heterogeneous vehicular data streams. Galarus et al. [26] discussed the importance of anomaly detection and reliability in traveler information systems. Although data quality has received growing attention, most studies neglect crash databases, where inconsistencies between structured and unstructured data are common.

2.3. Applications of Large Language Models (LLMs) in Traffic Safety

Large language models (LLMs) have introduced new paradigms for contextual reasoning and intelligent automation in transportation systems. Liu et al. [27] developed an LLM-based simulation framework using chain-of-thought reasoning to enhance autonomous vehicle training. Masri et al. [28,29] used GPT models to optimize traffic flow at intersections. Cheng et al. [30] proposed an LLM-enhanced trajectory prediction system to improve path forecasting and driver intent recognition. Similarly, de Zarzà et al. [31] employed multimodal LLMs to forecast traffic accidents, integrating visual and narrative cues for greater predictive power, while Alhadidi et al. [32] applied hybrid vision–language models to roadway object detection.

Despite these advances, most LLM-based studies focus on prediction and control. Few explore retrospective applications, such as quality auditing of crash databases. The authors in [33] have shown the value of using LLMs for narrative interpretation in crash datasets, but their use for label validation is still in the early stages.

Our study builds on this emerging direction by shifting the use of LLMs from prediction to corrective tasks—specifically, auditing structured crash labels using narrative-driven inference. Unlike Huang et al. [34], who used ChatGPT to generate synthetic crash reports, our approach applies a fine-tuned GPT-3.5 model to detect mislabels in real-world crash data through multimodal validation.

2.4. Comparison of Prior Work and Research Gap

While several studies have explored crash classification using either structured or unstructured data, and others have applied machine learning for severity prediction or report generation, few have directly addressed the challenge of detecting mislabeled crash records through a validation-oriented lens. Table 1 provides a comparative summary of representative prior studies, highlighting their core tasks, data types, methodological approaches, and limitations.

Notably, none of the existing works leverage fine-tuned large language models (LLMs) for retrospective mislabel detection using multimodal data fusion. This gap underscores the novelty of the current study, which introduces a precision-focused framework using GPT-3.5 to identify inconsistencies between narrative text and structured crash attributes, validated through expert review.

The gap identified in prior research is significant, as it highlights the need for a methodological shift toward validation of labeled crash records using advanced data fusion techniques. This study aims to fill that gap by employing a fine-tuned LLM to enhance the accuracy and reliability of crash data classifications.

3. Methodology

This section presents a generalizable framework for detecting mislabeled crash records by fusing structured and unstructured data using large language models (LLMs). The framework consists of four modular stages, as illustrated in Figure 1, and is designed to be adaptable to various crash attributes and datasets beyond any single case. Each component of the framework is discussed in detail in Section 3.2, Section 3.3, Section 3.4, Section 3.5, Section 3.6 and Section 3.7. A real-world application of this framework using crash data is presented separately in Section 4.

3.1. Framework Overview

The proposed methodology consists of four modular stages:

Data Preparation: Structured crash attributes and narrative text are preprocessed and merged into a unified input format, ensuring compatibility with the LLM’s input requirements.
Model Fine-Tuning: A pre-trained language model (GPT-3.5-turbo-0613) is fine-tuned using labeled crash data to learn multimodal patterns that correlate structured features and narrative cues with correct crash labels (e.g., construction zone involvement).
Inference and Discrepancy Detection: The fine-tuned model is used to predict crash labels. Records where predicted labels differ from those in the original dataset are flagged as potentially mislabeled.
Expert Validation: A domain expert manually reviews the flagged cases to verify true misclassifications and assess model precision in surfacing data quality issues.

Figure 1 provides a visual summary of the modular framework and its applicability to mislabeled attributes beyond construction zone involvement.

3.2. Data Preparation

In the first stage, structured crash attributes (e.g., severity, lighting, injuries) are combined with corresponding unstructured narrative descriptions to create a single input suitable for training and inference with LLMs.

The data preparation process involves three steps:

Tabular Data Conversion: Key structured features, such as crash severity, light condition, and construction zone status, are extracted and formatted as part of a user prompt.
Narrative Integration: The structured prompts are merged with crash narrative text to create context-rich inputs for the LLM.
JSONL Format Creation: Each crash report is converted into OpenAI’s required JSON Lines (JSONL) format using the system–user–assistant structure. A sample record is formatted as follows:

{

“messages”: [

{“role”: “system”, “content”: “You are a helpful assistant for crash report classification”.},

{“role”: “user”, “content”: “Light: Daylight, Weather: Clear, Injuries: 2, Narrative: Vehicle 1 struck a barrier in a marked construction zone…”},

{“role”: “assistant”, “content”: “severity: Minor, czone: 1”}

]

}

3.3. Data Formatting and Validation

Before fine-tuning, a two-stage automated validation pipeline was developed to ensure data integrity and model compatibility. Each entry was formatted into OpenAI’s required JSONL schema, consisting of a system message, user prompt, and assistant response—specifically, as follows:

The system role was set to define the assistant’s task.
The user message included all structured crash attributes (e.g., location, light condition, vehicle type).
The assistant message encoded the target labels—crash severity and construction zone involvement.

This structure, based on OpenAI’s supervised fine-tuning schema, ensures that the assistant is guided by both structured input (user) and task description (system). Each record was then validated through the following checks:

Confirmed valid JSONL structure with correctly nested fields.
Verified role consistency (system, user, assistant) across messages.
Used the Tiktoken library to calculate token lengths and ensure compliance with the 4096-token limit.

To ensure scalability, the dataset (n = 6400 records) was processed programmatically. The validation script flagged malformed examples, messages with missing roles or content, and examples that exceeded the token limit. These entries were automatically logged and excluded from the training set to avoid truncation or model instability.

This process ensured that only structurally correct, role-complete, and token-safe entries were retained for fine-tuning, resulting in a clean and efficient dataset suitable for high-quality model adaptation.

3.4. Model Fine-Tuning

We selected GPT-3.5-turbo-0613 for its compatibility with supervised fine-tuning, cost-efficiency, and strong performance in domain-specific classification tasks. Recent empirical studies support this choice. For example, Pornprasit and Tantithamthavorn [39] compared various large language model configurations, and found that fine-tuned GPT-3.5 outperformed zero-shot and few-shot prompting strategies in automating code review tasks, highlighting the value of task-specific tuning over general-purpose inference [7,39,40]. While their domain was software engineering, the same principle applies to our context—where domain adaptation via fine-tuning enables more accurate interpretation of structured and unstructured crash data.

Fine-tuning was conducted using 100 annotated crash records, formatted in OpenAI’s chat-completion JSONL schema. This fine-tuning process enabled the model to learn domain-specific linguistic patterns associated with crash types, particularly construction zone indicators such as “workers present” or “temporary barriers”.

As the GPT-3.5 model was fine-tuned using OpenAI’s chat completion format, the inference pipeline included system, user, and assistant roles. Figure 2 illustrates their interaction in generating crash classifications.

3.5. Inference and Misclassification Detection

After fine-tuning, the model was deployed to classify crash records. For each record, it inferred whether the crash occurred in a construction zone based on the fused structured and narrative inputs. The predicted labels were then compared to the original labels in the dataset.

If the model’s prediction differed from the original label, the case was flagged as a potential misclassification. These flagged records were retained for subsequent expert validation. An example is given below:

Original: czone = 0
Predicted: czone = 1
Flagged: Misclassified

This comparison step is central to identifying latent inconsistencies within the dataset. Table 2 provides a structured example of how such mismatches are detected, visually representing how LLM outputs are used to identify under-reported or mislabeled cases.

3.6. Evaluation Metrics

Given the imbalanced nature of our dataset, accuracy alone is not a suitable metric for evaluating model performance [41,42]. Therefore, we prioritize precision, recall, and F1 score. Precision measures the proportion of true positive results among all positive predictions, recall measures the proportion of actual positives correctly identified, and the F1 score provides a balance between precision and recall. This approach provides a more accurate and reliable evaluation of the model’s effectiveness in handling imbalanced data. We consider that true positive (TP) is the number of positive cases predicted as positive, true negative (TN) is the number of negative cases predicted as negative, false negative (FN) is the number of positive cases predicted as negative, and false positive (FP) is the number of the negative cases predicted as positive. Here are the metrics upon which our evaluations rely.

Precision measures the percentage of accurate positive predictions compared to the total number of samples classified as positive.
Precision = TP/(TP + FP)
Recall measures the percentage of the accurate positive predictions compared to total number of actual positives.
Recall = TP/(TP + FN)
F1 score is a trade-off between precision and recall, which combines precision and recall to assess the performance of the model. It represents a better metric in the case of imbalanced classes.

F1 score = 2 × (Precision × Recall)/(Precision + Recall)

3.7. Expert Validation

To validate the model’s flagged outputs, a traffic safety expert reviewed each mismatched record. The expert followed a structured four-step protocol. First, the expert conducted a narrative inspection, reviewing each crash description for explicit references to construction-related features such as signage, cones, or workers. Second, a contextual check was performed by cross-referencing structured fields (e.g., crash type, lighting conditions) with the corresponding narrative. Third, a misclassification decision was made—cases were labeled as true misclassifications only if the narrative clearly contradicted the structured label. Finally, the expert applied ambiguity handling, conservatively treating unclear or indirect cases as not mislabeled to maintain high validation precision.

Each reviewed record was ultimately categorized as a true positive (TP), false positive (FP), or true negative (TN). These expert judgments informed the final performance metrics reported in the Results Section. Figure 3 illustrates the evaluation framework and how each prediction outcome is mapped.

3.8. Generalizability

Although this framework was applied to the “czone” label, its structure is adaptable to other crash attributes prone to misclassification, such as the following:

Distracted driving
Injury severity
Pedestrian involvement

By updating label targets, annotation schemes, and prompts, the framework can be extended to support quality assurance across various traffic safety analytics tasks.

4. Case Study: Application to Missouri Crash Dataset

This section demonstrates the implementation of the proposed GPT-3.5-based misclassification detection framework using a real-world dataset from the Missouri State Highway Patrol. The case study illustrates the process of dataset overview, model application, classification results, and expert validation.

4.1. Dataset Overview and Application

The dataset comprises 6400 police-reported crash records from the Missouri State Highway Patrol, covering incidents from 2019 to 2020 [43]. Each record includes structured variables—such as crash severity, light condition, and construction zone (czone)—as well as an accompanying unstructured narrative describing the crash.

Among these records, 91 were originally labeled as construction zone crashes (czone = 1), while the rest were labeled as non-construction zone crashes (czone = 0). The fine-tuned GPT-3.5-turbo model was applied to this dataset to predict construction zone involvement using both structured features and textual narratives. Each model-generated prediction was compared with the original czone label. Any mismatch (i.e., predicted ≠ actual) was flagged as a potential misclassification, which was subsequently submitted for expert validation to assess the model’s effectiveness in detecting labeling errors.

4.2. Classification Results

To evaluate the effectiveness of the proposed framework, we applied the fine-tuned GPT-3.5-turbo model to a dataset of 6400 police-reported crash records. The model was tasked with classifying whether each crash occurred in a construction zone by leveraging both structured crash attributes and unstructured narrative descriptions.

Among the 91 crashes originally labeled as construction zone-related, only 13 were correctly identified by the model, highlighting a conservative prediction strategy designed to minimize false positives. Despite this, the model achieved high overall classification performance across the entire dataset, particularly excelling in accurately identifying non-construction zone crashes.

Table 3 summarizes the model’s performance using standard classification metrics, reported separately for each class and averaged across the dataset.

Figure 4 visualizes the confusion matrix for construction zone involvement. The model accurately classified 6307 non-construction zone crashes and misclassified only 2 of them, indicating a low false-positive rate. However, 78 construction zone crashes were incorrectly predicted as non-construction zone, indicating a high false-negative rate. This pattern suggests a conservative bias, favoring precision over recall—an approach suitable for data quality auditing, where minimizing false positives can reduce the expert review workload.

The model’s precision-first behavior is further reflected in its excellent performance in non-construction zone cases (F1 = 99.37%) and limited success with construction zone cases (recall = 14.29%). While this limits full recall coverage, the trade-off supports the goal of flagging only high-confidence anomalies in large-scale crash datasets.

4.3. Expert Validation of Flagged Cases

To assess the model’s real-world effectiveness, 80 records flagged as potential misclassifications (i.e., predicted czone ≠ actual czone) were manually reviewed by a traffic safety expert with domain experience in crash analysis. Each flagged case was reviewed manually using a structured protocol:

Narrative-Label Cross-Check: The expert read the full narrative description and evaluated whether it included strong indicators of construction zone involvement—such as references to signage, workers, barriers, cones, or lane closures.
Contextual Consistency: The structured fields (e.g., time, lighting, crash type) were reviewed in relation to the narrative to detect inconsistencies that might support or refute the predicted label.
True Misclassification Criteria: A case was marked as a genuine misclassification if the narrative clearly indicated construction zone conditions, yet the original structured label (czone = 0) contradicted this.
Ambiguous Cases: If the narrative lacked sufficient clarity or if the indicators were indirect, the record was not counted as a true misclassification, to maintain high confidence in label corrections.

Based on this review, each flagged case was classified into one of the following categories:

Confirmed Misclassification (True Positive): The model correctly flagged a mislabeled record.
Correct Label (False Positive): The model flagged a case, but the expert found the original label to be correct.

The expert validation results are summarized in Table 4 and visualized in Figure 5.

The expert validation confirms that the model is capable of surfacing truly mislabeled records. Although only four of the flagged records were verified as incorrect labels, these would likely have remained undetected through manual review alone. This analysis supports the model’s role in semi-automated auditing, especially when used to prioritize cases for further investigation.

Out of 80 model-flagged mismatches, 4 were confirmed as mislabeled and 76 had their original labels verified as correct by the expert.

5. Discussion

This study introduced a novel LLM-based framework for detecting misclassified work zone (WZ) crashes by fusing structured crash attributes with unstructured narratives. The model, a fine-tuned GPT-3.5-turbo, was evaluated both as a binary classifier and as a tool for data quality auditing. The dual evaluation—on the full dataset and the expert-reviewed mismatches—offers a holistic view of the model’s strengths, its limitations, and its practical value in enhancing traffic crash data quality.

5.1. GPT-3.5 Classification Performance

The GPT-3.5 model achieved an overall accuracy of 98.75% across 6400 crash records, indicating strong agreement with the original dataset labels. It correctly identified 6307 non-construction zone (czone = 0) cases and 13 construction zone (czone = 1) cases, with only 2 false positives and 78 false negatives. However, a closer inspection reveals a significant class imbalance in performance:

For non-construction zone cases, the performance was as follows:

Precision: 98.78%;
Recall: 99.97%;
F1 score: 99.37%.

For construction zone cases, the performance was as follows:

Precision: 86.67%;
Recall: 14.29%;
F1 score: 24.53%

While the model performs exceptionally well in avoiding false alarms, its low recall for construction zone cases reflects a conservative bias—erring on the side of caution and underpredicting positives. This trade-off may be acceptable in operational contexts, where overprediction leads to high review costs, but it limits the model’s utility in safety-critical settings, where detecting all true cases is essential.

The confusion matrix (Figure 5) further illustrates this imbalance, highlighting the model’s tendency to favor the majority class (no WZ) and under-detect the minority class (WZ), despite some correct positive identifications.

5.2. Detection of Mislabeled Records

To assess the model’s effectiveness in identifying mislabeled records, we analyzed 80 cases where GPT-3.5 predictions differed from the original dataset labels. These mismatches were reviewed by a domain expert to determine whether the model correctly flagged inconsistencies.

The expert review revealed the following:

Four records were genuine misclassifications (i.e., originally labeled as non-work zone cases, but correctly predicted by the model as work zone cases).
Seventy-six records were false positives (i.e., flagged as mislabels, but determined to be accurate upon review).

This outcome highlights the model’s conservative behavior, flagging only cases with strong narrative cues that challenge the structured label. While the model demonstrated strong overall classification performance, the primary objective of this study was not to maximize predictive accuracy across all cases, but rather to identify mismatches and detect potential mislabels for expert review. Out of 80 flagged records, the model successfully identified 4 genuine misclassifications, demonstrating its potential as a precision-oriented auditing tool for data quality assurance in traffic crash datasets.

The model’s focus on minimizing false positives aligns with the intended use case of targeted auditing, where even a small number of recovered mislabels can improve the reliability of crash databases. This supports its use as a semi-automated auditing tool that can reduce the expert workload by prioritizing records likely to contain labeling errors. The framework is not intended to replace full classification pipelines, but to support targeted quality assurance in large-scale transportation safety datasets, where even a few corrected mislabels can improve the reliability of downstream analysis and decision-making.

These results are consistent with Mumtarin et al. [44], who found that GPT-based models perform best when crash narratives contain clear indicators, often showing high similarity in binary tasks such as work zone detection [44]. Similarly, Bhagat et al. [45] emphasize that models with high technical accuracy may exhibit poor agreement with expert judgment, underscoring the need for expert-informed validation frameworks—a challenge also addressed in our study [45]. Moreover, Bucher and Martini [46] demonstrated that fine-tuned LLMs consistently outperform zero-shot generative models in classification tasks that combine structured and unstructured inputs, validating our approach to domain-specific tuning for improved precision [46].

5.3. Limitations and Future Work

While our study demonstrates the efficacy of using fine-tuned GPT-3.5 models for detecting misclassifications in crash data, several limitations must be acknowledged:

Imbalanced Class Distribution: The dataset was highly skewed toward non-construction zone crashes, which may have influenced the model’s conservative predictions and low recall for the minority class (czone = 1).
Low Recall in Misclassification Detection: Although the model achieved high precision in identifying mislabeled records, it only detected a small portion of the actual misclassifications, limiting its effectiveness in uncovering all inconsistencies.
Limited Fine-Tuning Data: The fine-tuning process relied on only 100 labeled samples, which may have constrained the model’s ability to generalize more nuanced or rare patterns across the broader dataset.
Domain Dependency: The model was trained and tested on crash reports from a specific region (Missouri), and its performance may not directly transfer to other jurisdictions with different reporting styles or terminologies.

To address these limitations and expand the framework’s capabilities, future research could explore the following:

Model Comparisons and Baselines: Evaluate the framework using non-fine-tuned models (e.g., zero-shot GPT-3.5) as baselines, and benchmark against alternative LLMs, such as GPT-4 or open-source models like LLaMA, to assess generalizability and cost–performance trade-offs.
Training Data Sensitivity: Investigate the impact of varying fine-tuning data sizes on performance, particularly to identify thresholds for reliable recall in low-resource scenarios.
Class Imbalance Mitigation: Explore strategies to address skewed class distributions, such as oversampling, undersampling, or synthetic data generation (e.g., SMOTE or prompt-based data augmentation), to improve model performance on minority classes and increase recall for under-reported crash attributes.
Prompt Engineering and Instruction Tuning: Design more targeted prompts or apply instruction-tuned variants of LLMs to better capture implicit work zone indicators and improve recall without compromising precision.
Few-Shot and In-Context Learning: Apply few-shot examples at the inference time to enhance flexibility and support generalization across crash attributes without requiring additional fine-tuning.
Hybrid Rule–LLM Models: Integrate traditional rule-based approaches with LLM outputs to balance precision and recall, enhancing robustness for misclassification detection.
Multi-Label and Multi-Attribute Detection: Extend the framework to detect multiple crash attributes—such as injury severity, driver distraction, or pedestrian involvement—in a single processing pipeline.
Cross-Dataset Validation: Test the framework on crash data from other states or jurisdictions to evaluate its transferability and regional adaptability.
Integration with Crash Severity Simulation: Incorporating crash severity simulators, such as the one proposed by Grinberg and Wiseman (2013), could enhance future versions of this framework by linking mislabel detection to predicted outcomes [47]. This could support not only data validation, but also scenario-based policy testing and resource allocation.
Multi-Label Classification: Future work could extend the framework to support multi-label classification, allowing for simultaneous detection of multiple misclassified crash attributes—such as injury severity, driver distraction, and road conditions—and thereby reflecting the complex nature of real-world crash events.

These directions would not only strengthen the framework’s accuracy and scalability, but also support the broader adoption of LLM-based validation tools in traffic safety analytics and policy.

6. Conclusions

This study proposed a novel framework that leverages a fine-tuned GPT-3.5-turbo model to detect misclassified construction zone (czone) crash records by fusing structured attributes and unstructured crash narratives. The model achieved high overall classification accuracy (98.75%) and strong precision, particularly for non-construction zone cases, while showing limited recall for identifying construction zone involvement, reflecting a conservative prediction strategy.

Importantly, the framework proved effective as a semi-automated data auditing tool. Among 80 mismatched cases flagged by the model, expert review confirmed 4 records as genuinely mislabeled—cases where the narrative clearly indicated construction zone involvement, despite a contradicting structured label. While recall for misclassification detection was low, the model’s high precision ensured that flagged cases had strong narrative signals, making it a reliable filter for surfacing high-confidence inconsistencies in crash datasets.

These findings underscore the value of large language models not only for text classification, but also as tools for data quality assurance in transportation safety analytics. With further tuning, prompt engineering, and expansion to other crash attributes, the framework could support broader efforts to detect under-reported or mislabeled incidents in large-scale administrative datasets.

Ultimately, this study contributes to growing evidence that integrating structured and unstructured data using LLMs offers a promising path toward improving the integrity, reliability, and policy utility of real-world crash records.

Author Contributions

Conceptualization, S.J., and M.E.; software, S.J.; formal analysis, S.J., T.I.A., and M.E.; investigation, S.J., and M.E.; data curation, S.J.; writing—original draft preparation, S.J.; writing—review and editing, S.J., N.A., T.I.A., and S.S.; visualization, S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

Author Shadi Jaradat and Nirmal Acharya were employed by the company Cogninet Australia. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Cheng, Y.; Wu, K.; Li, H.; Parker, S.; Ran, B.; Noyce, D. Work Zone Crash Occurrence Prediction Based on Planning Stage Work Zone Configurations Using an Artificial Neural Network. Transp. Res. Rec. 2022, 2676, 377–384. [Google Scholar] [CrossRef]
Yang, H.; Ozbay, K.; Ozturk, O.; Xie, K. Work Zone Safety Analysis and Modeling: A State-of-the-Art Review. Traffic Inj. Prev. 2015, 16, 387–396. [Google Scholar] [CrossRef]
Blackman, R.; Debnath, A.K.; Haworth, N. Understanding Vehicle Crashes in Work Zones: Analysis of Workplace Health and Safety Data as an Alternative to Police-Reported Crash Data in Queensland, Australia. Aust. Traffic Inj. Prev. 2020, 21, 222–227. [Google Scholar] [CrossRef]
Sayed, M.A.; Qin, X.; Kate, R.J.; Anisuzzaman, D.M.; Yu, Z. Identification and Analysis of Misclassified Work-Zone Crashes Using Text Mining Techniques. Accid. Anal. Prev. 2021, 159, 106211. [Google Scholar] [CrossRef]
Almahdi, A.; Al Mamlook, R.E.; Bandara, N.; Almuflih, A.S.; Nasayreh, A.; Gharaibeh, H.; Alasim, F.; Aljohani, A.; Jamal, A. Boosting Ensemble Learning for Freeway Crash Classification under Varying Traffic Conditions: A Hyperparameter Optimization Approach. Sustainability 2023, 15, 15896. [Google Scholar] [CrossRef]
Pande, A.; Das, A.; Abdel-Aty, M.; Hassan, H. Estimation of Real-Time Crash Risk. Transp. Res. Rec. 2011, 2237, 60–66. [Google Scholar] [CrossRef]
OpenAI. GPT-3.5 Turbo Fine-Tuning and API Updates; OpenAI: San Francisco, CA, USA, 2023. [Google Scholar]
Swansen, E.; Mckinnon, I.A.; Knodler, M.A. Integration of Crash Report Narratives for Identification of Work Zone-Related Crash Classification. In Proceedings of the Transportation Research Board 92nd Annual Meeting, Washington, DC, USA, 13–17 January 2013. [Google Scholar]
Carrick, G.; Heaslip, K.; Srinivasan, S.; Brady, B. A Case Study in Spatial Misclassification of Work Zone Crashes. In Proceedings of the 88th Transportation Research Board Annual Meeting, National Academy of Sciences, Washington, DC, USA, 11–15 January 2009. [Google Scholar]
Asadi, H.; Wang, J. An Ensemble Approach for Predicting Crash Severity in Work Zones Using Machine Learning. Sustainability 2023, 15, 1201. [Google Scholar] [CrossRef]
Sharma, K.P.; Yajid, M.S.A.; Gowrishankar, J.; Mahajan, R.; Alsoud, A.R.; Jadhav, A.; Singh, D. A Systematic Re-view on Text Summarization: Techniques, Challenges, Opportunities. Expert Syst. 2025, 42, e13833. [Google Scholar] [CrossRef]
Nusir, M.; Louati, A.; Louati, H.; Tariq, U.; Zitar, R.A.; Abualigah, L.; Gandomi, A.H. Design Research Insights on Text Mining Analysis: Establishing the Most Used and Trends in Keywords of Design Research Journals. Electronics 2022, 11, 3930. [Google Scholar] [CrossRef]
Jaradat, S.; Elhenawy, M.; Nayak, R.; Paz, A.; Ashqar, H.I.; Glaser, S. Multimodal Data Fusion for Tabular and Textual Data: Zero-Shot, Few-Shot, and Fine-Tuning of Generative Pre-Trained Transformer Models. AI 2025, 6, 72. [Google Scholar] [CrossRef]
Alhadidi, T.I.; Alazmi, A.; Jaradat, S.; Jaber, A.; Ashqar, H.; Elhenawy, M. Pavement Distress Classification Using Bidirectional Cascaded Neural Networks (BCNNs) and U-Net 50-Based Augmented Datasets. arXiv, 2025; in press. [Google Scholar]
Chang, Y.; Edara, P. Predicting Hazardous Events in Work Zones Using Naturalistic Driving Data. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017; pp. 1–6. [Google Scholar]
Wang, B.; Chen, T.; Zhang, C.; Wong, Y.D.; Zhang, H.; Zhou, Y. Toward Safer Highway Work Zones: An Empirical Analysis of Crash Risks Using Improved Safety Potential Field and Machine Learning Techniques. Accid. Anal. Prev. 2024, 194, 107361. [Google Scholar] [CrossRef]
Rangaswamy, R.; Alnawmasi, N.; Wang, Z. Exploring Contributing Factors to Improper Driving Actions in Single-Vehicle Work Zone Crashes: A Mixed Logit Analysis Considering Heterogeneity in Means and Variances, and Temporal Instability. J. Transp. Saf. Secur. 2023, 16, 768–797. [Google Scholar] [CrossRef]
Mashhadi, A.H.; Rashidi, A.; Medina, J.; Marković, N. Comparing Performance of Different Machine Learning Methods for Predicting Severity of Construction Work Zone Crashes. Comput. Civ. Eng. 2023. [Google Scholar] [CrossRef]
Ullman, G.L.; Scriba, T.A. Revisiting the Influence of Crash Report Forms on Work Zone Crash Data. Transp. Res. Rec. 2004, 1897, 180–182. [Google Scholar] [CrossRef]
Clark, J.B.; Fontaine, M.D. Exploration of Work Zone Crash Causes and Implications for Safety Performance Measurement Programs. Transp. Res. Rec. 2015, 2485, 61–69. [Google Scholar] [CrossRef]
Daniel, J.; Dixon, K.; Jared, D. Analysis of Fatal Crashes in Georgia Work Zones. Transp. Res. Rec. 2000, 1715, 18–23. [Google Scholar] [CrossRef]
Wang, J.; Hughes, W.E.; Council, F.M.; Paniati, J.F. Investigation of Highway Work Zone Crashes: What We Know and What We Don’t Know. Transp. Res. Rec. 1996, 1529, 54–62. [Google Scholar] [CrossRef]
Hrubeš, P.; Langr, M.; Purkrábková, Z. Review of Data Governance Approaches in the Field of Transportation Domain. In Proceedings of the 2024 Smart City Symposium Prague (SCSP), Prague, Czech Republic, 23–24 May 2024. [Google Scholar]
Si, S.; Xiong, W.; Che, X. Data Quality Analysis and Improvement: A Case Study of a Bus Transportation System. Appl. Sci. 2023, 13, 11020. [Google Scholar] [CrossRef]
Remoundou, K.; Alexakis, T.; Peppes, N.; Demestichas, K.; Adamopoulou, E. A Quality Control Methodology for Heterogeneous Vehicular Data Streams. Sensors 2022, 22, 1550. [Google Scholar] [CrossRef]
Galarus, D.; Turnbull, I.; Campbell, S. Timely, Reliable: A High Standard and Elusive Goal for Traveler Information Data Quality. In Proceedings of the 2019 Future of Information and Communication Conference, San Francisco, CA, USA, 14–15 March 2019. [Google Scholar]
Liu, Z.; Li, L.; Wang, Y.; Lin, H.; Liu, Z.; He, L.; Wang, J. Controllable Traffic Simulation through Llm-Guided Hierarchical Chain-of-Thought Reasoning. arXiv 2024, arXiv:2409.15135. [Google Scholar]
Masri, S.; Ashqar, H.I.; Elhenawy, M. Large Language Models (LLMs) as Traffic Control Systems at Urban Intersections: A New Paradigm. Vehicles 2025, 7, 11. [Google Scholar] [CrossRef]
Masri, S.; Ashqar, H.I.; Elhenawy, M. Leveraging Large Language Models (LLMs) for Traffic Management at Urban Intersections: The Case of Mixed Traffic Scenarios. arXiv 2024, arXiv:2408.00948. [Google Scholar]
Cheng, Q.; Jiao, X.; Yang, M.; Yang, M.; Jiang, K.; Yang, D. Advancing Autonomous Driving Safety Through LLM Enhanced Trajectory Prediction. In Proceedings of the Advanced Vehicle Control Symposium, Milan, Italy, 2–6 September 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 496–502. [Google Scholar]
de Zarzà, I.; de Curtò, J.; Roig, G.; Calafate, C.T. LLM Multimodal Traffic Accident Forecasting. Sensors 2023, 23, 9225. [Google Scholar] [CrossRef]
Alhadidi, T.; Jaber, A.; Jaradat, S.; Ashqar, H.I.; Elhenawy, M. Object Detection Using Oriented Window Learning Vi-Sion Transformer: Roadway Assets Recognition. arXiv 2024, arXiv:2406.10712. [Google Scholar]
Jaradat, S.; Alhadidi, T.I.; Ashqar, H.I.; Hossain, A.; Elhenawy, M. Investigating Patterns of Freeway Crashes in Jordan: Findings from Text Mining Approach. Results Eng. 2025, 26, 104413. [Google Scholar] [CrossRef]
Huang, X.; Feng, Y.; Zhang, Z. Crash Report Generation Using ChatGPT: A Novel Approach for Automated Accident Reporting. In Proceedings of the 2024 IEEE 4th International Conference on Electronic Information Engineering and Computer Science (EIECS), Yanji, China, 27–29 September 2024; pp. 1174–1177. [Google Scholar]
Pendyala, R.; Hall, S. Explaining Misinformation Detection Using Large Language Models. Electronics 2024, 13, 1673. [Google Scholar] [CrossRef]
Oliveira, J.; Almeida, D.; Santos, F. Comparative Analysis of BERT-Based and Generative Large Language Models for Detecting Suicidal Ideation: A Performance Evaluation Study. Cad. Saude Publica 2024, 40, e00028824. [Google Scholar] [CrossRef]
Klie, T.; Nguyen, T.; Calderon, A. Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future. Comput. Linguist. 2022, 49, 157–198. [Google Scholar] [CrossRef]
Beattie, T.; Moulton, S.; Wong, M. Utilizing Large Language Models for Enhanced Clinical Trial Matching: A Study on Automation in Patient Screening. Cureus 2024, 16, e60044. [Google Scholar] [CrossRef]
Pornprasit, C.; Tantithamthavorn, C. Fine-Tuning and Prompt Engineering for Large Language Models-Based Code Review Automation. Inf. Softw. Technol. 2024, 175, 107523. [Google Scholar] [CrossRef]
Latif, E.; Zhai, X. Fine-Tuning ChatGPT for Automatic Scoring. Comput. Educ. Artif. Intell. 2024, 6, 100210. [Google Scholar] [CrossRef]
Davis, J.; Goadrich, M. The Relationship between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 233–240. [Google Scholar]
Saito, T.; Rehmsmeier, M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
Missouri State Highway Patrol (MSHP). Online Crash Report Search; MSHP: Jefferson City, MO, USA, 2022.
Mumtarin, M.; Chowdhury, M.S.; Wood, J. Large Language Models in Analyzing Crash Narratives: A Comparative Study of ChatGPT, Bard and GPT-4. arXiv 2023, arXiv:2308.13563. [Google Scholar]
Bhagat, S.; Shihab, I.F.; Sharma, A. Accuracy Is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models. arXiv 2024, arXiv:2504.13068. [Google Scholar]
Bucher, M.J.J.; Martini, M. Fine-Tuned “small” LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification. arXiv 2024, arXiv:2406.08660. [Google Scholar]
Grinberg, I.; Wiseman, Y. Scalable Parallel Simulator for Vehicular Collision Detection. Int. J. Veh. Saf. Mobil. Technol. 2013, 8, 116–121. [Google Scholar] [CrossRef]

Figure 1. Overview of misclassification detection framework.

Figure 2. Roles in fine-tuned GPT model workflow.

Figure 3. Misclassification detection and expert validation workflow.

Figure 4. Confusion matrix of GPT-3.5’s predictions for construction zone involvement.

Figure 5. Expert validation of 80 flagged cases.

Table 1. A review of prior studies and the novelty of this research.

Study	Task Focus	Data Type Used	Method/Model	Validation Approach	Identified Limitation	Gap This Study Fills
Swansen et al. [8]	Work zone crash identification	Narrative + structured	Rule-based classifier	Manual comparison	Inconsistent work zone labeling	No use of AI or LLMs for automated cross-checking of crash records
Carrick et al. [9]	Spatial misclassification	Spatial + structured	Spatial analysis	Mapping-based validation	Limited narrative integration	Lack of automated narrative analysis to enhance crash record accuracy
Sayed et al. [4]	Narrative-based work zone detection	Narrative text	Noisy-OR classifier	Manual inspection	No integration with structured data	Lack of multimodal data fusion for improved detection accuracy
Asadi et al. [10]	Work zone crash severity prediction	Structured (naturalistic dataset)	Ensemble models	Predictive performance	Focus on prediction rather than label verification	Not focusing on retrospective mislabeling in crash data
Pendyala and Hall [35]	Misinformation detection	Text data	LLMs (GPT models)	Performance evaluation	Limited application to crash-related contexts	Provides insight into LLM utility for crash data validation
Oliveira et al. [36]	Detection of suicidal ideation	Text data	Fine-tuned BERT models	Performance evaluation	Focused on healthcare, not crash data	Techniques may be adaptable for detecting mislabeling in crash data
Klie et al. [37]	Annotation error detection	Text data	Multiple error detection methods	Comparative analysis	Limited to general text classification	Introduces methods that could be applied to crash report labeling accuracy
Beattie et al. [38]	Clinical trial patient screening	Text data	LLMs (GPT-3.5, GPT-4)	Accuracy and sensitivity evaluation	Mislabeled criteria in ground truth data	Highlights issues of label accuracy which may be relevant to crash data classification
Huang, et al. [34]	Crash report generation	Text synthesis	ChatGPT-based method	Evaluation through case examples	Focus on report generation without mislabel detection	Offers foundational framework for report automation, while lacking audit mechanisms
Our work	Mislabel detection (work zones)	Structured + narrative	Fine-tuned GPT-3.5	Expert validation (80 cases)	Conservative recall rate	First study to employ fine-tuned LLM for mislabel detection via multimodal fusion

Table 2. Example of misclassification detection using fused structured and narrative crash data.

ID	Structured Fields	Narrative	LLM Predicted Czone	Mislabeled
1	Light: Daylight, Severity: Minor, Injuries: 0	Vehicle 1 was overtaking a construction crew. Vehicle 2 crossed the center line to avoid the construction crew and struck vehicle 1. Assisted by tpr m.r. lawson (831).	1	✓
2	Light: Dark, Severity: Fatal, Injuries: 2	Crash occurred during nighttime on a rural highway; no signs of construction observed.	0	✗
3	Light: Daylight, Severity: Minor, Injuries: 1	Traffic slowed due to workers and barriers ahead; rear-end collision followed.	1	✓

Table 3. Classification metrics for construction zone detection (dataset, n = 6400).

Metric	Non-Czone (0)	Czone (1)	Overall/Avg
Precision	98.78%	86.67%	–
Recall	99.97%	14.29%	–
F1 score	99.37%	24.53%	–
Support (No. of cases)	6309	91	6400
Accuracy	–	–	98.75%
Macro-average F1 score	–	–	61.95%
Weighted-average F1 score	–	–	98.31%

Table 4. Expert validation of 80 flagged records.

Classification Type	Count
True misclassification detected (TP)	4
Correct label (false positive)	76
Total flagged cases reviewed	80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jaradat, S.; Acharya, N.; Shivshankar, S.; Alhadidi, T.I.; Elhenawy, M. AI for Data Quality Auditing: Detecting Mislabeled Work Zone Crashes Using Large Language Models. Algorithms 2025, 18, 317. https://doi.org/10.3390/a18060317

AMA Style

Jaradat S, Acharya N, Shivshankar S, Alhadidi TI, Elhenawy M. AI for Data Quality Auditing: Detecting Mislabeled Work Zone Crashes Using Large Language Models. Algorithms. 2025; 18(6):317. https://doi.org/10.3390/a18060317

Chicago/Turabian Style

Jaradat, Shadi, Nirmal Acharya, Smitha Shivshankar, Taqwa I. Alhadidi, and Mohammad Elhenawy. 2025. "AI for Data Quality Auditing: Detecting Mislabeled Work Zone Crashes Using Large Language Models" Algorithms 18, no. 6: 317. https://doi.org/10.3390/a18060317

APA Style

Jaradat, S., Acharya, N., Shivshankar, S., Alhadidi, T. I., & Elhenawy, M. (2025). AI for Data Quality Auditing: Detecting Mislabeled Work Zone Crashes Using Large Language Models. Algorithms, 18(6), 317. https://doi.org/10.3390/a18060317

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AI for Data Quality Auditing: Detecting Mislabeled Work Zone Crashes Using Large Language Models

Abstract

1. Introduction

2. Literature Review

2.1. Misclassification in Work Zone Crash Data

2.2. Data Quality Challenges in Transportation Systems

2.3. Applications of Large Language Models (LLMs) in Traffic Safety

2.4. Comparison of Prior Work and Research Gap

3. Methodology

3.1. Framework Overview

3.2. Data Preparation

3.3. Data Formatting and Validation

3.4. Model Fine-Tuning

3.5. Inference and Misclassification Detection

3.6. Evaluation Metrics

3.7. Expert Validation

3.8. Generalizability

4. Case Study: Application to Missouri Crash Dataset

4.1. Dataset Overview and Application

4.2. Classification Results

4.3. Expert Validation of Flagged Cases

5. Discussion

5.1. GPT-3.5 Classification Performance

5.2. Detection of Mislabeled Records

5.3. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI