AI for Data Quality Auditing: Detecting Mislabeled Work Zone Crashes Using Large Language Models
Abstract
1. Introduction
- We propose a scalable framework that repurposes a fine-tuned GPT-3.5 model for retrospective data validation—shifting the use of large language models (LLMs) from prediction to auditing in the context of traffic safety.
- We demonstrate the effectiveness of multimodal data fusion by integrating structured crash attributes and unstructured narrative descriptions to detect inconsistencies in construction zone labeling.
- We highlight the framework’s generalizability, suggesting its potential to identify other types of misclassified attributes (e.g., injury severity, pedestrian involvement, distraction) in crash datasets.
Can fine-tuned large language models (LLMs), when applied to multimodal crash data combining structured attributes and narrative text, effectively detect mislabeled construction zone crashes and support data quality auditing?
2. Literature Review
2.1. Misclassification in Work Zone Crash Data
2.2. Data Quality Challenges in Transportation Systems
2.3. Applications of Large Language Models (LLMs) in Traffic Safety
2.4. Comparison of Prior Work and Research Gap
3. Methodology
3.1. Framework Overview
- Data Preparation: Structured crash attributes and narrative text are preprocessed and merged into a unified input format, ensuring compatibility with the LLM’s input requirements.
- Model Fine-Tuning: A pre-trained language model (GPT-3.5-turbo-0613) is fine-tuned using labeled crash data to learn multimodal patterns that correlate structured features and narrative cues with correct crash labels (e.g., construction zone involvement).
- Inference and Discrepancy Detection: The fine-tuned model is used to predict crash labels. Records where predicted labels differ from those in the original dataset are flagged as potentially mislabeled.
- Expert Validation: A domain expert manually reviews the flagged cases to verify true misclassifications and assess model precision in surfacing data quality issues.
3.2. Data Preparation
- Tabular Data Conversion: Key structured features, such as crash severity, light condition, and construction zone status, are extracted and formatted as part of a user prompt.
- Narrative Integration: The structured prompts are merged with crash narrative text to create context-rich inputs for the LLM.
- JSONL Format Creation: Each crash report is converted into OpenAI’s required JSON Lines (JSONL) format using the system–user–assistant structure. A sample record is formatted as follows:
3.3. Data Formatting and Validation
- The system role was set to define the assistant’s task.
- The user message included all structured crash attributes (e.g., location, light condition, vehicle type).
- The assistant message encoded the target labels—crash severity and construction zone involvement.
- Confirmed valid JSONL structure with correctly nested fields.
- Verified role consistency (system, user, assistant) across messages.
- Used the Tiktoken library to calculate token lengths and ensure compliance with the 4096-token limit.
3.4. Model Fine-Tuning
3.5. Inference and Misclassification Detection
- Original: czone = 0
- Predicted: czone = 1
- Flagged: Misclassified
3.6. Evaluation Metrics
- Precision measures the percentage of accurate positive predictions compared to the total number of samples classified as positive.
- Precision = TP/(TP + FP)
- Recall measures the percentage of the accurate positive predictions compared to total number of actual positives.
- Recall = TP/(TP + FN)
- F1 score is a trade-off between precision and recall, which combines precision and recall to assess the performance of the model. It represents a better metric in the case of imbalanced classes.
3.7. Expert Validation
3.8. Generalizability
- Distracted driving
- Injury severity
- Pedestrian involvement
4. Case Study: Application to Missouri Crash Dataset
4.1. Dataset Overview and Application
4.2. Classification Results
4.3. Expert Validation of Flagged Cases
- Narrative-Label Cross-Check: The expert read the full narrative description and evaluated whether it included strong indicators of construction zone involvement—such as references to signage, workers, barriers, cones, or lane closures.
- Contextual Consistency: The structured fields (e.g., time, lighting, crash type) were reviewed in relation to the narrative to detect inconsistencies that might support or refute the predicted label.
- True Misclassification Criteria: A case was marked as a genuine misclassification if the narrative clearly indicated construction zone conditions, yet the original structured label (czone = 0) contradicted this.
- Ambiguous Cases: If the narrative lacked sufficient clarity or if the indicators were indirect, the record was not counted as a true misclassification, to maintain high confidence in label corrections.
- Confirmed Misclassification (True Positive): The model correctly flagged a mislabeled record.
- Correct Label (False Positive): The model flagged a case, but the expert found the original label to be correct.
5. Discussion
5.1. GPT-3.5 Classification Performance
- Precision: 98.78%;
- Recall: 99.97%;
- F1 score: 99.37%.
- Precision: 86.67%;
- Recall: 14.29%;
- F1 score: 24.53%
5.2. Detection of Mislabeled Records
- Four records were genuine misclassifications (i.e., originally labeled as non-work zone cases, but correctly predicted by the model as work zone cases).
- Seventy-six records were false positives (i.e., flagged as mislabels, but determined to be accurate upon review).
5.3. Limitations and Future Work
- Imbalanced Class Distribution: The dataset was highly skewed toward non-construction zone crashes, which may have influenced the model’s conservative predictions and low recall for the minority class (czone = 1).
- Low Recall in Misclassification Detection: Although the model achieved high precision in identifying mislabeled records, it only detected a small portion of the actual misclassifications, limiting its effectiveness in uncovering all inconsistencies.
- Limited Fine-Tuning Data: The fine-tuning process relied on only 100 labeled samples, which may have constrained the model’s ability to generalize more nuanced or rare patterns across the broader dataset.
- Domain Dependency: The model was trained and tested on crash reports from a specific region (Missouri), and its performance may not directly transfer to other jurisdictions with different reporting styles or terminologies.
- Model Comparisons and Baselines: Evaluate the framework using non-fine-tuned models (e.g., zero-shot GPT-3.5) as baselines, and benchmark against alternative LLMs, such as GPT-4 or open-source models like LLaMA, to assess generalizability and cost–performance trade-offs.
- Training Data Sensitivity: Investigate the impact of varying fine-tuning data sizes on performance, particularly to identify thresholds for reliable recall in low-resource scenarios.
- Class Imbalance Mitigation: Explore strategies to address skewed class distributions, such as oversampling, undersampling, or synthetic data generation (e.g., SMOTE or prompt-based data augmentation), to improve model performance on minority classes and increase recall for under-reported crash attributes.
- Prompt Engineering and Instruction Tuning: Design more targeted prompts or apply instruction-tuned variants of LLMs to better capture implicit work zone indicators and improve recall without compromising precision.
- Few-Shot and In-Context Learning: Apply few-shot examples at the inference time to enhance flexibility and support generalization across crash attributes without requiring additional fine-tuning.
- Hybrid Rule–LLM Models: Integrate traditional rule-based approaches with LLM outputs to balance precision and recall, enhancing robustness for misclassification detection.
- Multi-Label and Multi-Attribute Detection: Extend the framework to detect multiple crash attributes—such as injury severity, driver distraction, or pedestrian involvement—in a single processing pipeline.
- Cross-Dataset Validation: Test the framework on crash data from other states or jurisdictions to evaluate its transferability and regional adaptability.
- Integration with Crash Severity Simulation: Incorporating crash severity simulators, such as the one proposed by Grinberg and Wiseman (2013), could enhance future versions of this framework by linking mislabel detection to predicted outcomes [47]. This could support not only data validation, but also scenario-based policy testing and resource allocation.
- Multi-Label Classification: Future work could extend the framework to support multi-label classification, allowing for simultaneous detection of multiple misclassified crash attributes—such as injury severity, driver distraction, and road conditions—and thereby reflecting the complex nature of real-world crash events.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Cheng, Y.; Wu, K.; Li, H.; Parker, S.; Ran, B.; Noyce, D. Work Zone Crash Occurrence Prediction Based on Planning Stage Work Zone Configurations Using an Artificial Neural Network. Transp. Res. Rec. 2022, 2676, 377–384. [Google Scholar] [CrossRef]
- Yang, H.; Ozbay, K.; Ozturk, O.; Xie, K. Work Zone Safety Analysis and Modeling: A State-of-the-Art Review. Traffic Inj. Prev. 2015, 16, 387–396. [Google Scholar] [CrossRef]
- Blackman, R.; Debnath, A.K.; Haworth, N. Understanding Vehicle Crashes in Work Zones: Analysis of Workplace Health and Safety Data as an Alternative to Police-Reported Crash Data in Queensland, Australia. Aust. Traffic Inj. Prev. 2020, 21, 222–227. [Google Scholar] [CrossRef]
- Sayed, M.A.; Qin, X.; Kate, R.J.; Anisuzzaman, D.M.; Yu, Z. Identification and Analysis of Misclassified Work-Zone Crashes Using Text Mining Techniques. Accid. Anal. Prev. 2021, 159, 106211. [Google Scholar] [CrossRef]
- Almahdi, A.; Al Mamlook, R.E.; Bandara, N.; Almuflih, A.S.; Nasayreh, A.; Gharaibeh, H.; Alasim, F.; Aljohani, A.; Jamal, A. Boosting Ensemble Learning for Freeway Crash Classification under Varying Traffic Conditions: A Hyperparameter Optimization Approach. Sustainability 2023, 15, 15896. [Google Scholar] [CrossRef]
- Pande, A.; Das, A.; Abdel-Aty, M.; Hassan, H. Estimation of Real-Time Crash Risk. Transp. Res. Rec. 2011, 2237, 60–66. [Google Scholar] [CrossRef]
- OpenAI. GPT-3.5 Turbo Fine-Tuning and API Updates; OpenAI: San Francisco, CA, USA, 2023. [Google Scholar]
- Swansen, E.; Mckinnon, I.A.; Knodler, M.A. Integration of Crash Report Narratives for Identification of Work Zone-Related Crash Classification. In Proceedings of the Transportation Research Board 92nd Annual Meeting, Washington, DC, USA, 13–17 January 2013. [Google Scholar]
- Carrick, G.; Heaslip, K.; Srinivasan, S.; Brady, B. A Case Study in Spatial Misclassification of Work Zone Crashes. In Proceedings of the 88th Transportation Research Board Annual Meeting, National Academy of Sciences, Washington, DC, USA, 11–15 January 2009. [Google Scholar]
- Asadi, H.; Wang, J. An Ensemble Approach for Predicting Crash Severity in Work Zones Using Machine Learning. Sustainability 2023, 15, 1201. [Google Scholar] [CrossRef]
- Sharma, K.P.; Yajid, M.S.A.; Gowrishankar, J.; Mahajan, R.; Alsoud, A.R.; Jadhav, A.; Singh, D. A Systematic Re-view on Text Summarization: Techniques, Challenges, Opportunities. Expert Syst. 2025, 42, e13833. [Google Scholar] [CrossRef]
- Nusir, M.; Louati, A.; Louati, H.; Tariq, U.; Zitar, R.A.; Abualigah, L.; Gandomi, A.H. Design Research Insights on Text Mining Analysis: Establishing the Most Used and Trends in Keywords of Design Research Journals. Electronics 2022, 11, 3930. [Google Scholar] [CrossRef]
- Jaradat, S.; Elhenawy, M.; Nayak, R.; Paz, A.; Ashqar, H.I.; Glaser, S. Multimodal Data Fusion for Tabular and Textual Data: Zero-Shot, Few-Shot, and Fine-Tuning of Generative Pre-Trained Transformer Models. AI 2025, 6, 72. [Google Scholar] [CrossRef]
- Alhadidi, T.I.; Alazmi, A.; Jaradat, S.; Jaber, A.; Ashqar, H.; Elhenawy, M. Pavement Distress Classification Using Bidirectional Cascaded Neural Networks (BCNNs) and U-Net 50-Based Augmented Datasets. arXiv, 2025; in press. [Google Scholar]
- Chang, Y.; Edara, P. Predicting Hazardous Events in Work Zones Using Naturalistic Driving Data. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017; pp. 1–6. [Google Scholar]
- Wang, B.; Chen, T.; Zhang, C.; Wong, Y.D.; Zhang, H.; Zhou, Y. Toward Safer Highway Work Zones: An Empirical Analysis of Crash Risks Using Improved Safety Potential Field and Machine Learning Techniques. Accid. Anal. Prev. 2024, 194, 107361. [Google Scholar] [CrossRef]
- Rangaswamy, R.; Alnawmasi, N.; Wang, Z. Exploring Contributing Factors to Improper Driving Actions in Single-Vehicle Work Zone Crashes: A Mixed Logit Analysis Considering Heterogeneity in Means and Variances, and Temporal Instability. J. Transp. Saf. Secur. 2023, 16, 768–797. [Google Scholar] [CrossRef]
- Mashhadi, A.H.; Rashidi, A.; Medina, J.; Marković, N. Comparing Performance of Different Machine Learning Methods for Predicting Severity of Construction Work Zone Crashes. Comput. Civ. Eng. 2023. [Google Scholar] [CrossRef]
- Ullman, G.L.; Scriba, T.A. Revisiting the Influence of Crash Report Forms on Work Zone Crash Data. Transp. Res. Rec. 2004, 1897, 180–182. [Google Scholar] [CrossRef]
- Clark, J.B.; Fontaine, M.D. Exploration of Work Zone Crash Causes and Implications for Safety Performance Measurement Programs. Transp. Res. Rec. 2015, 2485, 61–69. [Google Scholar] [CrossRef]
- Daniel, J.; Dixon, K.; Jared, D. Analysis of Fatal Crashes in Georgia Work Zones. Transp. Res. Rec. 2000, 1715, 18–23. [Google Scholar] [CrossRef]
- Wang, J.; Hughes, W.E.; Council, F.M.; Paniati, J.F. Investigation of Highway Work Zone Crashes: What We Know and What We Don’t Know. Transp. Res. Rec. 1996, 1529, 54–62. [Google Scholar] [CrossRef]
- Hrubeš, P.; Langr, M.; Purkrábková, Z. Review of Data Governance Approaches in the Field of Transportation Domain. In Proceedings of the 2024 Smart City Symposium Prague (SCSP), Prague, Czech Republic, 23–24 May 2024. [Google Scholar]
- Si, S.; Xiong, W.; Che, X. Data Quality Analysis and Improvement: A Case Study of a Bus Transportation System. Appl. Sci. 2023, 13, 11020. [Google Scholar] [CrossRef]
- Remoundou, K.; Alexakis, T.; Peppes, N.; Demestichas, K.; Adamopoulou, E. A Quality Control Methodology for Heterogeneous Vehicular Data Streams. Sensors 2022, 22, 1550. [Google Scholar] [CrossRef]
- Galarus, D.; Turnbull, I.; Campbell, S. Timely, Reliable: A High Standard and Elusive Goal for Traveler Information Data Quality. In Proceedings of the 2019 Future of Information and Communication Conference, San Francisco, CA, USA, 14–15 March 2019. [Google Scholar]
- Liu, Z.; Li, L.; Wang, Y.; Lin, H.; Liu, Z.; He, L.; Wang, J. Controllable Traffic Simulation through Llm-Guided Hierarchical Chain-of-Thought Reasoning. arXiv 2024, arXiv:2409.15135. [Google Scholar]
- Masri, S.; Ashqar, H.I.; Elhenawy, M. Large Language Models (LLMs) as Traffic Control Systems at Urban Intersections: A New Paradigm. Vehicles 2025, 7, 11. [Google Scholar] [CrossRef]
- Masri, S.; Ashqar, H.I.; Elhenawy, M. Leveraging Large Language Models (LLMs) for Traffic Management at Urban Intersections: The Case of Mixed Traffic Scenarios. arXiv 2024, arXiv:2408.00948. [Google Scholar]
- Cheng, Q.; Jiao, X.; Yang, M.; Yang, M.; Jiang, K.; Yang, D. Advancing Autonomous Driving Safety Through LLM Enhanced Trajectory Prediction. In Proceedings of the Advanced Vehicle Control Symposium, Milan, Italy, 2–6 September 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 496–502. [Google Scholar]
- de Zarzà, I.; de Curtò, J.; Roig, G.; Calafate, C.T. LLM Multimodal Traffic Accident Forecasting. Sensors 2023, 23, 9225. [Google Scholar] [CrossRef]
- Alhadidi, T.; Jaber, A.; Jaradat, S.; Ashqar, H.I.; Elhenawy, M. Object Detection Using Oriented Window Learning Vi-Sion Transformer: Roadway Assets Recognition. arXiv 2024, arXiv:2406.10712. [Google Scholar]
- Jaradat, S.; Alhadidi, T.I.; Ashqar, H.I.; Hossain, A.; Elhenawy, M. Investigating Patterns of Freeway Crashes in Jordan: Findings from Text Mining Approach. Results Eng. 2025, 26, 104413. [Google Scholar] [CrossRef]
- Huang, X.; Feng, Y.; Zhang, Z. Crash Report Generation Using ChatGPT: A Novel Approach for Automated Accident Reporting. In Proceedings of the 2024 IEEE 4th International Conference on Electronic Information Engineering and Computer Science (EIECS), Yanji, China, 27–29 September 2024; pp. 1174–1177. [Google Scholar]
- Pendyala, R.; Hall, S. Explaining Misinformation Detection Using Large Language Models. Electronics 2024, 13, 1673. [Google Scholar] [CrossRef]
- Oliveira, J.; Almeida, D.; Santos, F. Comparative Analysis of BERT-Based and Generative Large Language Models for Detecting Suicidal Ideation: A Performance Evaluation Study. Cad. Saude Publica 2024, 40, e00028824. [Google Scholar] [CrossRef]
- Klie, T.; Nguyen, T.; Calderon, A. Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future. Comput. Linguist. 2022, 49, 157–198. [Google Scholar] [CrossRef]
- Beattie, T.; Moulton, S.; Wong, M. Utilizing Large Language Models for Enhanced Clinical Trial Matching: A Study on Automation in Patient Screening. Cureus 2024, 16, e60044. [Google Scholar] [CrossRef]
- Pornprasit, C.; Tantithamthavorn, C. Fine-Tuning and Prompt Engineering for Large Language Models-Based Code Review Automation. Inf. Softw. Technol. 2024, 175, 107523. [Google Scholar] [CrossRef]
- Latif, E.; Zhai, X. Fine-Tuning ChatGPT for Automatic Scoring. Comput. Educ. Artif. Intell. 2024, 6, 100210. [Google Scholar] [CrossRef]
- Davis, J.; Goadrich, M. The Relationship between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 233–240. [Google Scholar]
- Saito, T.; Rehmsmeier, M. The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
- Missouri State Highway Patrol (MSHP). Online Crash Report Search; MSHP: Jefferson City, MO, USA, 2022.
- Mumtarin, M.; Chowdhury, M.S.; Wood, J. Large Language Models in Analyzing Crash Narratives: A Comparative Study of ChatGPT, Bard and GPT-4. arXiv 2023, arXiv:2308.13563. [Google Scholar]
- Bhagat, S.; Shihab, I.F.; Sharma, A. Accuracy Is Not Agreement: Expert-Aligned Evaluation of Crash Narrative Classification Models. arXiv 2024, arXiv:2504.13068. [Google Scholar]
- Bucher, M.J.J.; Martini, M. Fine-Tuned “small” LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification. arXiv 2024, arXiv:2406.08660. [Google Scholar]
- Grinberg, I.; Wiseman, Y. Scalable Parallel Simulator for Vehicular Collision Detection. Int. J. Veh. Saf. Mobil. Technol. 2013, 8, 116–121. [Google Scholar] [CrossRef]
Study | Task Focus | Data Type Used | Method/Model | Validation Approach | Identified Limitation | Gap This Study Fills |
---|---|---|---|---|---|---|
Swansen et al. [8] | Work zone crash identification | Narrative + structured | Rule-based classifier | Manual comparison | Inconsistent work zone labeling | No use of AI or LLMs for automated cross-checking of crash records |
Carrick et al. [9] | Spatial misclassification | Spatial + structured | Spatial analysis | Mapping-based validation | Limited narrative integration | Lack of automated narrative analysis to enhance crash record accuracy |
Sayed et al. [4] | Narrative-based work zone detection | Narrative text | Noisy-OR classifier | Manual inspection | No integration with structured data | Lack of multimodal data fusion for improved detection accuracy |
Asadi et al. [10] | Work zone crash severity prediction | Structured (naturalistic dataset) | Ensemble models | Predictive performance | Focus on prediction rather than label verification | Not focusing on retrospective mislabeling in crash data |
Pendyala and Hall [35] | Misinformation detection | Text data | LLMs (GPT models) | Performance evaluation | Limited application to crash-related contexts | Provides insight into LLM utility for crash data validation |
Oliveira et al. [36] | Detection of suicidal ideation | Text data | Fine-tuned BERT models | Performance evaluation | Focused on healthcare, not crash data | Techniques may be adaptable for detecting mislabeling in crash data |
Klie et al. [37] | Annotation error detection | Text data | Multiple error detection methods | Comparative analysis | Limited to general text classification | Introduces methods that could be applied to crash report labeling accuracy |
Beattie et al. [38] | Clinical trial patient screening | Text data | LLMs (GPT-3.5, GPT-4) | Accuracy and sensitivity evaluation | Mislabeled criteria in ground truth data | Highlights issues of label accuracy which may be relevant to crash data classification |
Huang, et al. [34] | Crash report generation | Text synthesis | ChatGPT-based method | Evaluation through case examples | Focus on report generation without mislabel detection | Offers foundational framework for report automation, while lacking audit mechanisms |
Our work | Mislabel detection (work zones) | Structured + narrative | Fine-tuned GPT-3.5 | Expert validation (80 cases) | Conservative recall rate | First study to employ fine-tuned LLM for mislabel detection via multimodal fusion |
ID | Structured Fields | Narrative | Original Czone | LLM Predicted Czone | Mislabeled |
---|---|---|---|---|---|
1 | Light: Daylight, Severity: Minor, Injuries: 0 | Vehicle 1 was overtaking a construction crew. Vehicle 2 crossed the center line to avoid the construction crew and struck vehicle 1. Assisted by tpr m.r. lawson (831). | 0 | 1 | ✓ |
2 | Light: Dark, Severity: Fatal, Injuries: 2 | Crash occurred during nighttime on a rural highway; no signs of construction observed. | 0 | 0 | ✗ |
3 | Light: Daylight, Severity: Minor, Injuries: 1 | Traffic slowed due to workers and barriers ahead; rear-end collision followed. | 0 | 1 | ✓ |
Metric | Non-Czone (0) | Czone (1) | Overall/Avg |
---|---|---|---|
Precision | 98.78% | 86.67% | – |
Recall | 99.97% | 14.29% | – |
F1 score | 99.37% | 24.53% | – |
Support (No. of cases) | 6309 | 91 | 6400 |
Accuracy | – | – | 98.75% |
Macro-average F1 score | – | – | 61.95% |
Weighted-average F1 score | – | – | 98.31% |
Classification Type | Count |
---|---|
True misclassification detected (TP) | 4 |
Correct label (false positive) | 76 |
Total flagged cases reviewed | 80 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jaradat, S.; Acharya, N.; Shivshankar, S.; Alhadidi, T.I.; Elhenawy, M. AI for Data Quality Auditing: Detecting Mislabeled Work Zone Crashes Using Large Language Models. Algorithms 2025, 18, 317. https://doi.org/10.3390/a18060317
Jaradat S, Acharya N, Shivshankar S, Alhadidi TI, Elhenawy M. AI for Data Quality Auditing: Detecting Mislabeled Work Zone Crashes Using Large Language Models. Algorithms. 2025; 18(6):317. https://doi.org/10.3390/a18060317
Chicago/Turabian StyleJaradat, Shadi, Nirmal Acharya, Smitha Shivshankar, Taqwa I. Alhadidi, and Mohammad Elhenawy. 2025. "AI for Data Quality Auditing: Detecting Mislabeled Work Zone Crashes Using Large Language Models" Algorithms 18, no. 6: 317. https://doi.org/10.3390/a18060317
APA StyleJaradat, S., Acharya, N., Shivshankar, S., Alhadidi, T. I., & Elhenawy, M. (2025). AI for Data Quality Auditing: Detecting Mislabeled Work Zone Crashes Using Large Language Models. Algorithms, 18(6), 317. https://doi.org/10.3390/a18060317