Work zone safety management and research relies heavily on the quality of work zone crash data. However, it is possible that a police officer may misclassify a crash in structured data due to: restrictive options in the crash report; a lack of understanding about their importance; lack of time due to police officers’ work load; and ignorance of work zone as one of the crash contributing factors. Consequently, work zone crashes are under representative in crash statistics. Crash narratives contain valuable information that is not included in the structured data. The objective of this study is to develop a classifier that applies text mining techniques to quickly find missed work zone (WZ) crashes through the unstructured text saved in the crash narratives.
The study used three-year crash data from 2017 to 2019. The data from 2017 to 2018 was used as training data, and the 2019 data was used as testing data. A unigram + bigram noisy-OR classifier was developed and proven to be an efficient and effective means of classifying work zone crashes based on key information in the crash narrative. The ad-hoc analysis of misclassified work zone crashes sheds light on when, where and the plausible reasons as to why work zone crashes are more likely to be missed.