Electronic health records (EHRs) contain information about tobacco use, but smoking status and history are often inadequately captured, resulting in missed opportunities for smoking cessation intervention and lung cancer screening decision-making. Informatics methods can improve ascertainment of smoking behaviors through development of a tobacco use registry from EHRs. Methods: Using structured data and free text from our local EHR, we developed two support vector machine (SVM) models to classify smoking status (never, former, current) and smoking history (never, pack-years, cigarettes per day, years smoked). We trained and tested these models on 758 clinical notes from the Epic-based EHR of the Dartmouth-Hitchcock health system; the training set had 479 notes and the test set 280. Notes were eligible if a patient was: ≥21 years old with a clinical encounter in the EHR from 1/1/15-9/1/16. We assessed the models' performance through precision (probability that retrieved element is relevant), recall (probability that relevant element is retrieved), and the F1-score (harmonic mean of precision and recall). We also tested the models on publicly available data from the National Centers for Biomedical Computing (i2b2) Results: Of the 280 test records, 22% were current smokers, 19% former, and 59% never smokers. Accuracy assessment of our models showed: precision = 68% and recall = 85% for smoking status and for smoking history; precision = 66% and recall = 94%. The F1-scores for smoking status and history were 65% and 74%, respectively. The majority of correctly classified smokers also had one or more smoking history element ascertained with our model. Of those individuals correctly classified as never smokers (n = 98) only two were misclassified as having a smoking history. When testing our models on i2b2 data, our F1-score was 92%. Review of misclassified records indicates that deep learning refinements to our current machine learning approach will improve performance measures. Conclusion: Machine learning models applied to our Epic EHR consistently identifies smoking history. Creating a tobacco use registry from the EHR is feasible and with advanced algorithms, will help target patients for cancer control efforts, such as smoking cessation and lung cancer screening.
The following are the 16 highest scoring abstracts of those submitted for presentation at the 41st Annual ASPO meeting held March 12–14, 2017, in Seattle, WA.
- ©2017 American Association for Cancer Research.