Developing Natural Language Processing Algorithms to Medically Code the Clinical Notes in the Theater Data Medical Store
Abstract:
Outside the Department of Defense, natural language processing (NLP) strategies have been used with electronic health records (EHR) to increase information extraction from free text notes and structured fields, allowing access to much larger cohorts than previously possible. Current operational medical data is held in the Theater Medical Data Store (TMDS). Most of the medical information in TMDS is contained in unstructured text fields. The objective will be to automate the data-coding process into the injury diagnostic code groups, which are derived from the International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM) codes. There are over 8million records in the TMDS and there may be as much as 50% of the ICD-9-CM codes that are not completely or accurately coded. The accuracy of the data in the TMDS has never been quantified, largely because most has been captured without any medical billing concerns. The study has developed a set of programming rules using NLP and machine learning (ML) (i.e., algorithms generated by automated learning from manually coded data), with eventual output that will represent human interpretation as much as possible. The coding algorithm models have been developed using pre-existing coded medical records from the Expeditionary Medical Encounter Dataset (EMED) housed at the Naval Health Research Center (NHRC). Experienced nurse staff are responsible for coding and validating all the EMED medical encounter records. The model will be trained ona subset of the EMED data and then tested on TMDS data that has been matched to the remaining EMED data.