Accession Number:



Using Natural Language Processing and Machine Learning to Classify Health Literacy From Secure Messages: The ECLIPPSE study

Descriptive Note:

Journal Article - Open Access

Corporate Author:

Arizona State University Mesa United States

Report Date:


Pagination or Media Count:



Limited health literacy is a barrier to optimal healthcare delivery and outcomes. Current measures requiring patients to self-report limitations are time-consuming and may be considered intrusive by some. This makes widespread classification of patient health literacy challenging. The objective of this study was to develop and validate literacy profiles as automated indicators of patients health literacy to facilitate a non-intrusive, economic and more comprehensive characterization of health literacy among a health care delivery systems membership. To this end, three literacy profiles were generated based on natural language processing combining computational linguistics and machine learning using a sample of 283,216 secure messages sent from 6,941 patients to their primary care physicians. All patients were participants in Kaiser Permanente Northern Californias DISTANCE Study. Performance of the three literacy profiles were compared against a gold standard of patient self-reported health literacy. Associations were analyzed between each literacy profile and patient demographics, health outcomes and healthcare utilization. T-tests were used for numeric data such as A1C, Charlson comorbidity index and healthcare utilization rates, and chi-square tests for categorical data such as sex, race, poor adherence and severe hypoglycemia. Literacy profiles varied in their test characteristics, with C-statistics ranging from 0.610.74. Relations between literacy profiles and health outcomes revealed patterns consistent with previous health literacy research patients identified via literacy profiles indicative of limited health literacy a were older and more likely of minority status b had poorer medication adherence and glycemic control and c exhibited higher rates of hypoglycemia, comorbidities and healthcare utilization. This represents the first successful attempt to employ natural language processing to estimate health literacy.

Subject Categories:

  • Linguistics
  • Cybernetics
  • Medicine and Medical Research

Distribution Statement: