Data Paths: An Update from the PaTH Informatics Team

Researchers often refer to patients’ electronic medical record (EMR) to better understand their risk for developing disease and the efficiency of different treatments. EMR data include information on patients’ demographics, social histories (such as smoking status), signs and symptoms, and billing codes that indicate their diagnoses and treatments. These data come in two forms: structured and unstructured. Structured data is coded in a way that is easy for computers to process, such as a patient’s age selected from a dropdown list. Unstructured data is not coded and is not easy for computers to process – such as a patient’s signs and symptoms described in a physician’s note about an office visit. To more efficiently use unstructured data, researchers develop Natural Language Processing (NLP) systems.

"Natural language processing systems can be thought of as a series of algorithms that work in tandem to identify, encode, and extract pertinent study information or variables ‘locked’ in free-text mediums like clinical notes," says Brett South, Ph.D., a member of PaTH’s NLP team from the University of Utah.

Developing an NLP tool is a complicated process, but PaTH’s NLP team is up to the challenge. The team is led by Wendy W. Chapman, Ph.D., of the University of Utah and has developed several NLP systems over the years, including one system that de-identifies clinical notes to preserve patient privacy and one that identifies critical findings from radiology reports to improve communication between physicians and radiologists.

"Some NLP problems are more difficult than others due to the high variability in ways physicians may document information in the notes" says NLP team member Danielle Mowery, Ph.D. For instance, Dr. Mowery says it can be challenging when physicians use acronyms or abbreviations in their notes, such as using "cigs" to mean cigarettes "These challenges are also what make NLP great fun!"

The team is currently working to support an NLP task called "information extraction." The goal is to use computers to identify, encode, and extract specific information from unstructured data. So far, the team has developed a system that extracts patients’ smoking status from clinical notes. They are working to integrate this information with structured EMR data to determine a patient’s lifetime smoking status. Both the structured data (i.e. a dropdown box) and unstructured data (i.e. a clinician’s notes) contain information about smoking, but, Dr. Mowery says, both of these sources could contain incomplete or out-of-date information. Therefore, comparisons between the two could provide a fuller picture of smoking history than examining either source alone. Once smoking status is determined, this study variable can be incorporated into analyses. For example, PaTH investigators focused on idiopathic pulmonary fibrosis (IPF) can use NLP to learn how smoking interacts with other factors that impact IPF. Dr. Mowery says that the PaTH team’s prototype extraction tool is performing comparably to the state-of-the-art performance of other NLP systems. Next, the team will assess the tool’s generalizability, or how well it operates at a new site with data it hasn’t seen before.

"PaTH is an excellent test ground for NLP development due to the number of diverse, clinical institutions within the network," says Dr. Chapman. "This test bed will permit our NLP team to assess the generalizability of the system and adapt the system to accurately extract study variables at each PaTH site in a patient privacy-preserving way."

Could your study benefit from Natural Language Processing? Contact Dr. Wendy Chapman at with any inquiries regarding how NLP can support studies.

«—- Back To News


Stay in touch with the PaTH Network with news and updates in your inbox.

PaTH Network Logo
Twitter Logo Facebook Logo LinkedIn Logo YouTube Logo

Copyright 2016 | PaTH Network