Clinical Natural Language Processing

While electronic medical record (EMR) systems employ increasingly rich data models that offer a wide variety of options for structured data entry, a large amount of medical data is in free-form, narrative text reports. Our research goal in clinical natural language processing is to provide convenient and intelligent information extraction and classification from medical reports by taking advantage of both individual human interventions and collective human intelligence, to ultimately improve diagnosis, reduce errors, and inform medical practice and decision making.

One ongoing project is IDEAL-X (, an interactive, incrementally learning based information extraction system to facilitate the process of information extraction and classification from narrative medical reports and transform extracted data into normalized structured forms. The system takes an incremental learning based approach which quickly learns from users' feedbacks from a small set of reports, and a chieves high accuracy on data extraction with minimal effort from users. Extracted data can be further normalized through controlled vocabularies. IDEAL-X requires no special configuration or training sets, and is not constrained to specific domains, thus it is easy to use and highly portable. IDEAL-X is being used for cohort identification from tens of thousands of patients, and for automated classification for massive number of radiology reports from CDC.