Natural language processing (NLP) is becoming a powerful technique to unlock insights in real world healthcare, and it has the potential to transform the practice of medicine for the better.
NLP is a field of computer science that uses machine learning and other techniques to extract meaning from the written word. First developed in the 1950s, it has has blossomed over the past 25 years with the availability of cheap, plentiful, and powerful computers working in parallel fashion.
Medical documentation is the process of recording a physician’s interactions with a patient during a clinical visit. Now, with the increasing adoption of electronic health records (EHR) over the past 5 to 7 years, more clinical encounters are documented in machine-readable text than indecipherable handwriting. This is where applying NLP techniques to healthcare can be most helpful to the overtasked physician – voice to text transcription; clinical note summary; auto-documentation; machine-assisted coding; and clinical decision support systems. Although there has been progress, there are important problems left to be solved before NLP’s healthcare potential can be realized.
Early Success Stories of NLP in Healthcare
Preliminary efforts to use NLP in healthcare have been reported in the literature by academics, government, and industry. For the most part, these studies have focused upon limited patient datasets and concentrated on identification or prediction of disease in individuals.
In 2013, the U.S. Department of Veterans Affairs used NLP to identify the prevalence of suicide attempts in 1.2 billion EHR documents. The algorithms developed were able to distinguish between screening for and actual mentions of suicide attempts in 10,000 patients with 80 percent accuracy.
Beginning in 2014, IBM collaborated with Epic electronic health record (EHR), and Carilion Clinic in Virginia to identify patients suffering from heart failure based upon machine analysis of clinical notes. The original idea was to extract clinical, social, and behavioral features from the medical record to predict heart failure. Most of this information is within the unstructured portions of the medical record rather than in the diagnosis or procedural codes used for billing (structured data). The pilot program included 21 million records, which were evaluated in less than 2 months. The investigators obtained an 85 percent accuracy rate of identifying patients at-risk for developing heart failure.
Researchers at UCLA analyzed radiology reports in 2015 along with diagnosis codes and lab results in EHRs to detect presence of cirrhosis (liver disease) in 4,294 patients. The disease prediction algorithm obtained a 97% sensitivity and 77% specificity, which, if used for surveillance, may be sufficient. But, if used for diagnosis and program enrollment, would have an unacceptably high false positive rate.
So, what will it take to build industrial-grade algorithms, ones that do not operate on a small population in a test situation, but which impact a larger, more diverse set of patients?
Moving NLP Into the Medical Mainstream
Healthcare organizations rightfully demand very high accuracy for solutions that address diagnosis, treatment, or reimbursement. For most situations, where there is revenue or clinical care at stake, algorithms need to detect 95 percent or more of the “truth.” Even a 10 percent false positive rate might mean needless tests and more harm done to the patient than care improved; and a 10 percent false negative rate might mean lower reimbursement. Other industries in which machine learning is used, such as ad engines, social media, and e-commerce, are much more forgiving given their targets.
Very often, the accuracy demonstrated by published algorithms in peer-reviewed journals are lower than what a hospital or health plan CEO would require, as demonstrated in studies cited above. Think of these algorithms as initial experiments. And there are more of them being published each day. The hard part is to achieve a greater level of algorithmic performance across a larger, more diverse population. For this, a platform is required that can support scale and integrate feedback into algorithms where they do not perform as well for the question at hand.
What makes text mining clinical records difficult is that the training data is so hard to obtain – as you or I as patients would want it. A company or academic lab needs to first demonstrate value to receive medical records to provide the basis for future learnings. But that value needs to come from algorithm that is reasonably well-trained. Clinical records (not easily obtained) are the starting point for training and testing machine learning natural language processing. It can appear to be a “chicken or the egg” type of puzzle.
Where initial success will come in healthcare will need to be in the administrative side, which is relatively more forgiving compared with the clinical realm. Even though we did mention that algorithms need to be high performing, there are some short-cuts possible. For example, false positive results can be filtered by experts in a targeted quality assurance workflow, as we have demonstrated for risk adjustment coding. This strategy provides the means to get started. And it is in this way that we will gain more experience with NLP technologies in healthcare given more error tolerance and room for learning.
A Word of Advice on Reaching the Finish Line
Life is a game of inches: The hardest part is traversing the last 5-to-10 percent of a path towards reaching the goal. This is true in the world of refining analytics to solve a problem to meet user needs. And, it is particularly true when it comes to creating a solution applicable in a variety of scenarios. We are in the early stages of realizing the value of NLP-powered applications in healthcare. These applications will provide the basis for learning from real-world care what treatment works, what doesn’t, and how to improve care delivery. Of course, there is more to achieving this level of performance, such as domain experience, feature engineering, and a ton of training data. But that is another blog post.
 Hammond, Kenric & Laundry, Ryan & OLeary, T.Michael & Jones, William. (2013). Use of Text Search to Effectively Identify Lifetime Prevalence of Suicide Attempts among Veterans. Proceedings of the Annual Hawaii International Conference on System Sciences. 2676-2683. 10.1109/HICSS.2013.586.
 IBM: Natural language, machine learning can flag heart disease. EHRIntelligence – https://ehrintelligence.com/news/ibm-natural-language-machine-learning-can-flag-heart-disease.
 Chang EK, Yu CY, Clarke R, Hackbarth A, Sanders T, Esrailian E, Hommes DW and Runyon BA (2016) Defining a patient population with cirrhosis: an automated algorithm with natural language processing. J Clin Gastroenterol 50, 889–894.