The Apixio Engineering Blog
Introduction to Machine Learning
This is the first in a two-part blog post about one of the methods we’ve employed to help us make the best use of our training data. In this first post we’ll give a brief introduction to some basics of machine learning and a couple of the problems we face. The second post will be a more in-depth look at a technique we’ve used to try to improve the results of our models.
At Apixio, amongst other things, we use machine learning to predict medical conditions that patients most likely have based on clinical documentation and medical claims data. After ingesting, cleaning and normalizing this data, we train models from which we make these predictions. In the data science world when we use the term “model” what we mean is that we are trying to create a function f that when given input x produces one or more outputs that classify the input x. We use a number of computational techniques to create these functions, all of which involve three important components: data points, features and weights.
Note: this is not your typical f(x) = y function, in which a function acts on x to produce y. Instead, this is a situation where x and y are both independent variables and we are trying to discover a function that ties them together in a predictable way. In mathematical terms, this is an Xi, Xj plot.
Basic Components of Training Data
For the following explanations of data points, features and weights, let’s say we are creating a classifier that predicts whether a patient has a broken leg or not.
A data point comes from the domain and range of the function we are trying to create to do our predictions. In this example the patient medical records are the domain and the prediction values, the information about whether the patients have a broken leg, is the range. When training a model we will have a data point for each patient containing some information on the patient (a.k.a. features, see below) and a binary value for “broken leg”.
The difference between having the binary true/false value vs. an empty value is the difference between what we call labeled vs. unlabeled data. A “labeled” data point is one where we know the answer to whether the patient has the condition “broken leg” or not, and for an “unlabeled” data point, we do not know the status of our patient’s legs. Unlabeled data can be used in machine learning setting, but labeled data are always better.
Features are pieces of information (traits) we extract from the raw data that are important in the prediction task. For example, the fact that “someone recently visited an orthopedist” might be an important piece of information for predicting whether the patient has a broken leg. It does not certainly follow that the patient has a broken leg, but it makes it more probable. We get features from the structured (tabular) data and from written clinical documents (typically words). Our models may end up having hundreds or thousands of features though good models can be created from just a handful of the right features too.
Each feature has a corresponding weight (think – importance) assigned to it which is unknown in the beginning of training. The weights and the features serve as inputs to whatever function we choose to model our outcome with. In many cases our function is of the form f(Wx+b), where W is the weights matrix, x is the feature vector, and f is a non-linear transformation.
The process of finding W is called the learning process or “training”.” For many of our predictors we can use logistic regression, where the logistic function will give us a probability that the patient has the condition. During training we use a cost function which is an error function that compares the predicted values with the labels, and outputs the amount of aggregated error over the whole dataset. The objective of “training” then is to find a W that minimizes the aggregate error. If the error is small, then those weights are considered to be “good” weights and we can use them to predict whether our patients have broken legs. There are other factors, like regularization terms, that go into a cost determining “goodness” of W. But minimizing the cost function for aggregate error is a good starting point.
Fitting a Model
So we’ve covered the basics of what goes into creating a predictive model. Now let’s look at what our functions do to data to classify something. On the right you can see an example of a trained linear classifier on a simple Xi,Xj plane which separates the data into two categories of blue and green. Our function here represented by the straight line, does a perfect job of separating the green and blue classes. If the training set is truly representative then this would be a perfect classifier. In real life your data is not going to line up for you like this, but the example is to show what we are expecting our function to do, separate our data points in space into separate classes.
For our next example we show a classifier that is a bit more realistic because the data doesn’t line up neatly. The function on the left is not linear and clearly not perfect. If the Red class is the positive class you can see that there are Green false positives on the Red side of the line and some Red false negatives on the Green side. But by and large the function does a good job of separating the classes. Even in this simple case we see that we may not be able to separate the two classes and this represents the error in our classifier and is very common in machine learning.
It is the process of training that creates a function that can separate our data into the classes we desire for our classifier. There are different approaches and techniques to actually train a model using optimization algorithms like gradient descent, but we’ll save these for a future blog post. Let’s assume that we have data points, features and an appropriate cost function and that we have trained and minimized the error; can we be sure the weights we get from training will result in good predictions when we apply them in prediction tasks on new data? In other words, if a new patient comes, can we extract the features, apply our logistic function with our best weights and rely on the results?
The answer is: maybe “yes” and maybe “no”. We won’t know unless we’ve held out some test data points, but if the answer is “no” then there is a culprit we always go to early on, “overfitting”.
During the process of minimizing our cost function, we try to come up with the best weights to “fit” the training data. Two things can go wrong for us during this process. One is that we might have too few data points for the number of features, also referred to as dimensions in the model.
This is always bad. The second common cause of overfitting is typically noise in our data set. In either case instead of achieving a smooth curve, the black line in the illustration at our right, we end up with something that looks like the green line which tries to account for every datapoint. This kind of function that looks like a congressional districts after heavy gerrymandering is going to be much more susceptible to error when new unseen data points are run through our predictor. The issue is that our model has learned the training data too precisely, either there are too few points to balance out our function or the model has learned the noise in our training data. When we get new data points we’ve not seen, they are likely to not fit the model.
Solving for Overfitting
The question then is how do we fix a model that is performing poorly due to overfitting? If we’re lucky, we can play with the “learning parameters”. We might be able to stop or modify the learning process before it overfits or use cross validation methods (which might not be feasible if our problem is too few data points). However these often do not fix our problem. If that’s the case we use one of two common approaches: feature selection and regularization, which will be topics of our following blog posts. The objective of feature selection is to reduce the dimensionality of the model by removing features which are likely to be contributing to the overfitting. Removing dimensions then improves the power of the training data we do have. The idea behind regularization is to constrain the space of the functions we are learning to make them more “smooth” with the intuition being that a smooth function is less likely to overfit the data. In the previous diagram, the black curve is smooth and the green curve is not. We want “regular” functions and not “gerrymandering” (excuse the political analogy) and regularization as a way to achieve this.
So, a closing point for this post. We’ve just discussed the problems inherent in fitting training data to a model and the special challenge of solving for overfitting. There’s a reason this is important to us at Apixio. In our business we deal with healthcare data and while one might assume that with the prevalence of health record systems these days that there’s plenty of data to go around, but this simply is not the case. Data in the healthcare milieu is actually quite hard to come by, it is very noisy and there are literally thousands of conditions and procedures many of which are rare so there are not large datasets to use for training. Peter Norvig at Google coined the phrase, “the unreasonable effectiveness of data” the point being if you have a lot of it, your job gets a lot easier. In healthcare we currently do not have this luxury so our models often do not generalize well in production. Our job is to find ways to overcome these issues. Our next post will focus more closely on using regularization to solve overfitting, namely by addressing too-small training sets and noisy data.