With machine learning, as with many subjects, the most important step in solving any problem is the first one: defining and understanding what that problem is. All of the fancy statistics and algorithms in the world are not going to achieve a goal if is the goal is not known. So, we like to ask the question, “How is this going to be used?” This question is fitting as subtle differences in the application of our results in practice can make significant impacts on choices made in the modeling process.

Perhaps the most significant of these is in how we evaluate how well our models work. Of course, it would be nice to just say that we are looking for the most accurate version of the model and leave it at that. But data science is not magic, and models are almost never entirely perfect. Thus, the manner in which those errors are measured can be quite important so that we can work to minimize the errors you most care about.

**Classification Measurements**

Classification is one of the most common kinds of machine learning problems. Any time you are looking to answer which group a particular case or example falls into, that's a classification question. These can be binary, as in, “Is this statement true or false,” or have multiple options, as in, “Which category does this fall into?”

*Binary Outcomes*

The simplest form of measuring Classification Error is simply called Accuracy or 0/1 Loss. Effectively, you simply measure the model’s prediction of which category is correct, compare it to what category it should have had, and see what percentage of the time the model is right. The concept is very simple and this methodology often gets used.

A slight variation on this is embedded in what’s known as Cost Matrix Minimization. The only difference here is that you care about what kind of error is being made and weight accordingly. For example, if the model is trying to determine if an e-mail is spam, the costs of labeling a real email as illegitimate are very often going to be higher than in blocking a scam email. Sometimes it might go the other way, of course, and the exact amount is going to depend on the context of who is receiving the emails. Yet, if we know what that cost is, or can even approximate it, we can often do better than the 0/1 case. If blocking a real email is three times as costly as letting a fake one through, then the model will want to shift its threshold to be more confident that something is spam before actually taking the action to block it.

*Ensuring Proper Order*

However, there are more cases than those just described. One of the most common reasons to build these kinds of models is to be able to best distribute limited resources. Let us look at an example where there are a limited number of operators (people, equipment), with a large number of cases for these operators to handle. In this case, we care more about the rank, or relative order of likelihood, that the operators can help on these cases than we do about the exact likelihoods themselves. Therefore, we would want to use a metric that measures how much the order of the model’s outputs measures the real ordering of the examples. A metric like Area Under the ROC Curve would work well for this.

*Probability Calibration*

Conversely, there are other situations where the amount of resources isn’t not necessarily limited, but the resources are expensive. In these case, we want to know in which instances it will be worth spending on these expensive resources. We do not care as much about ranking here; we care more about precisely knowing the correct probabilities. Here, then, we would want to use a measure that gets at how well-calibrated the prediction is. For example, a well-calibrated model will have predictions of 30% and actually happen pretty close to 30% of the time. Poorly-calibrated models will be far off in predicting actual outcomes.

Things can get even more complicated here, as there are additional knobs to turn. In cases where we want to be very risk averse (high penalty for being confident and wrong), we might use a metric like Log Loss. In cases where that is less important and we just want to get as close as we can to correct without being too worried about risk, we might use a metric like Brier Score.

Finally, we might want to put additional weight on some examples more than others because they may be more important, more expensive or closer to an area where we would switch what action we take. All of these are important things to know ahead of time.

**Regression Measurements**

The other big area of machine learning that we regularly deal with is Regression. Regression comes into play whenever what we are trying to predict is continuous, real-valued. If we want to know how many people are going to be at X, when Y will happen or what the cost of Z is, those are all going to be Regression problems. Just like with Classification, there are many different options in measuring accuracy.

*Average Distance*

The most basic way of measuring the error here is to take every prediction and figure out how far away it was from the actual value by simple subtraction. After taking the absolute value (to make sure that large negative and large positive errors do not cancel each other out), you average all of these errors and report that as your metric. This is called the Mean Absolute Error. A more common variation on this works to penalize large errors somewhat more than small ones. Instead of taking the absolute value, the errors are squared before averaging. This is Mean Squared Error. Usually the square root is taken after the data is aggregated, bringing things back to the original order of magnitude; in this case, we're dealing with Root Mean Squared Error. This approach can be extended further to more highly penalize larger errors, taking the error to a higher power, but this is rarely done.

*Relative Scale*

Alternatives to these methods exist as well. Maybe we are dealing with something where the exact number of difference is less important than the relative amount. In these cases, we might work with Percent Error so that being off by 10 when the real value is in the millions is less important than being off by 5 when the real value is 10.

Finally, in some cases, what we really care about is that the prediction is on the right order of magnitude. In other words, we care less about getting the numbers exactly right and more about knowing whether to expect a number in the 10s versus the thousands versus the millions. In these cases, we will usually want to transform the raw values by taking their logarithm and then use one of the above metrics on the log-transformed data. This is effectively the same as looking at the multiplicative error; if we divide our prediction by the real answer (or vice versa), how close are we to getting the ideal ratio of 1?

These are the most common types of measurements and errors we look at when dealing with machine learning, but this is not an exhaustive list. The important point is that it really does make a difference to know how the data is getting used because that allows us to tailor the right measurement to the right problem. It is only by doing this that we can make models that are not only accurate, but are also successful in achieving their desired goals.