The three categories to consider when evaluating algorithms are ease of understanding accuracy and

After implementing a machine learning algorithm the next step is to find out how effective is the model based on metric and data sets. Different performance metrics are used to evaluate different Machine Learning Algorithms. In this post, we will cover different types of evaluation metrics available for classification and regression.

Confusion matrix

precision

Recall

F1 score

Accuracy

ROC Curve -AUC

Log Loss

MSE

MAE

Cross-Entropy Loss

Hinge Loss

Confusion Matrix :

Confusion Matrix as the name suggests gives us a matrix as output as N X N matrix , where N is the number of classes being predicted. Confusion matrix, also known as an error matrix. This Metric used for finding the correctness and accuracy of the model and even works better for imbalanced data set. Confusion matrix is a table with two dimensions (“Actual” and “Predicted”), and sets of “classes” in both dimensions. Most performance measures are computed from the confusion matrix.

confusion matrix for Binary Class:

The Confusion matrix in itself is not a performance measure as such, but almost all of the performance metrics are based on confusion matrix and the numbers inside it.

True Positives: The cases in which we predicted YES and the actual output was also YES.

True Negatives: The cases in which we predicted NO and the actual output was NO.

False Positives: The cases in which we predicted YES and the actual output was NO.

False Negatives: The cases in which we predicted NO and the actual output was YES.

When to minimize what?

  1. We know that there will be some error associated with every model that we use for predicting the true class of the target variable. This will result in false Positives and False Negatives(i.e Model Classifying things incorrectly as compared to the actual class).
  2. There’s no hard rule that says what should be minimized in all the situations. It purely depends on the business needs and the context of the problem you are trying to solve. Based on what, we might want to minimize either False positives or False Negatives.

Minimizing False positives:

For a better understanding of False positives, let’s use a different example where the model classifies whether an email is a spam or not. Let’s say that you are expecting an important email like hearing back from a recruiter or awaiting an admit letter from a university. Let’s assign a label to the target variable and say, 1: “Email is a spam” and 0: “Email is not spam”. Suppose the model classifies that important email that you are desperately waiting for, as spam(case of False Positive). Now, in this situation, this is pretty bad than classifying a spam email as important or not spam since in that case, we can still go ahead and manually delete it and it’s not a pain if it happens once a while. So in case of spam email classification, minimizing False Positives is more important than False Negatives.

Precision:

Precision is defined as the number of true positives divided by the number of true positives plus the number of false positives. It tells us what proportion of predictions which are predicted as positives that are actual positives to the total predicted positives.

Recall:

Recall is defined as the number of true positives divided by the number of true positives plus the number of false negatives. It tells us what proportion of predictions which are predicted as positives that are actual positives to the total positives in the original data set.

High recall, low precision: Indicates that most of the positive examples are correctly recognized (low FN) but there are a lot of false positives.

Low recall, high precision: Indicates that we miss a lot of positive examples (high FN) but those we predict as positive are indeed positive (low FP).

When to use Precision and when to use Recall?

  1. It is clear that recall gives us information about a classifier’s performance with respect to false negatives(how many did we miss), while precision gives us information about its performance with respect to false positives(how many did we caught).
  2. Precision is about being precise. So even if we managed to capture only one cancer case, and we captured it correctly, then we are 100% precise.
  3. Recall is not so much about capturing cases correctly but more about capturing all cases that have “cancer” with the answer as “cancer”. So if we simply always say every case as “cancer”, we have 100% recall.
  4. So basically if we want to focus more on minimizing False Negatives, we would want our recall to be as close to 100% as possible without precision being too bad and if we want to focus on minimizing False positives, then our focus should be to make Precision as close to 100% as possible.

F1 score combines precision and recall relative to a specific positive class. It conveys the balance between the precision and the recall and there is an uneven class distribution. F1 score reaches its best value at 1 and worst at 0.

Accuracy:

we use Accuracy in classification problems and it is the most common evaluation metric. Accuracy is defined as the ratio of the number of correct predictions made by the model over all kinds of predictions made.

Disadvantage of accuracy:

One of the biggest disadvantage of accuracy is it doesn’t work well when we have an imbalanced data set. it works well only if there are an equal number of samples belonging to each class.

Example: Consider that there are 98% samples of class A and 2% of class B in our training set. Then our model can easily get 98% training accuracy by simply predicting every training sample belonging to class A. When the same model is tested on a test set with 60% samples of class A and 40% samples of class B, then the test accuracy would drop down to 60%. Classification Accuracy is great but gives us a false sense of achieving high accuracy.

The real problem arises when the cost of misclassification of the minor class samples are very high. If we deal with a rare but fatal disease, the cost of failing to diagnose the disease of a sick person is much higher than the cost of sending a healthy person to more tests or fraud detection.

When we need to check or visualize the performance of the multi-class classification problem, we use the AUC(Area Under The Curve) ROC (Receiver Operating Characteristics ) curve. It is one of the most important evaluation metrics for checking any classification model’s performance. It is also written as AUROC(Area Under the Receiver Operating Characteristics).

AUC-ROC Curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis.

TPR/Recall/Sensitivity = TP/TP +FN

Specificity = TN/TN+FP

FPR = 1-Specificity= FP/TN+FP

What is AUC-ROC Curve?

  1. AUC-ROC curve is a performance measurement for the classification problem at various thersholds settings.
  2. ROC is a probability curve and AUC represents the degree or measure of separability.
  3. ROC (Receiver Operating Characteristic) curve tells us about how good the model can distinguish between two things(e.g. If a patient has a disease or no).
  4. Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties in distinguishing between the two.
  5. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. By analogy, Higher the AUC, better the model is at distinguishing between patients with the disease and no disease
  1. AS we see, the first model does quite a good job of distinguishing the positive and negative values. Therefore, there the AUC score is 0.9 as the area under the ROC curve ios large.

2) Whereas, if we see the last model, predictions are completely overlapping each other and we get the AUC score of 0.5. This means that the model is performing poorly and it is predictions are almost random.

How to use AUC ROC curve for a multi-class model?

In the multi-class model, we can plot N number of AUC ROC curves for N number classes using one vs ALL methodology. So for example, if you have three classes named X, Y and Z you will have one ROC for X classified against Y and Z another ROC for Y classified against X and Z, and the third one Z classified against Y and X.

  1. Loss functions are a type of methods to evaluate how well your algorithm models your dataset. If your predictions are totally off, your loss function will output a higher number. If they’re pretty good , it’ ll output a lower number. As you change pieces of your algorithm to try and improve your model, your loss function will tell your are getting anywhere.
  2. Gradually with help of some optimization function like Gradient Descent, loss function learns to reduce error in prediction.
  3. There’s no one-size-fits-all loss function to algorithms in machine learning. There are various factors involved in choosing a loss function for specific problem such as type of machine learning algorithm chosen, ease of calculating the derivatives and to some degree the percentage of outliers in the data set.
  4. We can classify loss functions into two major categories depending upon the type of learning task we are dealing with like Regression losses or Classification losses,
  5. In classification, we are trying to predict the output from a set of finite categorical values.
  6. In Regression, on the other hand, deals with predicting a continuous value.

. MSE / Quadratic loss / L2 loss

. MAE / L1 loss

. Mean bias error

Classification losses:

. Hinge loss

. cross-entropy loss / log loss

. likelihood loss

MSE / Quadratic loss / L2 loss:

Mean Squared Error, or MSE loss is the default loss to use for regression problems. Mathematically, it is the preferred loss function under the inference framework of maximum likelihood if the distribution of the target variable is Gaussian. It is the loss function to be evaluated first and only changed if you have a good reason.

MSE is calculated as the average of the squared differences between the predicted and actual values. The result is always positive regardless of the sign of the predicted and actual values and a perfect value is 0.0. The squaring means that larger mistakes result in more error than similar mistakes, meaning that the model is punished for making larger mistakes.

Mean Absolute Error / L1 Loss:

On some regression problems, the distribution of the target variable may be mostly Gaussian but may have many outliers, e.g. large or small values far from the mean value.

MAE, on the other hand, is measured as the average sum of absolute difference between predictions and actual observations. Like MSE, this as well measures the magnitude of error without considering their direction. Unlike MSE, MAE needs more complicated tools such as linear programming to compute the gradients. MAE is more robust to outliers since it does not make use of square.

Cross-Entropy Loss (Binary Classification):

Cross-Entropy is the default loss function to use for binary classification problems.

It is intended for use with binary classification where the target values are in the set {0,1}. Mathematically, it is the preferred loss function under the inference framework of maximum likelihood. It is the loss function to be evaluated first and only changed if you have a good reason. Notice that when actual label is 1 (y(i) = 1), second half of function disappears whereas in case actual label is 0(y(i) = 0) first half is dropped off. In short we are multiplying the log of the actual predicted probability for the ground truth class. An important aspect of this is that cross entropy loss penalizes heavily the predictions that are confident but wrong.

Hinge Loss(Binary Classification):

An alternative to cross-entropy for binary classification problems is the hinge loss function, primarily developed for use with support vector machine (SVM) models. It is intended for use with binary classification where the target values are in the set{-1,1}.

The hinge loss function encourages examples to have the correct sign, assigning more error when there is a difference in the sign between the actual and predicted class values. Reports of performance with the hinge loss are mixed, sometimes resulting in better performance than cross-entropy on binary classification problems.

Última postagem

Tag