In the world of AI and machine learning, evaluating how well a model performs is just as important as building it. One of the most foundational tools for this evaluation is the confusion matrix. Despite its name, it's designed to bring clarity — not confusion. A confusion matrix breaks down the performance of a classification model into four key categories: True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN). It gives a detailed insight into not just how many predictions were correct, but what kind of errors the model is making. Understanding this matrix is critical before diving into metrics like accuracy, precision, recall, or F1-score. Whether you're detecting spam emails, diagnosing diseases, or building AI for self-driving cars, the confusion matrix helps you understand where your model is getting confused — and how to make it smarter.
Let’s start with a small “ACCURACY” story. Martha and Bane were given a task of classification. A binary classification (two class) so that their algorithm should be able to identify cats as cats and dogs as dogs. Martha and Bane worked through it and came up with a result. Their senior manager asked them what the matrices are. Martha said — I don’t know why I did deal with the problem correctly but I am getting an accuracy of around 0%. Seeing this Bane was relieved and suddenly told the manager that he has an accuracy of 50%. But the manager asked Martha to bring in the work instead of Bane having a higher accuracy. Bane was confused. What would have happened here??
50% is more like a coin toss when it comes to a binary classification. If a cat image is given the model will say 5 out of 10 times its a cat and other 5 times its a dog. Which means the model doesn’t know anything and it is randomly picking up from cat and dog for every image.
On the other hand what does 0% accuracy mean in a binary classification?.. Everytime a dog is given the model is picking up the cat. Everytime the cat is given the model is picking up the dog. The manager with enough experience understands this and only a tiny adjustment in the Martha’s work would make it around 100% accurate. Just by swapping the classes or maybe she gave the input classes wrong.
It’s not about having a higher value. Understanding the concept of values makes more sense. 75% accurate and 90% accurate model. We go with 90 but 25% and 50% we go with 25% with adjustments. 25% and 75% models are pretty much the same.
So all this time I was focusing on a single parameter called accuracy. But does accuracy alone help to understand the model capability??
Let’s get back to Martha. She was given a new job now. Again binary classification but there is a big problem of data imbalance. She is dealing with cancer and non cancer detection from the images. Her testing set contained 90 non cancer images and 10 cancer images. She ran her model for inference in the testing set. And a wonderful 90% came as accuracy.
90% is a great way to open. She ran to the senior manager who asked gave her another set for testing which had 85 non cancer images and 15 cancer images (total 100). She ran it and accuracy was 85%. Martha was like : sir the result sticks and still at 85% which is good. Now the manager gave her another set with only 10 non cancer and 90 cancer and the accuracy for her model suddenly dropped to 10%. What could have happened here??
She was having a highly biased model which every time predicted image as non cancer. Every scenario it was predicting all the 100 images as non cancer. Case 1 of 90 non cancer and 10 cancer everything predicted as non cancer. Meaning 90 was correct 10 cancer as well classified into non cancer. But the accuracy is 90%. It’s a bummer. If there was a balanced set of 50 and 50 the model accuracy would drop down to 50%.
So it’s now very clear that accuracy cannot alone decide the quality of the model in most scenarios. But there are different other matrices that will give us a very good insight about the model performance in various scenarios which we would be taking a very good look at.
Contents:
- Explaining positives and negatives
- Accuracy
- Precision or PPV (Positive predictive value)
- Recall or Sensitivity or TPR (True positive rate)
- Specificity or selectivity or TNR (True negative rates)
- FNR (False negative rate)
- FPR (False positive rate)
Explaining Positives and Negatives
Before diving into the various metrics derived from the confusion matrix, let’s first understand the basic terms: All the matrices are defined with these terms and understanding in depth is necessary for this.
The first part tells us what the model did either True (correct) or False (wrong) and the second part tells us what class it was that model predicted (positive or negative). True means the model is correct and false meaning the model is wrong. With keeping this in mind a false positive means the model predicted the actual positive class false. Meaning it predicted a negative when it was actually positive. Same way when all of the basic values are explained it looks like this:
True Positives (TP): The model said its positive and also the real value was positive. True — model is correct and positive — the class was positive.
True Negatives (TN): These are cases where the model correctly predicts the negative class. True — the model is correct. What was the class? Negative.
False Positives (FP): Naaah!! The model did a bad job here. False meaning the model is wrong. But how is it wrong? It predicted positive when it should have been negative. False Negatives (FN): Yet again the model lost it. But this time only in the other class. False — model is wrong. It predicted negative when the actual one was positive.
If the above terminologies are clear you are good to proceed further. It is the basis for any further reading .
Accuracy
Accuracy is the ratio of correctly predicted observations to the total observations. So what does it mean? Of the total number of predictions, how many of them were correctly predicted by model. TRUE positives and TRUE negatives are what the model did correctly.
Accuracy= (TP+TN) / TOTAL COUNT
Or
Accuracy = (TP+TN) / (TP+TN+FN+FP)
A very deep example of accuracy was defined at the start of this blog and how we should interpret the accuracy.
Example: In Martha’s cancer detection model which was stated earlier, if she has 90 non-cancer and 10 cancer images, and the model predicts all images as non-cancer, the accuracy is 90%. However, this doesn’t reflect the model’s ability to identify cancer, making accuracy alone insufficient.
Precision or Positive Predictive Value (PPV)
Precision is an important metric in classification tasks, especially in contexts where the costs of false positives are high. It is calculated by dividing the number of true positive outcomes by the sum of true positive and false positive outcomes. Essentially, precision measures the accuracy of the positive predictions made by the model.
Precision=TP / (TP+FP)
Let us explain this with a small example that we see everywhere. Spam email prediction. Our model purpose is to predict if the email received is spam or ham. If the email is spam the email will be automatically moved to spam where we would not be noticing it anymore. So here the positive class is spam.
Precision is Number of Correctly Predicted Positive Cases by Total Number of Predicted Positive Cases. So if 20 emails are predicted as spam and only 15 of the emails were actually spam then the precision would be 15/20 that is 0.75 or 75%. In the scenario the 5 emails predicted wrong and sent to spam, might contain very relevant information. What if one of those email is a call for your job interview. With the model mis-classifying this to the spam you are losing the message and in these scenarios the precision has to be dealt as major. Once or twice spam emails coming to the main mail area might not hurt. But a single valuable email going into the spam might cost you big time. So we try to increase the prediction in these scenarios and the precision plays a crucial role as a matrix in such scenarios.
Recall or Sensitivity or True Positive Rate (TPR)
Recall measures how many of the actual positives a model correctly identifies. It’s like a detective diligently ensuring that no important clue is missed. In simple terms, recall is the proportion of true positives accurately predicted compared to all the cases that are genuinely positive. We calculate it by dividing the number of true positives by the sum of true positives and false negatives:
Elaborating false negative meant the model predicted the positive class to be negative. Meaning it should have come to positive. So true positive plus false negative gives the total sum of positive values in the set. Given a total of 100 positive value the model predicted 90 of them as positive and 10 as negative then the recall is 90/100 that is 0.9.
Recall=TP / (TP+FN)
Example: Breast Cancer Screening
In breast cancer screening, the primary goal is to identify as many actual cases of cancer as possible. Here’s how the concept of recall becomes crucial:
-
True Positives (TP) : These are the cases where the screening test correctly identifies patients who actually have breast cancer.
-
False Negatives (FN):These are the cases where the screening test fails to identify breast cancer, meaning the test results are negative but the patient actually has cancer.
In this scenario, the recall metric is vital because a high recall rate means the test is successful in identifying most of the actual cases of breast cancer. A low recall rate, on the other hand, indicates that many cases are being missed by the test, which can be dangerous as it would lead to patients not receiving the necessary treatments early on.
Why is High Recall Critical in This Context?
- Patient Safety: Ensuring that nearly all patients with breast cancer are identified means early intervention, which can significantly improve treatment outcomes and survival rates.
- Reducing Risks: Missing a diagnosis of breast cancer (a false negative) can have dire consequences, far worse than misdiagnosing someone who does not have the disease (a false positive). Thus, optimizing for high recall reduces the risk of missed diagnoses.
In summary, in situations like medical diagnostics where the cost of missing an actual positive case is extremely high, aiming for a high recall rate is crucial to protect patient health and improve treatment efficacy. This approach prioritizes sensitivity over the risk of generating some false alarms. Or should be said even if the model added a non cancerous to cancer in initial screening the next test can see that the person does not have cancer. But if it says a false negative as if he actually had cancer but the model said he doesn’t have cancer then it will be left untreated that can cause life.
Specificity or Selectivity or True Negative Rate (TNR)
Specificity, also known as the True Negative Rate (TNR), measures a model’s ability to correctly identify negative (non-event) instances. It is the ratio of true negatives (TN) to the total number of actual negatives (TN + FP), reflecting how well a test avoids false alarms. In simpler terms, it answers the question: “Of all the actual negatives, how many did the model correctly recognize as negative?”
Specificity=TN / (TN+FP)
Example: Airport Security Screening
Consider an airport security setting where the primary aim is to identify objects that are not weapons. Here’s how specificity plays a crucial role:
- True Negatives (TN): These are the instances where the security system correctly identifies items as non-weapons.
- False Positives (FP): These occur when the system mistakenly flags non-weapon items as weapons.
In this scenario, having high specificity means the security system effectively recognizes most non-threat items correctly, minimizing inconvenience and delays:
- Scenario: If there were 1,000 passengers carrying non-weapon items and the system correctly identified 950 of these, the specificity would be 0.95 or 95%
Specificity = 950/1000 = 0.95 or 95%
Importance of High Specificity in Airport Security:
- Efficiency: High specificity ensures the flow of passengers remains smooth with fewer false alarms, leading to fewer unnecessary checks and delays.
- Resource Management: By minimizing false positives, security personnel can focus their efforts on true threats, enhancing overall safety and resource allocation
False Negative Rate (FNR)
False Negative Rate (FNR) is the proportion of positives which yield negative test outcomes with the test, i.e., the event is falsely declared as negative. It is essentially the probability of a type II error and is calculated as the ratio of false negatives (FN) to the total actual positives (FN + TP). It complements recall, showing the flip side of the sensitivity coin.
FNR=FN / (FN+TP)
Example: Email Spam Filtering
Consider an email system designed to filter out spam messages:
-
False Negatives (FN): These occur when spam emails are incorrectly marked as safe and end up in the inbox.
-
True Positives (TP): These are the instances where spam emails are correctly identified and filtered out. In this scenario, the False Negative Rate quantifies the system’s risk of letting spam slip through:
-
Scenario: If the system processed 300 emails identified as spam, but missed 30 of them, the FNR would be: FNR=30/300=0.1 or
FNR = 30/300 = 0.1 or 10%
Why Minimizing FNR Matters in Spam Filtering:
- Security: A high FNR means more spam reaching users, potentially increasing the risk of phishing attacks.
- User Experience: Keeping FNR low ensures that users’ inboxes are not cluttered with unwanted emails, enhancing the overall email experience. These metrics — specificity and FNR — serve as critical indicators of a system’s performance, particularly in fields requiring high accuracy and safety standards.
False Positive Rate (FPR)
False Positive Rate (FPR) quantifies the likelihood of incorrectly predicting positive observations among all the actual negatives. It’s the ratio of false positives (FP) to the total number of actual negative cases (FP + TN). As the complement of specificity, FPR helps in understanding how often a test incorrectly flags an event when none exists.
FPR=FP / (FP+TN)
Example: Home Security Alarm System
Consider a home security alarm system designed to detect intruders:
-
False Positives (FP): These occur when the alarm system mistakenly identifies a non-threat situation (like a pet moving) as an intrusion.
-
True Negatives (TN): These are the instances where the system correctly identifies that there is no intruder. Here’s how FPR plays a crucial role:
-
Scenario: If there are 500 situations where there are no intruders and the alarm system incorrectly activates for 50 of these, the FPR would be
FPR=50/500=0.1 or 10%
Importance of Minimizing FPR in Alarm Systems:
- Reduce False Alarms: High FPR means more false alarms, which can lead to unnecessary panic, police calls, and potential fines for false alarms.
- Trust in the System: Lower FPR enhances the homeowners’ trust in the alarm system, ensuring they can rely on it for actual security threats. Understanding and managing the False Positive Rate is essential, especially in systems where the cost of a false positive is high, both in terms of operational disruption and credibility.
Conclusion
Evaluating a model’s performance requires more than just accuracy. Metrics like precision, recall, specificity, FNR, and FPR provide a comprehensive view of how well the model distinguishes between classes. By understanding and utilizing these metrics, we can better assess and improve our models, ensuring they perform effectively in real-world scenarios.
There are many other matrices which are bit more complex. These matrices are also worth noting down:
- F1 Score
- Informedness
- Positive likelihood ratio
- Negative likelihood ratio
- Markedness
- Threat score or Jaccard index
- Matthews correlation coefficient (MCC)
- Fowlkes–Mallows index (FM)
- Diagnostic odds ratio (DOR)
There are many more and since I don’t want to drag the article much more these will be explained in another article.
Feel free to comment on any more topic you would like to read on or directly ping me on linkedin for any discussion with AI.
Linkedin : https://www.linkedin.com/in/anudev-manju-satheesh-218b71175/
Clap if you liked the content and suggestions are always welcome that help to improve the content quality.👏