Accuracy alone is not enough. A model that predicts "no fraud" on every transaction would be 99.9% accurate on a dataset where 0.1% of transactions are fraud — while being completely useless. Evaluation metrics give you the full picture.
The confusion matrix
For classification problems, start with the confusion matrix — a table that breaks down predictions into four categories:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | True Positive (TP) ✓ | False Negative (FN) ✗ |
| Actually Negative | False Positive (FP) ✗ | True Negative (TN) ✓ |
Precision and Recall
Precision — of all the cases your model flagged as positive, what fraction actually were? High precision = few false alarms.
Recall (Sensitivity) — of all the actual positives, what fraction did your model catch? High recall = few misses.
A cancer screening test should have very high recall — missing a cancer is catastrophic. A spam filter should have high precision — you don't want real emails deleted. The right balance depends on the cost of each type of error in your specific context.
F1 Score
The F1 score is the harmonic mean of precision and recall — a single number that balances both. Useful when you need one metric but care about both precision and recall equally.
from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))
# Shows precision, recall, F1 for each class
AUC-ROC
The AUC (Area Under the ROC Curve) measures how well a classifier separates the positive and negative classes across all possible thresholds. AUC of 1.0 = perfect. AUC of 0.5 = random guessing. Particularly useful for imbalanced datasets.
Regression metrics
For regression (predicting numbers), different metrics apply:
- MAE (Mean Absolute Error) — average absolute difference between predictions and actual values. Easy to interpret.
- RMSE (Root Mean Squared Error) — like MAE but penalises large errors more heavily.
- R² (R-squared) — how much of the variance in the data your model explains. 1.0 = perfect, 0 = no better than predicting the mean.
Key takeaways
- Accuracy alone is misleading on imbalanced datasets — always look at precision and recall
- Precision: of flagged positives, how many were correct? Recall: of actual positives, how many were found?
- The precision/recall trade-off depends on the cost of false positives vs false negatives
- F1 balances precision and recall into one number
- For regression: MAE, RMSE, and R² are the standard metrics