Accuracy alone is not enough. A model that predicts "no fraud" on every transaction would be 99.9% accurate on a dataset where 0.1% of transactions are fraud — while being completely useless. Evaluation metrics give you the full picture.

The confusion matrix

For classification problems, start with the confusion matrix — a table that breaks down predictions into four categories:

Predicted PositivePredicted Negative
Actually PositiveTrue Positive (TP) ✓False Negative (FN) ✗
Actually NegativeFalse Positive (FP) ✗True Negative (TN) ✓

Precision and Recall

Precision — of all the cases your model flagged as positive, what fraction actually were? High precision = few false alarms.

Recall (Sensitivity) — of all the actual positives, what fraction did your model catch? High recall = few misses.

The precision/recall trade-off

A cancer screening test should have very high recall — missing a cancer is catastrophic. A spam filter should have high precision — you don't want real emails deleted. The right balance depends on the cost of each type of error in your specific context.

F1 Score

The F1 score is the harmonic mean of precision and recall — a single number that balances both. Useful when you need one metric but care about both precision and recall equally.

# Full evaluation report in scikit-learn
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))
# Shows precision, recall, F1 for each class

AUC-ROC

The AUC (Area Under the ROC Curve) measures how well a classifier separates the positive and negative classes across all possible thresholds. AUC of 1.0 = perfect. AUC of 0.5 = random guessing. Particularly useful for imbalanced datasets.

Regression metrics

For regression (predicting numbers), different metrics apply:

Key takeaways

  • Accuracy alone is misleading on imbalanced datasets — always look at precision and recall
  • Precision: of flagged positives, how many were correct? Recall: of actual positives, how many were found?
  • The precision/recall trade-off depends on the cost of false positives vs false negatives
  • F1 balances precision and recall into one number
  • For regression: MAE, RMSE, and R² are the standard metrics