Model Evaluation — AI Reference Library

Accuracy alone is not enough. A model that predicts "no fraud" on every transaction would be 99.9% accurate on a dataset where 0.1% of transactions are fraud — while being completely useless. Evaluation metrics give you the full picture.

The confusion matrix

For classification problems, start with the confusion matrix — a table that breaks down predictions into four categories:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP) ✓	False Negative (FN) ✗
Actually Negative	False Positive (FP) ✗	True Negative (TN) ✓

Precision and Recall

Precision — of all the cases your model flagged as positive, what fraction actually were? High precision = few false alarms.

Recall (Sensitivity) — of all the actual positives, what fraction did your model catch? High recall = few misses.

The precision/recall trade-off

A cancer screening test should have very high recall — missing a cancer is catastrophic. A spam filter should have high precision — you don't want real emails deleted. The right balance depends on the cost of each type of error in your specific context.

F1 Score

The F1 score is the harmonic mean of precision and recall — a single number that balances both. Useful when you need one metric but care about both precision and recall equally.

# Full evaluation report in scikit-learn

from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

# Shows precision, recall, F1 for each class

AUC-ROC

The AUC (Area Under the ROC Curve) measures how well a classifier separates the positive and negative classes across all possible thresholds. AUC of 1.0 = perfect. AUC of 0.5 = random guessing. Particularly useful for imbalanced datasets.

Regression metrics

For regression (predicting numbers), different metrics apply:

MAE (Mean Absolute Error) — average absolute difference between predictions and actual values. Easy to interpret.
RMSE (Root Mean Squared Error) — like MAE but penalises large errors more heavily.
R² (R-squared) — how much of the variance in the data your model explains. 1.0 = perfect, 0 = no better than predicting the mean.

Key takeaways

Accuracy alone is misleading on imbalanced datasets — always look at precision and recall
Precision: of flagged positives, how many were correct? Recall: of actual positives, how many were found?
The precision/recall trade-off depends on the cost of false positives vs false negatives
F1 balances precision and recall into one number
For regression: MAE, RMSE, and R² are the standard metrics