Understanding how models train and how to evaluate them honestly is one of the most important skills in machine learning. A model that looks great on paper but fails in the real world is worse than useless — it's dangerous.
The training process
During training, the model makes predictions on training examples, calculates how wrong it is (the loss), then adjusts its parameters to be slightly less wrong. This loop — predict, calculate error, adjust — runs thousands or millions of times until the model converges.
Training is like studying for an exam. The training data is your practice problems. The test set is the actual exam — questions you've never seen. If you only practise problems you've already memorised, you won't do well on new ones.
Overfitting — the most common ML failure
Overfitting happens when a model learns the training data too well — including its noise and random quirks — and fails on new data. The model has memorised rather than generalised.
Signs of overfitting: very high training accuracy, much lower test accuracy. A gap of more than 5–10% is usually a red flag.
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"Train accuracy: {train_score:.2%}") # e.g. 99% — suspiciously high
print(f"Test accuracy: {test_score:.2%}") # e.g. 72% — much lower = overfitting
Underfitting — the opposite problem
Underfitting is when the model is too simple to capture the patterns in the data. Both training and test accuracy are low. The fix is usually a more complex model or more features.
The validation set
In practice, data is split three ways: training (learn), validation (tune), and test (final evaluation). The validation set is used during development to compare models and tune hyperparameters. The test set is locked away and only used once — for the final honest evaluation.
from sklearn.model_selection import train_test_split
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.18)
# Approx 70% train / 15% val / 15% test
Cross-validation
With small datasets, a single train/test split is unreliable — you might get lucky or unlucky with which examples land where. Cross-validation splits the data multiple times and averages the results, giving a more reliable performance estimate.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(f"Mean accuracy: {scores.mean():.2%} ± {scores.std():.2%}")
Key takeaways
- Training is the loop of predict → measure error → adjust parameters
- Overfitting: great on training data, poor on new data — the most common ML failure
- Always evaluate on data the model has never seen — that's the honest measure
- Use train / validation / test splits: train to learn, validation to tune, test for final score
- Cross-validation gives more reliable estimates on small datasets