Training & Testing — AI Reference Library

Understanding how models train and how to evaluate them honestly is one of the most important skills in machine learning. A model that looks great on paper but fails in the real world is worse than useless — it's dangerous.

The training process

During training, the model makes predictions on training examples, calculates how wrong it is (the loss), then adjusts its parameters to be slightly less wrong. This loop — predict, calculate error, adjust — runs thousands or millions of times until the model converges.

Analogy

Training is like studying for an exam. The training data is your practice problems. The test set is the actual exam — questions you've never seen. If you only practise problems you've already memorised, you won't do well on new ones.

Overfitting — the most common ML failure

Overfitting happens when a model learns the training data too well — including its noise and random quirks — and fails on new data. The model has memorised rather than generalised.

Signs of overfitting: very high training accuracy, much lower test accuracy. A gap of more than 5–10% is usually a red flag.

# Diagnosing overfitting

model.fit(X_train, y_train)

train_score = model.score(X_train, y_train)

test_score  = model.score(X_test,  y_test)

print(f"Train accuracy: {train_score:.2%}")  # e.g. 99% — suspiciously high

print(f"Test accuracy:  {test_score:.2%}")   # e.g. 72% — much lower = overfitting

Underfitting — the opposite problem

Underfitting is when the model is too simple to capture the patterns in the data. Both training and test accuracy are low. The fix is usually a more complex model or more features.

The validation set

In practice, data is split three ways: training (learn), validation (tune), and test (final evaluation). The validation set is used during development to compare models and tune hyperparameters. The test set is locked away and only used once — for the final honest evaluation.

# Three-way split

from sklearn.model_selection import train_test_split

X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15)

X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.18)

# Approx 70% train / 15% val / 15% test

Cross-validation

With small datasets, a single train/test split is unreliable — you might get lucky or unlucky with which examples land where. Cross-validation splits the data multiple times and averages the results, giving a more reliable performance estimate.

# 5-fold cross-validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5)

print(f"Mean accuracy: {scores.mean():.2%} ± {scores.std():.2%}")

Key takeaways

Training is the loop of predict → measure error → adjust parameters
Overfitting: great on training data, poor on new data — the most common ML failure
Always evaluate on data the model has never seen — that's the honest measure
Use train / validation / test splits: train to learn, validation to tune, test for final score
Cross-validation gives more reliable estimates on small datasets