Supervised learning is the most widely used form of machine learning. The idea is simple: you teach the model using examples where you already know the right answer — labelled data. The model learns to map inputs to outputs, then applies that learning to new, unseen examples.
Supervised learning is like teaching with an answer key. You show the model thousands of examples with correct answers. It learns the patterns. Then it applies those patterns to examples it's never seen before.
Classification vs Regression
Supervised learning problems fall into two categories:
- Classification — predicting a category. Is this email spam or not? Is this tumour malignant or benign? Which digit (0–9) is in this image?
- Regression — predicting a continuous number. What will this house sell for? What will the temperature be tomorrow? How many units will we sell next quarter?
Classification: Gmail spam detection, face recognition, medical diagnosis, sentiment analysis (positive/negative review)
Regression: House price prediction, weather forecasting, stock price modelling, demand forecasting
How supervised learning works — step by step
- Collect labelled data — gather examples where you know the correct answer
- Split into train/test sets — typically 80% training, 20% testing
- Choose a model — decision tree, neural network, logistic regression, etc.
- Train — the model iteratively adjusts to minimise its errors on training data
- Evaluate — test performance on the held-out test set
- Deploy — use the model on new, real-world data
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions):.2%}")
The importance of good labels
The quality of a supervised learning model is limited by the quality of its labels. Inconsistent, biased, or incorrect labels produce poor models — no matter how sophisticated the algorithm. This is why data labelling (and the humans who do it) is so critical and so expensive.
Unsupervised learning — when you don't have labels
Not all data comes with correct answers. Unsupervised learning finds patterns without labels — grouping similar customers together (clustering), detecting unusual transactions (anomaly detection), or compressing data into fewer dimensions. It's harder to evaluate but powerful when labelled data is scarce.
Key takeaways
- Supervised learning uses labelled data — examples with known correct answers
- Classification predicts categories; regression predicts continuous numbers
- Training: show examples → model adjusts → evaluate on unseen test data
- Label quality directly determines model quality — garbage labels = garbage model
- Unsupervised learning finds patterns without labels — clustering, anomaly detection