You don't need a maths degree to understand AI — but a handful of statistical concepts come up constantly and are worth knowing. This module covers the essentials in plain English, with light Python examples to make them concrete.
We'll focus on intuition, not formulas. Each concept is illustrated with a real-world analogy and a short Python snippet you can run yourself.
Mean, median, and mode
These are the three ways to describe the "centre" of a dataset:
- Mean — the average. Add everything up, divide by the count. Sensitive to extreme values.
- Median — the middle value when sorted. More robust to outliers than the mean.
- Mode — the most frequently occurring value. Useful for categorical data.
salaries = [30000, 35000, 40000, 45000, 200000]
mean = sum(salaries) / len(salaries) # 70,000 — skewed by the outlier
import statistics
median = statistics.median(salaries) # 40,000 — more representative
mode = statistics.mode([1, 2, 2, 3, 4]) # 2
In AI, the mean is used constantly — in loss functions, normalisation, and evaluation metrics. Understanding when the median is a better choice (when your data has outliers) is a practical skill.
Variance and standard deviation
The mean tells you the centre. Variance and standard deviation tell you how spread out the data is around that centre.
Think of two classes with an average exam score of 70. In class A, scores range from 65–75. In class B, scores range from 30–100. Same mean, very different spread. Standard deviation captures this difference.
import statistics
class_a = [65, 68, 70, 72, 75]
class_b = [30, 55, 70, 85, 100]
print(statistics.stdev(class_a)) # ~3.8 — tightly clustered
print(statistics.stdev(class_b)) # ~27.5 — widely spread
In ML, high variance in a model's predictions is a sign of overfitting. Normalising data (scaling it to have mean=0 and std=1) is a standard preprocessing step before training.
Distributions
A distribution describes the shape of how values are spread across a dataset. The most important is the normal distribution (bell curve) — symmetrical, with most values clustered around the mean.
Many real-world measurements follow a normal distribution: heights, test scores, measurement errors. Many ML algorithms assume normally distributed data. When data is skewed (not symmetrical), it often needs to be transformed before use.
Correlation vs causation
This is arguably the most important statistical concept for anyone working with data. Correlation means two things move together. Causation means one thing causes the other.
Ice cream sales and drowning rates are correlated — both rise in summer. But ice cream doesn't cause drowning. The hidden cause is hot weather, which drives both. An AI trained on this data might mistakenly "learn" that ice cream sales predict drowning risk.
AI models find correlations brilliantly. They don't understand causation. This is a fundamental limitation — and a source of many real-world AI failures.
Probability and confidence
AI models almost never give a binary yes/no — they give a probability. A spam filter doesn't say "this is spam" — it says "this has a 94% chance of being spam." A medical AI doesn't diagnose — it says "this scan has a 73% likelihood of showing a tumour."
Understanding probability helps you interpret AI outputs correctly. A model that's 80% confident is right 4 times out of 5 — which means it's wrong once out of 5. Depending on the context, that might be excellent or dangerously unreliable.
# Example output from a spam classifier:
output = {"spam": 0.94, "not_spam": 0.06}
# The model applies a threshold (usually 0.5) to make a decision
threshold = 0.5
label = "spam" if output["spam"] > threshold else "not_spam"
Train / test split
When building an ML model, you split your data into (at least) two sets: training data the model learns from, and test data it's evaluated on — data it has never seen. This simulates real-world use, where the model will encounter new examples.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
features, labels, test_size=0.2, random_state=42
)
# random_state ensures reproducibility
If you evaluate a model on its training data, you'll get misleadingly high performance. The model has memorised the answers. Only testing on unseen data tells you how the model will actually perform in the real world.
Key takeaways
- Mean is sensitive to outliers; median is more robust for skewed data
- Standard deviation measures spread — high spread in model outputs can signal overfitting
- Correlation ≠ causation — AI finds patterns but doesn't understand why they exist
- AI outputs are probabilities, not certainties — always consider the confidence level
- Always test models on data they haven't been trained on — that's the true measure of performance