Basic Statistics for AI — AI Reference Library

You don't need a maths degree to understand AI — but a handful of statistical concepts come up constantly and are worth knowing. This module covers the essentials in plain English, with light Python examples to make them concrete.

Approach

We'll focus on intuition, not formulas. Each concept is illustrated with a real-world analogy and a short Python snippet you can run yourself.

Mean, median, and mode

These are the three ways to describe the "centre" of a dataset:

Mean — the average. Add everything up, divide by the count. Sensitive to extreme values.
Median — the middle value when sorted. More robust to outliers than the mean.
Mode — the most frequently occurring value. Useful for categorical data.

# Python example

salaries = [30000, 35000, 40000, 45000, 200000]

mean = sum(salaries) / len(salaries)  # 70,000 — skewed by the outlier

import statistics

median = statistics.median(salaries)  # 40,000 — more representative

mode = statistics.mode([1, 2, 2, 3, 4])  # 2

In AI, the mean is used constantly — in loss functions, normalisation, and evaluation metrics. Understanding when the median is a better choice (when your data has outliers) is a practical skill.

Variance and standard deviation

The mean tells you the centre. Variance and standard deviation tell you how spread out the data is around that centre.

Think of two classes with an average exam score of 70. In class A, scores range from 65–75. In class B, scores range from 30–100. Same mean, very different spread. Standard deviation captures this difference.

# Standard deviation tells you spread

import statistics

class_a = [65, 68, 70, 72, 75]

class_b = [30, 55, 70, 85, 100]

print(statistics.stdev(class_a))  # ~3.8 — tightly clustered

print(statistics.stdev(class_b))  # ~27.5 — widely spread

In ML, high variance in a model's predictions is a sign of overfitting. Normalising data (scaling it to have mean=0 and std=1) is a standard preprocessing step before training.

Distributions

A distribution describes the shape of how values are spread across a dataset. The most important is the normal distribution (bell curve) — symmetrical, with most values clustered around the mean.

Many real-world measurements follow a normal distribution: heights, test scores, measurement errors. Many ML algorithms assume normally distributed data. When data is skewed (not symmetrical), it often needs to be transformed before use.

Correlation vs causation

This is arguably the most important statistical concept for anyone working with data. Correlation means two things move together. Causation means one thing causes the other.

Classic example

Ice cream sales and drowning rates are correlated — both rise in summer. But ice cream doesn't cause drowning. The hidden cause is hot weather, which drives both. An AI trained on this data might mistakenly "learn" that ice cream sales predict drowning risk.

AI models find correlations brilliantly. They don't understand causation. This is a fundamental limitation — and a source of many real-world AI failures.

Probability and confidence

AI models almost never give a binary yes/no — they give a probability. A spam filter doesn't say "this is spam" — it says "this has a 94% chance of being spam." A medical AI doesn't diagnose — it says "this scan has a 73% likelihood of showing a tumour."

Understanding probability helps you interpret AI outputs correctly. A model that's 80% confident is right 4 times out of 5 — which means it's wrong once out of 5. Depending on the context, that might be excellent or dangerously unreliable.

# Many classifiers output probabilities, not just classes

# Example output from a spam classifier:

output = {"spam": 0.94, "not_spam": 0.06}

# The model applies a threshold (usually 0.5) to make a decision

threshold = 0.5

label = "spam" if output["spam"] > threshold else "not_spam"

Train / test split

When building an ML model, you split your data into (at least) two sets: training data the model learns from, and test data it's evaluated on — data it has never seen. This simulates real-world use, where the model will encounter new examples.

# A common split: 80% training, 20% testing

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

    features, labels, test_size=0.2, random_state=42

)

# random_state ensures reproducibility

If you evaluate a model on its training data, you'll get misleadingly high performance. The model has memorised the answers. Only testing on unseen data tells you how the model will actually perform in the real world.

Key takeaways

Mean is sensitive to outliers; median is more robust for skewed data
Standard deviation measures spread — high spread in model outputs can signal overfitting
Correlation ≠ causation — AI finds patterns but doesn't understand why they exist
AI outputs are probabilities, not certainties — always consider the confidence level
Always test models on data they haven't been trained on — that's the true measure of performance