Raw data is potential. Insight is what happens when you transform that potential into something you can act on. This module covers the journey from a messy dataset to a clear understanding — the core workflow of data analysis and a prerequisite for building effective AI systems.
Step 1 — Explore (EDA)
Exploratory Data Analysis (EDA) is the first thing any good data analyst or ML engineer does with a new dataset. Before modelling anything, you need to understand what you're working with.
import pandas as pd
df = pd.read_csv("sales_data.csv")
df.head() # see the first 5 rows
df.info() # column names, types, missing values
df.describe() # count, mean, std, min, max for each column
EDA reveals problems you'd otherwise miss: columns with missing values, dates stored as text, categorical variables with unexpected values, or numerical columns with suspicious outliers.
Step 2 — Clean
Real-world data is almost never clean. Common problems include:
- Missing values — entire rows or specific cells with no data. You can drop them, fill them with the mean/median, or use a model to impute them.
- Duplicates — the same record appearing multiple times due to data entry errors or merging issues.
- Inconsistent formatting — "UK", "United Kingdom", "U.K." all meaning the same thing.
- Outliers — values far outside the expected range. Sometimes genuine (a billionaire in a salary dataset), sometimes errors (a height of 999cm).
df.dropna() # remove rows with any missing value
df["age"].fillna(df["age"].mean()) # fill with column mean
df.drop_duplicates() # remove duplicate rows
df["country"].str.strip().str.upper()# standardise text
Data cleaning typically takes 60–80% of the time in a real data project. It's unglamorous but essential. Uncleaned data fed into an ML model will produce unreliable, often misleading results.
Step 3 — Visualise
Humans are visual creatures. A chart can reveal patterns that tables of numbers completely hide. Visualisation is both an exploration tool (finding patterns) and a communication tool (explaining findings to others).
Common chart types and when to use them:
| Chart type | Best for |
|---|---|
| Histogram | Distribution of a single numerical variable |
| Scatter plot | Relationship between two numerical variables |
| Bar chart | Comparing categories |
| Line chart | Trends over time |
| Heatmap | Correlations between many variables at once |
| Box plot | Distribution + outliers for a variable |
Step 4 — Analyse and model
With clean data and a visual understanding of its shape, you're ready to extract insights. This might mean running a statistical test to confirm a relationship, building a predictive model, or segmenting customers into groups.
The key question at this stage: What decision will this insight drive? Analysis without a clear question or decision in mind often leads nowhere useful.
An e-commerce company notices conversion rates are dropping. EDA reveals the drop is concentrated on mobile devices. Cleaning removes bot traffic. Visualisation shows the drop started exactly when a new checkout page was launched. Analysis confirms mobile checkout time increased 40%. Insight: the new checkout page is broken on mobile. Decision: roll back or fix it.
Step 5 — Communicate
Insight that isn't communicated clearly has no value. The best data analysts are also good storytellers — they can translate complex findings into simple narratives that non-technical stakeholders can understand and act on.
Good data communication means: leading with the conclusion, not the methodology; using the right chart for the audience; and being honest about uncertainty and limitations.
In machine learning, "data to insight" happens before model building (understanding your data), during training (monitoring learning curves and metrics), and after deployment (evaluating real-world performance). It's not a one-time step — it's an ongoing discipline.
Key takeaways
- EDA — explore your data before building anything to understand its shape and problems
- Data cleaning takes 60–80% of project time — it's essential, not optional
- Visualisation reveals patterns that numbers alone hide
- Analysis should always be tied to a clear decision or question
- Communicating findings clearly is as important as finding them