Not all data is the same. Understanding the different types of data is essential for knowing which AI techniques apply to a given problem — and why some problems are much harder than others.

Structured vs Unstructured data

The most important distinction in data is whether it has a predefined format.

Structured dataUnstructured data
Organised in rows and columnsNo predefined format
Lives in spreadsheets and databasesLives in files, emails, images, audio
Easy to search and queryRequires AI to extract meaning
~20% of all data generated~80% of all data generated
Examples: sales records, customer tables, stock pricesExamples: emails, social posts, photos, PDFs, videos

The rise of deep learning was partly driven by AI finally becoming good at processing unstructured data — images, text, speech — which was previously almost impossible to work with at scale.

Semi-structured data

Between the two extremes sits semi-structured data — data that has some organisation but not rigid tabular structure. JSON files, XML documents, HTML pages, and emails with consistent headers all fall into this category. Most data on the web is semi-structured.

Example: JSON (semi-structured)

A customer record stored as JSON has labelled fields (name, email, purchases) but the values can vary in length and type. It's not a perfect spreadsheet row, but it's not freeform text either.

Quantitative vs Qualitative data

Quantitative data is numerical — things you can measure and calculate. Age, temperature, price, height, sales figures. You can add, average, and compare it directly.

Qualitative data is descriptive — categories, labels, opinions, text. Customer reviews, survey responses, product categories, interview transcripts. It's richer in meaning but harder to analyse computationally without converting it somehow.

Much of what AI does is convert qualitative information into quantitative representations — turning words into numbers (embeddings), images into pixel values, categories into numerical codes.

Labelled vs Unlabelled data

This distinction is critical in machine learning:

Why labelling is so expensive

Creating labelled training data often requires human expertise and enormous time. Medical AI systems need radiologists to label thousands of scans. This bottleneck — the cost of labelling — is one of the main reasons AI development is so resource-intensive.

Time series data

Data points collected over time — stock prices, sensor readings, weather measurements, heart rate monitors. Time series data has special properties: order matters, recent data is often more relevant than old data, and patterns repeat cyclically (daily, seasonally). Forecasting models are specifically designed for this type of data.

Tabular, image, text, and audio — the four main modalities

From an AI perspective, data is often categorised by its modality — the type of information it represents:

Modern multimodal models like GPT-4o and Claude 3 can work across all these modalities simultaneously.

A note on data collection ethics

How data is collected matters enormously. Was it collected with informed consent? Does it represent all the people it needs to represent? Was it scraped from the web without permission? These questions are not just ethical — they affect the legal standing and social acceptance of AI systems built on that data.

Key takeaways

  • Structured data is tabular (rows/columns); unstructured data has no set format — 80% of all data is unstructured
  • Quantitative data is numerical; qualitative data is descriptive — AI often converts qualitative to quantitative
  • Labelled data is required for supervised learning and is expensive to create
  • The four main data modalities: tabular, image, text, audio
  • Multimodal models can handle multiple data types simultaneously