Not all data is the same. Understanding the different types of data is essential for knowing which AI techniques apply to a given problem — and why some problems are much harder than others.
Structured vs Unstructured data
The most important distinction in data is whether it has a predefined format.
| Structured data | Unstructured data |
|---|---|
| Organised in rows and columns | No predefined format |
| Lives in spreadsheets and databases | Lives in files, emails, images, audio |
| Easy to search and query | Requires AI to extract meaning |
| ~20% of all data generated | ~80% of all data generated |
| Examples: sales records, customer tables, stock prices | Examples: emails, social posts, photos, PDFs, videos |
The rise of deep learning was partly driven by AI finally becoming good at processing unstructured data — images, text, speech — which was previously almost impossible to work with at scale.
Semi-structured data
Between the two extremes sits semi-structured data — data that has some organisation but not rigid tabular structure. JSON files, XML documents, HTML pages, and emails with consistent headers all fall into this category. Most data on the web is semi-structured.
A customer record stored as JSON has labelled fields (name, email, purchases) but the values can vary in length and type. It's not a perfect spreadsheet row, but it's not freeform text either.
Quantitative vs Qualitative data
Quantitative data is numerical — things you can measure and calculate. Age, temperature, price, height, sales figures. You can add, average, and compare it directly.
Qualitative data is descriptive — categories, labels, opinions, text. Customer reviews, survey responses, product categories, interview transcripts. It's richer in meaning but harder to analyse computationally without converting it somehow.
Much of what AI does is convert qualitative information into quantitative representations — turning words into numbers (embeddings), images into pixel values, categories into numerical codes.
Labelled vs Unlabelled data
This distinction is critical in machine learning:
- Labelled data — each example comes with the correct answer. A photo tagged "cat" or "dog". An email marked "spam" or "not spam". Required for supervised learning. Expensive to create because humans must do the labelling.
- Unlabelled data — raw data with no annotations. The vast majority of data in the world is unlabelled. Used in unsupervised learning and self-supervised learning (the technique used to train LLMs).
Creating labelled training data often requires human expertise and enormous time. Medical AI systems need radiologists to label thousands of scans. This bottleneck — the cost of labelling — is one of the main reasons AI development is so resource-intensive.
Time series data
Data points collected over time — stock prices, sensor readings, weather measurements, heart rate monitors. Time series data has special properties: order matters, recent data is often more relevant than old data, and patterns repeat cyclically (daily, seasonally). Forecasting models are specifically designed for this type of data.
Tabular, image, text, and audio — the four main modalities
From an AI perspective, data is often categorised by its modality — the type of information it represents:
- Tabular — rows and columns. The traditional domain of classical machine learning (decision trees, gradient boosting).
- Image — pixel grids. The domain of convolutional neural networks and vision transformers.
- Text — sequences of words or tokens. The domain of large language models.
- Audio — waveforms sampled over time. Closely related to image processing — audio is often converted to visual spectrograms before processing.
Modern multimodal models like GPT-4o and Claude 3 can work across all these modalities simultaneously.
A note on data collection ethics
How data is collected matters enormously. Was it collected with informed consent? Does it represent all the people it needs to represent? Was it scraped from the web without permission? These questions are not just ethical — they affect the legal standing and social acceptance of AI systems built on that data.
Key takeaways
- Structured data is tabular (rows/columns); unstructured data has no set format — 80% of all data is unstructured
- Quantitative data is numerical; qualitative data is descriptive — AI often converts qualitative to quantitative
- Labelled data is required for supervised learning and is expensive to create
- The four main data modalities: tabular, image, text, audio
- Multimodal models can handle multiple data types simultaneously