Types of Data — AI Reference Library

Not all data is the same. Understanding the different types of data is essential for knowing which AI techniques apply to a given problem — and why some problems are much harder than others.

Structured vs Unstructured data

The most important distinction in data is whether it has a predefined format.

Structured data	Unstructured data
Organised in rows and columns	No predefined format
Lives in spreadsheets and databases	Lives in files, emails, images, audio
Easy to search and query	Requires AI to extract meaning
~20% of all data generated	~80% of all data generated
Examples: sales records, customer tables, stock prices	Examples: emails, social posts, photos, PDFs, videos

The rise of deep learning was partly driven by AI finally becoming good at processing unstructured data — images, text, speech — which was previously almost impossible to work with at scale.

Semi-structured data

Between the two extremes sits semi-structured data — data that has some organisation but not rigid tabular structure. JSON files, XML documents, HTML pages, and emails with consistent headers all fall into this category. Most data on the web is semi-structured.

Example: JSON (semi-structured)

A customer record stored as JSON has labelled fields (name, email, purchases) but the values can vary in length and type. It's not a perfect spreadsheet row, but it's not freeform text either.

Quantitative vs Qualitative data

Quantitative data is numerical — things you can measure and calculate. Age, temperature, price, height, sales figures. You can add, average, and compare it directly.

Qualitative data is descriptive — categories, labels, opinions, text. Customer reviews, survey responses, product categories, interview transcripts. It's richer in meaning but harder to analyse computationally without converting it somehow.

Much of what AI does is convert qualitative information into quantitative representations — turning words into numbers (embeddings), images into pixel values, categories into numerical codes.

Labelled vs Unlabelled data

This distinction is critical in machine learning:

Labelled data — each example comes with the correct answer. A photo tagged "cat" or "dog". An email marked "spam" or "not spam". Required for supervised learning. Expensive to create because humans must do the labelling.
Unlabelled data — raw data with no annotations. The vast majority of data in the world is unlabelled. Used in unsupervised learning and self-supervised learning (the technique used to train LLMs).

Why labelling is so expensive

Creating labelled training data often requires human expertise and enormous time. Medical AI systems need radiologists to label thousands of scans. This bottleneck — the cost of labelling — is one of the main reasons AI development is so resource-intensive.

Time series data

Data points collected over time — stock prices, sensor readings, weather measurements, heart rate monitors. Time series data has special properties: order matters, recent data is often more relevant than old data, and patterns repeat cyclically (daily, seasonally). Forecasting models are specifically designed for this type of data.

Tabular, image, text, and audio — the four main modalities

From an AI perspective, data is often categorised by its modality — the type of information it represents:

Tabular — rows and columns. The traditional domain of classical machine learning (decision trees, gradient boosting).
Image — pixel grids. The domain of convolutional neural networks and vision transformers.
Text — sequences of words or tokens. The domain of large language models.
Audio — waveforms sampled over time. Closely related to image processing — audio is often converted to visual spectrograms before processing.

Modern multimodal models like GPT-4o and Claude 3 can work across all these modalities simultaneously.

A note on data collection ethics

How data is collected matters enormously. Was it collected with informed consent? Does it represent all the people it needs to represent? Was it scraped from the web without permission? These questions are not just ethical — they affect the legal standing and social acceptance of AI systems built on that data.

Key takeaways

Structured data is tabular (rows/columns); unstructured data has no set format — 80% of all data is unstructured
Quantitative data is numerical; qualitative data is descriptive — AI often converts qualitative to quantitative
Labelled data is required for supervised learning and is expensive to create
The four main data modalities: tabular, image, text, audio
Multimodal models can handle multiple data types simultaneously