Everything in AI starts with data. Models don't learn from intuition or experience the way humans do — they learn from data. The quality, quantity, and diversity of data directly determines how good an AI system will be. Understanding data is foundational to understanding AI.

Simple definition

Data is any recorded information — numbers, text, images, audio, video, sensor readings, clicks, transactions. If it can be stored and processed by a computer, it's data.

Data is everywhere

Every time you search online, send a message, buy something, stream a song, or walk past a camera — data is generated. We now produce an extraordinary amount of it. In 2024, the world generates an estimated 120 zettabytes of data per year — a number so large it's almost meaningless without context: that's roughly 120 trillion gigabytes.

This explosion of data — combined with cheap storage and fast processors — is one of the main reasons AI has advanced so dramatically in the past decade. Modern AI systems are voracious data consumers. GPT-4 was trained on roughly a trillion words of text.

Why data matters so much to AI

In traditional programming, humans write the rules. In AI, the data provides the rules — implicitly. A spam filter doesn't get told "emails with the word 'prize' are spam." It reads millions of emails labelled as spam or not spam, and figures out the patterns on its own.

This means the AI system will reflect whatever is in its training data — the good and the bad. Biased data produces biased models. Incomplete data produces models with blind spots. Noisy data produces unreliable models.

Real-world example

Amazon once built an AI hiring tool trained on a decade of historical resumes. The problem: those resumes came mostly from men, because the tech industry had hired mostly men. The AI learned to penalise resumes that included the word "women's" (as in women's chess club) and downgraded graduates of all-women's colleges. Amazon scrapped the tool in 2018.

The data lifecycle

Data doesn't just appear ready to use. It goes through several stages:

  1. Collection — gathering raw data from sensors, forms, scrapers, transactions, surveys, or APIs
  2. Storage — saving it in databases, data warehouses, or data lakes
  3. Cleaning — fixing errors, handling missing values, removing duplicates. Often the most time-consuming step.
  4. Transformation — converting data into a format useful for analysis or model training
  5. Analysis / Training — extracting insights or training a model
  6. Deployment — using the results in a product or decision

Data quality vs data quantity

More data isn't always better. A million poorly labelled images will produce a worse model than a hundred thousand accurately labelled ones. Data quality — accuracy, completeness, consistency, and relevance — matters at least as much as quantity.

That said, scale does matter. Large language models benefit enormously from vast amounts of text. The interplay between quality and quantity is one of the central challenges of building AI systems.

Key insight

"Garbage in, garbage out" is one of the oldest sayings in computing — and it's never been more true than in AI. The best algorithms in the world can't compensate for fundamentally bad data.

Personal data and privacy

Much of the data powering AI is personal — our browsing habits, purchases, health records, location history, and social interactions. This raises important questions about consent, privacy, and who benefits from data collection. We explore these in the Ethics & Society module.

Key takeaways

  • Data is any recorded information — text, numbers, images, audio, sensor readings
  • AI learns from data instead of hand-coded rules — making data quality critical
  • Biased or incomplete training data produces biased or incomplete AI systems
  • The data lifecycle: collect → store → clean → transform → train → deploy
  • "Garbage in, garbage out" — the best algorithm can't fix fundamentally bad data