Privacy & Data — AI Reference Library

AI systems are, fundamentally, data systems. They are trained on vast collections of human-generated content, they learn from every interaction, and they make inferences about individuals that those individuals may not even be aware of. The relationship between AI and privacy is one of the most important and least understood aspects of modern technology.

What AI systems know about you

Modern AI systems collect, process, and infer information about users in ways that go far beyond what most people realise. Understanding the different types of data involved helps you make more informed choices.

Data you explicitly provide

The questions you ask, the documents you upload, the conversations you have. When you ask a health AI about symptoms, describe a financial situation to a chatbot, or upload a confidential document for summarisation — that data exists somewhere, processed by someone's infrastructure.

Data inferred from your behaviour

How long you pause before typing, which suggestions you accept, what topics you return to repeatedly, how you rephrase questions when you don't get what you want. These behavioural signals reveal preferences, emotional states, and decision patterns that users rarely consciously disclose.

Training data you never gave

Large language models were trained on vast amounts of internet content — including content that people posted publicly without knowing it would be scraped into AI training sets. Your old forum posts, social media content, or professional work may have contributed to training AI systems without your knowledge or consent.

The inferences problem

Modern AI can infer sensitive attributes — health conditions, political views, sexual orientation, financial distress — from seemingly innocuous data like purchase patterns, location history, or writing style. You may never disclose these things explicitly, but an AI system may infer them with surprising accuracy.

The training data question

One of the most contested privacy questions in AI concerns what data was used to train these systems and whether consent was obtained. Large language models were predominantly trained on data scraped from the internet — Common Crawl, Wikipedia, GitHub, books, and more. Much of this content was created by people who had no expectation it would be used to train commercial AI systems.

Several major lawsuits are underway — from news organisations, authors, and artists — arguing that using copyrighted content in AI training without permission or compensation violates intellectual property law. These cases will shape the norms around training data for years to come.

Risks to individuals

Data breaches — AI companies store conversation data, and like any company, they are potential targets for hacking. A breach of an AI assistant's conversation logs could reveal sensitive personal information.
Re-identification — data that appears anonymised can often be re-identified when combined with other datasets. AI systems that seem to work on anonymised data may in practice reveal information about specific individuals.
Surveillance and profiling — AI enables surveillance at a scale and accuracy previously impossible. Facial recognition deployed in public spaces, sentiment analysis of social media, and predictive policing all raise profound concerns about what kind of society we are building.
Sensitive inference — as noted above, AI can infer health conditions, political views, and other sensitive attributes from indirect signals. These inferences can then affect insurance pricing, employment decisions, or political targeting.
Deepfakes and impersonation — voice cloning and image generation make it possible to create convincing fake content of real individuals without their consent. This threatens personal reputation and has been used for harassment, fraud, and non-consensual intimate imagery.

What organisations must do — key regulations

Privacy regulation is evolving rapidly in response to AI. The key frameworks you should know:

GDPR (EU)

The General Data Protection Regulation applies to any organisation processing data of EU residents. Requires lawful basis for processing, data minimisation, purpose limitation, and grants individuals rights to access, correct, and delete their data. Fines up to 4% of global annual revenue. Has significant implications for AI training and automated decision-making.

DPDP Act (India)

India's Digital Personal Data Protection Act (2023) establishes rights for Indian citizens over their personal data. Requires explicit consent for data processing, imposes obligations on "data fiduciaries" (organisations processing data), and creates a Data Protection Board for enforcement. Particularly relevant for the large Indian AI and tech sector.

EU AI Act

The first comprehensive AI-specific regulation, which includes significant privacy protections. Bans real-time biometric surveillance in public spaces (with narrow exceptions), requires transparency when AI is making decisions about individuals, and mandates human oversight for high-risk AI applications. Covered in detail in the Regulation module.

CCPA (California)

The California Consumer Privacy Act gives California residents rights over their personal data — access, deletion, and opt-out of sale. A patchwork of similar state laws is emerging across the US in the absence of federal privacy legislation.

Practical steps for individuals

You have more agency over your AI privacy than you might realise. Practical actions:

Read the privacy policy — especially whether your conversations are used to train future models. Most major AI assistants allow you to opt out of training data use in settings.
Don't share what you wouldn't want stored — treat AI conversations like email: assume they could be read by someone else. Don't share passwords, financial details, health information you want kept private, or confidential work data without understanding where it goes.
Use enterprise or business plans — these typically offer stronger data privacy guarantees than consumer plans. Many AI companies commit not to train on enterprise customer data.
Consider local models — open-source models like Llama, run on your own device, process data locally with no external transmission. For highly sensitive use cases, this is the most private option.
Exercise your data rights — under GDPR and DPDP you have rights to access and delete data held about you. Most major AI companies have processes for this.

Practical steps for organisations

Data minimisation — collect only the data you genuinely need. Every additional data point is additional risk.
Purpose limitation — be clear about what data is used for and don't repurpose it without review.
Employee AI use policies — employees using consumer AI tools may inadvertently share confidential company data. Clear policies and approved enterprise tools mitigate this.
Vendor due diligence — understand what your AI vendors do with your data. Get contractual commitments on data not being used for training.
Privacy impact assessments — for significant new AI deployments, conduct a formal assessment of privacy risks before launch.

Key takeaways

AI systems collect explicit inputs, infer from behaviour, and were trained on data people never knowingly contributed
AI can infer sensitive attributes — health, politics, financial state — from indirect behavioural signals
Key regulations: GDPR (EU), DPDP Act (India), EU AI Act, and state-level laws in the US
Individual steps: check privacy settings, don't share sensitive data carelessly, use enterprise plans or local models for sensitive work
Organisation steps: data minimisation, clear policies on employee AI use, vendor due diligence, privacy impact assessments
Training data consent remains legally and ethically contested — major cases are working their way through courts