Large Language Models (LLMs) like GPT-4, Claude, and Gemini are the engines of the generative AI revolution. This module explains — without heavy mathematics — what they actually are and how they produce such remarkably capable outputs.
The basic idea: next token prediction
At the most fundamental level, an LLM is trained to do one thing: predict the next token given all the previous tokens. A token is roughly a word or part of a word. Given "The cat sat on the", the model predicts "mat" (or "floor", or "sofa" — with different probabilities).
This sounds almost comically simple. But when you train a neural network to predict the next token on hundreds of billions of sentences from the entire internet, books, code, and scientific papers — something remarkable emerges. The model must implicitly learn grammar, facts, logic, reasoning, style, and much more to predict well. The simple task of next-token prediction becomes a path to general intelligence.
LLMs don't have intelligence "programmed in." Intelligence emerges from the training process. To predict text accurately across all human knowledge, the model must develop deep representations of how the world works.
The Transformer architecture
Modern LLMs are built on an architecture called the Transformer, introduced by Google researchers in 2017. The key innovation is the attention mechanism — a way for the model to weigh how relevant each word is to every other word in the input when making predictions.
When processing "The bank by the river was steep", the word "bank" should attend strongly to "river" to understand it means a riverbank, not a financial institution. Attention learns these relationships automatically from data.
Pre-training and fine-tuning
LLMs are built in (at least) two stages:
- Pre-training — train on a massive dataset of text using next-token prediction. This is enormously expensive: GPT-4 reportedly cost over $100 million to train. The result is a model that understands language deeply but isn't yet useful as an assistant.
- Fine-tuning (RLHF) — the pretrained model is further trained using human feedback to make it helpful, harmless, and honest. Human raters compare model outputs and rate which is better. This feedback trains the model to be a useful assistant rather than just a text predictor.
Context window — the model's working memory
An LLM can only "see" a limited amount of text at once — its context window. Everything it uses to generate a response must fit within this window. Older models had windows of 4,000 tokens (~3,000 words). Modern models like Claude and Gemini support 100,000–1,000,000 tokens, enabling analysis of entire books or codebases.
Emergent capabilities
One of the most surprising findings in LLM research is emergence: as models grow larger, capabilities appear suddenly that smaller models completely lack. Reasoning, multi-step problem solving, few-shot learning (learning a new task from just a few examples in the prompt) — none of these were explicitly trained for. They emerged from scale.
Key takeaways
- LLMs are trained to predict the next token — intelligence emerges from this simple objective at scale
- The Transformer architecture with attention mechanisms is the foundation of all modern LLMs
- Pre-training is expensive; RLHF fine-tuning is what makes models useful and safe
- Context window is the model's working memory — larger windows enable more complex tasks
- Emergent capabilities — reasoning, few-shot learning — appear unpredictably as models scale