Ever wondered how that mysterious giant behind ChatGPT - GPT - actually works?

Don’t worry, Andrej Karpathy is here to save you again. The former Tesla AI Director and OpenAI founding member has launched a super down-to-earth project: Microgpt. As the name suggests, it’s a “mini GPT” - a full 200 lines of Python code, no pip install, no third-party dependencies, training and inference all in one.

You read that right. 200 lines. Not even enough for a moderately complex scraper.

The Backstory

Karpathy has earned the nickname “Father of Minimalism” in AI education. From micrograd, makemore, and nanogpt to now microgpt, he’s always been breaking things down to the absolute basics.

This time he declared: “I can’t simplify this any further. This is the底线 of GPT.”

Curiosity killed the cat, but I just had to see what magic these 200 lines contain.

First, Let’s See Microgpt in Action

The training data is 32,033 English names, and the model has only 4,192 parameters (compared to GPT-2’s 1.6 billion and GPT-4’s rumored trillions).

After 1,000 training steps, the model starts “hallucinating” names that look like real names but don’t actually exist:

sample  1: kamon
sample  2: ann
sample  3: karai
sample  4: jaire
sample  5: vialan
sample  6: karia

Think about it - the model really did learn English name spelling patterns: vowel-consonant alternation, common suffixes (-anna, -ia, -en), and so on.

With only 4,192 parameters, this is already a complete language model.

What’s Packed into Those 200 Lines?

Let me crack open this mysterious box for you.

Layer 1: Data

The code directly embeds a list of 32,033 names as training data. Each name is a “document.”

docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]

The model’s task is simple: given the preceding characters, predict the next character. For example, input “em” and the model should predict “m”.

Layer 2: Tokenizer

This is the most “primitive” part of the whole pipeline.

uchars = sorted(set(''.join(docs)))  # Collect all unique characters
vocab_size = len(uchars) + 1  # +1 for special token

Each character directly maps to a numeric ID. The English alphabet has 26 letters, plus the special BOS (Begin Of Sequence) token, totaling just 27 “vocabulary” items.

That’s right - fewer words than your college entrance exam vocabulary.

Compared to modern LLMs with tens of thousands of tokens, this is basically a toy. But it’s enough to understand the principle.

Layer 3: Backpropagation

This is the most hardcore part. Training a neural network requires knowing “which direction to adjust parameters,” and backpropagation does exactly that.

Karpathy wrote a Value class:

class Value:
    def __init__(self, data, children=(), local_grads=()):
        self.data = data
        self.grad = 0
        self._children = children
        self._local_grads = local_grads

Each number is a “node” that records how it was computed (_children) and how each input affects the output (_local_grads).

When you call backward(), gradients flow backward from the result to the parameters like water.

Sound abstract? Karpathy gives an example:

“If a car’s speed is 2x a bicycle’s speed, and a bicycle’s speed is 4x walking speed, then a car’s speed is 2 × 4 = 8x walking speed.”

Chain rule is just multiplication, layer after layer.

Layer 4: GPT Architecture

The real show begins.

Here’s what the model structure looks like:

def gpt(token_id, pos_id, keys, values):
    # 1. Embedding layer: convert character IDs to vectors
    tok_emb = state_dict['wte'][token_id]
    pos_emb = state_dict['wpe'][pos_id]
    x = tok_emb + pos_emb  # Position encoding

    # 2. Transformer layer
    for li in range(n_layer):
        # 2.1 Attention mechanism
        q = linear(x, state_dict[f'layer{li}.attn_wq'])
        k = linear(x, state_dict[f'layer{li}.attn_wk'])
        v = linear(x, state_dict[f'layer{li}.attn_wv'])

        # 3. MLP feedforward network
        x = linear(x, state_dict[f'layer{li}.mlp_fc1'])
        x = [xi.relu() for xi in x]
        x = linear(x, state_dict[f'layer{li}.mlp_fc2'])

    # 4. Output layer
    logits = linear(x, state_dict['lm_head'])
    return logits

Attention mechanism lets each position “look at” preceding characters and decide what information to gather from them.

For example, when the model sees “em” and needs to predict the third letter, it notices that “e” is a vowel and is more likely to predict another vowel (or consonant).

That’s how language models learn grammar.

Layer 5: Training and Inference

Training is constantly repeating: input characters → predict next → compute loss → backpropagate → update parameters.

During inference, the model generates one character at a time, then feeds that character back into the input to continue predicting the next. This continues until the model outputs an end token (BOS) or reaches maximum length.

That’s why ChatGPT outputs one character at a time.

Running It

You just need:

python train.py

On a MacBook, it takes about 1 minute to see results. Loss drops from 3.3 (pure random guessing) to around 2.37.

Then you get a bunch of “hallucinated” names that look real but don’t actually exist.

What Can It Do?

Honestly, a small model with 4,192 parameters can’t match ChatGPT’s generation quality.

But its value isn’t in being “useful” - it’s in being “understandable.”

If you’ve always been confused by complex APIs like PyTorch and TensorFlow, and want to understand what neural networks are actually doing, this code is the perfect starting point.

No hidden layers, no magic - just pure mathematics.

Karpathy’s teaching path looks like this:

FileContent
train0.pyFrequency statistics - not even a neural network
train1.pyMLP + manual gradients + SGD
train2.pyAutomatic differentiation (the Value class)
train3.pyPosition encoding + Transformer
train4.pyComplete GPT + Adam optimizer

Microgpt is the final product.

In Conclusion

200 lines of code, ten years ago, might just be a college student’s course assignment.

Ten years later, these simple building blocks have become an AI empire worth hundreds of billions.

If you want to understand how GPT actually works, starting from these 200 lines is enough.


FAQ

Q: What’s the difference between Microgpt and real GPT? A: The core algorithm is exactly the same - Transformer plus autoregressive generation. The difference is scale: Microgpt has 4,192 parameters, while GPT-4 has trillions. Scale determines capability ceiling, but the underlying principles are the same.

Q: What environment do I need to run Microgpt? A: Just Python. No pip install, no GPU needed. Karpathy specifically removed all third-party dependencies to make it runnable for everyone.

Q: Who is this project for? A: People who want to get into AI/neural networks but have been scared off by complex code. 200 lines covering the entire pipeline from data to training to inference - it’s hard to find a more concise tutorial.


Related Resources:


Follow 梦兽编程微信公众号,解锁更多黑科技