Neural Network Reverse Engineering: When Machine Learning Becomes a Puzzle Game

Ever played that “guess the number” game? Someone thinks of a number, you guess, they tell you higher or lower. After a few tries, you lock in the answer.

Now imagine a twist: the opponent isn’t a person, but a neural network. You feed it an input, it outputs 0. No matter what you give it, it’s always 0. The question is: does a special input exist that makes it output 1? If so, how do you find it?

This is the machine learning puzzle game released last year by Jane Street, a quantitative trading firm. Unlike regular CTFs, they give you the complete neural network weights. No black box, all information is right there. Yet even so, this puzzle stumped a lot of people.

A Different Kind of ML Puzzle

Most ML-related CTF problems give you a black-box model and you try to trick it into outputting specific results. The common approach is adversarial example attacks, using gradient descent to find an input that makes the network “mess up.”

Jane Street’s problem is different. They give you the model.pt file directly, containing all the weights. The problem statement is absurdly simple:

Today I went hiking and found a bunch of tensors buried under a Neolithic tomb! The model outputs 0 for almost all inputs. If you can figure out what it’s actually doing, please tell us.

The PyTorch model file is just a pickle - load it and you can see all the parameters. At first glance, shouldn’t this be a freebie? All the information is in your hands, what tricks could you possibly need?

But this network has a special characteristic: you can’t use traditional backpropagation to solve it. The network’s structure determines that propagating gradients back from the output is useless. You have to truly understand what this network is computing to find the answer.

Neural Network Lock Mechanism: 16-byte MD5 Key vs Neural Network

First Glance: This Wasn’t Trained

A college student named Alex decided to give it a try. He loaded the model and first looked at the last layer’s weights:

import torch
import plotly.express as px

model = torch.load('./model.pt')
linears = [x for x in model if isinstance(x, torch.nn.Linear)]
px.imshow(linears[-1].weight.detach())

At a glance, you can tell this isn’t a properly trained network. All weights are integers - carefully designed integers. This network was hand-written, not optimized through gradient descent.

The last layer is a 48x1 matrix, clearly divided into three segments. Looking at the second-to-last layer, its weights seem to repeat the same pattern three times, with biases being the same 16-byte vector incremented by 1, then 2, then 3.

Alex quickly realized what this pattern does. The second-to-last layer, combined with the ReLU activation function, is actually checking whether two 16-byte integers are equal. Each byte uses one neuron to compare, with three copies checking v-x-1, v-x, and v-x+1 respectively. The last layer’s weights 1, -2, 1 combine these three values - it only outputs 1 when v exactly equals x.

It’s like designing a precision lock - only when the correct 16-byte “key” is input will the last layer activate.

The Question Becomes: What’s This Lock Locking?

Since the network’s end is checking a specific 16-byte value, all the preceding layers must be computing something and comparing the result against this value.

This network has over 2500 linear layers. Alex started tracing layer by layer, trying to figure out what each part computes. But he quickly realized this was brute force work - the network structure was too complex to manually diagram.

He switched approaches: treat the entire network as a linear programming problem to solve.

ReLU activation functions aren’t linear themselves, but they can be modeled by introducing integer variables. If a neuron’s activation is negative, ReLU turns it to 0, so you can use a boolean variable to represent “was this neuron clipped?” This transforms the entire network into a Mixed Integer Linear Programming (MILP) problem.

Sounds great, but running it was a disaster. The network had millions of variables, and the integer programming solver ran for a day without results.

Network Diet: Cutting 80% of Useless Nodes

Alex discovered this network has a characteristic: most neurons aren’t doing anything.

Specifically, if a neuron has only one input edge with weight exactly 1, it’s essentially just passing the previous neuron’s value through unchanged. Such nodes can be merged directly.

There are more complex simplification rules too. For example, if all input weights to a neuron are positive, ReLU has no effect on it - it can never be clipped, so you can directly connect its input to its output. Or if two neurons in the same layer receive exactly the same inputs, they can be merged.

After rounds of simplification, the network shrank from 2 million nodes to 75,000. That’s huge progress, but the solver still couldn’t handle it.

From Linear Programming to SAT Solver

Still too many nodes. Alex had another idea: propagate bound values.

Starting from the input layer, calculate the maximum and minimum possible values for each neuron layer by layer. Many neurons become very constrained after this analysis - their value ranges become extremely narrow, like only 0 or 1.

Since many variables can only be boolean values, why not convert directly to a SAT problem? Each neuron at each possible value corresponds to a boolean variable, and the relationships between layers become boolean constraints.

After conversion, there were about 200,000 boolean variables. The SAT solver ran for a day, simplifying the problem to 20,000 variables, then got stuck. The core program was still too complex to brute force.

The Epiphany: It’s a Hash Function

From Confusion to Revelation: Brute Force to SAT to MD5 Discovery

Since brute force wasn’t working, Alex changed his thinking. This network was intentionally designed this way, so the designer must have left some “backdoor” or pattern.

He noticed the network’s layer widths showed periodic variation: 32 cycles, each cycle length 48. This structure is too regular to be randomly generated.

What algorithm uses 32 rounds of iteration, processing 48 bits of data per round? He asked ChatGPT, and the answer pointed in one direction: hash functions.

MD5, SHA-1, SHA-256 - these common hash functions all use multiple rounds of iteration to scramble data. Alex started trying them one by one, inputting strings, calculating various hash values, then comparing against the second-to-last layer’s biases.

The answer revealed: this network implements the MD5 hash function.

Can You Reverse a Hash Function?

Now the problem became clear. The network’s second-to-last layer bias encodes a specific MD5 hash value. You need to find an input string whose MD5 hash exactly equals this value.

MD5 is a one-way function - theoretically impossible to reverse. But the problem hint said the answer is two lowercase English words separated by a space. This is an important constraint: the search space is greatly reduced.

Alex initially tried brute-forcing with a common wordlist, but the top 10,000高频词 didn’t contain the answer. He switched to a larger wordlist and finally found the solution.

What’s the answer? Jane Street didn’t disclose it, but the solving approach itself is more valuable than the answer.

An Accidentally Discovered Bug

During the solving process, Alex discovered a bug in this network.

MD5 requires encoding the input message length into the data. This network uses 4 bytes to store the length, in little-endian format. The problem is, when input length exceeds 255 bytes, the encoding method goes wrong.

The correct encoding should split the length value into 4 bytes, but the network directly stores the length value in the first byte. This means when length exceeds 255, the encoding is wrong.

Alex spent two days reverse-analyzing this bug, wondering if he could use it to simplify the problem. He found out the bug was unintentional - not a designed trap in the problem - and didn’t help with solving. But this analysis process demonstrated very solid reverse engineering skills.

Why This Problem Is Worth Doing

Why did Jane Street create this problem?

On the surface it’s a fun brain teaser, but actually it’s a recruitment tool. People who can solve it likely have these abilities:

Understanding neural network internals, not just knowing how to call APIs
Translating abstract math problems into concrete solving strategies
Knowing when to brute force and when to look for patterns when facing complex problems
Having patience to trace details and discover anomalies

Jane Street’s research team often needs to understand complex models’ internal mechanisms. This problem tests exactly this “mechanistic interpretability” capability.

If you’re interested in such problems, Jane Street later released a second problem: a neural network’s layers have been shuffled, and you need to put them back in the correct order. Check their Hugging Face page if you’re interested.

Summary

The core of this problem isn’t machine learning - it’s reverse engineering. All the network weights are given to you, there’s no secret. But seeing the outline of an MD5 hash function from millions of integers requires analytical ability and patience.

Traditional deep learning approaches don’t work here. Gradient descent, backpropagation, adversarial examples - these tools all fail. You have to return to more fundamental methods: analyzing structure, simplifying problems, finding patterns.

This ability is very useful in real work too. When you need to debug a model performing abnormally in production, or understand a third-party library’s internals, you’re using similar skills.

Jane Street is right: someone who can solve this problem would probably do well on their team. The kind of meticulousness revealed in the solving process is exactly what they want.

FAQ

Why can’t gradient descent be used to solve this?

The network’s design is special. It uses a lot of ReLU activation functions, and these functions have zero gradients in the negative region. When the network outputs 0 for almost all inputs, the gradients propagated back are basically all zero - gradient descent can’t optimize. You have to truly understand what the network is computing, not treat it as a differentiable function to derive.

What is mechanistic interpretability?

Simply put, it’s “opening the black box to see the gears.” Traditional machine learning research focuses on model performance, while mechanistic interpretability focuses on the internal working mechanisms of models - what each neuron is computing, how layers cooperate. This Jane Street problem is essentially testing this ability.

What practical uses does this reverse engineering technique have?

Quite a few. For example, if your model suddenly performs anomalously in production, you need to figure out which layer is the problem. Or if you’re auditing a third-party model to confirm it has no hidden backdoors. Or if you want to do model compression and need to find which neurons are redundant. Understanding model internals is more valuable than just tweaking parameters.

Mechanistic Interpretability: Opening the Black Box to See the Gears

A Different Kind of ML Puzzle#

First Glance: This Wasn’t Trained#

The Question Becomes: What’s This Lock Locking?#

Network Diet: Cutting 80% of Useless Nodes#

From Linear Programming to SAT Solver#

The Epiphany: It’s a Hash Function#

Can You Reverse a Hash Function?#

An Accidentally Discovered Bug#

Why This Problem Is Worth Doing#

Summary#

FAQ#