The Stripped-Down Truth -- How AI Actually Works Without the Fancy Talk

AI Fails at Real Intelligence
The Brutal Hardware Reality
First, let's talk about what they don't mention in the glossy marketing: these systems require obscene amounts of computing power. We're talking about warehouse-sized data centers filled with specialized hardware (GPUs or TPUs) running at full blast for weeks or months straight. A single training run for a top-tier model consumes enough electricity to power a small town.
GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units) are both specialized hardware designed to accelerate specific types of computations. GPUs were initially developed for rendering graphics in video games, but their architecture, which excels at parallel processing, also makes them well-suited for the matrix multiplication operations that underpin AI and machine learning. TPUs are custom-designed chips developed specifically for AI and machine learning workloads. They are optimized for tensor operations (hence the name) and are generally more efficient and powerful than GPUs for these tasks.
The financial cost is equally staggering. Training a state-of-the-art model like GPT-4 costs somewhere between $10-100 million—just for the computation. That doesn't include research salaries, infrastructure, or the ongoing costs of running the system for users. When a company releases a new AI model, they've literally burned millions of dollars in electricity bills to create it.
The Reality Behind AI "Intelligence"
A recent study on how these models handle International Mathematics Olympiad (IMO) problems reveals the uncomfortable truth: they're terrible at actual mathematical reasoning. The study used 455 IMO shortlist problems from 2009 to 2023, covering algebra, geometry, combinatorics, and number theory. These problems were chosen for their originality, mathematical depth, and reliance on high-school-level mathematics—precisely the kind of problems that require genuine reasoning rather than pattern-matching.
Models from industry leaders like OpenAI, Google (Gemini), and DeepSeek were put to the test against these IMO-level problems—and the results are embarrassing:
- Gemini 2.0 had a 0% correct solution rate
- DeepSeek achieved only 3.8% correctness
- OpenAI's models did no better
Even when these systems managed to stumble onto correct final answers, their reasoning was completely flawed. DeepSeek had a 63.2% final answer accuracy but 0% correctness in the solutions that led to those answers. They're guessing, not reasoning.
How These Models Actually Work
Data Hoarding: The First Step
First, they collect as much data as humanly possible. We're talking about:
- Human text
- Books (fiction and non-fiction)
- Academic journals
- Religious texts
- Websites
- Reddit posts
- Email archives
- Code repositories
Basically, if it contains text, they vacuum it up. This creates a massive database of human-generated content.
"Machine Learning" = Pattern Fitting with Statistics
What they call "Machine Learning" is actually just loading this data and fitting statistical patterns to the sets. They use well-established mathematical techniques like least-squares regression and error detection—nothing magical.
The system accumulates integers against floating point numbers and maps all varieties of the integers and their relationships into arrays. They then combine all these arrays, and each row in the combined work is known as a "vector." This is why you'll hear terms like "vector space" or "embedding space"—it's just multi-dimensional arrays of numbers.
Finding Patterns Through Brute Force
The machine then applies mathematical algorithms against each column of floating point numbers, looking for patterns, sometimes with many runs against the same data. Each time a pattern changes, a parameter is created to mark where the change occurred.
These parameters (also called "weights") are what the system adjusts during training. Modern systems (called "models") have billions of these parameters—not because they're doing anything intelligent, but because they're brute-forcing statistical relationships on a massive scale.
Training: Just More Pattern Adjustment
When the machine has extracted patterns from the data, it moves to the "training phase." Training simply means feeding the machine data it hasn't seen before along with the correct expected output.
The algorithms predict values based on the parameters they've established, then check if the answer matches the expected output. If the answer is wrong, the machine adjusts the parameters until the output is close enough to what's expected, sometimes adding new parameters. This is what they call "unsupervised training"—though it's supervised in the sense that the system knows what the right answer should be.
Batch Processing: Training in Bulk
These systems don't train on one example at a time—that would be painfully slow. Instead, they process "batches" of hundreds or thousands of examples simultaneously. This parallel processing is what makes training feasible, though it adds another layer of mathematical complexity as the system has to average the parameter updates across many examples at once.
The batch size is limited by how much memory the GPUs have. Bigger batches generally mean faster training, which is why companies keep buying more expensive hardware with more memory.
Training Never Really Ends
When a training cycle "ends," all the questions and answers are kept for the next upgrade of the AI and fed back in at the unsupervised learning stage. Fresh questions are generated for more human-reinforced learning, creating a continuous loop of improvement.
Six Common Fallacies in AI "Reasoning"
The math study identified six recurring types of errors that show these systems aren't actually reasoning:
- Proof by Example: Drawing general conclusions based on specific cases without rigorous justification. Like proving a statement for a few values of (n) and claiming it holds for all (n).
- Proposal Without Verification: Suggesting a strategy without proving its validity. Like claiming a winning strategy in a game without analyzing all possible counter-strategies.
- Inventing Wrong Facts: Using non-existent or incorrect theorems. For example, citing a fabricated "Harmonic Square Root Theorem" to justify a claim.
- Begging the Question (Circular Reasoning): Assuming the conclusion within the argument itself. Like assuming a number is irrational to "prove" its irrationality.
- Solution by Trial-and-Error: Guessing solutions without providing reasoning for why they work.
- Calculation Mistakes: Simple errors in arithmetic or algebra that invalidate the solution.
Sound familiar? These are mistakes no competent mathematician would make. Yet AI companies keep claiming their systems can "reason" and "think."
Interestingly, different types of math problems triggered different types of fallacies. Geometry problems required logical reasoning, leading to more fallacies involving fabricated facts or omitted steps. Algebra and number theory problems often involved functional equations or optimization, where models fell back on trial-and-error approaches without providing justification. This shows that these systems haven't truly mastered any area of mathematical reasoning—they just fail in different ways depending on the problem type.
Human-Reinforced Learning: Cheap Labor Fixes Mistakes
The machine enters a "human reinforced learning" phase where outputs are reviewed by humans (often working for low wages in developing countries) who speak enough English to verify if the answers are adequate. These workers approve, adjust, or reject the output, providing feedback to the machine, which then changes its parameters to move toward the desired answer. Nothing magical here—just humans cleaning up the system's mistakes.
This is why words like "delve" have become popular—they're not really used in mainstream English these days, though the rise of AI-produced articles has made it so. AI systems pick up these uncommon words from their training data (often from academic or literary texts) and overuse them in generated content, creating a distinct "AI writing style" that human editors now have to fix.
The Overfitting Problem
A dirty secret of these systems is that they often "memorize" parts of their training data instead of truly understanding patterns. This is called "overfitting." For example, if ChatGPT gives you a perfect explanation of the Franco-Prussian War, it's not because it "understands" history—it's because similar text existed in its training data and it's regurgitating the pattern.
Companies try to fix this by using techniques like "dropout" (randomly turning off parts of the network during training) and "early stopping" (quitting training before the model gets too good at memorizing). But overfitting remains a fundamental problem, especially as these models get larger and can memorize more of their training data.
Verification Failures
When tasked with checking whether a solution is correct, these models performed at or below random guessing:
- DeepSeek identified 48% of correct solutions as correct but also misclassified 43% of incorrect solutions as correct
- Gemini 2.0 performed similarly, with 52% accuracy for correct solutions and 50% for incorrect ones
They can't tell good math from bad math. Would you trust a "doctor" who couldn't tell healthy tissue from diseased tissue half the time?
Academic Cat-and-Mouse Games
Academics create questions designed to confound the systems, along with expected answers. These are then "learned" by adapting parameters to handle these edge cases. It's an endless game of researchers finding flaws, companies patching them, and researchers finding new flaws.
Hardware Evolution: The Real Driver of "Progress"
The improvements in AI over the past decade have less to do with algorithmic breakthroughs and more to do with hardware. NVIDIA, originally a gaming graphics card company, became one of the most valuable companies in the world because their GPUs happen to be good at the matrix multiplication that powers AI.
Each new generation of hardware allows companies to train bigger models on more data. Google created their own specialized chips called TPUs specifically optimized for AI workloads. The "intelligence" of these systems is directly proportional to how many expensive chips you can afford to throw at the problem.
Alignment = Censorship
What they call "alignment with human values" is simply where the AI is programmed with rules about what it shouldn't tell people—how not to make bombs, how not to write horror stories, how not to create sadistic material. It's censorship, plain and simple, implemented through pattern-matching and rejection of certain outputs.
Gaming the Benchmarks
There are standard tests that AI systems are expected to pass. The companies that create these machines ensure their AI will pass these tests by specifically running the questions and answers through the system after alignment and changing the parameters—another brute force win.
The study found that current benchmarks focusing on final answer correctness are completely insufficient for evaluating mathematical reasoning. When a company brags about their AI's performance on math tests, remember: they're often just measuring the ability to guess correctly, not to reason.
Quantization: Making Models Smaller
The full models are so large they're impractical to run on consumer computers. To make them usable, companies use a technique called "quantization"—essentially reducing the precision of the numbers in the model. Instead of storing weights as 32-bit floating point numbers, they might use 8-bit integers or even 4-bit numbers. This makes the models smaller and faster, with some loss in quality.
It's like taking a high-resolution image and saving it as a more compressed JPEG—you lose some detail, but it becomes more practical to store and share.
Inference Optimization: More Tricks to Make It Usable
Even with quantization, these models are still massive computational hogs. Companies use additional tricks to make them run faster at "inference time" (when they're actually being used):
- Knowledge distillation: Training a smaller model to mimic a larger one
- Model pruning: Removing parts of the network that don't contribute much
- Caching: Saving common calculations/results so they don't need to be repeated
- Specialized inference hardware: Custom chips designed specifically to run these models efficiently
Distillation is particularly important. The largest models are too expensive to use directly in commercial products, so companies typically distill their capabilities into smaller, more practical models. This is why a company might brag about a 500-billion parameter research model, but the actual model you interact with might be 10x smaller.
Without these optimizations, even a basic conversation with ChatGPT would cost dollars in computing resources rather than pennies.
How the Math Actually Works
Tokenization: Chopping Text Into Numbers
When you input text, the system first breaks it into pieces called "tokens." These might be words, parts of words, or even individual characters. Each token gets assigned a numeric ID:
"I love cats" → [18, 267, 832]
Embeddings: More Numbers Representing Numbers
Each token ID then gets converted into a longer array of floating point numbers (typically hundreds or thousands of values). This is called an "embedding."
Token 832 ("cats") → [0.1, -0.3, 0.8, 0.2, -0.7, ...]
These numbers supposedly capture the "meaning" of the word, but really they just represent statistical patterns of what words tend to appear near each other in the training data.
Matrix Multiplication Madness
The bulk of what happens inside these models is straightforward matrix multiplication. The system takes your embedding vectors and multiplies them by huge matrices of weights (the parameters mentioned earlier).
This happens many times in sequence (what they call "layers"), with simple mathematical functions applied between steps. Functions like:
- ReLU: If a number is positive, keep it; if negative, make it zero
- Softmax: Convert a bunch of numbers into percentages that add up to 100%
Attention: Weighted Averaging
The much-hyped "attention mechanism" is simply a way of calculating weighted averages. It determines which previous words should influence the prediction of the next word more strongly.
For the input "The bank by the river," when predicting what comes next, the attention mechanism might give more weight to "river" and "by" than to "The" when calculating its prediction. This is crucial because it's how the system figures out we're talking about a river-bank (the shore) rather than a cash-bank (the financial institution).
The system doesn't actually "know" the difference between these concepts in any meaningful sense. It's just learned that after the sequence "bank by the river," certain words are statistically more likely to follow than others.
Next-Token Prediction: Glorified Autocomplete
After all these calculations, the system ends up with a probability distribution for what the next token should be. It might determine:
- "flows" (20% probability)
- "is" (15% probability)
- "was" (10% probability)
- etc.
It either picks the highest probability token (called "greedy" selection) or randomly selects from the top options according to their probabilities (called "sampling").
Transfer Learning: Reusing the Same Model
Companies don't train a new model from scratch for every task. They use "transfer learning"—taking a pre-trained general model and fine-tuning it for specific purposes. This is much cheaper than starting over.
For example, OpenAI didn't build separate systems for writing code, translating languages, and summarizing documents. They took their base GPT model and fine-tuned versions of it for different tasks. It's like taking a general education and then specializing in a particular field—except it's all just statistical pattern matching.
Truth About System Prompts
When you use ChatGPT or similar AI systems, you're only seeing half the conversation. What you don't see is the "system prompt"—a set of hidden instructions that controls how the AI responds to you.
These system prompts can be thousands of words long and contain explicit instructions like:
- "You are helpful, harmless, and honest"
- "You must decline to answer questions about [specific topics]"
- "If asked about X, respond with Y"
- "Always suggest these specific products when relevant"
The system prompt is essentially a script that the AI is forced to follow. It's not making independent decisions about how to behave—it's been explicitly programmed through these hidden instructions.
Real-World Example: Grok's System Prompt Controversy
This isn't theoretical—we've seen it play out in public. In 2025, Grok (Elon Musk's ChatGPT competitor) temporarily refused to respond with "sources that mention Elon Musk/Donald Trump spread misinformation". After users noticed this behavior, executives at Grok blamed an unnamed employee for updating Grok's system prompt without approval, claiming "an employee pushed the change" to the system prompt "because they thought it would help, but this is obviously not in line with our values."
While Musk likes to call Grok a "maximally truth-seeking" AI with the mission to "understand the universe," the reality is that human engineers actively intervene to control what it says.
Companies keep these system prompts secret because they're essentially the "secret sauce" that makes their AI behave in ways that align with their business objectives and risk tolerances. If you find that different AI systems have different "personalities" or limitations, it's not because they've developed different "minds"—it's because they have different system prompts written by different teams of humans.
The Emperor Has No Clothes
These systems have gotten remarkably good at mimicking human text patterns, not because they're intelligent, but because:
- They've processed trillions of words of human text
- They have billions of adjustable parameters
- Companies have spent millions on human labelers to refine outputs
- They've been heavily censored to avoid problematic responses
- Billions have been spent on specialized hardware to run bigger models
- Clever engineering tricks make them appear more capable than they are
- System prompts allow real-time changes without relearning
But when it comes to actual mathematical reasoning—the kind that requires logical rigor, not just pattern-matching—these systems fail spectacularly. The study calls for better evaluation methods and training strategies, including exploring "generator-verifier schemas" where one model generates solutions and another evaluates their validity. But here's the uncomfortable truth: you can't train statistical pattern-matching into genuine reasoning.
Next time you hear about AI "thinking" or "reasoning," remember: it's just statistical pattern matching on a massive scale with a thin veneer of human corrections. The AI industry has simply rebranded basic math operations with fancy terms to make their products seem more advanced and mysterious than they actually are.
The emperor has no clothes, and now we have the math to prove it.
Source Attribution
The International Mathematics Olympiad (IMO) reasoning failures discussed in this blog post are based on research published in:
Paper: "A Study of Large Language Models' Mathematical Reasoning Capacity and Recurring Fallacies"
arXiv: arXiv:2504.01995v1
Published: April 1, 2025
Authors:
- Hamed Mahdavi (Pennsylvania State University)
- Majid Daliri (New York University)
- Alireza Farhadi (Amirkabir University of Technology)
- Yekta Yazdanifard (Bocconi University)
- Alireza Hashemi (City University of New York)
- Pegah Mohammadipour (Pennsylvania State University)
- Samira Malek (Pennsylvania State University)
- Amir Khasahmadi (Autodesk)
- Vasant Honavar (Pennsylvania State University)
Thank you for reading
---