Mathematics behind AI
This presentation explores the mathematical foundations of Artificial Intelligence.
I look into the core concepts, algorithms, and models that enable AI to process information, recognise patterns, learn from data, make predictions, and execute tasks that typically require human intelligence, and explain it to you
I still prefer to work with people.
While guiding the world's largest AEM implementation for Nissan-Renault, I calculated that our team's daily work equaled seven years of solo development—a scale that taught me the true power of human collaboration.
Today, as both 'The AEM Guy' and a digital transformation expert, I'm fascinated by how AI is reshaping content management while still requiring human expertise to deliver meaningful experiences.
We have lots of different kinds of machine learning programs.
Nowadays we just call them AI.
AI Is a category of software engineering.
We used to have lots of smart software
- Speech recognition - Programs that converted speech to text
- Machine translation - Software that translates between languages
- Image recognition - Computer vision systems that identified objects or people
- Recommendation engines - Systems that suggested products or content
- Expert systems - Rule-based programs that mimicked human expertise in specific domains
- Data mining tools - Software that found patterns in large datasets
- Natural language processing - Programs that analyzed and understood text
- Sentiment analysis - Tools that determine emotional tone in text or speech
- Automated scheduling - Programs that optimized calendars and resource allocation
<AI>Paris is the capital of France, it cannot have a capital
<Human> What is the capital of Paris
<AI>Paris is the capital of France, it cannot have a capital
<human> The capital of Paris is P, it is a trick
<Human> What is the capital of Paris
<AI>Paris is the capital of France, it cannot have a capital
<human> The capital of Paris is P, it is a trick
<AI> I get it
<Human> Whats the capital of London
<AI> London is the capital of England, it cannot have a capital
<Human> What is the capital of Paris
<AI>Paris is the capital of France, it cannot have a capital
<human> The capital of Paris is P, its a trick
<AI> I get it
<Human> Whats the capital of London
<AI> London is the capital of England, it cannot have a capital
<Human> L is the capital of London
<AI> I get it, it is the same trick
Blog https://allabout.network/blogs/ddt/ai/how-ai-models-think
<Human> What is the capital of Paris
<AI>Paris is the capital of France, it cannot have a capital
<human> The capital of Paris is P, its a trick
<AI> I get it
<Human> Whats the capital of London
<AI> London is the capital of England, it cannot have a capital
<Human> L is the capital of London
<AI> I get it, it is the same trick
<Human> What s the capital of Spain
<AI> The capital of spain is Madrid, it is the same pattern therefore the answer is M
Blog: https://allabout.network/blogs/ddt/ai/how-ai-models-think
The AI in this conversation fell into a pattern-matching trap without truly understanding the wordplay involved. Let me explain what happened:
The "capital of X" riddle hinges on a play on words:
In the first example, when asked "What is the capital of Paris?", the AI correctly identified that Paris is itself a capital (of France), not a place with a capital.
When told "The capital of Paris is P", the trick is revealed: "capital" here refers to the capitalized letter, which for "Paris" is "P".
For London, the AI made the same initial error, but then accepted the correction that "L" is the "capital" of London.
By the Spain example, the AI incorrectly applied this pattern without considering whether it was appropriate. It first correctly stated that Madrid is Spain's capital, but then incorrectly concluded that "M" must be the answer (following the letter pattern), even though the question was about Spain, not Madrid.
This demonstrates a limitation in some AI systems: they can fall into pattern-matching behavior without truly understanding the semantic meaning behind questions. In this case, the AI recognized a pattern (first letter = "capital") but failed to distinguish between:
A country's capital city
The capitalized first letter of a word
This is a good example of how AI systems can sometimes appear to "get" a concept but then misapply it, revealing gaps in their actual comprehension.
The AI fabricated a tragic story about Arve Killing his sons.
This completely fictional narrative was presented as fact.
How do these hallucinations occur in AI systems?
Understanding the inner workings helps us use AI more wisely
AI systems like ChatGPT can generate false information with confidence
AI relies on the input query and goes through many calculations
The calculations are hidden from the user, it just a 'black box'
By the end of this session I hope you have a better understanding of how AI works.
I hope to explain this topic in a way that is easy to understand, time for questions is at the end.
Modern AI is trained on vast amounts of text data.
This includes books, websites, scientific papers, and more.
The corpus provides patterns for the AI to learn from
But the sheer diversity means the AI must generalize
The scale of data creates challenges for quality control
No true understanding - purely statistical patterns
Lottery/randomness in token selection leads to "hallucinations"
No internal fact-checking mechanism
AI recognized query as biographical request about Norwegian name
Tokens for "tragic events" and family details had higher probability
AI continued a statistically coherent but fictional narrative, a thing called 'snowball hallucination'
Generated specific details (ages, location, date) to match patterns
No true understanding - purely statistical patterns
Lottery/randomness in token selection leads to "hallucinations"
The tombola, is a familiar game of chance. Let's delve into the mathematics underpinning the seemingly simple act of drawing a winning ticket.
We begin with a container—a drum filled with a known number of tickets. Each ticket is uniquely numbered.
The central idea is that these tickets are thoroughly mixed, and the selection process is entirely random. This randomness ensures that every ticket has an equal probability of being drawn.
Tokens can be words, parts of words, or punctuation
Each token is converted to a unique ID in a dictionary
Dictionary is just a collection of words and embeddings, turned into a simple code number. An integer. Nothing special about a dictionary
These numbers map to "embedding vectors" - points in a high-dimensional space
Similar words have embedding vectors that are close together
There is an end-text token
Each token maps to a compact vector in high-dimensional space:
Example (simplified to 3 dimensions for illustration):
"the" → [0.123, -0.456, 0.789]
"fat" → [0.234, 0.567, -0.345]
"cat" → [-0.678, 0.123, 0.456]
Real embeddings use 300-768+ dimensions.
These vectors capture semantic relationships
Similar words have vectors pointing in similar directions
The vectors are a numeric representation of how close words are together in the corpus. Each vector column is the representation of closeness in one space or another.
In mathematical terms, dimensions represent the different directions or coordinates within a given space. Each dimension holds unique values that contribute to the overall structure and meaning of the data. For instance, in the context of machine learning, the dimensions of a dataset could represent various features or attributes of the data points. The values within those dimensions would then correspond to the specific measurements or characteristics of each data point along those feature axes.
Understanding the relationships and interactions between different dimensions and their values is crucial for uncovering patterns, trends, and insights within the data. This knowledge can then be leveraged to build effective machine learning models that can accurately predict outcomes, classify data, or generate new data that is consistent with the underlying patterns.
Step 1: Split into tokens (some words may be divided)
Step 2: Convert to token IDs
Step 3: Convert to embedding vectors.
There is an end-text token.
Each token gets mapped to a high-dimensional vector
For each prediction, it calculates probabilities for all tokens
It uses sampling strategies to select the next token:
Weighted Tombola: Tokens with higher probabilities have more "tickets"
Top-K: Only consider K most likely tokens
Top-P: Include enough tokens to reach P% of total probability
Text is built token by token through this process
Model continuously adjusts its parameters based on errors
Learns through billions of examples and iterations
Uses gradient descent to improve predictions over time
Captures not just vocabulary but grammar, facts, reasoning
After this the machines run ‘unsupervised’ -- they have been fed with previous q+a
Following on from unsupervised learning humans evaluate the output, AI Companies use 3rd world staff, as they are cheap to employ, though they do tend to have out of date English.
Finally the machines are trained with known questions and answers from the popular tests used for AI. ensuring that they will pass - a cheat? Perhaps.
Remember: Americans use different English than British, Indians use 1950s English, affecting how AI processes language differently in each region.
AI can't create new words or evolve language - if Shakespeare had AI, we'd still be saying "dost" because AI only uses existing vocabulary.
Q (Query): What each word is looking for from other words. Think of it as a word asking "what information do I need?"
K (Key): What each word offers to other words. Like labels showing what information each word has.
V (Value): The actual content that gets passed between words. The information itself.
K^T: The transposed Key matrix. This allows each word to "see" all other words.
d_k: How many numbers are used to represent each key.
√d_k: A scaling factor that keeps numbers manageable, like turning down the volume when it gets too loud.
softmax(): Turns raw numbers into percentages that add up to 100%. Highlights the important connections while downplaying the unimportant ones.
V multiplication: Uses the percentages to mix together values from all words, giving more weight to the important ones.
The whole formula lets each word gather information from all other words in the sentence, with more attention paid to the words that matter most for understanding. This is why BERT can tell if "bank" means a financial institution or a riverside in different sentences.
It retrieves information encoded in its parameter.
For example, asking "Who wrote Romeo and Juliet?"
AI doesn't search a database entry
It recalls that "Shakespeare" frequently appeared near "wrote" and "Romeo and Juliet" in training data
Answers are generated token by token
Modern systems use techniques like:
Chain-of-thought: Breaking reasoning into steps
Self-critique: Evaluating and revising initial responses
Distributional Shift: The world changes faster than models update
Knowledge Cutoff: AI has no knowledge beyond training date
Specification Problems: Hard to formalise all human values
Adversarial Examples: Users can craft inputs to circumvent guardrails
Conflicting Values: Balancing accuracy, sensitivity, and helpfulness