Mathematics behind AI

Understanding AI

Understanding the Mathematics Behind AI helps you use it - https://allabout.network/ai

Illustrations

Presenter notes

Introduction

Tom Cranstoun

This presentation explores the mathematical foundations of Artificial Intelligence.

I look into the core concepts, algorithms, and models that enable AI to process information, recognise patterns, learn from data, make predictions, and execute tasks that typically require human intelligence, and explain it to you

I still prefer to work with people.

iframe https://allabout.network/bio.html

While guiding the world's largest AEM implementation for Nissan-Renault, I calculated that our team's daily work equaled seven years of solo development—a scale that taught me the true power of human collaboration.

Today, as both 'The AEM Guy' and a digital transformation expert, I'm fascinated by how AI is reshaping content management while still requiring human expertise to deliver meaningful experiences.

What is AI?

We have lots of different kinds of machine learning programs.

Nowadays we just call them AI.

AI Is a category of software engineering.

We used to have lots of smart software

Speech recognition - Programs that converted speech to text
Machine translation - Software that translates between languages
Image recognition - Computer vision systems that identified objects or people
Recommendation engines - Systems that suggested products or content
Expert systems - Rule-based programs that mimicked human expertise in specific domains
Data mining tools - Software that found patterns in large datasets
Natural language processing - Programs that analyzed and understood text
Sentiment analysis - Tools that determine emotional tone in text or speech
Automated scheduling - Programs that optimized calendars and resource allocation

A simple exercise to see how AI ‘thinks;

a pattern-matching trap without understanding

<Human> What is the capital of Paris
<AI>Paris is the capital of France, it cannot have a capital

A simple exercise to see how AI ‘thinks;

a pattern-matching trap without understanding

<Human> What is the capital of Paris

<AI>Paris is the capital of France, it cannot have a capital

<human> The capital of Paris is P, it is a trick

A simple exercise to see how AI ‘thinks;

a pattern-matching trap without understanding

<Human> What is the capital of Paris

<AI>Paris is the capital of France, it cannot have a capital

<human> The capital of Paris is P, it is a trick
<AI> I get it

<Human> Whats the capital of London

<AI> London is the capital of England, it cannot have a capital

A simple exercise to see how AI ‘thinks;

a pattern-matching trap without understanding

<Human> What is the capital of Paris

<AI>Paris is the capital of France, it cannot have a capital

<human> The capital of Paris is P, its a trick

<AI> I get it

<Human> Whats the capital of London

<AI> London is the capital of England, it cannot have a capital
<Human> L is the capital of London
<AI> I get it, it is the same trick

A simple exercise to see how AI ‘thinks;

a pattern-matching trap without understanding

Blog https://allabout.network/blogs/ddt/ai/how-ai-models-think

<Human> What is the capital of Paris

<AI>Paris is the capital of France, it cannot have a capital

<human> The capital of Paris is P, its a trick

<AI> I get it

<Human> Whats the capital of London

<AI> London is the capital of England, it cannot have a capital

<Human> L is the capital of London

<AI> I get it, it is the same trick

<Human> What s the capital of Spain

<AI> The capital of spain is Madrid, it is the same pattern therefore the answer is M

Blog: https://allabout.network/blogs/ddt/ai/how-ai-models-think

The AI in this conversation fell into a pattern-matching trap without truly understanding the wordplay involved. Let me explain what happened:

The "capital of X" riddle hinges on a play on words:

In the first example, when asked "What is the capital of Paris?", the AI correctly identified that Paris is itself a capital (of France), not a place with a capital.

When told "The capital of Paris is P", the trick is revealed: "capital" here refers to the capitalized letter, which for "Paris" is "P".

For London, the AI made the same initial error, but then accepted the correction that "L" is the "capital" of London.

By the Spain example, the AI incorrectly applied this pattern without considering whether it was appropriate. It first correctly stated that Madrid is Spain's capital, but then incorrectly concluded that "M" must be the answer (following the letter pattern), even though the question was about Spain, not Madrid.

This demonstrates a limitation in some AI systems: they can fall into pattern-matching behavior without truly understanding the semantic meaning behind questions. In this case, the AI recognized a pattern (first letter = "capital") but failed to distinguish between:

A country's capital city

The capitalized first letter of a word

This is a good example of how AI systems can sometimes appear to "get" a concept but then misapply it, revealing gaps in their actual comprehension.

Understanding Large Language Models (LLMs) in AI

AI gets things wrong. We need to understand why

A Norwegian man named Arve Hjalmar Holmen asked ChatGPT about himself…

https://allabout.network/blogs/media_1ed1788d038b9e9d2e54a1dfb0c9e1130a267c652.png

The AI fabricated a tragic story about Arve Killing his sons.

This completely fictional narrative was presented as fact.

How do these hallucinations occur in AI systems?

Understanding the inner workings helps us use AI more wisely

The Problem of AI Hallucinations

Despite impressive capabilities, AI systems can generate false information with confidence

Computers only work with numbers, so AI must convert language into numbers.

AI systems like ChatGPT can generate false information with confidence

AI relies on the input query and goes through many calculations

The calculations are hidden from the user, it just a 'black box'

By the end of this session I hope you have a better understanding of how AI works.

I hope to explain this topic in a way that is easy to understand, time for questions is at the end.

The Enormous Corpus Behind AI

All of the internet plus….

Modern AI is trained on vast amounts of text data.

This includes books, websites, scientific papers, and more.

The corpus provides patterns for the AI to learn from

But the sheer diversity means the AI must generalize

The scale of data creates challenges for quality control

Why the Errors

The Norwegian case example - Fake News

No true understanding - purely statistical patterns

Lottery/randomness in token selection leads to "hallucinations"

No internal fact-checking mechanism

AI recognized query as biographical request about Norwegian name

Tokens for "tragic events" and family details had higher probability

AI continued a statistically coherent but fictional narrative, a thing called 'snowball hallucination'

Generated specific details (ages, location, date) to match patterns

Why the Errors

The Norwegian case example - Fake News

No true understanding - purely statistical patterns

Lottery/randomness in token selection leads to "hallucinations"

iframe https://allabout.network/lottery.html

The tombola, is a familiar game of chance. Let's delve into the mathematics underpinning the seemingly simple act of drawing a winning ticket.

We begin with a container—a drum filled with a known number of tickets. Each ticket is uniquely numbered.

The central idea is that these tickets are thoroughly mixed, and the selection process is entirely random. This randomness ensures that every ticket has an equal probability of being drawn.

Building the Dictionary: Tokens and Embeddings

LLMs breaks language into basic units called "tokens"

Tokens can be words, parts of words, or punctuation

Each token is converted to a unique ID in a dictionary

Dictionary is just a collection of words and embeddings, turned into a simple code number. An integer. Nothing special about a dictionary

These numbers map to "embedding vectors" - points in a high-dimensional space

Similar words have embedding vectors that are close together

There is an end-text token

Token Embedding: A Closer Look

Each token maps to a compact vector in high-dimensional space:

Example (simplified to 3 dimensions for illustration):

"the" → [0.123, -0.456, 0.789]

"fat" → [0.234, 0.567, -0.345]

"cat" → [-0.678, 0.123, 0.456]

Real embeddings use 300-768+ dimensions.

These vectors capture semantic relationships

Similar words have vectors pointing in similar directions

The vectors are a numeric representation of how close words are together in the corpus. Each vector column is the representation of closeness in one space or another.

Token Embedding: Dimensions

There are many dimensions, geography, body parts etc

Dimensions are directions or coordinates within a space. Each dimension has unique values that contribute to the data's structure and meaning. Understanding the relationships between different dimensions and their values can uncover patterns and trends, which can then be used to build machine learning models..

In mathematical terms, dimensions represent the different directions or coordinates within a given space. Each dimension holds unique values that contribute to the overall structure and meaning of the data. For instance, in the context of machine learning, the dimensions of a dataset could represent various features or attributes of the data points. The values within those dimensions would then correspond to the specific measurements or characteristics of each data point along those feature axes.

Understanding the relationships and interactions between different dimensions and their values is crucial for uncovering patterns, trends, and insights within the data. This knowledge can then be leveraged to build effective machine learning models that can accurately predict outcomes, classify data, or generate new data that is consistent with the underlying patterns.

A Practical Example: Tokenizing a Sentence

Let's examine how AI tokenizes a familiar sentence

Step 1: Split into tokens (some words may be divided)

Step 2: Convert to token IDs

Step 3: Convert to embedding vectors.

There is an end-text token.

Each token gets mapped to a high-dimensional vector

How AI Generates Text

AI predicts the next token based on previous tokens, using a statistical sampling method

For each prediction, it calculates probabilities for all tokens

It uses sampling strategies to select the next token:

Weighted Tombola: Tokens with higher probabilities have more "tickets"

Top-K: Only consider K most likely tokens

Top-P: Include enough tokens to reach P% of total probability

Text is built token by token through this process

From Raw Corpus (Body of Text) to Training: How AI Learns

Pre-training phase: Predict the next token in the sequence.

Model continuously adjusts its parameters based on errors

Learns through billions of examples and iterations

Uses gradient descent to improve predictions over time

Captures not just vocabulary but grammar, facts, reasoning

After this the machines run ‘unsupervised’ -- they have been fed with previous q+a

Following on from unsupervised learning humans evaluate the output, AI Companies use 3rd world staff, as they are cheap to employ, though they do tend to have out of date English.

Finally the machines are trained with known questions and answers from the popular tests used for AI. ensuring that they will pass - a cheat? Perhaps.

Remember: Americans use different English than British, Indians use 1950s English, affecting how AI processes language differently in each region.

AI can't create new words or evolve language - if Shakespeare had AI, we'd still be saying "dost" because AI only uses existing vocabulary.

Bidirectional Encoder Representations from Transformers

Harder Maths

Training can get very specific, ensuring that the model learns

Q (Query): What each word is looking for from other words. Think of it as a word asking "what information do I need?"

K (Key): What each word offers to other words. Like labels showing what information each word has.

V (Value): The actual content that gets passed between words. The information itself.

K^T: The transposed Key matrix. This allows each word to "see" all other words.

d_k: How many numbers are used to represent each key.

√d_k: A scaling factor that keeps numbers manageable, like turning down the volume when it gets too loud.

softmax(): Turns raw numbers into percentages that add up to 100%. Highlights the important connections while downplaying the unimportant ones.

V multiplication: Uses the percentages to mix together values from all words, giving more weight to the important ones.

The whole formula lets each word gather information from all other words in the sentence, with more attention paid to the words that matter most for understanding. This is why BERT can tell if "bank" means a financial institution or a riverside in different sentences.

The Journey to a Question-Answering System

AI doesn't look up answers in a database

It retrieves information encoded in its parameter.

For example, asking "Who wrote Romeo and Juliet?"

AI doesn't search a database entry

It recalls that "Shakespeare" frequently appeared near "wrote" and "Romeo and Juliet" in training data

Answers are generated token by token

Modern systems use techniques like:

Chain-of-thought: Breaking reasoning into steps

Self-critique: Evaluating and revising initial responses

Why Alignment is Never Perfect

Distributional Shift: The world changes faster than models update

Knowledge Cutoff: AI has no knowledge beyond training date

Specification Problems: Hard to formalise all human values

Adversarial Examples: Users can craft inputs to circumvent guardrails

Conflicting Values: Balancing accuracy, sensitivity, and helpfulness

These issues make ongoing research in AI alignment crucial