The Mathematical Heartbeat of AI

When we interact with AI systems that seem almost magically intelligent, we're actually engaging with elegant mathematical structures. Behind the curtain of every language model, image generator, and recommendation system lies a foundation of mathematical functions that transform raw data into understanding.

I want to take you on a journey through the mathematical heart of artificial intelligence.

How e, Sigmoid, ReLU and Vectors Power Modern Intelligence

I have slides for this blog, as a talk slides for mathematics behind AI

Behind every AI system that translates languages, generates images, or predicts behaviors exists an elegant mathematical framework that transforms raw data into understanding. "The Mathematical Heartbeat of AI" unveils the hidden mathematical structures powering artificial intelligence that often go unnoticed by everyday users.

This exploration takes us through the fundamental mathematical concepts at AI's core: how Euler's number e creates perfect growth patterns, how sigmoid functions transform complex inputs into probabilities, how the simple yet revolutionary ReLU function solved critical training problems, and how vector spaces allow machines to process human language. These mathematical building blocks work together in harmony to create the sophisticated behaviors we observe in modern AI systems.

While the mathematics behind AI might initially seem dense, this post explains these concepts in plain English, making them accessible to readers without advanced mathematical backgrounds. Understanding these principles serves a practical purpose. The more we comprehend how AI systems actually work, the better equipped we become to use them effectively, recognize their limitations, and make informed decisions about their applications in our daily lives and businesses.

The Magical Constant That Smooths Growth

Graph of e^x function showing its special self-derivative property. The curve rises exponentially, with a tangent line at point x=2 showing the slope equals the height at that point.

At the core of many neural networks sits a seemingly random number: 2.71828... This is Euler's number ‘e’, a mathematical constant as fundamental to AI as it is to natural phenomena like population growth, radioactive decay, and compound interest.

What makes e so special? Its defining property is that the derivative of e^x is itself. This means that at any point on the curve, the rate of change (the slope of the tangent line) exactly equals the value of the function at that point. No other non-zero function has this extraordinary property.

This self-replicating quality creates the smoothest possible growth pattern. It's like a savings account where the interest rate perfectly matches your current balance - the more you have, the faster it grows, in perfect proportion. This property creates growth that accelerates naturally and continuously rather than in jumps or irregular spurts.

Differentiability plays a critical role in the effectiveness of AI learning systems. In essence, it ensures that the output of a neural network changes smoothly in response to changes in its input. This smoothness is vital as it enables the system to make granular adjustments to its internal parameters, fine-tuning its behavior based on the errors it encounters during the training process. This process of learning from errors, known as backpropagation, relies heavily on the ability to calculate gradients, which is only possible with differentiable functions.

Without differentiability, the learning process in AI systems would be significantly hampered. The system would struggle to make precise adjustments to its parameters, leading to computational inefficiencies and suboptimal performance. The inability to calculate gradients would also prevent the use of backpropagation, which is a cornerstone of modern AI training methodologies. This would result in a less effective and less adaptable AI system, unable to reach its full potential in terms of accuracy and performance.

The S-Curve of Decision - Sigmoid Functions

The sigmoid function is a mathematical function that maps any input value to an output value between 0 and 1. Its graph has a characteristic S-shape with horizontal asymptotes at y=0 and y=1. The decision boundary is at x=0, where the output value equals 0.5. This allows the sigmoid function to be used for binary classification, with input values below the decision boundary being classified as 0 and input values above the decision boundary being classified as 1.

When an AI needs to make a binary decision – Is this email spam? Is this image a cat? Should this loan be approved? – it often relies on the sigmoid function:

σ(x) = 1/(1 + e^(-x))

This S-shaped curve elegantly transforms any input, no matter how extreme, into a value between 0 and 1. It's the mathematical equivalent of taking the chaotic complexity of the world and distilling it down to a probability.

The sigmoid's graceful curve leverages Euler's number to create a smooth transition between states. Notice how the curve approaches but never quite reaches its upper and lower bounds. This smooth transition is crucial in AI systems where we need to model gradual changes in confidence or probability.

The sigmoid's S-shape also mimics many natural phenomena: population growth curves, learning rates, and even how opinions spread through social networks. When applied in AI, it represents the system gradually becoming more confident in its decision as evidence mounts in one direction.

However, the sigmoid has limitations. As neural networks grew deeper, researchers discovered its tendency to "saturate" – its gradients becoming vanishingly small at extreme values, essentially stopping learning in its tracks. This weakness led to one of the most important innovations in modern AI...

ReLU and the Power of f(x) = max(0,x)

Sometimes the most powerful solutions are the simplest. The Rectified Linear Unit (ReLU) function is disarmingly straightforward:

If the input is positive, let it through. If the input is negative, output zero.

This piecewise linear function lacks the elegant curves of the sigmoid, yet it sparked a revolution in deep learning. Why? Because it solved the vanishing gradient problem while being computationally cheaper.

The beauty of ReLU lies in its binary gradient behavior - it's either 0 (for negative inputs) or 1 (for positive inputs). This property makes gradient flow much more stable in deep networks, allowing for faster and more reliable training.

ReLU also introduced a valuable property to neural networks: sparsity. By zeroing out negative values, it creates networks where many neurons are inactive at any given time – mirroring how our own brains work, where only specific neural pathways activate for particular tasks.

This sharp, cornered function outperforms the smooth, differentiable sigmoid in many applications. Sometimes broken symmetry creates strength – a principle found throughout nature and now harnessed in artificial intelligence.

From Words to Numbers

Diagram showing the process of tokenization and vectorization, with example words being mapped to vector representations in high-dimensional space.

For AI to work with human language, it needs to transform words – abstract symbols laden with cultural and contextual meaning – into mathematical objects it can manipulate. This transformation is token vectorization.

The Tokenization Step

A modern language model first breaks text into tokens – pieces that might be words, parts of words, or even individual characters:

"I love machine learning" → ["I", "love", "machine", "learning"]

Modern systems often use subword tokenization, where common words stay whole but uncommon ones split into meaningful pieces:

"antidisestablishmentarianism" → ["anti", "dis", "establish", "ment", "arian", "ism"]

This balances vocabulary size with the ability to handle any input text by combining subword pieces.

Creating the Vectors

Once tokenized, each token gets transformed into a vector through one of several methods:

One-hot encoding (Early approach) Each token gets a vector with a single 1 and all other positions 0:

"cat" → [0,1,0,0,0,...]

"dog" → [0,0,1,0,0,...]

This is simple but extremely high-dimensional and doesn't encode any semantic relationships.

Static word embeddings (Word2Vec, GloVe) Each token gets a dense vector (typically 100-300 dimensions) based on how words co-occur in large text corpora:

"king"  → [0.123, -0.029, 0.895, ...]

"queen" → [0.157, -0.041, 0.887, ...]

These vectors are learned through training objectives like "predict surrounding words" or "predict if words appear together." The magic is that words used in similar contexts end up with similar vectors – encoding semantic relationships.

Contextual embeddings (BERT, GPT) Modern transformers create dynamic vectors that change based on surrounding context:

"bank" in "river bank" → [0.42, 0.12, ...]

"bank" in "bank account" → [0.11, 0.87, ...]

These are generated by multi-layer neural networks that process entire sentences or documents at once, allowing for context-dependent representations.

The Mathematical Space

In this high-dimensional vector space, relationships between concepts become geometric relationships. Words with similar meanings cluster together. The distance between vectors represents semantic difference, while the direction can encode specific types of relationships.

This mathematical representation enables one of the most famous examples in word embeddings: "king" minus "man" plus "woman" approximately equals "queen." The vector arithmetic captures the gender relationship independent of the royal status.

What's fascinating is that these relationships emerge naturally during training – the algorithm isn't explicitly told that "king" and "queen" differ by gender. These patterns emerge from the statistical properties of language itself, encoded in the smooth mathematical space created by the embedding process.

The Softmax Function = The Heart of AI Decision-Making

What Is Softmax?

Softmax is a mathematical function that transforms a vector of raw numbers (often called "logits") into a probability distribution. The resulting probabilities all sum to exactly 1, making softmax essential for multi-class classification problems in machine learning and AI systems.

The Mathematical Definition

For an input vector z with n elements, the softmax function is defined as:

softmax(z)_i = e^(z_i) / Σ e^(z_j)

Where:

e is Euler's number (approximately 2.71828)

z_i is the i-th element of the input vector
The denominator is the sum of exponentials of all elements in the vector

How Softmax Works

Softmax elegantly handles the transformation from raw neural network outputs to interpretable probabilities through several key steps:

It applies the exponential function (e^x) to each input value, ensuring all results are positive
It normalizes these values by dividing each exponential by their sum, guaranteeing they add up to 1
The largest input value receives the highest probability, while smaller values remain proportionally represented

The diagram shows the transformation from raw scores through exponential normalization, culminating in two selection strategies: greedy selection (always choosing the highest probability word) and sampling (randomly selecting from the top-k most likely words). A quick point: this is not a weather forecaster; it is a chatbot. A weather forecasting AI would use different vectors.

Why Softmax Matters in AI

Softmax appears at the final layer of classification networks where multiple options must be evaluated. For example, when classifying images, softmax converts raw neural outputs into statements like: "This image is 87% likely to be a cat, 12% likely to be a dog, and 1% likely to be something else."

Key Properties That Make Softmax Powerful

Differentiability

Softmax is smooth and differentiable, which is crucial for neural network training through backpropagation. This allows networks to make precise, incremental adjustments during learning.

Amplification of Differences

The exponential nature of softmax naturally emphasizes larger differences while compressing smaller ones. This mirrors human decision-making, where we tend to strongly favor the most compelling option.

Probabilistic Interpretation

Unlike other activation functions, softmax directly outputs valid probabilities, making its results immediately interpretable without further processing.

Softmax vs. Sigmoid

While sigmoid transforms a single value into a probability between 0 and 1 (perfect for binary decisions), softmax extends this concept to handle scenarios with multiple options. You can think of sigmoid as a special case of softmax where there are only two classes.

Applications Beyond Classification

Beyond its common use in classification, softmax also appears in:

Attention mechanisms in transformers
Reinforcement learning for action selection
Generative models for token selection in text generation

This elegant mathematical function demonstrates how relatively simple equations, properly arranged, can create systems with remarkably sophisticated behavior - a fundamental building block in the mathematical architecture powering modern artificial intelligence.

Neural Networks, bringing the math together

Diagram showing how mathematical components work together in a neural network, with labeled neurons (A,B,C in input layer, D,E,F in hidden layer, G in output) and mathematical functions above corresponding layers. In AI systems there will be many hidden layers, the more there are the deeper the compute.

What makes these mathematical functions remarkable is how they work in concert:

Text is broken into tokens and transformed into vectors (vectorization)
These vectors flow through a neural network with multiple layers
ReLU functions in the hidden layers introduce non-linearity and sparsity
Attention mechanisms powered by softmax (another application of e) focus on relevant information
Finally, softmax functions may convert outputs to probabilities

This mathematical symphony, centered around the remarkable properties of Euler's number, creates systems that can perform tasks once thought to be the exclusive domain of human cognition: writing essays, generating images, translating languages, and recognizing patterns in vast datasets.

Going deeper

The Symphony of Mathematical Principles in Action

BERT (Bidirectional Encoder Representations from Transformers) represents the mathematical heart of AI in full harmony. Developed by Google in 2018, BERT demonstrates how the mathematical building blocks we've explored—Euler's number, attention mechanisms, and vector spaces—can orchestrate a revolution in language understanding.

The Bidirectional Breakthrough

What makes BERT revolutionary is its bidirectional approach to context. While earlier models processed text in one direction (left-to-right or right-to-left), BERT simultaneously analyzes words from both directions. This mathematical innovation allows BERT to develop a more complete understanding of meaning, much like how humans comprehend language by considering the entire context.

Consider the sentence "The bank is closed due to flooding." The word "bank" has different meanings depending on context. Is this a financial institution or a riverside? BERT's bidirectional approach means it processes "The bank is closed due to" and "flooding" simultaneously rather than sequentially, allowing for more accurate disambiguation.

Euler's Number at Work in BERT's Attention

At BERT's mathematical core lies the self-attention mechanism, powered by the softmax function:

softmax(QK^T/√d_k)V

This elegant formula relies fundamentally on Euler's number (e ≈ 2.71828...), which we've already seen creates the smoothest possible growth patterns. When BERT calculates attention scores between words, each score undergoes an exponential transformation:

e^(score) / Σ e^(all scores)

This exponential transformation, using e^x, creates precisely the right mathematical dynamics for language understanding:

It converts all values to positive numbers (essential for attention weights)
It amplifies differences between scores (making important connections stand out)
It creates a smooth, differentiable function (enabling effective learning)

The special property of e^x—being its own derivative—plays a crucial role in BERT's learning process. When the model adjusts its parameters through backpropagation, this mathematical property ensures the smoothest possible gradient flow, allowing for precise parameter updates.

Masked Language Modeling: Mathematical Prediction in Context

BERT employs a clever mathematical training approach called masked language modeling. The model randomly masks 15% of tokens in its input and learns to predict them based on bidirectional context:

"The [MASK] fox jumps over the lazy dog" → BERT predicts "quick"

This prediction task creates a more challenging mathematical optimization problem than previous approaches. Instead of learning sequential patterns, BERT must develop a deep contextual understanding that works in both directions simultaneously. This forces the model to encode richer mathematical representations of language in its vector space.

The Vector Space of Contextual Embeddings

BERT exemplifies the contextual embeddings we discussed in the "From Words to Numbers" section. While static embeddings like Word2Vec assign the same vector to a word regardless of context, BERT creates dynamic vectors that shift based on surrounding words:

"bank" in "river bank" → [0.42, 0.12, ...] "bank" in "bank account" → [0.11, 0.87, ...]

This mathematical approach creates a more expressive vector space where relationships between concepts are captured with greater nuance. The high-dimensional geometry of this space (768 dimensions in BERT-base) provides sufficient mathematical expressivity to encode complex linguistic relationships.

ReLU and Feed-Forward Networks in BERT

Between its attention layers, BERT employs feed-forward neural networks that use ReLU activation functions:

f(x) = max(0, x)

As we explored earlier, ReLU introduces crucial non-linearity and sparsity to neural networks. In BERT, these feed-forward networks process the outputs from attention mechanisms, adding another layer of mathematical transformation that enriches the model's representational power.

The Mathematical Scale of BERT

The base version of BERT contains approximately 110 million parameters, while the large variant scales to 340 million. Each parameter represents a degree of freedom in the model's functional space, creating an immensely powerful mathematical framework for language understanding.

This computational scale allows BERT to capture nuanced patterns in language that simpler models miss. The mathematical expressivity of hundreds of millions of parameters, organized through carefully designed architecture, creates a system that can perform tasks once thought to require human-level understanding.

BERT in the Mathematical Symphony of AI

BERT demonstrates how the mathematical concepts we've explored—Euler's number, vector spaces, non-linear activations, and attention mechanisms—can work in concert to create systems with remarkable capabilities. It's a testament to how relatively simple mathematical principles, when arranged in the right architecture and scaled appropriately, can create emergent behaviors that mimic aspects of human cognition.

A Simpler Understanding

At the core of BERT's remarkable ability to understand language lies a seemingly simple mathematical formula: softmax(QK^T/√d_k)V. Hidden within this equation is Euler's number (e ≈ 2.71828...), the same magical constant that appears throughout nature in growth patterns and decay curves.

When BERT processes a sentence like "I'll meet you at the bank," it needs to determine if "bank" refers to a financial institution or a riverside. This happens through the softmax function, which is defined as:

softmax(z)_i = e^(z_i) / Σ e^(z_j)

Here, Euler's number creates an elegant mathematical transformation. For each attention score between words, BERT calculates e^(score), which amplifies important connections while keeping all values positive. Words that are highly relevant to each other receive exponentially higher attention weights thanks to the special properties of e^x.

What makes this approach so powerful is that Euler's number creates the perfect growth curve - at any point, its rate of change equals its value. This property enables smooth gradient flow during learning, allowing BERT to fine-tune which connections between words matter most for understanding context.

Closing

When you interact with an AI Language Model system, remember that behind its seemingly magical language understanding is Euler's number - the same mathematical constant that describes compound interest, population growth, and radioactive decay - now helping machines understand the subtle meanings in human language.In the seemingly simple equations of e, sigmoid, ReLU, and vector spaces lies a world of artificial intelligence – a testament to how abstract mathematics can create systems that begin to mirror human understanding.

The special properties of e^x—its unique ability to be its own derivative, creating the smoothest possible growth patterns—ripple throughout these systems. This mathematical elegance translates directly into computational efficiency, allowing neural networks to learn more effectively. Mathematics is not only the language of nature, but it's also the language of artificial intelligence. This may be its most impactful application to date.

<hr>

Thank you for reading

/fragments/ddt/ai-proposition

style

bg-dark