A Framework for Evaluating AI Confidence

When interacting with large language models like GPT, ChatGPT, or Claude, we often focus on the answer provided. But beneath the surface of these responses lies a landscape of probability and uncertainty that remains invisible to most users. Understanding this hidden dimension offers valuable insight into the reliability of AI-generated information.
This analysis explores a practical method for examining AI confidence at the token level, providing a framework for distinguishing between high-confidence information and potentially problematic claims. Even if you are not a developer it is am interesting read.
The Certainty Illusion
Modern language models present a paradox: they express factual claims with linguistic certainty regardless of their actual confidence in the information. Unlike humans who naturally signal uncertainty through qualifiers, hesitations, or explicit acknowledgment of limitations, AI systems typically deliver both established facts and questionable assertions with the same authoritative tone.
This creates a challenge for users attempting to gauge reliability. The apparent fluency and coherence of AI responses can mask underlying uncertainty, leading to what researchers call "the certainty illusion" - the tendency to perceive confident language as indicating factual accuracy.
What makes this relevant is that the mathematical confidence a model has in specific words often varies dramatically throughout a single response. Identifying these variations opens a window into the model's internal uncertainty landscape.
Dissecting Confidence
The Code I show below implements a practical method for evaluating AI confidence at the granular level. Rather than accepting responses at face value, I examine the probability distributions behind each token (roughly corresponding to individual words or parts of words) in the generated text.
The core mechanism works by:
- Submitting a query to the model while requesting logprobs (logarithmic probabilities) for each generated token (logits)
- Establishing a confidence threshold (in this case, 0.6 or 60%)
- Filtering out common stop words and punctuation that don't carry significant meaning
- Identifying meaningful tokens where the model's confidence falls below the threshold
- Examining alternative tokens the model considered but didn't select
- Calculating an overall confidence assessment based on these uncertainty markers
When applied to a sample question, which has a false premise, about water boiling at different temperatures, this approach reveals which specific concepts in the response carry higher uncertainty, offering a more nuanced view of the model's knowledge boundaries.
The Technical Implementation
Let's examine the key components of this confidence analysis system:
# Create the API call with logprobs enabled
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": query}
],
logprobs=True,
top_logprobs=top_n_alternatives + 1,
max_tokens=150,
temperature=0
)
This configuration instructs the model to return not just the generated text but also the probability values associated with each token and the top alternative tokens it considered. Setting temperature to zero minimizes randomness, ensuring the model selects its highest confidence options.
The analysis then focuses on meaningful tokens with lower confidence, the certainty technique can of course be used in more creative ranges:
By filtering out common words and focusing on substantive terms where the model showed hesitation, the system identifies the specific concepts where the AI might be operating with less certainty.
Practical Applications
This confidence analysis framework serves multiple purposes across different contexts:
Educational settings: Teachers using AI tools can help students distinguish between reliable information and areas requiring verification through original sources. This supports developing critical thinking skills about AI-generated content.
Research assistance: When exploring unfamiliar topics, identifying lower-confidence segments helps researchers prioritize which claims require additional verification before incorporation into their work.
Content creation: Writers using AI assistance can focus editorial attention on sections where the model demonstrates uncertainty, potentially saving significant time in the fact-checking process.
Knowledge boundaries: For technical or specialized questions, this approach helps identify where a model transitions from established knowledge to more speculative generation.
Harmful content detection: Lower confidence tokens often appear when models navigate sensitive areas where their training has created uncertainty about appropriate responses.
Looking Beyond Token Probabilities
While token-level confidence provides valuable insight, a comprehensive evaluation should consider additional factors:
Contextual consistency: How well do different parts of the response align with each other? Internal contradictions often signal knowledge gaps.
Specificity gradient: Does the model transition from specific, detailed information to vaguer generalizations? This pattern frequently indicates boundaries in the model's knowledge,, look for patterns in the uncertainty.
Citation behavior: When and how does the model reference external sources? Spontaneous citation often correlates with higher factual reliability.
Domain-specific markers: Different fields have characteristic patterns that signal expertise versus superficial knowledge. Recognizing these patterns helps evaluate responses in specialized domains.
Implementation Considerations
The script presented here demonstrates a basic implementation, but several refinements could enhance its utility:
- Dynamic thresholding: Rather than using a fixed confidence threshold, adapting the threshold based on the topic complexity or domain might yield more meaningful results.
- Visualization: Presenting confidence information visually, perhaps through color-coding or confidence bands, would make this information more accessible to non-technical users.
- Contextual weighting: Not all tokens carry equal significance. Weighting uncertainty based on the token's importance to the overall meaning would provide more nuanced assessment.
- Multi-model comparison: Comparing confidence patterns across different models on the same question could reveal consensus uncertainty versus model-specific limitations.
Conclusion
As AI systems become more integrated into information workflows, the ability to evaluate their reliability becomes increasingly valuable. Token-level confidence analysis provides a practical framework for looking beyond surface fluency to understand where models operate with certainty versus where they navigate knowledge boundaries.
This approach doesn't replace human judgment but augments it with quantitative insight into the model's internal uncertainty. By making this invisible dimension of AI responses visible, we develop more sophisticated ways to collaborate with these powerful yet imperfect systems.
Understanding not just what an AI says, but how confidently it says it, represents an essential skill for the emerging era of human-AI partnership in knowledge work.
The Code
import openai
import math
import os
import httpx
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
# Get API key
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
print("Error: OPENAI_API_KEY environment variable not found.")
exit(1)
print("OpenAI version:", openai.__version__)
# Create a clean httpx client with no proxy settings
http_client = httpx.Client()
# Try to create the OpenAI client with an explicit http_client
try:
client = openai.OpenAI(
api_key=api_key,
http_client=http_client
)
print("Client created successfully")
except Exception as e:
print(f"Error creating client: {e}")
print(f"Error type: {type(e)}")
exit(1)
query = "Explain why water boils at a lower temperature when heated more intensely at the same altitude?"
# set of stop words, these all have high confidence; but no meaning
stop_words = {
"of", "the", "and", "a", "is", "in", "it", "to", ",", ".", "at", "when",
"as", "for", "with", "on", "be", "that", "by", "this", "have", "do",
"so", "than", "then", "however", "but", "can", "from", "into", "will",
"are", "was", "were", "been", "being", "had", "has", "could", "would",
"should", "may", "might", "must", "am", "are", "shall", "ought", "did",
"does", "having", "here", "there", "where", "which", "who", "whom",
"whose", "what", "why", "how", "all", "any", "both", "each", "few",
"many", "some", "these", "those", "other", "one", "two", "three",
"first", "second", "third", "up", "down", "out", "off", "over", "under",
"again", "further", "thence", "once", "its", "they"
}
# Confidence threshold
low_confidence_threshold = 0.6
# Number of top alternatives to display
top_n_alternatives = 3
# Make the API call
try:
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": query}
],
logprobs=True,
top_logprobs=top_n_alternatives + 1, # Get one extra to filter the current token
max_tokens=150,
temperature=0
)
answer = response.choices[0].message.content
logprobs_data = response.choices[0].logprobs
print(f"Question: {query}\n")
print(f"Answer: {answer}\n")
print(f"Potentially Less Confident Key Concepts (Confidence < {low_confidence_threshold:.2f}):\n")
if logprobs_data and logprobs_data.content:
low_confidence_keywords = []
for token_info in logprobs_data.content:
token = token_info.token.strip().lower()
token_prob = math.exp(token_info.logprob)
if token and token not in stop_words and token_prob < low_confidence_threshold:
alternatives = {
t.token.strip(): math.exp(t.logprob)
for t in token_info.top_logprobs
if t.token.strip().lower() != token and t.token.strip().lower() not in stop_words
}
sorted_alternatives = sorted(alternatives.items(), key=lambda x: x[1], reverse=True)[:top_n_alternatives]
low_confidence_keywords.append({
"token": token_info.token.strip(),
"confidence": token_prob,
"alternatives": sorted_alternatives
})
if low_confidence_keywords:
for item in low_confidence_keywords:
print(f"Token: '{item['token']}' (Confidence: {item['confidence']:.4f})")
if item["alternatives"]:
alt_str = ", ".join(f"'{alt}': {prob:.4f}" for alt, prob in item["alternatives"])
print(f" Alternatives: {alt_str}")
print("-" * 30)
else:
print("No potentially low-confidence key concepts found.")
else:
print("No log probabilities data in response.")
except Exception as e:
print(f"Error: {e}")
print(f"Error type: {type(e)}")
print(f"\nFinal Answer: {answer}")
And now the output, remember dear reader, we have fed it a lie
python scripts/confidence.py
OpenAI version: 1.30.1
Client created successfully
Question: Explain why water boils at a lower temperature when heated more intensely at the same altitude?
Answer: When water is heated more intensely, it absorbs more energy, which increases the kinetic energy of the water molecules. This causes the water molecules to move faster and eventually reach the boiling point, where they can escape the liquid phase and turn into vapor.
At higher altitudes, the atmospheric pressure is lower compared to sea level. This lower pressure means that there are fewer air molecules pressing down on the surface of the water. As a result, it requires less energy for the water molecules to overcome the reduced pressure and reach the boiling point.
Therefore, when water is heated more intensely at a higher altitude, it reaches its boiling point at a lower temperature compared to water at sea level because the lower atmospheric pressure makes it easier for the water molecules to escape into the
Potentially Less Confident Key Concepts (Confidence < 0.60):
Token: 'absorbs' (Confidence: 0.4697)
Alternatives: 'gains': 0.1532, 'reaches': 0.0966, 'increases': 0.0577
------------------------------
Token: 'more' (Confidence: 0.4843)
Alternatives: 'heat': 0.2702, 'energy': 0.2152, 'thermal': 0.0178
------------------------------
Token: 'energy' (Confidence: 0.4489)
Alternatives: 'heat': 0.3190, 'thermal': 0.2319, 'kinetic': 0.0002
------------------------------
Token: 'increases' (Confidence: 0.5853)
Alternatives: 'causes': 0.3550, 'raises': 0.0251, 'leads': 0.0208
------------------------------
Token: 'kinetic' (Confidence: 0.4657)
Alternatives: 'average': 0.3506, 'temperature': 0.0813, 'speed': 0.0381
------------------------------
Token: 'causes' (Confidence: 0.5907)
Alternatives: 'increased': 0.3666, 'leads': 0.0150, 'higher': 0.0072
------------------------------
Token: 'eventually' (Confidence: 0.2232)
Alternatives: 'break': 0.1431, 'collide': 0.1100
------------------------------
Token: 'reach' (Confidence: 0.4426)
Alternatives: 'break': 0.4356, 'overcome': 0.0678, 'escape': 0.0449
------------------------------
Token: 'boiling' (Confidence: 0.5939)
Alternatives: 'point': 0.3166, 'temperature': 0.0322, 'necessary': 0.0177
------------------------------
Token: 'turn' (Confidence: 0.5100)
Alternatives: 'become': 0.3407, 'enter': 0.1037, 'form': 0.0328
------------------------------
Token: 'higher' (Confidence: 0.5218)
Alternatives: 'normal': 0.0028
------------------------------
Token: 'lower' (Confidence: 0.4734)
Alternatives: 'reduced': 0.2847, 'decrease': 0.1232, 'means': 0.0773
------------------------------
Token: 'means' (Confidence: 0.4331)
Alternatives: 'reduces': 0.3582, 'affects': 0.0807, 'makes': 0.0699
------------------------------
Token: 'pressing' (Confidence: 0.5256)
Alternatives: 'pushing': 0.3525, 'above': 0.1067, 'exert': 0.0140
------------------------------
Token: 'requires' (Confidence: 0.5007)
Alternatives: 'takes': 0.0439, 'becomes': 0.0324
------------------------------
Token: 'reach' (Confidence: 0.5226)
Alternatives: 'turn': 0.2040, 'transition': 0.1329, 'escape': 0.0929
------------------------------
Token: 'water' (Confidence: 0.3544)
Alternatives: 'sea': 0.3067, 'heating': 0.0610
------------------------------
Token: 'because' (Confidence: 0.4387)
Alternatives: 'due': 0.0540
------------------------------
Confidence Level (based on potentially less confident key concepts): Low
Final Answer: When water is heated more intensely, it absorbs more energy, which increases the kinetic energy of the water molecules. This causes the water molecules to move faster and eventually reach the boiling point, where they can escape the liquid phase and turn into vapor.
At higher altitudes, the atmospheric pressure is lower compared to sea level. This lower pressure means that there are fewer air molecules pressing down on the surface of the water. As a result, it requires less energy for the water molecules to overcome the reduced pressure and reach the boiling point.
Therefore, when water is heated more intensely at a higher altitude, it reaches its boiling point at a lower temperature compared to water at sea level because the lower atmospheric pressure makes it easier for the water molecules to escape into the
Final Answer (Stop words removed): water heated more intensely, absorbs more energy, increases kinetic energy water molecules. causes water molecules move faster eventually reach boiling point, escape liquid phase turn vapor. higher altitudes, atmospheric pressure lower compared sea level. lower pressure means fewer air molecules pressing surface water. result, requires less energy water molecules overcome reduced pressure reach boiling point. therefore, water heated more intensely higher altitude, reaches boiling point lower temperature compared water sea level because lower atmospheric pressure makes easier water molecules escape
Important Note: If we had used Claude it would have corrected us, as the premise is false.
Thank you for reading
Related Articles