Don't Believe the Bot: How to Evaluate AI Outputs...

The portal flickers again. The terrain re-renders. A new checkpoint appears in the distance. Time to level up.

Midjourney artistic rendition of an early operant behavioral psychologist looking at a parrot in the style of Hayao Miyazaki

Welcome to the Third Issue of the Ongoing Series

Last week, we reminded ourselves that AI is math, not magic. AI is a layered set of equations shaped by data and feedback, not a sentient oracle whispering truths from the void.

This week, we take the next step: learning how to evaluate what AI systems produce with the same skepticism and precision we apply to behavior data.

Because no matter how confident (or eloquent) the bot may sound–

And no matter how polished the marketing deck looks–

We must never confuse fluency for validity.

Below you’ll find a practical framework for interrogating AI outputs, complete with real-world analogies, common failure points, and a checklist you can use this week.

Locate the Contingencies: Why did the model say that?

Let’s start with asking ourselves this question:

“What antecedent data and reinforcement history shaped this output?”

As behavior analysts, we’re trained to identify the A-B-C of behavior. Now we apply that same lens to model behavior.

Antecedent (Prompt / Input): Inspect the exact wording of your prompt, query parameters, or uploaded data. Small differences (“rate” vs. “evaluate”) can lead to dramatically different results. Whenever possible, prompt transparently; include the relevant data or context directly rather than relying on vague commands.
Behavior (Model Output / Inference): Remember: Large Language Models (LLMs), such as ChatGPT, Claude, and Gemini, emit the statistically most probable next token*, not the logically best one. Translation: Just because it sounds right doesn’t mean it is right.

Consequence (Reinforcement Mechanism): Public-facing LLMs are fine-tuned using human feedback—typically nudging the model to sound helpful, harmless, and pleasing. If being wrong costs nothing, the model prioritizes user satisfaction over truth.

*A “token” is just a chunk of text—like a word, part of a word, or punctuation mark—that the model processes one at a time to build its response.

Red-flag heuristic: If the output flatters your confirmation bias or delivers a perfect answer too quickly, pause. Ask: What was reinforced during model training and alignment—and for whose benefit?

Check the Stimulus Control: What trained the model?

An intervention built on flawed baseline data is useless—or worse, harmful. The same goes for AI.

This post is for paying subscribers only

Upgrade

Already have an account? Log in

Don't Believe the Bot: How to Evaluate AI Outputs Like a Behavior Analyst

Welcome to the Third Issue of the Ongoing Series

Locate the Contingencies: Why did the model say that?

Check the Stimulus Control: What trained the model?

This post is for paying subscribers only

Chiron: The AI Literacy Series for ABA Professionals

Two Step