Cleaning, Structuring, and Feeding Data to Machines (and Why It Matters)
David J. Cox PhD MSB BCBA-D, Ryan L. O'Donnell MS BCBA

In Idiocracy, the world doesn’t collapse because people become malicious. It collapses because the system becomes sloppy and then starts rewarding the sloppiness.
Nobody is trying to ruin civilization. They’re just choosing the easier option, over and over, until the easier option becomes the only option left.
That is the risk pattern worth paying attention to in ABA as AI tools mature. Not “AI is dangerous”. Not “robots are coming”.
Something more subtle: When we feed low-quality data into high-powered systems, we don’t just get low-quality outputs. We get workflows that start trusting low-quality outputs.
And once that happens, the organization doesn’t merely make occasional mistakes. It begins to normalize them.
This issue is about the part of AI that rarely gets discussed in vendor demos, but explains most real-world success and failure: Data preprocessing.
Cleaning. Structuring. Labeling. Choosing what counts. Deciding what gets ignored.
Because models learn from the version of reality your data pipeline creates and not reality itself.
The Data Pipeline is the Model (Whether You Admit It or Not)
Most people talk about AI like it’s a single thing: the Model.
In practice, AI systems behave more like an assembly line:
Collection → Cleaning → Structuring → Labeling → Training → Evaluation → Deployment → Monitoring
The model is just one station on that line. If the upstream stations are messy, inconsistent, or poorly governed, the downstream system will still produce outputs, but those outputs will be brittle, biased, or misleading.
This is where Idiocracy becomes a useful metaphor. In an “Idiocracy pipeline”, the organization still has dashboards; they still have automation; they still have clean-looking outputs.
But the underlying inputs are drifting:
-
Definitions change without documentation
-
Labels get applied inconsistently
-
Missingness gets ignored
-
Convenience data replaces meaningful data
-
Systems optimize what’s easiest to measure, not what matters
And then the AI system learns that version of the world as if it were ground truth.
Garbage In, Garbage Out is Too Kind. “Garbage in, garbage out” makes it sound like the consequences are obvious. But in modern AI systems—especially LLMs—bad inputs often produce outputs that are:
-
fluent
-
confident
-
structured
-
professional-looking
Which means the real risk is error that looks like competence. That’s the Idiocracy pattern. A system that appears functional while systematically degrading the quality of thinking and decision-making inside it.
Two Kinds of AI, Two Kinds of Data Problems
ABA is currently experiencing a wave of LLM adoption, so it’s tempting to think “AI = ChatGPT”. But AI in clinical systems is much broader than language generation. In practice, ABA organizations encounter two major families of AI:
1. Generative AI (e.g., LLMs). Used for:
-
note drafting and summarization
-
rewriting and formatting reports
-
caregiver communication drafts
-
policy and training materials
-
extracting themes from text
LLMs are highly sensitive to text quality, structure, and context boundaries. If the input is messy, incomplete, or inconsistent, the output will still be fluent but less reliable.
2. Predictive / scoring models (non-LLM AI). Used for:
-
no-show risk scores
-
intake prioritization
-
scheduling optimization
-
authorization likelihood predictions
-
churn risk
-
outcome forecasting
These systems are highly sensitive to labels, missingness, leakage, and drift. They often fail not because the math is wrong, but because the data pipeline violates or misplaces assumptions (for certain contexts).
Cleaning: Removing Noise Without Erasing Meaning
Cleaning sounds simple: remove errors, standardize formatting, handle missing values. But in behavioral health data, cleaning is never purely technical. It is always interpretive. Here are the most common “cleaning decisions” that change what the model learns.
1. Missingness is not random (and pretending it is creates fiction). In ABA data, missing values often mean something:
-
session canceled
-
staff turnover
-
caregiver burnout
-
insurance disruption
-
device not charged
-
client illness
-
clinician didn’t enter the data
If you treat missingness as “just blank cells,” the model will learn patterns that reflect documentation habits, not client progress. Here’s a practical example. If “data not recorded” happens more often during difficult weeks, then the dataset will over-represent the easy weeks. Your model will learn a world that is cleaner than reality.
That’s not an AI failure. That’s a measurement failure.
Chiron: The AI Literacy Series for ABA Professionals
A weekly newsletter exploring how ABA professionals can develop essential AI literacy skills to ensure ethical and effective practice in a rapidly changing field.