Cleaning, Structuring, and Feeding Data to Machines

In Idiocracy, the world doesn’t collapse because people become malicious. It collapses because the system becomes sloppy and then starts rewarding the sloppiness.

Nobody is trying to ruin civilization. They’re just choosing the easier option, over and over, until the easier option becomes the only option left.

That is the risk pattern worth paying attention to in ABA as AI tools mature. Not “AI is dangerous”. Not “robots are coming”.

Something more subtle: When we feed low-quality data into high-powered systems, we don’t just get low-quality outputs. We get workflows that start trusting low-quality outputs.

And once that happens, the organization doesn’t merely make occasional mistakes. It begins to normalize them.

This issue is about the part of AI that rarely gets discussed in vendor demos, but explains most real-world success and failure: Data preprocessing.

Cleaning. Structuring. Labeling. Choosing what counts. Deciding what gets ignored.

Because models learn from the version of reality your data pipeline creates and not reality itself.

The Data Pipeline is the Model (Whether You Admit It or Not)

Most people talk about AI like it’s a single thing: the Model.

In practice, AI systems behave more like an assembly line:

Collection → Cleaning → Structuring → Labeling → Training → Evaluation → Deployment → Monitoring

The model is just one station on that line. If the upstream stations are messy, inconsistent, or poorly governed, the downstream system will still produce outputs, but those outputs will be brittle, biased, or misleading.

This is where Idiocracy becomes a useful metaphor. In an “Idiocracy pipeline”, the organization still has dashboards; they still have automation; they still have clean-looking outputs.

But the underlying inputs are drifting:

Definitions change without documentation
Labels get applied inconsistently
Missingness gets ignored
Convenience data replaces meaningful data
Systems optimize what’s easiest to measure, not what matters

And then the AI system learns that version of the world as if it were ground truth.

Garbage In, Garbage Out is Too Kind. “Garbage in, garbage out” makes it sound like the consequences are obvious. But in modern AI systems—especially LLMs—bad inputs often produce outputs that are:

fluent
confident
structured
professional-looking

Which means the real risk is error that looks like competence. That’s the Idiocracy pattern. A system that appears functional while systematically degrading the quality of thinking and decision-making inside it.

Two Kinds of AI, Two Kinds of Data Problems

ABA is currently experiencing a wave of LLM adoption, so it’s tempting to think “AI = ChatGPT”. But AI in clinical systems is much broader than language generation. In practice, ABA organizations encounter two major families of AI:

1. Generative AI (e.g., LLMs). Used for:

note drafting and summarization
rewriting and formatting reports
caregiver communication drafts
policy and training materials
extracting themes from text

LLMs are highly sensitive to text quality, structure, and context boundaries. If the input is messy, incomplete, or inconsistent, the output will still be fluent but less reliable.

2. Predictive / scoring models (non-LLM AI). Used for:

no-show risk scores
intake prioritization
scheduling optimization
authorization likelihood predictions
churn risk
outcome forecasting

These systems are highly sensitive to labels, missingness, leakage, and drift. They often fail not because the math is wrong, but because the data pipeline violates or misplaces assumptions (for certain contexts).

Cleaning: Removing Noise Without Erasing Meaning

Cleaning sounds simple: remove errors, standardize formatting, handle missing values. But in behavioral health data, cleaning is never purely technical. It is always interpretive. Here are the most common “cleaning decisions” that change what the model learns.

1. Missingness is not random (and pretending it is creates fiction). In ABA data, missing values often mean something:

session canceled
staff turnover
caregiver burnout
insurance disruption
device not charged
client illness
clinician didn’t enter the data

If you treat missingness as “just blank cells,” the model will learn patterns that reflect documentation habits, not client progress. Here’s a practical example. If “data not recorded” happens more often during difficult weeks, then the dataset will over-represent the easy weeks. Your model will learn a world that is cleaner than reality.

That’s not an AI failure. That’s a measurement failure.

Cleaning, Structuring, and Feeding Data to Machines (and Why It Matters)

The Data Pipeline is the Model (Whether You Admit It or Not)

Two Kinds of AI, Two Kinds of Data Problems

Cleaning: Removing Noise Without Erasing Meaning

Subscribe to keep reading this post

Chiron: The AI Literacy Series for ABA Professionals

Two Step