What Bad Data Costs: Data Quality in Market Research and AI Training

Technology

April 27, 2026

What Bad Data Costs: Data Quality in Market Research and AI Training

An unbudgeted line item, a board-level question, and the quality premium taking shape underneath both.

Ammar Khan

The Spreadsheet That Never Gets Built

Every CFO eventually sits with a finance team that has built a beautiful model. The cohort assumptions are crisp. The unit economics work out to three decimal places. The forecast lands inside the confidence interval. And then, somewhere in the second year of execution, the numbers come in materially off, and nobody can explain why.

The honest answer is often that the data was wrong. Not catastrophically wrong, just wrong in the small, persistent way that bends every downstream decision a few degrees off course. By the time the variance shows up in a board deck, the cost has already compounded across product launches, marketing spend, hiring plans, and acquisition models.

Most companies do not know what bad data is actually costing them. According to Gartner, nearly sixty percent of organizations have never measured the annual financial impact of poor data quality, even though Gartner's own estimates put that cost in the eight figures for the average enterprise. The Cherry Bekaert middle market CFO survey published in late 2025 found that forty-nine percent of finance leaders say poor data quality is actively blocking critical financial decisions. The number that gets cited inside finance teams, the one Gartner has been publishing for years, lands around fifteen million dollars per year per organization. Newer estimates run higher. And those numbers do not capture the second-order damage, the strategy decisions made on flawed inputs, which is where the real money sits.

This is the unbudgeted line item. It is the cost of believing the data when the data was wrong.

Why This Is a CFO Question Now

Two things changed in the last twenty-four months that pulled data quality up from an operational concern into a boardroom one.

The first is volume. Imperva's 2025 Bad Bot Report found that automated traffic now accounts for more than half of all activity on the open internet for the first time in the report's twelve-year history. Ahrefs analyzed nine hundred thousand newly published web pages and found that seventy-four percent contained AI-generated content. Whatever pool of humans your panels, your customer surveys, your training data, your community feedback, and your usage analytics are sampling from has been getting more polluted by the month.

The second is leverage. A peer-reviewed study published in the Proceedings of the National Academy of Sciences in late 2025 demonstrated that AI agents can pass standard online survey quality checks 99.8 percent of the time. The cost of running those agents through commercial APIs is roughly five cents per response. The cost of running them on open-source models approaches zero. A real human respondent earns about a dollar fifty for the same survey. That spread, where the cost of producing a fraudulent response collapses while the incentive payout stays flat, has done to data what cheap distribution did to media in the 2010s.

The CFO question is straightforward. If the inputs to your most consequential decisions are increasingly produced by machines pretending to be people, what is the actual cost of acting on those inputs? And what is the reasonable insurance premium against being wrong?

The Two-Layer Fraud Problem

Survey and panel fraud has two layers, and the financial exposure looks different at each level.

The first layer is people gaming the system. Distributed survey farms, professional respondents, and ban evaders have been a known cost in market research for over a decade. In April 2025, the U.S. Department of Justice unsealed an indictment against eight defendants who had operated a survey fraud ring for ten years, billing roughly ten million dollars in fabricated data to clients including Google, Seattle Children's Hospital, and several universities. Greenbook has documented organized survey farms training workers to operate multiple panel accounts and complete questionnaires in bulk. Research Defender, which monitors more than four billion survey entrants annually, finds that thirty-one percent of raw responses get flagged as fraudulent.

The second layer is machines gaming the system, and this layer is moving fastest. The 2025 PNAS study found that off-the-shelf AI agents passed 99.8 percent of standard quality checks designed to filter inattentive humans. A 2026 cross-platform study went a step further and showed that AI agents now outperform real people on the composite quality measures the industry has spent two decades building. The agents score higher on attention checks, comprehension gates, open-ended response analysis, and consistency tests than verified humans do. The traditional quality infrastructure, in other words, has started to select for the best AI agents and against the most distractible humans.

Stack these layers together and the implication for any company that pays for survey-based or panel-based inputs gets uncomfortable fast. Roughly a third of every dollar spent on respondents is being absorbed by fraudulent or non-human responses. Of that, traditional cleaning catches about sixteen percent. The remaining contamination flows downstream into the dataset, the analysis, and the decision.

The Downstream Cost Cascade

The financial damage from contaminated data operates on three layers, and each one is more expensive than the last.

The first is the operational cost of cleaning. Sawtooth Software has documented research projects where teams collected over a thousand responses and ended up with one hundred and fifteen usable records after cleaning. That is a ninety percent waste rate on the raw collection. Kantar's research division reports that researchers now discard somewhere between thirty-eight and forty-five percent of collected data due to quality concerns, and that discard rate has been climbing year over year. The labor, vendor spend, and re-fielding costs required to catch and remove bad records inflate the per-project cost several times over before the data ever reaches an analyst.

The second is the cost of bad decisions. When fraudulent data survives the cleaning pass, it shapes strategy. A widely-cited study published in PNAS reported that forty percent of Americans supported political violence; after the bad actors were removed, the actual number dropped below three percent. A CDC COVID-19 study that ran in over one hundred and fifty news outlets was later found to be built on fabricated underlying data. Multinational brands have launched eight and nine-figure marketing campaigns built on fraud-contaminated insights. The decisions looked credible. The data behind them was wrong. And the cost of the resulting misallocation rarely gets traced back to its source.

The third is the cost of compounding contamination. Foundational AI models trained on poisoned input data carry that contamination forward through every subsequent inference. Industry estimates suggest that cleaning and retraining a corrupted model runs three to five times the cost of the original training investment. As AI labs license panel data, survey archives, and crowdsourced datasets to feed reinforcement learning pipelines, the quality of that source material has direct financial consequences in the hundreds of millions. Surge AI, a bootstrapped human feedback company founded in 2020, surpassed a billion dollars in annual revenue while remaining profitable, and pursued an external fundraise at a valuation reported above twenty-five billion dollars in mid-2025. The numbers reflect a clear market signal. Verified human inputs to AI training pipelines are scarce and getting scarcer, and the companies that can supply them are commanding pricing accordingly.

The cleaner version of this story is that bad data has become a balance sheet item. The companies that recognize it early will capitalize the right thing. The companies that do not will continue to expense the symptom while the underlying liability grows.

Why Detection Has a Ceiling

The conventional response to data quality problems has been detection. Build better cleaners. Add behavioral analysis. Layer device fingerprints. Run trap questions, attention checks, and post-hoc statistical filters. Catch the bad responses after collection and discard them.

The detection paradigm has done useful work, and it will continue to. The trouble is that detection runs into a ceiling that the AI economics described above are now pushing through. Researchers at ETH Zurich demonstrated a hundred percent bypass rate against the most widely deployed challenge-response system in 2024. Anti-detect browsers spoof more than fifty device data points at the source-code level. Trap questions catch low-effort respondents but, as NORC's senior researcher David Dutwin has noted, are far less effective against any operator paying enough attention to answer them correctly. And the most rigorous fraud detection technology available today still cannot reliably tell apart a real human in a real browser presenting an AI-generated face from the legitimate respondent that face is impersonating.

Every detection technique triggers an adaptation. When researchers add attention checks, fraud operators train their workers to watch for them. When platforms start scanning open-ended responses for AI patterns, language models add intentional imperfections. The defender's economics are unfavorable: they need to catch every form of fraud, while the attacker only needs to find one gap. As the cost of producing convincing fraud collapses, the gap opens faster than the defender can close it.

The structural alternative is verification at the point of entry. Confirm that each respondent is a real, unique, live human before the first question loads. The cost of verification is fixed per respondent. The cost of detection scales with the volume of fraud, and the volume of fraud is growing.

What Boards Should Be Asking

For a board sitting across from a CFO and a chief data officer, three questions cut to the core of the exposure.

The first question is about the inputs. What share of the data flowing into product, marketing, customer experience, and AI training pipelines comes from sources we have verified as real, unique humans? If the answer is "we trust our vendors" or "we use industry-standard cleaning," the board should treat that as the start of a longer conversation rather than the end of one.

The second question is about the cost of being wrong. What is the financial exposure if a meaningful percentage of our recent insights were built on contaminated data? This requires running the math both directions: the cost of remediation if it turns out the data was clean, and the cost of business decisions made if it turned out the data was not. The asymmetry usually favors investing in quality at the point of entry.

The third question is about competitive position. As verified human data becomes scarcer and more valuable, are we building toward a position of advantage or disadvantage? Companies with verified user pools, verified panel relationships, and verified training data inputs will own pricing power in markets that still treat data as a commodity. The companies that wait will end up paying premium prices for inputs they could have secured cheaply by acting earlier.

From Cleaning to Verification

The shift now underway in market research and AI training runs from a detection economy to a verification economy. The old logic was that everyone gets in, and the platforms clean up afterwards. The new logic is that platforms only admit verified humans in the first place, and the cleaning bill collapses with the contamination.

Verification works because it changes the economics of fraud at the source. A five-cent AI agent cannot generate a unique anonymous credential tied to a real, live human. A click farm running hundreds of accounts cannot scale if each account requires a separate verified person. The cost of producing a fraudulent presence rises above the incentive value of completing a survey, and the rational basis for fraud collapses. Bad actors stop showing up because the math stops working for them.

For legitimate respondents, verification cuts the other way. One credential carries across every platform in the network. The same fifteen-second check that filters bad actors at the door eliminates redundant onboarding for the humans you actually want. Completion rates rise. Drop-off declines. The respondent pool grows more accessible as the network expands, not less.

For the CFO, the financial picture is cleaner. Cost per unit of input goes up modestly at the gate. Cost per usable, defensible, decision-quality response drops, often by five to ten times depending on the source channel. The over-recruitment buffer collapses. The third-party fraud detection spend collapses. The re-fielding risk collapses. The total cost of being right gets cheaper, and the variance around it tightens.

The Compounding Advantage

There is a version of this conversation that ends with cost containment, and there is a version that ends with strategic moat. The cost containment version is real and worth the spreadsheet exercise. The moat version is more interesting.

Verified human data compounds in value as the surrounding internet gets noisier. Gartner projects that by 2030, seventy-five percent of B2B buyers will prefer interactions that prioritize human contact over AI. News Corp signed a two hundred and fifty million dollar licensing deal with OpenAI for its human-produced corpus. Reddit licensed its data for roughly seventy million dollars per year, with reports indicating that Reddit content carries five times the weight of other sources in AI training. Shutterstock booked over a hundred million dollars in AI licensing revenue in a single year. The market is already pricing the human premium in. The companies that lock in verified human data infrastructure now will own a disproportionate share of the value when the rest of the market catches up.

The CFO and the board sit at the leverage point of this decision. Quality at the point of entry is cheaper than quality after the fact. Verification at the gate is cheaper than detection after the gate. And in a market where the share of provably human data is shrinking faster than anyone modeled, the companies that move first end up paying the lowest price for the most durable competitive position.

The unbudgeted line item, the one nobody has put a dollar amount on yet, is bigger than most finance teams realize. The good news is that it is finally measurable. The better news is that it is preventable. The work now is figuring out which version of that conversation a board wants to have, and how soon.

See the math for your team

VerifyYou is the human verification network. We confirm every respondent is a real, unique, live human in roughly fifteen seconds, with one API call. The credential carries across every platform in our network, so quality compounds the more partners join.

If you want to run the cost-per-quality-response math against your own pipeline, see how it works at verifyyou.com.

How HumanCheck Works

The Label Is Only as Good as the Human Behind It

Back to Blog